Tuesday, October 14, 2014

CouchBase Issues / Errors - Over coming them..


Couchbase Issues & Errors – Overcoming Them

By Raj Kiran Mattewada

This article describes some common Couchbase errors and issues encountered in production environments and offers practical remedies and best practices gleaned from real-world experience. Please test any commands or steps in your own environment before applying them in production.





Issue #1: “Join completion call failed” / HTTP 500 error during node addition

Symptom / Error Message

Attention – Join completion call failed.  
Got HTTP status 500 from REST call post to 
http://hostname.domain.com:8091/completeJoin  
Body: ["Unexpected server error, request logged."]

Root Cause (RCA)
This often happens when you attempt to add a node to a cluster that already has one or more buckets configured, or residual data files remain from a previous configuration. In some cases, misconfiguration or leftover metadata cause conflicts.

Solution / Recovery Steps

⚠️ Caution: These steps may be destructive. Always back up data and proceed carefully.

  1. Stop Couchbase on the node you’re trying to add:

    service couchbase stop
    service couchbase status
    
  2. Ensure that critical ports are free (e.g., 8091, 8092, 11211, 11210). If any are held by other processes, identify and kill them:

    netstat -a | grep 8091  
    netstat -a | grep 8092  
    netstat -a | grep 11211  
    netstat -a | grep 11210  
    
  3. Move or remove existing data directories—this removes leftover or conflicting bucket metadata:

    cd /data/couchbase  
    mv data data.OLD  
    mv index index.OLD  
    
  4. Recreate the directories fresh:

    mkdir data index  
    chown couchbase:couchbase data index  
    
  5. Re-verify that the required ports are no longer in use.

  6. Start Couchbase again:

    service couchbase start  
    service couchbase status  
    
  7. From the master cluster node, proceed to add the new node to the appropriate server group in the UI or via CLI.


Issue #2: Auto-failover alerts / memcached / moxi exit statuses

Symptoms / Error Messages

  • “Couchbase Server alert: auto_failover_node”

  • “Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 0.”

  • “Port server moxi on node … exited with status 134.”

Scenarios & Explanations

Scenario Description / Cause Recommended Action
1a Memcached process on a node was gracefully shut down (exit status 0) Restart Couchbase on that node and reintroduce it to the cluster
1b Network glitch or overly aggressive auto-failover timeout caused false detection of failure Increase auto-failover timeout thresholds; ensure network stability
2 Moxi (a proxy component) exit status 134 (abort) Investigate memory or configuration issues; check system logs for root cause

Notes / Best Practices

  • Be careful when setting auto-failover timeout values too low (e.g. 30 seconds) — transient network hiccups may trigger unnecessary failovers.

  • Monitor the bystander / “babysitter” processes which track service health.

  • Always review Couchbase logs (e.g. ns_server.log) and system logs (e.g. /var/log/messages) to catch the underlying issue that caused the process to exit unexpectedly.

  • In many cases, simply restarting the failed node and rebalancing is sufficient once root cause (memory, network, CPU) is addressed.


Additional Tips & Enhancements (based on experience)

  1. Pre-check cluster state before upgrades or adding nodes

    • Ensure cluster is healthy (cbstats, ns_server APIs)

    • No rebalance is in progress

    • Monitor resident ratio, memory usage, swap usage

  2. Use backup snapshots before major operations

    • Ensure you have snapshots or data export so you can rollback if needed.

  3. Always stagger upgrades / additions

    • Avoid performing multiple cluster changes at once; sequence them with time gaps to detect early failures.

  4. Automate diagnostics collection

    • Use cbcollect_info to collect logs and state

    • Automate via scripts or orchestration (Ansible)

    • Include logs, config files, cluster stats, alerts

  5. Tune memory settings carefully

    • Monitor eviction thresholds, memcached quotas, and disk swap behavior

    • Avoid overcommitting memory across all services (data, index, query, search)

  6. Monitor XDCR lag proactively

    • If replication lag creeps up, use heartbeat-based alerts or custom scripts

    • In some environments, suggest enhancements (heartbeat signals) to Couchbase support

  7. Scripting for node operations

    • Script repetitive tasks like failover, rebalance, add-node, remove-node

    • Wrap safety checks to prevent destructive mistakes

Sunday, October 12, 2014

Exploring Couchbase Via Command Line and Monitoring

This article walks through several essential CLI operations and tips for managing Couchbase via command line. It is intended for DBAs and engineers who want a deeper control beyond the UI.


1. Useful References & Tools

  • The official Couchbase CLI documentation is great source for understanding all commands, flags, and service interactions. 

  • Fabfile / Fabric can help automate remote operations.


2. Uninstalling Couchbase via CLI / RPM-based Systems

Important: These steps can be destructive—always ensure you have backups or snapshots before proceeding.

Steps to uninstall:

  1. Stop Couchbase service

    service couchbase status service couchbase stop
  2. List installed RPM packages

    rpm -qa | grep couchbase
  3. Uninstall the RPM(s)

    rpm -e couchbase-server-2.5-1083.x86_64
  4. Clean up directories (if you used default paths)

    cd /opt rm -rf couchbase* cd /data/couchbase rm -rf *
  5. Remove init scripts or service links

    ls -ltr /etc/init.d/couch* # Any remnants should be removed

Only remove directories you know are Couchbase-related. Do not inadvertently delete unrelated files.


3. Changing the Admin Password via CLI

On a cluster with multiple nodes, you must change the password on all nodes to maintain consistency.

Example workflow:

cd /opt/couchbase/bin ./cbreset_password

You will be prompted:

Please enter the new administrative password (or <Enter> for system generated password): Running this command will reset administrative password. Do you really want to do it? (yes/no) yes Password for user admin was successfully replaced.

4. Expanded Best Practices & Enhancements (beyond original text)

4.1 Scripting & Automation

  • Bundle commands (stop, clean directories, restart, password reset) into a Fabric/Ansible playbook so you can run across nodes uniformly.

  • Incorporate error checking (exit codes, service status checks) to fail-fast and roll back if helpful.

4.2 Version Compatibility & CLI Differences

  • CLI flags and behaviors change across versions; always refer to the matching version’s documentation.

  • Before upgrading a node, back up configs and cbcollect_info output so you have a traceable baseline.

4.3 Cluster-wide Consistency

  • After changing the admin password, also update any scripts, SDKs, or configuration files that use the old credentials.

  • Propagate the change quickly across all nodes to avoid mismatched states (some nodes accepting old, some new).

4.4 Monitoring & Logging

  • Use cbstats, REST APIs, or ns_server logs to verify that nodes are still reachable after administrative changes.

  • Monitor for abnormal failures in services like memcached, ns_server, indexer, etc.

  • Collect cbcollect_info outputs post-changes to help with troubleshooting.

4.5 Safety Flags & Checks

  • Before removal or cleanup, verify cluster health: ensure no rebalance is in progress, no node failures, and no replication lag outstanding.

  • In a multi-node environment, don’t change passwords or uninstall nodes in parallel without staging and sequencing.


  I've created a comprehensive Couchbase monitoring solution with three components:


  1. Main Monitoring Script (couchbase_intelligent_monitor.sh)


  Key Features:


  - Real-time Health Monitoring: CPU, memory, disk, cache performance

  - Auto-Healing: Automatically fixes common issues

  - Intelligent Recommendations: Provides actionable insights

  - Multi-layer Monitoring:

    - Cluster health and balance

    - Node status and services

    - Bucket performance and cache efficiency

    - Index fragmentation and rebuilding

    - Query performance analysis

    - XDCR replication status

    - Backup validation


  Auto-Healing Capabilities:


  - Memory Issues: Compacts buckets, clears caches, adjusts watermarks

  - Disk Space: Triggers compaction, cleans logs, archives data

  - Index Problems: Rebuilds fragmented indexes automatically

  - Replication Lag: Restarts XDCR, optimizes settings

  - Node Failures: Initiates failover and rebalance when safe


  Intelligent Recommendations:


  - Suggests optimal memory quotas based on usage patterns

  - Recommends index creation for slow queries

  - Advises on cluster scaling needs

  - Predicts capacity requirements

  - Identifies security vulnerabilities


  2. Configuration File (couchbase_monitor_config.yaml)


  - Customizable thresholds for all metrics

  - Alert routing (email, Slack, PagerDuty)

  - Auto-healing rules and conditions

  - Maintenance windows

  - Advanced optimization rules


  3. Auto-Remediation Script (couchbase_auto_remediation.sh)


  Handles specific remediation actions:

  - Memory pressure recovery (4 severity levels)

  - Disk space cleanup

  - Replication lag fixes

  - Index rebuilding

  - Query optimization

  - Node failure recovery

  - Security issue resolution

  - Backup failure recovery


  Usage:


  # Basic usage

  ./couchbase_intelligent_monitor.sh


  # With custom configuration

  export COUCHBASE_HOST=cluster.example.com

  export COUCHBASE_USER=Administrator

  export COUCHBASE_PASSWORD=secure_password

  ./couchbase_intelligent_monitor.sh


  # Run specific remediation

  ./couchbase_auto_remediation.sh memory_pressure high node1


  Monitoring Output Example:


  ==========================================

  Couchbase Intelligent Monitor v1.0

  Cluster: production-cluster

  ==========================================


  [2025-01-11 10:30:00] Starting monitoring cycle...

  [INFO] Checking cluster health...

  [INFO] Cluster balanced: true

  [INFO] All 3 nodes healthy


  [INFO] Checking memory usage...

  [WARNING] High memory usage on node1: 87%

  [INFO] Auto-healing: Compacting buckets...

  [INFO] Auto-healing: Memory optimization complete


  📌 RECOMMENDATION [MEMORY]: Consider adding more nodes or increasing RAM


  [INFO] Checking bucket performance...

  [WARNING] High cache miss rate in bucket orders: 12%

  📌 RECOMMENDATION [CACHE]: Increase RAM quota for bucket orders to 8192MB


  [INFO] Checking indexes...

  [INFO] Rebuilding fragmented index idx_customer_email (35% fragmentation)


  [SUCCESS] Monitoring cycle complete


  HTML Health Report Features:


  - Visual status indicators (green/yellow/red)

  - Cluster overview with all metrics

  - Node-by-node breakdown

  - Historical recommendations

  - Performance graphs (when integrated with Grafana)


  Advanced Intelligence:


  1. Predictive Analysis:

    - Forecasts when disk/memory will be exhausted

    - Predicts query performance degradation

    - Estimates rebalance completion times

  2. Pattern Recognition:

    - Identifies recurring issues

    - Learns from past remediations

    - Suggests preventive measures

  3. Optimization Rules:

    - Automatically adjusts settings based on workload

    - Balances read/write performance

    - Optimizes for specific use cases


  Integration Capabilities:


  - Grafana: Exports metrics for visualization

  - Prometheus: Metric endpoints available

  - ELK Stack: Log shipping ready

  - PagerDuty: Critical alert escalation

  - Slack: Team notifications



#!/bin/bash

#############################################################################
# Script: couchbase_auto_remediation.sh
# Purpose: Automated remediation actions for common Couchbase issues
# Version: 1.0
# This script is called by the main monitor for auto-healing
#############################################################################

# Configuration
source /etc/couchbase-monitor/config.env 2>/dev/null || true

# Logging
LOG_DIR="${LOG_DIR:-/var/log/couchbase-monitor}"
REMEDIATION_LOG="${LOG_DIR}/remediation_$(date +%Y%m%d).log"

# Log remediation action
log_action() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $@" >> "$REMEDIATION_LOG"
}

# Remediation functions

# 1. Memory Pressure Remediation
remediate_memory_pressure() {
local severity=$1
local node=$2
log_action "MEMORY_PRESSURE: Starting remediation for $node (severity: $severity)"
case $severity in
"low")
# Clear caches
log_action "Clearing query result cache"
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://${node}:8093/admin/settings" \
-d 'queryResultCache=false' 2>/dev/null
sleep 2
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://${node}:8093/admin/settings" \
-d 'queryResultCache=true' 2>/dev/null
;;
"medium")
# Compact buckets
log_action "Triggering bucket compaction"
for bucket in $(couchbase-cli bucket-list -c "$node" -u "$CB_USER" -p "$CB_PASS" | grep -v "^$"); do
couchbase-cli bucket-compact -c "$node" -u "$CB_USER" -p "$CB_PASS" --bucket "$bucket"
log_action "Compacted bucket: $bucket"
done
;;
"high")
# Aggressive memory recovery
log_action "Aggressive memory recovery initiated"
# 1. Flush expired documents
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://${node}:8091/controller/expirePurge" 2>/dev/null
# 2. Adjust memory watermarks temporarily
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://${node}:8091/pools/default/settings/memcached/global" \
-d 'mem_low_wat=60&mem_high_wat=75' 2>/dev/null
# 3. Drop non-critical indexes
log_action "Consider dropping non-critical indexes to free memory"
;;
"critical")
# Emergency actions
log_action "CRITICAL: Emergency memory recovery"
# Failover if part of cluster
if confirm_failover "$node"; then
couchbase-cli failover -c "$node" -u "$CB_USER" -p "$CB_PASS" \
--server-failover "$node" --hard
log_action "Failed over node $node due to critical memory pressure"
fi
;;
esac
}

# 2. Disk Space Remediation
remediate_disk_space() {
local severity=$1
local path=$2
log_action "DISK_SPACE: Starting remediation for $path (severity: $severity)"
# Clean old logs
find /opt/couchbase/var/lib/couchbase/logs -type f -name "*.log.*" -mtime +3 -delete
log_action "Cleaned old log files"
# Clean crash dumps
find /opt/couchbase/var/lib/couchbase/crash -type f -mtime +7 -delete
log_action "Cleaned old crash dumps"
if [ "$severity" = "critical" ]; then
# Emergency space recovery
log_action "Critical disk space - removing old backup files"
find /backup/couchbase -type f -mtime +30 -delete 2>/dev/null
# Trigger immediate compaction
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://localhost:8091/controller/startCompaction" 2>/dev/null
fi
}

# 3. Replication Lag Remediation
remediate_replication_lag() {
local replication_id=$1
local lag_items=$2
log_action "REPLICATION_LAG: Fixing replication $replication_id (lag: $lag_items items)"
if [ "$lag_items" -gt 10000 ]; then
# Restart XDCR replication
couchbase-cli xdcr-replicate -c localhost -u "$CB_USER" -p "$CB_PASS" \
--pause --xdcr-id "$replication_id"
sleep 5
couchbase-cli xdcr-replicate -c localhost -u "$CB_USER" -p "$CB_PASS" \
--resume --xdcr-id "$replication_id"
log_action "Restarted XDCR replication: $replication_id"
else
# Adjust replication settings
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://localhost:8091/settings/replications/$replication_id" \
-d 'workerBatchSize=1000&docBatchSizeKb=4096' 2>/dev/null
log_action "Optimized replication settings for $replication_id"
fi
}

# 4. Index Fragmentation Remediation
remediate_index_fragmentation() {
local index_name=$1
local fragmentation=$2
log_action "INDEX_FRAGMENTATION: Rebuilding index $index_name (fragmentation: $fragmentation%)"
# Rebuild index using N1QL
local query="ALTER INDEX \`$index_name\` REBUILD"
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://localhost:8093/query/service" \
-d "statement=$query" 2>/dev/null
log_action "Index rebuild initiated for: $index_name"
}

# 5. Query Performance Remediation
remediate_slow_queries() {
local query_hash=$1
log_action "SLOW_QUERY: Optimizing query performance"
# Clear prepared statement cache
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://localhost:8093/admin/prepareds" \
-d "all=true" 2>/dev/null
# Update statistics
local buckets=$(couchbase-cli bucket-list -c localhost -u "$CB_USER" -p "$CB_PASS")
for bucket in $buckets; do
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://localhost:8093/query/service" \
-d "statement=UPDATE STATISTICS FOR \`$bucket\`" 2>/dev/null
log_action "Updated statistics for bucket: $bucket"
done
}

# 6. Node Failure Remediation
remediate_node_failure() {
local failed_node=$1
local failure_type=$2
log_action "NODE_FAILURE: Handling $failure_type failure for $failed_node"
case $failure_type in
"network")
# Try to recover network
log_action "Attempting network recovery"
ssh "$failed_node" "systemctl restart network" 2>/dev/null || true
;;
"service")
# Restart Couchbase service
log_action "Attempting service restart"
ssh "$failed_node" "systemctl restart couchbase-server" 2>/dev/null || true
;;
"hardware")
# Initiate failover
if confirm_failover "$failed_node"; then
couchbase-cli failover -c localhost -u "$CB_USER" -p "$CB_PASS" \
--server-failover "$failed_node" --hard
log_action "Failed over node: $failed_node"
# Trigger rebalance
couchbase-cli rebalance -c localhost -u "$CB_USER" -p "$CB_PASS"
log_action "Rebalance initiated after failover"
fi
;;
esac
}

# 7. Security Remediation
remediate_security_issues() {
local issue_type=$1
log_action "SECURITY: Addressing $issue_type"
case $issue_type in
"weak_password")
log_action "Weak passwords detected - forcing password rotation"
# Implementation would depend on organization's password policy
;;
"audit_disabled")
# Enable audit logging
curl -X POST -u "${CB_USER}:${CB_PASS}" \
"http://localhost:8091/settings/audit" \
-d 'auditdEnabled=true' 2>/dev/null
log_action "Enabled audit logging"
;;
"ssl_disabled")
log_action "SSL is disabled - security risk identified"
# Note: Enabling SSL requires certificate setup
;;
esac
}

# 8. Backup Failure Remediation
remediate_backup_failure() {
local backup_job=$1
local error_type=$2
log_action "BACKUP_FAILURE: Fixing backup job $backup_job (error: $error_type)"
case $error_type in
"space")
# Clean old backups
find /backup/couchbase -type d -name "20*" -mtime +30 -exec rm -rf {} \; 2>/dev/null
log_action "Cleaned old backup files to free space"
;;
"permission")
# Fix permissions
chown -R couchbase:couchbase /backup/couchbase
chmod -R 755 /backup/couchbase
log_action "Fixed backup directory permissions"
;;
"corruption")
# Start fresh backup
log_action "Starting fresh full backup due to corruption"
cbbackup http://localhost:8091 /backup/couchbase/full_$(date +%Y%m%d) \
-u "$CB_USER" -p "$CB_PASS" &
;;
esac
}

# Helper function to confirm critical actions
confirm_failover() {
local node=$1
# Check if this is the only node
local node_count=$(couchbase-cli server-list -c localhost -u "$CB_USER" -p "$CB_PASS" | wc -l)
if [ "$node_count" -le 1 ]; then
log_action "Cannot failover $node - it's the only node in cluster"
return 1
fi
# Check if cluster can handle failover
local replica_count=$(curl -s -u "${CB_USER}:${CB_PASS}" \
"http://localhost:8091/pools/default/buckets" | \
jq '.[0].replicaNumber' 2>/dev/null)
if [ "$replica_count" -ge 1 ]; then
return 0 # OK to failover
else
log_action "Cannot failover - no replicas configured"
return 1
fi
}

# Main remediation dispatcher
dispatch_remediation() {
local issue_type=$1
shift
local params=$@
case $issue_type in
"memory_pressure")
remediate_memory_pressure $params
;;
"disk_space")
remediate_disk_space $params
;;
"replication_lag")
remediate_replication_lag $params
;;
"index_fragmentation")
remediate_index_fragmentation $params
;;
"slow_queries")
remediate_slow_queries $params
;;
"node_failure")
remediate_node_failure $params
;;
"security")
remediate_security_issues $params
;;
"backup_failure")
remediate_backup_failure $params
;;
*)
log_action "Unknown issue type: $issue_type"
;;
esac
}

# Execute if called directly
if [ "${1:-}" != "" ]; then
dispatch_remediation "$@"
fi



# Couchbase Intelligent Monitor Configuration
# This file contains advanced configuration for the monitoring script
#couchbase_monitor_config.yaml

cluster:
name: production-cluster
hosts:
- host: node1.couchbase.local
port: 8091
services: [kv, index, n1ql]
- host: node2.couchbase.local
port: 8091
services: [kv, index]
- host: node3.couchbase.local
port: 8091
services: [kv, fts]

credentials:
username: Administrator
password: ${COUCHBASE_PASSWORD} # Use environment variable

monitoring:
interval: 60 # seconds
detailed_interval: 300 # detailed checks every 5 minutes
thresholds:
cpu:
warning: 70
critical: 85
memory:
warning: 75
critical: 90
disk:
warning: 70
critical: 85
cache_miss_rate:
warning: 5
critical: 10
query_latency:
warning: 500 # ms
critical: 1000 # ms
index_fragmentation:
warning: 20
critical: 30
replication_lag:
warning: 1000 # items
critical: 10000 # items

auto_healing:
enabled: true
actions:
memory_pressure:
- compact_buckets
- clear_expired_docs
- adjust_cache_watermarks
disk_pressure:
- trigger_compaction
- cleanup_logs
- archive_old_data
high_cache_miss:
- increase_resident_ratio
- optimize_ejection
index_issues:
- rebuild_corrupted
- defragment_high
replication_lag:
- restart_xdcr
- adjust_throttling

alerts:
email:
enabled: true
smtp_server: smtp.company.com
smtp_port: 587
from: couchbase-monitor@company.com
to:
- dba-team@company.com
- ops@company.com
severity: [critical, error]
slack:
enabled: true
webhook_url: https://hooks.slack.com/services/XXX/YYY/ZZZ
channel: "#database-alerts"
severity: [critical, error, warning]
pagerduty:
enabled: false
api_key: ${PAGERDUTY_API_KEY}
service_key: ${PAGERDUTY_SERVICE_KEY}
severity: [critical]

reporting:
enabled: true
html_reports: true
json_metrics: true
grafana_export: true
report_interval: 3600 # seconds
retention_days: 30

advanced_checks:
security_audit:
enabled: true
check_ssl: true
check_rbac: true
check_audit_logs: true
performance_analysis:
enabled: true
slow_query_threshold: 1000 # ms
index_scan_threshold: 10000 # documents
capacity_planning:
enabled: true
growth_prediction: true
resource_forecasting: true

optimization_rules:
- name: "Optimize Heavy Read Workloads"
condition: "cache_miss_rate > 5 AND get_ops > 10000"
actions:
- increase_cache_size
- add_replica_vbuckets
- name: "Optimize Heavy Write Workloads"
condition: "disk_queue > 100000 AND set_ops > 5000"
actions:
- increase_writers
- optimize_persistence
- name: "Balance Query Load"
condition: "query_latency > 500 AND cpu > 70"
actions:
- add_query_nodes
- optimize_indexes

maintenance_windows:
- day: Sunday
start: "02:00"
end: "06:00"
allowed_actions:
- compaction
- rebalance
- index_rebuild
backup_validation:
enabled: true
test_restore: true
verify_integrity: true
alert_on_failure: true


#!/bin/bash

#############################################################################
# Script: couchbase_intelligent_monitor.sh
# Purpose: Intelligent Couchbase monitoring with auto-healing and recommendations
# Version: 1.0
# Author: Raj
# Date: June 10th 2014
#
# Features:
# - Comprehensive health monitoring
# - Auto-healing capabilities
# - Intelligent recommendations
# - Performance metrics collection
# - Predictive analysis
# - Alert management
#############################################################################

# Configuration
COUCHBASE_HOST="${COUCHBASE_HOST:-localhost}"
COUCHBASE_PORT="${COUCHBASE_PORT:-8091}"
COUCHBASE_USER="${COUCHBASE_USER:-Administrator}"
COUCHBASE_PASSWORD="${COUCHBASE_PASSWORD:-password}"
CLUSTER_NAME="${CLUSTER_NAME:-couchbase-cluster}"

# Monitoring Configuration
MONITOR_INTERVAL="${MONITOR_INTERVAL:-60}" # seconds
LOG_DIR="/var/log/couchbase-monitor"
LOG_FILE="${LOG_DIR}/monitor_$(date +%Y%m%d).log"
ALERT_LOG="${LOG_DIR}/alerts_$(date +%Y%m%d).log"
METRICS_FILE="${LOG_DIR}/metrics_$(date +%Y%m%d_%H%M%S).json"
REPORT_FILE="${LOG_DIR}/health_report_$(date +%Y%m%d_%H%M%S).html"

# Thresholds
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=80
SWAP_THRESHOLD=20
CACHE_MISS_THRESHOLD=10
REBALANCE_TIMEOUT=3600
CONNECTION_THRESHOLD=1000
QUERY_LATENCY_THRESHOLD=1000 # ms
INDEX_FRAGMENTATION_THRESHOLD=30

# Alert Configuration
ENABLE_EMAIL_ALERTS="${ENABLE_EMAIL_ALERTS:-false}"
ALERT_EMAIL="${ALERT_EMAIL:-admin@company.com}"
ENABLE_SLACK_ALERTS="${ENABLE_SLACK_ALERTS:-false}"
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
PURPLE='\033[0;35m'
NC='\033[0m'

# Create log directory
mkdir -p "$LOG_DIR"

# Logging function
log_message() {
local level=$1
shift
local message="$@"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] [$level] $message" | tee -a "$LOG_FILE"
if [ "$level" = "ERROR" ] || [ "$level" = "CRITICAL" ]; then
echo "[$timestamp] [$level] $message" >> "$ALERT_LOG"
send_alert "$level" "$message"
fi
}

# Print colored output
print_color() {
local color=$1
shift
echo -e "${color}$@${NC}"
}

# Send alerts
send_alert() {
local level=$1
local message=$2
# Email alert
if [ "$ENABLE_EMAIL_ALERTS" = "true" ] && [ -n "$ALERT_EMAIL" ]; then
echo "$message" | mail -s "[$level] Couchbase Alert - $CLUSTER_NAME" "$ALERT_EMAIL"
fi
# Slack alert
if [ "$ENABLE_SLACK_ALERTS" = "true" ] && [ -n "$SLACK_WEBHOOK" ]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[$level] $CLUSTER_NAME: $message\"}" \
"$SLACK_WEBHOOK" 2>/dev/null
fi
}

# Execute Couchbase REST API call
cb_api_call() {
local endpoint=$1
local method=${2:-GET}
local data=$3
local url="http://${COUCHBASE_HOST}:${COUCHBASE_PORT}${endpoint}"
if [ "$method" = "GET" ]; then
curl -s -u "${COUCHBASE_USER}:${COUCHBASE_PASSWORD}" "$url"
elif [ "$method" = "POST" ]; then
curl -s -X POST -u "${COUCHBASE_USER}:${COUCHBASE_PASSWORD}" \
-d "$data" "$url"
fi
}

# Execute Couchbase CLI command
cb_cli() {
local command=$@
/opt/couchbase/bin/couchbase-cli $command \
-c "${COUCHBASE_HOST}:${COUCHBASE_PORT}" \
-u "$COUCHBASE_USER" \
-p "$COUCHBASE_PASSWORD"
}

# Check cluster health
check_cluster_health() {
log_message "INFO" "Checking cluster health..."
local cluster_info=$(cb_api_call "/pools/default")
if [ -z "$cluster_info" ]; then
log_message "CRITICAL" "Cannot connect to Couchbase cluster"
return 1
fi
# Parse cluster status
local balanced=$(echo "$cluster_info" | jq -r '.balanced')
local rebalance_status=$(echo "$cluster_info" | jq -r '.rebalanceStatus')
local nodes_count=$(echo "$cluster_info" | jq '.nodes | length')
local healthy_nodes=$(echo "$cluster_info" | jq '[.nodes[] | select(.status == "healthy")] | length')
# Check if cluster is balanced
if [ "$balanced" != "true" ]; then
log_message "WARNING" "Cluster is not balanced"
recommend_action "REBALANCE" "Run cluster rebalance to distribute data evenly"
fi
# Check node health
if [ "$healthy_nodes" -lt "$nodes_count" ]; then
local unhealthy=$((nodes_count - healthy_nodes))
log_message "ERROR" "Found $unhealthy unhealthy nodes in cluster"
analyze_unhealthy_nodes
fi
# Check rebalance status
if [ "$rebalance_status" = "running" ]; then
log_message "INFO" "Rebalance in progress"
monitor_rebalance
fi
echo "$cluster_info"
}

# Analyze unhealthy nodes
analyze_unhealthy_nodes() {
local nodes=$(cb_api_call "/pools/default" | jq -r '.nodes[]')
echo "$nodes" | jq -r '. | select(.status != "healthy") | .hostname' | while read node; do
log_message "WARNING" "Unhealthy node detected: $node"
# Check if node is reachable
if ! ping -c 1 -W 2 "$node" > /dev/null 2>&1; then
log_message "ERROR" "Node $node is unreachable"
recommend_action "NODE_DOWN" "Check network connectivity or node status for $node"
else
# Try to diagnose the issue
diagnose_node_issue "$node"
fi
done
}

# Diagnose node issues
diagnose_node_issue() {
local node=$1
log_message "INFO" "Diagnosing issues on node $node"
# Check services
local services=$(cb_api_call "/pools/default" | jq -r ".nodes[] | select(.hostname == \"$node\") | .services[]")
for service in $services; do
case $service in
"kv")
check_data_service "$node"
;;
"index")
check_index_service "$node"
;;
"n1ql")
check_query_service "$node"
;;
"fts")
check_search_service "$node"
;;
esac
done
}

# Check memory usage
check_memory_usage() {
log_message "INFO" "Checking memory usage..."
local nodes=$(cb_api_call "/pools/default/buckets/@system/nodes")
echo "$nodes" | jq -r '.servers[]' | while read -r node_data; do
local hostname=$(echo "$node_data" | jq -r '.hostname')
local mem_used=$(echo "$node_data" | jq -r '.systemStats.mem_actual_used')
local mem_total=$(echo "$node_data" | jq -r '.systemStats.mem_total')
local mem_percent=$(echo "scale=2; ($mem_used / $mem_total) * 100" | bc)
if (( $(echo "$mem_percent > $MEMORY_THRESHOLD" | bc -l) )); then
log_message "WARNING" "High memory usage on $hostname: ${mem_percent}%"
auto_heal_memory "$hostname"
fi
done
}

# Auto-heal memory issues
auto_heal_memory() {
local node=$1
log_message "INFO" "Attempting to auto-heal memory issues on $node"
# 1. Compact buckets
local buckets=$(cb_api_call "/pools/default/buckets" | jq -r '.[].name')
for bucket in $buckets; do
log_message "INFO" "Compacting bucket $bucket"
cb_api_call "/pools/default/buckets/$bucket/controller/compactBucket" "POST"
done
# 2. Clear expired documents
log_message "INFO" "Clearing expired documents"
cb_cli bucket-compact --bucket all
# 3. Adjust cache if needed
recommend_action "MEMORY" "Consider adjusting bucket memory quotas or adding more nodes"
}

# Check disk usage
check_disk_usage() {
log_message "INFO" "Checking disk usage..."
local nodes=$(cb_api_call "/pools/default")
echo "$nodes" | jq -r '.nodes[]' | while read -r node_data; do
local hostname=$(echo "$node_data" | jq -r '.hostname')
local disk_used=$(echo "$node_data" | jq -r '.systemStats.disk_used')
local disk_total=$(echo "$node_data" | jq -r '.systemStats.disk_total')
if [ "$disk_total" != "null" ] && [ "$disk_total" -gt 0 ]; then
local disk_percent=$(echo "scale=2; ($disk_used / $disk_total) * 100" | bc)
if (( $(echo "$disk_percent > $DISK_THRESHOLD" | bc -l) )); then
log_message "WARNING" "High disk usage on $hostname: ${disk_percent}%"
auto_heal_disk "$hostname"
fi
fi
done
}

# Auto-heal disk issues
auto_heal_disk() {
local node=$1
log_message "INFO" "Attempting to auto-heal disk issues on $node"
# 1. Trigger compaction
log_message "INFO" "Triggering auto-compaction"
cb_api_call "/controller/setAutoCompaction" "POST" \
"databaseFragmentationThreshold[percentage]=20&viewFragmentationThreshold[percentage]=20"
# 2. Clean up old logs
log_message "INFO" "Cleaning up old logs"
find /opt/couchbase/var/lib/couchbase/logs -type f -mtime +7 -delete 2>/dev/null
# 3. Recommend further actions
recommend_action "DISK" "Consider: 1) Increasing disk space, 2) Adjusting TTL values, 3) Archiving old data"
}

# Check bucket performance
check_bucket_performance() {
log_message "INFO" "Checking bucket performance..."
local buckets=$(cb_api_call "/pools/default/buckets")
echo "$buckets" | jq -r '.[].name' | while read bucket; do
local stats=$(cb_api_call "/pools/default/buckets/$bucket/stats")
# Check cache miss ratio
local cache_miss_rate=$(echo "$stats" | jq -r '.op.samples.ep_cache_miss_rate[-1]')
if (( $(echo "$cache_miss_rate > $CACHE_MISS_THRESHOLD" | bc -l) )); then
log_message "WARNING" "High cache miss rate in bucket $bucket: ${cache_miss_rate}%"
optimize_bucket_cache "$bucket"
fi
# Check disk queue
local disk_queue=$(echo "$stats" | jq -r '.op.samples.ep_queue_size[-1]')
if [ "$disk_queue" -gt 1000000 ]; then
log_message "WARNING" "Large disk queue in bucket $bucket: $disk_queue items"
recommend_action "PERFORMANCE" "Bucket $bucket has disk write backlog. Consider increasing writers or I/O capacity"
fi
# Check operation latency
check_operation_latency "$bucket" "$stats"
done
}

# Check operation latency
check_operation_latency() {
local bucket=$1
local stats=$2
# Get operation timings
local get_latency=$(echo "$stats" | jq -r '.op.samples.get_cmd_latency[-1]' 2>/dev/null)
local set_latency=$(echo "$stats" | jq -r '.op.samples.set_cmd_latency[-1]' 2>/dev/null)
if [ -n "$get_latency" ] && [ "$get_latency" != "null" ]; then
if (( $(echo "$get_latency > 1000" | bc -l) )); then
log_message "WARNING" "High GET latency in bucket $bucket: ${get_latency}μs"
recommend_action "LATENCY" "Consider: 1) Adding more nodes, 2) Optimizing queries, 3) Adding indexes"
fi
fi
}

# Optimize bucket cache
optimize_bucket_cache() {
local bucket=$1
log_message "INFO" "Optimizing cache for bucket $bucket"
# Get current bucket configuration
local bucket_info=$(cb_api_call "/pools/default/buckets/$bucket")
local ram_quota=$(echo "$bucket_info" | jq -r '.quota.ram')
local item_count=$(echo "$bucket_info" | jq -r '.basicStats.itemCount')
# Calculate optimal memory
local bytes_per_item=500 # Approximate
local optimal_ram=$((item_count * bytes_per_item / 1048576)) # Convert to MB
if [ "$optimal_ram" -gt "$ram_quota" ]; then
log_message "INFO" "Bucket $bucket needs more RAM. Current: ${ram_quota}MB, Recommended: ${optimal_ram}MB"
recommend_action "CACHE" "Increase RAM quota for bucket $bucket to ${optimal_ram}MB"
fi
}

# Check indexes
check_indexes() {
log_message "INFO" "Checking indexes..."
# Get index status
local indexes=$(cb_api_call "/indexStatus")
echo "$indexes" | jq -r '.indexes[]' | while read -r index_data; do
local index_name=$(echo "$index_data" | jq -r '.index')
local status=$(echo "$index_data" | jq -r '.status')
local progress=$(echo "$index_data" | jq -r '.progress')
if [ "$status" != "Ready" ]; then
log_message "WARNING" "Index $index_name is not ready: $status ($progress%)"
if [ "$status" = "Error" ]; then
rebuild_index "$index_name"
fi
fi
# Check fragmentation
check_index_fragmentation "$index_name"
done
}

# Check index fragmentation
check_index_fragmentation() {
local index=$1
local stats=$(cb_api_call "/pools/default/buckets/@index/stats")
local fragmentation=$(echo "$stats" | jq -r ".op.samples.index_fragmentation_${index}[-1]" 2>/dev/null)
if [ -n "$fragmentation" ] && [ "$fragmentation" != "null" ]; then
if (( $(echo "$fragmentation > $INDEX_FRAGMENTATION_THRESHOLD" | bc -l) )); then
log_message "WARNING" "High fragmentation in index $index: ${fragmentation}%"
rebuild_index "$index"
fi
fi
}

# Rebuild index
rebuild_index() {
local index=$1
log_message "INFO" "Rebuilding index $index"
# This would typically use N1QL
local query="ALTER INDEX \`$index\` REBUILD"
cb_api_call "/query/service" "POST" "statement=$query"
recommend_action "INDEX" "Index $index is being rebuilt due to issues"
}

# Check query service
check_query_service() {
local node=$1
log_message "INFO" "Checking query service on $node"
# Check active queries
local active_requests=$(cb_api_call "/pools/default/tasks" | jq -r '.tasks[] | select(.type == "n1ql")')
if [ -n "$active_requests" ]; then
echo "$active_requests" | while read -r request; do
local duration=$(echo "$request" | jq -r '.runtime')
if [ "$duration" -gt 60000 ]; then # More than 60 seconds
log_message "WARNING" "Long-running query detected: ${duration}ms"
analyze_slow_query "$request"
fi
done
fi
}

# Analyze slow queries
analyze_slow_query() {
local query_info=$1
log_message "INFO" "Analyzing slow query"
# Get query plan
local statement=$(echo "$query_info" | jq -r '.statement')
# Check for missing indexes
if echo "$statement" | grep -qi "WHERE\|JOIN" && ! echo "$statement" | grep -qi "USE INDEX"; then
recommend_action "QUERY" "Query may benefit from indexes. Review execution plan."
fi
# Check for full collection scans
if echo "$statement" | grep -qi "SELECT \*"; then
recommend_action "QUERY" "Avoid SELECT * queries. Specify required fields only."
fi
}

# Check XDCR (Cross Data Center Replication)
check_xdcr() {
log_message "INFO" "Checking XDCR status..."
local xdcr_tasks=$(cb_api_call "/pools/default/remoteClusters")
if [ "$(echo "$xdcr_tasks" | jq '. | length')" -gt 0 ]; then
echo "$xdcr_tasks" | jq -r '.[]' | while read -r remote; do
local name=$(echo "$remote" | jq -r '.name')
local hostname=$(echo "$remote" | jq -r '.hostname')
# Check connectivity
if ! nc -z "$hostname" 8091 2>/dev/null; then
log_message "ERROR" "XDCR remote cluster $name unreachable at $hostname"
recommend_action "XDCR" "Check network connectivity to remote cluster $name"
fi
# Check replication status
check_xdcr_replication "$name"
done
fi
}

# Check XDCR replication status
check_xdcr_replication() {
local remote=$1
local replications=$(cb_api_call "/pools/default/tasks" | jq -r '.tasks[] | select(.type == "xdcr")')
echo "$replications" | while read -r repl; do
local status=$(echo "$repl" | jq -r '.status')
local errors=$(echo "$repl" | jq -r '.errors')
if [ "$status" = "error" ] || [ "$errors" -gt 0 ]; then
log_message "ERROR" "XDCR replication to $remote has errors"
auto_heal_xdcr "$remote"
fi
done
}

# Auto-heal XDCR issues
auto_heal_xdcr() {
local remote=$1
log_message "INFO" "Attempting to auto-heal XDCR to $remote"
# Restart XDCR replication
cb_cli xdcr-replicate --pause --xdcr-replicator="$remote"
sleep 5
cb_cli xdcr-replicate --resume --xdcr-replicator="$remote"
recommend_action "XDCR" "XDCR replication to $remote was restarted. Monitor for improvements."
}

# Monitor rebalance
monitor_rebalance() {
local start_time=$(date +%s)
while true; do
local rebalance_status=$(cb_api_call "/pools/default/rebalanceProgress")
local status=$(echo "$rebalance_status" | jq -r '.status')
if [ "$status" = "none" ]; then
log_message "INFO" "Rebalance completed successfully"
break
fi
local progress=$(echo "$rebalance_status" | jq -r '.ns_1@node1.progress')
log_message "INFO" "Rebalance progress: ${progress}%"
# Check timeout
local current_time=$(date +%s)
local elapsed=$((current_time - start_time))
if [ "$elapsed" -gt "$REBALANCE_TIMEOUT" ]; then
log_message "ERROR" "Rebalance timeout after ${elapsed} seconds"
recommend_action "REBALANCE" "Rebalance is taking too long. Check cluster resources."
break
fi
sleep 30
done
}

# Check backup status
check_backup_status() {
log_message "INFO" "Checking backup status..."
# Check for backup repository
if [ -d "/opt/couchbase/backup" ]; then
local latest_backup=$(find /opt/couchbase/backup -type d -name "20*" | sort -r | head -1)
if [ -n "$latest_backup" ]; then
local backup_age=$(find "$latest_backup" -maxdepth 0 -mmin +1440 2>/dev/null)
if [ -n "$backup_age" ]; then
log_message "WARNING" "Latest backup is more than 24 hours old"
recommend_action "BACKUP" "Schedule regular backups to prevent data loss"
fi
else
log_message "WARNING" "No backups found"
recommend_action "BACKUP" "Configure and schedule regular backups immediately"
fi
fi
}

# Recommend actions based on issues
recommend_action() {
local category=$1
local recommendation=$2
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$category] RECOMMENDATION: $recommendation" >> "${LOG_DIR}/recommendations.log"
print_color "$YELLOW" "📌 RECOMMENDATION [$category]: $recommendation"
# Add to report
echo "<div class='recommendation $category'>$recommendation</div>" >> "$REPORT_FILE"
}

# Generate health report
generate_health_report() {
log_message "INFO" "Generating health report..."
cat > "$REPORT_FILE" <<EOF
<!DOCTYPE html>
<html>
<head>
<title>Couchbase Health Report - $(date '+%Y-%m-%d %H:%M:%S')</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
h1 { color: #333; }
.status-ok { color: green; }
.status-warning { color: orange; }
.status-error { color: red; }
.metric { margin: 10px 0; padding: 10px; border-left: 3px solid #ddd; }
.recommendation { background: #fffbdd; padding: 10px; margin: 10px 0; border-left: 3px solid #ffa500; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background: #f2f2f2; }
</style>
</head>
<body>
<h1>Couchbase Cluster Health Report</h1>
<p>Generated: $(date '+%Y-%m-%d %H:%M:%S')</p>
<p>Cluster: $CLUSTER_NAME</p>
EOF
# Add cluster overview
local cluster_info=$(cb_api_call "/pools/default")
local node_count=$(echo "$cluster_info" | jq '.nodes | length')
local balanced=$(echo "$cluster_info" | jq -r '.balanced')
cat >> "$REPORT_FILE" <<EOF
<h2>Cluster Overview</h2>
<table>
<tr><th>Metric</th><th>Value</th><th>Status</th></tr>
<tr><td>Total Nodes</td><td>$node_count</td><td class="status-ok">OK</td></tr>
<tr><td>Balanced</td><td>$balanced</td><td class="$([ "$balanced" = "true" ] && echo "status-ok" || echo "status-warning")">$([ "$balanced" = "true" ] && echo "OK" || echo "NEEDS REBALANCE")</td></tr>
</table>
EOF
# Add node details
cat >> "$REPORT_FILE" <<EOF
<h2>Node Status</h2>
<table>
<tr><th>Hostname</th><th>Status</th><th>Services</th><th>CPU %</th><th>Memory %</th><th>Disk %</th></tr>
EOF
echo "$cluster_info" | jq -r '.nodes[]' | while read -r node_data; do
local hostname=$(echo "$node_data" | jq -r '.hostname')
local status=$(echo "$node_data" | jq -r '.status')
local services=$(echo "$node_data" | jq -r '.services | join(", ")')
local cpu=$(echo "$node_data" | jq -r '.systemStats.cpu_utilization_rate // 0')
local mem_used=$(echo "$node_data" | jq -r '.systemStats.mem_actual_used // 0')
local mem_total=$(echo "$node_data" | jq -r '.systemStats.mem_total // 1')
local mem_percent=$(echo "scale=2; ($mem_used / $mem_total) * 100" | bc)
cat >> "$REPORT_FILE" <<EOF
<tr>
<td>$hostname</td>
<td class="$([ "$status" = "healthy" ] && echo "status-ok" || echo "status-error")">$status</td>
<td>$services</td>
<td>$cpu%</td>
<td>$mem_percent%</td>
<td>N/A</td>
</tr>
EOF
done
cat >> "$REPORT_FILE" <<EOF
</table>
<h2>Recommendations</h2>
$(cat "${LOG_DIR}/recommendations.log" 2>/dev/null | tail -20 | sed 's/^/<p>/;s/$/<\/p>/')
</body>
</html>
EOF
log_message "INFO" "Health report generated: $REPORT_FILE"
}

# Main monitoring loop
main_monitoring_loop() {
print_color "$GREEN" "=========================================="
print_color "$GREEN" "Couchbase Intelligent Monitor v1.0"
print_color "$GREEN" "Cluster: $CLUSTER_NAME"
print_color "$GREEN" "=========================================="
while true; do
print_color "$BLUE" "\n[$(date '+%Y-%m-%d %H:%M:%S')] Starting monitoring cycle..."
# Core health checks
check_cluster_health
check_memory_usage
check_disk_usage
check_bucket_performance
check_indexes
check_query_service
check_xdcr
check_backup_status
# Generate report every hour
if [ $(($(date +%s) % 3600)) -lt "$MONITOR_INTERVAL" ]; then
generate_health_report
fi
print_color "$GREEN" "Monitoring cycle complete. Next check in ${MONITOR_INTERVAL} seconds..."
sleep "$MONITOR_INTERVAL"
done
}

# Signal handlers
trap 'log_message "INFO" "Monitor stopped by user"; exit 0' SIGINT SIGTERM

# Check prerequisites
check_prerequisites() {
# Check for required tools
for tool in curl jq bc nc; do
if ! command -v $tool &> /dev/null; then
print_color "$RED" "ERROR: Required tool '$tool' is not installed"
exit 1
fi
done
# Check Couchbase CLI
if [ ! -f "/opt/couchbase/bin/couchbase-cli" ]; then
print_color "$YELLOW" "WARNING: Couchbase CLI not found at default location"
fi
# Test connection
if ! cb_api_call "/pools" > /dev/null 2>&1; then
print_color "$RED" "ERROR: Cannot connect to Couchbase at ${COUCHBASE_HOST}:${COUCHBASE_PORT}"
print_color "$YELLOW" "Please check connection settings and credentials"
exit 1
fi
}

# Start monitoring
log_message "INFO" "Starting Couchbase Intelligent Monitor"
check_prerequisites
main_monitoring_loop