Raj Kiran's: 2014

Tuesday, October 14, 2014

CouchBase Issues / Errors - Over coming them..

Couchbase Issues & Errors – Overcoming Them

By Raj Kiran Mattewada

This article describes some common Couchbase errors and issues encountered in production environments and offers practical remedies and best practices gleaned from real-world experience. Please test any commands or steps in your own environment before applying them in production.

Issue #1: “Join completion call failed” / HTTP 500 error during node addition

Symptom / Error Message

Attention – Join completion call failed.  
Got HTTP status 500 from REST call post to 
http://hostname.domain.com:8091/completeJoin  
Body: ["Unexpected server error, request logged."]

Root Cause (RCA)
This often happens when you attempt to add a node to a cluster that already has one or more buckets configured, or residual data files remain from a previous configuration. In some cases, misconfiguration or leftover metadata cause conflicts.

Solution / Recovery Steps

⚠️ Caution: These steps may be destructive. Always back up data and proceed carefully.

Stop Couchbase on the node you’re trying to add:
```
service couchbase stop
service couchbase status
```
Ensure that critical ports are free (e.g., 8091, 8092, 11211, 11210). If any are held by other processes, identify and kill them:
```
netstat -a | grep 8091  
netstat -a | grep 8092  
netstat -a | grep 11211  
netstat -a | grep 11210  
```
Move or remove existing data directories—this removes leftover or conflicting bucket metadata:
```
cd /data/couchbase  
mv data data.OLD  
mv index index.OLD  
```

Recreate the directories fresh:

mkdir data index  
chown couchbase:couchbase data index

Re-verify that the required ports are no longer in use.

Start Couchbase again:

service couchbase start  
service couchbase status

From the master cluster node, proceed to add the new node to the appropriate server group in the UI or via CLI.

Issue #2: Auto-failover alerts / memcached / moxi exit statuses

Symptoms / Error Messages

“Couchbase Server alert: auto_failover_node”
“Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 0.”
“Port server moxi on node … exited with status 134.”

Scenarios & Explanations

Scenario	Description / Cause	Recommended Action
1a	Memcached process on a node was gracefully shut down (exit status 0)	Restart Couchbase on that node and reintroduce it to the cluster
1b	Network glitch or overly aggressive auto-failover timeout caused false detection of failure	Increase auto-failover timeout thresholds; ensure network stability
2	Moxi (a proxy component) exit status 134 (abort)	Investigate memory or configuration issues; check system logs for root cause

Notes / Best Practices

Be careful when setting auto-failover timeout values too low (e.g. 30 seconds) — transient network hiccups may trigger unnecessary failovers.
Monitor the bystander / “babysitter” processes which track service health.
Always review Couchbase logs (e.g. ns_server.log) and system logs (e.g. /var/log/messages) to catch the underlying issue that caused the process to exit unexpectedly.
In many cases, simply restarting the failed node and rebalancing is sufficient once root cause (memory, network, CPU) is addressed.

Additional Tips & Enhancements (based on experience)

Pre-check cluster state before upgrades or adding nodes
- Ensure cluster is healthy (cbstats, ns_server APIs)
- No rebalance is in progress
- Monitor resident ratio, memory usage, swap usage
Use backup snapshots before major operations
- Ensure you have snapshots or data export so you can rollback if needed.
Always stagger upgrades / additions
- Avoid performing multiple cluster changes at once; sequence them with time gaps to detect early failures.
Automate diagnostics collection
- Use cbcollect_info to collect logs and state
- Automate via scripts or orchestration (Ansible)
- Include logs, config files, cluster stats, alerts
Tune memory settings carefully
- Monitor eviction thresholds, memcached quotas, and disk swap behavior
- Avoid overcommitting memory across all services (data, index, query, search)
Monitor XDCR lag proactively
- If replication lag creeps up, use heartbeat-based alerts or custom scripts
- In some environments, suggest enhancements (heartbeat signals) to Couchbase support
Scripting for node operations
- Script repetitive tasks like failover, rebalance, add-node, remove-node
- Wrap safety checks to prevent destructive mistakes

Sunday, October 12, 2014

Exploring Couchbase Via Command Line and Monitoring

This article walks through several essential CLI operations and tips for managing Couchbase via command line. It is intended for DBAs and engineers who want a deeper control beyond the UI.

1. Useful References & Tools

The official Couchbase CLI documentation is great source for understanding all commands, flags, and service interactions.
Fabfile / Fabric can help automate remote operations.

2. Uninstalling Couchbase via CLI / RPM-based Systems

Important: These steps can be destructive—always ensure you have backups or snapshots before proceeding.

Steps to uninstall:

Stop Couchbase service


service couchbase status  
service couchbase stop

List installed RPM packages
```
rpm -qa | grep couchbase
```

Uninstall the RPM(s)


rpm -e couchbase-server-2.5-1083.x86_64

Clean up directories (if you used default paths)


cd /opt  
rm -rf couchbase*  
cd /data/couchbase  
rm -rf *

Remove init scripts or service links


ls -ltr /etc/init.d/couch*
# Any remnants should be removed

Only remove directories you know are Couchbase-related. Do not inadvertently delete unrelated files.

3. Changing the Admin Password via CLI

On a cluster with multiple nodes, you must change the password on all nodes to maintain consistency.

Example workflow:


cd /opt/couchbase/bin  
./cbreset_password

You will be prompted:


Please enter the new administrative password (or <Enter> for system generated password):
Running this command will reset administrative password.
Do you really want to do it? (yes/no) yes
Password for user admin was successfully replaced.

4. Expanded Best Practices & Enhancements (beyond original text)

4.1 Scripting & Automation

Bundle commands (stop, clean directories, restart, password reset) into a Fabric/Ansible playbook so you can run across nodes uniformly.
Incorporate error checking (exit codes, service status checks) to fail-fast and roll back if helpful.

4.2 Version Compatibility & CLI Differences

CLI flags and behaviors change across versions; always refer to the matching version’s documentation.
Before upgrading a node, back up configs and cbcollect_info output so you have a traceable baseline.

4.3 Cluster-wide Consistency

After changing the admin password, also update any scripts, SDKs, or configuration files that use the old credentials.
Propagate the change quickly across all nodes to avoid mismatched states (some nodes accepting old, some new).

4.4 Monitoring & Logging

Use cbstats, REST APIs, or ns_server logs to verify that nodes are still reachable after administrative changes.
Monitor for abnormal failures in services like memcached, ns_server, indexer, etc.
Collect cbcollect_info outputs post-changes to help with troubleshooting.

4.5 Safety Flags & Checks

Before removal or cleanup, verify cluster health: ensure no rebalance is in progress, no node failures, and no replication lag outstanding.
In a multi-node environment, don’t change passwords or uninstall nodes in parallel without staging and sequencing.

I've created a comprehensive Couchbase monitoring solution with three components:

1. Main Monitoring Script (couchbase_intelligent_monitor.sh)

Key Features:

- Real-time Health Monitoring: CPU, memory, disk, cache performance

- Auto-Healing: Automatically fixes common issues

- Intelligent Recommendations: Provides actionable insights

- Multi-layer Monitoring:

- Cluster health and balance

- Node status and services

- Bucket performance and cache efficiency

- Index fragmentation and rebuilding

- Query performance analysis

- XDCR replication status

- Backup validation

Auto-Healing Capabilities:

- Memory Issues: Compacts buckets, clears caches, adjusts watermarks

- Disk Space: Triggers compaction, cleans logs, archives data

- Index Problems: Rebuilds fragmented indexes automatically

- Replication Lag: Restarts XDCR, optimizes settings

- Node Failures: Initiates failover and rebalance when safe

Intelligent Recommendations:

- Suggests optimal memory quotas based on usage patterns

- Recommends index creation for slow queries

- Advises on cluster scaling needs

- Predicts capacity requirements

- Identifies security vulnerabilities

2. Configuration File (couchbase_monitor_config.yaml)

- Customizable thresholds for all metrics

- Alert routing (email, Slack, PagerDuty)

- Auto-healing rules and conditions

- Maintenance windows

- Advanced optimization rules

3. Auto-Remediation Script (couchbase_auto_remediation.sh)

Handles specific remediation actions:

- Memory pressure recovery (4 severity levels)

- Disk space cleanup

- Replication lag fixes

- Index rebuilding

- Query optimization

- Node failure recovery

- Security issue resolution

- Backup failure recovery

Usage:

# Basic usage

./couchbase_intelligent_monitor.sh

# With custom configuration

export COUCHBASE_HOST=cluster.example.com

export COUCHBASE_USER=Administrator

export COUCHBASE_PASSWORD=secure_password

./couchbase_intelligent_monitor.sh

# Run specific remediation

./couchbase_auto_remediation.sh memory_pressure high node1

Monitoring Output Example:

==========================================

Couchbase Intelligent Monitor v1.0

Cluster: production-cluster

==========================================

[2025-01-11 10:30:00] Starting monitoring cycle...

[INFO] Checking cluster health...

[INFO] Cluster balanced: true

[INFO] All 3 nodes healthy

[INFO] Checking memory usage...

[WARNING] High memory usage on node1: 87%

[INFO] Auto-healing: Compacting buckets...

[INFO] Auto-healing: Memory optimization complete

📌 RECOMMENDATION [MEMORY]: Consider adding more nodes or increasing RAM

[INFO] Checking bucket performance...

[WARNING] High cache miss rate in bucket orders: 12%

📌 RECOMMENDATION [CACHE]: Increase RAM quota for bucket orders to 8192MB

[INFO] Checking indexes...

[INFO] Rebuilding fragmented index idx_customer_email (35% fragmentation)

[SUCCESS] Monitoring cycle complete

HTML Health Report Features:

- Visual status indicators (green/yellow/red)

- Cluster overview with all metrics

- Node-by-node breakdown

- Historical recommendations

- Performance graphs (when integrated with Grafana)

Advanced Intelligence:

1. Predictive Analysis:

- Forecasts when disk/memory will be exhausted

- Predicts query performance degradation

- Estimates rebalance completion times

2. Pattern Recognition:

- Identifies recurring issues

- Learns from past remediations

- Suggests preventive measures

3. Optimization Rules:

- Automatically adjusts settings based on workload

- Balances read/write performance

- Optimizes for specific use cases

Integration Capabilities:

- Grafana: Exports metrics for visualization

- Prometheus: Metric endpoints available

- ELK Stack: Log shipping ready

- PagerDuty: Critical alert escalation

- Slack: Team notifications

#!/bin/bash

#############################################################################
# Script: couchbase_auto_remediation.sh
# Purpose: Automated remediation actions for common Couchbase issues
# Version: 1.0
# This script is called by the main monitor for auto-healing
#############################################################################

# Configuration
source /etc/couchbase-monitor/config.env 2>/dev/null || true

# Logging
LOG_DIR="${LOG_DIR:-/var/log/couchbase-monitor}"
REMEDIATION_LOG="${LOG_DIR}/remediation_$(date +%Y%m%d).log"

# Log remediation action
log_action() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $@" >> "$REMEDIATION_LOG"
}

# Remediation functions

# 1. Memory Pressure Remediation
remediate_memory_pressure() {
    local severity=$1
    local node=$2
    
    log_action "MEMORY_PRESSURE: Starting remediation for $node (severity: $severity)"
    
    case $severity in
        "low")
            # Clear caches
            log_action "Clearing query result cache"
            curl -X POST -u "${CB_USER}:${CB_PASS}" \
                "http://${node}:8093/admin/settings" \
                -d 'queryResultCache=false' 2>/dev/null
            sleep 2
            curl -X POST -u "${CB_USER}:${CB_PASS}" \
                "http://${node}:8093/admin/settings" \
                -d 'queryResultCache=true' 2>/dev/null
            ;;
            
        "medium")
            # Compact buckets
            log_action "Triggering bucket compaction"
            for bucket in $(couchbase-cli bucket-list -c "$node" -u "$CB_USER" -p "$CB_PASS" | grep -v "^$"); do
                couchbase-cli bucket-compact -c "$node" -u "$CB_USER" -p "$CB_PASS" --bucket "$bucket"
                log_action "Compacted bucket: $bucket"
            done
            ;;
            
        "high")
            # Aggressive memory recovery
            log_action "Aggressive memory recovery initiated"
            
            # 1. Flush expired documents
            curl -X POST -u "${CB_USER}:${CB_PASS}" \
                "http://${node}:8091/controller/expirePurge" 2>/dev/null
            
            # 2. Adjust memory watermarks temporarily
            curl -X POST -u "${CB_USER}:${CB_PASS}" \
                "http://${node}:8091/pools/default/settings/memcached/global" \
                -d 'mem_low_wat=60&mem_high_wat=75' 2>/dev/null
            
            # 3. Drop non-critical indexes
            log_action "Consider dropping non-critical indexes to free memory"
            ;;
            
        "critical")
            # Emergency actions
            log_action "CRITICAL: Emergency memory recovery"
            
            # Failover if part of cluster
            if confirm_failover "$node"; then
                couchbase-cli failover -c "$node" -u "$CB_USER" -p "$CB_PASS" \
                    --server-failover "$node" --hard
                log_action "Failed over node $node due to critical memory pressure"
            fi
            ;;
    esac
}

# 2. Disk Space Remediation
remediate_disk_space() {
    local severity=$1
    local path=$2
    
    log_action "DISK_SPACE: Starting remediation for $path (severity: $severity)"
    
    # Clean old logs
    find /opt/couchbase/var/lib/couchbase/logs -type f -name "*.log.*" -mtime +3 -delete
    log_action "Cleaned old log files"
    
    # Clean crash dumps
    find /opt/couchbase/var/lib/couchbase/crash -type f -mtime +7 -delete
    log_action "Cleaned old crash dumps"
    
    if [ "$severity" = "critical" ]; then
        # Emergency space recovery
        log_action "Critical disk space - removing old backup files"
        find /backup/couchbase -type f -mtime +30 -delete 2>/dev/null
        
        # Trigger immediate compaction
        curl -X POST -u "${CB_USER}:${CB_PASS}" \
            "http://localhost:8091/controller/startCompaction" 2>/dev/null
    fi
}

# 3. Replication Lag Remediation
remediate_replication_lag() {
    local replication_id=$1
    local lag_items=$2
    
    log_action "REPLICATION_LAG: Fixing replication $replication_id (lag: $lag_items items)"
    
    if [ "$lag_items" -gt 10000 ]; then
        # Restart XDCR replication
        couchbase-cli xdcr-replicate -c localhost -u "$CB_USER" -p "$CB_PASS" \
            --pause --xdcr-id "$replication_id"
        
        sleep 5
        
        couchbase-cli xdcr-replicate -c localhost -u "$CB_USER" -p "$CB_PASS" \
            --resume --xdcr-id "$replication_id"
        
        log_action "Restarted XDCR replication: $replication_id"
    else
        # Adjust replication settings
        curl -X POST -u "${CB_USER}:${CB_PASS}" \
            "http://localhost:8091/settings/replications/$replication_id" \
            -d 'workerBatchSize=1000&docBatchSizeKb=4096' 2>/dev/null
        
        log_action "Optimized replication settings for $replication_id"
    fi
}

# 4. Index Fragmentation Remediation
remediate_index_fragmentation() {
    local index_name=$1
    local fragmentation=$2
    
    log_action "INDEX_FRAGMENTATION: Rebuilding index $index_name (fragmentation: $fragmentation%)"
    
    # Rebuild index using N1QL
    local query="ALTER INDEX \`$index_name\` REBUILD"
    
    curl -X POST -u "${CB_USER}:${CB_PASS}" \
        "http://localhost:8093/query/service" \
        -d "statement=$query" 2>/dev/null
    
    log_action "Index rebuild initiated for: $index_name"
}

# 5. Query Performance Remediation
remediate_slow_queries() {
    local query_hash=$1
    
    log_action "SLOW_QUERY: Optimizing query performance"
    
    # Clear prepared statement cache
    curl -X POST -u "${CB_USER}:${CB_PASS}" \
        "http://localhost:8093/admin/prepareds" \
        -d "all=true" 2>/dev/null
    
    # Update statistics
    local buckets=$(couchbase-cli bucket-list -c localhost -u "$CB_USER" -p "$CB_PASS")
    for bucket in $buckets; do
        curl -X POST -u "${CB_USER}:${CB_PASS}" \
            "http://localhost:8093/query/service" \
            -d "statement=UPDATE STATISTICS FOR \`$bucket\`" 2>/dev/null
        log_action "Updated statistics for bucket: $bucket"
    done
}

# 6. Node Failure Remediation
remediate_node_failure() {
    local failed_node=$1
    local failure_type=$2
    
    log_action "NODE_FAILURE: Handling $failure_type failure for $failed_node"
    
    case $failure_type in
        "network")
            # Try to recover network
            log_action "Attempting network recovery"
            ssh "$failed_node" "systemctl restart network" 2>/dev/null || true
            ;;
            
        "service")
            # Restart Couchbase service
            log_action "Attempting service restart"
            ssh "$failed_node" "systemctl restart couchbase-server" 2>/dev/null || true
            ;;
            
        "hardware")
            # Initiate failover
            if confirm_failover "$failed_node"; then
                couchbase-cli failover -c localhost -u "$CB_USER" -p "$CB_PASS" \
                    --server-failover "$failed_node" --hard
                log_action "Failed over node: $failed_node"
                
                # Trigger rebalance
                couchbase-cli rebalance -c localhost -u "$CB_USER" -p "$CB_PASS"
                log_action "Rebalance initiated after failover"
            fi
            ;;
    esac
}

# 7. Security Remediation
remediate_security_issues() {
    local issue_type=$1
    
    log_action "SECURITY: Addressing $issue_type"
    
    case $issue_type in
        "weak_password")
            log_action "Weak passwords detected - forcing password rotation"
            # Implementation would depend on organization's password policy
            ;;
            
        "audit_disabled")
            # Enable audit logging
            curl -X POST -u "${CB_USER}:${CB_PASS}" \
                "http://localhost:8091/settings/audit" \
                -d 'auditdEnabled=true' 2>/dev/null
            log_action "Enabled audit logging"
            ;;
            
        "ssl_disabled")
            log_action "SSL is disabled - security risk identified"
            # Note: Enabling SSL requires certificate setup
            ;;
    esac
}

# 8. Backup Failure Remediation
remediate_backup_failure() {
    local backup_job=$1
    local error_type=$2
    
    log_action "BACKUP_FAILURE: Fixing backup job $backup_job (error: $error_type)"
    
    case $error_type in
        "space")
            # Clean old backups
            find /backup/couchbase -type d -name "20*" -mtime +30 -exec rm -rf {} \; 2>/dev/null
            log_action "Cleaned old backup files to free space"
            ;;
            
        "permission")
            # Fix permissions
            chown -R couchbase:couchbase /backup/couchbase
            chmod -R 755 /backup/couchbase
            log_action "Fixed backup directory permissions"
            ;;
            
        "corruption")
            # Start fresh backup
            log_action "Starting fresh full backup due to corruption"
            cbbackup http://localhost:8091 /backup/couchbase/full_$(date +%Y%m%d) \
                -u "$CB_USER" -p "$CB_PASS" &
            ;;
    esac
}

# Helper function to confirm critical actions
confirm_failover() {
    local node=$1
    
    # Check if this is the only node
    local node_count=$(couchbase-cli server-list -c localhost -u "$CB_USER" -p "$CB_PASS" | wc -l)
    
    if [ "$node_count" -le 1 ]; then
        log_action "Cannot failover $node - it's the only node in cluster"
        return 1
    fi
    
    # Check if cluster can handle failover
    local replica_count=$(curl -s -u "${CB_USER}:${CB_PASS}" \
        "http://localhost:8091/pools/default/buckets" | \
        jq '.[0].replicaNumber' 2>/dev/null)
    
    if [ "$replica_count" -ge 1 ]; then
        return 0  # OK to failover
    else
        log_action "Cannot failover - no replicas configured"
        return 1
    fi
}

# Main remediation dispatcher
dispatch_remediation() {
    local issue_type=$1
    shift
    local params=$@
    
    case $issue_type in
        "memory_pressure")
            remediate_memory_pressure $params
            ;;
        "disk_space")
            remediate_disk_space $params
            ;;
        "replication_lag")
            remediate_replication_lag $params
            ;;
        "index_fragmentation")
            remediate_index_fragmentation $params
            ;;
        "slow_queries")
            remediate_slow_queries $params
            ;;
        "node_failure")
            remediate_node_failure $params
            ;;
        "security")
            remediate_security_issues $params
            ;;
        "backup_failure")
            remediate_backup_failure $params
            ;;
        *)
            log_action "Unknown issue type: $issue_type"
            ;;
    esac
}

# Execute if called directly
if [ "${1:-}" != "" ]; then
    dispatch_remediation "$@"
fi



# Couchbase Intelligent Monitor Configuration
# This file contains advanced configuration for the monitoring script
#couchbase_monitor_config.yaml

cluster:
  name: production-cluster
  hosts:
    - host: node1.couchbase.local
      port: 8091
      services: [kv, index, n1ql]
    - host: node2.couchbase.local
      port: 8091
      services: [kv, index]
    - host: node3.couchbase.local
      port: 8091
      services: [kv, fts]

credentials:
  username: Administrator
  password: ${COUCHBASE_PASSWORD}  # Use environment variable

monitoring:
  interval: 60  # seconds
  detailed_interval: 300  # detailed checks every 5 minutes
  
thresholds:
  cpu:
    warning: 70
    critical: 85
  memory:
    warning: 75
    critical: 90
  disk:
    warning: 70
    critical: 85
  cache_miss_rate:
    warning: 5
    critical: 10
  query_latency:
    warning: 500  # ms
    critical: 1000  # ms
  index_fragmentation:
    warning: 20
    critical: 30
  replication_lag:
    warning: 1000  # items
    critical: 10000  # items

auto_healing:
  enabled: true
  actions:
    memory_pressure:
      - compact_buckets
      - clear_expired_docs
      - adjust_cache_watermarks
    disk_pressure:
      - trigger_compaction
      - cleanup_logs
      - archive_old_data
    high_cache_miss:
      - increase_resident_ratio
      - optimize_ejection
    index_issues:
      - rebuild_corrupted
      - defragment_high
    replication_lag:
      - restart_xdcr
      - adjust_throttling

alerts:
  email:
    enabled: true
    smtp_server: smtp.company.com
    smtp_port: 587
    from: couchbase-monitor@company.com
    to:
      - dba-team@company.com
      - ops@company.com
    severity: [critical, error]
  
  slack:
    enabled: true
    webhook_url: https://hooks.slack.com/services/XXX/YYY/ZZZ
    channel: "#database-alerts"
    severity: [critical, error, warning]
  
  pagerduty:
    enabled: false
    api_key: ${PAGERDUTY_API_KEY}
    service_key: ${PAGERDUTY_SERVICE_KEY}
    severity: [critical]

reporting:
  enabled: true
  html_reports: true
  json_metrics: true
  grafana_export: true
  report_interval: 3600  # seconds
  retention_days: 30

advanced_checks:
  security_audit:
    enabled: true
    check_ssl: true
    check_rbac: true
    check_audit_logs: true
  
  performance_analysis:
    enabled: true
    slow_query_threshold: 1000  # ms
    index_scan_threshold: 10000  # documents
    
  capacity_planning:
    enabled: true
    growth_prediction: true
    resource_forecasting: true

optimization_rules:
  - name: "Optimize Heavy Read Workloads"
    condition: "cache_miss_rate > 5 AND get_ops > 10000"
    actions:
      - increase_cache_size
      - add_replica_vbuckets
      
  - name: "Optimize Heavy Write Workloads"  
    condition: "disk_queue > 100000 AND set_ops > 5000"
    actions:
      - increase_writers
      - optimize_persistence
      
  - name: "Balance Query Load"
    condition: "query_latency > 500 AND cpu > 70"
    actions:
      - add_query_nodes
      - optimize_indexes

maintenance_windows:
  - day: Sunday
    start: "02:00"
    end: "06:00"
    allowed_actions:
      - compaction
      - rebalance
      - index_rebuild
      
backup_validation:
  enabled: true
  test_restore: true
  verify_integrity: true
  alert_on_failure: true

#!/bin/bash

#############################################################################
# Script: couchbase_intelligent_monitor.sh
# Purpose: Intelligent Couchbase monitoring with auto-healing and recommendations
# Version: 1.0
# Author: Raj
# Date: June 10th 2014
#
# Features:
#   - Comprehensive health monitoring
#   - Auto-healing capabilities
#   - Intelligent recommendations
#   - Performance metrics collection
#   - Predictive analysis
#   - Alert management
#############################################################################

# Configuration
COUCHBASE_HOST="${COUCHBASE_HOST:-localhost}"
COUCHBASE_PORT="${COUCHBASE_PORT:-8091}"
COUCHBASE_USER="${COUCHBASE_USER:-Administrator}"
COUCHBASE_PASSWORD="${COUCHBASE_PASSWORD:-password}"
CLUSTER_NAME="${CLUSTER_NAME:-couchbase-cluster}"

# Monitoring Configuration
MONITOR_INTERVAL="${MONITOR_INTERVAL:-60}"  # seconds
LOG_DIR="/var/log/couchbase-monitor"
LOG_FILE="${LOG_DIR}/monitor_$(date +%Y%m%d).log"
ALERT_LOG="${LOG_DIR}/alerts_$(date +%Y%m%d).log"
METRICS_FILE="${LOG_DIR}/metrics_$(date +%Y%m%d_%H%M%S).json"
REPORT_FILE="${LOG_DIR}/health_report_$(date +%Y%m%d_%H%M%S).html"

# Thresholds
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=80
SWAP_THRESHOLD=20
CACHE_MISS_THRESHOLD=10
REBALANCE_TIMEOUT=3600
CONNECTION_THRESHOLD=1000
QUERY_LATENCY_THRESHOLD=1000  # ms
INDEX_FRAGMENTATION_THRESHOLD=30

# Alert Configuration
ENABLE_EMAIL_ALERTS="${ENABLE_EMAIL_ALERTS:-false}"
ALERT_EMAIL="${ALERT_EMAIL:-admin@company.com}"
ENABLE_SLACK_ALERTS="${ENABLE_SLACK_ALERTS:-false}"
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
PURPLE='\033[0;35m'
NC='\033[0m'

# Create log directory
mkdir -p "$LOG_DIR"

# Logging function
log_message() {
    local level=$1
    shift
    local message="$@"
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$timestamp] [$level] $message" | tee -a "$LOG_FILE"
    
    if [ "$level" = "ERROR" ] || [ "$level" = "CRITICAL" ]; then
        echo "[$timestamp] [$level] $message" >> "$ALERT_LOG"
        send_alert "$level" "$message"
    fi
}

# Print colored output
print_color() {
    local color=$1
    shift
    echo -e "${color}$@${NC}"
}

# Send alerts
send_alert() {
    local level=$1
    local message=$2
    
    # Email alert
    if [ "$ENABLE_EMAIL_ALERTS" = "true" ] && [ -n "$ALERT_EMAIL" ]; then
        echo "$message" | mail -s "[$level] Couchbase Alert - $CLUSTER_NAME" "$ALERT_EMAIL"
    fi
    
    # Slack alert
    if [ "$ENABLE_SLACK_ALERTS" = "true" ] && [ -n "$SLACK_WEBHOOK" ]; then
        curl -X POST -H 'Content-type: application/json' \
            --data "{\"text\":\"[$level] $CLUSTER_NAME: $message\"}" \
            "$SLACK_WEBHOOK" 2>/dev/null
    fi
}

# Execute Couchbase REST API call
cb_api_call() {
    local endpoint=$1
    local method=${2:-GET}
    local data=$3
    
    local url="http://${COUCHBASE_HOST}:${COUCHBASE_PORT}${endpoint}"
    
    if [ "$method" = "GET" ]; then
        curl -s -u "${COUCHBASE_USER}:${COUCHBASE_PASSWORD}" "$url"
    elif [ "$method" = "POST" ]; then
        curl -s -X POST -u "${COUCHBASE_USER}:${COUCHBASE_PASSWORD}" \
            -d "$data" "$url"
    fi
}

# Execute Couchbase CLI command
cb_cli() {
    local command=$@
    /opt/couchbase/bin/couchbase-cli $command \
        -c "${COUCHBASE_HOST}:${COUCHBASE_PORT}" \
        -u "$COUCHBASE_USER" \
        -p "$COUCHBASE_PASSWORD"
}

# Check cluster health
check_cluster_health() {
    log_message "INFO" "Checking cluster health..."
    
    local cluster_info=$(cb_api_call "/pools/default")
    
    if [ -z "$cluster_info" ]; then
        log_message "CRITICAL" "Cannot connect to Couchbase cluster"
        return 1
    fi
    
    # Parse cluster status
    local balanced=$(echo "$cluster_info" | jq -r '.balanced')
    local rebalance_status=$(echo "$cluster_info" | jq -r '.rebalanceStatus')
    local nodes_count=$(echo "$cluster_info" | jq '.nodes | length')
    local healthy_nodes=$(echo "$cluster_info" | jq '[.nodes[] | select(.status == "healthy")] | length')
    
    # Check if cluster is balanced
    if [ "$balanced" != "true" ]; then
        log_message "WARNING" "Cluster is not balanced"
        recommend_action "REBALANCE" "Run cluster rebalance to distribute data evenly"
    fi
    
    # Check node health
    if [ "$healthy_nodes" -lt "$nodes_count" ]; then
        local unhealthy=$((nodes_count - healthy_nodes))
        log_message "ERROR" "Found $unhealthy unhealthy nodes in cluster"
        analyze_unhealthy_nodes
    fi
    
    # Check rebalance status
    if [ "$rebalance_status" = "running" ]; then
        log_message "INFO" "Rebalance in progress"
        monitor_rebalance
    fi
    
    echo "$cluster_info"
}

# Analyze unhealthy nodes
analyze_unhealthy_nodes() {
    local nodes=$(cb_api_call "/pools/default" | jq -r '.nodes[]')
    
    echo "$nodes" | jq -r '. | select(.status != "healthy") | .hostname' | while read node; do
        log_message "WARNING" "Unhealthy node detected: $node"
        
        # Check if node is reachable
        if ! ping -c 1 -W 2 "$node" > /dev/null 2>&1; then
            log_message "ERROR" "Node $node is unreachable"
            recommend_action "NODE_DOWN" "Check network connectivity or node status for $node"
        else
            # Try to diagnose the issue
            diagnose_node_issue "$node"
        fi
    done
}

# Diagnose node issues
diagnose_node_issue() {
    local node=$1
    log_message "INFO" "Diagnosing issues on node $node"
    
    # Check services
    local services=$(cb_api_call "/pools/default" | jq -r ".nodes[] | select(.hostname == \"$node\") | .services[]")
    
    for service in $services; do
        case $service in
            "kv")
                check_data_service "$node"
                ;;
            "index")
                check_index_service "$node"
                ;;
            "n1ql")
                check_query_service "$node"
                ;;
            "fts")
                check_search_service "$node"
                ;;
        esac
    done
}

# Check memory usage
check_memory_usage() {
    log_message "INFO" "Checking memory usage..."
    
    local nodes=$(cb_api_call "/pools/default/buckets/@system/nodes")
    
    echo "$nodes" | jq -r '.servers[]' | while read -r node_data; do
        local hostname=$(echo "$node_data" | jq -r '.hostname')
        local mem_used=$(echo "$node_data" | jq -r '.systemStats.mem_actual_used')
        local mem_total=$(echo "$node_data" | jq -r '.systemStats.mem_total')
        local mem_percent=$(echo "scale=2; ($mem_used / $mem_total) * 100" | bc)
        
        if (( $(echo "$mem_percent > $MEMORY_THRESHOLD" | bc -l) )); then
            log_message "WARNING" "High memory usage on $hostname: ${mem_percent}%"
            auto_heal_memory "$hostname"
        fi
    done
}

# Auto-heal memory issues
auto_heal_memory() {
    local node=$1
    log_message "INFO" "Attempting to auto-heal memory issues on $node"
    
    # 1. Compact buckets
    local buckets=$(cb_api_call "/pools/default/buckets" | jq -r '.[].name')
    for bucket in $buckets; do
        log_message "INFO" "Compacting bucket $bucket"
        cb_api_call "/pools/default/buckets/$bucket/controller/compactBucket" "POST"
    done
    
    # 2. Clear expired documents
    log_message "INFO" "Clearing expired documents"
    cb_cli bucket-compact --bucket all
    
    # 3. Adjust cache if needed
    recommend_action "MEMORY" "Consider adjusting bucket memory quotas or adding more nodes"
}

# Check disk usage
check_disk_usage() {
    log_message "INFO" "Checking disk usage..."
    
    local nodes=$(cb_api_call "/pools/default")
    
    echo "$nodes" | jq -r '.nodes[]' | while read -r node_data; do
        local hostname=$(echo "$node_data" | jq -r '.hostname')
        local disk_used=$(echo "$node_data" | jq -r '.systemStats.disk_used')
        local disk_total=$(echo "$node_data" | jq -r '.systemStats.disk_total')
        
        if [ "$disk_total" != "null" ] && [ "$disk_total" -gt 0 ]; then
            local disk_percent=$(echo "scale=2; ($disk_used / $disk_total) * 100" | bc)
            
            if (( $(echo "$disk_percent > $DISK_THRESHOLD" | bc -l) )); then
                log_message "WARNING" "High disk usage on $hostname: ${disk_percent}%"
                auto_heal_disk "$hostname"
            fi
        fi
    done
}

# Auto-heal disk issues
auto_heal_disk() {
    local node=$1
    log_message "INFO" "Attempting to auto-heal disk issues on $node"
    
    # 1. Trigger compaction
    log_message "INFO" "Triggering auto-compaction"
    cb_api_call "/controller/setAutoCompaction" "POST" \
        "databaseFragmentationThreshold[percentage]=20&viewFragmentationThreshold[percentage]=20"
    
    # 2. Clean up old logs
    log_message "INFO" "Cleaning up old logs"
    find /opt/couchbase/var/lib/couchbase/logs -type f -mtime +7 -delete 2>/dev/null
    
    # 3. Recommend further actions
    recommend_action "DISK" "Consider: 1) Increasing disk space, 2) Adjusting TTL values, 3) Archiving old data"
}

# Check bucket performance
check_bucket_performance() {
    log_message "INFO" "Checking bucket performance..."
    
    local buckets=$(cb_api_call "/pools/default/buckets")
    
    echo "$buckets" | jq -r '.[].name' | while read bucket; do
        local stats=$(cb_api_call "/pools/default/buckets/$bucket/stats")
        
        # Check cache miss ratio
        local cache_miss_rate=$(echo "$stats" | jq -r '.op.samples.ep_cache_miss_rate[-1]')
        if (( $(echo "$cache_miss_rate > $CACHE_MISS_THRESHOLD" | bc -l) )); then
            log_message "WARNING" "High cache miss rate in bucket $bucket: ${cache_miss_rate}%"
            optimize_bucket_cache "$bucket"
        fi
        
        # Check disk queue
        local disk_queue=$(echo "$stats" | jq -r '.op.samples.ep_queue_size[-1]')
        if [ "$disk_queue" -gt 1000000 ]; then
            log_message "WARNING" "Large disk queue in bucket $bucket: $disk_queue items"
            recommend_action "PERFORMANCE" "Bucket $bucket has disk write backlog. Consider increasing writers or I/O capacity"
        fi
        
        # Check operation latency
        check_operation_latency "$bucket" "$stats"
    done
}

# Check operation latency
check_operation_latency() {
    local bucket=$1
    local stats=$2
    
    # Get operation timings
    local get_latency=$(echo "$stats" | jq -r '.op.samples.get_cmd_latency[-1]' 2>/dev/null)
    local set_latency=$(echo "$stats" | jq -r '.op.samples.set_cmd_latency[-1]' 2>/dev/null)
    
    if [ -n "$get_latency" ] && [ "$get_latency" != "null" ]; then
        if (( $(echo "$get_latency > 1000" | bc -l) )); then
            log_message "WARNING" "High GET latency in bucket $bucket: ${get_latency}μs"
            recommend_action "LATENCY" "Consider: 1) Adding more nodes, 2) Optimizing queries, 3) Adding indexes"
        fi
    fi
}

# Optimize bucket cache
optimize_bucket_cache() {
    local bucket=$1
    log_message "INFO" "Optimizing cache for bucket $bucket"
    
    # Get current bucket configuration
    local bucket_info=$(cb_api_call "/pools/default/buckets/$bucket")
    local ram_quota=$(echo "$bucket_info" | jq -r '.quota.ram')
    local item_count=$(echo "$bucket_info" | jq -r '.basicStats.itemCount')
    
    # Calculate optimal memory
    local bytes_per_item=500  # Approximate
    local optimal_ram=$((item_count * bytes_per_item / 1048576))  # Convert to MB
    
    if [ "$optimal_ram" -gt "$ram_quota" ]; then
        log_message "INFO" "Bucket $bucket needs more RAM. Current: ${ram_quota}MB, Recommended: ${optimal_ram}MB"
        recommend_action "CACHE" "Increase RAM quota for bucket $bucket to ${optimal_ram}MB"
    fi
}

# Check indexes
check_indexes() {
    log_message "INFO" "Checking indexes..."
    
    # Get index status
    local indexes=$(cb_api_call "/indexStatus")
    
    echo "$indexes" | jq -r '.indexes[]' | while read -r index_data; do
        local index_name=$(echo "$index_data" | jq -r '.index')
        local status=$(echo "$index_data" | jq -r '.status')
        local progress=$(echo "$index_data" | jq -r '.progress')
        
        if [ "$status" != "Ready" ]; then
            log_message "WARNING" "Index $index_name is not ready: $status ($progress%)"
            
            if [ "$status" = "Error" ]; then
                rebuild_index "$index_name"
            fi
        fi
        
        # Check fragmentation
        check_index_fragmentation "$index_name"
    done
}

# Check index fragmentation
check_index_fragmentation() {
    local index=$1
    
    local stats=$(cb_api_call "/pools/default/buckets/@index/stats")
    local fragmentation=$(echo "$stats" | jq -r ".op.samples.index_fragmentation_${index}[-1]" 2>/dev/null)
    
    if [ -n "$fragmentation" ] && [ "$fragmentation" != "null" ]; then
        if (( $(echo "$fragmentation > $INDEX_FRAGMENTATION_THRESHOLD" | bc -l) )); then
            log_message "WARNING" "High fragmentation in index $index: ${fragmentation}%"
            rebuild_index "$index"
        fi
    fi
}

# Rebuild index
rebuild_index() {
    local index=$1
    log_message "INFO" "Rebuilding index $index"
    
    # This would typically use N1QL
    local query="ALTER INDEX \`$index\` REBUILD"
    cb_api_call "/query/service" "POST" "statement=$query"
    
    recommend_action "INDEX" "Index $index is being rebuilt due to issues"
}

# Check query service
check_query_service() {
    local node=$1
    log_message "INFO" "Checking query service on $node"
    
    # Check active queries
    local active_requests=$(cb_api_call "/pools/default/tasks" | jq -r '.tasks[] | select(.type == "n1ql")')
    
    if [ -n "$active_requests" ]; then
        echo "$active_requests" | while read -r request; do
            local duration=$(echo "$request" | jq -r '.runtime')
            
            if [ "$duration" -gt 60000 ]; then  # More than 60 seconds
                log_message "WARNING" "Long-running query detected: ${duration}ms"
                analyze_slow_query "$request"
            fi
        done
    fi
}

# Analyze slow queries
analyze_slow_query() {
    local query_info=$1
    
    log_message "INFO" "Analyzing slow query"
    
    # Get query plan
    local statement=$(echo "$query_info" | jq -r '.statement')
    
    # Check for missing indexes
    if echo "$statement" | grep -qi "WHERE\|JOIN" && ! echo "$statement" | grep -qi "USE INDEX"; then
        recommend_action "QUERY" "Query may benefit from indexes. Review execution plan."
    fi
    
    # Check for full collection scans
    if echo "$statement" | grep -qi "SELECT \*"; then
        recommend_action "QUERY" "Avoid SELECT * queries. Specify required fields only."
    fi
}

# Check XDCR (Cross Data Center Replication)
check_xdcr() {
    log_message "INFO" "Checking XDCR status..."
    
    local xdcr_tasks=$(cb_api_call "/pools/default/remoteClusters")
    
    if [ "$(echo "$xdcr_tasks" | jq '. | length')" -gt 0 ]; then
        echo "$xdcr_tasks" | jq -r '.[]' | while read -r remote; do
            local name=$(echo "$remote" | jq -r '.name')
            local hostname=$(echo "$remote" | jq -r '.hostname')
            
            # Check connectivity
            if ! nc -z "$hostname" 8091 2>/dev/null; then
                log_message "ERROR" "XDCR remote cluster $name unreachable at $hostname"
                recommend_action "XDCR" "Check network connectivity to remote cluster $name"
            fi
            
            # Check replication status
            check_xdcr_replication "$name"
        done
    fi
}

# Check XDCR replication status
check_xdcr_replication() {
    local remote=$1
    
    local replications=$(cb_api_call "/pools/default/tasks" | jq -r '.tasks[] | select(.type == "xdcr")')
    
    echo "$replications" | while read -r repl; do
        local status=$(echo "$repl" | jq -r '.status')
        local errors=$(echo "$repl" | jq -r '.errors')
        
        if [ "$status" = "error" ] || [ "$errors" -gt 0 ]; then
            log_message "ERROR" "XDCR replication to $remote has errors"
            auto_heal_xdcr "$remote"
        fi
    done
}

# Auto-heal XDCR issues
auto_heal_xdcr() {
    local remote=$1
    log_message "INFO" "Attempting to auto-heal XDCR to $remote"
    
    # Restart XDCR replication
    cb_cli xdcr-replicate --pause --xdcr-replicator="$remote"
    sleep 5
    cb_cli xdcr-replicate --resume --xdcr-replicator="$remote"
    
    recommend_action "XDCR" "XDCR replication to $remote was restarted. Monitor for improvements."
}

# Monitor rebalance
monitor_rebalance() {
    local start_time=$(date +%s)
    
    while true; do
        local rebalance_status=$(cb_api_call "/pools/default/rebalanceProgress")
        local status=$(echo "$rebalance_status" | jq -r '.status')
        
        if [ "$status" = "none" ]; then
            log_message "INFO" "Rebalance completed successfully"
            break
        fi
        
        local progress=$(echo "$rebalance_status" | jq -r '.ns_1@node1.progress')
        log_message "INFO" "Rebalance progress: ${progress}%"
        
        # Check timeout
        local current_time=$(date +%s)
        local elapsed=$((current_time - start_time))
        
        if [ "$elapsed" -gt "$REBALANCE_TIMEOUT" ]; then
            log_message "ERROR" "Rebalance timeout after ${elapsed} seconds"
            recommend_action "REBALANCE" "Rebalance is taking too long. Check cluster resources."
            break
        fi
        
        sleep 30
    done
}

# Check backup status
check_backup_status() {
    log_message "INFO" "Checking backup status..."
    
    # Check for backup repository
    if [ -d "/opt/couchbase/backup" ]; then
        local latest_backup=$(find /opt/couchbase/backup -type d -name "20*" | sort -r | head -1)
        
        if [ -n "$latest_backup" ]; then
            local backup_age=$(find "$latest_backup" -maxdepth 0 -mmin +1440 2>/dev/null)
            
            if [ -n "$backup_age" ]; then
                log_message "WARNING" "Latest backup is more than 24 hours old"
                recommend_action "BACKUP" "Schedule regular backups to prevent data loss"
            fi
        else
            log_message "WARNING" "No backups found"
            recommend_action "BACKUP" "Configure and schedule regular backups immediately"
        fi
    fi
}

# Recommend actions based on issues
recommend_action() {
    local category=$1
    local recommendation=$2
    
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$category] RECOMMENDATION: $recommendation" >> "${LOG_DIR}/recommendations.log"
    
    print_color "$YELLOW" "📌 RECOMMENDATION [$category]: $recommendation"
    
    # Add to report
    echo "<div class='recommendation $category'>$recommendation</div>" >> "$REPORT_FILE"
}

# Generate health report
generate_health_report() {
    log_message "INFO" "Generating health report..."
    
    cat > "$REPORT_FILE" <<EOF
<!DOCTYPE html>
<html>
<head>
    <title>Couchbase Health Report - $(date '+%Y-%m-%d %H:%M:%S')</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 20px; }
        h1 { color: #333; }
        .status-ok { color: green; }
        .status-warning { color: orange; }
        .status-error { color: red; }
        .metric { margin: 10px 0; padding: 10px; border-left: 3px solid #ddd; }
        .recommendation { background: #fffbdd; padding: 10px; margin: 10px 0; border-left: 3px solid #ffa500; }
        table { border-collapse: collapse; width: 100%; }
        th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
        th { background: #f2f2f2; }
    </style>
</head>
<body>
    <h1>Couchbase Cluster Health Report</h1>
    <p>Generated: $(date '+%Y-%m-%d %H:%M:%S')</p>
    <p>Cluster: $CLUSTER_NAME</p>
EOF
    
    # Add cluster overview
    local cluster_info=$(cb_api_call "/pools/default")
    local node_count=$(echo "$cluster_info" | jq '.nodes | length')
    local balanced=$(echo "$cluster_info" | jq -r '.balanced')
    
    cat >> "$REPORT_FILE" <<EOF
    <h2>Cluster Overview</h2>
    <table>
        <tr><th>Metric</th><th>Value</th><th>Status</th></tr>
        <tr><td>Total Nodes</td><td>$node_count</td><td class="status-ok">OK</td></tr>
        <tr><td>Balanced</td><td>$balanced</td><td class="$([ "$balanced" = "true" ] && echo "status-ok" || echo "status-warning")">$([ "$balanced" = "true" ] && echo "OK" || echo "NEEDS REBALANCE")</td></tr>
    </table>
EOF
    
    # Add node details
    cat >> "$REPORT_FILE" <<EOF
    <h2>Node Status</h2>
    <table>
        <tr><th>Hostname</th><th>Status</th><th>Services</th><th>CPU %</th><th>Memory %</th><th>Disk %</th></tr>
EOF
    
    echo "$cluster_info" | jq -r '.nodes[]' | while read -r node_data; do
        local hostname=$(echo "$node_data" | jq -r '.hostname')
        local status=$(echo "$node_data" | jq -r '.status')
        local services=$(echo "$node_data" | jq -r '.services | join(", ")')
        local cpu=$(echo "$node_data" | jq -r '.systemStats.cpu_utilization_rate // 0')
        local mem_used=$(echo "$node_data" | jq -r '.systemStats.mem_actual_used // 0')
        local mem_total=$(echo "$node_data" | jq -r '.systemStats.mem_total // 1')
        local mem_percent=$(echo "scale=2; ($mem_used / $mem_total) * 100" | bc)
        
        cat >> "$REPORT_FILE" <<EOF
        <tr>
            <td>$hostname</td>
            <td class="$([ "$status" = "healthy" ] && echo "status-ok" || echo "status-error")">$status</td>
            <td>$services</td>
            <td>$cpu%</td>
            <td>$mem_percent%</td>
            <td>N/A</td>
        </tr>
EOF
    done
    
    cat >> "$REPORT_FILE" <<EOF
    </table>
    
    <h2>Recommendations</h2>
    $(cat "${LOG_DIR}/recommendations.log" 2>/dev/null | tail -20 | sed 's/^/<p>/;s/$/<\/p>/')
    
</body>
</html>
EOF
    
    log_message "INFO" "Health report generated: $REPORT_FILE"
}

# Main monitoring loop
main_monitoring_loop() {
    print_color "$GREEN" "=========================================="
    print_color "$GREEN" "Couchbase Intelligent Monitor v1.0"
    print_color "$GREEN" "Cluster: $CLUSTER_NAME"
    print_color "$GREEN" "=========================================="
    
    while true; do
        print_color "$BLUE" "\n[$(date '+%Y-%m-%d %H:%M:%S')] Starting monitoring cycle..."
        
        # Core health checks
        check_cluster_health
        check_memory_usage
        check_disk_usage
        check_bucket_performance
        check_indexes
        check_query_service
        check_xdcr
        check_backup_status
        
        # Generate report every hour
        if [ $(($(date +%s) % 3600)) -lt "$MONITOR_INTERVAL" ]; then
            generate_health_report
        fi
        
        print_color "$GREEN" "Monitoring cycle complete. Next check in ${MONITOR_INTERVAL} seconds..."
        sleep "$MONITOR_INTERVAL"
    done
}

# Signal handlers
trap 'log_message "INFO" "Monitor stopped by user"; exit 0' SIGINT SIGTERM

# Check prerequisites
check_prerequisites() {
    # Check for required tools
    for tool in curl jq bc nc; do
        if ! command -v $tool &> /dev/null; then
            print_color "$RED" "ERROR: Required tool '$tool' is not installed"
            exit 1
        fi
    done
    
    # Check Couchbase CLI
    if [ ! -f "/opt/couchbase/bin/couchbase-cli" ]; then
        print_color "$YELLOW" "WARNING: Couchbase CLI not found at default location"
    fi
    
    # Test connection
    if ! cb_api_call "/pools" > /dev/null 2>&1; then
        print_color "$RED" "ERROR: Cannot connect to Couchbase at ${COUCHBASE_HOST}:${COUCHBASE_PORT}"
        print_color "$YELLOW" "Please check connection settings and credentials"
        exit 1
    fi
}

# Start monitoring
log_message "INFO" "Starting Couchbase Intelligent Monitor"
check_prerequisites
main_monitoring_loop