Tuesday, October 14, 2014

CouchBase Issues / Errors - Over coming them..


Couchbase Issues & Errors – Overcoming Them

By Raj Kiran Mattewada

This article describes some common Couchbase errors and issues encountered in production environments and offers practical remedies and best practices gleaned from real-world experience. Please test any commands or steps in your own environment before applying them in production.





Issue #1: “Join completion call failed” / HTTP 500 error during node addition

Symptom / Error Message

Attention – Join completion call failed.  
Got HTTP status 500 from REST call post to 
http://hostname.domain.com:8091/completeJoin  
Body: ["Unexpected server error, request logged."]

Root Cause (RCA)
This often happens when you attempt to add a node to a cluster that already has one or more buckets configured, or residual data files remain from a previous configuration. In some cases, misconfiguration or leftover metadata cause conflicts.

Solution / Recovery Steps

⚠️ Caution: These steps may be destructive. Always back up data and proceed carefully.

  1. Stop Couchbase on the node you’re trying to add:

    service couchbase stop
    service couchbase status
    
  2. Ensure that critical ports are free (e.g., 8091, 8092, 11211, 11210). If any are held by other processes, identify and kill them:

    netstat -a | grep 8091  
    netstat -a | grep 8092  
    netstat -a | grep 11211  
    netstat -a | grep 11210  
    
  3. Move or remove existing data directories—this removes leftover or conflicting bucket metadata:

    cd /data/couchbase  
    mv data data.OLD  
    mv index index.OLD  
    
  4. Recreate the directories fresh:

    mkdir data index  
    chown couchbase:couchbase data index  
    
  5. Re-verify that the required ports are no longer in use.

  6. Start Couchbase again:

    service couchbase start  
    service couchbase status  
    
  7. From the master cluster node, proceed to add the new node to the appropriate server group in the UI or via CLI.


Issue #2: Auto-failover alerts / memcached / moxi exit statuses

Symptoms / Error Messages

  • “Couchbase Server alert: auto_failover_node”

  • “Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 0.”

  • “Port server moxi on node … exited with status 134.”

Scenarios & Explanations

Scenario Description / Cause Recommended Action
1a Memcached process on a node was gracefully shut down (exit status 0) Restart Couchbase on that node and reintroduce it to the cluster
1b Network glitch or overly aggressive auto-failover timeout caused false detection of failure Increase auto-failover timeout thresholds; ensure network stability
2 Moxi (a proxy component) exit status 134 (abort) Investigate memory or configuration issues; check system logs for root cause

Notes / Best Practices

  • Be careful when setting auto-failover timeout values too low (e.g. 30 seconds) — transient network hiccups may trigger unnecessary failovers.

  • Monitor the bystander / “babysitter” processes which track service health.

  • Always review Couchbase logs (e.g. ns_server.log) and system logs (e.g. /var/log/messages) to catch the underlying issue that caused the process to exit unexpectedly.

  • In many cases, simply restarting the failed node and rebalancing is sufficient once root cause (memory, network, CPU) is addressed.


Additional Tips & Enhancements (based on experience)

  1. Pre-check cluster state before upgrades or adding nodes

    • Ensure cluster is healthy (cbstats, ns_server APIs)

    • No rebalance is in progress

    • Monitor resident ratio, memory usage, swap usage

  2. Use backup snapshots before major operations

    • Ensure you have snapshots or data export so you can rollback if needed.

  3. Always stagger upgrades / additions

    • Avoid performing multiple cluster changes at once; sequence them with time gaps to detect early failures.

  4. Automate diagnostics collection

    • Use cbcollect_info to collect logs and state

    • Automate via scripts or orchestration (Ansible)

    • Include logs, config files, cluster stats, alerts

  5. Tune memory settings carefully

    • Monitor eviction thresholds, memcached quotas, and disk swap behavior

    • Avoid overcommitting memory across all services (data, index, query, search)

  6. Monitor XDCR lag proactively

    • If replication lag creeps up, use heartbeat-based alerts or custom scripts

    • In some environments, suggest enhancements (heartbeat signals) to Couchbase support

  7. Scripting for node operations

    • Script repetitive tasks like failover, rebalance, add-node, remove-node

    • Wrap safety checks to prevent destructive mistakes

No comments: