Couchbase Issues & Errors – Overcoming Them
By Raj Kiran Mattewada
This article describes some common Couchbase errors and issues encountered in production environments and offers practical remedies and best practices gleaned from real-world experience. Please test any commands or steps in your own environment before applying them in production.
Issue #1: “Join completion call failed” / HTTP 500 error during node addition
Symptom / Error Message
Attention – Join completion call failed.
Got HTTP status 500 from REST call post to
http://hostname.domain.com:8091/completeJoin
Body: ["Unexpected server error, request logged."]
Root Cause (RCA)
This often happens when you attempt to add a node to a cluster that already has one or more buckets configured, or residual data files remain from a previous configuration. In some cases, misconfiguration or leftover metadata cause conflicts.
Solution / Recovery Steps
⚠️ Caution: These steps may be destructive. Always back up data and proceed carefully.
-
Stop Couchbase on the node you’re trying to add:
service couchbase stop service couchbase status -
Ensure that critical ports are free (e.g., 8091, 8092, 11211, 11210). If any are held by other processes, identify and kill them:
netstat -a | grep 8091 netstat -a | grep 8092 netstat -a | grep 11211 netstat -a | grep 11210 -
Move or remove existing data directories—this removes leftover or conflicting bucket metadata:
cd /data/couchbase mv data data.OLD mv index index.OLD -
Recreate the directories fresh:
mkdir data index chown couchbase:couchbase data index -
Re-verify that the required ports are no longer in use.
-
Start Couchbase again:
service couchbase start service couchbase status -
From the master cluster node, proceed to add the new node to the appropriate server group in the UI or via CLI.
Issue #2: Auto-failover alerts / memcached / moxi exit statuses
Symptoms / Error Messages
-
“Couchbase Server alert: auto_failover_node”
-
“Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 0.”
-
“Port server moxi on node … exited with status 134.”
Scenarios & Explanations
| Scenario | Description / Cause | Recommended Action |
|---|---|---|
| 1a | Memcached process on a node was gracefully shut down (exit status 0) | Restart Couchbase on that node and reintroduce it to the cluster |
| 1b | Network glitch or overly aggressive auto-failover timeout caused false detection of failure | Increase auto-failover timeout thresholds; ensure network stability |
| 2 | Moxi (a proxy component) exit status 134 (abort) | Investigate memory or configuration issues; check system logs for root cause |
Notes / Best Practices
-
Be careful when setting auto-failover timeout values too low (e.g. 30 seconds) — transient network hiccups may trigger unnecessary failovers.
-
Monitor the bystander / “babysitter” processes which track service health.
-
Always review Couchbase logs (e.g.
ns_server.log) and system logs (e.g./var/log/messages) to catch the underlying issue that caused the process to exit unexpectedly. -
In many cases, simply restarting the failed node and rebalancing is sufficient once root cause (memory, network, CPU) is addressed.
Additional Tips & Enhancements (based on experience)
-
Pre-check cluster state before upgrades or adding nodes
-
Ensure cluster is healthy (
cbstats,ns_serverAPIs) -
No rebalance is in progress
-
Monitor resident ratio, memory usage, swap usage
-
-
Use backup snapshots before major operations
-
Ensure you have snapshots or data export so you can rollback if needed.
-
-
Always stagger upgrades / additions
-
Avoid performing multiple cluster changes at once; sequence them with time gaps to detect early failures.
-
-
Automate diagnostics collection
-
Use
cbcollect_infoto collect logs and state -
Automate via scripts or orchestration (Ansible)
-
Include logs, config files, cluster stats, alerts
-
-
Tune memory settings carefully
-
Monitor eviction thresholds, memcached quotas, and disk swap behavior
-
Avoid overcommitting memory across all services (data, index, query, search)
-
-
Monitor XDCR lag proactively
-
If replication lag creeps up, use heartbeat-based alerts or custom scripts
-
In some environments, suggest enhancements (heartbeat signals) to Couchbase support
-
-
Scripting for node operations
-
Script repetitive tasks like failover, rebalance, add-node, remove-node
-
Wrap safety checks to prevent destructive mistakes
-

No comments:
Post a Comment