Raj Kiran's: CouchBase Issues / Errors

Couchbase Issues & Errors – Overcoming Them

By Raj Kiran Mattewada

This article describes some common Couchbase errors and issues encountered in production environments and offers practical remedies and best practices gleaned from real-world experience. Please test any commands or steps in your own environment before applying them in production.

Issue #1: “Join completion call failed” / HTTP 500 error during node addition

Symptom / Error Message

Attention – Join completion call failed.  
Got HTTP status 500 from REST call post to 
http://hostname.domain.com:8091/completeJoin  
Body: ["Unexpected server error, request logged."]

Root Cause (RCA)
This often happens when you attempt to add a node to a cluster that already has one or more buckets configured, or residual data files remain from a previous configuration. In some cases, misconfiguration or leftover metadata cause conflicts.

Solution / Recovery Steps

⚠️ Caution: These steps may be destructive. Always back up data and proceed carefully.

Stop Couchbase on the node you’re trying to add:
```
service couchbase stop
service couchbase status
```
Ensure that critical ports are free (e.g., 8091, 8092, 11211, 11210). If any are held by other processes, identify and kill them:
```
netstat -a | grep 8091  
netstat -a | grep 8092  
netstat -a | grep 11211  
netstat -a | grep 11210  
```
Move or remove existing data directories—this removes leftover or conflicting bucket metadata:
```
cd /data/couchbase  
mv data data.OLD  
mv index index.OLD  
```

Recreate the directories fresh:

mkdir data index  
chown couchbase:couchbase data index

Re-verify that the required ports are no longer in use.

Start Couchbase again:

service couchbase start  
service couchbase status

From the master cluster node, proceed to add the new node to the appropriate server group in the UI or via CLI.

Issue #2: Auto-failover alerts / memcached / moxi exit statuses

Symptoms / Error Messages

“Couchbase Server alert: auto_failover_node”
“Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 0.”
“Port server moxi on node … exited with status 134.”

Scenarios & Explanations

Scenario	Description / Cause	Recommended Action
1a	Memcached process on a node was gracefully shut down (exit status 0)	Restart Couchbase on that node and reintroduce it to the cluster
1b	Network glitch or overly aggressive auto-failover timeout caused false detection of failure	Increase auto-failover timeout thresholds; ensure network stability
2	Moxi (a proxy component) exit status 134 (abort)	Investigate memory or configuration issues; check system logs for root cause

Notes / Best Practices

Be careful when setting auto-failover timeout values too low (e.g. 30 seconds) — transient network hiccups may trigger unnecessary failovers.
Monitor the bystander / “babysitter” processes which track service health.
Always review Couchbase logs (e.g. ns_server.log) and system logs (e.g. /var/log/messages) to catch the underlying issue that caused the process to exit unexpectedly.
In many cases, simply restarting the failed node and rebalancing is sufficient once root cause (memory, network, CPU) is addressed.

Additional Tips & Enhancements (based on experience)

Pre-check cluster state before upgrades or adding nodes
- Ensure cluster is healthy (cbstats, ns_server APIs)
- No rebalance is in progress
- Monitor resident ratio, memory usage, swap usage
Use backup snapshots before major operations
- Ensure you have snapshots or data export so you can rollback if needed.
Always stagger upgrades / additions
- Avoid performing multiple cluster changes at once; sequence them with time gaps to detect early failures.
Automate diagnostics collection
- Use cbcollect_info to collect logs and state
- Automate via scripts or orchestration (Ansible)
- Include logs, config files, cluster stats, alerts
Tune memory settings carefully
- Monitor eviction thresholds, memcached quotas, and disk swap behavior
- Avoid overcommitting memory across all services (data, index, query, search)
Monitor XDCR lag proactively
- If replication lag creeps up, use heartbeat-based alerts or custom scripts
- In some environments, suggest enhancements (heartbeat signals) to Couchbase support
Scripting for node operations
- Script repetitive tasks like failover, rebalance, add-node, remove-node
- Wrap safety checks to prevent destructive mistakes

Raj Kiran's

Tuesday, October 14, 2014

CouchBase Issues / Errors - Over coming them..

Couchbase Issues & Errors – Overcoming Them

Issue #1: “Join completion call failed” / HTTP 500 error during node addition

Issue #2: Auto-failover alerts / memcached / moxi exit statuses

Additional Tips & Enhancements (based on experience)

No comments: