1. Architecture Overview
Couchbase Cluster Components
-
Data Service (KV): Handles read/write operations and manages the key-value store.
-
Index Service: Manages GSI (Global Secondary Indexes) to support N1QL queries.
-
Query Service: Processes N1QL queries using indexes or views.
-
Search Service: Handles Full-Text Search.
-
Analytics Service: Performs parallel analytics workloads.
-
Eventing Service: Real-time event-driven processing.
🕸️ Typical Production Deployment (2015-era Reference)
2. Kernel & OS-Level Tuning
Memory and Swapping
-
Disable Transparent Huge Pages (THP):
-
Disable Swap:
File Descriptors
-
Increase limits in
/etc/security/limits.conf
:
Network Tuning
-
Use low-latency network settings:
3. Best Practices (Operational)
3.1 Storage
-
Use ext4 or XFS file systems with noatime flag.
-
Place index and data on separate physical volumes for IO separation.
-
Prefer RAID 10 for low-latency workloads.
3.2 Memory
-
Ensure working set fits in memory.
-
Monitor ephemeral buckets and eviction policies (value-only vs full).
-
Monitor Resident Ratio > 95% in healthy clusters.
3.3 Compaction
-
Schedule off-peak automatic compaction.
-
Monitor compaction queues in
cbcollect_info
or Couchbase Admin UI.
3.4 Backup
-
Use
cbbackupmgr
for scheduled backups. -
Store backups off-node and test recovery monthly.
-
Automate with crontab/Ansible and validate integrity.
4. Error Handling & Troubleshooting
4.1 Node Join Failure (HTTP 500)
Resolution:
-
Clear
/data/couchbase/
→ recreatedata
andindex
folders. -
Ensure required ports (8091, 8092, 11210, etc.) are not in use.
-
Use
netstat
to validate before restarting services.
4.2 Memcached or Moxi Exit (status 134 or 0)
Diagnosis:
-
Check
ns_server.log
and babysitter logs. -
Look for memory exhaustion or kernel panic signatures.
4.3 Auto-failover Recurrence
Fixes:
-
Increase auto-failover timeout to 60s or more.
-
Validate NIC / MTU mismatch and system logs (
/var/log/messages
).
5. Upgrades & Version Management
Upgrade Path
-
Follow supported paths:
2.2 → 3.0.3 → 4.1.0 → 5.x+
Use Swap Rebalance for Seamless Upgrade
-
Script the following:
6. Cross Data Center Replication (XDCR)
-
Set up bidirectional or unidirectional replication.
-
Secure using SSL for XDCR.
-
Monitor lag using:
XDCR Best Practices
-
Avoid XDCR loops (clusters replicating back to each other).
-
Monitor for large checkpoint backlogs.
-
Enable compression (if supported by version).
7. Monitoring & Automation
Tools
-
cbcollect_info
: Cluster diagnostics. -
cbstats
: Node-level stats. -
Custom scripts: monitor resident ratio, replication lag, memory usage.
Automation Ideas
-
Rebalancing triggers (after node recovery)
-
Scheduled backup + integrity validation
-
Alerting: memory, disk, XDCR lag thresholds
8. Sample Scripts Snippet
9. Final Points to Remember
-
Avoid
SELECT *
— write targeted queries with covering indexes. -
Use partial indexes (
WHERE type = 'txn'
) to reduce index bloat. -
Monitor rebalance queue length during cluster changes.
-
Avoid over-committing memory across data and index services.
-
Tune Linux kernel to avoid memory fragmentation and swap usage.
Appendix
A. Relevant Logs to Monitor
-
/opt/couchbase/var/lib/couchbase/logs/ns_server.log
-
babysitter.log
,memcached.log
,stats.log
B. Key Directories
-
/opt/couchbase/var/lib/couchbase/data/
-
/opt/couchbase/var/lib/couchbase/index/