Monday, January 5, 2015

Points to remember in Couchbase

1. Architecture Overview

 Couchbase Cluster Components

  • Data Service (KV): Handles read/write operations and manages the key-value store.

  • Index Service: Manages GSI (Global Secondary Indexes) to support N1QL queries.

  • Query Service: Processes N1QL queries using indexes or views.

  • Search Service: Handles Full-Text Search.

  • Analytics Service: Performs parallel analytics workloads.

  • Eventing Service: Real-time event-driven processing.

🕸️ Typical Production Deployment (2015-era Reference)

+------------------------+
| Load Balancer         |
+-----------+------------+
            |
+-----------v------------+
|   Couchbase Query/API  |  ← 3 Nodes (Query, Index)
+-----------+------------+
            |
+-----------v------------+
|   Couchbase Data Tier  |  ← 5-9 Nodes (KV + Index)
+-----------+------------+
            |
+-----------v------------+
| XDCR/Remote Clusters   |
+------------------------+

2. Kernel & OS-Level Tuning

 Memory and Swapping

  • Disable Transparent Huge Pages (THP):

    echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag
  • Disable Swap:

    swapoff -a

 File Descriptors

  • Increase limits in /etc/security/limits.conf:

    couchbase soft nofile 100000 couchbase hard nofile 100000

 Network Tuning

  • Use low-latency network settings:

    sysctl -w net.core.somaxconn=4096 sysctl -w net.ipv4.tcp_tw_reuse=1 sysctl -w net.ipv4.ip_local_port_range="10240 65535"

3. Best Practices (Operational)

3.1 Storage

  • Use ext4 or XFS file systems with noatime flag.

  • Place index and data on separate physical volumes for IO separation.

  • Prefer RAID 10 for low-latency workloads.

3.2 Memory

  • Ensure working set fits in memory.

  • Monitor ephemeral buckets and eviction policies (value-only vs full).

  • Monitor Resident Ratio > 95% in healthy clusters.

3.3 Compaction

  • Schedule off-peak automatic compaction.

  • Monitor compaction queues in cbcollect_info or Couchbase Admin UI.

3.4 Backup

  • Use cbbackupmgr for scheduled backups.

  • Store backups off-node and test recovery monthly.

  • Automate with crontab/Ansible and validate integrity.

4. Error Handling & Troubleshooting

4.1 Node Join Failure (HTTP 500)

Resolution:

  • Clear /data/couchbase/ → recreate data and index folders.

  • Ensure required ports (8091, 8092, 11210, etc.) are not in use.

  • Use netstat to validate before restarting services.

4.2 Memcached or Moxi Exit (status 134 or 0)

Diagnosis:

  • Check ns_server.log and babysitter logs.

  • Look for memory exhaustion or kernel panic signatures.

4.3 Auto-failover Recurrence

Fixes:

  • Increase auto-failover timeout to 60s or more.

  • Validate NIC / MTU mismatch and system logs (/var/log/messages).

5. Upgrades & Version Management

 Upgrade Path

  • Follow supported paths:
    2.2 → 3.0.3 → 4.1.0 → 5.x+

 Use Swap Rebalance for Seamless Upgrade

  • Script the following:

    couchbase-cli rebalance \ --cluster 127.0.0.1:8091 \ --user admin --password pass \ --server-remove oldnode \ --server-add newnode

6. Cross Data Center Replication (XDCR)

  • Set up bidirectional or unidirectional replication.

  • Secure using SSL for XDCR.

  • Monitor lag using:

    couchbase-cli xdcr-replicate --list

XDCR Best Practices

  • Avoid XDCR loops (clusters replicating back to each other).

  • Monitor for large checkpoint backlogs.

  • Enable compression (if supported by version).


7. Monitoring & Automation

Tools

  • cbcollect_info: Cluster diagnostics.

  • cbstats: Node-level stats.

  • Custom scripts: monitor resident ratio, replication lag, memory usage.

Automation Ideas

  • Rebalancing triggers (after node recovery)

  • Scheduled backup + integrity validation

  • Alerting: memory, disk, XDCR lag thresholds


8. Sample Scripts Snippet

#!/bin/bash # Monitor XDCR lag CLUSTER="127.0.0.1" curl -s -u admin:password \ "http://${CLUSTER}:8091/pools/default/tasks" \ | jq '.[] | select(.type=="xdcr") | {source, target, replication_lag}'

 9. Final Points to Remember

  • Avoid SELECT * — write targeted queries with covering indexes.

  • Use partial indexes (WHERE type = 'txn') to reduce index bloat.

  • Monitor rebalance queue length during cluster changes.

  • Avoid over-committing memory across data and index services.

  • Tune Linux kernel to avoid memory fragmentation and swap usage.


Appendix

A. Relevant Logs to Monitor

  • /opt/couchbase/var/lib/couchbase/logs/ns_server.log

  • babysitter.log, memcached.log, stats.log

B. Key Directories

  • /opt/couchbase/var/lib/couchbase/data/

  • /opt/couchbase/var/lib/couchbase/index/