Is there hope when your couchbase cluster is stuck in compacting buckets?
Well to be anticlimactic: no.
ScopeThis seems to be at least a Couchbase 3.x problem. So far I haven't experienced it with Couchbase 4. Of both versions I only know about the so called community edition. As for the frequency: Couchbase 3 getting stuck on bucket compacting is propabilistic. In the setups I've run so far it happens every half a year. But this might be load-dependant. Actually never having had the issue on some "smaller" clusters, I actually think it is.
The SymptomsIf you do not monitor explicitly for the compacting status, you will probably noticy by some nodes disks running full. Compacting not working anymore means, the Couchbase disk fragmentation growing and finally filling you disks. If you look in the GUI you will see a constant "Compacting..." indicator in the top right of the admin GUI. In normal operation it never takes more than some minutes to finish (again depending on your usage).
Things that do not work...
- Removing nodes: actually in this cluster state you cannot remove nodes anymore. It seems the compacting operation is locking the cluster. So disconnecting the disk full nodes won't work and neither won't help.
- Restarting the cluster: wether it is rebooting or simply restarting all instances in sequence or putting the entire cluster down and restarting it, won't help as the compacting issue stays persistent (see root cause below).
- Removing load: also doesn't help. The cluster doesn't recover if it has no requests anymore
What does help...
- Reinstall your cluster: Yeah!
- Stopping traffic + flushing buckets: If you can afford the downtime / cold-cache stop all traffic, flush the causing buckets and reenable traffic.