I have an Apache Ignite 2.14 cluster of 3 nodes running on Kubernetes. All my caches have one backup copy.
After enabling persistence on the default data region a couple of months ago, I started getting the exception CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost)
when one or two nodes restarted either as a result of deployment or for some other reason.
It was worrying but I learned to fix it by running control.sh --cache reset_lost_partitions cacheName
.
This time after two nodes restarted due to some transient failure I started getting an error which I couldn't fix by running the mentioned command:
Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=ignite-sys-atomic-cache@default-ds-group, partition=985, key=UserKeyCacheObjectImpl [part=985, val=GridCacheInternalKeyImpl [name=alias, grpName=default-ds-group], hasValBytes=true]] at rg.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:214)
Looks like this time this issue involved a system cache ignite-sys-atomic-cache@default-ds-group
. I guess it is related to the AtomicSequence object that I use in the application to get IDs generated. The error occurs exactly when I'm trying to use AtomicLong.
The question are:
- Why it might happen?
- Is it possible to fix it without destroying the cluster and reloading all the data from scratch (it would take a day or two).
- How to prevent similar issues in the future?
Thank you in advance!
P.S. On GridGain Portal the following error is reported: Cache [default-ds-group] has zero partition copies.