We have setup 2 keydb servers with Active Replica, as described here: https://docs.keydb.dev/docs/active-rep/
We use HAproxy to redirect the traffic to the correct server. So we have the current situation:
keydb 001 - 10.0.0.7
keydb 002 - 10.0.0.8
We want to update and reboot keydb 01. We have put it in maintenance in HAproxy and all connections are drained. So the server is not used anymore, and all live connections are going to keydb 02.
Now when keydb 01 comes back up again, it askes keydb 02 for a full db sync. After this is done, we see that keydb 02 also askes a full db sync from keydb 01!!! This causes keydb 02 to go in a LOADING state, whilst it is the only live server in Haproxy.
So the result is that there are for a short period of time, NO keydb server is live. This results in errors like: LOADING Redis is loading the dataset in memory
The whole idea of this active replica setup is that it is robust, failsafe and creates an high availability setup. However, in this situation it means that every time a server goes done, it creates a short moment of complete downtime. This is unacceptable for our setup.
Are we doing wrong? Is this by design? Or do we have to reconfigure something?
We have tested with disk-based and diskless syncs. It makes no difference. Our configuration setup (from an ansible playbook, so formatting can look a bit weird):
bind: "127.0.0.1 {{ my_private_ips[inventory_hostname] }}"
requirepass: "{{ keydb_auth }}"
masterauth: "{{ keydb_auth }}"
replicaof: "xxx 6379"
client-output-buffer-limit:
- normal 0 0 0
- replica 1024mb 256mb 60
- pubsub 32mb 8mb 60
repl-diskless-sync: yes
port: 6379
maxmemory: 3000m
active-replica: yes
We are running Ubuntu 20.04.2 LTS with keydb version 6.0.16.
Here is the keydb log file from keydb 01, which goes down for maintenance and comes back up again.
3812319:1439:C 18 Jun 2021 09:08:36.048 * DB saved on disk
3812319:1439:C 18 Jun 2021 09:08:36.054 * RDB: 3 MB of memory used by copy-on-write
958:1439:S 18 Jun 2021 09:08:36.094 * Background saving terminated with success
958:signal-handler (1624000133) Received SIGTERM scheduling shutdown...
958:signal-handler (1624000133) Received SIGTERM scheduling shutdown...
958:1439:S 18 Jun 2021 09:08:53.194 # User requested shutdown...
958:1439:S 18 Jun 2021 09:08:53.194 # systemd supervision requested, but NOTIFY_SOCKET not found
958:1439:S 18 Jun 2021 09:08:53.194 * Saving the final RDB snapshot before exiting.
958:1439:S 18 Jun 2021 09:08:53.194 # systemd supervision requested, but NOTIFY_SOCKET not found
958:1439:S 18 Jun 2021 09:08:54.159 * DB saved on disk
958:1439:S 18 Jun 2021 09:08:54.159 * Removing the pid file.
958:1439:S 18 Jun 2021 09:08:54.159 # KeyDB is now ready to exit, bye bye...
874:874:C 18 Jun 2021 09:09:04.528 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
874:874:C 18 Jun 2021 09:09:04.531 * Notice: "active-replica yes" implies "replica-read-only no"
874:874:C 18 Jun 2021 09:09:04.531 # oO0OoO0OoO0Oo KeyDB is starting oO0OoO0OoO0Oo
874:874:C 18 Jun 2021 09:09:04.531 # KeyDB version=6.0.16, bits=64, commit=00000000, modified=0, pid=874, just started
874:874:C 18 Jun 2021 09:09:04.531 # Configuration loaded
874:874:C 18 Jun 2021 09:09:04.531 # WARNING supervised by systemd - you MUST set appropriate values for TimeoutStartSec and TimeoutStopSec in your service unit.
874:874:C 18 Jun 2021 09:09:04.531 # systemd supervision requested, but NOTIFY_SOCKET not found
KeyDB 6.0.16 (00000000/0) 64 bit
Running in standalone mode
Port: 6379
PID: 957
Join the KeyDB community! https://community.keydb.dev/
957:874:S 18 Jun 2021 09:09:04.781 # Server initialized
957:874:S 18 Jun 2021 09:09:04.781 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
957:874:S 18 Jun 2021 09:09:04.781 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with KeyDB. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. KeyDB must be restarted after THP is disabled.
957:874:S 18 Jun 2021 09:09:04.789 * Loading RDB produced by version 6.0.16
957:874:S 18 Jun 2021 09:09:04.789 * RDB age 11 seconds
957:874:S 18 Jun 2021 09:09:04.789 * RDB memory usage when created 100.16 Mb
957:874:S 18 Jun 2021 09:09:05.563 * DB loaded from disk: 0.778 seconds
957:874:S 18 Jun 2021 09:09:05.563 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
957:874:S 18 Jun 2021 09:09:05.563 # systemd supervision requested, but NOTIFY_SOCKET not found
957:1345:S 18 Jun 2021 09:09:05.563 Thread 1 alive.
957:1344:S 18 Jun 2021 09:09:05.564 Thread 0 alive.
957:1344:S 18 Jun 2021 09:09:05.564 * Connecting to MASTER 10.0.0.8:6379
957:1344:S 18 Jun 2021 09:09:05.564 * MASTER <-> REPLICA sync started
957:1344:S 18 Jun 2021 09:09:05.566 * Non blocking connect for SYNC fired the event.
957:1344:S 18 Jun 2021 09:09:05.567 * Master replied to PING, replication can continue...
957:1344:S 18 Jun 2021 09:09:05.571 * Replica 10.0.0.8:6379 asks for synchronization
957:1344:S 18 Jun 2021 09:09:05.571 * Full resync requested by replica 10.0.0.8:6379
957:1344:S 18 Jun 2021 09:09:05.571 * Replication backlog created, my new replication IDs are '60759691a4adca42cced2fd1895a470ab7b9eebb' and '0000000000000000000000000000000000000000'
957:1344:S 18 Jun 2021 09:09:05.571 * Delay next BGSAVE for diskless SYNC
957:1344:S 18 Jun 2021 09:09:05.573 * Partial resynchronization not possible (no cached master)
957:1344:S 18 Jun 2021 09:09:11.521 * Full resync from master: 3eb1532df27d6815eea8a420c7c72b04e03618dc:516117186158
957:1344:S 18 Jun 2021 09:09:11.521 * Discarding previously cached master state.
957:1344:S 18 Jun 2021 09:09:11.530 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
957:1344:S 18 Jun 2021 09:09:11.603 * Starting BGSAVE for SYNC with target: replicas sockets
957:1344:S 18 Jun 2021 09:09:11.606 * Background RDB transfer started by pid 2412
2412:1344:C 18 Jun 2021 09:09:12.488 * RDB: 1 MB of memory used by copy-on-write
957:1344:S 18 Jun 2021 09:09:12.488 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
957:1344:S 18 Jun 2021 09:09:12.507 * Background RDB transfer terminated with success
957:1344:S 18 Jun 2021 09:09:12.507 * Streamed RDB transfer with replica 10.0.0.8:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
957:1344:S 18 Jun 2021 09:09:12.872 * Synchronization with replica 10.0.0.8:6379 succeeded
957:1344:S 18 Jun 2021 09:10:12.799 # I/O error trying to sync with MASTER: connection lost
957:1344:S 18 Jun 2021 09:10:13.429 * Connecting to MASTER 10.0.0.8:6379
957:1344:S 18 Jun 2021 09:10:13.429 * MASTER <-> REPLICA sync started
957:1344:S 18 Jun 2021 09:10:13.429 * Non blocking connect for SYNC fired the event.
957:1344:S 18 Jun 2021 09:10:13.429 * Master replied to PING, replication can continue...
957:1344:S 18 Jun 2021 09:10:13.430 * Partial resynchronization not possible (no cached master)
957:1344:S 18 Jun 2021 09:10:19.943 * Full resync from master: 349f88c33a6dc9d67a3cb4d623727d9f1047033d:516121643991
957:1344:S 18 Jun 2021 09:10:19.952 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
957:1344:S 18 Jun 2021 09:10:20.856 * MASTER <-> REPLICA sync: Loading DB in memory
957:1344:S 18 Jun 2021 09:10:20.856 * Loading RDB produced by version 6.0.16
957:1344:S 18 Jun 2021 09:10:20.856 * RDB age 1 seconds
957:1344:S 18 Jun 2021 09:10:20.856 * RDB memory usage when created 102.08 Mb
957:1344:S 18 Jun 2021 09:10:21.431 * MASTER <-> REPLICA sync: Finished with success
957:1344:S 18 Jun 2021 09:10:21.431 # systemd supervision requested, but NOTIFY_SOCKET not found
957:1344:S 18 Jun 2021 09:10:21.431 # systemd supervision requested, but NOTIFY_SOCKET not found
957:1344:S 18 Jun 2021 09:11:03.197 * 10000 changes in 60 seconds. Saving...
957:1344:S 18 Jun 2021 09:11:03.200 * Background saving started by pid 4845
4845:1344:C 18 Jun 2021 09:11:04.015 * DB saved on disk
4845:1344:C 18 Jun 2021 09:11:04.018 * RDB: 2 MB of memory used by copy-on-write
957:1344:S 18 Jun 2021 09:11:04.104 * Background saving terminated with success
957:1344:S 18 Jun 2021 09:12:05.071 * 10000 changes in 60 seconds. Saving...
957:1344:S 18 Jun 2021 09:12:05.075 * Background saving started by pid 4870
4870:1344:C 18 Jun 2021 09:12:05.963 * DB saved on disk
Here is the keydb 02 log. This server stayed âupâ, but was shortly unavailable:
960:1388:S 18 Jun 2021 09:08:29.066 * Background saving started by pid 3814844
3814844:1388:C 18 Jun 2021 09:08:29.896 * DB saved on disk
3814844:1388:C 18 Jun 2021 09:08:29.901 * RDB: 3 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:08:29.974 * Background saving terminated with success
960:1388:S 18 Jun 2021 09:08:54.176 # Connection with master lost.
960:1388:S 18 Jun 2021 09:08:54.176 * Caching the disconnected master state.
960:1388:S 18 Jun 2021 09:08:54.177 # Connection with replica client id #11 lost.
960:1388:S 18 Jun 2021 09:08:54.957 * Connecting to MASTER 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:08:54.957 * MASTER <-> REPLICA sync started
960:1388:S 18 Jun 2021 09:09:02.073 # Error condition on socket for SYNC: Resource temporarily unavailable
960:1388:S 18 Jun 2021 09:09:02.988 * Connecting to MASTER 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:09:02.988 * MASTER <-> REPLICA sync started
960:1388:S 18 Jun 2021 09:09:02.988 # Error condition on socket for SYNC: Operation now in progress
960:1388:S 18 Jun 2021 09:09:03.998 * Connecting to MASTER 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:09:03.998 * MASTER <-> REPLICA sync started
960:1388:S 18 Jun 2021 09:09:03.999 * Non blocking connect for SYNC fired the event.
960:1388:S 18 Jun 2021 09:09:04.068 * Master replied to PING, replication can continue...
960:1388:S 18 Jun 2021 09:09:04.084 * Partial resynchronization not possible (no cached master)
960:1388:S 18 Jun 2021 09:09:04.087 * Replica 10.0.0.7:6379 asks for synchronization
960:1388:S 18 Jun 2021 09:09:04.087 * Full resync requested by replica 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:09:04.087 * Delay next BGSAVE for diskless SYNC
960:1388:S 18 Jun 2021 09:09:10.035 * Starting BGSAVE for SYNC with target: replicas sockets
960:1388:S 18 Jun 2021 09:09:10.042 * Background RDB transfer started by pid 3814859
960:1388:S 18 Jun 2021 09:09:10.117 * Full resync from master: 640f9e48636076058722301cc52ea8a21bc8e450:453896462998
960:1388:S 18 Jun 2021 09:09:10.118 * Discarding previously cached master state.
960:1388:S 18 Jun 2021 09:09:10.122 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
960:1388:S 18 Jun 2021 09:09:10.999 * MASTER <-> REPLICA sync: Loading DB in memory
960:1388:S 18 Jun 2021 09:09:11.000 * Replica is about to load the RDB file received from the master, but there is a pending RDB child running. Killing process 3814859 and removing its temp file to avoid any race
3814859:signal-handler (1624000150) Received SIGUSR1 in child, exiting now.
960:1388:S 18 Jun 2021 09:09:11.000 * Loading RDB produced by version 6.0.16
960:1388:S 18 Jun 2021 09:09:11.000 * RDB age 0 seconds
960:1388:S 18 Jun 2021 09:09:11.000 * RDB memory usage when created 95.02 Mb
960:1388:S 18 Jun 2021 09:09:11.018 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
960:1388:S 18 Jun 2021 09:09:11.019 # Background transfer terminated by signal 10
960:1388:S 18 Jun 2021 09:09:11.019 * Streamed RDB transfer with replica 10.0.0.7:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
960:1388:S 18 Jun 2021 09:09:11.386 * MASTER <-> REPLICA sync: Finished with success
960:1388:S 18 Jun 2021 09:09:11.386 # systemd supervision requested, but NOTIFY_SOCKET not found
960:1388:S 18 Jun 2021 09:09:11.386 # systemd supervision requested, but NOTIFY_SOCKET not found
960:1388:S 18 Jun 2021 09:09:30.096 * 10000 changes in 60 seconds. Saving...
960:1388:S 18 Jun 2021 09:09:30.101 * Background saving started by pid 3814875
3814875:1388:C 18 Jun 2021 09:09:30.970 * DB saved on disk
3814875:1388:C 18 Jun 2021 09:09:30.975 * RDB: 5 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:09:31.022 * Background saving terminated with success
960:1388:S 18 Jun 2021 09:10:12.795 # Disconnecting timedout replica: 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:10:12.795 # Connection with replica 10.0.0.7:6379 lost.
960:1389:S 18 Jun 2021 09:10:13.429 * Replica 10.0.0.7:6379 asks for synchronization
960:1389:S 18 Jun 2021 09:10:13.429 * Full resync requested by replica 10.0.0.7:6379
960:1389:S 18 Jun 2021 09:10:13.429 * Delay next BGSAVE for diskless SYNC
960:1388:S 18 Jun 2021 09:10:19.941 * Starting BGSAVE for SYNC with target: replicas sockets
960:1388:S 18 Jun 2021 09:10:19.948 * Background RDB transfer started by pid 3815442
3815442:1388:C 18 Jun 2021 09:10:20.860 * RDB: 5 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:10:20.860 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
960:1388:S 18 Jun 2021 09:10:20.958 * Background RDB transfer terminated with success
960:1388:S 18 Jun 2021 09:10:20.958 * Streamed RDB transfer with replica 10.0.0.7:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
960:1389:S 18 Jun 2021 09:10:22.034 * Synchronization with replica 10.0.0.7:6379 succeeded
960:1388:S 18 Jun 2021 09:10:32.013 * 10000 changes in 60 seconds. Saving...
960:1388:S 18 Jun 2021 09:10:32.018 * Background saving started by pid 3816278
3816278:1388:C 18 Jun 2021 09:10:32.845 * DB saved on disk
3816278:1388:C 18 Jun 2021 09:10:32.850 * RDB: 4 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:10:32.921 * Background saving terminated with success
960:1388:S 18 Jun 2021 09:11:33.101 * 10000 changes in 60 seconds. Saving...