we have an hopsworks 2.3 installation with RonDB 21.04 Cluster that have this topology:
- 2 server nodes:
- 2 server nodes
After a few days of running, we encounter this error which cause the NDB node restart:
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 2 disconnected in recv with errnum: 104 in state: 0
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 1: Node 2 Disconnected
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 1: Communication to Node 2 closed
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 1: Network partitioning - arbitration required
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 1: President restarts arbitration thread [state=7]
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 49: Node 2 Disconnected
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 2 clear m_started, disconnected
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 1: Arbitration won - positive reply from node 49
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 1: NR Status: node=2,OLD=Restart completed,NEW=Node failed, fail handling ongoing
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 2: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: ‘Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node’.
This issue is a bug or there is some configuration to be done to solve the problem ?
Thank you for support.
Signal 11 is a Segmentation fault, so definitely looks like a bug.
To analyse what happened I would need the data node log and
the trace files. Those should be found in
/srv/hops/mysql-cluster I think.
I think the segfault is the triggerr for all the log messages you
displayed above. Also good to know which version you are
running to see if you hit a bug which is already fixed in a newer
As your suggestion, i have open and upload log files on github issue:
Thanks for support
First look at it indicates that you hit the bugs HOPSWORKS-2652 or HOPSWORKS-2651. They were fixed in RonDB 21.04.1 and caused a bit random failures in the ordered index queries.
An update to 21.04.8 is what I would suggest, this version has been very stable and no serious issues
have been found in it so far. I will look one more time before I update issue in GitHub.
Here is a reference to the bugs in the release notes of 21.04.1:
Hi @mikaelronstrom, the update 220.127.116.11 is too newest and there isn’t any hopsworks DAL to support that on yours repo:
I read the documentation and would update to 18.104.22.168, but RonDB is used by Hospworks 2.3; there is any documentation that explain how this update can be maded ?
Thank you for support.
I’ll look into this tomorrow and point you to the docs about upgrades and will see if I can ensure
that 21.04.8 DAL is uploaded. The 21.04.6 DAL can also be used against 21.04.8 in the data
nodes. There are no bug fixes in the DAL parts in 21.04.7 and 21.04.8.