Issue RonDB 21.04

g.boccia · September 14, 2022, 3:04pm

Hi all,
we have an hopsworks 2.3 installation with RonDB 21.04 Cluster that have this topology:

2 server nodes:
Mysql Mgm
Myslqd
2 server nodes
Mysql NDB

After a few days of running, we encounter this error which cause the NDB node restart:
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 2 disconnected in recv with errnum: 104 in state: 0
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 1: Node 2 Disconnected
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 1: Communication to Node 2 closed
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 1: Network partitioning - arbitration required
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 1: President restarts arbitration thread [state=7]
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 49: Node 2 Disconnected
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 2 clear m_started, disconnected
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 1: Arbitration won - positive reply from node 49
2022-09-14 11:41:36 [MgmtSrvr] INFO – Node 1: NR Status: node=2,OLD=Restart completed,NEW=Node failed, fail handling ongoing
2022-09-14 11:41:36 [MgmtSrvr] ALERT – Node 2: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: ‘Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node’.

This issue is a bug or there is some configuration to be done to solve the problem ?

Thank you for support.

mikaelronstrom · September 15, 2022, 10:04am

Signal 11 is a Segmentation fault, so definitely looks like a bug.
To analyse what happened I would need the data node log and
the trace files. Those should be found in
/srv/hops/mysql-cluster I think.

I think the segfault is the triggerr for all the log messages you
displayed above. Also good to know which version you are
running to see if you hit a bug which is already fixed in a newer
21.04 version.

mikaelronstrom · September 15, 2022, 12:08pm

An easy way to upload crash files would be to create an issue in the github tree at
GitHub - logicalclocks/rondb: This is RonDB, a distribution of NDB Cluster developed and used by Hopsworks AB. It also contains development branches of RonDB..

g.boccia · September 16, 2022, 1:44pm

As your suggestion, i have open and upload log files on github issue:

Thanks for support

Gianluca

mikaelronstrom · September 16, 2022, 3:15pm

First look at it indicates that you hit the bugs HOPSWORKS-2652 or HOPSWORKS-2651. They were fixed in RonDB 21.04.1 and caused a bit random failures in the ordered index queries.
An update to 21.04.8 is what I would suggest, this version has been very stable and no serious issues
have been found in it so far. I will look one more time before I update issue in GitHub.

Here is a reference to the bugs in the release notes of 21.04.1:

g.boccia · September 19, 2022, 8:47am

Hi @mikaelronstrom, the update 21.0.4.8 is too newest and there isn’t any hopsworks DAL to support that on yours repo:
https://repo.hops.works/master/

I read the documentation and would update to 21.0.4.6, but RonDB is used by Hospworks 2.3; there is any documentation that explain how this update can be maded ?

Thank you for support.

mikaelronstrom · September 19, 2022, 9:53pm

I’ll look into this tomorrow and point you to the docs about upgrades and will see if I can ensure
that 21.04.8 DAL is uploaded. The 21.04.6 DAL can also be used against 21.04.8 in the data
nodes. There are no bug fixes in the DAL parts in 21.04.7 and 21.04.8.

mikaelronstrom · October 15, 2022, 12:33am

Took a bit more time than expected, working on a new section in the documentation
about upgrades and downgrades of RonDB. Will be released in conjunction with new
RonDB releases here in October.

mikaelronstrom · November 1, 2022, 4:28pm

Added a new chapter now to the RonDB docs: