Got error 4009 'Cluster Failure' from NDBCLUSTER

ruturaj · October 5, 2021, 5:33pm

we have successfully set up a ronDB cluster and are also able to write data to the cluster.

our current setup includes the following:

2x Data Nodes (r5.4xLarge)
1x Mgmt Nodes (r5.xLarge)
2x Mysql Nodes (r5.4xLarge)

whenever we connect to the MySQL nodes (using jdbc (com.mysql.cj.jdbc.Driver)) to run select queries we encounter the following error:

Got error 4009 'Cluster Failure' from NDBCLUSTER

there is no information in cluster logs, mgmt server logs, MySQL logs and data node logs as well.

not sure why we are getting this error and how we can debug this?

the error is also not descriptive

mikaelronstrom · October 5, 2021, 5:54pm

Normally 4009 means that the cluster has failed completely.
So I am a bit surprised that you see no logs.
Most likely your client cannot contact the cluster data nodes
due to some firewall issue.

I usually configure data nodes to use port 11860 as ServerPort.
The default setting is still that one uses random ServerPort.
I gather that I should change the default setting to use 11860.

mikaelronstrom · October 5, 2021, 5:57pm

A quick course on ports in RonDB:
MySQL server port defaults to 3306.
API nodes and MySQL Server connects to a dynamic port number on
each data node. This ensures that we will always find a free port. But
it will not be very nice in the context of firewalls.
The management server one connects to through port 1186.

mikaelronstrom · October 5, 2021, 5:59pm

By setting ServerPort in config.ini on data nodes one ensures
that the data node only listens to this port. Thus if this is set
to 11860 one needs to open port 3306, 1186 and 11860 in
the firewall (also 33060 if using X API in MySQL Server).
In managed RonDB we set ServerPort always to 11860
and in some cases 2202.

mikaelronstrom · October 5, 2021, 6:03pm

A bit more on 4009, it is an error code that comes entirely from
the MySQL server. The MySQL Server (or API node) isn’t able
to connect to any data node. The assumption is that the cluster
is down, thus the error code “Cluster Failure”. But as mentioned
it can also be a connectivity issue, there is no way to know which
it is from the MySQL Server.

ruturaj · October 5, 2021, 6:19pm

I use the same JDBC connection to write data which workers fine

also, there is no firewall present at the moment that could be blocking these ports (3306, 1186, 11860) as well

I am able to query the data randomly via the MySQL cli…

as there are no logs it is difficult to understand

we already have a setup that uses HikariCP to create connections to Mysql and run queries … we just used the same setup and replaced the Mysql endpoint with the one setup for ronDB…

also serverPort is set to 11860

[ndbd default]
NoOfReplicas=3
DataDir=/usr/local/ndb_data
ServerPort=11860
HeartbeatIntervalDbDb=1000
ODirect=1
IndexStatAutoCreate=1
IndexStatAutoUpdate=1
StringMemory=4

mikaelronstrom · October 5, 2021, 6:27pm

Sounds very weird, if you can write data, you should be able to read, it
uses the same connection.
The cluster log should at a minimum contain logs of connection being setup
and torn down. This log resides on the node with the management server
and is usually called something ndb_65_cluster.log

mikaelronstrom · October 5, 2021, 6:28pm

Not familiar with HikariCP, but if it works with normal MySQL, it should also
work with RonDB.

ruturaj · October 5, 2021, 6:30pm

2021-10-04 22:47:34 [MgmtSrvr] INFO     -- Node 1: Local checkpoint 16251 started. Keep GCI = 138327 oldest restorable GCI = 138328
2021-10-04 22:47:35 [MgmtSrvr] INFO     -- Node 1: LDM(3): Completed LCP, #frags = 19 #records = 15169, #bytes = 718960
2021-10-04 22:47:35 [MgmtSrvr] INFO     -- Node 1: LDM(4): Completed LCP, #frags = 19 #records = 14902, #bytes = 708896
2021-10-04 22:47:35 [MgmtSrvr] INFO     -- Node 1: LDM(1): Completed LCP, #frags = 20 #records = 30243, #bytes = 1429456
2021-10-04 22:47:35 [MgmtSrvr] INFO     -- Node 1: LDM(2): Completed LCP, #frags = 20 #records = 30418, #bytes = 1453952
2021-10-04 22:47:37 [MgmtSrvr] INFO     -- Node 2: LDM(3): Completed LCP, #frags = 19 #records = 15169, #bytes = 718960
2021-10-04 22:47:37 [MgmtSrvr] INFO     -- Node 2: LDM(4): Completed LCP, #frags = 19 #records = 15643, #bytes = 744464
2021-10-04 22:47:37 [MgmtSrvr] INFO     -- Node 2: LDM(2): Completed LCP, #frags = 20 #records = 30418, #bytes = 1453952
2021-10-04 22:47:37 [MgmtSrvr] INFO     -- Node 2: LDM(1): Completed LCP, #frags = 20 #records = 31008, #bytes = 1466176
2021-10-04 22:47:37 [MgmtSrvr] INFO     -- Node 1: Local checkpoint 16251 completed
2021-10-05 07:55:29 [MgmtSrvr] INFO     -- Node 1: Local checkpoint 16252 started. Keep GCI = 148639 oldest restorable GCI = 138337
2021-10-05 07:55:31 [MgmtSrvr] INFO     -- Node 1: LDM(1): Completed LCP, #frags = 20 #records = 0, #bytes = 0
2021-10-05 07:55:31 [MgmtSrvr] INFO     -- Node 1: LDM(4): Completed LCP, #frags = 19 #records = 0, #bytes = 0
2021-10-05 07:55:31 [MgmtSrvr] INFO     -- Node 1: LDM(2): Completed LCP, #frags = 20 #records = 0, #bytes = 0
2021-10-05 07:55:31 [MgmtSrvr] INFO     -- Node 1: LDM(3): Completed LCP, #frags = 19 #records = 0, #bytes = 0
2021-10-05 07:55:33 [MgmtSrvr] INFO     -- Node 2: LDM(3): Completed LCP, #frags = 19 #records = 0, #bytes = 0
2021-10-05 07:55:33 [MgmtSrvr] INFO     -- Node 2: LDM(2): Completed LCP, #frags = 20 #records = 0, #bytes = 0
2021-10-05 07:55:33 [MgmtSrvr] INFO     -- Node 2: LDM(4): Completed LCP, #frags = 19 #records = 0, #bytes = 0
2021-10-05 07:55:33 [MgmtSrvr] INFO     -- Node 2: LDM(1): Completed LCP, #frags = 20 #records = 0, #bytes = 0

the cluster logs just have the above checkpoint logs and nothing more

mikaelronstrom · October 5, 2021, 6:32pm

Can you try some of the RonDB tools, e.g. ndb_waiter just to see if you can get an API node to connect to the cluster. Seems like at least the data node and MGM server is up and running as they should

ruturaj · October 5, 2021, 6:34pm

2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Sending ALLOC_NODE_ID_REQ to node 1
2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Alloc node id 231 succeeded
2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Nodeid 231 allocated for API at 127.0.0.1
2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Node 231: ndb_desc
2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Node 1: Node 231 Connected
2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Node 2: Node 231 Connected
2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Node 1: Node 231: API RonDB-21.04.1
2021-10-05 14:20:02 [MgmtSrvr] INFO     -- Node 2: Node 231: API RonDB-21.04.1
2021-10-05 14:20:05 [MgmtSrvr] ALERT    -- Node 1: Node 231 Disconnected
2021-10-05 14:20:05 [MgmtSrvr] INFO     -- Node 1: Communication to Node 231 closed
2021-10-05 14:20:05 [MgmtSrvr] INFO     -- Node 2: Communication to Node 231 closed
2021-10-05 14:20:05 [MgmtSrvr] ALERT    -- Node 2: Node 231 Disconnected

was able to connect an api node to the cluster as well

mikaelronstrom · October 5, 2021, 6:34pm

An easy way to set the connect string in this case is to set the environment variable NDB_CONNECTSTRING to point to the MGM server host
e.g.
export NDB_CONNECTSTRING=mgm_host

mikaelronstrom · October 5, 2021, 6:35pm

Ok, then the issue must be in how the MySQL Server is setup

mikaelronstrom · October 5, 2021, 6:37pm

Need something like this in my.cnf:
[mysqld]
ndbcluster
ndb-cluster-connection-pool-nodeids=67
ndb-connectstring=192.168.1.100

ruturaj · October 5, 2021, 6:37pm

./mysqld --ndbcluster --ndb-connectstring=xxx.xxx.xxx.xxx --datadir=/var/lib/mysql/data --log-error=/home/centos/rondb-21.04.1-linux-glibc2.17-x86_64/log/mysql_err.out --ndb-cluster-connection-pool=2 --ndb-cluster-connection-pool-nodeids=67,68 --socket=/tmp/mysql.sock

i ran the following command to setup and run mysql

mikaelronstrom · October 5, 2021, 6:39pm

Looks ok, what do u find in the mysql_err.out
It should have some info that indicates that it succeeded to connect to RonDB

ruturaj · October 5, 2021, 6:42pm

2021-10-05T14:29:25.052872Z 0 [Warning] [MY-011070] [Server] 'Disabling symbolic links using --skip-symbolic-links (or equivalent) is the default. Consider not using this option as it' is deprecated and will be removed in a future release.
2021-10-05T14:29:25.052969Z 0 [System] [MY-010116] [Server] /home/centos/rondb-21.04.1-linux-glibc2.17-x86_64/bin/mysqld (mysqld 21.04.1-cluster) starting as process 23347
2021-10-05T14:29:25.061818Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2021-10-05T14:29:25.335077Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2021-10-05T14:29:26.064959Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /tmp/mysqlx.sock
2021-10-05T14:29:26.081384Z 0 [System] [MY-010229] [Server] Starting XA crash recovery...
2021-10-05T14:29:26.087476Z 0 [System] [MY-010232] [Server] XA crash recovery finished.
2021-10-05T14:29:26.160899Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2021-10-05T14:29:26.161146Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2021-10-05T14:29:26.195237Z 0 [System] [MY-010931] [Server] /home/centos/rondb-21.04.1-linux-glibc2.17-x86_64/bin/mysqld: ready for connections. Version: '21.04.1-cluster'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution.
2021-10-05 14:29:26 [NdbApi] INFO     -- Flushing incomplete GCI:s < 176424/8
2021-10-05 14:29:26 [NdbApi] INFO     -- Flushing incomplete GCI:s < 176424/8

this is all of the log in the mysql_err.out file…

mikaelronstrom · October 5, 2021, 6:44pm

Did you perform the initialisation step for mysqld before starting it like you did above

ruturaj · October 5, 2021, 6:47pm

yes, i did run with the --initialize parameter as well

mikaelronstrom · October 5, 2021, 6:51pm

Hmm, very weird, not much in logs as you mentioned.
Can you use the mysql client and see if you can perform the following
simple commands:
CREATE DATABASE test;
CREATE TABLE t1 (a int primary key) engine = ndb;
insert into t1 values (1);
select * from t1;
DROP TABLE t1;

and see what it says