ERROR 1777 (HY000): Partition memsqldb:0 has no master instance - singlestore

I am using community edition of memsql. I got this error while i was running a query today. So i just restarted my cluster and got this error solved.
memsql-ops cluster-restart
But what happened and what should i do in future to avoid this error ?
NOTE
I donot want to buy the Enterprise edition.
Question
Is this problem of Availability ?

I got this error when experimenting with performance.
VM had 24 CPUs and 25 nodes: 1 Master Agg, 24 Leaf nodes
Reduced VM to 4 CPUs and restarted cluster.
All the leaves did not recover.
All except 4 recovered in < 5 minutes.
20 minutes later, 4 leaf nodes still were not connected.
From MySQL/MemSQL prompt:
use db;
show partitions;
I notice some partitions with ordinal from 0-71 for me have null instead Host, Port, Role defined for a given partition.
In memsql ops UI http://server:9000 > Settings > Config > Manual Cluster Control I checked "ENABLE MANUAL CONTROL" while I tried to run various commands with no real benefit.
Then 15 minutes later, I unchecked the box, Memsql-ops tried attaching all the leaf nodes again and was finally successful.
Perhaps a cluster restart would have done the same thing.

This happened because a leaf in your cluster has failed a health check heartbeat for some reason (loss of network connectivity, hardware failure, OS issue, machine overloaded, out of memory, etc.) and its partitions are no longer accessible to query. MemSQL Community Edition only supports redundancy 1 so there are no other copies of the data on the failed leaf node in your cluster (thus the error about missing a partition of data - MemSQL can't complete a query that needs to read data on any partitions on the problem leaf).
Given that a restart repaired things, the most likely answer is that linux "out of memory" killed you: MemSQL Linux OOM killer docs
You can also check the tracelog on the leaf that ran into issues to see if there is any clue there about what happened (It's usually at /var/lib/memsql/leaf_3306/tracelogs/memsql.log)
-Adam

I too have faced this error, that was because some of the slave ordinals had no corresponding masters. My error message looked like:
ERROR 1772 (HY000) at line 1: Leaf Error (10.0.0.112:3306): Partition database `<db_name>_0` can't be promoted to master because it is provisioning replication
My memsql> SHOW PARTITIONS; command returned the following.
So what approach I followed was to remove each of such cases (where the role was either Slave or NULL).
DROP PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
DROP PARTITION <db_name>:46 ON "10.0.0.193":3306;
And then created a new partition with each of the dropped partition.
CREATE PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
CREATE PARTITION <db_name>:46 ON "10.0.0.193":3306;
And this was the result of memsql> SHOW PARTITIONS; after that.
You can refer to the MemSQL Documentation regarding partitions, here if the above steps doesn't seem to solve your problem.

I was hitting the same problem. Using the following command in the master node, solved the problem:
REBALANCE PARTITIONS ON db_name
Optionally you can force it using FORCE:
REBALANCE PARTITIONS ON db_name FORCE
And to see the list of operations when rebalancing is going to execute, use above command with EXPLAIN:
EXPLAIN REBALANCE PARTITIONS ON db_name [FORCE]

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

Updating a galera-cluster to 10.3.15 with MDEV-17458

I just updated a mariadb/galera-cluster to db version 10.3.15. It won't work correctly without at least 2 nodes up, but trying to start any node past the 1st runs into strange error messages, like: .
0 [Warning] WSREP: SST position can't be set in past. Requested: 0, Current: 14422308.
0 [Warning] WSREP: Can't continue.
This bug may be related:
https://jira.mariadb.org/browse/MDEV-17458?attachmentViewMode=list
However, I notice one peculiarity: the requested state is 0, quite possibly because it's lost somewhere along the way, or because I'm experiencing an entirely different issue.
I also know what it should be: the value that it thinks is 'current'.
In other words, reality is the exact opposite of what this node thinks is true: the 'current' should be 0, the 'requested' should be 14422308.
In a related issue:
https://jira.mariadb.org/browse/MDEV-19193
someone off-hand comments about deleting some files in order to start from a pristine case, but isn't exactly clear what exactly to do where.
I do not mind starting from the data on one node, ignoring everything on the other nodes and copying everything over.
I tried deleting the following file(s) from the offending nodes. (I believe the data directory they're mentioning is /var/lib/mysql/ on most linux systems):
galera.cache
ib_logfile0
ib_logfile1
This has no effect.
Someone over at this question: Unable to complete SST transfer due to "WSREP: SST position can't be set in past." error suggests changing the SST number on the node that's still OK. But that won't work: I can only start that node if I use the 'galera_new_cluster' script, which resets its SST number to '-1', no matter what it was. If I start it normally, I get an error like this:
[ERROR] WSREP: wsrep::connect(gcomm://<IP1>,<IP2>,<IP3>,...) failed: 7
In other words, there's not enough other nodes online to join the cluster. So in order to change the SST on the primary node, another node needs to be online, but in order to start up the other node, I need to change the SST on the primary? Catch-22, won't work.
It's nice that they fixed the bug, but how do I fix my now broken cluster?
One more question I've asked myself is this: Does this 'SST number' of 14422308 originate from the node that's trying to re-join the cluster, or is it retrieved from the cluster? Apparently, the second thing is true, for even completely reinstalling the secondary node from scratch and trying to re-join the cluster with it will not solve the problem. The exact same error message stays.
Somehow, the cluster appears to have gotten confused as to its own state. The JOINER nodes in each synchronization step think they have a more advanced state than the DONOR nodes.
The solution to this problem is to trick the cluster; to force it to recognize some node as 'more advanced'.
Suppose we can identify one node that has complete cluster data. Denote this to be the '1st node'. Pick one node to be the 2nd, one to be the 3rd, etc. (These choices can be at-random).
Then, stop mysql on all nodes. Edit the configuration file for the cluster and change the value for 'wsrep_cluster_address' on each node. It should be the following:
+------+---------------------------+
| Node | wsrep_cluster_address |
+------+---------------------------+
| 1 | gcomm:// |
| 2 | gcomm://<IP1>,<IP2> |
| 3 | gcomm://<IP1>,<IP2>,<IP3> |
+------+---------------------------+
(The pattern continues like this for the fourth and any further nodes in the cluster).
Now remove all cached data from the nodes other than the first. These are the files:
ib_logfile*
grastate.dat
gvwstate.dat
galera.cache
situated in the data dir of the mysql installation. (Example; /var/lib/mysql/ on debian systems).
Then edit the "grastate.dat" file on node #1. In our example, the most advanced state the cluster has yet seen is 14422308. Thus set it to 14422309 (or: old state + 1). Also set safe_to_bootstrap to 0 on all nodes (so we don't accidentally try to bootstrap and lose our seqno, running into the same bug again).
Now start mysql on node #1 (example, via systemd: systemctl start mysql).
Once it's running, do the same on node #2. Wait for all the data to transfer (this may take a while depending on the inter-node connection speed and the size of the database in question), then repeat for node 3, and any further nodes.
Afterwards, restore the value for wsrep_cluster_address in every configuration to what it should be (which is equal to the value for the last node).

Elasticsearch cluster size/architecture

I've been trying to setup an elasticsearch cluster for processing some log data from some 3D printers .
we are having more than 850K documents generated each day for 20 machines . each of them has it own index .
Right now we have the data of 16 months with make it about 410M records to index in each of the elasticsearch index .
we are processing the data from CSV files with spark and writing to an elasticsearch cluster with 3 machines each one of them has 16GB of RAM and 16 CPU cores .
but each time we reach about 10-14M doc/index we are getting a network error .
Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-xxxxxxx.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[X.X.X.X:9200]]
I'm sure this is not a network error it's just elasticsearch cannot handle more indexing requests .
To solve this , I've tried to tweak many elasticsearch parameters such as : refresh_interval to speed up the indexation and get rid of the error but nothing worked . after monitoring the cluster we think that we should scale it up.
we also tried to tune the elasticsearch spark connector but with no result .
So I'm looking for a right way to choose the cluster size ? is there any guidelines on how to choose your cluster size ? any highlights will be helpful .
NB : we are interested mainly in indexing data since we have only one query or two to run on data to get some metrics .
I would start by trying to split up the indices by month (or even day) and then search across an index pattern. Example: sensor_data_2018.01, sensor_data_2018.02, sensor_data_2018.03 etc. And search with an index pattern of sensor_data_*
Some things which will impact what cluster size you need will be:
How many documents
Average size of each document
How many messages/second are being indexed
Disk IO speed
I think your cluster should be good enough to handle that amount of data. We have a cluster with 3 nodes (8CPU / 61GB RAM each), ~670 indices, ~3 billion documents, ~3TB data and have only had indexing problems when the indexing rate exceeds 30,000 documents/second. Even then only the indexing of a few documents will fail and can be successfully retried after a short delay. Our implementation is also very indexing heavy with minimal actual searching.
I would check the elastic search server logs and see if you can find a more detailed error message. Possible look for RejectedExecutionException's. Also check the cluster health and node stats when you start to receive the failures which might shed some more light on whats occurring. If possible implement a re-try and backoff when failures start to occur to give ES time to catch up to the load.
Hope that helps a bit!
This is a network error, saying the data node is ... lost. Maybe a crash, you can check the elasticsearch logs to see whats going on.
The most important thing to understand with elasticsearch4Hadoop is how work is parallelized:
1 Spark partition by 1 elasticsearch shard
The important thing is sharding, this is how you load-balance the work with elasticsearch. Also, refresh_interval must be > 30 secondes, and, you should disable replication when indexing, this is very basic configuration tuning, I am sure you can find many advises about that on documentation.
With Spark, you can check on web UI (port 4040) how the work is split into tasks and partitions, this help a lot. Also, you can monitor the network bandwidth between Spark and ES, and es node stats.

Cassandra throwing NoHostAvailableException after 5 minutes of high IOPS run

I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.1.123:9042 (com.datastax.driver.core.TransportException: [/10.0.1.123:9042] Connection has been closed), /10.0.1.56:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/cqlStatements.html
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/queryBuilderOverview.html
For reference heres the connection options in case you find it useful
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/connectionsOptions_c.html
Hope this helps.

In Cassandra 1.2 - CQL 3 is it possible to abort a secondary index build?

Been using a 6GB dataset with each source record being ~1KB in length when I accidentally added an index on a column that I am pretty sure has a 100% cardinality.
Tried dropping the index from cqlsh but by that point the two node cluster had gone into a run away death spiral with loadavg surpassing 20 on each node and cqlsh hung on the drop command for 30 minutes. Since this was just a test setup, I shut-down and destroyed the cluster and restarted.
This is a fairly disconcerting problem as it makes me fear a scenario where a junior developer is on a production cluster and they set an index on a similar high cardinality column. I scanned through the documentation and looked at the options in nodetool but there didn't seem to be anything along the lines of "abort job or abort building index".
Test environment:
2x m1.xlarge EC2 instances with 2 Raid 0 ephemeral disks
Dataset was 6GB, 1KB per record.
My question in summary: Is it possible to abort the process of building a secondary index AND or possible to stop/postpone running builds (indexing, compaction) for a later date.
nodetool -h node_address stop index_build
See: http://www.datastax.com/docs/1.2/references/nodetool#nodetool-stop

Resources