Error while saving the Spark RDD to Cassandra - apache-spark

We are trying to save our RDD which will have close to 4 billion rows to Cassandra. While some of the data gets persisted but for some partitions we see these error logs in the spark logs.
We have already set these two properties for cassandra connector. Is there some other optimization that we would need to do? Also what are the recommended settings for reader? We have left them as default.
spark.cassandra.output.batch.size.rows=1
spark.cassandra.output.concurrent.writes=1
We are running spark-1.1.0 and spark-cassandra-connector-java_2.10 v 2.1.0
15/01/08 05:32:44 ERROR QueryExecutor: Failed to execute: com.datastax.driver.core.BoundStatement#3f480b4e
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.87.33.133:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Thanks
Ankur

I've seen something similar in my four node cluster. It seemed that if I specified EVERY cassandra node name in the spark settings, then it works, however if I only specified the seeds (of the four, two were seeds) than I got the exact same issue. I haven't followed up on it, as specifying all four is getting the job done (but I intend to at some point). I'm using hostnames for seed values and not ips. And using hostnames in spark cassandra settings. I did hear it could be due to some akka dns issues. Maybe try using ip addresses through and through, or specifying all hosts. The latter's been working flawlessly for me.

I realized, I was running the application with spark.cassandra.output.concurrent.writes=2. I changed it to 1 and there were no exceptions. The exceptions were because Spark was producing data at much higher frequency than our Cassandra cluster could write so changing the setting to 1 worked for us.
Thanks!!

Related

Apache Ignite 2.9.0 Recover from lost partitions

We have setup a apache ignite 2.9.0 cluster with native persistence using kubernetes in Azure with 4 nodes. To update some cache configuration, we restarted all the ignite nodes. After restart, running any sql query on one particular table - results in restart of 2 ignite nodes and after that we see lost partitions exception.
If we try to restart all nodes to recover from lost partitions, then its fine until we run any sql query on that table after which 2 nodes restart and we get lost partitions exception.
Is there anyway we can recover from lost partitions and overcome this problem? We also wanted to understand why its occuring?We could not find any logs related to this.
When all partition owners left the grid, the partition is considered to be lost, you might think of this as a special internal marker. Depending on the PartitionLossPolicy Ignite might ignore this fact and allow cache operations or disallow them to protect data consistency.
If you use native persistence, then most likely there was no physical data loss and all you need is to tell Ignite that you are aware of the situation, now all data are in place and it's safe to remove the "lost" mark from the partitions.
I think the most simple way to handle this would be to use the control script from within a pod:
control.sh --cache reset_lost_partitions cacheName1,cacheName2,...
More details:
https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss

Cassandra NoHostAvailableException : All host(s) tried for query failed

We are running a java micro service that uses Cassandra DB multi node cluster. While writing data seeing following error randomly from different nodes:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed
Already verified that all nodes are available and running and are reachable to each other in the cluster.
Any pointer are highly appreciated.
Thanks.
Above error suggests that driver was unable to connect to any of the host for a query. Following could be the reasons for the same.
Cassandra Node down - Which you have verified is not the case for you.
Congestion due to high traffic due to which nodes are shown down.
Intermittent issues in network connectivity between client and nodes which compels driver to mark the host down.

DataStax DevCenter cannot create new connection in Windows

When I am trying to test a new connection it returns an error:
The specified host(s) could not be reached.
All host(s) tried for query failed (tried: /host_ip:9042 (com.datastax.driver.core.TransportException: [/host_ip:9042] Cannot connect))
[/host_ip:9042] Cannot connect
In my windows firewall, I have already created a rule for DevCenter, which allows DevCenter to communicate with remote Cassandra server. I have no access to Cassandra server but it is configured well, it means that the problem is somewhere on my local computer.
This type of thing typically happens when the host crashes expectantly resulting in the corruption of the sstables or commitlog files.
This is why it is really important to use replication since when you get into this situation you can run nodetool repair to repair the corrupted tables and data from other nodes.
If you are not fortunate enough to have replication configured, then you are in for some data loss. Clear the suspect file from \data\commitlogs, cry a little and restart the node.

Migrate Datastax Enterprise Cassandra to Apache Cassandra or Datastax Community?

I have a large, but simple Cassandra database on a Datastax 4.6 cluster. The license renewal is prohibitive for this very simple use case and I am trying to migrate to either a straight Apache or Datastax Comunity version. First is it possible to do an inline update?
I have altered all the keyspaces to remove the "EverywhereStrategy" replication strategy but I still get an error that the DSC version of cassandra I'm trying to get to join the cluster doesn't support it. I'm using Like Cassandra versions (2.0.16) and most other things seem to be close.
java.lang.RuntimeException: org.apache.cassandra.exceptions.ConfigurationException: Unable to find replication strategy class 'org.apache.cassandra.locator.EverywhereStrategy'
If it's not possible to do an inline upgrade what would be the best strategy to migrate a decent size (30 node, 150Tb) cluster?
So to make this work you have to extract any of the DSE features that you may have on any of your tables.
This meant I had to change the replication strategy on the dse_system table from EverywhereStrategy to SimpleStrategy with RF=3 (or almost anything after conversion you can drop this keyspace) The error message was:
java.lang.RuntimeException: org.apache.cassandra.exceptions.ConfigurationException: Unable to find replication strategy class 'org.apache.cassandra.locator.EverywhereStrategy'
I Also had to drop the unused CFS keyspaces. We never used the hadoop/CFS integration so we had nothing in these keyspaces anyway. I didn't capture the error for this.
We did have a solr index on a table we were testing on this cluster about a year ago so I had to drop this columnfamily. The error message was:
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex
There may be other incompatibilities if you use other features of Datastax Enterprise that you would have to remove, but this was enough for me to get the migration working.
dse-core.jar contains the EverywhereStrategy class.
We solved this problem by doing the following:
Replace everything except the above JAR so nodes can come up fine. Once all nodes are migrated to OSS, drop the dse_system keyspace (that uses this replication), delete the JAR and restart the nodes one by one.

Cassandra hangs on arbitrary commands

We're hosting Cassandra 2.0.2 cluster on AWS. We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. It went fairly well, aside from two nodes hanging forever on decommission. Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. The production system, also using phpcassa, does normal data requests - it works fine.
All cassandras have the same config, the same versions, the same package they were installed from. All nodes were recently restarted, due to seed node change.
Versions:
Connected to ### at ####.compute-1.amazonaws.com:9160.
[cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I've run out out of ideas. Any hints would be greatly appreciated.
Update:
After a bit of random investigating, here's a bit more detailed description.
If I cassandra-cli to any machine, and do "show keyspaces", it works.
If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely.
If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. ctrl+c, repeat the same query, it instantly responds.
If I cqlsh to a local cassandra, and do a describe keyspaces, it works.
If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. I was able to return data with limit 760, the 761 would fail.
If I do a consistency all, and select the same, it hangs.
If I do a trace, different machines return the data, though sometimes source_elapsed is "null"
Not to forget, applications querying the cluster sometimes do get results, after several attempts.
Update 2
Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so..
Solution
Switch rpc_server_type from hsha to sync. All problems gone. We worked with hsha for months without issues.
If someone also stumbles here:
http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/
In cassandra.yaml:
Switch rpc_server_type from hsha to sync.

Resources