Spark Cassandra Issue with KeySpace Replication - apache-spark

I have created table in Cassandra with below commands:
CREATE KEYSPACE test WITH REPLICATION = { 'class' :
'NetworkTopologyStrategy', 'dc1' : 3 } AND DURABLE_WRITES = false;
use test;
create table demo(id int primary key, name text);
Once the table got created successfully, I was running the below code to write the data into Cassandra from Spark.
But facing below error
Code Snippet of Spark
import com.datastax.spark.connector._
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector.cql._
val connectorToClusterOne = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host","xx.xx.xx.xx").set("spark.cassandra.auth.username", "xxxxxxx").set("spark.cassandra.auth.password", "xxxxxxx"))
---K/V---
val data = sc.textFile("/home/ubuntu/test.txt").map(_.split(",")).map(p => demo(p(0).toInt,p(1)))
implicit val c = connectorToClusterOne
data.saveToCassandra("test","demo")
BELOW IS THE ERROR DESCRIPTION: .
Error while computing token map for keyspace test with datacenter dc1: could not achieve replication factor 3 (found 0 replicas only), check your keyspace replication settings.
Could any one suggest what could be the possible reason for this.

This error is usually because either the request is not being directed at the correct cluster or the datacenter does not exist or has an incorrect name.
To make sure you are connecting to the correct cluster double check the connection host used for your spark application.
To check the datacenter, use nodetool status to make sure that the datacenter which you requested exists and includes no extraneous whitespace.
Lastly, it could be possible that all the nodes in the datacenter are down, so double check this as well.

Related

Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)

I am running spark-cassandra-connector and hitting a weird issue:
I run the spark-shell as:
bin/spark-shell --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.1
Then I run the following commands:
import com.datastax.spark.connector._
val rdd = sc.cassandraTable("test_spark", "test")
println(rdd.first)
# CassandraRow{id: 2, name: john, age: 29}
Problem is that following command gives an error:
rdd.take(1).foreach(println)
# CassandraRow{id: 2, name: john, age: 29}
rdd.take(2).foreach(println)
# Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)
# at com.datastax.driver.core.exceptions.UnavailableException.copy(UnavailableException.java:128)
# at com.datastax.driver.core.Responses$Error.asException(Responses.java:114)
# at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:467)
# at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1012)
# at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:935)
# at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
And the following command just hangs:
println(rdd.count)
My Cassandra keyspace seems to have the right replication factor:
describe test_spark;
CREATE KEYSPACE test_spark WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} AND durable_writes = true;
How to fix both the above errors?
I assume you hit the issue with SimpleStrategy and multi-dc when using LOCAL_ONE (spark connector default) consistency. It will look for a node in the local DC to make the request to but theres a chance that all the replicas exist in a different DC and wont meet the requirement. (CASSANDRA-12053)
If you change your consistency level (input.consistency.level to ONE) I think it will be resolved. You should also really consider using the network topology strategy instead.

Cassandra cluster simple query error

I'm learning Cassandra and I have a problem. I have a cluster with 2 computers (node A and node B). On a computer I can create new users and keyspaces and on the other one I can use this users and keyspaces. But if i create a new table on any of these computers (inside cassandra, on a keyspace), i can't see this new table with a simple query statement like SELECT * FROM table or SELECT * FROM keyspace.table. Cassandra displays this error "ServerError: <ErrorMessage code=0000 [Server error] message='java.lang.AssertionError">"
if i use nodetool status on the node A (node+seeder) displays an error:
java.lang.RuntimeException: No nodes present in the cluster. Has this node finished startin up?
but if i use nodetool status on the node B (only node), displays a node: the node B.
Keyspace stamement:
CREATE KEYSPACE demo WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
Cassandra 3.2 is installed on Debian
What can i do? any ideas? I can't fix it

NetworkTopologyStrategy on single cassandra node

I have created a keyspace in cassandra once using NetworkTopologyStrategy and next time using SimpleStrategy with the following syntax :
Keyspace definition:
CREATE KEYSPACE cw WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter16' : 1 };
CREATE KEYSPACE cw WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1}
Output of bin/nodetool ring :
Datacenter: 16
==========
Address Rack Status State Load Owns Token
172.16.4.196 4 Up Normal 35.92 KB 100.00% 0
When i create one table in NetworkTopologyStrategy keyspace and do the select * query on the table. It returns the following error :
Unable to complete request: one or more nodes were unavailable
Whereas it works fine in SimpleStrategy keyspace why is it so? Can't we use the NetworkTopologyStrategy on single cassandra node cluster?
While everyone else is right, you are already using a different snitch as your data center name is '16'. In your keyspace definition, you have Datacenter: 16. That means data center name is actually '16'.
Try this:
CREATE KEYSPACE cw WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', '16' : 1 };
By default cassandra is configured to use SimpleSnitch.
SimpleSnitch does not recognize datacenter and rack information, hence can use only SimpleStrategy.
To Change the Snitch you have to edit following in cassandra.yaml
endpoint_snitch: CHANGE THIS TO WHATEVER YOU WANT
also you have to then change corresponding properties file to define datacenter and racks
You have to define a network-aware snitch in order to use NetworkTopologyStrategy. See this document for more information: http://docs.datastax.com/en/cassandra/2.1/cassandra/architecture/architectureSnitchPFSnitch_t.html

NoHostAvailable exception after shutting down nodes of one datacenter

I am having a cassandra ring across two datacenters, below is the key space definition. when I trying to bring all the nodes of the local datacenter (aws), I am expecting the the datastax driver to query the remote nodes. But in this case I am getting HostNotAvaialble exception. Please help.
Keyspace definition as below
CREATE KEYSPACE IF NOT EXISTS mystore_stress WITH replication = {
'class':'NetworkTopologyStrategy',
'sol':2,
'aws':2
};
My session is created as follows:
public Session getSession() {
final Cluster cluster = Cluster.builder().addContactPoints(contactPoints)
.withRetryPolicy(DefaultRetryPolicy.INSTANCE)
.withLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy("aws", 1)))
.withReconnectionPolicy(new ExponentialReconnectionPolicy(RECONNECTION_BASE_DELAY, RECONNECTION_MAX_DELAY))
.withSocketOptions(
new SocketOptions()
.setConnectTimeoutMillis(CONNECT_TIMEOUT_MILLIS)
.setReadTimeoutMillis(READ_TIMEOUT_MILLIS))
.withPort(PORT)
.build();
return cluster.connect();
}
I suspect you need to review the consistency level on your query. If it's quorum then you should ideally have 3 replicates. You can find some examples of read consistency here:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dmlClientRequestsRead.html

How does spark worker distributes load in cassandra cluster?

I am trying to understand how cassandra and spark work together, especially when
the data is distributed across nodes.
I have cassandra+spark setup with two node cluster using DSE.
The schema is
CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy','replication_factor':1}
CREATE TABLE bar (
customer text,
start timestamp,
offset bigint,
data blob,
PRIMARY KEY ((customer, start), offset)
)
I populated the table with huge set of test data. Later figured out the keys
that lie on different nodes with the help of "nodetool getendpoints" command.
For example in my case a particular customer data with date '2014-05-25' is on
node1 and '2014-05-26' is node2.
When I run the following query from spark shell, I see that spark worker on
node1 is running the task during mapPartitions stage.
csc.setKeyspace("foo")
val query = "SELECT cl_ap_mac_address FROM bar WHERE customer='test' AND start IN ('2014-05-25')"
val srdd = csc.sql(query)
srdd.count()
and for the following query, spark worker on node2 is running the task.
csc.setKeyspace("foo")
val query = "SELECT cl_ap_mac_address FROM bar WHERE customer='test' AND start IN ('2014-05-26')"
val srdd = csc.sql(query)
srdd.count()
But when I give both the dates only one node worker is getting utilized.
csc.setKeyspace("foo")
val query = "SELECT cl_ap_mac_address FROM bar WHERE customer='test' AND start IN ('2014-05-25', '2014-05-26')"
val srdd = csc.sql(query)
srdd.count()
I was thinking that this should use both the nodes in parallel during
mapPartitions stage. Am I missing something.
I think you are trying to understand the interaction between spark and Cassandra as well as the data distribution in Cassandra.
Basically from spark application, request will be made to one of the Cassandra node, which acts as a coordinator for that particular client request.More details can be found here.
Further data partitioning and replication will be take care by Cassandra system only.

Resources