I want to optimize RDD join with Cassandra via Spark. I am trying to read a data, and join with Cassandra data.
I was trying to use datastax cassandra connector for this. But it is giving me an error -- Invalid row size: 6 instead of 4. Here are the details
import com.datastax.spark.connector.cql.CassandraConnector
val ip15M = sqlContext.read.parquet("/home/hadoop/work/data").toDF();
ip15M.dtypes
res8: Array[(String, String)] = Array((key1,StringType),(key2,StringType), (key3,StringType), (column1,StringType),
(fact1,StringType),(fact2,StringType)
val joinWithRDD = ip15M.rdd.joinWithCassandraTable("key","tabl1").on(SomeColumns("key1","key2","key3","column1"))
joinWithRDD.take(10).foreach(println)
I have the following Cassandra Table:
CREATE TABLE key.tabl1 (
key1 text,
key2 text,
key3 text,
column1 text,
value1 text,
value2 text,
PRIMARY KEY ((key1, key2, key3), column1)
) WITH CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99p';
I am getting the below error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 332, mr25p01if-ingx03030701.mr.if.apple.com, executor 146): java.lang.IllegalArgumentException: requirement failed: Invalid row size: 6 instead of 4.
I believe the error is due to the fact that RDD has 6 columns while my Cassandra table has 4 primary keys. I need the fact columns in the RDD since I need to update the values based on the join. I am not sure how to resolve this issue.
I tried running with and without the .on, but still the same error. Based on what I see, the '.on' is for Cassandra side columns, not the RDD
Let me know if any other inputs are needed
Updates: If I create an RDD by using parallelize, the join seems to work. It seems when I read a file and change to RDD, it loses the schema
Any help is appreciated
Related
I need to select distinct count in table in cassandra.
As I understand direct distinct count is not supported in cassandra not even nested queries like rdbms.
select count(*) from (select distinct key_part_one from stackoverflow_composite) as count;
SyntaxException: line 1:21 no viable alternative at input '(' (select count(*) from [(]...)
What are the ways to get it. whether I can get directly from cassandra or any addon tools/languages need to be used?
below is my create table statement.
CREATE TABLE nishant_ana.ais_profile_table (
profile_key text,
profile_id text,
last_update_day date,
last_transaction_timestamp timestamp,
last_update_insertion_timestamp timeuuid,
profile_data blob,
PRIMARY KEY ((profile_key, profile_id), last_update_day)
) WITH CLUSTERING ORDER BY (last_update_day DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I have just started using cassandra.
From Cassandra you can only do the select distinct partition_key from ....
If you need something like this, you can use Spark + Spark Cassandra Connector - it will work, but don't expect really real-time answers, as it needs to read necessary data from all nodes, and then calculate answer.
I am using Cassandra database for capturing and saving a simple network sniffer data, but because the number of rows in the table is greater than 20M+ rows, it is unable to run any aggregate function such as sum or count.
Following is my table schema:
CREATE TABLE db.uinfo (
id timeuuid,
created timestamp,
dst_ip text,
dst_mac text,
dst_port int,
protocol int,
src_ip text,
src_mac text,
src_port int,
PRIMARY KEY (id, created)
) WITH CLUSTERING ORDER BY (created ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Now when I run the query(with or without limit):
select src_ip, sum(data) as total from db.uinfo;
It throws me the following error:
OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
Any chance any of you good people could help me do the same? I have tried changing the timeouts in the cqlshrc and cassandra.yaml respectively. I have even tried starting the cqlsh using:
cqlsh --connect-timeout=120 --request-timeout=120.
I am using [cqlsh 5.0.1 | Cassandra 3.11.4 | CQL spec 3.4.4 | Native protocol v4]
This kind of queries won't work with Cassandra when you have relatively big data in it - such kind of queries requires the scanning of the whole database, and reading all data in it. Cassandra is great when you know the partition that you want to hit, and as such, send query only to individual servers where they could be processed very effectively. So aggregation functions work best only within partition.
If you need this kind of queries done - the common suggestion is to use Spark to read data in parallel, and perform aggregations. You can do this using Spark Cassandra Connector, but it will be slower than usual queries - maybe dozens of seconds, or even minutes, depending on the size of data, hardware for Spark jobs, etc.
If you need this kind of queries performed very often, then you need to look to other technologies, but it's hard to say who will perform well in such situation.
I'm using spark-cassandra-connector to read data from cassandra and process it in spark.
There are 2 billions rows in a cassandra table.
The schema is like:
CREATE TABLE my_keyspace.testtable (
id text,
code text,
orgcode text,
number bigint,
branchnumber int,
price decimal,
content text,
PRIMARY KEY ((id, code), orgcode, number, branchnumber)
) WITH CLUSTERING ORDER BY (orgcode ASC, number ASC, branchnumber ASC)
AND bloom_filter_fp_chance = 0.1
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.DeflateCompressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
Using the following code to count id:
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> idRDD = javaFunctions(jsc).cassandraTable("my_keyspace", "testtable", mapColumnTo(String.class)).select("id")
idRDD.distinct().count();
But got the following error:
17/12/13 11:32:21 WARN TaskSetManager: Lost task 27.0 in stage 0.0 (TID 27, 10.240.0.30, executor 1): java.io.IOException: Exception during execution of SELECT "id" FROM "my_keyspace"."testtable" WHERE token("id", "code") > ? AND token("id", "code") <= ? ALLOW FILTERING: Cassandra failure during reay at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:3
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
Caused by: com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:76)
All tasks in spark failed. And after a while, cassandra failed because of OOME.
There are 5 nodes on GCE to do this test. I have tried to expand memory and increase request_timeout, but they didn't work.
Environment:
spark: 2.1.2
cassandra: 3.10
spark-cassandra-connector: 2.0.6
I'm having trouble deleting a keyspace.
The keyspace in question has 4 tables similar to this one:
CREATE KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = false;
CREATE TABLE demo.t1 (
t11 int,
t12 timeuuid,
t13 int,
t14 text,
t15 text,
t16 boolean,
t17 boolean,
t18 int,
t19 timeuuid,
t110 text,
PRIMARY KEY (t11, t12)
) WITH CLUSTERING ORDER BY (t13 DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX t1_idx ON demo.t1 (t14);
CREATE INDEX t1_deleted_idx ON demo.t2 (t15);
When I want to delete the keyspace using the command:
Session session = cluster.connect();
PreparedStatement prepared = session.prepare("drop keyspace if exists " + schemaName);
BoundStatement bound = prepared.bind();
session.execute(bound);
Then the query gets timed out (or takes over 10 seconds to execute), even when the tables are empty:
com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.0.1:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:39)
I tried this on multiple machines and the result was the same. I'm using Cassandra 3.9. A similar thing happens using cqlsh. I know that I can increase the read timeout in the cassandra.yaml file, but how can I make this dropping faster? Another thing is that if I do two consecutive requests, the first one gets timed out and the second one goes through fast.
Try to run it with increased timeout:
cqlsh --request-timeout=3600 (in seconds, default 10 deconds)
There's should be also same setting on driver level. Review timeout session in this link:
http://docs.datastax.com/en/developer/java-driver/3.1/manual/socket_options/
Increasing timeout just hides the issue away and is usually a bad idea. Have a look at this answer: https://stackoverflow.com/a/16618888/7413631
I'm running Cassandra version 3.3 on a fairly beefy machine. I would like to try out the row cache so I have allocated 2 GBs of RAM for row caching and configured the target tables to cache a number of their rows.
If I run a query on a very small table (under 1 MB) twice with tracing on, on the 2nd query I see a cache hit. However, when I run a query on a large table (34 GB) I only get cache misses and see this message after every cache miss:
Fetching data but not populating cache as query does not query from the start of the partition
What does this mean? Do I need a bigger row cache to be able to handle a 34 GB table with 90 million keys?
Taking a look at the row cache source code on github, I see that clusteringIndexFilter().isHeadFilter() must be evaluating to false in this case. Is this a function of my partitions being too big?
My schema is:
CREATE TABLE ap.account (
email text PRIMARY KEY,
added_at timestamp,
data map< int, int >
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': '100000'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
The query is simply SELECT * FROM account WHERE email='sample#test.com'