Cassandra full table scan using Spark Performance - apache-spark

I have a requirement of scanning a table which contains 100 million record in Production. The search will be made on the first clustering key. The requirement is to find the unique partition keys where first clustering key is matching a condition. The table looks like the following -
employeeid, companyname , lastdateloggedin, floorvisted, swipetimestamp
Partition Key - employeeid
Clustering Key - companyname , lastdateloggedin
I would like to get select distinct(employeeid),company, swipetimestamp where companyname = 'XYZ'. This is an SQL representation of what i would like to fetch from the table.
SparkConf conf = new SparkConf().set("spark.cassandra.connection.enabled", "true")
.set("spark.cassandra.auth.username", "XXXXXXXXXX")
.set("spark.cassandra.auth.password", "XXXXXXXXX")
.set("spark.cassandra.connection.host", "hostname")
.set("spark.cassandra.connection.port", "29042")
.set("spark.cassandra.connection.factory", ConnectionFactory.class)
.set("spark.cassandra.connection.cluster_name", "ZZZZ")
.set("spark.cassandra.connection.application_name", "ABC")
.set("spark.cassandra.connection.local_dc", "DC1")
.set("spark.cassandra.connection.cachedClusterFile", "/tmp/xyz/test.json")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.input.fetch.size_in_rows","10000") //
.set("spark.driver.allowMultipleContexts","true")
.set("spark.cassandra.connection.ssl.trustStore.path", "sampleabc-spark-util/src/main/resources/x.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "cassandrasam");
CassandraJavaRDD<CassandraRow> ctable = javaFunctions(jsc).cassandraTable("keyspacename", "employeedetails").
select("employeeid", "companyname","swipetimestamp").where("companyname= ?","XYZ");
List<CassandraRow> cassandraRows = ctable.distinct().collect();
This code run in non production with close 5 million data. Since it is production i would like to approach this query with caution. Questions -
What are the config that should be present in my SparkConf ?
Will the spark job ever bring down the db because of the large table ?
Running that job might starve threads to cassandra at that moment ?

I would recommend to use Dataframe API instead of the RDDs - theoretically, SCC may do more optimizations for that API. If you have condition on the first clustering column, then this condition should be pushed by SCC down to Cassandra and filtering will happen there. You can check that by using .expalin on the dataframe, and checking that you have rules marked with * in the PushedFilters part.
Regarding config - use default version of the spark.cassandra.input.fetch.size_in_rows - if you have too high value, then you can have a higher chance of getting timeouts. You can still bring down the nodes with default value, as SCC is reading with LOCAL_ONE, and that overload single nodes. Sometimes, reading with LOCAL_QUORUM is faster because it won't overload individual nodes too much, and won't restart the tasks that are reading data.
And I recommend to make sure that you're using latest Spark Cassandra Connector - 2.5.0 - it has a lot of new optimizations and new functionality...

Related

How can I implement pyspark Cassandra "keybased" connector?

I am using Spark 2.4.7 and I have implemented normal pyspark cassandra connector, but there is a use case where I need to implement key based connector, I am not getting useful blogs/tutorials around it, Someone please help me with it.
I have tried normal pyspark-cassandra connector and it is working good.
Now I am expecting to implement keybased connector, which I am unable to find.
Normally Cassandra Loads entire table but I want not to load entire table but run a query on source and get the required data
By keybased I want to get data using some keys i.e. using where condition like
Select *
From <table_name>
Where <column_name>!=0
should run on source and load those data only which satisfies this condition.
To have this functionality you need to understand how both Spark & Cassandra works separately & together:
When you do spark.read, Spark doesn't load all data - it just fetches metadata, like, table structure, column names & types, partitioning schema, etc.
When you perform query with condition (where or filter), Spark Cassandra Connector tries to perform so-called predicate pushdown - convert Spark SQL query into corresponding CQL query, but it really depends on the condition. And if it's not possible, then it goes through all data, and perform filtering on the Spark side. For example, if you have condition on the column that is partition key - then it will be converted into CQL expression SELECT ... FROM table where pk = XXX. Similarly, there are some optimizations for queries on the clustering columns - Spark will need to go through all partitions, but it's still will be more optimized as it may filter data based on the clustering columns. Use a link above to understand what conditions could be pushed down into Cassandra and which aren't. The rule of thumb is - if you can execute query in CQLSH without ALLOW FILTERING, then it will be pushed down.
In your specific example, you're using inequality predicate (<> or !=) that isn't supported by Cassandra, so Spark Cassandra connector will need to go through all data, and filtering will happen on the Spark side.

Looking up about 40k records out 150 million records in Cassandra in every job run?

I am building a near real time/ microbatch data application with Cassandra as the lookup store. Each incremental run has ~40K records, while the Cassandra table has about 150 million records. In each run, I need to lookup the id field and get some attributes from Cassandra. These lookups can be random (not any time/ region/ country dependency), so there is no clear partitioning scheme.
How should I try to partition the Cassandra table to ensure decent/ good performance (for microbatches running every 15-30 mins)?
Apart from partitioning, any other tips?
joinWithCassandraTable and leftJoinWithCassandraTable functions were specifically designed for efficient data lookup in Cassandra from Spark jobs. It performs fetching of data by primary or partition key, and because it's executed by multiple executors in parallel, it could be fast (although ~40K could still take time, but it depends on size of your Cassandra and Spark clusters). See the SCC's documentation for detailed information how to use it - but remember, that these functions are available only in RDD API. The DataStax's version of connector has support for so-called "DirectJoin" - efficient joins with Cassandra in the DataFrame API.
Regarding partitioning - it depends on how do you perform lookup - you have 1 record in Cassandra matching one record in Spark? If yes, then just use this ID as primary key (it's equal to partition key in this case).

Date partition size 10GB read efficiently

We are using Cassandra DataStax 6.0 and Spark enabled. We have 10GB of data coming every day. All queries are based on date. We have one huge table with 40 columns. We are planning to generate reports using Spark. What is the best way to setup this data. Since we keep getting data every day and save data for around 1 year in one table.
We tried to use different partition but most of our keys are based on date.
No code just need suggestion
Our query should be fast enough. We have 256GB Ram with 9 nodes. 44 core CPU.
Having the data organized in the daily partitions isn't very good design - in this case, only RF nodes will be active during the day writing the data, and then at the time of the report generation.
Because you'll be accessing that data only from Spark, you can use following approach - have some bucket field as partition key, for example, with uniformly generated random number, and timestamp as a clustering column, and maybe another uuid column for uniqueness guarantee of records, something like this:
create table test.sdtest (
b int,
ts timestamp,
uid uuid,
v1 int,
primary key(b, ts, uid));
Where maximum value for generatio of b should be selected to have not too very big and not very small partitions, so we can effectively read them.
And then we can run Spark code like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T00:00:00+0000' as timestamp) AND ts < cast('2019-03-11T00:00:00+0000' as timestamp)")
The trick here is that we distribute data across the nodes by using the random partition key, so the all nodes will handle the load during writing the data and during the report generation.
If we look into physical plan for that Spark code (formatted for readability):
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [b#23,ts#24,v1#25]
PushedFilters: [*GreaterThanOrEqual(ts,2019-03-10 00:00:00.0),
*LessThan(ts,2019-03-11 00:00:00.0)], ReadSchema: struct<b:int,ts:timestamp,v1:int>
We can see that both conditions will be pushed to DSE on the CQL level - this means, that Spark won't load all data into memory and filter them, but instead all filtering will happen in Cassandra, and only necessary data will be returned back. And because we're spreading requests between multiple nodes, the reading could be faster (need to test) than reading one giant partition. Another benefit of this design, is that it will be easy to perform deletion of the old data using Spark, with something like this:
val toDel = sc.cassandraTable("test", "sdtest").where("ts < '2019-08-10T00:00:00+0000'")
toDel.deleteFromCassandra("test", "sdtest", keyColumns = SomeColumns("b", "ts"))
In this case, Spark will perform very effective range/row deletion that will generate less tombstones.
P.S. it's recommended to use DSE's version of the Spark connector as it may have more optimizations.
P.P.S. theoretically, we can merge ts and uid into one timeuuid column, but I'm not sure that it will work with Dataframes.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

FiloDB + Spark Streaming Data Loss

I'm using FiloDB 0.4 with Cassandra 2.2.5 column and meta store and trying to insert data into it using Spark Streaming 1.6.1 + Jobserver 0.6.2. I use the following code to insert data:
messages.foreachRDD(parseAndSaveToFiloDb)
private static Function<JavaPairRDD<String, String>, Void> parseAndSaveToFiloDb = initialRdd -> {
final List<RowWithSchema> parsedMessages = parseMessages(initialRdd.collect());
final JavaRDD<Row> rdd = javaSparkContext.parallelize(createRows(parsedMessages));
final DataFrame dataFrame = sqlContext.createDataFrame(rdd, generateSchema(rawMessages);
dataFrame.write().format("filodb.spark")
.option("database", keyspace)
.option("dataset", dataset)
.option("row_keys", rowKeys)
.option("partition_keys", partitionKeys)
.option("segment_key", segmentKey)
.mode(saveMode).save();
return null;
};
Segment key is ":string /0", row key is set to column which is unique for each row and partition key is set to column which is const for all rows. In other words all my test data set goes to single segment on single partition. When I'm using single one-node Spark then everything works fine and I get all data inserted but when I'm running two separate one-node Sparks(not as a cluster) at the same time then I get lost about 30-60% of data even if I send messages one by one with several seconds as interval.
I checked that dataFrame.write() is executed for each message so the issue happens after this line.
When I'm setting segment key to column which is unique for each row then all data reaches Cassandra/FiloDB.
Please suggest me solutions for scenario with 2 separate sparks.
#psyduck, this is most likely because data for each partition can only be ingested on one node at a time -- for the 0.4 version. So to stick with the current version, you would need to partition your data into multiple partitions and then ensure each worker only gets one partition. The easiest way to achieve the above is to sort your data by partition key.
I would highly encourage you to move to the latest version though - master (Spark 2.x / Scala 2.11) or spark1.6 branch (spark 1.6 / Scala 2.10). The latest version has many changes that are not in 0.4 that would solve your problem:
Using Akka Cluster to automatically route your data to the right ingestion node. In this case with the same model your data would all go to the right node and ensure no data loss
TimeUUID-based chunkID, so even in case multiple workers (in case of a split brain) somehow write to the same partition, data loss is avoided
A new "segment less" data model so you don't need to define any segment keys, more efficient for both reads and writes
Feel free to reach out on our mailing list, https://groups.google.com/forum/#!forum/filodb-discuss

Resources