How to use accumulators with spark 2.3.1 api - apache-spark

I am using spark-sql_2.11-2.3.1 version with Cassandra 3.x.
I need to provide a validation feature which has
column_family_name text,
oracle_count bigint,
cassandra_count bigint,
create_timestamp timestamp,
last_update_timestamp timestamp,
update_user text
For the same I need to count the successfully inserted record count i.e. cassandra_count to be populated , for that I want to make use of spark accumulator. But unfortunately I am not able to find required API samples with spark-sql_2.11-2.3.1 version.
Below is my saving to cassandra snippet
o_model_df.write.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> columnFamilyName, "keyspace" -> keyspace ))
.mode(SaveMode.Append)
.save()
Here how to implement accumulator increment for each row being successfully saved into Cassandra ...
Any help would be highly thankful.

Spark's accumulators are usually used in the transformations that you write, don't expect that spark cassandra connector will provide you something like.
But overall - if your job had finished without error, then it means that the data is written correctly into database.
If you want to check how many rows are really in the database, then you need to count data in the database - you can use cassandraCount method of the spark cassandra connector. The main reason for that - you may have in your DataFrame multiple rows that could be mapped into single Cassandra row (for example, if you incorrectly defined primary key, so multiple rows have it).

Related

How can I implement pyspark Cassandra "keybased" connector?

I am using Spark 2.4.7 and I have implemented normal pyspark cassandra connector, but there is a use case where I need to implement key based connector, I am not getting useful blogs/tutorials around it, Someone please help me with it.
I have tried normal pyspark-cassandra connector and it is working good.
Now I am expecting to implement keybased connector, which I am unable to find.
Normally Cassandra Loads entire table but I want not to load entire table but run a query on source and get the required data
By keybased I want to get data using some keys i.e. using where condition like
Select *
From <table_name>
Where <column_name>!=0
should run on source and load those data only which satisfies this condition.
To have this functionality you need to understand how both Spark & Cassandra works separately & together:
When you do spark.read, Spark doesn't load all data - it just fetches metadata, like, table structure, column names & types, partitioning schema, etc.
When you perform query with condition (where or filter), Spark Cassandra Connector tries to perform so-called predicate pushdown - convert Spark SQL query into corresponding CQL query, but it really depends on the condition. And if it's not possible, then it goes through all data, and perform filtering on the Spark side. For example, if you have condition on the column that is partition key - then it will be converted into CQL expression SELECT ... FROM table where pk = XXX. Similarly, there are some optimizations for queries on the clustering columns - Spark will need to go through all partitions, but it's still will be more optimized as it may filter data based on the clustering columns. Use a link above to understand what conditions could be pushed down into Cassandra and which aren't. The rule of thumb is - if you can execute query in CQLSH without ALLOW FILTERING, then it will be pushed down.
In your specific example, you're using inequality predicate (<> or !=) that isn't supported by Cassandra, so Spark Cassandra connector will need to go through all data, and filtering will happen on the Spark side.

Cassandra full table scan using Spark Performance

I have a requirement of scanning a table which contains 100 million record in Production. The search will be made on the first clustering key. The requirement is to find the unique partition keys where first clustering key is matching a condition. The table looks like the following -
employeeid, companyname , lastdateloggedin, floorvisted, swipetimestamp
Partition Key - employeeid
Clustering Key - companyname , lastdateloggedin
I would like to get select distinct(employeeid),company, swipetimestamp where companyname = 'XYZ'. This is an SQL representation of what i would like to fetch from the table.
SparkConf conf = new SparkConf().set("spark.cassandra.connection.enabled", "true")
.set("spark.cassandra.auth.username", "XXXXXXXXXX")
.set("spark.cassandra.auth.password", "XXXXXXXXX")
.set("spark.cassandra.connection.host", "hostname")
.set("spark.cassandra.connection.port", "29042")
.set("spark.cassandra.connection.factory", ConnectionFactory.class)
.set("spark.cassandra.connection.cluster_name", "ZZZZ")
.set("spark.cassandra.connection.application_name", "ABC")
.set("spark.cassandra.connection.local_dc", "DC1")
.set("spark.cassandra.connection.cachedClusterFile", "/tmp/xyz/test.json")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.input.fetch.size_in_rows","10000") //
.set("spark.driver.allowMultipleContexts","true")
.set("spark.cassandra.connection.ssl.trustStore.path", "sampleabc-spark-util/src/main/resources/x.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "cassandrasam");
CassandraJavaRDD<CassandraRow> ctable = javaFunctions(jsc).cassandraTable("keyspacename", "employeedetails").
select("employeeid", "companyname","swipetimestamp").where("companyname= ?","XYZ");
List<CassandraRow> cassandraRows = ctable.distinct().collect();
This code run in non production with close 5 million data. Since it is production i would like to approach this query with caution. Questions -
What are the config that should be present in my SparkConf ?
Will the spark job ever bring down the db because of the large table ?
Running that job might starve threads to cassandra at that moment ?
I would recommend to use Dataframe API instead of the RDDs - theoretically, SCC may do more optimizations for that API. If you have condition on the first clustering column, then this condition should be pushed by SCC down to Cassandra and filtering will happen there. You can check that by using .expalin on the dataframe, and checking that you have rules marked with * in the PushedFilters part.
Regarding config - use default version of the spark.cassandra.input.fetch.size_in_rows - if you have too high value, then you can have a higher chance of getting timeouts. You can still bring down the nodes with default value, as SCC is reading with LOCAL_ONE, and that overload single nodes. Sometimes, reading with LOCAL_QUORUM is faster because it won't overload individual nodes too much, and won't restart the tasks that are reading data.
And I recommend to make sure that you're using latest Spark Cassandra Connector - 2.5.0 - it has a lot of new optimizations and new functionality...

Spark-Cassandra connector: how to change collections write behavior

In Java, I have a Spark dataset (Spark Structured Streaming) with a column of type java.util.ArrayList<Short> and I want to write the dataset in a Cassandra table which has a corresponding list<smallint>.
Each time I write the row in Cassandra it updates an existing row and I want to customize the write behavior of the list in order to control if
the written list will overwrite the existing list or
the content of the written list will be appended to the content of the list already saved in Cassandra
I found in the spark-cassandra-connector source code a class CollectionBehavior which is extended by both CollectionAppend and CollectionOverwrite. It seems exatcly what I am looking for but I didn't find a way to use it while writing to Cassandra.
The dataset is written to Cassandra using:
dataset.write()
.format("org.apache.spark.sql.cassandra")
.option("table", table)
.option("keyspace", keyspace)
.mode(SaveMode.Append)
.save();
Is it possible to change this behavior?
To save to Cassandra collections while setting the save mode for the collection, use the RDD API. Dataset API seem to be missing this so far. So changing the dataset to RDD and using RDD methods to save to cassandra should be able to give you the behaviour you want.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

Date partition size 10GB read efficiently

We are using Cassandra DataStax 6.0 and Spark enabled. We have 10GB of data coming every day. All queries are based on date. We have one huge table with 40 columns. We are planning to generate reports using Spark. What is the best way to setup this data. Since we keep getting data every day and save data for around 1 year in one table.
We tried to use different partition but most of our keys are based on date.
No code just need suggestion
Our query should be fast enough. We have 256GB Ram with 9 nodes. 44 core CPU.
Having the data organized in the daily partitions isn't very good design - in this case, only RF nodes will be active during the day writing the data, and then at the time of the report generation.
Because you'll be accessing that data only from Spark, you can use following approach - have some bucket field as partition key, for example, with uniformly generated random number, and timestamp as a clustering column, and maybe another uuid column for uniqueness guarantee of records, something like this:
create table test.sdtest (
b int,
ts timestamp,
uid uuid,
v1 int,
primary key(b, ts, uid));
Where maximum value for generatio of b should be selected to have not too very big and not very small partitions, so we can effectively read them.
And then we can run Spark code like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T00:00:00+0000' as timestamp) AND ts < cast('2019-03-11T00:00:00+0000' as timestamp)")
The trick here is that we distribute data across the nodes by using the random partition key, so the all nodes will handle the load during writing the data and during the report generation.
If we look into physical plan for that Spark code (formatted for readability):
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [b#23,ts#24,v1#25]
PushedFilters: [*GreaterThanOrEqual(ts,2019-03-10 00:00:00.0),
*LessThan(ts,2019-03-11 00:00:00.0)], ReadSchema: struct<b:int,ts:timestamp,v1:int>
We can see that both conditions will be pushed to DSE on the CQL level - this means, that Spark won't load all data into memory and filter them, but instead all filtering will happen in Cassandra, and only necessary data will be returned back. And because we're spreading requests between multiple nodes, the reading could be faster (need to test) than reading one giant partition. Another benefit of this design, is that it will be easy to perform deletion of the old data using Spark, with something like this:
val toDel = sc.cassandraTable("test", "sdtest").where("ts < '2019-08-10T00:00:00+0000'")
toDel.deleteFromCassandra("test", "sdtest", keyColumns = SomeColumns("b", "ts"))
In this case, Spark will perform very effective range/row deletion that will generate less tombstones.
P.S. it's recommended to use DSE's version of the Spark connector as it may have more optimizations.
P.P.S. theoretically, we can merge ts and uid into one timeuuid column, but I'm not sure that it will work with Dataframes.

How to paralellize RDD work when using cassandra spark connector for data aggration?

Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.

Resources