Spark RDD write to Cassandra - apache-spark

I have a below Cassandra Table schema.
ColumnA Primary Key
ColumnB Clustering Key
ColumnC
ColumnD
Now, I have a Spark RDD with columns ordered as
RDD[ColumnC, ColumnA, ColumnB, ColumnD]
So, when I am writing to the Cassandra Table, I need to make sure the ordering is correct. So, I am having specify the column ordering using SomeColumns
rdd.saveToCassandra(keyspace,table,SomeColumns("ColumnA","ColumnB","ColumnC","ColumnD))
Is there any way I Can pass all the column names as a list instead? I am asking that Cause I have around 140 Columns in my target table and cannot give all the names as part of SomeColumns. So, looking for a more cleaner approach.
PS: I cannot write it from a DataFrame, I Am looking only for solution based on RDD's.

You can use following syntax to explode sequence into list of arguments:
SomeColumns(names_as_sequence: _*)
Update:
If you have a sequence of column names as strings, then you need to do:
SomeColumns(names_as_string_seq.map(x => x.as(x)): _*)

Related

Query a Cassandra table according to fields that are not part of the partition key

I have a Cassandra table where a few columns are defined as the cluster but I need to also be able to filter according to data in the other columns.
So let's say my table consists of the columns A,B,C,D,E,F
Columns A,B are the cluster key but I need the "WHERE" part to include values in E or F or E and F
so something like
SELECT * FROM My_Table WHERE A='x' AND B='y' AND E='t' AND F='g'
Cassandra will only allow this with the ALLOW FILTERING option which of course is not good.
What are my options?
It isn't easy to answer your question since you haven't posted the table schema.
If the columns E and F are not clustering keys, an option for you is to index the columns. However, those have their own pros and cons depending on the data types and/or data you are storing.
For more info, see When to use an index in Cassandra. Cheers!

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Cassandra CQL column slice and read path confusion

I am wondering how's column slicing in CQL WHERE clause affects read performance. Does Cassandra have some optimization, which is able to only fetch specific columns with the value or have to retrieve all the columns of a row and check one after another? e.g.: I have a primary key as (key1, key2), key2 is the clustering key. I only want to find columns that match a certain key2, say value2?
Cassandra saves the data as cells - each value for a key+column is a cell. If you save several values for the key in once they will be placed together in same file. Also, since cassandra writes to sstables, you can have several values saved for same key-column/cell in different files, and cassandra will read all of them and return you the last written one, until comperssion or repair is occured, and irrelevant values are deleted.
Good article about deletes/reads/tombstones:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

RDD JoinWithCassandraTable Join Columns

RDD1.joinWithCassandraTable("keyspace", "Tableabc", SomeColumns("lines"), SomeColumns("col1", "col2", "col3"))
Above is the syntax to join RDD1 with a table in Cassandra. where col1,col2, col3 are the columns used to join with RDD1
I have a requirement as below.
Tableabc has a column with name "lines" which is of datatype "list".
This lines column has 4 columns like below.
lines:{cola: 22,colb: hello,colc: sri,cold: 123}
Basically its a json object.
Now if you see my syntax above i used SomeColumns("lines"). I am able get the output as a RDD like below.
(RDD1columns,CassandraRow{lines: [{cola: 22,colb: hello,colc: sri,cold:123}]})
But what i need is i just want to select only "cola", i dont need all columns from the lines.
Can anyone please help me on the same.
JoinWithCassandraTable invokes a direct join on the database and can only be used when joining on Partition Keys. If you would like to join on a portion of the partition key you will need to do a full scan.
This means is it impossible to do a JWCT on any collection columns.
Also how is lines a List when it has key:value pairs?
If you only want the "cola" value from lines you can just do a map
RDD1.joinWithCassandraTable().map(_._2.get[Map[String, AnyRef]](lines).get("cola"))

How is the Spark select-explode idiom implemented?

Assume we have a DataFrame with a string column, col1, and an array column, col2. I was wondering what happens behind the scenes in the Spark operation:
df.select('col1', explode('col2'))
It seems that select takes a sequence of Column objects as input, and explode returns a Column so the types match. But the column returned by explode('col2') is logically of different length than col1, so I was wondering how select knows to "sync" them when constructing its output DataFrame. I tried looking at the Column class for clues but couldn't really find anything.
The answer is simple - there is no such data structure as Column. While Spark SQL uses columnar storage for caching and can leverage data layout for some low level operations columns are just descriptions of data and transformations not data containers. So simplifying things a bit explode is yet another flatMap on the Dataset[Row].

Resources