RDD JoinWithCassandraTable Join Columns - apache-spark

RDD1.joinWithCassandraTable("keyspace", "Tableabc", SomeColumns("lines"), SomeColumns("col1", "col2", "col3"))
Above is the syntax to join RDD1 with a table in Cassandra. where col1,col2, col3 are the columns used to join with RDD1
I have a requirement as below.
Tableabc has a column with name "lines" which is of datatype "list".
This lines column has 4 columns like below.
lines:{cola: 22,colb: hello,colc: sri,cold: 123}
Basically its a json object.
Now if you see my syntax above i used SomeColumns("lines"). I am able get the output as a RDD like below.
(RDD1columns,CassandraRow{lines: [{cola: 22,colb: hello,colc: sri,cold:123}]})
But what i need is i just want to select only "cola", i dont need all columns from the lines.
Can anyone please help me on the same.

JoinWithCassandraTable invokes a direct join on the database and can only be used when joining on Partition Keys. If you would like to join on a portion of the partition key you will need to do a full scan.
This means is it impossible to do a JWCT on any collection columns.
Also how is lines a List when it has key:value pairs?
If you only want the "cola" value from lines you can just do a map
RDD1.joinWithCassandraTable().map(_._2.get[Map[String, AnyRef]](lines).get("cola"))

Related

Spark RDD write to Cassandra

I have a below Cassandra Table schema.
ColumnA Primary Key
ColumnB Clustering Key
ColumnC
ColumnD
Now, I have a Spark RDD with columns ordered as
RDD[ColumnC, ColumnA, ColumnB, ColumnD]
So, when I am writing to the Cassandra Table, I need to make sure the ordering is correct. So, I am having specify the column ordering using SomeColumns
rdd.saveToCassandra(keyspace,table,SomeColumns("ColumnA","ColumnB","ColumnC","ColumnD))
Is there any way I Can pass all the column names as a list instead? I am asking that Cause I have around 140 Columns in my target table and cannot give all the names as part of SomeColumns. So, looking for a more cleaner approach.
PS: I cannot write it from a DataFrame, I Am looking only for solution based on RDD's.
You can use following syntax to explode sequence into list of arguments:
SomeColumns(names_as_sequence: _*)
Update:
If you have a sequence of column names as strings, then you need to do:
SomeColumns(names_as_string_seq.map(x => x.as(x)): _*)

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Field identification, during except() operation in Spark

except() function in spark work to compare two dataframes and return non matching records from first dataframe.
however, I would like to track field details also, which are not matching. how to do it in spark ?? please help
as mentioned except will give you complete row mismatch. So I would suggest use leftanti join instead of except and have a join key or keys as condition. Primary or composite keys you can take. This will give you row mismatch w.r.t those keys. Then you need to write one more query where your keys matched I.e intersection but any mismatches in other columns. Write an inner join for this w.r.t keys where table1.colA != table2.colA like this for all fields in case condition.

Pyspark: filter DataaFrame where column value equals some value in list of Row objects

I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects
The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.

How is the Spark select-explode idiom implemented?

Assume we have a DataFrame with a string column, col1, and an array column, col2. I was wondering what happens behind the scenes in the Spark operation:
df.select('col1', explode('col2'))
It seems that select takes a sequence of Column objects as input, and explode returns a Column so the types match. But the column returned by explode('col2') is logically of different length than col1, so I was wondering how select knows to "sync" them when constructing its output DataFrame. I tried looking at the Column class for clues but couldn't really find anything.
The answer is simple - there is no such data structure as Column. While Spark SQL uses columnar storage for caching and can leverage data layout for some low level operations columns are just descriptions of data and transformations not data containers. So simplifying things a bit explode is yet another flatMap on the Dataset[Row].

Resources