Remove Duplicates without shuffle Spark - apache-spark

I have a Cassandra table XYX with columns(
id uuid,
insert a timestamp,
header text)
Where id and insert are composite primary key.
I'm using Dataframe and in my spark shell I'm fetching id and header column.
I want to have distinct rows based on id and header column.
I'm seeing lot of shuffles which not be the case since Spark Cassandra connector ensures that all rows for a given Cassandra partition are in same spark partition.
After fetching I'm using dropDuplicates to get distinct records.

Spark Dataframe API does not support custom partitioners yet. So the Connector could not introduce the C* partitioner to Dataframe engine.
An RDD Spark API supports custom partitioner from other hand. Thus you could load your data into RDD and then covert it to df.
Here is a Connector doc about C* partitioner usage: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
keyBy() function allow you to define key columns to use for grouping
Here is working example. It is not short, so I expect someone could improve it:
//load data into RDD and define a group key
val rdd = sc.cassandraTable[(String, String)] ("test", "test")
.select("id" as "_1", "header" as "_2")
.keyBy[Tuple1[Int]]("id")
// check that partitioner is CassandraPartitioner
rdd.partitioner
// call distinct for each group, flat it, get two column DF
val df = rdd.groupByKey.flatMap {case (key,group) => group.toSeq.distinct}
.toDF("id", "header")

Related

Difference between Spark dataframe writer partitionBy vs delta create table partition by

What is the exact difference between Spark dataframe writer partitionBy and delta/hive create table partition by. Which one will be faster and why?

Spark join: grouping of records having same value for a particular column in the same partition

We have 2 Hive tables which are read in spark and joined using a join key, let’s call it user_id.
Then, we write this joined dataset to S3 and register it hive as a 3rd table for subsequent tasks to use this joined dataset.
One of the other columns in the joined dataset is called keychain_id.
We want to group all the user records belonging to the same keychain_id in the same partition for a reason to avoid shuffles later.
So, can I do a repartition(“keychain_id”) before writing to s3 and registering it in Hive , and when I read the same data back from this third table will it still have the same partition grouping (all users belonging to the
Same keychain_id in the same partition)? Because trying to avoid doing a repartition(“keychain_id”) every time when reading from this 3rd table.
Can you please clarify ? If there is no guarantee that it will retain the same partition grouping while reading, then is there another efficient way this can be done other than caching?
if there is no data skew(will lead to diff partition file sizes) in keychain_id you can do write with partitionBy:
df.write\
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
Update:
In order to 'retain the grouping of user records having the same keychain_id in the same dataframe partition'
You could repartition before, on unique ids and/or column
from pyspark.sql import functions as F
n = df.select(F.col('keychain_id')).distinct().count()
df.repartition(n, F.col("keychain_id)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
or
df.repartition(n)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")

Spark: Record count mismatch

I am quite confused because I am facing a weird situation.
My spark application reads data from an Oracle database and load it into a dataframe using this instruction:
private val df = spark.read.jdbc(
url = [the jdbc url],
table="(" + [the query] + ") qry",
properties= [the oracle driver]
)
Then, I save in a variable the number of records in this dataframe:
val records = df.count()
The I create a hive table ([my table]) with the dataframe schema, and I dump the content of the dataframe on it:
df.write
.mode(SaveMode.Append)
.insertInto([my hive db].[my table])
Well, here is where I am lost; When I perform a select count(*) to the hive table where the dataframe is being loaded, "sometimes" there are a few records more in hive than in the records variable.
Can you think on what could be the source of this mismatch??
*Related to the possible duplicate, my question is different. I am not counting my dataframe many times with different values. I count the records on my dataframe once. I dump the dataframe into hive, and I count the records in the hive table, and sometimes there are more in hive than in my count.*
Thank you very much in advance.

Transfer time series data from PySpark to Cassandra

I have a Spark Cluster and a Cassandra cluster. In pyspark I read a csv file then transform it to an RDD. I then go through every row in my RDD and use a mapper and reducer function. I end up getting the following output (I've made this list short for demonstration purposes):
[(u'20170115', u'JM', u'COP'), (u'20170115', u'JM', u'GOV'), (u'20170115', u'BM', u'REB'), (u'20170115', u'OC', u'POL'), (u'20170114', u'BA', u'EDU')]
I want to go through each row in the array above and store each tuple into one table in Cassandra. I want the unique key to be the date. Now I know that I can turn this array into a dataframe and then store it into Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md#saving-a-dataframe-in-python-to-cassandra). If I turn the list into a dataframe and then store it into Cassandra will Cassandra still be able to handle it? I guess I'm not fully understanding how Cassandra stores values. In my array the dates are repeated, but the other values are different.
What is the best way for me to store the data above in Cassandra? Is there a way for me to store data directly from Spark to Cassandra using python?
Earlier versions of DSE 4.x supported RDDs, but the current connector for DSE and open source Cassandra is "limited to DataFrame only operations."
PySpark with Data Frames
You stated "I want the unique key to be the date". I assume you mean partion key, since date is not unique in your example. Its ok to use date as the partion key (assuming partitons will not be too large) but your primary key needs to be unique.

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources