Incremental updates in Spark SQL

Incremental updates in Spark SQL - apache-spark

What is the most efficient way to append incremental updates in Spark SQL in Scala?
I have an employee dataframe E1 which is archived with primary key empId.
I also have a latest employee dataframe and want to write only the updated, new and the deleted ones back to the archival dataframe.
For example:
Employee archived:
EmpId, EmpName
1 Tom
2 Harry
Employee recent:
EmpId, EmpName
2 Harry Lewis
3 Hermoine
Difference should return:
EmpId, EmpName, deleted
1 Tom yes
2 Harry Lewis no
3 Hermoine no

If you only wanted to find updated or new rows it would be possible to use except, however, since the deleted rows should be present it is a bit more complicated. Assuming E1 is the archived employee dataframe and E2 is the recent one, you can use a full join in Scala as follows:
E1.withColumnRenamed("EmpName", "EmpNameOld")
.join(E2, Seq("EmpId"), "fullouter")
.where($"EmpName".isNull || $"EmpNameOld".isNull || $"EmpName" =!= $"EmpNameOld")
.withColumn("deleted", when($"EmpName".isNull, "yes").otherwise("no"))
.withColumn("EmpName", coalesce($"EmpName", $"EmpNameOld"))
.drop("EmpNameOld")
This will give you the wanted result, containing updated rows, new rows and deleted rows.

Related

In Spark, how to repartition large data frames/RDDs such that shuffling while joining is reduced as much as possible?

I have one table called A which has a column called id. I have another table called B which has a column called a_id which refers to table A's id column (foreign key). I want to join these two tables based on this column but both the table are quite large (The size of table A is ~ 1 TB and table B is ~ 4.2 TB). So, my plan is to reduce shuffling as much as possible to increase efficiency, and for that, I am trying to partition the data frames such that rows with the same value in the two columns reside in the same partition as in the following example.
Example tables:
A:
id category
1 tennis
2 cricket
B:
name a_id
Roger 1
Novak 1
Lara 2
Ricky 2
What I want to achieve after partition:
Partition 1:
A:
id category
1 tennis
B:
name a_id
Roger 1
Novak 1
Partition 2:
A:
id category
2 cricket
B:
name a_id
Lara 2
Ricky 2
I was trying to achieve this by repartitioning A on column id and repartitioning B on column a_id after loading the tables. I checked this on a short test dataset and it seemed to render what I want.
But on the full dataset, spark job does a lot of shuffling during joining and gets stuck. So, what is the most efficient way of doing this?

how to I select only a specific column from a Dataset after sorting it

I have the following table:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I want to sort the table based on count column and want to see only count column. I have done it but I am doing it in 2 steps
1- I first sort to get sorted DS - dataDS.sort(col("count").desc)
2- then select on that DS- (dataDS.sort(col("count").desc)).select(col("count")).show();
The above feels like am embedded sql query to me. In sql however, I can do the same query without using an embedded query
select * from flight_data_2015 ORDER BY count ASC
Is there a better way for me to both sort and select without creating a new Dataset?

There is nothing wrong
(dataDS.sort(col("count").desc)).select(col("count")).show();
It is the right thing to do and has no negative performance implications, other than intrinsic problems of sorting as such.
Use it freely and don't worry about it anymore.

Spark Dataset appending unique ID

I'm looking whether there is an "already implemented alternative" to append unique ID on a spark dataset.
My scenario:
I have an incremental job that runs each day processing a batch of information. In this job, I create a dimension table of something and assign unique IDs to each row using monotonically_increasing_id(). On next day, I want to append some rows to that something table and want to generate unique IDs for those rows.
Example:
day 1:
something_table
uniqueID name
100001 A
100002 B
day 2:
something_table
uniqueId name
100001 A
100002 B
100003 C -- new data that must be created on day 2
Sniped code for day 1:
case class BasicSomething(name: String)
case class SomethingTable(id: Long, name: String)
val ds: Dataset[BasicSomething] = spark.createDataset(Seq(BasicSomething("A"), BasicSomething("B")))
ds.withColumn("uniqueId", monotonically_increasing_id())
.as[SomethingTable]
.write.csv("something")
I have no idea of how to keep state for monotonically_increasing_id() in a way that in the next day it will know the existing ids from something_table unique id.

You can always get the last uniqueId of a dataset that you have created. Thus you can use that uniqueId with monotically_increasing_id() and create new uniqueIds.
ds.withColumn("uniqueId", monotonically_increasing_id()+last uniqueId of previous dataframe)

Would a forced Spark DataFrame materialization work as a checkpoint?

I have a large and complex DataFrame with nested structures in Spark 2.1.0 (pySpark) and I want to add an ID column to it. The way I did it was to add a column like this:
df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID')
So it goes e.g. from this:
File A B
a.txt valA1 [valB11,valB12]
a.txt valA2 [valB21,valB22]
to this:
File A B ID
a.txt valA1 [valB11,valB12] 1
a.txt valA2 [valB21,valB22] 2
After I add this column, I don't immediately trigger a materialization in Spark, but I first branch the DataFrame to a new variable:
dfOutput = df.select('A','ID')
with only columns A and ID and I write dfOutput to Hive, so I get e.g. Table1:
A ID
valA1 1
valA2 2
So far so good. Then I continue using df for further transformations, namely I explode some of the nested arrays in the columns and drop the original, like this:
df = df.withColumn('Bexpl',explode('B')).drop('B')
and I get this:
File A Bexpl ID
a.txt valA1 valB11 1
a.txt valA1 valB12 1
a.txt valA2 valB21 2
a.txt valA2 valB22 2
and output other tables from it, sometimes after creating a second ID column since there are more rows from the exploded arrays. E.g. I create Table2:
df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID2')
to get:
File A Bexpl ID ID2
a.txt valA1 valB11 1 1
a.txt valA1 valB12 1 2
a.txt valA2 valB21 2 3
a.txt valA2 valB22 2 4
and output as earlier:
dfOutput2 = df.select('Bexpl','ID','ID2')
to get:
Bexpl ID ID2
valB11 1 1
valB12 1 2
valB21 2 3
valB22 2 4
I would expect that the values of the first ID column remain the same and match the data for each row from the point that this column was created. This would allow me to keep a relation between Table1 created from dfOutput and subsequent tables from df, like dfOutput2 and the resulting Table2.
The problem is that ID and ID2 are not as they should be in the example above, but mixed up, and I'm trying to find out why. My guess is that the values of the first ID column are not deterministic because df is not materialized before branching to dfOutput. So when the data is actually materialized when saving the table from dfOutput, the rows are shuffled and IDs are different from the data that is materialized on a later point from df, as in dfOutput2. I am however not sure, so my questions are:
Is my assumption correct, that IDs are generated differently for the different branches although I add the ID column before branching?
Would materializing the DataFrame before branching to dfOutput, e.g. through df.cache().count(), ensure a fixed ID column which I can later branch however I want from df, so that I can use this as a checkpoint?
If not, how can I solve this?
I would appreciate any help or at least quick confirmation because I can't test it properly. Spark would shuffle the data only if it doesn't have enough memory, and reaching that point would mean loading a lot of data and in turn need a long time, and may still provide coincidentally good results (already tried with smaller datasets).

I don't full understand the problem but the address some points:
With window definition as:
(PARTITION BY ... ORDER BY NULL)
order of values in the window is effectively random. There is no reason for it to stable.
Spark would shuffle the data only if it doesn't have enough memory,
Spark will shuffle every time it is required by the execution plan (grouping, window functions). You confused this with disk spills, which don't induce randomness.
Would materializing the DataFrame before branching to dfOutput, e.g. through df.cache().count()
No. You cannot depend on caching for correctness.

Join two files in Pyspark without using sparksql/dataframes

I have two files customer and sales like below
Customer :
cu_id name region city state
1 Rahul ME Vizag AP
2 Raghu SE HYD TS
3 Rohith ME BNLR KA
Sales:
sa_id sales country
2 100000 IND
3 230000 USA
4 240000 UK
Both the files are \t delimited.
I want to join both the files based on the cu_id from customer and sa_id from sales using pyspark with out using sparksql/dataframes.
your help is very much appreciated.

You can definitely use the join methods that Spark has to offer regarding workings with RDD's.
You can do something like:
customerRDD = sc.textFile("customers.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
salesRDD = sc.textFile("sales.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
joinedRDD = customerRDD.join(salesRDD)
And you will get a new RDD that contains the only joined records from both customer and sales files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Incremental updates in Spark SQL - apache-spark

Related

In Spark, how to repartition large data frames/RDDs such that shuffling while joining is reduced as much as possible?

how to I select only a specific column from a Dataset after sorting it

Spark Dataset appending unique ID

Would a forced Spark DataFrame materialization work as a checkpoint?

Join two files in Pyspark without using sparksql/dataframes

Categories

Resources