join two data files in HDFS with Spark? - apache-spark

I have two datasets which are already partitioned using same partitioner and stored in HDFS. These datasets are output of two different Spark jobs which we don't have control. Now, I wan to join these two datasets to produce different information.
Example:
Data Set 1:
ORDER_ID CUSTOMER_ID ITEMS
OD1 C1 1,2,3 -> partition 0
OD2 C2 3,4,5 -> partition 0
OD3 C4 1,2,3 -> partition 1
OD4 C3 1,3 -> partition 1
Data Set 1:
ORDER_ID CUSTOMER_ID REFUND_ITEMS
OD1 C1 1 -> partition 0
OD2 C2 5 -> partition 0
OD3 C4 2,3 -> partition 1
OD4 C3 3 -> partition 1
Options are:
1) Create two RDDs from the datasets and join them.
2) Create one RDD using one of the dataset.
-> For each partition in the RDD get the actual partition id i.e OD1 -> 0, OD3 -> 1 (using some custom logic)
-> Load data from HDFS for that partition for dataset 2
-> Iterate over both the dataset and produce combined result.
For option 2 I don't know how to read a specific file form HDFS in the Spark executor. (I have the full URI for location of the file)

You can try creating 2 dataframes and join them using SQL. Please find the code below.
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
case class struc_dataset(ORDER_ID: String,CUSTOMER_ID: String, ITEMS:String)
//Read file1
val File1DF = spark.sparkContext
.textFile("temp/src/file1.txt")
.map(_.split("\t"))
.map(attributes => struc_dataset(attributes(0), attributes(1),attributes(3))).toDF()
//Register as Temp view - Dataset1
File1DF.createOrReplaceTempView("Datset1")
//Read file2
val File2DF = spark.sparkContext
.textFile("temp/src/file2.txt")
.map(_.split("\t"))
.map(attributes => struc_dataset(attributes(0),attributes(1),attributes(3))).toDF()
//Register as Temp view - Dataset2
File2DF.createOrReplaceTempView("Datset2")
// SQL statement to create final dataframe (JOIN)
val finalDF = spark.sql("SELECT * FROM Dataset1 ds1 JOIN Dataset2 ds2 on ds1.ORDER_ID=ds2.ORDER_ID AND ds1.CUSTOMER_ID=ds2.CUSTOMER_ID")
finalDF.show()

Related

Try to avoid shuffle by manual control of table read per executor

I have:
really huge (let's say 100s if Tb) Iceberg table B which is partitioned by main_col, truncate[N, stamp]
small table S with columns main_col, stamp_as_key
I want to get a dataframe (actually table) with logic:
b = spark.read.table(B)
s = spark.read.table(S)
df = b.join(F.broadcast(s), (b.main_col == s.main_col) & (s.stamp_as_key - W0 <= b.stamp <= s.stamp_as_key + W0))
df = df.groupby('main_col', 'stamp_as_key').agg(make_some_transformations)
I want to avoid shuffle when reading B table. Iceberg has some meta tables about all parquet files in table and its content. What is possible to do:
read only metainfo table of B table
join it with S table
repartition by expected columns
collect s3 paths of real B data
read these files from executors independently.
Is there a better way to make this work? Also I can change the schema of B table if needed. But main_col should stay as a first paritioner.
One more question: suppose I have such dataframe and I saved it as a table. I need effectively join such tables. Am I correct that it is also impossible to do without shuffle with classic spark code?

How is the number of partitions for an inner join calculated in Spark?

We have two dataframes. df_A and df_B
df_A.rdd.getPartitionsNumber() # => 9
df_B.rdd.getPartitionsNumber() # => 160
df_A.createOrReplaceTempView('table_A')
df_B.createOrReplaceTempView('table_B')
After creation joined dataframe via SparkSQL,
df_C = spark.sql("""
select *
from table_A inner table_B on (...)
""")
df_C.rdd.getPartitionsNumber() # => 160
How does Spark calculate and use these two partitions for two joined dataframes?
Shouldn't the number of partitions of the joined dataframe be 9 * 160 = 1440?
Spark configures the number of partitions to 200 when shuffling data for joins or aggregations. You can change the value in spark.sql.shuffle.partitions to increase or decrease the number of partitions in join operation.
https://spark.apache.org/docs/latest/sql-performance-tuning.html

Spark filter pushdown with multiple values in subquery

I have a small table adm with one column x that contains only 10 rows. Now I want to filter another table big that is partitioned by y with the values from adm using partition pruning.
While here
select * from big b
where b.y = ( select max(a.x) from adm a)
the partition filter pushdown works, but unfortunately this:
select * from big b
where b.y IN (select a.x from adm a )
results in a broadcast join between a and b
How can the subquery be pushed down as a partition filter even when I use IN
This is happening because the result of your subquery by itself is an RDD, so Spark deals with it in a truly distributed fashion -- via broadcast and join -- as it would if it were any other column, not necessarily partition.
To work around this, you will need to execute subquery separately, collect the result and format it into a value usable in IN clause.
scala> val ax = spark.sql("select a.x from adm a")
scala> val inclause = ax.as(Encoders.STRING).map(x => "'"+x+"'").collectAsList().asScala.mkString(",")
scala> spark.sql("select * from big b where b.y IN (" + inclause + ")")
(This assumes x and y are strings.)

Updating static source based on Kafka Stream using Spark Streaming?

I am using spark-sql 2.4.1v with java8.
I have a scenario where I have some meta data in dataset1 i.e. which is loaded from an HDFS Parquet file.
And I have another dataset2 which is read from a Kafka Stream.
For each record from dataset2 for column1 I need to check columnX in dataset2
if its there in dataset1.
If it is there in dataset1,then I need replace the columnX value with column1 value of dataset1
Else
I need to add increment (max(column1 ) from dataset1 ) by one and store it dataset1.
Some sample data you can see here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3447405230020171/7035720262824085/latest.html
How this can be done in sSpark?
Example:
val df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
).toDF("company_id_external","company_id")
val df2 = Seq(
("60886923","Chengdu Fuma Food Co,.Ltd"), //company_id_external match found in df1
("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
("583633","Boso oil and fat Co., Ltd. ") //company_id_external match found in df1
)toDF("company_id_external","companyName")
If match found in df1
Here only two records of df1 "company_id_external" matching to df2 "company_id_external"
i.e. 60886923 & 583633 ( first and last record)
For these records of df2
i.e. ("60886923","Chengdu Fuma Food Co,.Ltd") becomes ==> ("2860","Chengdu Fuma Food Co,.Ltd")
("583633","Boso oil and fat Co., Ltd. ") becomes ==> ("46067","Boso oil and fat Co., Ltd. ")
Else match not found in df1
For other two of df2 there is no "company_id_external" match in df1, need to generate it company_id and add to df1
i.e. ("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
company_id generation logic
new company_id = max(company_id) of df1 + 1
From the above max is 50330 + 1 => 50331 add this record to df1 i.e. ("608815923","50331")
Do the same for other one i.e. add this record to df1 i.e. ("583633","50332")
**So now**
df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
("608815923","50331")
("583633","50332")
).toDF("company_id_external","company_id")

Partitioning the data while reading from hive/hdfs based on column values in Spark

I have 2 spark dataframes that I read from hive using the sqlContext. Lets call these dataframes as df1 and df2. The data in both the dataframes is sorted on a Column called PolicyNumber at hive level. PolicyNumber also happens to be the primary key for both the dataframes. Below are the sample values for both the dataframes although in reality, both my dataframes are huge and spread across 5 executors as 5 partitions. For simplity sake, I will assume that each partition will have one record.
Sample df1
PolicyNumber FirstName
1 A
2 B
3 C
4 D
5 E
Sample df2
PolicyNumber PremiumAmount
1 450
2 890
3 345
4 563
5 2341
Now, I want to join df1 and df2 on PolicyNumber column. I can run the below piece of code and get my required output.
df1.join(df2,df1.PolicyNumber=df2.PolicyNumber)
Now, I want to avoid as much shuffle as possible to make this join efficient. So to avoid shuffle, while reading from hive, I want to partition df1 based on values of PolicyNumber Column in such a way that the row with PolicyNumber 1 will go to Executor 1, row with PolicyNumber 2 will go to Executor 2, row with PolicyNumber 3 will go to Executor 3 and so on. And I want to partition df2 in the exact same way I did for df1 as well.
This way, Executor 1 will now have the row from df1 with PolicyNumber=1 and also the row from df2 with PolicyNumber=1 as well.
Similarly, Executor 2 will have the row from df1 with PolicyNumber=2 and also the row from df2 with PolicyNumber=2 ans so on.
This way, there will not be any shuffle required as now, the data is local to that executor.
My question is, is there a way to control the partitions in this granularity? And if yes, how do I do it.
Unfortunately there is no direct control on the data which is floating into each executor, however, while you reading data into each dataframe, use the CLUSTER BY on join column which helps sorted data distributing into right executors.
ex:
df1 = sqlContext.sql("select * from CLSUTER BY JOIN_COLUMN")
df2 = sqlContext.sql("SELECT * FROM TABLE2 CLSUTER BY JOIN_COLUMN")
hope it helps.

Resources