Updating static source based on Kafka Stream using Spark Streaming? - apache-spark

I am using spark-sql 2.4.1v with java8.
I have a scenario where I have some meta data in dataset1 i.e. which is loaded from an HDFS Parquet file.
And I have another dataset2 which is read from a Kafka Stream.
For each record from dataset2 for column1 I need to check columnX in dataset2
if its there in dataset1.
If it is there in dataset1,then I need replace the columnX value with column1 value of dataset1
Else
I need to add increment (max(column1 ) from dataset1 ) by one and store it dataset1.
Some sample data you can see here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3447405230020171/7035720262824085/latest.html
How this can be done in sSpark?
Example:
val df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
).toDF("company_id_external","company_id")
val df2 = Seq(
("60886923","Chengdu Fuma Food Co,.Ltd"), //company_id_external match found in df1
("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
("583633","Boso oil and fat Co., Ltd. ") //company_id_external match found in df1
)toDF("company_id_external","companyName")
If match found in df1
Here only two records of df1 "company_id_external" matching to df2 "company_id_external"
i.e. 60886923 & 583633 ( first and last record)
For these records of df2
i.e. ("60886923","Chengdu Fuma Food Co,.Ltd") becomes ==> ("2860","Chengdu Fuma Food Co,.Ltd")
("583633","Boso oil and fat Co., Ltd. ") becomes ==> ("46067","Boso oil and fat Co., Ltd. ")
Else match not found in df1
For other two of df2 there is no "company_id_external" match in df1, need to generate it company_id and add to df1
i.e. ("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
company_id generation logic
new company_id = max(company_id) of df1 + 1
From the above max is 50330 + 1 => 50331 add this record to df1 i.e. ("608815923","50331")
Do the same for other one i.e. add this record to df1 i.e. ("583633","50332")
**So now**
df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
("608815923","50331")
("583633","50332")
).toDF("company_id_external","company_id")

Related

How to join 2 dataframes in spark which are already partitioned with same column without shuffles..?

I have 2 df's
df1:
columns: col1, col2, col3
partitioned on col1
no of partitions: 120000
df2:
columns: col1, col2, col3
partitioned on col1
no of partitions: 80000
Now I want to join the df1, df2 on (df1.col1=df2.col1 and df1.col2=df2.col2) without much shuffles
tried to join but taking a lot of time...
How do i do it.. Can any one help..?
Imo you can try to use broadcast join if one of your dataset is small (lets say few hundrests of mb) - in this case smaller dataset will be broadcasted and you will skip the shuffle
Without broadcast hint catalyst is probably going to pick SMJ(sort-merge join) and during this join algorithm data needs to be repartitioned by join key and then sorted. I prepared quick example
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.shuffle.partitions", "10")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 7), ("test55", 86))
val data2 = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 6), ("test33", 76))
val df = data.toDF("Name", "Value").repartition(5, col("Name"))
df.show
val df2 = data2.toDF("Name", "Value").repartition(5, col("Name"))
df2.show
df.join(df2, Seq("Name", "Value")).show
autoBroadcastJoinThreshold is set to -1 to disable broadcastJoin
sql.shuffle.partitions is set to 10 to show that join is going to use this value during repartition
i repartitioned dfs before join with 5 partitions and called action to be sure that they are paritioned by the same column before join
And in sql tab i can see that Spark is repartitioning data again
If you cant broadcast and your join is taking a lot of time you may check if you have some skew.
You may read this blogpost by Dima Statz to find more informations about skew on joins

Optimize Join of two large pyspark dataframes

I have two large pyspark dataframes df1 and df2 containing GBs of data.
The columns in first dataframe are id1, col1.
The columns in second dataframe are id2, col2.
The dataframes have equal number of rows.
Also all values of id1 and id2 are unique.
Also all values of id1 correspond to exactly one value id2.
For. first few entries are as for df1 and df2 areas follows
df1:
id1 | col1
12 | john
23 | chris
35 | david
df2:
id2 | col2
23 | lewis
35 | boon
12 | cena
So I need to join the two dataframes on key id1 and id2.
df = df1.join(df2, df1.id1 == df2.id2)
I am afraid this may suffer from shuffling.
How can I optimize the join operation for this special case?
To avoid the shuffling at the time of join operation, reshuffle the data based on your id column.
The reshuffle operation will also do a full shuffle but it will optimize your further joins if there are more than one.
df1 = df1.repartition('id1')
df2 = df2.repartition('id2')
Another way to avoid shuffles at join is to leverage bucketing.
Save both the dataframes by using bucketBy clause on id then later when you read the dataframes the id column will reside in same executors, hence avoiding the shuffle.
But to leverage benefit of bucketing, you need a hive metastore as the bucketing information is contained in it.
Also this will include additional steps of creating the bucket then reading.

can I specify column names when creating a DataFrame

My data is in a csv file. The file hasn't got any header column
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
If I read it, Spark creates names for the columns automatically.
scala> val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv")
data: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]
Is it possible to provide my own names for the columns when reading the file if I don't want to use _c0, _c1? For eg, I want spark to use DEST, ORIG and count for column names. I don't want to add header row in the csv to do this
Yes you can, There is a way, You can us toDF function of dataframe.
val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv").toDF("DEST", "ORIG", "count")
It's better to define schema (StructType) first, then load the csv data using the schema.
Here is how to define schema:
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("DEST",StringType,true),
StructField("ORIG",StringType,true),
StructField("count",IntegerType,true)
))
Load the dataframe:
val df = spark.read.schema(schema).csv("./data/flight-data/csv/2015-summary.csv")
Hopefully it'll help you.

How to merge edits from one dataframe into another dataframe in Spark?

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30
you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record
I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)
It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)

join two data files in HDFS with Spark?

I have two datasets which are already partitioned using same partitioner and stored in HDFS. These datasets are output of two different Spark jobs which we don't have control. Now, I wan to join these two datasets to produce different information.
Example:
Data Set 1:
ORDER_ID CUSTOMER_ID ITEMS
OD1 C1 1,2,3 -> partition 0
OD2 C2 3,4,5 -> partition 0
OD3 C4 1,2,3 -> partition 1
OD4 C3 1,3 -> partition 1
Data Set 1:
ORDER_ID CUSTOMER_ID REFUND_ITEMS
OD1 C1 1 -> partition 0
OD2 C2 5 -> partition 0
OD3 C4 2,3 -> partition 1
OD4 C3 3 -> partition 1
Options are:
1) Create two RDDs from the datasets and join them.
2) Create one RDD using one of the dataset.
-> For each partition in the RDD get the actual partition id i.e OD1 -> 0, OD3 -> 1 (using some custom logic)
-> Load data from HDFS for that partition for dataset 2
-> Iterate over both the dataset and produce combined result.
For option 2 I don't know how to read a specific file form HDFS in the Spark executor. (I have the full URI for location of the file)
You can try creating 2 dataframes and join them using SQL. Please find the code below.
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
case class struc_dataset(ORDER_ID: String,CUSTOMER_ID: String, ITEMS:String)
//Read file1
val File1DF = spark.sparkContext
.textFile("temp/src/file1.txt")
.map(_.split("\t"))
.map(attributes => struc_dataset(attributes(0), attributes(1),attributes(3))).toDF()
//Register as Temp view - Dataset1
File1DF.createOrReplaceTempView("Datset1")
//Read file2
val File2DF = spark.sparkContext
.textFile("temp/src/file2.txt")
.map(_.split("\t"))
.map(attributes => struc_dataset(attributes(0),attributes(1),attributes(3))).toDF()
//Register as Temp view - Dataset2
File2DF.createOrReplaceTempView("Datset2")
// SQL statement to create final dataframe (JOIN)
val finalDF = spark.sql("SELECT * FROM Dataset1 ds1 JOIN Dataset2 ds2 on ds1.ORDER_ID=ds2.ORDER_ID AND ds1.CUSTOMER_ID=ds2.CUSTOMER_ID")
finalDF.show()

Resources