Spark SQL - data comparison - apache-spark

What would be the best approach to compare two csv files (millions of rows) with same schema with a primary key column and print out the differences. For example ,
CSV1
Id name zip
1 name1 07112
2 name2 07234
3 name3 10290
CSV2
Id name zip
1 name1 07112
2 name21 07234
4 name4 10290
Comparing modified file CSV2 with the original data CSV1,
Output should be
Id name zip
2 name21 07234 Modified
3 name3 10290 Deleted
4 name4 10290 Added
New to Spark SQL, I am thinking of importing data into Hive tables and then run Spark SQL to identify the changes.
1) Is there any row modified method available to identify whether a row has modified instead of comparing values in each column?
2) Are there any better approach available to implement either using Spark or other HDFS tools?
Appreciate the feedback

Many approaches exist; this is one that can have things done in parallel:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val origDF = sc.parallelize(Seq(
("1", "a", "b"),
("2", "c", "d"),
("3", "e", "f")
)).toDF("k", "v1", "v2")
val newDF = sc.parallelize(Seq(
("1", "a", "b"),
("2", "c2", "d"),
("4", "g", "h")
)).toDF("k", "v1", "v2")
val df1 = origDF.except(newDF) // if k not exists in df2, then deleted
//df1.show(false)
val df2 = newDF.except(origDF) // if k not exists in df1, then added
//df2.show(false)
// if no occurrence in both dfs, then the same
// if k exists in both, then k in df2 = modified
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
val df3 = spark.sql("""SELECT df1.k, df1.v1, df1.v2, "deleted" as operation
FROM df1
WHERE NOT EXISTS (SELECT df2.k
FROM df2
WHERE df2.k = df1.k)
UNION
SELECT df2.k, df2.v1, df2.v2, "added" as operation
FROM df2
WHERE NOT EXISTS (SELECT df1.k
FROM df1
WHERE df1.k = df2.k)
UNION
SELECT df2.k, df2.v1, df2.v2, "modified" as operation
FROM df2
WHERE EXISTS (SELECT df1.k
FROM df1
WHERE df1.k = df2.k)
""")
df3.show(false)
returns:
+---+---+---+---------+
|k |v1 |v2 |operation|
+---+---+---+---------+
|4 |g |h |added |
|2 |c2 |d |modified |
|3 |e |f |deleted |
+---+---+---+---------+
Not so hard, no standard utility.

Related

Join Two Dataframes while keeping the same format

Hello I have two Dataframes:
df1 with columns: first_name, last_name, id, location, phone_number.
df2 with columns: last_name, id, location, employer.
I am trying to create a new dataset that displays only the columns in df1 that returns only the rows where the last_name, and id is present in df2. So I decided that a inner join on the two tables. The issue is that join appends the columns from df2 to the end of df1 so my resulting df is much larger than I need. I only care about the columns in df1.
My join was: df1.join(df2, df1.col("last_name").equalTo(df2.col("last_name").and(df1.col("id").equalTo(df2.col("id")), "inner");
The problem with this is I got a new table of: first_name, last_name, id, location, phone_number, employer. Where id and last_name was ambiguous.
Is there any way to keep the same table format of df1 after the join? (Without dropping individual columns, because I am using with a large table with about 30 columns).
You can use the the join method of a Dataframe with the following function signature (from the API docs):
def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
This will only keep 1 column of the joining columns, removing your ambiguity problem.
After that, you can just select the columns of df1 dynamically, by using df1.columns. In total, it would look something like this:
import spark.implicits._
val df1 = Seq(
("joe", "shmoe", 1, "x", 123),
("jack", "johnson", 2, "y", 456)
).toDF("first_name", "last_name", "id", "location", "phone_number")
df1.show
+----------+---------+---+--------+------------+
|first_name|last_name| id|location|phone_number|
+----------+---------+---+--------+------------+
| joe| shmoe| 1| x| 123|
| jack| johnson| 2| y| 456|
+----------+---------+---+--------+------------+
val df2 = Seq(
("shmoe", 1, "x", "someCoolGuy"),
("otherName", 3, "z", "employer2")
).toDF("last_name", "id", "location", "employer")
df2.show
+---------+---+--------+-----------+
|last_name| id|location| employer|
+---------+---+--------+-----------+
| shmoe| 1| x|someCoolGuy|
|otherName| 3| z| employer2|
+---------+---+--------+-----------+
val output = df1
.join(df2.select("last_name", "id"), Seq("last_name", "id")) // only selecting interesting columns of df2 for the join
.select(df1.columns.head, df1.columns.tail: _*)
output.show
+----------+---------+---+--------+------------+
|first_name|last_name| id|location|phone_number|
+----------+---------+---+--------+------------+
| joe| shmoe| 1| x| 123|
+----------+---------+---+--------+------------+
Hope this helps!

Finding the duplicate using Row Number in Pyspark

I have written an SQL query which actually finds the duplicate elevation from the table along with other unique columns. Here is my query. I want to convert that into pyspark.
dup_df = spark.sql('''
SELECT g.pbkey,
g.lon,
g.lat,
g.elevation
FROM DATA AS g
INNER JOIN
(SELECT elevation,
COUNT(elevation) AS NumOccurrences
FROM DATA
GROUP BY elevation
HAVING (COUNT(elevation) > 1)) AS a ON (a.elevation = g.elevation)
''')
On Scala can be implemented with Window, guess, can be converted to Python:
val data = Seq(1, 2, 3, 4, 5, 7, 3).toDF("elevation")
val elevationWindow = Window.partitionBy("elevation")
data
.withColumn("elevationCount", count("elevation").over(elevationWindow))
.where($"elevationCount" > 1)
.drop("elevationCount")
Output is:
+---------+
|elevation|
+---------+
|3 |
|3 |
+---------+

How to create an unique autogenerated Id column in a spark dataframe

I have a dataframe where I have to generate a unique Id in one of the columns. This id has to be generated with an offset.
Because , I need to persist this dataframe with the autogenerated id , now if new data comes in the autogenerated id should not collide with the existing ones.
I checked the monotonically increasing function but it does not accept any offset .
This is what I tried :
df=df.coalesce(1);
df = df.withColumn(inputCol,functions.monotonically_increasing_id());
But is there a way to make the monotonically_increasing_id() start from a starting offset ?
You can simply add to it to provide a minimum value for the id. Note that it is not guaranteed the values will start from the minimum value
.withColumn("id", monotonically_increasing_id + 123)
Explanation: Operator + is overloaded for columns https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L642
Or if you don't want to restrict your program into one only partition with df.coalesce(1) you can use zipWithIndex which starts with index = 0 as next:
lines = [["a1", "a2", "a3"],
["b1", "b2", "b3"],
["c1", "c2", "c3"]]
cols = ["c1", "c2", "c3"]
df = spark.createDataFrame(lines, cols)
start_indx = 10
df = df.rdd.zipWithIndex() \
.map(lambda (r, indx): (indx + start_indx, r[0], r[1], r[2])) \
.toDF(["id", "c1", "c2", "c3"])
df.show(10, False)
In this case I set the start_index = 10. And this will be the output:
+---+---+---+---+
|id |c1 |c2 |c3 |
+---+---+---+---+
|10 |a1 |a2 |a3 |
|11 |b1 |b2 |b3 |
|12 |c1 |c2 |c3 |
+---+---+---+---+
You could add a rownumber to your columns and then add that to the maximum existing identity column, or your offset. Once it is set drop the rownumber attribute.
from pyspark.sql import functions as sf
from pyspark.sql.window import Window
# Could also grab the exist max ID value
seed_value = 123
df = df.withColumn("row_number", sf.rowNumber().over(Window.partitionBy(sf.col("natural_key")).orderBy(sf.col("anything"))))
df = df.withColumn("id", sf.col("row_number")+seed_value)
Remember to drop the row_number attribute.

pyspark AnalysisException: "Reference '<COLUMN>' is ambiguous [duplicate]

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.

Merge two data frame with few different columns

I want to merge several DataFrames having few different columns.
Suppose ,
DataFrame A has 3 columns: Column_1, Column_2, Column 3
DataFrame B has 3 columns: Column_1, Columns_2, Column_4
DataFrame C has 3 Columns: Column_1, Column_2, Column_5
I want to merge these DataFrames such that I get a DataFrame like :
Column_1, Column_2, Column_3, Column_4 Column_5
number of DataFrames may increase. Is there any way to get this merge ? such that for a particular Column_1 Column_2 combination i get the values for other three columns in same row, and if for a particular combination of Column_1 Column_2 there is no data in some Columns then it should show null there.
DataFrame A:
Column_1 Column_2 Column_3
1 x abc
2 y def
DataFrame B:
Column_1 Column_2 Column_4
1 x xyz
2 y www
3 z sdf
The merge of A and B :
Column_1 Column_2 Column_3 Column_4
1 x abc xyz
2 y def www
3 z null sdf
If I understand your question correctly, you'll be needing to perform an outer join using a sequence of columns as keys.
I have used the data provided in your question to illustrate how it is done with an example :
scala> val df1 = Seq((1,"x","abc"),(2,"y","def")).toDF("Column_1","Column_2","Column_3")
// df1: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string]
scala> val df2 = Seq((1,"x","xyz"),(2,"y","www"),(3,"z","sdf")).toDF("Column_1","Column_2","Column_4")
// df2: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_4: string]
scala> val df3 = df1.join(df2, Seq("Column_1","Column_2"), "outer")
// df3: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string, Column_4: string]
scala> df3.show
// +--------+--------+--------+--------+
// |Column_1|Column_2|Column_3|Column_4|
// +--------+--------+--------+--------+
// | 1| x| abc| xyz|
// | 2| y| def| www|
// | 3| z| null| sdf|
// +--------+--------+--------+--------+
This is called an equi-join with another DataFrame using the given columns.
It is different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
Note
Outer equi-joins are available since Spark 1.6.
First use following codes for all three data frames, so that SQL queries can be implemented on dataframes
DF1.createOrReplaceTempView("df1view")
DF2.createOrReplaceTempView("df2view")
DF3.createOrReplaceTempView("df3view")
then use this join command to merge
val intermediateDF = spark.sql("SELECT a.column1, a.column2, a.column3, b.column4 FROM df1view a leftjoin df2view b on a.column1 = b.column1 and a.column2 = b.column2")`
intermediateDF.createOrReplaceTempView("imDFview")
val resultDF = spark.sql("SELECT a.column1, a.column2, a.column3, a.column4, b.column5 FROM imDFview a leftjoin df3view b on a.column1 = b.column1 and a.column2 = b.column2")
these join can also be done together in one join, also since you want all values of column1 and column2,you can use full outer join instead of left join

Resources