How to Merge DataFrames in Apache Spark/Hive and then increment version - apache-spark

We receive daily files from external system and store it into Hive.
Want to enable versioning on data. col1, col2 is composite key so if we receive same combination of data from file then it should be stored into Hive with new version. Latest data that comes from file should get the biggest version number. How could we do this in spark
file df
+----+----+-----+-------------------+-------+
||col1 |col2|value| ts       |version|
+----+----+-----+-------------------+-------+
| A| B| 777|2019-01-01 00:00:00| 1|
| K| D| 228|2019-01-01 00:00:00| 1|
| G| G| 241|2019-01-01 00:00:00| 1|
+----+----+-----+-------------------+-------+
Don't receive version from external system but if we need it for comparison then it will be always 1
hive df
+----+----+-----+-------------------+-------+
||col1 |col2|value| ts       |version|
+----+----+-----+-------------------+-------+
| A| B| 999|2018-01-01 00:00:00| 1|
| A| B| 888|2018-01-02 00:00:00| 2|
| B| C| 133|2018-01-03 00:00:00| 1|
| G| G| 231|2018-01-01 00:00:00| 1|
+----+----+-----+-------------------+-------+
After merge
+----+----+-----+-------------------+-----------+
|col1|col2|value| ts |new_version|
+----+----+-----+-------------------+-----------+
| B| C| 133|2018-01-03 00:00:00| 1|
| K| D| 228|2019-01-01 00:00:00| 1|
| A| B| 999|2018-01-01 00:00:00| 1|
| A| B| 888|2018-01-02 00:00:00| 2|
| A| B| 777|2019-01-01 00:00:00| 3|
| G| G| 231|2018-01-01 00:00:00| 1|
| G| G| 241|2019-01-01 00:00:00| 2|
+----+----+-----+-------------------+-----------+

existing main hive table:
INSERT INTO TABLE test_dev_db.test_1 VALUES
('A','B',124,1),
('A','B',123,2),
('B','C',133,1),
('G','G',231,1);
suppose you have loaded below data from file
INSERT INTO TABLE test_dev_db.test_2 VALUES
('A','B',222,1),
('K','D',228,1),
('G','G',241,1);
here is your query:
WITH CTE AS (
SELECT col1,col2,value,version FROM test_dev_db.test_1
UNION
SELECT col1,col2,value,version FROM test_dev_db.test_2
)
insert overwrite table test_dev_db.test_1
SELECT a.col1,a.col2,a.value, row_number() over(partition by a.col1,a.col2 order by a.col1,a.col1) as new_version
FROM CTE a;
hive> select * from test_dev_db.test_1;
OK
A B 123 1
A B 124 2
A B 222 3
B C 133 1
G G 231 1
G G 241 2
K D 228 1
for Spark:
create your dataframes reading from file and hive table and union them
uniondf=df1.unionAll(df2)
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().partitionBy('col1','col2').orderBy(lit('A'))
newdf= uniondf.withColumn("new_version", row_number().over(w)).drop('version')
>>> newdf.show();
+----+----+-----+-----------+
|col1|col2|value|new_version|
+----+----+-----+-----------+
| B| C| 133| 1|
| K| D| 228| 1|
| A| B| 124| 1|
| A| B| 123| 2|
| A| B| 222| 3|
| G| G| 231| 1|
| G| G| 241| 2|
+----+----+-----+-----------+
saving it to hive
newdf.write.format("orc").option("header", "true").mode("overwrite").saveAsTable('test_dev_db.new_test_1')

Related

Conditions in Spark window function

I have a dataframe like
+---+---+---+---+
| q| w| e| r|
+---+---+---+---+
| a| 1| 20| y|
| a| 2| 22| z|
| b| 3| 10| y|
| b| 4| 12| y|
+---+---+---+---+
I want to mark the rows with the minimum e and r = z . If there are no rows which have r = z, I want the row with the minimum e, even if r = y.
Essentially, something like
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 1| 20| y| 0|
| a| 2| 22| z| 1|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
I can do it using a number of joins, but that would be too expensive.
So I was looking for a window-based solution.
You can calculate the minimum per group once for rows with r = z and then for all rows within a group. The first non-null value can then be compared to e:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = ...
w = Window.partitionBy("q")
#When ordering is not defined, an unbounded window frame is used by default.
df.withColumn("min_e_with_r_eq_z", F.expr("min(case when r='z' then e else null end)").over(w)) \
.withColumn("min_e_overall", F.min("e").over(w)) \
.withColumn("t", F.coalesce("min_e_with_r_eq_z","min_e_overall") == F.col("e")) \
.orderBy("w") \
.show()
Output:
+---+---+---+---+-----------------+-------------+-----+
| q| w| e| r|min_e_with_r_eq_z|min_e_overall| t|
+---+---+---+---+-----------------+-------------+-----+
| a| 1| 20| y| 22| 20|false|
| a| 2| 22| z| 22| 20| true|
| b| 3| 10| y| null| 10| true|
| b| 4| 12| y| null| 10|false|
+---+---+---+---+-----------------+-------------+-----+
Note: I assume that q is the grouping column for the window.
You can assign row numbers based on whether r = z and the value of column e:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
't',
F.when(
F.row_number().over(
Window.partitionBy('q')
.orderBy((F.col('r') == 'z').desc(), 'e')
) == 1,
1
).otherwise(0)
)
df2.show()
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 2| 22| z| 1|
| a| 1| 20| y| 0|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
Adding the spark-scala version of #werner 's accepted answer
val w = Window.partitionBy("q")
df.withColumn("min_e_with_r_eq_z", min(when($"r" === "z", $"e").otherwise(null)).over(w))
.withColumn("min_e_overall", min("e").over(w))
.withColumn("t", coalesce($"min_e_with_r_eq_z", $"min_e_overall") === $"e")
.orderBy("w")
.show()

Check if a column is consecutive with groupby in pyspark

I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5]})
I would like to create a new binary column is_consecutive that indicates if the values in the value column are consecutive by group.
The output should look like this:
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5],
'is_consecutive': [1,1,1,1,1,0,0,0]})
How could I do that in pyspark?
You can use lag to compare values with the previous row and check if they are consecutive, then use min to determine whether all rows are consecutive in a given group.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'consecutive',
F.coalesce(
F.col('value') - F.lag('value').over(Window.partitionBy('group').orderBy('value')) == 1,
F.lit(True)
).cast('int')
).withColumn(
'all_consecutive',
F.min('consecutive').over(Window.partitionBy('group'))
)
df2.show()
+-----+-----+-----------+---------------+
|group|value|consecutive|all_consecutive|
+-----+-----+-----------+---------------+
| c| 2| 1| 0|
| c| 4| 0| 0|
| c| 5| 1| 0|
| b| 4| 1| 1|
| b| 5| 1| 1|
| a| 1| 1| 1|
| a| 2| 1| 1|
| a| 3| 1| 1|
+-----+-----+-----------+---------------+
You can use lead and subtract the same with the existing value then find max of the window, once done , put a condition saying return 0 is max is >1 else return 1
w = Window.partitionBy("group").orderBy(F.monotonically_increasing_id())
(foo.withColumn("Diff",F.lead("value").over(w)-F.col("value"))
.withColumn("is_consecutive",F.when(F.max("Diff").over(w)>1,0).otherwise(1))
.drop("Diff")).show()
+-----+-----+--------------+
|group|value|is_consecutive|
+-----+-----+--------------+
| a| 1| 1|
| a| 2| 1|
| a| 3| 1|
| b| 4| 1|
| b| 5| 1|
| c| 2| 0|
| c| 4| 0|
| c| 5| 0|
+-----+-----+--------------+

Drop function doesn't work properly after joining same columns of Dataframe

I am facing this same issue while joining two Data frame A, B.
For ex:
c = df_a.join(df_b, [df_a.col1 == df_b.col1], how="left").drop(df_b.col1)
And when I try to drop the duplicate column like as above this query doesn't drop the col1 of df_b. Instead when I try to drop col1 of df_a, then it able to drop the col1 of df_a.
Could anyone please say about this.
Note: I tried the same in my project which has more than 200 columns and shows the same problem. Sometimes this drop function works properly if we have few columns but not if we have more columns.
Drop function not working after left outer join in pyspark
function to drop duplicates column after merge.
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
There are some similar issues I faced recently. Let me show them below with your case.
I am creating two dataframes with the same data
scala> val df_a = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_a: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> val df_b = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_b: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
Joining them
scala> val df = df_a.join(df_b, df_a("b") === df_b("a"), "leftouter")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 4 more fields]
scala> df.show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Let's drop a column that is not present in the above dataframe
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Ideally we will expect spark to throw an error, but it executes successfully.
Now, if you drop a column from the above dataframe
scala> df.drop("a").show
+---+---+---+---+
| b| c| b| c|
+---+---+---+---+
| 2| as| 3| ds|
| 3| ds| 1| ht|
| 3| ds| 4| ew|
| 4| ew| 1| re|
| 1| re| 2| as|
| 1| ht| 2| as|
+---+---+---+---+
It drops all the columns with provided column name in the input dataframe.
If you want to drop specific columns, it should be done as below:
scala> df.drop(df_a("a")).show()
+---+---+---+---+---+
| b| c| a| b| c|
+---+---+---+---+---+
| 2| as| 2| 3| ds|
| 3| ds| 3| 1| ht|
| 3| ds| 3| 4| ew|
| 4| ew| 4| 1| re|
| 1| re| 1| 2| as|
| 1| ht| 1| 2| as|
+---+---+---+---+---+
I don't think spark accepts the input as give by you(see below):
scala> df.drop(df_a.a).show()
<console>:30: error: value a is not a member of org.apache.spark.sql.DataFrame
df.drop(df_a.a).show()
^
scala> df.drop(df_a."a").show()
<console>:1: error: identifier expected but string literal found.
df.drop(df_a."a").show()
^
If you provide the input to drop, as below, it executes but will have no impact
scala> df.drop("df_a.a").show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
The reason being, spark interprets "df_a.a" as a nested column. As that column is not present ideally it should have thrown error, but as explained above, it just executes.
Hope this helps..!!!

PySpark - How to set the default value for pyspark.sql.functions.lag to a value within the current row?

How does one set the default value for pyspark.sql.functions.lag to a value within the current row?
For example, given:
testInput = [(1, 'a'),(2, 'c'),(3, 'e'),(1, 'a'),(1, 'b'),(1, 'b')]
columns = ['Col A', 'Col B']
df = sc.parallelize(testInput).toDF(columns)
df.show()
windowSpecification = Window.partitionBy(col('Col A')).orderBy(col('Col B'))
changedRows = col('Col B') != F.lag(col('Col B'), 1).over(windowSpecification)
df.select(col('Col A'), col('Col B'), changedRows.alias('New Col C')).show()
which outputs:
+-----+-----+
|Col A|Col B|
+-----+-----+
| 1| a|
| 2| c|
| 3| e|
| 1| a|
| 1| b|
| 1| b|
+-----+-----+
+-----+-----+---------+
|Col A|Col B|New Col C|
+-----+-----+---------+
| 1| a| null|
| 1| a| false|
| 1| b| true|
| 1| b| false|
| 3| e| null|
| 2| c| null|
+-----+-----+---------+
I would like the output to look like:
+-----+-----+---------+
|Col A|Col B|New Col C|
+-----+-----+---------+
| 1| a| false|
| 1| a| false|
| 1| b| true|
| 1| b| false|
| 3| e| false|
| 2| c| false|
+-----+-----+---------+
My current workaround is to add a second lag call to the changedRows, like so:
changedRows = (col('Col B') != F.lag(col('Col B'), 1).over(windowSpecification)) & F.lag(col('Col B'), 1).over(windowSpecification).isNotNull()
but this does not look clean to me.
I would like to do something like
changedRows = col('Col B') != F.lag(col('Col B'), 1, col('Col B')).over(windowSpecification)
but I get the error TypeError: 'Column' object is not callable.
You can use column values as parameters if you use pyspark.sql.functions.expr. In your case, make the following modification to changedRows:
changedRows = F.expr(
"`Col B` != lag(`Col B`, 1, `Col B`) over (PARTITION BY `Col A` ORDER BY `Col B`)"
)
df.select('Col A', 'Col B', changedRows.alias('New Col C')).show()
#+-----+-----+---------+
#|Col A|Col B|New Col C|
#+-----+-----+---------+
#| 1| a| false|
#| 1| a| false|
#| 1| b| true|
#| 1| b| false|
#| 3| e| false|
#| 2| c| false|
#+-----+-----+---------+
You have to refer to the column names in back ticks because of the space.

pyspark : fetch common data from dataframe when comparing values of given columns

I have two pyspark dataframes like this.
data_frame A
+----+---+
|name1| id1|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
data_frame B
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
| d| 6|
| e| 0|
| f| 3|
+----+---+
I want to fetch dataframe B contents if values of name1 (from df a) and name2 (from df b) matches. which is as shown below.
o/p dataframe
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
+----+---+
I want to avoid computationally expensive methods such as collect() etc.
How this can be done in apache spark?
from pyspark.sql.functions import *
df1.join(df2, df1.name1 == df2.name2).select('df2.*')
OR (using sql)
df1.registerTempTable("tableA")
df2.registerTempTable("tableB")
val result = sqlContext.sql("select b.name2, b.id2 from tableA a join tableB b on a.name1=b.name2")
result.show()
+----+----+
|name2| id2|
+----+----+
| a| 13|
| b| 15|
| c| 17|
+----+---+

Resources