Pyspark join with functions and difference between timestamps - apache-spark

I am trying to join 2 tables with user events. I want to join table_a with table_b by user_id (id) and when the difference timestamps smaller than 5s (5000ms).
Here is what I am doing:
table_a = (
table_a
.join(
table_b,
table_a.uid == table_b.uid
& abs(table_b.b_timestamp - table_a.a_timestamp) < 5000
& table_a.a_timestamp.isNotNull()
,
how = 'left'
)
)
I am getting 2 errors:
Error 1)
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Error 2 when if I remove the 2nd condition on the join and leave only the 1st and 3rd:
org.apache.spark.sql.AnalysisException: cannot resolve '(uidAND (a_timestampIS NOT NULL))' due to data type mismatch: differing types in '(uidAND (a_timestampIS NOT NULL))' (string and boolean).;;
Any help is much appreciated!

You just need parentheses around each filtering condition. For example, the following works:
df1 = spark.createDataFrame([
(1, 20),
(1, 21),
(1, 25),
(1, 30),
(2, 21),
], ['id', 'val'])
df2 = spark.createDataFrame([
(1, 21),
(2, 30),
], ['id', 'val'])
df1.join(
df2,
(df1.id == df2.id)
& (abs(df1.val - df2.val) < 5)
).show()
# +---+---+---+---+
# | id|val| id|val|
# +---+---+---+---+
# | 1| 20| 1| 21|
# | 1| 21| 1| 21|
# | 1| 25| 1| 21|
# +---+---+---+---+
But without parens:
df1.join(
df2,
df1.id == df2.id
& abs(df1.val - df2.val) < 5
).show()
# ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Related

Drop rows containing specific value in PySpark dataframe

I have a pyspark dataframe like:
A B C
1 NA 9
4 2 5
6 4 2
5 1 NA
I want to delete rows which contain value "NA". In this case first and the last row. How to implement this using Python and Spark?
Update based on comment:
Looking for a solution that removes rows that have the string: NA in any of the many columns.
Just use a dataframe filter expression:
l = [('1','NA','9')
,('4','2', '5')
,('6','4','2')
,('5','NA','1')]
df = spark.createDataFrame(l,['A','B','C'])
#The following command requires that the checked columns are strings!
df = df.filter((df.A != 'NA') & (df.B != 'NA') & (df.C != 'NA'))
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 4| 2| 5|
| 6| 4| 2|
+---+---+---+
#bluephantom: In the case you have hundreds of columns, just generate a string expression via list comprehension:
#In my example are columns need to be checked
listOfRelevantStringColumns = df.columns
expr = ' and '.join('(%s != "NA")' % col_name for col_name in listOfRelevantStringColumns)
df.filter(expr).show()
In case if you want to remove the row
df = df.filter((df.A != 'NA') | (df.B != 'NA'))
But sometimes we need to replace with mean(in case of numeric column) or most frequent value(in case of categorical). for that you need to add column with same name which replace the original column i-e "A"
from pyspark.sql.functions import mean,col,when,count
df=df.withColumn("A",when(df.A=="NA",mean(df.A)).otherwise(df.A))
In Scala I did this differently, but got to this using pyspark. Not my favourite answer, but it is because of lesser pyspark knowledge my side. Things seem easier in Scala. Unlike an array there is no global match against all columns that can stop as soon as one found. Dynamic in terms of number of columns.
Assumptions made on data not having ~~ as part of data, could have split to array but decided not to do here. Using None instead of NA.
from pyspark.sql import functions as f
data = [(1, None, 4, None),
(2, 'c', 3, 'd'),
(None, None, None, None),
(3, None, None, 'z')]
df = spark.createDataFrame(data, ['k', 'v1', 'v2', 'v3'])
columns = df.columns
columns_Count = len(df.columns)
# colCompare is String
df2 = df.select(df['*'], f.concat_ws('~~', *columns).alias('colCompare') )
df3 = df2.filter(f.size(f.split(f.col("colCompare"), r"~~")) == columns_Count).drop("colCompare")
df3.show()
returns:
+---+---+---+---+
| k| v1| v2| v3|
+---+---+---+---+
| 2| c| 3| d|
+---+---+---+---+

pyspark replace multiple values with null in dataframe

I have a dataframe (df) and within the dataframe I have a column user_id
df = sc.parallelize([(1, "not_set"),
(2, "user_001"),
(3, "user_002"),
(4, "n/a"),
(5, "N/A"),
(6, "userid_not_set"),
(7, "user_003"),
(8, "user_004")]).toDF(["key", "user_id"])
df:
+---+--------------+
|key| user_id|
+---+--------------+
| 1| not_set|
| 2| user_003|
| 3| user_004|
| 4| n/a|
| 5| N/A|
| 6|userid_not_set|
| 7| user_003|
| 8| user_004|
+---+--------------+
I would like to replace the following values: not_set, n/a, N/A and userid_not_set with null.
It would be good if I could add any new values to a list and they to could be changed.
I am currently using a CASE statement within spark.sql to preform this and would like to change this to pyspark.
None inside the when() function corresponds to the null. In case you wish to fill in anything else instead of null, you have to fill it in it's place.
from pyspark.sql.functions import col
df = df.withColumn(
"user_id",
when(
col("user_id").isin('not_set', 'n/a', 'N/A', 'userid_not_set'),
None
).otherwise(col("user_id"))
)
df.show()
+---+--------+
|key| user_id|
+---+--------+
| 1| null|
| 2|user_001|
| 3|user_002|
| 4| null|
| 5| null|
| 6| null|
| 7|user_003|
| 8|user_004|
+---+--------+
You can use the in-built when function, which is the equivalent of a case expression.
from pyspark.sql import functions as f
df.select(df.key,f.when(df.user_id.isin(['not_set', 'n/a', 'N/A']),None).otherwise(df.user_id)).show()
Also the values needed can be stored in a list and be referenced.
val_list = ['not_set', 'n/a', 'N/A']
df.select(df.key,f.when(df.user_id.isin(val_list),None).otherwise(df.user_id)).show()
PFB few approaches. I am assuming that all the legitimate user IDs starts with "user_". Please try below code.
from pyspark.sql.functions import *
df.withColumn(
"user_id",
when(col("user_id").startswith("user_"),col("user_id")).otherwise(None)
).show()
Another One.
cond = """case when user_id in ('not_set', 'n/a', 'N/A', 'userid_not_set') then null
else user_id
end"""
df.withColumn("ID", expr(cond)).show()
Another One.
cond = """case when user_id like 'user_%' then user_id
else null
end"""
df.withColumn("ID", expr(cond)).show()
Another one.
df.withColumn(
"user_id",
when(col("user_id").rlike("user_"),col("user_id")).otherwise(None)
).show()

Pyspark: reshape data without aggregation

I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:
columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
(1, 0, 141),
(0, 0, 140),
(1, 1, 21),
(0, 1, 12)
]
What I want is a contingency table with the second column as two new binary columns (value_HIGH_1, value_HIGH_0) and the values from the count column - meaning:
columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0']
vals = [
(1, 21, 141),
(0, 12, 140)
]
You can use pivot with a fake maximum aggregation (since you have only one element for each group):
import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
| 0| 12| 140|
| 1| 21| 141|
+------+------------+------------+
Using groupby and pivot is the natural way to do this, but if you want to avoid any aggregation you can achieve this with a filter and join
import pyspark.sql.functions as f
df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
.join(
df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
on="FAULTY"
)\
.show()
#+------+------------+------------+
#|FAULTY|value_HIGH_1|value_HIGH_1|
#+------+------------+------------+
#| 0| 12| 140|
#| 1| 21| 141|
#+------+------------+------------+

pyspark AnalysisException: "Reference '<COLUMN>' is ambiguous [duplicate]

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.

Creating a column based upon a list and column in Pyspark

I have a pyspark DataFrame, say df1, with multiple columns.
I also have a list, say, l = ['a','b','c','d'] and these values are the subset of the values present in one of the columns in the DataFrame.
Now, I would like to do something like this:
df2 = df1.withColumn('new_column', expr("case when col_1 in l then 'yes' else 'no' end"))
But this is throwing the following error:
failure: "(" expected but identifier l found.
Any idea how to resolve this error or any better way of doing it?
You can achieve that with the isin function of the Column object:
df1 = sqlContext.createDataFrame([('a', 1), ('b', 2), ('c', 3)], ('col1', 'col2'))
l = ['a', 'b']
from pyspark.sql.functions import *
df2 = df1.withColumn('new_column', when(col('col1').isin(l), 'yes').otherwise('no'))
df2.show()
+----+----+----------+
|col1|col2|new_column|
+----+----+----------+
| a| 1| yes|
| b| 2| yes|
| c| 3| no|
+----+----+----------+
Note: For Spark < 1.5, use inSet instead of isin.
Reference: pyspark.sql.Column documentation

Resources