I'm using pyspark on a 2.X Spark version for this.
I have 2 sql dataframes, df1 and df2. df1 is an union of multiple small dfs with the same header names.
df1 = (
df1_1.union(df1_2)
.union(df1_3)
.union(df1_4)
.union(df1_5)
.union(df1_6)
.union(df1_7)
.distinct()
)
df2 does not have the same header names.
What i'm trying to achieve is to create a new column and to fill it with 2 values depending on a condition. But the condition would be something like if in the column of df1 you contain an element of an column of df2 then write A else B
So I tried something like this:
df1 = df1.withColumn(
"new_col",
when(df1["ColA"].substr(0, 4).contains(df2["ColA_a"]), "A").otherwise(
"B"
),
)
Every fields are string types.
I tried also using isin but the error is the same.
note: substr(0, 4) is because in df1["ColA"] I only need 4 characters in my field to match df2["ColA_a"].
py4j.protocol.Py4JJavaError: An error occurred while calling o660.select. :
org.apache.spark.sql.AnalysisException: Resolved attribute(s) ColA_a#444 missing from
ColA#438,ColB#439 in operator !Project [Contains(ColA#438, ColA_a#444) AS contains(ColA, ColA_a)#451].;;
The solution I've read on the Internet that I tried:
Cloning dfs
Collecting df and create new df (here we lose the performance of spark, and that's very sad)
Renaming columns to have the same name, or different name. (ambiguous naming ?)
EDIT:
here is some input output as requested
df1
+-----+-----+-----+
| Col1| ColA| ColB|
+-----+-----+-----+
|value|3062x|value|
|value|2156x|value|
|value|3059x|value|
|value|3044x|value|
|value|2661x|value|
|value|2400x|value|
|value|1907x|value|
|value|4384x|value|
|value|4427x|value|
|value|2091x|value|
+-----+-----+-----+
df2
+------+------+
|ColA_a|ColB_b|
+------+------+
| 2156| GMVT7|
| 2156| JQL71|
| 2156| JZDSQ|
| 2050| GX8PH|
| 2050| G67CV|
| 2050| JFFF7|
| 2031| GCT5C|
| 2170| JN0LB|
| 2129| J2PRG|
| 2091| G87WT|
+------+------+
output
+-----+-----+-----+-------+
| Col1| ColA| ColB|new_col|
+-----+-----+-----+-------+
|value|3062x|value| B |
|value|2156x|value| A |
|value|3059x|value| B |
|value|3044x|value| B |
|value|2661x|value| B |
|value|2400x|value| B |
|value|1907x|value| B |
|value|4384x|value| B |
|value|4427x|value| B |
|value|2091x|value| A |
+-----+-----+-----+-------+
You can use rlike join, to determine if the value exists in other column
df1=sqlContext.createDataFrame([
('value',3062,'value'),
('value',2156,'value'),
('value',3059,'value'),
('value',3044,'value'),
('value',2661,'value'),
('value',2400,'value'),
('value',1907,'value'),
('value',4384,'value'),
('value',4427,'value'),
('value',2091,'value')
],schema=['Col1', 'ColA', 'ColB'])
df2 =sqlContext.createDataFrame([
(2156, 'GMVT7'),
( 2156, 'JQL71'),
( 2156, 'JZDSQ'),
( 2050, 'GX8PH'),
( 2050, 'G67CV'),
( 2050, 'JFFF7'),
( 2031, 'GCT5C'),
( 2170, 'JN0LB'),
( 2129, 'J2PRG'),
( 2091, 'G87WT')],schema=['ColA_a','ColB_b'])
#%%
df_join = df1.join(df2.select('ColA_a').distinct(),F.expr("""ColA rlike ColA_a"""),how = 'left')
df_fin = df_join.withColumn("new_col",F.when(F.col('ColA_a').isNull(),'B').otherwise('A'))
df_fin.show()
+-----+----+-----+------+-------+
| Col1|ColA| ColB|ColA_a|new_col|
+-----+----+-----+------+-------+
|value|3062|value| null| B|
|value|2156|value| 2156| A|
|value|3059|value| null| B|
|value|3044|value| null| B|
|value|2661|value| null| B|
|value|2400|value| null| B|
|value|1907|value| null| B|
|value|4384|value| null| B|
|value|4427|value| null| B|
|value|2091|value| 2091| A|
+-----+----+-----+------+-------+
If you don't prefer rlike join, you can use the isin() method in your join.
df_join = df1.join(df2.select('ColA_a').distinct(),F.col('ColA').isin(F.col('ColA_a')),how = 'left')
df_fin = df_join.withColumn("new_col",F.when(F.col('ColA_a').isNull(),'B').otherwise('A'))
The results will be the same
Related
I have 3 Dataframe df1(EMPLOYEE_INFO),df2(DEPARTMENT_INFO),df3(COMPANY_INFO) and i want to update a column which is in df1 by joining all the three dataframes . The name of column is FLAG_DEPARTMENT which is in df1. I need to set the FLAG_DEPARTMENT='POLITICS' . In sql query will look like this.
UPDATE [COMPANY_INFO] INNER JOIN ([DEPARTMENT_INFO]
INNER JOIN [EMPLOYEE_INFO] ON [DEPARTMENT_INFO].DEPT_ID = [EMPLOYEE_INFO].DEPT_ID)
ON [COMPANY_INFO].[COMPANY_DEPT_ID] = [DEPARTMENT_INFO].[DEP_COMPANYID]
SET EMPLOYEE_INFO.FLAG_DEPARTMENT = "POLITICS";
If the values in columns of these three tables matches i need to set my FLAG_DEPARTMENT='POLITICS' in my employee_Info Table
How can i achieve this same thing in pyspark. I have just started learning pyspark don't have that much depth knowledge?
You can use a chain of joins with a select on top of it.
Suppose that you have the following pyspark DataFrames:
employee_df
+---------+-------+
| Name|dept_id|
+---------+-------+
| John| dept_a|
| Liù| dept_b|
| Luke| dept_a|
| Michail| dept_a|
| Noe| dept_e|
|Shinchaku| dept_c|
| Vlad| dept_e|
+---------+-------+
department_df
+-------+----------+------------+
|dept_id|company_id| description|
+-------+----------+------------+
| dept_a| company1|Department A|
| dept_b| company2|Department B|
| dept_c| company5|Department C|
| dept_d| company3|Department D|
+-------+----------+------------+
company_df
+----------+-----------+
|company_id|description|
+----------+-----------+
| company1| Company 1|
| company2| Company 2|
| company3| Company 3|
| company4| Company 4|
+----------+-----------+
Then you can run the following code to add the flag_department column to your employee_df:
from pyspark.sql import functions as F
employee_df = (
employee_df.alias('a')
.join(
department_df.alias('b'),
on='dept_id',
how='left',
)
.join(
company_df.alias('c'),
on=F.col('b.company_id') == F.col('c.company_id'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in employee_df.columns],
F.when(
F.col('b.dept_id').isNotNull() & F.col('c.company_id').isNotNull(),
F.lit('POLITICS')
).alias('flag_department')
)
)
The new employee_df will be:
+---------+-------+---------------+
| Name|dept_id|flag_department|
+---------+-------+---------------+
| John| dept_a| POLITICS|
| Liù| dept_b| POLITICS|
| Luke| dept_a| POLITICS|
| Michail| dept_a| POLITICS|
| Noe| dept_e| null|
|Shinchaku| dept_c| null|
| Vlad| dept_e| null|
+---------+-------+---------------+
I've two columns in my DataFrame name1 and name2.
I want to join them and count the occurrence (without Null values!).
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
Desired result:
+------------------------------+
|name |Occurrence |
+------------------------------+
|Luc Krier |2 |
|Jeanny Thorn |3 |
|Teddy E Beecher |1 |
|Philippe Schauss |1 |
|Meindert I Tholen |2 |
|Liam Muller |1 |
|Ben Weller |1 |
+------------------------------+
How can I achieve this?
You can use explode with array fuction to merge the columns into one then simply group by and count, like this :
from pyspark.sql.functions import col, array, explode, count
df.select(explode(array("name1", "name2")).alias("name")) \
.filter("nullif(name, '') is not null") \
.groupBy("name") \
.agg(count("*").alias("Occurrence")) \
.show()
#+-----------------+----------+
#| name|Occurrence|
#+-----------------+----------+
#|Meindert I Tholen| 2|
#| Jeanny Thorn| 3|
#| Luc Krier| 2|
#| Teddy E Beecher| 1|
#|Philippe Schauss| 1|
#| Ben Weller| 1|
#| Liam Muller| 1|
#+-----------------+----------+
Another way is to select each column, union then group by and count:
df.select(col("name1").alias("name")).union(df.select(col("name2").alias("name"))) \
.filter("nullif(name, '') is not null")\
.groupBy("name") \
.agg(count("name").alias("Occurrence")) \
.show()
Many fancy answers out there, but the easiest solution should be to do a union and then aggregate the count:
df2 = (df.select('name1')
.union(df.select('name2'))
.filter("name1 != ''")
.groupBy('name1')
.count()
.toDF('name', 'Occurrence')
)
df2.show()
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
|Meindert I Tholen| 2|
| Jeanny Thorn| 3|
| Luc Krier| 2|
| Teddy E Beecher| 1|
|Philippe Schauss| 1|
| Ben Weller| 1|
| Liam Muller| 1|
+-----------------+----------+
There are better ways to do it. One naive way of doing it is as follows
from collections import Counter
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("OccurenceCount").getOrCreate()
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
counter_dict = dict(Counter(df.select("name1", "name2").rdd.flatMap(lambda x: x).collect()))
counter_list = list(map(list, counter_dict.items()))
frequency_df = spark.createDataFrame(counter_list, ["name", "Occurrence"])
frequency_df.show()
Output:
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
| | 1|
| Liam Muller| 1|
| Teddy E Beecher| 1|
| Ben Weller| 1|
| Jeanny Thorn| 3|
| Luc Krier| 2|
|Philippe Schauss| 1|
|Meindert I Tholen| 2|
+-----------------+----------+
Does this work?
# Groupby & count both dataframes individually to reduce size.
df_name1 = (df.groupby(['name1']).count()
.withColumnRenamed('name1', 'name')
.withColumnRenamed('count', 'count1'))
df_name2 = (df.groupby(['name2']).count()
.withColumnRenamed('name2', 'name')
.withColumnRenamed('count', 'count2'))
# Join the two dataframes containing frequency counts
# Any null value in the 'count' column can be correctly interpreted as zero.
df_count = (df_name1.join(df_name2, on=['name'], how='outer')
.fillna(0, subset=['count1', 'count2']))
# Sum the two counts and drop the useless columns
df_count = (df_count.withColumn('count', df_count['count1'] + df_count['count2'])
.drop('count1').drop('count2').dropna(subset=['name']))
# (Optional) While any rows with a null name have been removed, rows with an
# empty string ("") for a name are still there. We can drop the empty name
# rows like this.
df_count = df_count[df_count['name'] != '']
df_count.show()
# +-----------------+-----+
# | name|count|
# +-----------------+-----+
# |Meindert I Tholen| 2|
# | Jeanny Thorn| 3|
# | Luc Krier| 2|
# | Teddy E Beecher| 1|
# |Philippe Schauss| 1|
# | Ben Weller| 1|
# | Liam Muller| 1|
# +-----------------+-----+
You can get the required output as follows in scala :
import org.apache.spark.sql.functions._
val df = Seq(
("Luc Krier","Jeanny Thorn"),
("Jeanny Thorn","Ben Weller"),
( "Teddy E Beecher","Luc Krier"),
("Philippe Schauss","Jeanny Thorn"),
("Meindert I Tholen","Liam Muller"),
("Meindert I Tholen","")
).toDF("name1", "name2")
val df1 = df.filter($"name1".isNotNull).filter($"name1" !==
"").groupBy("name1").agg(count("name1").as("count1"))
val df2 = df.filter($"name2".isNotNull).filter($"name2" !==
"").groupBy("name2").agg(count("name2").as("count2"))
val newdf = df1.join(df2, $"name1" === $"name2","outer").withColumn("count1",
when($"count1".isNull,0).otherwise($"count1")).withColumn("count2",
when($"count2".isNull,0).otherwise($"count2")).withColumn("Count",$"count1" +
$"count2")
val finalDF =newdf.withColumn("name",when($"name1".isNull,$"name2")
.when($"name2".isNull,$"name1").otherwise($"name1")).select("name","Count")
display(finalDF)
You can see the final output as image below :
I have a case where I may have null values in the column that needs to be summed up in a group.
If I encounter a null in a group, I want the sum of that group to be null. But PySpark by default seems to ignore the null rows and sum-up the rest of the non-null values.
For example:
dataframe = dataframe.groupBy('dataframe.product', 'dataframe.price') \
.agg(f.sum('price'))
Expected output is:
But I am getting:
sum function returns NULL only if all values are null for that column otherwise nulls are simply ignored.
You can use conditional aggregation, if count(price) == count(*) it means there are no nulls and we return sum(price). Else, null is returned:
from pyspark.sql import functions as F
df.groupby("product").agg(
F.when(F.count("price") == F.count("*"), F.sum("price")).alias("sum_price")
).show()
#+-------+---------+
#|product|sum_price|
#+-------+---------+
#| B| 200|
#| C| null|
#| A| 250|
#+-------+---------+
Since Spark 3.0+, one can also use any function:
df.groupby("product").agg(
F.when(~F.expr("any(price is null)"), F.sum("price")).alias("sum_price")
).show()
You can replace nulls with NaNs using coalesce:
df2 = df.groupBy('product').agg(
F.sum(
F.coalesce(F.col('price'), F.lit(float('nan')))
).alias('sum(price)')
).orderBy('product')
df2.show()
+-------+----------+
|product|sum(price)|
+-------+----------+
| A| 250.0|
| B| 200.0|
| C| NaN|
+-------+----------+
If you want to keep integer type, you can convert NaNs back to nulls using nanvl:
df2 = df.groupBy('product').agg(
F.nanvl(
F.sum(
F.coalesce(F.col('price'), F.lit(float('nan')))
),
F.lit(None)
).cast('int').alias('sum(price)')
).orderBy('product')
df2.show()
+-------+----------+
|product|sum(price)|
+-------+----------+
| A| 250|
| B| 200|
| C| null|
+-------+----------+
I have data like:
id,ts_start,ts_end,foo_start,foo_end
1,1,2,f_s,f_e
2,3,4,foo,bar
3,3,6,foo,f_e
I.e. a single record with all the start and end information aggregated.
Using a flat map, these could be transformed to
id,ts,foo
1,1,f_s
1,2,f_e
How can I do the same using the optimized SQL DSL with explode or maybe pivot?
edit
Obviously, I do not want to read in the data two times and union the result.
Or is this the only option if I do not want to use flatmap + serde + custom code?
given:
val df = Seq(
(1,1,2,"f_s","f_e"),
(2,3,4,"foo","bar"),
(3,3,6,"foo","f_e")
).toDF("id","ts_start","ts_end","foo_start","foo_end")
you can do:
df
.select($"id",
explode(
array(
struct($"ts_start".as("ts"),$"foo_start".as("foo")),
struct($"ts_end".as("ts"),$"foo_end".as("foo"))
)
).as("tmp")
)
.select(
$"id",
$"tmp.*"
)
.show()
which gives:
+---+---+---+
| id| ts|foo|
+---+---+---+
| 1| 1|f_s|
| 1| 2|f_e|
| 2| 3|foo|
| 2| 4|bar|
| 3| 3|foo|
| 3| 6|f_e|
+---+---+---+
I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.
df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value