UDF lookup mapping a pyspark dataframe column - apache-spark

I have a pyspark.sql.dataframe.DataFrame object df which contains Continent and Country code.
I also have a dictionary of dictionary dicts which contains the lookup value for each column.
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = sc.parallelize([('A1','JP'),('A1','CH'),('A2','CA'),
('A2','US')]).toDF(['Continent','Country'])
dicts = sc.broadcast(dict([('Country', dict([
('US', 'USA'),
('JP', 'Japan'),
('CA', 'Canada'),
('CH', 'China')
])),
('Continent', dict([
('A1','Asia'),
('A2','America')])
)
]))
+---------+-------+
|Continent|Country|
+---------+-------+
| A1| JP|
| A1| CH|
| A2| CA|
| A2| US|
+---------+-------+
I want to replace both Country and Continent into it lookup value as I have try:
preprocess_request = F.udf(lambda colname, key:
dicts.value[colname].get[key],
T.StringType())
df.withColumn('Continent', preprocess_request('Continent', F.col('Continent')))\
.withColumn('Country', preprocess_request('Country', F.col('Country')))\
.display()
but got me error said object is not subscriptable.
What I expect exactly like this:
+---------+-------+
|Continent|Country|
+---------+-------+
| Asia| Japan|
| Asia| China|
| America| Canada|
| America| USA|
+---------+-------+

There is a problem with your arguments to a function - when you specify 'Continent' - it's treated as a column name, not a fixed value, so when your UDF is called, the value of this column is passed, not the word Continent. To fix this, you need to wrap Continent and Country into F.lit:
preprocess_request = F.udf(lambda colname, key:
dicts.value.get(colname, {}).get(key),
T.StringType())
df.withColumn('Continent', preprocess_request(F.lit('Continent'), F.col('Continent')))\
.withColumn('Country', preprocess_request(F.lit('Country'), F.col('Country')))\
.display()
with it it gives correct result:
+---------+-------+
|Continent|Country|
+---------+-------+
| Asia| Japan|
| Asia| China|
| America| Canada|
| America| USA|
+---------+-------+
But really you don't need UDF for that, as it's very slow due serialization overhead. It could be much faster if you use native PySpark APIs and represent dictionaries as Spark literal. Something like this:
continents = F.expr("map('A1','Asia', 'A2','America')")
countries = F.expr("map('US', 'USA', 'JP', 'Japan', 'CA', 'Canada', 'CH', 'China')")
df.withColumn('Continent', continents[F.col('Continent')])\
.withColumn('Country', countries[F.col('Country')])\
.show()
gives you the same answer, but should be much faster:
+---------+-------+
|Continent|Country|
+---------+-------+
| Asia| Japan|
| Asia| China|
| America| Canada|
| America| USA|
+---------+-------+

I would use a pandas udf instead of a plain udf. pandas udfs are vectorized.
Option 1
def map_dict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
Continent=pdf.Continent
Country=pdf.Country
yield pdf.assign(Continent=Continent.map(dicts.value['Continent']),
Country=Country.map(dicts.value['Country']))
df.mapInPandas(map_dict, schema=df.schema).show()
Option 2
Please note though this is likely to incur a shuffle.
from typing import Iterator, Tuple
import pandas as pd
from pyspark.sql.functions import pandas_udf
def map_dict(pdf: pd.DataFrame) -> pd.DataFrame:
Continent=pdf.Continent
Country=pdf.Country
return pdf.assign(Continent=Continent.map(dicts.value['Continent']),
Country=Country.map(dicts.value['Country']))
df.groupby("Continent","Country").applyInPandas(map_dict, schema=df.schema).show()
+---+---------+-------+
| id|Continent|Country|
+---+---------+-------+
| 2| Asia| China|
| 1| Asia| Japan|
| 3| America| Canada|
| 4| America| USA|
+---+---------+-------+

Related

How to update dataframe column value while joinining with other dataframe in pyspark?

I have 3 Dataframe df1(EMPLOYEE_INFO),df2(DEPARTMENT_INFO),df3(COMPANY_INFO) and i want to update a column which is in df1 by joining all the three dataframes . The name of column is FLAG_DEPARTMENT which is in df1. I need to set the FLAG_DEPARTMENT='POLITICS' . In sql query will look like this.
UPDATE [COMPANY_INFO] INNER JOIN ([DEPARTMENT_INFO]
INNER JOIN [EMPLOYEE_INFO] ON [DEPARTMENT_INFO].DEPT_ID = [EMPLOYEE_INFO].DEPT_ID)
ON [COMPANY_INFO].[COMPANY_DEPT_ID] = [DEPARTMENT_INFO].[DEP_COMPANYID]
SET EMPLOYEE_INFO.FLAG_DEPARTMENT = "POLITICS";
If the values in columns of these three tables matches i need to set my FLAG_DEPARTMENT='POLITICS' in my employee_Info Table
How can i achieve this same thing in pyspark. I have just started learning pyspark don't have that much depth knowledge?
You can use a chain of joins with a select on top of it.
Suppose that you have the following pyspark DataFrames:
employee_df
+---------+-------+
| Name|dept_id|
+---------+-------+
| John| dept_a|
| Liù| dept_b|
| Luke| dept_a|
| Michail| dept_a|
| Noe| dept_e|
|Shinchaku| dept_c|
| Vlad| dept_e|
+---------+-------+
department_df
+-------+----------+------------+
|dept_id|company_id| description|
+-------+----------+------------+
| dept_a| company1|Department A|
| dept_b| company2|Department B|
| dept_c| company5|Department C|
| dept_d| company3|Department D|
+-------+----------+------------+
company_df
+----------+-----------+
|company_id|description|
+----------+-----------+
| company1| Company 1|
| company2| Company 2|
| company3| Company 3|
| company4| Company 4|
+----------+-----------+
Then you can run the following code to add the flag_department column to your employee_df:
from pyspark.sql import functions as F
employee_df = (
employee_df.alias('a')
.join(
department_df.alias('b'),
on='dept_id',
how='left',
)
.join(
company_df.alias('c'),
on=F.col('b.company_id') == F.col('c.company_id'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in employee_df.columns],
F.when(
F.col('b.dept_id').isNotNull() & F.col('c.company_id').isNotNull(),
F.lit('POLITICS')
).alias('flag_department')
)
)
The new employee_df will be:
+---------+-------+---------------+
| Name|dept_id|flag_department|
+---------+-------+---------------+
| John| dept_a| POLITICS|
| Liù| dept_b| POLITICS|
| Luke| dept_a| POLITICS|
| Michail| dept_a| POLITICS|
| Noe| dept_e| null|
|Shinchaku| dept_c| null|
| Vlad| dept_e| null|
+---------+-------+---------------+

Explode multiple array columns in pyspark [duplicate]

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
I want to explode the dataframe in such a way that i get the following output-
Name Age Subjects Grades
Bob 16 Maths A
Bob 16 Physics B
Bob 16 Chemistry C
How can I achieve this?
PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
This works,
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df.show()
+-----+----+--------------------+---------+
| Name| Age| Subjects| Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+
Use udf with zip. Those columns needed to explode have to be merged before exploding.
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(StructType([StructField("subs", StringType()),
StructField("grades", StringType())])))
df = df.withColumn("new", combine("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
Arriving late to the party :-)
The simplest way to go is by using inline that doesn't have python API but is supported by selectExpr.
df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+
Have you tried this
df.select(explode(split(col("Subjects"))).alias("Subjects")).show()
you can convert the data frame to an RDD.
For an RDD you can use a flatMap function to separate the Subjects.
Copy/paste function if you need to repeat this quickly and easily across a large number of columns in a dataset
cols = ["word", "stem", "pos", "ner"]
def explode_cols(self, data, cols):
data = data.withColumn('exp_combo', f.arrays_zip(*cols))
data = data.withColumn('exp_combo', f.explode('exp_combo'))
for col in cols:
data = data.withColumn(col, f.col('exp_combo.' + col))
return data.drop(f.col('exp_combo'))
result = explode_cols(data, cols)
Your welcome :)
When Exploding multiple columns, the above solution comes in handy only when the length of array is same, but if they are not.
It is better to explode them separately and take distinct values each time.
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn('Subjects',F.explode('Subjects')).select('Name','Age','Subjects', 'Grades').distinct()
df = df.withColumn('Grades',F.explode('Grades')).select('Name','Age','Subjects', 'Grades').distinct()
df.show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+
Thanks #nasty for saving the day.
Just small tweaks to get the code working.
def explode_cols( df, cl):
df = df.withColumn('exp_combo', arrays_zip(*cl))
df = df.withColumn('exp_combo', explode('exp_combo'))
for colm in cl:
final_col = 'exp_combo.'+ colm
df = df.withColumn(final_col, col(final_col))
#print col
#print ('exp_combo.'+ colm)
return df.drop(col('exp_combo'))

Conditional replacement of values in pyspark dataframe

I have the spark dataframe below:
+----------+-------------+--------------+------------+----------+-------------------+
| part| company| country| city| price| date|
+----------+-------------+--------------+------------+----------+-------------------+
| 52125-136| Brainsphere| null| Braga| 493.94€|2016-05-10 11:13:43|
| 70253-307|Chatterbridge| Spain| Barcelona| 969.29€|2016-05-10 13:06:30|
| 50563-113| Kanoodle| Japan| Niihama| ¥72909.95|2016-05-10 13:11:57|
|52380-1102| Flipstorm| France| Nanterre| 794.84€|2016-05-10 13:19:12|
| 54473-578| Twitterbeat| France| Annecy| 167.48€|2016-05-10 15:09:46|
| 76335-006| Ntags| Portugal| Lisbon| 373.07€|2016-05-10 15:20:22|
| 49999-737| Buzzbean| Germany| Düsseldorf| 861.2€|2016-05-10 15:21:51|
| 68233-011| Flipstorm| Greece| Athens| 512.89€|2016-05-10 15:22:03|
| 36800-952| Eimbee| France| Amiens| 219.74€|2016-05-10 21:22:46|
| 16714-295| Teklist| null| Arnhem| 624.4€|2016-05-10 21:57:15|
| 42254-213| Thoughtmix| Portugal| Amadora| 257.99€|2016-05-10 22:01:04|
From these columns, only the country column has null values. So what I want to do is to fill the null values with the country that corresponds to the city on the right. The dataframe is big and there are cases where Braga (for example) has the country that it belongs and other cases where this is not the case.
So, how can I fill those null values in the country column based on the city column on the right and at the same time take advantage of Spark's parallel computation?
You can use a window functions for that.
from pyspark.sql import functions as F, Window
df.withColumn(
"country",
F.coalesce(
F.col("country"),
F.first("country").over(Window.partitionBy("city").orderBy("city")),
),
).show()
Use coalesce function in spark to get first non null value from list of columns.
Example:
df.show()
#+--------+---------+
#| country| city|
#+--------+---------+
#| null| Braga|
#| Spain|Barcelona|
#| null| Arnhem|
#|portugal| Amadora|
#+--------+---------+
from pyspark.sql.functions import *
df.withColumn("country",coalesce(col("country"),col("city"))).show()
#+--------+---------+
#| country| city|
#+--------+---------+
#| Braga| Braga|
#| Spain|Barcelona|
#| Arnhem| Arnhem|
#|portugal| Amadora|
#+--------+---------+

How to add column with alternate values in PySpark dataframe?

I have the following sample dataframe
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
and I want to explode the values in each row and associate alternating 1-0 values in the generated rows. This way I can identify the start/end entries in each row.
I am able to achieve the desired result this way
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = (df.withColumn('start_end', fn.array('start', 'end'))
.withColumn('date', fn.explode('start_end'))
.withColumn('row_num', fn.row_number().over(w)))
df = (df.withColumn('is_start', fn.when(fn.col('row_num')%2 == 0, 0).otherwise(1))
.select('date', 'is_start'))
which gives
| date | is_start |
|--------|----------|
| start | 1 |
| end | 0 |
| start1 | 1 |
| end1 | 0 |
but it seems overly complicated for such a simple task.
Is there any better/cleaner way without using UDFs?
You can use pyspark.sql.functions.posexplode along with pyspark.sql.functions.array.
First create an array out of your start and end columns, then explode this with the position:
from pyspark.sql.functions import array, posexplode
df.select(posexplode(array("end", "start")).alias("is_start", "date")).show()
#+--------+------+
#|is_start| date|
#+--------+------+
#| 0| end|
#| 1| start|
#| 0| end1|
#| 1|start1|
#+--------+------+
You can try union:
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df = df.withColumn('startv', F.lit(1))
df = df.withColumn('endv', F.lit(0))
df = df.select(['start', 'startv']).union(df.select(['end', 'endv']))
df.show()
+------+------+
| start|startv|
+------+------+
| start| 1|
|start1| 1|
| end| 0|
| end1| 0|
+------+------+
You can rename the columns and re-order the rows starting here.
I had similar situation in my use case. In my situation i had Huge dataset(~50GB) and doing any self join/heavy transformation was resulting in more memory and unstable execution .
I went one more level down of dataset and used flatmap of rdd. This will use map side transformation and it will be cost effective in terms of shuffle, cpu and memory.
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df.show()
+------+----+
| start| end|
+------+----+
| start| end|
|start1|end1|
+------+----+
final_df = df.rdd.flatMap(lambda row: [(row.start, 1), (row.end, 0)]).toDF(['date', 'is_start'])
final_df.show()
+------+--------+
| date|is_start|
+------+--------+
| start| 1|
| end| 0|
|start1| 1|
| end1| 0|
+------+--------+

Concat multiple columns of a dataframe using pyspark

Suppose I have a list of columns, for example:
col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']
I need to create a new column by concatenating col1 and col2. I don't want to hard code the column names while concatenating but need to pick it from the list.
How can I do this?
You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list. Keep on passing them as arguments.
from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A1| 11| A3| A4|
| B1| 22| B3| B4|
| C1| 33| C3| C4|
+----+----+----+----+
In the concat() function, you pass all the columns you need to concatenate - like concat('col1','col2'). If you have a list, you can un-list it using *. So (*['col1','col2']) returns ('col1','col2')
col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
| A1| 11| A3| A4| A111|
| B1| 22| B3| B4| B122|
| C1| 33| C3| C4| C133|
+----+----+----+----+-----------------+

Resources