Updating a column in pyspark dependent on the column current value - apache-spark

Lets say given a DataFrame
+-----+-----+-----+
| x| y| z|
+-----|-----+-----+
| 3| 5| 9|
| 2| 4| 6|
+-----+-----+-----+
I want to multiply all of the values in z column by the value in y column where z column equals 6.
This post shows the solution I am aiming for, using the code
from pyspark.sql import functions as F
df = df.withColumn('z',
F.when(df['z']==6, df['z']*df['y']).
otherwise(df['z']))
The problem is that df['z'] and df['y'] are recognized as Column object and casting them won't work...
How can I do this correctly?

from pyspark.sql import functions as F
from pyspark.sql.types import LongType
df = df.withColumn('new_col',
F.when(df.z==6,
(df.z.cast(LongType()) * df.y.cast(LongType()))
).otherwise(df.z)
)

Related

Grouping in pySpark Dataframes

I am using spark dataframes.
The task is this: to calculate and display in descending order the number of cities in the country grouped by country and region.
Initial data:
from pyspark.sql.functions import col
from pyspark.sql.functions import count
df = spark.read.json("/content/world-cities.json")
df.printSchema()
df.show()
enter image description here
Desired result:
enter image description here
I get grouping only by the country column.
How to add grouping by second column subcountry?
df.groupBy(col('country')).agg(count("*").alias("cnt"))\
.orderBy(col('cnt').desc())\
.show()
enter image description here
If i understand you correctly you just need to add second column to your group by
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),("USA","usa-subcountry", "usa-city-2"),("USA","usa-subcountry-2", "usa-city"), ("Argentina","argentina-subcountry", "argentina-city")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("cnt"))\
.orderBy(F.col('cnt').desc())\
.show()
Output is:
+---------+--------------------+---+
| country| subcountry|cnt|
+---------+--------------------+---+
| USA| usa-subcountry| 2|
| USA| usa-subcountry-2| 1|
|Argentina|argentina-subcountry| 1|
+---------+--------------------+---+
Edit: another try based on comment:
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),
("USA","usa-subcountry", "usa-city-2"),
("USA","usa-subcountry", "usa-city-3"),
("USA","usa-subcountry-2", "usa-city"),
("Argentina","argentina-subcountry", "argentina-city"),
("Argentina","argentina-subcountry-2", "argentina-city-2"),
("UK","UK-subcountry", "UK-city-1")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("city_count"))\
.groupBy(F.col('country')).agg(F.count("*").alias("subcountry_count"), F.sum('city_count').alias("city_count"))\
.orderBy(F.col('city_count').desc())\
.show()
output:
+---------+----------------+----------+
| country|subcountry_count|city_count|
+---------+----------------+----------+
| USA| 2| 4|
|Argentina| 2| 2|
| UK| 1| 1|
+---------+----------------+----------+
I am assuming that cities and subcountries are unique, if not you may consider to use countDistinct instead of count

Spark dataframe foreachPartition: sum the elements using pyspark

I am trying to partition spark dataframe and sum elements in each partition using pyspark. But I am unable to do this inside a called function "sumByHour". Basically, I am unable to access dataframe columns inside "sumByHour".
Basically, I am partitioning by "hour" column and trying to sum the elements based on "hour" partition. So expected output is: 6,15,24 for 0,1,2 hour respectively. Tried below with no luck.
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
def sumByHour(ip):
print(ip)
pandasDF = pd.DataFrame({'hour': [0,0,0,1,1,1,2,2,2], 'numlist': [1,2,3,4,5,6,7,8,9]})
myschema = StructType(
[StructField('hour', IntegerType(), False),
StructField('numlist', IntegerType(), False)]
)
myDf = spark.createDataFrame(pandasDF, schema=myschema)
mydf = myDf.repartition(3, "hour")
myDf.foreachPartition(sumByHour)
I am able to solve this with "window.partitionBy". But I want to know if it can be solved by "foreachPartition".
Thanks in Advance,
Sri
Thanks for the code sample it made this easy. Here's a really simple example modifies you sumByHour code:
def sumByHour(ip):
mySum = 0
myPartition = ""
for x in ip:
mySum += x.numlist
myPartition = x.hour
myString = '{}_{}'.format(mySum, myPartition)
print(myString)
mydf = myDf.repartition(5,"hour") #wait 5 I wanted 3!!!
You get almost the expected result:
>>> mydf.foreachPartition(sumByHour)
0_
0_
24_2
6_0
15_1
>>>
You might ask why partition by '5' and not the '3'? Well turns out the hash formula used for 3 partitions has collision for (0,1) into the same partition and then has an empty partition.(Bad luck) So this will work but, you only want to use it on an array that will fit into memory.
You can use a Window to do that and add the sumByHour as a new column.
from pyspark.sql import functions, Window
w = Window.partitionBy("hour")
myDf = myDf.withColumn("sumByHour", functions.sum("numlist").over(w))
myDf.show()
+----+-------+---------+
|hour|numlist|sumByHour|
+----+-------+---------+
| 1| 4| 15|
| 1| 5| 15|
| 1| 6| 15|
| 2| 7| 24|
| 2| 8| 24|
| 2| 9| 24|
| 0| 1| 6|
| 0| 2| 6|
| 0| 3| 6|
+----+-------+---------+

Explode 2 columns (2 lists) in the same time in pyspark [duplicate]

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# | a| b| c| d|
# +---+---------+---------+---+
# | 1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+
What I want:
+---+---+----+------+
| a| b| c | d |
+---+---+----+------+
| 1| 1| 7 | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+----+------+
If I only had one list column, this would be easy by just doing an explode:
df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# | a| b| c| d|
# +---+---+---------+---+
# | 1| 1|[7, 8, 9]|foo|
# | 1| 2|[7, 8, 9]|foo|
# | 1| 3|[7, 8, 9]|foo|
# +---+---+---------+---+
However, if I try to also explode the c column, I end up with a dataframe with a length the square of what I want:
df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | 1| 1| 7|foo|
# | 1| 1| 8|foo|
# | 1| 1| 9|foo|
# | 1| 2| 7|foo|
# | 1| 2| 8|foo|
# | 1| 2| 9|foo|
# | 1| 3| 7|foo|
# | 1| 3| 8|foo|
# | 1| 3| 9|foo|
# +---+---+---+---+
What I want is - for each column, take the nth element of the array in that column and add that to a new row. I've tried mapping an explode accross all columns in the dataframe, but that doesn't seem to work either:
df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()
Spark >= 2.4
You can replace zip_ udf with arrays_zip function
from pyspark.sql.functions import arrays_zip, col, explode
(df
.withColumn("tmp", arrays_zip("b", "c"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
Spark < 2.4
With DataFrames and UDF:
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, udf, explode
zip_ = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
# Adjust types to reflect data types
StructField("first", IntegerType()),
StructField("second", IntegerType())
]))
)
(df
.withColumn("tmp", zip_("b", "c"))
# UDF output cannot be directly passed to explode
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))
With RDDs:
(df
.rdd
.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])
.toDF(["a", "b", "c", "d"]))
Both solutions are inefficient due to Python communication overhead. If data size is fixed you can do something like this:
from functools import reduce
from pyspark.sql import DataFrame
# Length of array
n = 3
# For legacy Python you'll need a separate function
# in place of method accessor
reduce(
DataFrame.unionAll,
(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")
for i in range(n))
).toDF("a", "b", "c", "d")
or even:
from pyspark.sql.functions import array, struct
# SQL level zip of arrays of known size
# followed by explode
tmp = explode(array(*[
struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))
for i in range(n)
]))
(df
.withColumn("tmp", tmp)
.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))
This should be significantly faster compared to UDF or RDD. Generalized to support an arbitrary number of columns:
# This uses keyword only arguments
# If you use legacy Python you'll have to change signature
# Body of the function can stay the same
def zip_and_explode(*colnames, n):
return explode(array(*[
struct(*[col(c).getItem(i).alias(c) for c in colnames])
for i in range(n)
]))
df.withColumn("tmp", zip_and_explode("b", "c", n=3))
You'd need to use flatMap, not map as you want to make multiple output rows out of each input row.
from pyspark.sql import Row
def dualExplode(r):
rowDict = r.asDict()
bList = rowDict.pop('b')
cList = rowDict.pop('c')
for b,c in zip(bList, cList):
newDict = dict(rowDict)
newDict['b'] = b
newDict['c'] = c
yield Row(**newDict)
df_split = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))
One liner (for Spark>=2.4.0):
df.withColumn("bc", arrays_zip("b","c"))
.select("a", explode("bc").alias("tbc"))
.select("a", col"tbc.b", "tbc.c").show()
Import required:
from pyspark.sql.functions import arrays_zip
Steps -
Create a column bc which is an array_zip of columns b and c
Explode bc to get a struct tbc
Select the required columns a, b and c (all exploded as required).
Output:
> df.withColumn("bc", arrays_zip("b","c")).select("a", explode("bc").alias("tbc")).select("a", "tbc.b", col("tbc.c")).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 7|
| 1| 2| 8|
| 1| 3| 9|
+---+---+---+

Convert column of lists into one column of values in Pyspark [duplicate]

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# | a| b| c| d|
# +---+---------+---------+---+
# | 1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+
What I want:
+---+---+----+------+
| a| b| c | d |
+---+---+----+------+
| 1| 1| 7 | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+----+------+
If I only had one list column, this would be easy by just doing an explode:
df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# | a| b| c| d|
# +---+---+---------+---+
# | 1| 1|[7, 8, 9]|foo|
# | 1| 2|[7, 8, 9]|foo|
# | 1| 3|[7, 8, 9]|foo|
# +---+---+---------+---+
However, if I try to also explode the c column, I end up with a dataframe with a length the square of what I want:
df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | 1| 1| 7|foo|
# | 1| 1| 8|foo|
# | 1| 1| 9|foo|
# | 1| 2| 7|foo|
# | 1| 2| 8|foo|
# | 1| 2| 9|foo|
# | 1| 3| 7|foo|
# | 1| 3| 8|foo|
# | 1| 3| 9|foo|
# +---+---+---+---+
What I want is - for each column, take the nth element of the array in that column and add that to a new row. I've tried mapping an explode accross all columns in the dataframe, but that doesn't seem to work either:
df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()
Spark >= 2.4
You can replace zip_ udf with arrays_zip function
from pyspark.sql.functions import arrays_zip, col, explode
(df
.withColumn("tmp", arrays_zip("b", "c"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
Spark < 2.4
With DataFrames and UDF:
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, udf, explode
zip_ = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
# Adjust types to reflect data types
StructField("first", IntegerType()),
StructField("second", IntegerType())
]))
)
(df
.withColumn("tmp", zip_("b", "c"))
# UDF output cannot be directly passed to explode
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))
With RDDs:
(df
.rdd
.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])
.toDF(["a", "b", "c", "d"]))
Both solutions are inefficient due to Python communication overhead. If data size is fixed you can do something like this:
from functools import reduce
from pyspark.sql import DataFrame
# Length of array
n = 3
# For legacy Python you'll need a separate function
# in place of method accessor
reduce(
DataFrame.unionAll,
(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")
for i in range(n))
).toDF("a", "b", "c", "d")
or even:
from pyspark.sql.functions import array, struct
# SQL level zip of arrays of known size
# followed by explode
tmp = explode(array(*[
struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))
for i in range(n)
]))
(df
.withColumn("tmp", tmp)
.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))
This should be significantly faster compared to UDF or RDD. Generalized to support an arbitrary number of columns:
# This uses keyword only arguments
# If you use legacy Python you'll have to change signature
# Body of the function can stay the same
def zip_and_explode(*colnames, n):
return explode(array(*[
struct(*[col(c).getItem(i).alias(c) for c in colnames])
for i in range(n)
]))
df.withColumn("tmp", zip_and_explode("b", "c", n=3))
You'd need to use flatMap, not map as you want to make multiple output rows out of each input row.
from pyspark.sql import Row
def dualExplode(r):
rowDict = r.asDict()
bList = rowDict.pop('b')
cList = rowDict.pop('c')
for b,c in zip(bList, cList):
newDict = dict(rowDict)
newDict['b'] = b
newDict['c'] = c
yield Row(**newDict)
df_split = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))
One liner (for Spark>=2.4.0):
df.withColumn("bc", arrays_zip("b","c"))
.select("a", explode("bc").alias("tbc"))
.select("a", col"tbc.b", "tbc.c").show()
Import required:
from pyspark.sql.functions import arrays_zip
Steps -
Create a column bc which is an array_zip of columns b and c
Explode bc to get a struct tbc
Select the required columns a, b and c (all exploded as required).
Output:
> df.withColumn("bc", arrays_zip("b","c")).select("a", explode("bc").alias("tbc")).select("a", "tbc.b", col("tbc.c")).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 7|
| 1| 2| 8|
| 1| 3| 9|
+---+---+---+

Change a columns values in dataframe pyspark

I have 2 dataframes in Spark which are train and test. I have a categorical column in both, say Product_ID, what I want to do is that, I want to put -1 value for those categories, which are in test but not present in train.
So for that I first found distinct categories for that column in p_not_in_test. But I am not able proceed further. how to do that.....
p_not_in_test = test.select('Product_ID').subtract(train.select('Product_ID'))
p_not_in_test = p_not_in_test.distinct()
Regards
Here's a reproducible example, first we create dummy data:
test = sc.parallelize([("ID1", 1,5),("ID2", 2,4),
("ID3", 5,8),("ID4", 9,0),
("ID5", 0,3)]).toDF(["PRODUCT_ID", "val1", "val2"])
train = sc.parallelize([("ID1", 4,7),("ID3", 1,4),
("ID5", 9,2)]).toDF(["PRODUCT_ID", "val1", "val2"])
Now we need to extend your definition of p_not_in_test so we get a list as an output:
p_not_in_test = (test.select('PRODUCT_ID')
.subtract(train.select('PRODUCT_ID'))
.rdd.map(lambda x: x[0]).collect())
Finally, we can create an udf that will add "-1" in front of each ID that's not present in train.
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
addString = udf(lambda x: '-1 ' + x if x in p_not_in_test else x, StringType())
test.withColumn("NEW_ID",addString(test["PRODUCT_ID"])).show()
+----------+----+----+------+
|PRODUCT_ID|val1|val2|NEW_ID|
+----------+----+----+------+
| ID1| 1| 5| ID1|
| ID2| 2| 4|-1 ID2|
| ID3| 5| 8| ID3|
| ID4| 9| 0|-1 ID4|
| ID5| 0| 3| ID5|
+----------+----+----+------+

Resources