Change a columns values in dataframe pyspark - apache-spark

I have 2 dataframes in Spark which are train and test. I have a categorical column in both, say Product_ID, what I want to do is that, I want to put -1 value for those categories, which are in test but not present in train.
So for that I first found distinct categories for that column in p_not_in_test. But I am not able proceed further. how to do that.....
p_not_in_test = test.select('Product_ID').subtract(train.select('Product_ID'))
p_not_in_test = p_not_in_test.distinct()
Regards

Here's a reproducible example, first we create dummy data:
test = sc.parallelize([("ID1", 1,5),("ID2", 2,4),
("ID3", 5,8),("ID4", 9,0),
("ID5", 0,3)]).toDF(["PRODUCT_ID", "val1", "val2"])
train = sc.parallelize([("ID1", 4,7),("ID3", 1,4),
("ID5", 9,2)]).toDF(["PRODUCT_ID", "val1", "val2"])
Now we need to extend your definition of p_not_in_test so we get a list as an output:
p_not_in_test = (test.select('PRODUCT_ID')
.subtract(train.select('PRODUCT_ID'))
.rdd.map(lambda x: x[0]).collect())
Finally, we can create an udf that will add "-1" in front of each ID that's not present in train.
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
addString = udf(lambda x: '-1 ' + x if x in p_not_in_test else x, StringType())
test.withColumn("NEW_ID",addString(test["PRODUCT_ID"])).show()
+----------+----+----+------+
|PRODUCT_ID|val1|val2|NEW_ID|
+----------+----+----+------+
| ID1| 1| 5| ID1|
| ID2| 2| 4|-1 ID2|
| ID3| 5| 8| ID3|
| ID4| 9| 0|-1 ID4|
| ID5| 0| 3| ID5|
+----------+----+----+------+

Related

Grouping in pySpark Dataframes

I am using spark dataframes.
The task is this: to calculate and display in descending order the number of cities in the country grouped by country and region.
Initial data:
from pyspark.sql.functions import col
from pyspark.sql.functions import count
df = spark.read.json("/content/world-cities.json")
df.printSchema()
df.show()
enter image description here
Desired result:
enter image description here
I get grouping only by the country column.
How to add grouping by second column subcountry?
df.groupBy(col('country')).agg(count("*").alias("cnt"))\
.orderBy(col('cnt').desc())\
.show()
enter image description here
If i understand you correctly you just need to add second column to your group by
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),("USA","usa-subcountry", "usa-city-2"),("USA","usa-subcountry-2", "usa-city"), ("Argentina","argentina-subcountry", "argentina-city")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("cnt"))\
.orderBy(F.col('cnt').desc())\
.show()
Output is:
+---------+--------------------+---+
| country| subcountry|cnt|
+---------+--------------------+---+
| USA| usa-subcountry| 2|
| USA| usa-subcountry-2| 1|
|Argentina|argentina-subcountry| 1|
+---------+--------------------+---+
Edit: another try based on comment:
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),
("USA","usa-subcountry", "usa-city-2"),
("USA","usa-subcountry", "usa-city-3"),
("USA","usa-subcountry-2", "usa-city"),
("Argentina","argentina-subcountry", "argentina-city"),
("Argentina","argentina-subcountry-2", "argentina-city-2"),
("UK","UK-subcountry", "UK-city-1")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("city_count"))\
.groupBy(F.col('country')).agg(F.count("*").alias("subcountry_count"), F.sum('city_count').alias("city_count"))\
.orderBy(F.col('city_count').desc())\
.show()
output:
+---------+----------------+----------+
| country|subcountry_count|city_count|
+---------+----------------+----------+
| USA| 2| 4|
|Argentina| 2| 2|
| UK| 1| 1|
+---------+----------------+----------+
I am assuming that cities and subcountries are unique, if not you may consider to use countDistinct instead of count

Explode multiple array columns in pyspark [duplicate]

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
I want to explode the dataframe in such a way that i get the following output-
Name Age Subjects Grades
Bob 16 Maths A
Bob 16 Physics B
Bob 16 Chemistry C
How can I achieve this?
PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
This works,
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df.show()
+-----+----+--------------------+---------+
| Name| Age| Subjects| Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+
Use udf with zip. Those columns needed to explode have to be merged before exploding.
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(StructType([StructField("subs", StringType()),
StructField("grades", StringType())])))
df = df.withColumn("new", combine("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
Arriving late to the party :-)
The simplest way to go is by using inline that doesn't have python API but is supported by selectExpr.
df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+
Have you tried this
df.select(explode(split(col("Subjects"))).alias("Subjects")).show()
you can convert the data frame to an RDD.
For an RDD you can use a flatMap function to separate the Subjects.
Copy/paste function if you need to repeat this quickly and easily across a large number of columns in a dataset
cols = ["word", "stem", "pos", "ner"]
def explode_cols(self, data, cols):
data = data.withColumn('exp_combo', f.arrays_zip(*cols))
data = data.withColumn('exp_combo', f.explode('exp_combo'))
for col in cols:
data = data.withColumn(col, f.col('exp_combo.' + col))
return data.drop(f.col('exp_combo'))
result = explode_cols(data, cols)
Your welcome :)
When Exploding multiple columns, the above solution comes in handy only when the length of array is same, but if they are not.
It is better to explode them separately and take distinct values each time.
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn('Subjects',F.explode('Subjects')).select('Name','Age','Subjects', 'Grades').distinct()
df = df.withColumn('Grades',F.explode('Grades')).select('Name','Age','Subjects', 'Grades').distinct()
df.show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+
Thanks #nasty for saving the day.
Just small tweaks to get the code working.
def explode_cols( df, cl):
df = df.withColumn('exp_combo', arrays_zip(*cl))
df = df.withColumn('exp_combo', explode('exp_combo'))
for colm in cl:
final_col = 'exp_combo.'+ colm
df = df.withColumn(final_col, col(final_col))
#print col
#print ('exp_combo.'+ colm)
return df.drop(col('exp_combo'))

Spark dataframe foreachPartition: sum the elements using pyspark

I am trying to partition spark dataframe and sum elements in each partition using pyspark. But I am unable to do this inside a called function "sumByHour". Basically, I am unable to access dataframe columns inside "sumByHour".
Basically, I am partitioning by "hour" column and trying to sum the elements based on "hour" partition. So expected output is: 6,15,24 for 0,1,2 hour respectively. Tried below with no luck.
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
def sumByHour(ip):
print(ip)
pandasDF = pd.DataFrame({'hour': [0,0,0,1,1,1,2,2,2], 'numlist': [1,2,3,4,5,6,7,8,9]})
myschema = StructType(
[StructField('hour', IntegerType(), False),
StructField('numlist', IntegerType(), False)]
)
myDf = spark.createDataFrame(pandasDF, schema=myschema)
mydf = myDf.repartition(3, "hour")
myDf.foreachPartition(sumByHour)
I am able to solve this with "window.partitionBy". But I want to know if it can be solved by "foreachPartition".
Thanks in Advance,
Sri
Thanks for the code sample it made this easy. Here's a really simple example modifies you sumByHour code:
def sumByHour(ip):
mySum = 0
myPartition = ""
for x in ip:
mySum += x.numlist
myPartition = x.hour
myString = '{}_{}'.format(mySum, myPartition)
print(myString)
mydf = myDf.repartition(5,"hour") #wait 5 I wanted 3!!!
You get almost the expected result:
>>> mydf.foreachPartition(sumByHour)
0_
0_
24_2
6_0
15_1
>>>
You might ask why partition by '5' and not the '3'? Well turns out the hash formula used for 3 partitions has collision for (0,1) into the same partition and then has an empty partition.(Bad luck) So this will work but, you only want to use it on an array that will fit into memory.
You can use a Window to do that and add the sumByHour as a new column.
from pyspark.sql import functions, Window
w = Window.partitionBy("hour")
myDf = myDf.withColumn("sumByHour", functions.sum("numlist").over(w))
myDf.show()
+----+-------+---------+
|hour|numlist|sumByHour|
+----+-------+---------+
| 1| 4| 15|
| 1| 5| 15|
| 1| 6| 15|
| 2| 7| 24|
| 2| 8| 24|
| 2| 9| 24|
| 0| 1| 6|
| 0| 2| 6|
| 0| 3| 6|
+----+-------+---------+

Get the distinct elements of a column grouped by another column on a PySpark Dataframe

I have a pyspark DF of ids and purchases which I'm trying to transform for use with FP growth.
Currently i have multiple rows for a given id with each row only relating to a single purchase.
I'd like to transform this dataframe to a form where there are two columns, one for id (with a single row per id ) and the second column containing a list of distinct purchases for that id.
I've tried to use a User Defined Function (UDF) to map the distinct purchases onto the distinct ids but I get a "py4j.Py4JException: Method getstate([]) does not exist". Thanks to #Mithril
I see that "You can't use sparkSession object , spark.DataFrame object or other Spark distributed objects in udf and pandas_udf, because they are unpickled."
So I've implemented the TERRIBLE approach below (which will work but is not scalable):
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
# Lets have a look at the resulting spark dataframe
spk_df_1.show()
# Lets capture the ids and list of their distinct pruschases in a
# list of tuples
purschases_lst = []
nums1 = []
import pyspark.sql.functions as f
# for each distinct id lets get the list of their distinct pruschases
for id in spark.sql("SELECT distinct(id) FROM TBLdf ").rdd.map(lambda row : row[0]).collect():
purschase = df.filter(f.col("id") == id).select("item").distinct().rdd.map(lambda row : row[0]).collect()
nums1.append((id,purschase))
# Lets see what our list of transaction tuples looks like
print(nums1)
print("\n")
# lets turn the list of transaction tuples into a pandas dataframe
df_pd = pd.DataFrame(nums1)
# Finally lets turn our pandas dataframe into a pyspark Dataframe
df2 = spark.createDataFrame(df_pd)
df2.show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
[(1, ['fruit', 'cake']), (3, ['beer']), (2, ['tea'])]
+---+-------------+
| 0| 1|
+---+-------------+
| 1|[fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-------------+
If anybody has any suggestions I'd greatly appreciate it.
That is a task for collect_set, which creates a set of items without duplicates:
import pyspark.sql.functions as F
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
spk_df_1.show()
spk_df_1.groupby('id').agg(F.collect_set('item')).show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
+---+-----------------+
| id|collect_set(item)|
+---+-----------------+
| 1| [fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-----------------+

Updating a column in pyspark dependent on the column current value

Lets say given a DataFrame
+-----+-----+-----+
| x| y| z|
+-----|-----+-----+
| 3| 5| 9|
| 2| 4| 6|
+-----+-----+-----+
I want to multiply all of the values in z column by the value in y column where z column equals 6.
This post shows the solution I am aiming for, using the code
from pyspark.sql import functions as F
df = df.withColumn('z',
F.when(df['z']==6, df['z']*df['y']).
otherwise(df['z']))
The problem is that df['z'] and df['y'] are recognized as Column object and casting them won't work...
How can I do this correctly?
from pyspark.sql import functions as F
from pyspark.sql.types import LongType
df = df.withColumn('new_col',
F.when(df.z==6,
(df.z.cast(LongType()) * df.y.cast(LongType()))
).otherwise(df.z)
)

Resources