PySpark: modify column values when another column value satisfies a condition

PySpark: modify column values when another column value satisfies a condition - apache-spark

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])

You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+

Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

Related

How to find the distribution of a column in PySpark dataframe for all the unique values present in that column?

I have a PySpark dataframe-
df = spark.createDataFrame([
("u1", 0),
("u2", 0),
("u3", 1),
("u4", 2),
("u5", 3),
("u6", 2),],
['user_id', 'medals'])
df.show()
Output-
+-------+------+
|user_id|medals|
+-------+------+
| u1| 0|
| u2| 0|
| u3| 1|
| u4| 2|
| u5| 3|
| u6| 2|
+-------+------+
I want to get the distribution of the medals column for all the users. So if there are n unique values in the medals column, I want n columns in the output dataframe with corresponding number of users who received that many medals.
The output for the data given above should look like-
+------- +--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+
How do I achieve this?

it's a simple pivot:
df.groupBy().pivot("medals").count().show()
+---+---+---+---+
| 0| 1| 2| 3|
+---+---+---+---+
| 2| 1| 2| 1|
+---+---+---+---+
if you need some cosmetic to add the word medals in the column name, then you can do this :
medals_df = df.groupBy().pivot("medals").count()
for col in medals_df.columns:
medals_df = medals_df.withColumnRenamed(col, "medals_{}".format(col))
medals_df.show()
+--------+--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+

Find common pairs in a column based on overlapping entries in another column grouped on it

I need some help with a query. Say I have a dataframe like this:
+------+------+
|userid|songid|
+------+------+
| 1| a|
| 1| b|
| 1| c|
| 2| a|
| 2| d|
| 3| c|
| 4| e|
| 4| d|
| 5| b|
+------+------+
I want to return a data frame which has userid pairs that have atleast one songid in common. It would look like this for the dataframe above:
+------+------+
|userid|friendid|
+------+------+
| 1| 2|
| 1| 3|
| 1| 5|
| 2| 4|
+------+------+
How can I do this?

One simple way is to use a self Join :
data = [(1, 'a'), (1, 'b'), (1, 'c'),
(2, 'a'), (2, 'd'), (3, 'c'),
(4, 'e'), (4, 'd'), (5, 'b')
]
df = spark.createDataFrame(data, ["userid", "songid"])
# join on songId = songId and userid different
join_condition = (col("u1.songid") == col("u2.songid")) & (col("u1.userid") != col("u2.userid"))
df.alias("u1").join(df.alias("u2"), join_condition, "inner") \
.select(sort_array(array(col("u1.userid"), col("u2.userid"))).alias("pairs")) \
.distinct() \
.select(col("pairs").getItem(0).alias("userid"), col("pairs").getItem(1).alias("friendid"))\
.show()
+------+--------+
|userid|friendid|
+------+--------+
| 1| 3|
| 1| 5|
| 2| 4|
| 1| 2|
+------+--------+

pyspark two dataframes subtractbykey issue

I am trying to output a dataframe only with columns identified with different values after comparing two dataframes. I am finding difficulty in identifying an approach to proceed.
**Code:**
df_a = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"pears","tiger","onion"),("c", 8,"jackfruit","elephant","raddish"),("c", 8,"watermelon","giraffe","tomato")], ["name", "id","fruit","animal","veggie"])
df_b = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"banana","tiger","onion"),("c", 8,"jackfruit","camel","raddish")], ["name", "id","fruit","animal","veggie"])
df_a = df_a.alias('df_a')
df_b = df_b.alias('df_b')
df = df_a.join(df_b, (df_a.id == df_b.id) & (df_a.name == df_b.name),'leftanti').select('df_a.*').show()
Trying to match based on the ids (id,name) between dataframe1 & dataframe2
Dataframe 1:
+----+---+----------+--------+-------+
|name| id| fruit| animal| veggie|
+----+---+----------+--------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| pears| tiger| onion|
| c| 8| jackfruit|elephant|raddish|
| c| 9|watermelon| giraffe| tomato|
+----+---+----------+--------+-------+
Dataframe 2:
+----+---+---------+------+-------+
|name| id| fruit|animal| veggie|
+----+---+---------+------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| banana| tiger| onion|
| c| 8|jackfruit| camel|raddish|
+----+---+---------+------+-------+
Expected dataframe
+----+---+----------+--------+
|name| id| fruit| animal|
+----+---+----------+--------+
| c| 7| pears| tiger|
| c| 8| jackfruit|elephant|
| c| 9|watermelon| giraffe|
+----+---+----------+--------+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!

With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])

I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

How to use first and last function in pyspark?

I used first and last functions to get first and last values of one column. But, I found the both of functions don't work as what I supposed. I referred to the answer #zero323, but I am still confusing with the both. the code like:
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy('k','v')
df.select(F.col("k"), F.last("v",True).over(w).alias('v')).show()
the result:
+---+----+
| k| v|
+---+----+
| b| 1|
| b| 3|
| a|null|
| a| -1|
| a| 1|
+---+----+
I supposed it should be like:
+---+----+
| k| v|
+---+----+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+----+
because, I showed df by operation of orderBy on 'k' and 'v':
df.orderBy('k','v').show()
+---+----+
| k| v|
+---+----+
| a|null|
| a| -1|
| a| 1|
| b| 1|
| b| 3|
+---+----+
additionally, I figured out the other solution to test this kind of problems, my code like:
df.orderBy('k','v').groupBy('k').agg(F.first('v')).show()
I found that it was possible that its results are different after running above it every time . Was someone met the same experience like me? I hope to use the both of functions in my project, but I found those solutions are inconclusive.

Try inverting the sort order using .desc() and then first() will give the desired output.
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select(F.col("k"), F.first("v",True).over(w2).alias('v')).show()
F.first("v",True).over(w2).alias('v').show()
Outputs:
+---+---+
| k| v|
+---+---+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+---+
You should also be careful about partitionBy vs. orderBy. Since you are partitioning by 'k', all of the values of k in any given window are the same. Sorting by 'k' does nothing.
The last function is not really the opposite of first, in terms of which item from the window it returns. It returns the last non-null, value it has seen, as it progresses through the ordered rows.
To compare their effects, here is a dataframe with both function/ordering combinations. Notice how in column 'last_w2', the null value has been replaced by -1.
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)]).toDF(["k", "v"])
#create two windows for comparison.
w = Window().partitionBy("k").orderBy('v')
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select('k','v',
F.first("v",True).over(w).alias('first_w1'),
F.last("v",True).over(w).alias('last_w1'),
F.first("v",True).over(w2).alias('first_w2'),
F.last("v",True).over(w2).alias('last_w2')
).show()
Output:
+---+----+--------+-------+--------+-------+
| k| v|first_w1|last_w1|first_w2|last_w2|
+---+----+--------+-------+--------+-------+
| b| 1| 1| 1| 3| 1|
| b| 3| 1| 3| 3| 3|
| a|null| null| null| 1| -1|
| a| -1| -1| -1| 1| -1|
| a| 1| -1| 1| 1| 1|
+---+----+--------+-------+--------+-------+

Have a look at Question 47130030.
The issue is not with the last() function but with the frame, which includes only rows up to the current one.
Using
w = Window().partitionBy("k").orderBy('k','v').rowsBetween(W.unboundedPreceding,W.unboundedFollowing)
will yield correct results for first() and last().

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark: modify column values when another column value satisfies a condition - apache-spark

Starting with #Pushkr solution couldn't you just use the following ? from pyspark.sql.functions import * df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

Related

How to find the distribution of a column in PySpark dataframe for all the unique values present in that column?

Find common pairs in a column based on overlapping entries in another column grouped on it

pyspark two dataframes subtractbykey issue

PySpark : change column names of a df based on relations defined in another df

How to use first and last function in pyspark?

Categories

Resources