Pyspark, when Column value starts with x, write as y - apache-spark

Hi folks I'm augmenting my DF was wondering if you can give a helping hand.
df = df.withColumn(('COUNTRY'), when(col("COUNTRY").startsWith("US"), "US").otherwise("null"))
What I am trying to achieve is resetting the column to, where a column value starts with US, such as US_Rules_Forever - to rewrite the dataframe simply as just US. Other variables to be set with null
ID COUNTRY
1 US_RULES
2 US_SANDWICH
3 USA_CLICKING
4 GLOBAL_CHICKEN_SANDWICH
ID COUNTRY
1 US
2 US
3 US
4 null

According to the docs, it should be startswith, not startsWith. w should not be capitalized.
df2 = df.withColumn('COUNTRY', when(col("COUNTRY").startswith("US"), "US"))
df2.show()
+---+-------+
| ID|COUNTRY|
+---+-------+
| 1| US|
| 2| US|
| 3| US|
| 4| null|
+---+-------+

mck was right - it was a syntax issue. Posting this for fellow devs:
df = df.withColumn(('COUNTRY'), when(col("COUNTRY").startswith("US"), "US").otherwise("null"))

Related

how I can make a column pair with respect of a group?

I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|
I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+

How to add 5th Column in Existing dataFrame Spark

I have a dataframe with consist of 5 columns . I need to add a new column at 3rd Position . How to achieve this in spark .
df.show()
+---------+--------+---+----------+--------+
|last_name|position|age|salary_inc| segment|
+---------+--------+---+----------+--------+
| george| IT| 10| 2313| one|
| jhon| non-it| 21| 34344| null|
| mark| IT| 11| 16161| third|
| spencer| it| 31| 2322| null|
| spencer| non-it| 41| 2322|Valuable|
+---------+--------+---+----------+--------+
Add new_column at position 3
+---------+--------+-----------+---+----------+--------+
|last_name|position|new_column |age|salary_inc| segment|
+---------+--------+-----------+---+----------+--------+
Can you please help me on this
(
df.withColumn("new_column", ...)
.select("last_name",
"position",
"new_column",
...)
.show()
)
Where first ellipses indicate what you're creating in your new column called "new_column"; for example lit(1) would give you literal (constant) 1 of type IntegerType. Second ellipses indicate remaining columns in the order you wish to select.

PySpark - Returning a value into a dataframe, if a value occurs on another dataframe where two fields match

Sorry for the vague title, I can't think of a better way to put it. I understand a bit of python and have some experience with Pandas dataframes, but recently I have been tasked to look at something involving Spark and I'm struggling to get my ahead around it.
I suppose the best way to explain this is with a small example. Imagine I have dataframe A:
id | Name |
--------------
1 | Random |
2 | Random |
3 | Random |
As well as dataframe B:
id | Fruit |
-------------
1 | Pear |
2 | Pear |
2 | Apple |
2 | Banana |
3 | Pear |
3 | Banana |
Now what I'm trying to do is match dataframe A with B (based on id matching), and iterate through the Fruit column in dataframe B. If a value comes up (say Banana), I want to add it as a column to dataframe. Could be a simple sum (everytime banana comes up add 1 to a column), or just class it if it comes up once. So for example, an output could look like this:
id | Name | Banana
---------------------
1 | Random | 0
2 | Random | 1
3 | Random | 1
My issue is iterating through Spark dataframes, and how I can connect the two if the match does occur. I was trying to do something to this effect:
def fruit(input):
fruits = {"Banana" : "B"}
return fruits[input]
fruits = df.withColumn("Output", fruit("Fruit"))
But it's not really working. Any ideas? Apologies in advance my experience with Spark is very little.
Hope this helps!
#sample data
A = sc.parallelize([(1,"Random"), (2,"Random"), (3,"Random")]).toDF(["id", "Name"])
B = sc.parallelize([(1,"Pear"), (2,"Pear"), (2,"Apple"), (2,"Banana"), (3,"Pear"), (3,"Banana")]).toDF(["id", "Fruit"])
df_temp = A.join(B, A.id==B.id, 'inner').drop(B.id)
df = df_temp.groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
Output is
+---+------+-----+------+----+
| id| Name|Apple|Banana|Pear|
+---+------+-----+------+----+
| 1|Random| 0| 0| 1|
| 3|Random| 0| 1| 1|
| 2|Random| 1| 1| 1|
+---+------+-----+------+----+
Edit note: In case you are only interested in few fruits then
from pyspark.sql.functions import col
#list of fruits you are interested in
fruit_list = ["Pear", "Banana"]
df = df_temp.\
filter(col('fruit').isin(fruit_list)).\
groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
+---+------+------+----+
| id| Name|Banana|Pear|
+---+------+------+----+
| 1|Random| 0| 1|
| 3|Random| 1| 1|
| 2|Random| 1| 1|
+---+------+------+----+

Join two dataframes in pyspark by one column

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.
df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value

PySpark: Subtract Dataframe Ignoring Some Columns

I want to perform subtract between 2 dataframes in pyspark. Challenge is that I have to ignore some columns while subtracting dataframe. But end dataframe should have all the columns, including ignored columns.
Here is an example:
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u's.kent#email.com',
date1=u'2017-02-08'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-09'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-10')
]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-11'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-12')
]).toDF()
Expected:
ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.
End result should look something like this including "date1" column.
+----------+--------------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08| s.kent#email.com| Steve| 1| Kent|
+----------+--------------------+----------+---+---------+
Seems you need anti-join:
userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent#email.com| Steve| 1| Kent|
+----------+----------------+----------+---+---------+
You can also use a full join and only keep null values:
userLeft.join(
userRight,
[c for c in userLeft.columns if c != "date1"],
"full"
).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()
+------------------+----------+---+---------+----------+----------+
| email|first_name| id|last_name| date1| date1|
+------------------+----------+---+---------+----------+----------+
|marge.hh#email.com| null| 3| hh|2017-02-10| null|
|marge.hh#email.com| null| 3| hh| null|2017-02-12|
| s.kent#email.com| Steve| 1| Kent|2017-02-08| null|
+------------------+----------+---+---------+----------+----------+
If you want to use joins, whether it's leftanti or full you'll need to find default values for your null in the joining columns (I think we discussed it in a previous thread).
You can also just drop the column that bothers you subtract and join:
df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()
+----------------+----------+---+---------+----------+
| email|first_name| id|last_name| date1|
+----------------+----------+---+---------+----------+
|s.kent#email.com| Steve| 1| Kent|2017-02-08|
+----------------+----------+---+---------+----------+

Resources