how I can make a column pair with respect of a group? - apache-spark

I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|

I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+

Related

transform a spark dataframe based on the value of a column

Suppose I have the dataframe below:
s.id
s.first
s.last
s.age
d.first
d.last
d.age
UPDATED_FIELDS
1
AAA
BBB
10
AAA__
BBB
10
["first"]
2
CCC
DDD
20
CCC__
DDD
21
["first", "age"]
I want to transform it to the format below, so for each of the UPDATED_FIELDS in the first dataframe, I want to create a new row in my second dataframe.
id
field
s_value
d_value
1
first
AAA
AAA_
2
first
CCC
CCC_
2
age
20
21
I feel like I need to create a new dataframe, but couldn't get it working
This isn't straightforward because after you explode "UPDATED_FIELDS" to create the "field" column, you have the following dataframe:
+----+-------+------+-----+-------+------+-----+--------------+-----+
|s.id|s.first|s.last|s.age|d.first|d.last|d.age|UPDATED_FIELDS|field|
+----+-------+------+-----+-------+------+-----+--------------+-----+
| 1| AAA| BBB| 10| AAA__| BBB| 10| [first]|first|
| 2| CCC| DDD| 20| CCC__| DDD| 21| [first, age]|first|
| 2| CCC| DDD| 20| CCC__| DDD| 21| [first, age]| age|
+----+-------+------+-----+-------+------+-----+--------------+-----+
The new columns s_value and d_value depend on using the literal from the exploded column "field" to tell us which column's value to use. It would be nice to do something like:
df.withColumn(
"field", F.explode("UPDATED_FIELDS")
).withColumn(
"s_value", F.col(f"s.{F.col('field')}")
)
... but this will result in an error because {F.col('field')} cannot be interpreted as a literal.
Update: based on this helpful answer, you can in fact use a string literal as a column lookup. You will want to first make sure you can access df['field'] so you need to perform the explode separately first, then use F.coalesce(*F.when(...)) as a way of obtaining the value inside the column name that comes from the string inside df['field']:
df = df.withColumn(
"field", F.explode("UPDATED_FIELDS")
)
## obtain list of all unique strings in "field"
base_cols = df.select(F.collect_set("field").alias("column")).first()["column"]
df.withColumn(
"s_value", F.coalesce(*[F.when(df['field'] == c, df[f"`s.{c}`"]) for c in base_cols])
).withColumn(
"d_value", F.coalesce(*[F.when(df['field'] == c, df[f"`d.{c}`"]) for c in base_cols])
).select(
F.col("`s.id`").alias("id"),"field","s_value","d_value"
)
+---+-----+-------+-------+
| id|field|s_value|d_value|
+---+-----+-------+-------+
| 1|first| AAA| AAA__|
| 2|first| CCC| CCC__|
| 2| age| 20| 21|
+---+-----+-------+-------+

Pyspark, when Column value starts with x, write as y

Hi folks I'm augmenting my DF was wondering if you can give a helping hand.
df = df.withColumn(('COUNTRY'), when(col("COUNTRY").startsWith("US"), "US").otherwise("null"))
What I am trying to achieve is resetting the column to, where a column value starts with US, such as US_Rules_Forever - to rewrite the dataframe simply as just US. Other variables to be set with null
ID COUNTRY
1 US_RULES
2 US_SANDWICH
3 USA_CLICKING
4 GLOBAL_CHICKEN_SANDWICH
ID COUNTRY
1 US
2 US
3 US
4 null
According to the docs, it should be startswith, not startsWith. w should not be capitalized.
df2 = df.withColumn('COUNTRY', when(col("COUNTRY").startswith("US"), "US"))
df2.show()
+---+-------+
| ID|COUNTRY|
+---+-------+
| 1| US|
| 2| US|
| 3| US|
| 4| null|
+---+-------+
mck was right - it was a syntax issue. Posting this for fellow devs:
df = df.withColumn(('COUNTRY'), when(col("COUNTRY").startswith("US"), "US").otherwise("null"))

PySpark - Returning a value into a dataframe, if a value occurs on another dataframe where two fields match

Sorry for the vague title, I can't think of a better way to put it. I understand a bit of python and have some experience with Pandas dataframes, but recently I have been tasked to look at something involving Spark and I'm struggling to get my ahead around it.
I suppose the best way to explain this is with a small example. Imagine I have dataframe A:
id | Name |
--------------
1 | Random |
2 | Random |
3 | Random |
As well as dataframe B:
id | Fruit |
-------------
1 | Pear |
2 | Pear |
2 | Apple |
2 | Banana |
3 | Pear |
3 | Banana |
Now what I'm trying to do is match dataframe A with B (based on id matching), and iterate through the Fruit column in dataframe B. If a value comes up (say Banana), I want to add it as a column to dataframe. Could be a simple sum (everytime banana comes up add 1 to a column), or just class it if it comes up once. So for example, an output could look like this:
id | Name | Banana
---------------------
1 | Random | 0
2 | Random | 1
3 | Random | 1
My issue is iterating through Spark dataframes, and how I can connect the two if the match does occur. I was trying to do something to this effect:
def fruit(input):
fruits = {"Banana" : "B"}
return fruits[input]
fruits = df.withColumn("Output", fruit("Fruit"))
But it's not really working. Any ideas? Apologies in advance my experience with Spark is very little.
Hope this helps!
#sample data
A = sc.parallelize([(1,"Random"), (2,"Random"), (3,"Random")]).toDF(["id", "Name"])
B = sc.parallelize([(1,"Pear"), (2,"Pear"), (2,"Apple"), (2,"Banana"), (3,"Pear"), (3,"Banana")]).toDF(["id", "Fruit"])
df_temp = A.join(B, A.id==B.id, 'inner').drop(B.id)
df = df_temp.groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
Output is
+---+------+-----+------+----+
| id| Name|Apple|Banana|Pear|
+---+------+-----+------+----+
| 1|Random| 0| 0| 1|
| 3|Random| 0| 1| 1|
| 2|Random| 1| 1| 1|
+---+------+-----+------+----+
Edit note: In case you are only interested in few fruits then
from pyspark.sql.functions import col
#list of fruits you are interested in
fruit_list = ["Pear", "Banana"]
df = df_temp.\
filter(col('fruit').isin(fruit_list)).\
groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
+---+------+------+----+
| id| Name|Banana|Pear|
+---+------+------+----+
| 1|Random| 0| 1|
| 3|Random| 1| 1|
| 2|Random| 1| 1|
+---+------+------+----+

Join two dataframes in pyspark by one column

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.
df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value

how to index categorical features in another way when using spark ml

The VectorIndexer in spark indexes categorical features according to the frequency of variables. But I want to index the categorical features in a different way.
For example, with a dataset as below, "a","b","c" will be indexed as 0,1,2 if I use the VectorIndexer in spark. But I want to index them according to the label.
There are 4 rows data which are indexed as 1, and among them 3 rows have feature 'a',1 row feautre 'c'. So here I will index 'a' as 0,'c' as 1 and 'b' as 2.
Is there any convienient way to implement this?
label|feature
-----------------
1 | a
1 | c
0 | a
0 | b
1 | a
0 | b
0 | b
0 | c
1 | a
If I understand your question correctly, you are looking to replicate the behaviour of StringIndexer() on grouped data. To do so (in pySpark), we first define an udf that will operate on a List column containing all the values per group. Note that elements with equal counts will be ordered arbitrarily.
from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType
def encoder(col):
# Generate count per letter
x = Counter(col)
# Create a dictionary, mapping each letter to its rank
ranking = {pair[0]: rank
for rank, pair in enumerate(x.most_common())}
# Use dictionary to replace letters by rank
new_list = [ranking[i] for i in col]
return(new_list)
encoder_udf = udf(encoder, ArrayType(IntegerType()))
Now we can aggregate the feature column into a list grouped by the column label using collect_list() , and apply our udf rowwise:
from pyspark.sql.functions import collect_list, explode
df1 = (df.groupBy("label")
.agg(collect_list("feature")
.alias("features"))
.withColumn("index",
encoder_udf("features")))
Consequently, you can explode the index column to get the encoded values instead of the letters:
df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 0|
| 0| 2|
| 1| 0|
| 1| 1|
| 1| 0|
| 1| 0|
+-----+-----+

Resources