Create new cols by comparing values from old cols - apache-spark

Input dataframe:
Item
L
W
H
I1
3
5
8
I2
2
1
2
I3
6
9
1
I4
7
3
4
The output dataframe should be as below. Create 3 new columns: L_n, W_n, H_n by checking the values from L, W, H cols. L_n is the longest dimension, W_n is the medium and H_n is the shortest dimension.
Item
L
W
H
L_n
W_n
H_n
I1
3
5
8
8
5
3
I2
2
1
2
2
2
1
I3
6
9
1
9
6
1
I4
7
3
4
7
4
3

I suggest creating an array (array), sorting it (array_sort) and selecting elements one-by-one (element_at).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('I1', 3, 5, 8),
('I2', 2, 1, 2),
('I3', 6, 9, 1),
('I4', 7, 3, 4)],
['Item', 'L', 'W', 'H']
)
arr = F.array_sort(F.array('L', 'W', 'H'))
df = df.select(
'*',
F.element_at(arr, 3).alias('L_n'),
F.element_at(arr, 2).alias('W_n'),
F.element_at(arr, 1).alias('H_n'),
)
df.show()
# +----+---+---+---+---+---+---+
# |Item| L| W| H|L_n|W_n|H_n|
# +----+---+---+---+---+---+---+
# | I1| 3| 5| 8| 8| 5| 3|
# | I2| 2| 1| 2| 2| 2| 1|
# | I3| 6| 9| 1| 9| 6| 1|
# | I4| 7| 3| 4| 7| 4| 3|
# +----+---+---+---+---+---+---+

Related

How to filter a dataframe with a specific condition in Spark

I have the following DF
Cod Category N
1 A 1
1 A 2
1 A 3
1 B 1
1 B 2
1 B 3
1 B 4
1 B 5
2 D 1
3 Z 1
3 Z 2
3 Z 3
3 Z 4
I need to filter this DF to, when N > 3, then all values for the category should be retrieved. My expected output to simplify the example:
Cod Category N
1 B 1
1 B 2
1 B 3
1 B 4
1 B 5
3 Z 1
3 Z 2
3 Z 3
3 Z 4
How Can I Implement this type of filter? I tried to use window functions to generate another column with a Flag indicating to filter, but with no success.
You can use a window to associate at each row the maximum N present in its category. Then just apply your condition to this new column to filter the categories.
w = Window.partitionBy("Cod", "Category")
df = df.withColumn("max_N_in_category", F.max("N").over(w))
N = 3
df = df \
.filter(F.col("max_N_in_category") > N) \
.drop("max_N_in_category")
Data
df =spark.createDataFrame([(1 , 'A', 1 ),
(1 , 'A' , 2),
(1 , 'A' , 3),
(1 ,'B' , 1),
(1 , 'B' , 2),
(1 , 'B' , 3),
(1 , 'B' , 4),
(1 , 'B' , 5),
(2 , 'D' , 1),
(3 , 'Z' , 1),
(3 , 'Z' , 2)],
('Cod', 'Category', 'N'))
new = (df.withColumn('check', collect_list('N').over(Window.partitionBy('cod','Category')))#create array per group and keep in column check
.where(expr("exists(check, c -> array_contains(check,3))"))#Filter arrays that do not contaiin 3
.drop('check')#drop column check
).show()
outcome
+---+--------+---+
|Cod|Category| N|
+---+--------+---+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 1| B| 1|
| 1| B| 2|
| 1| B| 3|
| 1| B| 4|
| 1| B| 5|
If you don't want to use window functions, then it can be done by groupBy and filter with isin():
df.filter(df.Category.isin([x.Category for x in df.groupBy("Category").max("N").collect() if x["max(N)"] > 3]))
[Out]:
+---+--------+---+
|Cod|Category| N|
+---+--------+---+
| 1| B| 1|
| 1| B| 2|
| 1| B| 3|
| 1| B| 4|
| 1| B| 5|
| 3| Z| 1|
| 3| Z| 2|
| 3| Z| 3|
| 3| Z| 4|
+---+--------+---+

add rows in pyspark dataframe and adjust the column sequence accordingly

We have a dataframe like below say DF1
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name
2
1
last_name
3
1
full_name
4
1
XYZ
5
0
sal
6
1
AAA
7
0
Now I want to add 2 rows for one row where hash_inc_ind =1 and adjust the col seq accordingly so that the output would be like
DF1:
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name_h
2
1
first_name_e
3
1
last_name_h
4
1
last_name_e
5
1
full_name_h
6
1
full_name_e
7
1
XYZ
8
0
sal_h
9
1
sal_e
10
1
AAA
11
0
You can explode an array constructed using when:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col_name',
F.expr("explode(transform(case when Hash_enc_ind = 1 then array('_h', '_e') else array('') end, x -> col_name || x))")
)
df2.show()
+------------+-------+------------+
| col_name|col_seq|Hash_enc_ind|
+------------+-------+------------+
| abc| 1| 0|
|first_name_h| 2| 1|
|first_name_e| 2| 1|
| last_name_h| 3| 1|
| last_name_e| 3| 1|
| full_name_h| 4| 1|
| full_name_e| 4| 1|
| XYZ| 5| 0|
| sal_h| 6| 1|
| sal_e| 6| 1|
| AAA| 7| 0|
+------------+-------+------------+

Mapping key and list of values to key value using pyspark

I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+

In pyspark, group based on a variable field, and add a counter for particular values (which resets when variable changes)

Create a spark dataframe from a pandas dataframe
import pandas as pd
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
Next I use the window partition on the field 'b'
from pyspark.sql import window
win_spec = (window.Window.partitionBy(['b']).orderBy("Sno").rowsBetween(window.Window.unboundedPreceding, 0))
Add a field positive , negative based on the values and created a lambda funtion
df2 = df2.withColumn("pos_neg",col("a") <0)
pos_neg_func =udf(lambda x: ((x) & (x != x.shift())).cumsum())
tried creating a new column (which is a counter for negative values but within variable 'b'. so counter restarts when the field in 'b' changes. Also if there are consecutive -ve values, they should be treated as a single value, counter changes by 1
df3 = (df2.select('pos_neg',pos_neg_func('pos_neg').alias('val')))
I want the output as,
b Sno a val val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 B 5 -3 True 1
5 B 6 -1 True 1
6 B 7 -7 True 1
7 C 8 -6 True 1
8 C 9 1 False 1
9 D 10 1 False 0
10 D 11 -1 True 1
11 D 12 1 False 1
12 D 13 4 False 1
13 D 14 5 False 1
14 D 15 -3 True 2
15 D 16 2 False 2
16 D 17 3 False 2
17 D 18 4 False 2
18 D 19 -1 True 3
19 D 20 -2 True 3
In python a simple function like following works:
df['val'] = df.groupby('b')['pos_neg'].transform(lambda x: ((x) & (x != x.shift())).cumsum())
josh-friedlander provided support in the above code
Pyspark doesn't have a shift function, but you could work with the lag window function which gives you the row before the current row. The first window (called w) sets the value of the val column to 1 if the value of the pos_neg column is True and the value of the previous pos_neg is False and to 0 otherwise.
With the second window (called w2) we calculate the cumulative sum to get your desired
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import Window
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
w = Window.partitionBy('b').orderBy('Sno')
w2 = Window.partitionBy('b').rowsBetween(Window.unboundedPreceding, 0).orderBy('Sno')
df2 = df2.withColumn("pos_neg",col("a") <0)
df2 = df2.withColumn('val', F.when((df2.pos_neg == True) & (F.lag('pos_neg', default=False).over(w) == False), 1).otherwise(0))
df2 = df2.withColumn('val', F.sum('val').over(w2))
df2.show()
Output:
+---+---+---+-------+---+
|Sno| a| b|pos_neg|val|
+---+---+---+-------+---+
| 5| -3| B| true| 1|
| 6| -1| B| true| 1|
| 7| -7| B| true| 1|
| 10| 1| D| false| 0|
| 11| -1| D| true| 1|
| 12| 1| D| false| 1|
| 13| 4| D| false| 1|
| 14| 5| D| false| 1|
| 15| -3| D| true| 2|
| 16| 2| D| false| 2|
| 17| 3| D| false| 2|
| 18| 4| D| false| 2|
| 19| -1| D| true| 3|
| 20| -2| D| true| 3|
| 8| -6| C| true| 1|
| 9| 1| C| false| 1|
| 1| 3| A| false| 0|
| 2| -4| A| true| 1|
| 3| 2| A| false| 1|
| 4| -1| A| true| 2|
+---+---+---+-------+---+
You may wonder why it was neccessary to have a column which allows us to order the dataset. Let me try to explain this with an example. The following data was read by pandas and got an index assigned (left column). You want to count the occurences of True in the pos_neg and you don't want to count consecuitive True's. This logic leads to the val2 column as shown below:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 A 5 -5 True 2
...but it depends on the index it got from pandas (order of rows). When you change the order of the rows (and the corrosponding pandas index) you will get a different result when you apply your logic to the same rows just because the order is different:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 3 2 False 0
2 A 2 -4 True 1
3 A 4 -1 True 1
4 A 5 -5 True 1
You see that the order of the rows is important. You might wonder now why pyspark doesn't create an index like pandas does. That is because spark keeps your data in several partitions which are distributed on your cluster and is depending on your data source even able to read the data distributedly. An index can't therefore not be added during the reading of the data. You can add one after the data was read with the monotonically_increasing_id function, but your data could already have a different order compared to your data source due to the read process.
Your sno column avoids this problem and guaranties that you will get always the same result for the same data (deterministic).

Update column with a where clause in Pyspark

How to update a column in Pyspark dataframe with a where clause?
This is similar to this SQL operation :
UPDATE table1 SET alpha1= x WHERE alpha2< 6;
where alpha1 and alpha2 are columns of the table1.
For Eg :
I Have a dataframe table1 with values below :
table1
alpha1 alpha2
3 7
4 5
5 4
6 8
dataframe Table1 after update :
alpha1 alpha2
3 7
x 5
x 4
6 8
How to do this in pyspark dataframe?
You are looking for the when function:
df = spark.createDataFrame([("3",7),("4",5),("5",4),("6",8)],["alpha1", "alpha2"])
df.show()
>>> +------+------+
>>> |alpha1|alpha2|
>>> +------+------+
>>> | 3| 7|
>>> | 4| 5|
>>> | 5| 4|
>>> | 6| 8|
>>> +------+------+
df2 = df.withColumn("alpha1", pyspark.sql.functions.when(df["alpha2"] < 6, "x").otherwise(df["alpha1"]))
df2.show()
>>>+------+------+
>>>|alpha1|alpha2|
>>>+------+------+
>>>| 3| 7|
>>>| x| 5|
>>>| x| 4|
>>>| 6| 8|
>>>+------+------+

Resources