I have the following dataframe:
| Timestamp | info |
+-------------------+----------+
|2016-01-01 17:54:30| 8 |
|2016-02-01 12:16:18| 2 |
|2016-03-01 12:17:57| 1 |
|2016-04-01 10:05:21| 2 |
|2016-05-11 18:58:25| 7 |
|2016-06-11 11:18:29| 6 |
|2016-07-01 12:05:21| 3 |
|2016-08-11 11:58:25| 2 |
|2016-09-11 15:18:29| 9 |
I would like to create a new column named count which counts in a window(-2, 0) (current row and previous two) how many values are > 5 (in the first two rows where I cannot perform the operation I would put 0).
The resulting table should be:
| Timestamp | info | count |
+-------------------+----------+----------+
|2016-01-01 17:54:30| 8 | 0 |
|2016-02-01 12:16:18| 2 | 0 |
|2016-03-01 12:17:57| 1 | 1 |
|2016-04-01 10:05:21| 2 | 0 |
|2016-05-11 18:58:25| 7 | 1 |
|2016-06-11 11:18:29| 6 | 2 |
|2016-07-01 12:05:21| 3 | 2 |
|2016-08-11 11:58:25| 2 | 1 |
|2016-09-11 15:18:29| 9 | 1 |
I tried to do this but it didn't work:
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn("count", F.when((F.count("info").over(w) > 5), F.count("info").over(w) > 5).otherwise(0))
The following would work if you don't mind calculations performed for the first 2 rows.
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn('count', F.count(F.when(F.col('info') > 5, 1)).over(w))
df_input.show()
# +-------------------+----+-----+
# | Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30| 8| 1|
# |2016-02-01 12:16:18| 2| 1|
# |2016-03-01 12:17:57| 1| 1|
# |2016-04-01 10:05:21| 2| 0|
# |2016-05-11 18:58:25| 7| 1|
# |2016-06-11 11:18:29| 6| 2|
# |2016-07-01 12:05:21| 3| 2|
# |2016-08-11 11:58:25| 2| 1|
# |2016-09-11 15:18:29| 9| 1|
# +-------------------+----+-----+
If you need 2 first rows to be 0, without changing the window, you can use this when condition:
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn(
'count',
F.when(F.size(F.collect_list('info').over(w)) == 3, F.count(F.when(F.col('info') > 5, 1)).over(w))
.otherwise(0)
)
df_input.show()
# +-------------------+----+-----+
# | Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30| 8| 0|
# |2016-02-01 12:16:18| 2| 0|
# |2016-03-01 12:17:57| 1| 1|
# |2016-04-01 10:05:21| 2| 0|
# |2016-05-11 18:58:25| 7| 1|
# |2016-06-11 11:18:29| 6| 2|
# |2016-07-01 12:05:21| 3| 2|
# |2016-08-11 11:58:25| 2| 1|
# |2016-09-11 15:18:29| 9| 1|
# +-------------------+----+-----+
We have a dataframe like below say DF1
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name
2
1
last_name
3
1
full_name
4
1
XYZ
5
0
sal
6
1
AAA
7
0
Now I want to add 2 rows for one row where hash_inc_ind =1 and adjust the col seq accordingly so that the output would be like
DF1:
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name_h
2
1
first_name_e
3
1
last_name_h
4
1
last_name_e
5
1
full_name_h
6
1
full_name_e
7
1
XYZ
8
0
sal_h
9
1
sal_e
10
1
AAA
11
0
You can explode an array constructed using when:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col_name',
F.expr("explode(transform(case when Hash_enc_ind = 1 then array('_h', '_e') else array('') end, x -> col_name || x))")
)
df2.show()
+------------+-------+------------+
| col_name|col_seq|Hash_enc_ind|
+------------+-------+------------+
| abc| 1| 0|
|first_name_h| 2| 1|
|first_name_e| 2| 1|
| last_name_h| 3| 1|
| last_name_e| 3| 1|
| full_name_h| 4| 1|
| full_name_e| 4| 1|
| XYZ| 5| 0|
| sal_h| 6| 1|
| sal_e| 6| 1|
| AAA| 7| 0|
+------------+-------+------------+
I have the following Spark DataFrame:
id
month
column_1
column_2
A
1
100
0
A
2
200
1
A
3
800
2
A
4
1500
3
A
5
1200
0
A
6
1600
1
A
7
2500
2
A
8
2800
3
A
9
3000
4
I would like to create a new column, let's call it 'dif_column1' based on a dynamic lag which is given by column_2. The desired output would be:
id
month
column_1
column_2
dif_column1
A
1
100
0
0
A
2
200
1
100
A
3
800
2
700
A
4
1500
3
1400
A
5
1200
0
0
A
6
1600
1
400
A
7
2500
2
1300
A
8
2800
3
1600
A
9
3000
4
1800
I have tried to use the lag function but apparently I can only use an integer with the lag function, so it does not work:
w = Window.partitionBy("id")
sdf = sdf.withColumn("dif_column1", F.col("column_1") - F.lag("column_1",F.col("column_2")).over(w))
You can add a row number column, and do a self join based on the row number and the lag defined in column_2:
from pyspark.sql import functions as F, Window
w = Window.partitionBy("id").orderBy("month")
df1 = df.withColumn('rn', F.row_number().over(w))
df2 = df1.alias('t1').join(
df1.alias('t2'),
F.expr('(t1.id = t2.id) and (t1.rn = t2.rn + t1.column_2)'),
'left'
).selectExpr(
't1.*',
't1.column_1 - t2.column_1 as dif_column1'
).drop('rn')
df2.show()
+---+-----+--------+--------+-----------+
| id|month|column_1|column_2|dif_column1|
+---+-----+--------+--------+-----------+
| A| 1| 100| 0| 0|
| A| 2| 200| 1| 100|
| A| 3| 800| 2| 700|
| A| 4| 1500| 3| 1400|
| A| 5| 1200| 0| 0|
| A| 6| 1600| 1| 400|
| A| 7| 2500| 2| 1300|
| A| 8| 2800| 3| 1600|
| A| 9| 3000| 4| 1800|
+---+-----+--------+--------+-----------+
Create a spark dataframe from a pandas dataframe
import pandas as pd
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
Next I use the window partition on the field 'b'
from pyspark.sql import window
win_spec = (window.Window.partitionBy(['b']).orderBy("Sno").rowsBetween(window.Window.unboundedPreceding, 0))
Add a field positive , negative based on the values and created a lambda funtion
df2 = df2.withColumn("pos_neg",col("a") <0)
pos_neg_func =udf(lambda x: ((x) & (x != x.shift())).cumsum())
tried creating a new column (which is a counter for negative values but within variable 'b'. so counter restarts when the field in 'b' changes. Also if there are consecutive -ve values, they should be treated as a single value, counter changes by 1
df3 = (df2.select('pos_neg',pos_neg_func('pos_neg').alias('val')))
I want the output as,
b Sno a val val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 B 5 -3 True 1
5 B 6 -1 True 1
6 B 7 -7 True 1
7 C 8 -6 True 1
8 C 9 1 False 1
9 D 10 1 False 0
10 D 11 -1 True 1
11 D 12 1 False 1
12 D 13 4 False 1
13 D 14 5 False 1
14 D 15 -3 True 2
15 D 16 2 False 2
16 D 17 3 False 2
17 D 18 4 False 2
18 D 19 -1 True 3
19 D 20 -2 True 3
In python a simple function like following works:
df['val'] = df.groupby('b')['pos_neg'].transform(lambda x: ((x) & (x != x.shift())).cumsum())
josh-friedlander provided support in the above code
Pyspark doesn't have a shift function, but you could work with the lag window function which gives you the row before the current row. The first window (called w) sets the value of the val column to 1 if the value of the pos_neg column is True and the value of the previous pos_neg is False and to 0 otherwise.
With the second window (called w2) we calculate the cumulative sum to get your desired
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import Window
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
w = Window.partitionBy('b').orderBy('Sno')
w2 = Window.partitionBy('b').rowsBetween(Window.unboundedPreceding, 0).orderBy('Sno')
df2 = df2.withColumn("pos_neg",col("a") <0)
df2 = df2.withColumn('val', F.when((df2.pos_neg == True) & (F.lag('pos_neg', default=False).over(w) == False), 1).otherwise(0))
df2 = df2.withColumn('val', F.sum('val').over(w2))
df2.show()
Output:
+---+---+---+-------+---+
|Sno| a| b|pos_neg|val|
+---+---+---+-------+---+
| 5| -3| B| true| 1|
| 6| -1| B| true| 1|
| 7| -7| B| true| 1|
| 10| 1| D| false| 0|
| 11| -1| D| true| 1|
| 12| 1| D| false| 1|
| 13| 4| D| false| 1|
| 14| 5| D| false| 1|
| 15| -3| D| true| 2|
| 16| 2| D| false| 2|
| 17| 3| D| false| 2|
| 18| 4| D| false| 2|
| 19| -1| D| true| 3|
| 20| -2| D| true| 3|
| 8| -6| C| true| 1|
| 9| 1| C| false| 1|
| 1| 3| A| false| 0|
| 2| -4| A| true| 1|
| 3| 2| A| false| 1|
| 4| -1| A| true| 2|
+---+---+---+-------+---+
You may wonder why it was neccessary to have a column which allows us to order the dataset. Let me try to explain this with an example. The following data was read by pandas and got an index assigned (left column). You want to count the occurences of True in the pos_neg and you don't want to count consecuitive True's. This logic leads to the val2 column as shown below:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 A 5 -5 True 2
...but it depends on the index it got from pandas (order of rows). When you change the order of the rows (and the corrosponding pandas index) you will get a different result when you apply your logic to the same rows just because the order is different:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 3 2 False 0
2 A 2 -4 True 1
3 A 4 -1 True 1
4 A 5 -5 True 1
You see that the order of the rows is important. You might wonder now why pyspark doesn't create an index like pandas does. That is because spark keeps your data in several partitions which are distributed on your cluster and is depending on your data source even able to read the data distributedly. An index can't therefore not be added during the reading of the data. You can add one after the data was read with the monotonically_increasing_id function, but your data could already have a different order compared to your data source due to the read process.
Your sno column avoids this problem and guaranties that you will get always the same result for the same data (deterministic).
I have a PySpark dataframe(input_dataframe) which looks like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 -1
103 1 1 0 1 -1
104 0 0 1 1 -1
I want to have a PySpark function(update_col_check), which update column(col_check) of this dataframe. I will pass one column name as an argument to this function. Function should check if value of that column is 1, then update value of col_check as this column name., let us say i am passing col2 as an argument to this function:
output_dataframe = update_col_check(input_dataframe, col2)
So, my output_dataframe should look like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 col2
103 1 1 0 1 col2
104 0 0 1 1 -1
Can i achieve this using PySpark? Any help will be appreciated.
You can do this fairly straight forward with functions when, otherwise:
from pyspark.sql.functions import when, lit
def update_col_check(df, col_name):
return df.withColumn('col_check', when(df[col_name] == 1, lit(col_name)).otherwise(df['col_check']))
update_col_check(df, 'col1').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| col1|
|102| 0| 1| 1| 0| -1|
|103| 1| 1| 0| 1| col1|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+
update_col_check(df, 'col2').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| -1|
|102| 0| 1| 1| 0| col2|
|103| 1| 1| 0| 1| col2|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+