Substract column values of two rows based on Dense Rank - apache-spark

I have a dataframe which has the below data
srl_no created_on completed_on prev_completed_on time_from_last Dense_Rank
XXXXXX1 2020-10-09T08:52:25 2020-10-09T08:57:45 null null 1
XXXXXX1 2020-10-09T09:04:32 2020-10-09T09:06:37 2020-10-09T08:57:45 407 2
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:06:37 213 3
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:12:17 -127 3
I want to substract the prev_completed_on from the created_on to get the time_from_last however as the last two rows have the same created_on and completed_on I am getting the time as negative. In this scenario I need to substract the value from the second row, i.e. substract based on the dense_rank column.
So in the above scenario I need to substract the value of completed_on of 2nd row from the value of created_on for the 4th row.
Code for the above
df = spark.createDataFrame(
[
('XXXXXX1','2020-10-09T08:52:25','2020-10-09T08:57:45'), # create your data here, be consistent in the types.
('XXXXXX1','2020-10-09T09:04:32','2020-10-09T09:06:37'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
],
['srl_no', 'created_on','completed_on'] # add your columns label here
)
df = df.withColumn('created_on',f.col('created_on').cast(TimestampType()))
df = df.withColumn('created_on',f.col('created_on').cast(TimestampType()))
partition_cols = ["srl_no"]
window_clause = Window.partitionBy(partition_cols).orderBy(f.col('completed_on').asc())
# create the row number column
df1 = df.withColumn('prev_completed_on',f.lag(f.col("completed_on"))\
.over(window_clause).cast(TimestampType()))
df1 = df1.withColumn('dense_rank',f.dense_rank()\
.over(window_clause))
df1 = df1.withColumn("time_from_last",\
f.col("created_on").cast(LongType()) - col("prev_completed_on").cast(LongType()))
expected output
srl_no created_on completed_on prev_completed_on time_from_last Dense_Rank
XXXXXX1 2020-10-09T08:52:25 2020-10-09T08:57:45 null null 1
XXXXXX1 2020-10-09T09:04:32 2020-10-09T09:06:37 2020-10-09T08:57:45 407 2
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:06:37 213 3
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:12:17 **213** 3

The trick here is to use a groupby to get the minimum date per srl_no, dense_rank. When joining that back to the prepared data frame you get the required result.
df = spark.createDataFrame(
[
('XXXXXX1','2020-10-09T08:52:25','2020-10-09T08:57:45'), # create your data here, be consistent in the types.
('XXXXXX1','2020-10-09T09:04:32','2020-10-09T09:06:37'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
],
['srl_no', 'created_on','completed_on'] # add your columns label here
)
df = df.withColumn('created_on',F.col('created_on').cast(T.TimestampType()))
df = df.withColumn('created_on',F.col('created_on').cast(T.TimestampType()))
partition_cols = ["srl_no"]
window_clause = Window.partitionBy(partition_cols).orderBy(F.col('completed_on').asc())
# create the row number column
df_with_rank = df.withColumn('prev_completed_on',F.lag(F.col("completed_on"))\
.over(window_clause).cast(T.TimestampType()))
df_with_rank = df_with_rank.withColumn('dense_rank', F.dense_rank()\
.over(window_clause))
dense_rank = df_with_rank.groupby("srl_no", "dense_rank") \
.agg(F.min('prev_completed_on').alias('prev_completed_on'))
df_with_rank = df_with_rank.drop('prev_completed_on')
df_with_rank = df_with_rank.join(dense_rank, ["srl_no", "dense_rank"], 'left')
df_with_rank.show()
Output:
+-------+----------+-------------------+-------------------+-------------------+
| srl_no|dense_rank| created_on| completed_on| prev_completed_on|
+-------+----------+-------------------+-------------------+-------------------+
|XXXXXX1| 1|2020-10-09 08:52:25|2020-10-09T08:57:45| null|
|XXXXXX1| 2|2020-10-09 09:04:32|2020-10-09T09:06:37|2020-10-09 08:57:45|
|XXXXXX1| 3|2020-10-09 09:10:10|2020-10-09T09:12:17|2020-10-09 09:06:37|
|XXXXXX1| 3|2020-10-09 09:10:10|2020-10-09T09:12:17|2020-10-09 09:06:37|
+-------+----------+-------------------+-------------------+-------------------+

Related

How to transpose and Pandas DataFrame and name new columns?

I have simple Pandas DataFrame with 3 columns. I am trying to Transpose it into and then rename that new dataframe and I am having bit trouble.
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
I tried using
df =df.T
which transpose the DataFrame into:
TotalInvoicedPrice,123
TotalProductCost,18
ShippingCost,5
So now i have to add column names to this data frame "Metrics" and "Values"
I tried using
df.columns["Metrics","Values"]
but im getting errors.
What I need to get is DataFrame that looks like:
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Let's reset the index then set the column labels
df.T.reset_index().set_axis(['Metrics', 'Values'], axis=1)
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Maybe you can avoid transpose operation (little performance overhead)
#YOUR DATAFRAME
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
#FORM THE LISTS FROM YOUR COLUMNS AND FIRST ROW VALUES
l1 = df.columns.values.tolist()
l2 = df.iloc[0].tolist()
#CREATE A DATA FRAME.
df2 = pd.DataFrame(list(zip(l1, l2)),columns = ['Metrics', 'Values'])
print(df2)

Pandas - DataFrame manipulation

I have a Csv which has data in different manner :
Data Set is given below
data = [[12,'abc#xyz.com', 'NaN', 'NaN' ], [12,'abc#xyz.com','NaN' , 'NaN'], ['NaN', 'NaN','x' , 'y' ] , ['NaN','NaN', 'a','b'] , ['13','qwer#123.com','NaN','NaN'],['NaN','NaN', 'x','r']]
df = pd.DataFrame(data , columns = ['id' , 'email','notes_key' , 'notes_value'])
df
Ideally third and fourth column should have the same id as first column.
The column name notes_key and notes_value represents the key:value pair i.e. the key is notes_key and its corresponding pair is in notes_pair.
I have to manipulate the dataframe in a way such that output turns out :
data = [[12,abc#xyz.com,x,y],[12,abc#xyz.com,a,b]]
df = pd.DataFrame(data , columns =['id','email','notes_key' , 'notes_value'])
I tried dropping the null values.
You can forward filling missing values by id and then remove rows if missing values in both columns notes_key,notes_value:
#if necessary
df = df.replace('NaN', np.nan)
df[['id','email']] = df[['id','email']].ffill()
df = df.dropna(subset=['notes_key','notes_value'], how='all')
print (df)
id email notes_key notes_value
2 12 abc#xyz.com x y
3 12 abc#xyz.com a b
5 13 qwer#123.com x r

How to make a new dataframe that contains records from a particular column of a dataframe?

new_df columns:
district
AD1
AD2
AD3
Alappuzha
Kottayam
Ernakulam
Pattanamtitta
Ernakulam
Kottayam
Alappuzha
Thrissur
Idukki
Pattanamtitta
Kottayam
Alappuzha
Use:
#create counter column
df['g'] = df.groupby('District1').cumcount().add(1)
#filter out first row per group
df = df[df['g'].gt(1)].copy()
#pivoting
df = df.pivot('District1','g','District2').add_prefix('AD')

problem filtering pyspark dataframe if contains ">" or "<"

My data frame has value column containing > or < and I want to remove them.
This is my code:
df1= df.filter((col("value").contains('>') | col("value").contains('<')))
df2= df.filter(~(col("value").contains('>') | col("value").contains('<')))
print(df.count())
print(df1.count())
print(df2.count())
My result:
3900000
202
3600000
My expectation is:
df.count() = df1.count() + df2.count()
But it is not. What is the problem here?
This is certainly caused by null values in value column.
df.count() counts all the rows in the Dataframe and nulls counted too. But when you use contains in the filter, null values are skipped.
Example:
data = [("value1_>", ), ("value2_>", ), ("value3_<",), ("value4",), (None,)]
df = spark.createDataFrame(data, ['value'])
df1 = df.filter((col("value").contains('>') | col("value").contains('<')))
df2 = df.filter(~(col("value").contains('>') | col("value").contains('<')))
print(df.count())
print(df1.count())
print(df2.count())
#5
#3
#1

Split rows with same ID into different columns python

I want to have a dataframe with repeated values with the same id number. But i want to split the repeated rows into colunms.
data = [[10450015,4.4],[16690019 4.1],[16690019,4.0],[16510069 3.7]]
df = pd.DataFrame(data, columns = ['id', 'k'])
print(df)
The resulting dataframe would have n_k (n= repated values of id rows). The repeated id gets a individual colunm and when it does not have repeated id, it gets a 0 in the new colunm.
data_merged = {'id':[10450015,16690019,16510069], '1_k':[4.4,4.1,3.7], '2_k'[0,4.0,0]}
print(data_merged)
Try assiging the column idx ref, using DataFrame.assign and groupby.cumcount then DataFrame.pivot_table. Finally use a list comprehension to sort column names:
df_new = (df.assign(col=df.groupby('id').cumcount().add(1))
.pivot_table(index='id', columns='col', values='k', fill_value=0))
df_new.columns = [f"{x}_k" for x in df_new.columns]
print(df_new)
1_k 2_k
id
10450015 4.4 0
16510069 3.7 0
16690019 4.1 4

Resources