Calculating averages considering two columns [duplicate] - python-3.x

I have following pandas dataframe:
| id | LocTime |ZPos | XPos
datetime |
2017-01-02 00:14:39 |20421902611| 12531245409231| 0 | -6
2017-01-02 00:14:40 |30453291020| 28332479673070| 0 | -2
I want to convert datetime index to column of the data frame.
I tried df.reset_index(level=['datetime']) but the result does not change.
any idea?

Need assign output back or inplace=True parameter:
df = df.reset_index()
df.reset_index(inplace=True)
print (df)
datetime id LocalTime ZPosition XPosition
0 2017-01-02 00:14:39 10453190861 1483312478909238 0 -9
1 2017-01-02 00:14:40 10453191020 1483312479673076 0 -8

Related

How to solve the ValueError: Unstacked DataFrame is too big, causing int32 overflow in python?

I have a dataframe in dynamic format for each ID
df:
ID |Start Date|End date |claim_no|claim_type|Admission_date|Discharge_date|Claim_amt|Approved_amt
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351
10 |01-Apr-20 |31-Mar-21| 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964
10 |01-Apr-20 |31-Mar-21| 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
11 |12-Dec-20 |11-Dec-21| 1503 |CSHLESS | 12-Jan-2021 | 15-Jan-2021 | 76137 | 50286
11 |12-Dec-20 |11-Dec-21| 1505 |CSHLESS | 05-Jan-2021 | 07-Jan-2021 | 30000 | 0
Based on the ID column i am trying to convert all the dynamic variables into a static format so that i can have a single row for each ID.
Columns such as ID, Start Date,End date are static in nature and rest of the columns are dynamic in nature for each ID.
Inorder to acheive the below output:
ID |Start Date|End date |claim_no_1|claim_type_1|Admission_date_1|Discharge_date_1|Claim_amt_1|Approved_amt_1|claim_no_2|claim_type_2|Admission_date_2|Discharge_date_2|Claim_amt_2|Approved_amt_2|claim_no_3|claim_type_3|Admission_date_3|Discharge_date_3|Claim_amt_3|Approved_amt_3
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351 | 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964 | 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
i am using the below code:
# Index columns
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
cols = df.groupby(idx).cumcount() + 1
# Reshape using stack and unstack
df_out = df.set_index([*idx, cols]).stack().unstack([-2, -1])
# Flatten the multiindex columns
df_out.columns = df_out.columns.map('{0[1]}_{0[0]}'.format)
but it throws a ValueError: Unstacked DataFrame is too big, causing int32 overflow
Try this:
Index columns (very similar to your code)
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
df['nrow'] = df.groupby(idx)['claim_no'].transform('rank')
df['nrow'] = df['nrow'].astype(int).astype(str)
instead of stack & unstack. Using these functions you can have better control over columns
df1 = pd.melt(df, id_vars =['nrow', *idx] , value_vars=['claim_no', 'claim_type', 'Admission_date',
'Discharge_date', 'Claim_amt', 'Approved_amt'],
value_name='var'
)
df2 = df1.pivot(index=[*idx],
columns=['variable', 'nrow'], values='var')
df2.columns = ['_'.join(col).rstrip('_') for col in df2.columns.values]
print(df2)
claim_no_1 claim_no_2 claim_no_3 claim_type_1 claim_type_2 claim_type_3 Admission_date_1 Admission_date_2 Admission_date_3 Discharge_date_1 Discharge_date_2 Discharge_date_3 Claim_amt_1 Claim_amt_2 Claim_amt_3 Approved_amt_1 Approved_amt_2 Approved_amt_3
ID Start Date End date
10 01-Apr-20 31-Mar-21 1123 1212 1680 CSHLESS POSTHOSP CSHLESS 23-Aug-2020 30-Aug-2020 18-Mar-2021 25-Aug-2020 01-Sep-2020 23-Mar-2021 25406 4209 18002 19351 3964 0
11 12-Dec-20 11-Dec-21 1503 1505 NaN CSHLESS CSHLESS NaN 12-Jan-2021 05-Jan-2021 NaN 15-Jan-2021 07-Jan-2021 NaN 76137 30000 NaN 50286 0 NaN

Filter DataFrame to delete duplicate values in pyspark

I have the following dataframe
date | value | ID
--------------------------------------
2021-12-06 15:00:00 25 1
2021-12-06 15:15:00 35 1
2021-11-30 00:00:00 20 2
2021-11-25 00:00:00 10 2
I want to join this DF with another one like this:
idUser | Name | Gender
-------------------
1 John M
2 Anne F
My expected output is:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
What I need is: Get only the most recent value of the first dataframe and join only this value with my second dataframe. Although, my spark script is joining both values:
My code:
df = df1.select(
col("date"),
col("value"),
col("ID"),
).OrderBy(
col("ID").asc(),
col("date").desc(),
).groupBy(
col("ID"), col("date").cast(StringType()).substr(0,10).alias("date")
).agg (
max(col("value")).alias("value")
)
final_df = df2.join(
df,
(col("idUser") == col("ID")),
how="left"
)
When i perform this join (formating the columns is abstracted in this post) I have the following output:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
2 Anne F 10
I use substr to remove hours and minutes to filter only by date. But when I have the same ID in different days my output df has the 2 values instead of the most recently. How can I fix this?
Note: I'm using only pyspark functions to do this (I now want to use spark.sql(...)).
You can use window and row_number function in pysaprk
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("ID").orderBy("date").desc()
df1_latest_val = df1.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
The output of table df1_latest_val will look something like this
date | value | ID | row_number |
-----------------------------------------------------
2021-12-06 15:15:00 35 1 1
2021-11-30 00:00:00 20 2 1
Now you will have df with the latest val, which you can directly join with another table.

How to create this year_month sales and previous year_month sales in two different columns?

I need to create two different columns, one for this year sales and one column for last year sales from a transactional level data?
Data format:-
Date | bill amount
2019-07-22 | 500
2019-07-25 | 200
2020-11-15 | 100
2020-11-06 | 900
2020-12-09 | 50
2020-12-21 | 600
Required format:-
Year_month |This month Sales | Prev month sales
2019_07 | 700 | -
2020_11 | 1000 | -
2020_12 | 650 | 1000
The relatively tricky bit is to figure out what the previous month is. We do it by figuring out the beginning of the month for each date and then rolling back by 1 month. Note that this will take care of January -> December of previous year issues
We start by creating a sample dataframe and importing some useful modules
from io import StringIO
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
""")
df = pd.read_csv(data,sep='|')
df['date'] = pd.to_datetime(df['date'])
df
we get
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Then we figure out the month start and the previous month start using datetime utilities
df['month_start'] = df['date'].apply(lambda d:datetime(year = d.year, month = d.month, day = 1))
df['prev_month_start'] = df['month_start'].apply(lambda d:d+relativedelta(months = -1))
Then we summarize monthly sales using groupby on month start
ms_df = df.drop(columns = 'date').groupby('month_start').agg({'prev_month_start':'first','amount':sum}).reset_index()
ms_df
so we get
month_start prev_month_start amount
0 2019-07-01 2019-06-01 700
1 2020-11-01 2020-10-01 1000
2 2020-12-01 2020-11-01 650
Then we join (merge) ms_df on itself by mapping 'prev_month_start' to 'month_start'
ms_df2 = ms_df.merge(ms_df, left_on='prev_month_start', right_on='month_start', how = 'left', suffixes = ('','_prev'))
We are more or less there but now make it pretty by getting rid of superfluous columns, adding labels, etc
ms_df2['label'] = ms_df2['month_start'].dt.strftime('%Y_%m')
ms_df2 = ms_df2.drop(columns = ['month_start','prev_month_start','month_start_prev','prev_month_start_prev'])
columns = ['label','amount','amount_prev']
ms_df2 = ms_df2[columns]
and we get
| | label | amount | amount_prev |
|---:|--------:|---------:|--------------:|
| 0 | 2019_07 | 700 | nan |
| 1 | 2020_11 | 1000 | nan |
| 2 | 2020_12 | 650 | 1000 |
Using #piterbarg's data, we can use resample, combined with shift and concat to get your desired data:
import pandas as pd
from io import StringIO
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
"""
)
df = pd.read_csv(data, sep="|", parse_dates=["date"])
df
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Get the sum for current sales:
data = df.resample(on="date", rule="1M").amount.sum().rename("This_month")
data
date
2019-07-31 700
2019-08-31 0
2019-09-30 0
2019-10-31 0
2019-11-30 0
2019-12-31 0
2020-01-31 0
2020-02-29 0
2020-03-31 0
2020-04-30 0
2020-05-31 0
2020-06-30 0
2020-07-31 0
2020-08-31 0
2020-09-30 0
2020-10-31 0
2020-11-30 1000
2020-12-31 650
Freq: M, Name: This_month, dtype: int64
Now, we can shift the month to get values for previous month, and drop rows that have 0 as total sales to get your final output:
(pd.concat([data, data.shift().rename("previous_month")], axis=1)
.query("This_month!=0")
.fillna(0))
This_month previous_month
date
2019-07-31 700 0.0
2020-11-30 1000 0.0
2020-12-31 650 1000.0

Converting timedeltas to integers for consecutive time points in pandas

Suppose I have the dataframe
import pandas as pd
df = pd.DataFrame({"Time": ['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04']})
print(df)
Time
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
If I want to calculate the time from the lowest time point for each time in the dataframe, I can use the apply function like
df['Time'] = pd.to_datetime(df['Time'])
df.sort_values(inplace = True)
df['Time'] = df['Time'].apply(lambda x: (x - df['Time'].iloc[0]).days)
print(df)
Time
0 0
1 1
2 2
3 3
Is there a function in Pandas that does this already?
I will recommend not use apply
(df.Time-df.Time.iloc[0]).dt.days
0 0
1 1
2 2
3 3
Name: Time, dtype: int64

Aggregating past and current values(monthly data) of Target column using pandas

I have dataframe like this below in pandas,
EMP_ID| Date| Target_GWP
1 | Jan-2017| 100
2 | Jan 2017| 300
1 | Feb-2017| 500
2 | Feb-2017| 200
and I need my output to be printed in below form.
EMP_ID| Date| Target_GWP | past_Target_GWP
1 | Feb-2017| 600 |100
2 | Feb-2017| 500 |300
Basically I have monthly data coming in excel and I want to aggregate this Target_GWP for each EMP_ID against the latest(current month) and have to create a back up column in pandas dataframe for past month Target_GWP. So How will i back the past month target_GWP and add it to current month Target GWP
Any leads on this would be appreciated.
Use:
#convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
#sorting and get last 2 rows
df = df.sort_values(['EMP_ID','Date']).groupby('EMP_ID').tail(2)
#aggregation
df = df.groupby('EMP_ID', as_index=False).agg({'Date':'last', 'Target_GWP':['sum','first']})
df.columns = ['EMP_ID','Date','Target_GWP','past_Target_GWP']
print (df)
EMP_ID Date Target_GWP past_Target_GWP
0 1 2017-02-01 600 100
1 2 2017-02-01 500 300
Or if need top value in Target_GWP instead sum use last:
df = df.groupby('EMP_ID', as_index=False).agg({'Date':'last', 'Target_GWP':['last','first']})
df.columns = ['EMP_ID','Date','Target_GWP','past_Target_GWP']
print (df)
EMP_ID Date Target_GWP past_Target_GWP
0 1 2017-02-01 500 100
1 2 2017-02-01 200 300

Resources