Merge dfs based on specific conditions and reduce the execution time

Merge dfs based on specific conditions and reduce the execution time - python-3.x

I have a df as shown below
df1:
ID Age_days N_30 N_31_90
1 201 60 15
2 20 0 15
3 800 0 0
4 100 0 0
5 600 0 6
df2:
ID Salary Speed Group
1 2000 60 A
2 600 0 C
3 9000 0 B
4 1000 0 D
5 4000 0 A
6 8000 0 A
Where I would like to merge the salary column of df2 with df1 based on the ID column values.
Expected Output:
ID Age_days N_30 N_31_90 Salary
1 201 60 15 2000
2 20 0 15 600
3 800 0 0 9000
4 100 0 0 1000
5 600 0 6 8000
I tried the below code
df3 = pd.merge(df1, df2[['ID', 'Speed']], on='ID')
Getting the expected output. Is there any other method where my execution time can be reduced

Related

Fill null and next value with avarge value

i work with customers consumptions and sometime didn't have this consumption for month or more
so the first consumption after that need to break it down into those months
example
df = pd.DataFrame({'customerId':[1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'month':['2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01'],
'consumption':[100,130,0,0,400,140,105,500,0,0,0,0,0,3300]})
bfill() return same value not mean (value/count of null +1)
desired value
'c':[100,130,133,133,133,140,105,500,550,550,550,550,550,550]

You can try something like this:
df = pd.DataFrame({'customerId':[1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'month':['2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01'],
'consumption':[100,130,0,0,400,140,105,500,0,0,0,0,0,3300]})
df['grp'] = df['consumption'].ne(0)[::-1].cumsum()
df['c'] = df.groupby(['customerId', 'grp'])['consumption'].transform('mean')
df
Output:
customerId month consumption grp c
0 1 2021-10-01 100 7 100.000000
1 1 2021-11-01 130 6 130.000000
2 1 2021-12-01 0 5 133.333333
3 1 2022-01-01 0 5 133.333333
4 1 2022-02-01 400 5 133.333333
5 1 2022-03-01 140 4 140.000000
6 1 2022-04-01 105 3 105.000000
7 2 2021-10-01 500 2 500.000000
8 2 2021-11-01 0 1 550.000000
9 2 2021-12-01 0 1 550.000000
10 2 2022-01-01 0 1 550.000000
11 2 2022-02-01 0 1 550.000000
12 2 2022-03-01 0 1 550.000000
13 2 2022-04-01 3300 1 550.000000
Details:
Create a group by checking for zero, the do a cumsum in reverse order
to group zeroes with the next non-zero value.
Groupby that group and transform mean to distribute that non-zero
value across zeroes.

How to return first item when the items in the pandas dataframe window are the same?

I am a python beginner.
I have the following pandas DataFrame, with only two columns; "Time" and "Input".
I want to loop over the "Input" column. Assuming we have a window size w= 3. (three consecutive values) such that for every selected window, we will check if all the items/elements within that window are 1's, then return the first item as 1 and change the remaining values to 0's.
index Time Input
0 11 0
1 22 0
2 33 0
3 44 1
4 55 1
5 66 1
6 77 0
7 88 0
8 99 0
9 1010 0
10 1111 1
11 1212 1
12 1313 1
13 1414 0
14 1515 0
My intended output is as follows
index Time Input What_I_got What_I_Want
0 11 0 0 0
1 22 0 0 0
2 33 0 0 0
3 44 1 1 1
4 55 1 1 0
5 66 1 1 0
6 77 1 1 1
7 88 1 0 0
8 99 1 0 0
9 1010 0 0 0
10 1111 1 1 1
11 1212 1 0 0
12 1313 1 0 0
13 1414 0 0 0
14 1515 0 0 0
What should I do to get the desired output? Am I missing something in my code?

import pandas as pd
import re
pd.Series(list(re.sub('111', '100', ''.join(df.Input.astype(str))))).astype(int)
Out[23]:
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 0
10 1
11 0
12 0
13 0
14 0
dtype: int32

how to add missing rows of time series data to panda dataframes in python

I have a time series dataset of product given below:
date product price amount
11/17/2019 A 10 20
11/19/2019 A 15 20
11/24/2019 A 20 30
12/01/2019 C 40 50
12/05/2019 C 45 35
This data has a missing days ("MM/dd/YYYY") between the start and end date of data for each product. I am trying to fill missing date with zero rows and convert to previous table into a table given below:
date product price amount
11/17/2019 A 10 20
11/18/2019 A 0 0
11/19/2019 A 15 20
11/20/2019 A 0 0
11/21/2019 A 0 0
11/22/2019 A 0 0
11/23/2019 A 0 0
11/24/2019 A 20 30
12/01/2019 C 40 50
12/02/2019 C 0 0
12/03/2019 C 0 0
12/04/2019 C 0 0
12/05/2019 C 45 35
To get this conversion, I used the code:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
data=data.set_index(["date", "product"])
start=data.first_valid_index()[0]
end=data.last_valid_index()[0]
df=data.set_index("date").reindex(pd.date_range(start,end, freq="1D"), fill_values=0)
However the code gives an error. Is there any way to get this conversion efficiently?

If need add 0 for missing Datetimes for each product separately use custom function in GroupBy.apply with DataFrame.reindex by minimal and maximal datetime:
df = pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
f = lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), name='date'), fill_value=0)
df = (df.set_index('date')
.groupby('product')
.apply(f)
.drop('product', axis=1)
.reset_index())
print (df)
product date price amount
0 A 2019-11-17 10 20
1 A 2019-11-18 0 0
2 A 2019-11-19 15 20
3 A 2019-11-20 0 0
4 A 2019-11-21 0 0
5 A 2019-11-22 0 0
6 A 2019-11-23 0 0
7 A 2019-11-24 20 30
8 C 2019-12-01 40 50
9 C 2019-12-02 0 0
10 C 2019-12-03 0 0
11 C 2019-12-04 0 0
12 C 2019-12-05 45 35

one option is to use the complete function from pyjanitor to expose the missing rows per group:
#pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
# build the dates to be applied per group
dates = dict(date = lambda df: pd.date_range(df.min(), df.max(), freq='1D'))
df.complete(dates, by='product', sort = True).fillna(0, downcast='infer')
date product price amount
0 2019-11-17 00:00:00 A 10 20
1 2019-11-18 00:00:00 A 0 0
2 2019-11-19 00:00:00 A 15 20
3 2019-11-20 00:00:00 A 0 0
4 2019-11-21 00:00:00 A 0 0
5 2019-11-22 00:00:00 A 0 0
6 2019-11-23 00:00:00 A 0 0
7 2019-11-24 00:00:00 A 20 30
8 2019-12-01 00:00:00 C 40 50
9 2019-12-02 00:00:00 C 0 0
10 2019-12-03 00:00:00 C 0 0
11 2019-12-04 00:00:00 C 0 0
12 2019-12-05 00:00:00 C 45 35

There's an easier method for this case:
#create the full date range, and then create a DataFrame with the range
#if needed, you can expand the range a bit using datetime.timedelta()
alldates=pd.DataFrame(pd.date_range(data.index.min()-timedelta(1),data.index.max()+timedelta(4), freq="1D",name="newdate"))
#make 'newdate' the index, and you no longer need it as a column
alldates.index=alldates.newdate
alldates.drop(columns="newdate", inplace=True)
#now, join the tables, missing dates in the original table will be filled with NaN
data=alldates.join(data)

Calculate Recency based on specific conditions across more than one columns - pandas

I have a df as shown below
df:
ID Limit N_30 N_31_90 N_91_180 N_180_365
1 500 60 15 30 1
2 300 0 15 5 10
3 800 0 0 10 6
4 100 0 0 0 370
5 600 0 6 5 10
6 800 0 0 15 6
7 500 10 10 30 9
8 200 0 0 0 0
About the data
ID - customer ID
Limit - Limit
N_30 - Number of transaction in last 30 days
N_31_90 - Number of transaction in last 31 to 90 days.
N_91_180 - Number of transaction in last 91 to 180 days.
N_180_365 - Number of transaction in last 281 to 365 days.
From the above df I would like to extract a column called Recency.
Explanation:
if df['N_30'] != 0, then Recency = (30/df['N_30'])
elif df['N_31_90'] != 0 then Recency = 30 + (60/df['N_31_90'])
elif df['N_91_180'] != 0 then Recency = 90 + (90/df['N_91_180'])
elif df['N_181_365'] != 0 then Recency = 180 + (185/df['N_181_365'])
else Recency = 730
Expected output:
ID Limit N_30 N_31_90 N_91_180 N_180_365 Recency
1 500 60 15 30 1 (30/60) = 0.5
2 300 0 15 5 10 30+(60/15) = 34
3 800 0 0 10 6 90+90/10 = 100
4 100 0 0 0 370 180+(185/370) = 180.5
5 600 0 6 5 10 30+(60/6) = 36
6 800 0 0 15 6 90+(90/15) = 96
7 500 10 10 30 9 30/10 = 3
8 200 0 0 0 0 730

IIUC, using boolean masking with bfill:
pd.set_option("use_inf_as_na", True)
df2 = df.filter(like="N_")
df["Recency"] = (df2.eq(0) * [30, 60, 90, 180]).sum(1) + ([30, 60, 90, 185] / df2).bfill(1).iloc[:, 0]
print(df)
Output:
ID Limit N_30 N_31_90 N_91_180 N_180_365 Recency
0 1 500 60 15 30 1 0.5
1 2 300 0 15 5 10 34.0
2 3 800 0 0 10 6 99.0
3 4 100 0 0 0 370 180.5
4 5 600 0 6 5 10 40.0
5 6 800 0 0 15 6 96.0
6 7 500 10 10 30 9 3.0

How to switch 1 (ON) flags occurring together in batch of size more than a specified threshold to 0 in pandas dataframe?

A flag column in a pandas dataframe is populated by 1 or 0
The problem is to identify continuous 1s.
Let t be the number of days thresholds
There are two types of transformations required:
i) If there are more than t 1s together, turn the (t+1)th onwards 1 to 0
ii) If there are more than t 1s together, turn all the 1s to 0s
My approach is to create 2 columns called result1 and result2, and filter using these columns:
Please see image here
I have not been able to think of anything as such, so not posting any code.
A nudge or hint in the right direction would be appreciated.

Use:
#compare 0 values
m = df['Value'].eq(0)
#get cumulative sum and filter only 1 rows
g = m.cumsum()[~m]
#set by condition - 0 or ccounter per groups
df['Result1'] = np.where(m, 0, df.groupby(g).cumcount().add(1))
#get maximum per groups with transform for new Series
df['Result2'] = np.where(m, 0, df.groupby(g)['Result1'].transform('max')).astype(int)
print (df)
Value Result1 Result2
0 1 1 1
1 0 0 0
2 0 0 0
3 1 1 2
4 1 2 2
5 0 0 0
6 1 1 4
7 1 2 4
8 1 3 4
9 1 4 4
10 0 0 0
11 0 0 0
12 1 1 1
13 0 0 0
14 1 1 1
15 0 0 0
16 0 0 0
17 1 1 6
18 1 2 6
19 1 3 6
20 1 4 6
21 1 5 6
22 1 6 6
23 0 0 0
24 1 1 1
25 0 0 0
26 0 0 0
27 1 1 1
28 0 0 0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Merge dfs based on specific conditions and reduce the execution time - python-3.x

Related

Fill null and next value with avarge value

How to return first item when the items in the pandas dataframe window are the same?

how to add missing rows of time series data to panda dataframes in python

Calculate Recency based on specific conditions across more than one columns - pandas

How to switch 1 (ON) flags occurring together in batch of size more than a specified threshold to 0 in pandas dataframe?

Categories

Resources