How to recalculate DataFrame column values based on condition dict (Pandas Python) - python-3.x

Lets say I have the following DataFrame:
A
B
0
aa
4.32
1
aa
7.00
2
bb
8.00
3
dd
74.00
4
cc
30.00
5
bb
2.00
And let's say I have the following dict which determs the condition for column A in its keys and determs the multiplier for coulmn B in its values:
dict1={'aa':-1, 'bb':2}
All I want is to multiply values in column B with vulues from dict1 based on condition that values in column A are queal to dict1 keys.
So the ouptput should be:
A
B
0
aa
-4.32
1
aa
-7.00
2
bb
16.00
3
dd
74.00
4
cc
30.00
5
bb
4.00
Thanks

Use pd.Series.map:
print (df["A"].map(dict1).fillna(1)*df["B"])
0 -4.32
1 -7.00
2 16.00
3 74.00
4 30.00
5 4.00
dtype: float64

Related

Fill 0s with mean values of previous and next in a Pandas DataFrame

Fill the 0s with the following conditions:
If there is one 0 value, do a simple average between the data after and before the 0. In this scenario 0 is replaced by mean of A and B
If there are two consecutive 0s, fill the first missing value with data from previous period and doing a simple average between the data after and before the second 0. First 0 is replaced by A and second 0 by mean of A and B.
If there are 3 consecutive 0s, replace first and second 0 with A and 3rd by mean of A and B.
Ticker is an identifier and would be common for every block(can be ignored). The entire table is 1000 rows long and in no case consecutive 0s would exceed 3. I am unable to manage scenario 2 and 3.
ID
asset
AA
34861000
AA
1607498
AA
0
AA
3530000000
AA
3333000000
AA
3179000000
AA
4053000000
AA
4520000000
AB
15250209
AB
0
AB
14691049
AB
0
AB
5044421
CC
5609212
CC
0
CC
0
CC
3673639
CC
132484747
CC
0
CC
0
CC
0
CC
141652646
You can use interpolate per group, on the reversed Series, with a limit of 1:
df['asset'] = (df
.assign(asset=df['asset'].replace(0, float('nan')))
.groupby('ID')['asset']
.transform(lambda s: s[::-1].interpolate(limit=1).bfill())
)
output:
ID asset
0 AA 3.486100e+07
1 AA 1.607498e+06
2 AA 1.765804e+09
3 AA 3.530000e+09
4 AA 3.333000e+09
5 AA 3.179000e+09
6 AA 4.053000e+09
7 AA 4.520000e+09
8 AB 1.525021e+07
9 AB 1.497063e+07
10 AB 1.469105e+07
11 AB 9.867735e+06
12 AB 5.044421e+06
13 CC 5.609212e+06
14 CC 5.609212e+06
15 CC 4.318830e+06
16 CC 3.673639e+06
17 CC 1.324847e+08 # X
18 CC 1.324847e+08 # filled X
19 CC 1.324847e+08 # filled X
20 CC 1.393607e+08 # (X+Y)/2
21 CC 1.416526e+08 # Y
Ok, compiling the answer here with help from #jezrael and #mozway
df['asset'] =df['asset'].replace(0, float('nan'))
df.loc[mask, 'asset'] = df.loc[mask | ~m, 'asset'].groupby(df['ID']).transform(lambda x: x.interpolate())
df.ffill()

Select two or more consecutive rows based on a criteria using python

I have a data set like this:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
A 2019-01-01 11.18 TX 234567 3
B 2019-01-02 12.19 WA 456789 4
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
B 2019-01-02 12.50 DC 157890 7
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
A 2019-01-04 09:40 CA 234567 11
In this data set I want to compare and select two or more consecutive which fit the following criteria:
User should be same
Time difference should be less than 15 mins
Cookie should be different
So if I apply the filter I should get the following data:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
So, in the above, comparing first two rows(index 1 and 2) satisfy all the conditions above. The next two (index 2 and 3) has same cookie, index 3 and 4 has different user, 5 and 6 is selected and displayed, 6 and 7 has time difference more than 15 mins. 8,9 and 10 fit the criteria but 11 doesnt as the date is 24 hours apart.
How can I solve this using python dataframe? All help is appreciated.
What I have tried:
I tried creating flags using
shift()
cookiediff=pd.DataFrame(df.Cookie==df.Cookie.shift())
cookiediff.columns=['Cookiediffs']
timediff=pd.DataFrame(pd.to_datetime(df.time) - pd.to_datetime(df.time.shift()))
timediff.columns=['timediff']
mask = df.user != df.user.shift(1)
timediff.timediff[mask] = np.nan
cookiediff['Cookiediffs'][mask] = np.nan
This will do the trick:
import numpy as np
#you have inconsistent time delim-just to correct it per your sample data
df["time"]=df["time"].str.replace(":", ".")
df["time"]=pd.to_datetime(df["time"], format="%Y-%m-%d %H.%M")
cond_=np.logical_or(
df["time"].sub(df["time"].shift()).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift()) &\
df["cookie"].ne(df["cookie"].shift()),
df["time"].sub(df["time"].shift(-1)).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift(-1)) &\
df["cookie"].ne(df["cookie"].shift(-1)),
)
res=df.loc[cond_]
Few points- you need to ensure your time column is datetime in order to make the 15 minutes condition verifiable.
Then - the final filter (cond_) you obtain by comparing each row to the previous one, checking all 3 conditions OR by doing the same, but checking against the next one (otherwise you would just get all the consecutive matching rows, except the first one).
Outputs:
user time city cookie index
0 A 2019-01-01 11:00:00 NYC 123456 1
1 A 2019-01-01 11:12:00 CA 234567 2
4 B 2019-01-02 12:21:00 FL 456789 5
5 B 2019-01-02 12:31:00 VT 987654 6
7 A 2019-01-03 09:12:00 CA 123456 8
8 A 2019-01-03 09:27:00 NYC 345678 9
9 A 2019-01-03 09:34:00 TX 123456 10
You could use regular expressions to isolate the fields and use named groups and the groupdict() function to store the value of each field into a dictionary and compare the values from the last dictionary to the current one. So iterate through each line of the dataset with two dictionaries, the current dictionary and the last dictionary, and perform a re.search() on each line with the regex pattern string to separate each line into named fields, then compare the value of the two dictionaries.
So, something like:
import re
c_dict=re.search('(?P<user>\w) +(?P<time>\d{4}-\d{2}-\d{2} \d{2}\.\d{2}) +(?P<city>\w+) +(?P<cookie>\d{6}) +(?P<index>\d+)',s).groupdict()
for each line of your dataset. For the first line of your dataset, this would create the dictionary {'user': 'A', 'time': '2019-01-01 11.00', 'city': 'NYC', 'cookie': '123456', 'index': '1'}. With the fields isolated, you could easily compare the values of the fields to previous lines if you stored those in another dictionary.

Rearrange columns in DataFrame

Having a DataFrame structured as follows:
country A B C D
0 Albany 5.2 4.7 253.75 4
1 China 7.5 3.4 280.72 3
2 Portugal 4.6 7.5 320.00 6
3 France 8.4 3.6 144.00 3
4 Greece 2.1 10.0 331.00 6
I wanted to get something like this:
cost A B
country C D C D
Albany 2.05 4 1.85 4
China 2.67 3 1.21 3
Portugal 1.44 6 2.34 6
France 5.83 3 2.50 3
Greece 0.63 6 3.02 6
I mean, get the columns A and B as headers over C and D, keeping D the same with its constant value, and calculating in C the percentage resulting of the header over C. Example for Albany:
value C in A: (5.2/253.75)*100 = 2.05
value C in B: (4.7/253.75)*100 = 1.85
Is there any way to do it?
Thanks!
You can divide multiple columns, here A and B by DataFrame.div, then DataFrame.reindex by MultiIndex created by MultiIndex.from_product and last set D columns by original with MultiIndex slicers:
cols = ['A','B']
mux = pd.MultiIndex.from_product([cols, ['C', 'D']])
df1 = df[cols].div(df['C'], axis=0).mul(100).reindex(mux, axis=1, level=0)
idx = pd.IndexSlice
df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].to_numpy()
#pandas bellow 0.24
#df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].values
print (df1)
A B
C D C D
0 2.049261 4 1.852217 4
1 2.671701 3 1.211171 3
2 1.437500 6 2.343750 6
3 5.833333 3 2.500000 3
4 0.634441 6 3.021148 6

How to group by value for certain time period

I had a DataFrame like below:
Item Date Count
a 6/1/2018 1
b 6/1/2018 2
c 6/1/2018 3
a 12/1/2018 3
b 12/1/2018 4
c 12/1/2018 1
a 1/1/2019 2
b 1/1/2019 3
c 1/1/2019 2
I would like to get the sum of Count per Item with the specified duration from 7/1/2018 to 6/1/2019. For this case, the expected output will be:
Item TotalCount
a 5
b 7
c 3
We can use query with Series.between and chain that with GroupBy.sum:
df.query('Date.between("07-01-2018", "06-01-2019")').groupby('Item')['Count'].sum()
Output
Item
a 5
b 7
c 3
Name: Count, dtype: int64
To match your exact output, use reset_index:
df.query('Date.between("07-01-2018", "06-01-2019")').groupby('Item')['Count'].sum()\
.reset_index(name='Totalcount')
Output
Item Totalcount
0 a 5
1 b 7
2 c 3
Here is one with .loc[] using lambda:
#df.Date=pd.to_datetime(df.Date)
(df.loc[lambda x: x.Date.between("07-01-2018", "06-01-2019")]
.groupby('Item',as_index=False)['Count'].sum())
Item Count
0 a 5
1 b 7
2 c 3

Pandas: Calculate Median of Group over Columns

Given the following data frame:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', 'A','A','A','B','B'],
'COL2' : ['AA','AA','BB','BB','BB','BB'],
'COL3' : [2,3,4,5,4,2],
'COL4' : [0,1,2,3,4,2]})
df
COL1 COL2 COL3 COL4
0 A AA 2 0
1 A AA 3 1
2 A BB 4 2
3 A BB 5 3
4 B BB 4 4
5 B BB 2 2
I would like, as efficiently as possible (i.e. via groupby and lambda x or better), to find the median of columns 3 and 4 for each distinct group of columns 1 and 2.
The desired result is as follows:
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3
5 B BB 2 2 3
Thanks in advance!
You already had the idea -- groupby COL1 and COL2 and calculate median.
m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(np.median)
m.name = 'MEDIAN'
print df.join(m, on=['COL1', 'COL2'])
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3.0
5 B BB 2 2 3.0
df.groupby(['COL1', 'COL2']).median()[['COL3','COL4']]

Resources