Filter rows of a dataframe based on certain conditions in pandas - python-3.x

While I was handling the dataframe in pandas, got some unexpected cells which consists values like-
E_no
E_name
6654-0984
Elvin-Johnson
430
Fred
663/547/900
Banty/Shon/Crio
87
Arif
546
Zerin
322,76
Chris,Deory
In some rows, more than one E_name and E_no has been assigned which is supposed to be a single employee in each and every cell
My data consists of E_no and E_name both these column needs to be separated in different rows.
What I want is
E_no
E_name
6654
ELvin
0984
Johnson
430
Fred
663
Banty
547
Shon
900
Crio
87
Arif
546
Zerin
322
Chris
76
Deory
Seperate those values and put in different rows.
Please help me in doing this so that I can proceed further, and it will be really helpful if someone can mention the logic , how to think for this prblm.
Thanks in advance.
Let me know if ur facing any kind of difficulty in understanding the prblm

Similar to Divyaansh's solution. Just use split, explode and merge.
import pandas as pd
df = pd.DataFrame({'E_no':['6654-0984','430','663/547/900','87','546', '322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']})
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
#Merge both the columns together
df2 = pd.merge(x,y,left_index=True,right_index=True)
#print the modified dataframe
print (df2)
Output of this will be:
Original Dataframe:
E_no E_name
0 6654-0984 Elvin-Johnson
1 430 Fred
2 663/547/900 Banty/Shon/Crio
3 87 Arif
4 546 Zerin
5 322,76 Chris,Deory
Modified Dataframe:
E_no E_name
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory
Alternate, you can also create a new dataframe with the values from x and y.
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x,'E_name':y})
print (df3)
Same result as before.
Or this:
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index()
y = df['E_name'].str.split('[,-/]').explode().reset_index()
#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x['E_no'],'E_name':y['E_name']})
print (df3)
Or you can do:
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
df4 = pd.DataFrame([x,y]).T
print (df4)

Split, flatten, recombine, rename:
a = [item for sublist in df.E_no.str.split('\W').tolist() for item in sublist]
b = [item for sublist in df.E_name.str.split('\W').tolist() for item in sublist]
df2 = pd.DataFrame(list(zip(a, b)), columns=df.columns)
Output:
E_no E_name
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory

You should provide multiple delimiter in the read_csv arguments.
pd.read_csv("filepath",sep='-|/|,| ')
This is the best I can help right now without the data tables.

I think this is actually rather tricky. Here's a solution in which we use the E_no column to build a column of regexes that we will then use to split the two original columns into parts. Finally we construct a new DataFrame from those parts. This method ensures that the second column's format matches the first's.
df = pd.DataFrame.from_records(
[
{"E_no": "6654-0984", "E_name": "Elvin-Johnson"},
{"E_no": "430", "E_name": "Fred"},
{"E_no": "663/547/900", "E_name": "Banty/Shon/Crio"},
{"E_no": "87", "E_name": "Arif"},
{"E_no": "546", "E_name": "Zerin"},
{"E_no": "322,76", "E_name": "Chris,Deory"},
{"E_no": "888+88", "E_name": "FIRST+SEC|OND"},
{"E_no": "999|99", "E_name": "TH,IRD|FOURTH"},
]
)
def get_pattern(e_no, delimiters=None):
if delimiters is None:
delimiters = "-/,|+"
delimiters = "|".join(re.escape(d) for d in delimiters)
non_match_delims = f"(?:(?!{delimiters}).)*"
delim_parts = re.findall(f"{non_match_delims}({delimiters})", e_no)
pattern_parts = []
for delim_part in delim_parts:
delim = re.escape(delim_part)
pattern_parts.append(f"((?:(?!{delim}).)*)")
pattern_parts.append(delim)
pattern_parts.append("(.*)")
return "".join(pattern_parts)
def extract_items(row, delimiters=None):
pattern = get_pattern(row["E_no"], delimiters)
nos = re.search(pattern, row["E_no"]).groups()
names = re.search(pattern, row["E_name"]).groups()
return (nos, names)
nos, names = map(
lambda L: [e for tup in L for e in tup],
zip(*df.apply(extract_items, axis=1))
)
print(pd.DataFrame({"E_no": nos, "E_names": names}))
E_no E_names
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory
10 888 FIRST
11 88 SEC|OND
12 999 TH,IRD
13 99 FOURTH

Here is my approach to this:
1)Replace -&/ with comma(,) for both columns
2)Split each field on comma(,) then expand it and then stack it.
3)Resetting the index of each data frame which gives below DF
4) Finally merging two DF's into one finalDF
a={'E_no':['6654-0984','430','663/547/900','87','546','322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']}
df = pd.DataFrame(a)
df1=df['E_no'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()
df2=df['E_name'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()
df1.drop(['level_0','level_1'],axis=1,inplace=True)
df1.rename(columns={0:'E_no'},inplace=True)
df2.drop(['level_0','level_1'],axis=1,inplace=True)
df2.rename(columns={0:'E_name'},inplace=True)
finalDF=pd.merge(df1,df2,left_index=True,right_index=True)
Output:

Related

How to split dataframe by column value condition, pandas

I want to split a dataframe in to different lists based on column value condition.
Here is a dataframe example.
df=pd.DataFrame({'flag_1':[1,2,3,1,2,500,498,495,1,1,1,1,1,500,440,430,2,3,4,4],'dd':[1,1,1,7,7,7,8,8,8,1,1,1,7,7,7,8,8,8,5,7]})
df_out
df_out=pd.DataFrame({'flag_1':[500,498,495,500,440,430],'dd':[7,8,8,7,7,8]})
Try this:
grp = (df['flag_1']<500).cumsum()
pd.concat({n: g[1:] for n, g in tuple(df.groupby(grp)) if len(g) > 1}, ignore_index=True)
Output:
flag_1 dd
0 500 7
1 598 8
2 595 8
3 500 7
4 540 7
5 5430 8

test/train splits in pycaret using a column for grouping rows that should be in the same split

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET
10 row sample for clarification:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
2 23 34 233
2 623 22 888
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
every unique group_id should be sent to any split in full this way (using 80/20):
TRAIN SET:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
TEST SET:
group_id measure1 measure2 measure3
2 23 34 233
2 623 22 888
You can try the following per the documentation
https://pycaret.readthedocs.io/en/latest/api/classification.html
fold_strategy = "groupkfold"
One solution could look like this:
import numpy as np
import pandas as pd
from itertools import combinations
def is_possible_sum(numbers, n):
for r in range(len(numbers)):
for combo in combinations(numbers, r + 1):
if sum(combo) == n:
return combo
print(f'Desired split not possible')
raise ArithmeticError
def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
train_ids = []
occurrences = table[col_identifier].value_counts().to_dict()
required = sum(occurrences.values()) * train_fraction
lengths = is_possible_sum(occurrences.values(), required)
for i in lengths:
for key, value in occurrences.items():
if value == i:
train_ids.append(key)
del occurrences[key] # prevents the same ID from being selected twice
break
train = table[table[col_identifier].isin(train_ids)]
test = table[~table[col_identifier].isin(train_ids)]
return train, test
if __name__ == '__main__':
df = pd.DataFrame()
df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
df['Measurement'] = np.random.random(10)
train_part, test_part = train_test_split(df, 0.8, 'Group_ID')
Some remarks:
This is probably the least elegant way to do it...
It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split.
Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.
First function taken from here: How to check if a sum is possible in array?

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

pandas df merge avoid duplicate column names

The question is when merge two dfs, and they all have a column called A, then the result will be a df having A_x and A_y, I am wondering how to keep A from one df and discard another one, so that I don't have to rename A_x to A later on after the merge.
Just filter your dataframe columns before merging.
df1 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(0,100,12),'C':list('ABCD')*3})
df2 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(100,1000,12),'C':list('ABCD')*3})
df1.merge(df2[['Key','A']], on='Key')
Output: (Note: C is not duplicated)
A_x C Key A_y
0 60 A 0 440
1 65 B 1 731
2 76 C 2 596
3 67 D 3 580
4 44 A 4 477
5 51 B 5 524
6 7 C 6 572
7 88 D 7 984
8 70 A 8 862
9 13 B 9 158
10 28 C 10 593
11 63 D 11 177
It depends if need append columns with duplicated columns names to final merged DataFrame:
...then add suffixes parameter to merge:
print (df1.merge(df2, on='Key', suffixes=('', '_')))
--
... if not use #Scott Boston solution.

Resources