How to gather and rename the same columns from multiple Pandas dataframes - python-3.x

I have many dataframes with the same structure - number of rows and names of columns.
How can I gather all the columns with same name, but with name replaced, into a single new dataframe?
df1 = pd.DataFrame({'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]})
df2 = pd.DataFrame({'Name':['Wendy', 'Frank', 'krish', 'Lucy'], 'Age':[20, 21, 19, 18]})
print(df1)
print(df2)
I want:
df3 = pd.DataFrame({'Name1':['Wendy', 'Frank', 'krish', 'Lucy'], 'Name2':['Tom', 'nick', 'krish', 'jack']})
print(df3)
Output:
df1:
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
df2:
Name Age
0 Wendy 20
1 Frank 21
2 krish 19
3 Lucy 18
df3:
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack

df1 = df1.drop(column='Age')
df2 = df2.drop(column='Age')
df3 = df1.join(df2)

You can concat the two DataFrames together along axis=1 in a list comprehension. Use .add_suffix with enumerate to get the numbers appended to the column names.
pd.concat([df[['Name']].add_suffix(i+1) for i,df in enumerate([df2, df1])], axis=1)
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack
Or if you want to do this for many similar columns at once concat with keys to create a MultiIndex on the columns and then collapse the MultiIndex and join the column names in a list comprehension.
l = [df2, df1]
df3 = pd.concat(l, axis=1, keys=np.arange(len(l))+1)
df3.columns = [f'{y}{x}' for x,y in df3.columns]
# Name1 Age1 Name2 Age2
#0 Wendy 20 Tom 20
#1 Frank 21 nick 21
#2 krish 19 krish 19
#3 Lucy 18 jack 18
df3.filter(like='Name')
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack

Related

How to get max min value in pandas dataframe

Hi i got the data frame like this
import pandas as pd
data = [(1,"tom", 23),
(1,"nick", 12),
(1,"jim",13),
(2,"tom", 44),
(2,"nick", 56),
(2,"jim",77),
(3, "tom", 88),
(3, "nick", 10),
(3, "jim", 13),
]
df = pd.DataFrame(data, columns=['class', 'Name','number']
output of this dataframe
class Name number
0 1 tom 23
1 1 nick 12
2 1 jim 13
3 2 tom 44
4 2 nick 56
5 2 jim 77
6 3 tom 88
7 3 nick 10
8 3 jim 1
how can i get the maximum number of in 3 different name of class 1 and get the number but in same name but different class the results can be look like this
[name =tom, class=1, number =23]
[name =tom, class=2, number =44]
[name =tom, class=3, number =88]
Thank you very much for helping me!
Find the name first from class 1, and then filter:
name = df.Name.loc[df[df['class'] == 1].number.idxmax()]
df[df.Name == name]
# class Name number
#0 1 tom 23
#3 2 tom 44
#6 3 tom 88
Try tis.
idx = df.groupby(['class'])['number'].transform(max) == df['number']
df[idx]

Want To Collect The Same string of header

I have header of sheet as
'''
+--------------+------------------+----------------+--------------+---------------+
| usa_alaska | usa_california | france_paris | italy_roma | france_lyon |
|--------------+------------------+----------------+--------------+---------------|
+--------------+------------------+----------------+--------------+---------------+
'''
df = pd.DataFrame([], columns = 'usa_alaska usa_california france_paris italy_roma france_lyon'.split())
I want to separate the headers by country and region in a way so that when I call france, I should get paris and lyon as columns.
Create a MultiIndex from your column names:
Suppose this dataframe:
>>> df
usa_alaska usa_california france_paris italy_roma france_lyon
0 1 2 3 4 5
df.columns = df.columns.str.split('_', expand=True)
df = df.sort_index(axis=1)
Output
>>> df
france italy usa
lyon paris roma alaska california
0 5 3 4 1 2
>>> df['france']
paris lyon
0 3 5

Pandas dataframe not correct format for groupby, what is wrong?

I am trying to sum all columns based on the value of the first, but groupby.sum is unexpectedly not working.
Here is a minimal example:
import pandas as pd
data = [['Alex',10, 11],['Bob',12, 10],['Clarke',13, 9], ['Clarke',1, 1]]
df = pd.DataFrame(data,columns=['Name','points1', 'points2'])
print(df)
df.groupby('Name').sum()
print(df)
I get this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 13 9
3 Clarke 1 1
And not this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
From what i understand, the dataframe is not the right format for pandas to perform group by. I would like to understand what is wrong with it because this is just a toy example but i have the same problem with a real data-set.
The real data i'm trying to read is the John Hopkins University Covid-19 dataset:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
You forget assign output of aggregation to variable, because aggregation not working inplace. So in your solution print (df) before and after groupby returned same original DataFrame.
df1 = df.groupby('Name', as_index=False).sum()
print (df1)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
Or you can set to same variable df:
df = df.groupby('Name', as_index=False).sum()
print (df)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

How to compare row by row in a dataframe

I have a data frame that has a name and the URL ID of the name. For example:
Abc 123
Abc.com 123
Def 345
Pqr 123
PQR.com 123
Here due to data extraction error, at times different names have same ID. I want to clean the table such that if the names are different and ID is same, then the record should remain the same. If the names are similar and ID is also same, the name should be changed to one. To be clear,
The expected output should be:
Abc.com 123
Abc.com 123
Def 354
PQR.com 123
PQR.com 123
That is, the last one was data entry error..and both were the same name(The first word of the string was the same). So they are changed to one name looking at ID.
But first and second records even though they had a similar ID to the last ones their names did not match and were completely different.
I am not able to understand how to achieve this.
Request some guidance here. Thanks in advance.
Note: The size of the dataset is almost 16 million of such records.
Idea is use fuzzy matching lib fuzzywuzzy for ratio of all combinations of Names by cross join by DataFrame.merge and removed rows with same names in both columns by DataFrame.query, also was added new column by lengths of data by Series.str.len:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
Name_x ID Name_y ratio len
1 Abc 123 BCD 0 3
2 BCD 123 Abc 0 3
6 Pqr 789 PQR.com 20 3
7 PQR.com 789 Pqr 20 7
Then filter rows by treshold and boolean indexing. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax with DataFrame.loc and then DataFrame.set_index for Series:
N = 15
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789 PQR.com
Name: Name_x, dtype: object
Last Series.map by ID and replace non matched values by original with Series.fillna:
df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
Name ID
0 Abc 123
1 BCD 123
2 Def 345
3 PQR.com 789
4 PQR.com 789
EDIT: If there is more valid strings per ID is is more complicated:
print (df)
Name ID
0 Air Ordnance 1578013421
1 Air-Ordnance.com 1578013421
2 Garreett 1578013421
3 Garrett 1578013421
First get fuzz.ratio like in solution before:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
2 Air Ordnance 1578013421 Garreett 30
3 Air Ordnance 1578013421 Garrett 32
4 Air-Ordnance.com 1578013421 Air Ordnance 79
6 Air-Ordnance.com 1578013421 Garreett 25
7 Air-Ordnance.com 1578013421 Garrett 26
8 Garreett 1578013421 Air Ordnance 30
9 Garreett 1578013421 Air-Ordnance.com 25
11 Garreett 1578013421 Garrett 93
12 Garrett 1578013421 Air Ordnance 32
13 Garrett 1578013421 Air-Ordnance.com 26
14 Garrett 1578013421 Garreett 93
Then filter by threshold:
N = 50
df2 = df1[df1['ratio'].gt(N)]
print (df2)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
4 Air-Ordnance.com 1578013421 Air Ordnance 79
11 Garreett 1578013421 Garrett 93
14 Garrett 1578013421 Garreett 93
But for more precision is necessary specify, what strings are valid in list L, filter by list:
L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
Name_x Name ID
4 Air-Ordnance.com Air Ordnance 1578013421
14 Garrett Garreett 1578013421
Last merge with left join to original and repalce missing values:
df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
Name ID
0 Air-Ordnance.com 1578013421
1 Air-Ordnance.com 1578013421
2 Garrett 1578013421
3 Garrett 1578013421

Resources