I have header of sheet as
'''
+--------------+------------------+----------------+--------------+---------------+
| usa_alaska | usa_california | france_paris | italy_roma | france_lyon |
|--------------+------------------+----------------+--------------+---------------|
+--------------+------------------+----------------+--------------+---------------+
'''
df = pd.DataFrame([], columns = 'usa_alaska usa_california france_paris italy_roma france_lyon'.split())
I want to separate the headers by country and region in a way so that when I call france, I should get paris and lyon as columns.
Create a MultiIndex from your column names:
Suppose this dataframe:
>>> df
usa_alaska usa_california france_paris italy_roma france_lyon
0 1 2 3 4 5
df.columns = df.columns.str.split('_', expand=True)
df = df.sort_index(axis=1)
Output
>>> df
france italy usa
lyon paris roma alaska california
0 5 3 4 1 2
>>> df['france']
paris lyon
0 3 5
Related
Suppose I have two data frames, DF1 and DF2,
no1 quantity no2
abc 3 123
pqr 5 NaN
and
no1 serial
abc 10
pqr 20
I want to create the following output DF3 and DF4
no1 quantity
abc 3
123 3
pqr 5
and
no1 serial
abc 10
123 10
pqr 20
Kindly help to create DF3. I have thought about repeat the rows of Df1 if DF1['no1'] != 'NA' for Df3 then drop no2 column. It is possible to create DF4 by using pd.merge but the serial number of 123 should be 10 which is required.
For df3 you can use append() method,to_frame() method and assign() method:
df3=df1['no1'].append(df1['no2']).to_frame(name='no1').assign(quantity=df1['quantity']).reset_index(drop=True).dropna()
Output of df3:
no1 quantity
0 abc 3
1 pqr 5
2 123.0 3
For df4 you can use merge() method ,groupby() method and ffill() method:
df4=df3.merge(df2,on='no1',how='left').groupby('quantity').ffill()
Output of df4:
no1 serial
0 abc 10.0
1 pqr 20.0
2 123.0 10.0
I have many dataframes with the same structure - number of rows and names of columns.
How can I gather all the columns with same name, but with name replaced, into a single new dataframe?
df1 = pd.DataFrame({'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]})
df2 = pd.DataFrame({'Name':['Wendy', 'Frank', 'krish', 'Lucy'], 'Age':[20, 21, 19, 18]})
print(df1)
print(df2)
I want:
df3 = pd.DataFrame({'Name1':['Wendy', 'Frank', 'krish', 'Lucy'], 'Name2':['Tom', 'nick', 'krish', 'jack']})
print(df3)
Output:
df1:
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
df2:
Name Age
0 Wendy 20
1 Frank 21
2 krish 19
3 Lucy 18
df3:
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack
df1 = df1.drop(column='Age')
df2 = df2.drop(column='Age')
df3 = df1.join(df2)
You can concat the two DataFrames together along axis=1 in a list comprehension. Use .add_suffix with enumerate to get the numbers appended to the column names.
pd.concat([df[['Name']].add_suffix(i+1) for i,df in enumerate([df2, df1])], axis=1)
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack
Or if you want to do this for many similar columns at once concat with keys to create a MultiIndex on the columns and then collapse the MultiIndex and join the column names in a list comprehension.
l = [df2, df1]
df3 = pd.concat(l, axis=1, keys=np.arange(len(l))+1)
df3.columns = [f'{y}{x}' for x,y in df3.columns]
# Name1 Age1 Name2 Age2
#0 Wendy 20 Tom 20
#1 Frank 21 nick 21
#2 krish 19 krish 19
#3 Lucy 18 jack 18
df3.filter(like='Name')
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack
I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?
I am new to Pandas and I have a dataset that looks something like this.
s_name Time p_name qty
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0
I am trying to group by s_name and find the sum of the qty of each unique p_name in a month but only display the p_name with the top 2 most quantities. Below is an example of how I want the final output to look like.
s_name Time p_name qty
A 01 DEF 4
A 01 ABC 2
B 02 ABC 3
B 02 DEF 2
B 03 ABC 2
B 03 FGH 0
Do you have any ideas? I have been stuck here for quite long so much help is appreciated.
Create a month using dt, then group by s_name and month, then apply a function to the groups, group each group by name and do a sum over qty, sort_values descending and only get the first two rows with head:
df.Time = pd.to_datetime(df.Time, format='%d/%m/%Y')
df['month'] = df.Time.dt.month
df_f = df.groupby(['s_name', 'month']).apply(
lambda df:
df.groupby('p_name').qty.sum()
.sort_values(ascending=False).head(2)
).reset_index()
df_f
# s_name month p_name qty
# 0 A 1 DEF 4
# 1 A 1 ABC 2
# 2 B 2 ABC 3
# 3 B 2 DEF 2
# 4 B 3 ABC 3
# 5 B 3 FGH 0
I am new to Pandas myself. I am going to attempt to answer your question.
See this code.
from io import StringIO
import pandas as pd
columns = "s_name Time p_name qty"
# Create dataframe from text.
df = pd.read_csv(
StringIO(
f"""{columns}
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0"""
),
sep=" ",
)
S_NAME, TIME, P_NAME, QTY = columns.split()
MONTH = "month"
# Convert the TIME col to datetime types.
df.Time = pd.to_datetime(df.Time, dayfirst=True)
# Create a month column with zfilled strings.
df[MONTH] = df.Time.apply(lambda x: str(x.month).zfill(2))
# Group
group = df.groupby(by=[S_NAME, P_NAME, MONTH])
gdf = (
group.sum()
.sort_index()
.sort_values(by=[S_NAME, MONTH, QTY], ascending=False)
.reset_index()
)
gdf.groupby([S_NAME, MONTH]).head(2).sort_values(by=[S_NAME, MONTH]).reset_index()
Is this the result you expected?
For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)