I searched it and indeed I found a lot of similar questions but none of those seemed to answer my case.
I have a pd Dataframe which is a joined table consist of products and the countries in which they are sold.
It's 3000 rows and 50 columns in size.
I'm uploading a photo (only part of the df) of the current situation I'm in now and the expected result I want to achieve.
I want to transpose the 'Country name' column into rows grouped by the 'Product code name. Please note that the new country columns are not limited to a certain amount of countries (some products has 3, some 40).
Thank you!
Use .cumcount() to count the number of countries that a product has.
Then use .pivot() to get your dataframe in the right shape:
df = pd.DataFrame({
'Country': ['NL', 'Poland', 'Spain', 'Sweden', 'China', 'Egypt'],
'Product Code': ['123', '123', '115', '115', '117', '118'],
'Product Name': ['X', 'X', 'Y', 'Y', 'Z', 'W'],
})
df['cumcount'] = df.groupby(['Product Code', 'Product Name'])['Country'].cumcount() + 1
df_pivot = df.pivot(
index=['Product Code', 'Product Name'],
columns='cumcount',
values='Country',
).add_prefix('country_')
Resulting dataframe:
cumcount country_1 country_2
ProductCode Product Name
115 Y Spain Sweden
117 Z China NaN
118 W Egypt NaN
123 X NL Poland
Try this:
df_out = df.set_index(['Product code',
'Product name',
df.groupby('Product code').cumcount() + 1]).unstack()
df_out.columns = [f'Country_{j}' for _, j in df_out.columns]
df_out.reset_index()
Output:
Product code Product name Country_1 Country_2 Country_3
0 AAA115 Y Sweden China NaN
1 AAA117 Z Egypt Greece NaN
2 AAA118 W France Italy NaN
3 AAA123 X Netherlands Poland Spain
Details:
Reshape dataframe with set_index and unstack, using cumcount to create country columns. Then flatten multiindex header using list comprehension.
Related
I have a dataset with duplicates, triplicates and more and I want to keep only one record of each unique with merge of data, for example:
id name address age city
1 Alex 123,blv
1 Alex 13
3 Alex 24 Florida
1 Alex Miami
Merging data using the id field:
Output:
id name address age city
1 Alex 123,blv 13 Miami
3 Alex 24 Florida
I've changed a bit the code from this answer.
Code to create the initial dataframe:
import pandas as pd
import numpy as np
d = {'id': [1,1,3,1],
'name': ["Alex", "Alex", "Alex", "Alex"],
'address': ["123,blv" , None, None, None],
'age': [None, 13, 24, None],
'city': [None, None, "Florida", "Miami"]
}
df = pd.DataFrame(data=d, index=d["id"])
print(df)
Output:
id name address age city
1 1 Alex 123,blv NaN None
1 1 Alex None 13.0 None
3 3 Alex None 24.0 Florida
1 1 Alex None NaN Miami
Aggregation code:
def get_notnull(x):
if x.notnull().any():
return x[x.notnull()]
else:
return np.nan
aggregation_functions = {'name': 'first',
'address': get_notnull,
'age': get_notnull,
'city': get_notnull
}
df = df.groupby(df['id']).aggregate(aggregation_functions)
print(df)
Output:
name address age city
id
1 Alex 123,blv 13.0 Miami
3 Alex NaN 24.0 Florida
(
df
.reset_index(drop=True) # set unique index for eash record
.drop('id', axis=1) # exclude 'id' column from processing
.groupby(df['id']) # group by 'id'
.agg(
# return first non-NA/None value for each column
lambda s: s.get(s.first_valid_index())
)
.reset_index() # get back the 'id' value for each record
)
ps. As an option:
df.replace([None, ''], pd.NA).groupby('id').first().reset_index()
I have an exercise in which I need to turn few or several rows into one row if they have the same data in three columnes.
substances = pd.DataFrame({'id': ['id_1', 'id_1', 'id_1', 'id_2', 'id_3'],
'part': ['1', '1', '2', '2', '3'],
'sub': ['paracetamolum', 'paracetamolum', 'ibuprofenum', 'dienogestum', 'etynyloestradiol'],
'strength': ['150', '50', '50', '20', '30'],
'unit' : ['mg', 'mg', 'mg', 'mg', 'mcg'],
'other irrelevant columns for this task' : ['sth1' , 'sth2', 'sth3', 'sth4', 'sth5']
})
Now provided that id, part and substance is the same, I am supposed to make it into one row, so the end result is:
id
part
strength
substance
unit
id_1
1
'150 # 50'
paracetamolum
mg
id_1
2
50
ibuprofenum
mg
id_2
2
20
dienogestum
mg
id_3
3
30
etynyloestradiol
mcg
The issue I have is that I have problem joining these rows into one row to show possible strength like this '150 # 50' I have tried to something like this, but it is not going great:
substances = substances.groupby('id', 'part', 'sub', 'strength').id.apply(lambda x: str(substances['strength']) + ' # ' + str(next(substances['strength'])))
df = df.groupby(['id','part','sub','unit']).agg({'strength':' # '.join}).reset_index()
df = df[['id','part','strength', 'sub','unit']]
print(df)
output:
id part strength sub unit
0 id_1 1 150 # 50 paracetamolum mg
1 id_1 2 50 ibuprofenum mg
2 id_2 2 20 dienogestum mg
3 id_3 3 30 etynyloestradiol mcg
There are totally 8 companies and around 30 - 40 countries. I need to get a dataframe where i can know how many total number of employees in each company by country.
Sounds like you want to use Panda's groupby feature. I'm not sure what type of data you have and what result you want, so here are some toy examples:
df = pd.DataFrame({'company': ["A", "A", "B"], 'country': ["USA", "USA", "USA"], 'employees': [10, 20, 50]})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].sum()
print(dfg)
# company country employees
# 0 A USA 30
# 1 B USA 50
df = pd.DataFrame({'company': ["A", "A", "A"], 'country': ["USA", "USA", "Japan"], 'employees': ['Art', 'Bob', 'Chris']})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].count()
print(dfg)
# company country employees
# 0 A Japan 1
# 1 A USA 2
I have a dataframe as follows
id Domain City
1 DM Pune
2 VS Delhi
I want to create a new column which will contain tuple of column values id & Domain,
e.g
id Domain City New_Col
1 DM Pune (1,DM)
2 VS Delhi (2,VS)
I know I can create it easily using apply & lambda as follows:
df['New_Col'] = df.apply(lambda r:tuple(r[bkeys]),axis=1) ##here bkeys = ['id','Domain']
However I this takes hell lot of time for larger dataframes having > 100k records. Hence I want to use np.where like this
df['New_Col'] = np.where(True, tuple(df[bkeys]), '')
But this doesn't work, it gives values like: ('id','Domain')
Any suggestions?
Try this:
df.assign(new_col = df[['id','Domain']].agg(tuple, axis=1))
Output:
id Domain City new_col
0 1 DM Pune (1, DM)
1 2 VS Delhi (2, VS)
Something or other is giving people a wrong idea of what np.where does. I've seen similar error in other questions.
Let's make your dataframe:
In [2]: import pandas as pd
In [3]: df = pd.DataFrame([[1,'DM','Pune'],[2,'VS','Delhi']],columns=['id','Domain','City'])
In [4]: df
Out[4]:
id Domain City
0 1 DM Pune
1 2 VS Delhi
Your apply expression:
In [5]: bkeys = ['id','Domain']
In [6]: df.apply(lambda r:tuple(r[bkeys]),axis=1)
Out[6]:
0 (1, DM)
1 (2, VS)
dtype: object
what's happening here? apply is iterating on the rows of df. r is one row.
So the first row:
In [9]: df.iloc[0]
Out[9]:
id 1
Domain DM
City Pune
Name: 0, dtype: object
index with bkeys:
In [10]: df.iloc[0][bkeys]
Out[10]:
id 1
Domain DM
Name: 0, dtype: object
and make a tuple from that:
In [11]: tuple(df.iloc[0][bkeys])
Out[11]: (1, 'DM')
But what do we get when indexing the whole dataframe:
In [12]: df[bkeys]
Out[12]:
id Domain
0 1 DM
1 2 VS
In [15]: tuple(df[bkeys])
Out[15]: ('id', 'Domain')
np.where is a function; it is not an iterator. The interpreter evaluates each of its arguments, and passes them to the function.
In [16]: np.where(True, tuple(df[bkeys]), '')
Out[16]: array(['id', 'Domain'], dtype='<U6')
This is what you tried to assign to the new column.
In [17]: df
Out[17]:
id Domain City New_Col
0 1 DM Pune id
1 2 VS Delhi Domain
This assignment only works because the tuple has 2 elements, and df has 2 rows. Otherwise you'd get an error.
np.where is not a magical way of speeding up a dataframe apply. It's a way of creating an array of values, which, if the right size can be assigned to a dataframe column (series).
We could create a numpy array from the selected columns:
In [31]: df[bkeys].to_numpy()
Out[31]:
array([[1, 'DM'],
[2, 'VS']], dtype=object)
and from that get a list of lists, and assign that to a new column:
In [32]: df[bkeys].to_numpy().tolist()
Out[32]: [[1, 'DM'], [2, 'VS']]
In [33]: df['New_Col'] = _
In [34]: df
Out[34]:
id Domain City New_Col
0 1 DM Pune [1, DM]
1 2 VS Delhi [2, VS]
If you really want tuples, the sublists will have to be converted:
In [35]: [tuple(i) for i in df[bkeys].to_numpy().tolist()]
Out[35]: [(1, 'DM'), (2, 'VS')]
Another way of making a list of tuples (which works because array records display as tuples:
In [42]: df[bkeys].to_records(index=False).tolist()
Out[42]: [(1, 'DM'), (2, 'VS')]
I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN