Got a tricky situation. I tried my best via Pivot or other methods but gave up. Please help if possible.
I like to take a value = 1 from each column and populate the Date in that part.
After the above map, the 'Date' field is no more needed. So I am ok to delete that
My sample dataset:
df1 = pd.DataFrame({'Patient': ['John','John','John','Smith','Smith','Smith'],
'Date': [20200101, 20200102, 20200105,20220101, 20220102, 20220105],
'Ibrufen': ['NaN','NaN',1,'NaN','NaN',1],
'Tylenol': [1, 'NaN','NaN',1, 'NaN','NaN'],
})
My desired output:
df2 = pd.DataFrame({'Patient': ['Jonh','Smith'],
'Ibrufen': ['20200105','20220105'],
'Tylenol': ['20200101','20220101'],
'Steroid': ['20200102','20220102'],
})
A possible solution, based on the idea of first creating an auxiliary column containing, for each row, the corresponding medicine:
df1['aux'] = df1.apply(lambda x:
'Ibrufen' if (x['Ibrufen'] == 1) else
'Tylenol' if (x['Tylenol'] == 1) else
'Steroid', axis=1)
(df1.pivot(index='Patient', columns='aux', values='Date')
.reset_index()
.rename_axis(None, axis=1))
Output:
Patient Ibrufen Steroid Tylenol
0 John 20200105 20200102 20200101
1 Smith 20220105 20220102 20220101
Related
I have below data frame t:
import pandas as pd
t = pd.DataFrame(data = (['AFG','Afghanistan',38928341],
['CHE','Switzerland',8654618],
['SMR','San Marino', 33938]), columns = ['iso_code', 'location', 'population'])
g = t.groupby('location')
g.size()
I can see in each group there's only one record, which is expected.
However if I run below code it didn't populate any error message:
g.first(10)
It shows
population
location
Afghanistan 38928341
San Marino 33938
Switzerland 8654618
My understanding is the first(n) for a group is the nth record for this group but each of my location group has only one record - so how did pandas give me that record?
Thanks
I think you're looking for g.nth(10).
g.first(10) is NOT doing what you think it is. The first (optional) parameter of first is numeric_only and takes a boolean, so you're actually running g.first(numeric_only=True) as bool(10) evaluates to True.
After read the comments from mozway and Henry Ecker/ sammywemmy I finally got it.
t = pd.DataFrame(data = (['AFG','Afghanistan',38928341,'A1'],
['CHE','Switzerland',8654618,'C1'],
['SMR','San Marino', 33938,'S1'],
['AFG','Afghanistan',38928342,'A2'] ,
['AFG','Afghanistan',38928343, 'A3'] ), columns = ['iso_code', 'location', 'population', 'code'])
g = t.groupby('location')
Then
g.nth(0)
g.nth(1)
g.first(True)
g.first(False)
g.first(min_countint=2)
shows the difference
I have the following DataFrame:
data = {'Customer_ID': ['123','2','1010','123'],
'Date_Create': ['12/08/2010','04/10/1998','27/05/2010','12/08/2010'],
'Purchase':[1,1,0,1]
}
df = pd.DataFrame(data, columns = ['Customer_ID', 'Date_Create','Purchase'])
I want to perform this query:
df_2 = df[['Customer_ID','Date_Create','Purchase']].groupby(['Customer_ID'],
as_index=False).sum().sort_values(by='Purchase', ascending=False)
The objective of this query is to sum all purchases(boolean field) and as output a dataframe with 3 columns: 'Customer_ID', 'Date_Create','Purchase
Problem is: the field Date_Create is not in query because it has duplicate as the date_creation of the account does not change.
How can i solve it?
thx
If im understanding it correctly and your source data has some duplicates,
There's a function specifically for this, dataframe.drop_duplicates()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
To only consider some columns in the duplicate check, use subset:
df2 = df.drop_duplicates(subset=['Customer_ID','Date_Create'])
You can add column Date_Create to groupby if same values per Customer_ID:
(df.groupby(['Customer_ID','Date_Create'], as_index=False)['Purchase']
.sum()
.sort_values(by='Purchase', ascending=False))
If not, use some aggregation function - e.g. GroupBy.first for first date per groups:
(df.groupby('Customer_ID')
.agg(Purchase = ('Purchase', 'sum'), Date_Create= ('Date_Create', 'first'))
.reset_index()
.sort_values(by='Purchase', ascending=False))
I quite often write a function to return different dataframes based on the parameters I enter. Here's an example dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='M'), 10000)})
I then created a function to perform sub-totals for me like this:
def some_fun(DF1, agg_column, myList=[], *args):
y = pd.concat([
DF1.assign(**{x:'[Total]' for x in myList[i:]})\
.groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)
return y
I then write out lists that I'll pass as arguments to the function:
list_one = [pd.Grouper(key='Date',freq='A'),'Category','Product']
list_two = [pd.Grouper(key='Date',freq='A'),'Category','Sub-Category','Sub-Category-2']
list_three = [pd.Grouper(key='Date',freq='A'),'Sub-Category','Product']
I then have to run each list through my function creating new dataframes:
df1 = some_fun(df,'Units_Sold',list_one)
df2 = some_fun(df,'Dollars_Sold',list_two)
df3 = some_fun(df,'Units_Sold',list_three)
I then use a function to write each of these dataframes to an Excel worksheet. This is just an example - I perform this same exercise 10+ times.
My question - is there a better way to perform this task than to write out df1, df2, df3 with the function information applied? Should I be looking at using a dictionary or some other data type to do this my pythonically with a function?
A dictionary would be my first choice:
variations = ([('Units Sold', list_one), ('Dollars_Sold',list_two),
..., ('Title', some_list)])
df_variations = {}
for i, v in enumerate(variations):
name = v[0]
data = v[1]
df_variations[i] = some_fun(df, name, data)
You might further consider setting the keys to unique / helpful titles for the variations, that goes beyond something like 'Units Sold', which isn't unique in your case.
IIUC,
as Thomas has suggested we can use a dictionary to parse through your data, but with some minor modifications to your function, we can use the dictionary to hold all the required data then pass that through to your function.
the idea is to pass two types of keys, the list of columns and the arguments to your pd.Grouper call.
data_dict = {
"Units_Sold": {"key": "Date", "freq": "A"},
"Dollars_Sold": {"key": "Date", "freq": "A"},
"col_list_1": ["Category", "Product"],
"col_list_2": ["Category", "Sub-Category", "Sub-Category-2"],
"col_list_3": ["Sub-Category", "Product"],
}
def some_fun(dataframe, agg_col, dictionary,column_list, *args):
key = dictionary[agg_col]["key"]
frequency = dictionary[agg_col]["freq"]
myList = [pd.Grouper(key=key, freq=frequency), *dictionary[column_list]]
y = (
pd.concat(
[
dataframe.assign(**{x: "[Total]" for x in myList[i:]})
.groupby(myList)
.agg(sumz=(agg_col, "sum"))
for i in range(1, len(myList) + 1)
]
)
.sort_index()
.unstack(0)
)
return y
Test.
df1 = some_fun(df,'Units_Sold',data_dict,'col_list_3')
print(df1)
sumz
Date 2016-12-31 2017-12-31 2018-12-31
Sub-Category Product
X Product 1 18308 17839 18776
Product 2 18067 19309 18077
Product 3 17943 19121 17675
[Total] 54318 56269 54528
Y Product 1 20699 18593 18103
Product 2 18642 19712 17122
Product 3 17701 19263 20123
[Total] 57042 57568 55348
Z Product 1 19077 17401 19138
Product 2 17207 21434 18817
Product 3 18405 17300 17462
[Total] 54689 56135 55417
[Total] [Total] 166049 169972 165293
as you want to automate the writing of the 10x worksheets, we can again do that with a dictionary call over your function:
matches = {'Units_Sold': ['col_list_1','col_list_3'],
'Dollars_Sold' : ['col_list_2']}
then a simple for loop to write all the files to a single excel sheet, change this to match your required behavior.
writer = pd.ExcelWriter('finished_excel_file.xlsx')
for key,value in matches.items():
for items in value:
dataframe = some_fun(df,k,data_dict,items)
dataframe.to_excel(writer,f'{key}_{items}')
writer.save()
I have two dataframes and I would like to create a new one, which will include all the unique columns of the two source dataframes and the aggregation of the common columns.
These are two samples:
And this is the result:
All the column indexes should match in order to be aggregated.
I have written the following code:
df_all = pd.DataFrame
for dfColumn in df_1:
if dfColumn in df_2.columns:
df_all[dfColumn] = df_1.loc[:, dfColumn].add(df_2.loc[:, dfColumn])
else:
df_all[dfColumn] = df_1[dfColumn]
for dfColumn in df_2:
if dfColumn not in df_all.columns:
df_all[dfColumn] = df_2[dfColumn]
However, I get an error on the following line:
df_all[dfColumn] = df_1.loc[:, dfColumn].add(df_2.loc[:, dfColumn])
when I am trying to assign the value to df_all[dfColumn]
It drives me crazy all the different possibilities that you have with Python.
But I cannot find one to make it work.
Thanks for your help and time.
Actually,
I fixed it only with the following:
df_all = pd.concat([df_1, df_2], axis=1)
df_all = df_all.groupby(level=[0, 1, 2], axis=1).sum()
Is there a way to replace level=[0, 1, 2] with something like level=df_all.columns.levels ?
I am attempting to create a dataframe where the first column is a list of tokens and where additional columns of information can be added. However pandas will not allow a list of tokens to be added as one column.
So code looks as below
array1 = ['two', 'sample', 'statistical', 'inferences', 'includes']
array2 = ['references', 'please', 'see', 'next', 'page', 'the','material', 'of', 'these']
array3 = ['time', 'student', 'interest', 'and', 'lecturer', 'preference', 'other', 'topics']
## initialise list
list = []
list.append(array1)
list.append(array2)
list.append(array3)
## create dataFrame
numberOfRows = len(list)
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns = ('data', 'diversity'))
df.iloc[0] = list[0]
the error message reads
ValueError: cannot copy sequence with size 6 to array axis with dimension 2
Any insight into how I can better achieve creating a dataframe and updating columns would be appreciated.
Thanks
ok so the answer was fairly simple posting it for prosperity.
When adding lists as rows I needed to include the column name and position..
so the code looks like below.
df.data[0] = array1