I have an exercise in which I need to turn few or several rows into one row if they have the same data in three columnes.
substances = pd.DataFrame({'id': ['id_1', 'id_1', 'id_1', 'id_2', 'id_3'],
'part': ['1', '1', '2', '2', '3'],
'sub': ['paracetamolum', 'paracetamolum', 'ibuprofenum', 'dienogestum', 'etynyloestradiol'],
'strength': ['150', '50', '50', '20', '30'],
'unit' : ['mg', 'mg', 'mg', 'mg', 'mcg'],
'other irrelevant columns for this task' : ['sth1' , 'sth2', 'sth3', 'sth4', 'sth5']
})
Now provided that id, part and substance is the same, I am supposed to make it into one row, so the end result is:
id
part
strength
substance
unit
id_1
1
'150 # 50'
paracetamolum
mg
id_1
2
50
ibuprofenum
mg
id_2
2
20
dienogestum
mg
id_3
3
30
etynyloestradiol
mcg
The issue I have is that I have problem joining these rows into one row to show possible strength like this '150 # 50' I have tried to something like this, but it is not going great:
substances = substances.groupby('id', 'part', 'sub', 'strength').id.apply(lambda x: str(substances['strength']) + ' # ' + str(next(substances['strength'])))
df = df.groupby(['id','part','sub','unit']).agg({'strength':' # '.join}).reset_index()
df = df[['id','part','strength', 'sub','unit']]
print(df)
output:
id part strength sub unit
0 id_1 1 150 # 50 paracetamolum mg
1 id_1 2 50 ibuprofenum mg
2 id_2 2 20 dienogestum mg
3 id_3 3 30 etynyloestradiol mcg
Related
I have a dataFrame like this:
id Description Price Unit
1 Test Only 1254 12
2 Data test Fresher 4
3 Sample 3569 1
4 Sample Onces Code test
5 Sample 245 2
I want to move to the left Description column from Price column if not integer then become NaN. I have no specific word to call in or match, the only thing is if Price column have Non-integer value, that string value move to Description column.
I already tried pandas replace and concat but it doesn't work.
Desired output is like this:
id Description Price Unit
1 Test Only 1254 12
2 Fresher 4
3 Sample 3569 1
4 Code test
5 Sample 245 2
This should work
# data
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Description': ['Test Only', 'Data test', 'Sample', 'Sample Onces', 'Sample'],
'Price': ['1254', 'Fresher', '3569', 'Code test', '245'],
'Unit': [12, 4, 1, np.nan, 2]})
# convert price column to numeric and coerce errors
price = pd.to_numeric(df.Price, errors='coerce')
# for rows where price is not numeric, replace description with these values
df.Description = df.Description.mask(price.isna(), df.Price)
# assign numeric price to price column
df.Price = price
df
Use:
#convert valeus to numeric
price = pd.to_numeric(df['Price'], errors='coerce')
#test missing values
m = price.isna()
#shifted only matched rows
df.loc[m, ['Description','Price']] = df.loc[m, ['Description','Price']].shift(-1, axis=1)
print (df)
id Description Price
0 1 Test Only 1254
1 2 Fresher NaN
2 3 Sample 3569
3 4 Code test NaN
4 5 Sample 245
If need numeric values in ouput Price column:
df = df.assign(Price=price)
print (df)
id Description Price
0 1 Test Only 1254.0
1 2 Fresher NaN
2 3 Sample 3569.0
3 4 Code test NaN
4 5 Sample 245.0
I define a Pandas DataFrame containing several deposit/withdrawal rows for different owners. I want to add a total row for each owner to totalize the deposits/withdrawals aswell as the yield amounts generated by each capital amount.
Here's the result of the code below:
Here's my code:
import pandas as pd
OWNER = 'OWNER'
DEPWITHDR = 'DEP/WITHDR'
DATEFROM = 'DATE FROM'
DATETO = 'DATE TO'
CAPITAL = 'CAPITAL'
YIELD = 'YIELD AMT'
TOTAL = 'TOTAL'
df = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
DEPWITHDR: [10000, 20000, 4000, 20000, -8000],
CAPITAL: [10000, 30000, 4000, 24000, 16000],
DATEFROM: ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-03', '2021-01-04'],
DATETO: ['2021-01-01', '2021-01-05', '2021-01-02', '2021-01-03', '2021-01-05'],
YIELD: [100, 1200, 80, 240, 320]
})
print('SOURCE DATAFRAME\n')
print(df)
print()
newDf = pd.DataFrame(columns=[OWNER, DEPWITHDR, CAPITAL, DATEFROM, DATETO, YIELD])
currentOwner = df.loc[1, OWNER]
# using groupby function to compute the two columns totals
dfTotal = df.groupby([OWNER]).agg({DEPWITHDR:'sum', YIELD:'sum'}).reset_index()
totalIndex = 0
# deactivating SettingWithCopyWarning caueed by totalRow[OWNER] += ' total'
pd.set_option('mode.chained_assignment', None)
for index, row in df.iterrows():
if currentOwner == row[OWNER]:
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
else:
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
totalIndex += 1
newDf = newDf.append({OWNER: '',
DEPWITHDR: '',
CAPITAL: '',
DATEFROM: '',
DATETO: '',
YIELD: ''}, ignore_index=True)
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
currentOwner = row[OWNER]
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
print('TARGET DATAFRAME\n')
print(newDf.fillna(''))
My question is: what is a better, more Pandas friendly, way, to obtain the desired result ?
You can use groupby and concat:
df_total = pd.concat((
df,
df.replace({o: o + ' total' for o in df[OWNER].unique()}).groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}).reset_index())
).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO])
In detail:
df.replace({o: o + ' total' for o in df[OWNER].unique()}): replace each occurrence of the name of every owner with the name itself plus the string ' total' (e.g., 'JOE' -> 'JOE total'); so that the result of the groupby will have those values in the column OWNER.
groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}): get the sum of the column DEPWITHDR and YIELD per each owner.
pd.concat(...).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO]): concatenate the original DataFrame and that with the totals and then sort rows by column OWNER, than DATEFROM, than DATETO, so that the rows with the totals for each OWNER will be placed at the ends of the rows belonging to that owner (because they ends with ' total') and moreover the rows will be chronologically sorted by DATEFROM, DATETO.
Here df_total:
index OWNER DEP/WITHDR CAPITAL DATE FROM DATE TO YIELD AMT
0 0 JOE 10000 10000 2021-01-01 2021-01-01 100
1 1 JOE 20000 30000 2021-01-02 2021-01-05 1200
5 0 JOE total 30000 1300
2 2 ROB 4000 4000 2021-01-01 2021-01-02 80
3 3 ROB 20000 24000 2021-01-03 2021-01-03 240
4 4 ROB -8000 16000 2021-01-04 2021-01-05 320
6 1 ROB total 16000 640
IMHO, I'd create a different DataFrame from each owner, with only his/her data, and then a summary DataFrame with totals for each owner. But, maybe, in your use case, this is the best solution.
I searched it and indeed I found a lot of similar questions but none of those seemed to answer my case.
I have a pd Dataframe which is a joined table consist of products and the countries in which they are sold.
It's 3000 rows and 50 columns in size.
I'm uploading a photo (only part of the df) of the current situation I'm in now and the expected result I want to achieve.
I want to transpose the 'Country name' column into rows grouped by the 'Product code name. Please note that the new country columns are not limited to a certain amount of countries (some products has 3, some 40).
Thank you!
Use .cumcount() to count the number of countries that a product has.
Then use .pivot() to get your dataframe in the right shape:
df = pd.DataFrame({
'Country': ['NL', 'Poland', 'Spain', 'Sweden', 'China', 'Egypt'],
'Product Code': ['123', '123', '115', '115', '117', '118'],
'Product Name': ['X', 'X', 'Y', 'Y', 'Z', 'W'],
})
df['cumcount'] = df.groupby(['Product Code', 'Product Name'])['Country'].cumcount() + 1
df_pivot = df.pivot(
index=['Product Code', 'Product Name'],
columns='cumcount',
values='Country',
).add_prefix('country_')
Resulting dataframe:
cumcount country_1 country_2
ProductCode Product Name
115 Y Spain Sweden
117 Z China NaN
118 W Egypt NaN
123 X NL Poland
Try this:
df_out = df.set_index(['Product code',
'Product name',
df.groupby('Product code').cumcount() + 1]).unstack()
df_out.columns = [f'Country_{j}' for _, j in df_out.columns]
df_out.reset_index()
Output:
Product code Product name Country_1 Country_2 Country_3
0 AAA115 Y Sweden China NaN
1 AAA117 Z Egypt Greece NaN
2 AAA118 W France Italy NaN
3 AAA123 X Netherlands Poland Spain
Details:
Reshape dataframe with set_index and unstack, using cumcount to create country columns. Then flatten multiindex header using list comprehension.
If 'Days' is greater than e.g 10 and corresponding 'Year' is a leap year, then reduce 'Days' by 1 only in that particular row. I tried some operations but couldn't do it. I am new in pandas. Appreciate any help.
sample data:
data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['69','2008']]
df=pd.DataFrame(data,columns=['Days','Year'])
I want 'Days' of row 5 to become 69 and everything else remains the same.
In [98]: import calendar
In [99]: data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['70','2008']] ;df=pd.DataFrame(data,column
...: s=['Days','Year'])
In [100]: df = df.astype(int)
In [102]: df["New_Days"] = df.apply(lambda x: x["Days"]-1 if (x["Days"] > 10 and calendar.isleap(x["Year"])) else x["D
...: ays"], axis=1)
In [103]: df
Out[103]:
Days Year New_Days
0 1 2005 1
1 2 2006 2
2 3 2008 3
3 50 2009 50
4 70 2008 69
I have a data frame like this:
d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
df = pandas.DataFrame(data= d)
What I want to do is, create a new id variable. Whenever a name (say john) appears for the first time this id will be equal to 1, for other occurrence of the same name (john) this id variable will be 0. This will be done for all the other names in the data. How do I go about doing that ?
Final output should be like this:
NOTE: If someone knows SAS, there you can sort your data by the name and then use first.name.
""if first.variable = 1 then id = 1""
For first occurrence of same name first.name = 1. For any other repeat occurrence of same name, first.name = 0. I am trying to replicate the same in python.
So far I have tried pandas groupby and first() functionality and also numpy.where() but couldnt make any of that work. Any fresh perspective will be appreciated.
You can using cumcount
s=df.groupby(['Prod','name']).cumcount().add(1)
df['counter']=s.mask(s.gt(1),0)
df
Out[1417]:
Prod Qty name counter
0 101 5 john 1
1 102 4 john 1
2 101 1 john 0
3 501 3 Tim 1
4 505 5 Tim 1
5 301 4 Tim 1
6 302 1 Bob 1
7 302 3 Bob 0
Update :
s=df.groupby(['name']).cumcount().add(1).le(1).astype(int)
s
Out[1421]:
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 0
dtype: int32
More Fast
df.loc[df.name.drop_duplicates().index,'counter']=1
df.fillna(0)
Out[1430]:
Prod Qty name counter
0 101 5 john 1.0
1 102 4 john 0.0
2 101 1 john 0.0
3 501 3 Tim 1.0
4 505 5 Tim 0.0
5 301 4 Tim 0.0
6 302 1 Bob 1.0
7 302 3 Bob 0.0
We can just work directly with your dictionary d and loop through to create a new entry.
d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
names = set() #store names that have appeared
id = []
for i in d['name']:
if i in names: #if it appeared add 0
id.append(0)
else:
id.append(1) #add 1 and note that it has appeared
names.add(i)
d['id'] = id #add entry to your dictionary
df = pandas.DataFrame(data= d)