How do we add dataframes with same id? - python-3.x

I'm a beginner in data science learning. Gone through the pandas topic and I found a task here, which I'm unable to understand what is wrong. Let me explain the problem.
I have three data frames:
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
Here, I need to add to all the medals into one column, country in another. When I added it was showing NAN. So, I filled the NAN with zero values, still I'm unable to get deserved output.
Code:
gold.set_index('Country', inplace = True)
silver.set_index('Country',inplace = True)
bronze.set_index('Country', inplace = True)
Total = silver.add(gold,fill_value = 0)
Total = bronze.add(silver,fill_value = 0)
Total = gold + silver + bronze
print(Total)
Actual Output:
Medals
Country
France NaN
Germany NaN
Russia NaN
UK NaN
USA 72.0
Expected:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0
Let me know what is wrong.

Just do concat with groupby sum
pd.concat([gold,silver,bronze]).groupby('Country').sum()
Out[1306]:
Medals
Country
France 53
Germany 20
Russia 25
UK 27
USA 72
Fixing your code
silver.add(gold,fill_value = 0).add(bronze,fill_value=0)
if we expect floating point:
pd.concat([gold,silver,bronze]).groupby('Country').sum().astype(float)

# For a video solution of the code, copy-paste the following link on your browser:
# https://youtu.be/p0cnApQDotA
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False)
# Print the sorted dataframe
print(total)

Related

Pandas: Merging rows into one

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar
Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo
Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

How to properly create a pandas dataframe with the given data?

I have the following experiment data:
experiment1_device = 'Dev2'
experiment1_values = [1, 2, 3, 4]
experiment1_timestamps = [1, 2, 3, 6]
experiment1_task = 'Oil Level'
experiment1_user = 'Sean'
experiment2_device = 'Dev1'
experiment2_values = [5, 6, 7, 8, 9]
experiment2_timestamps = [1, 2, 3, 4, 6]
experiment2_task = 'Ventilation'
experiment2_user = 'Martin'
experiment3_device = 'Dev1'
experiment3_values = [10, 11, 12, 13, 14]
experiment3_timestamps = [1, 2, 3, 4, 6]
experiment3_task = 'Ventilation'
experiment3_user = 'Sean'
Each experiment consists of:
A user who conducted the experiment
The task the user had to do
The device the user was using
A series of timestamps ...
and a series of data values which have been read at a certain timestamp, so len(experimentx_values ) == len(experimentx_timestamps )
The data is currently given in the above format, so to speak single variables, but I could change this output format if need. For example, if it would be better to put everything in a dict or so.
The expected output format I would like to achieve is the following:
Timestamp, User and Task should be a MultiIndex, whereas the device is supposed to be the column name, and empty (grey) cells should just contain NaN.
I tried multiple approaches with pd.Dataframe.from_records but couldn't get the desired output format.
Any help is highly appreciated!
Since the data are stored in all of those different variables it will be a lot of writing. However, you should try to store the results of each experiment in a DataFrame (directly from whatever outputs those values), and hold all of those DataFrames in a list which will cut down on the variables you have floating around.
Given your variables, construct DataFrames as follows:
df1 = pd.DataFrame({'Timestamp': experiment1_timestamps,
'User': experiment1_user,
'Task': experiment1_task,
experiment1_device: experiment1_values})
df2 = pd.DataFrame({'Timestamp': experiment2_timestamps,
'User': experiment2_user,
'Task': experiment2_task,
experiment2_device: experiment2_values})
df3 = pd.DataFrame({'Timestamp': experiment3_timestamps,
'User': experiment3_user,
'Task': experiment3_task,
experiment3_device: experiment3_values})
Now join them together into a single DataFrame, setting the index to the columns you want. It looks like your output wants the cartesian product of all possibilities, so we'll reindex to get the fully NaN rows back in:
df = pd.concat([df1, df2, df3]).set_index(['Timestamp', 'User', 'Task']).sort_index()
idx = pd.MultiIndex.from_product([df.index.get_level_values(i).unique()
for i in range(df.index.nlevels)])
df = df.reindex(idx)
Dev2 Dev1
Timestamp User Task
1 Martin Ventilation NaN 5.0
Oil Level NaN NaN
Sean Ventilation NaN 10.0
Oil Level 1.0 NaN
2 Martin Ventilation NaN 6.0
Oil Level NaN NaN
Sean Ventilation NaN 11.0
Oil Level 2.0 NaN
3 Martin Ventilation NaN 7.0
Oil Level NaN NaN
Sean Ventilation NaN 12.0
Oil Level 3.0 NaN
4 Martin Ventilation NaN 8.0
Oil Level NaN NaN
Sean Ventilation NaN 13.0
Oil Level NaN NaN
6 Martin Ventilation NaN 9.0
Oil Level NaN NaN
Sean Ventilation NaN 14.0
Oil Level 4.0 NaN
Maybe there is a simpler way, but you can create sub-dataframes:
import pandas as pd
import pandas as pd
df1 = pd.DataFrame(data=[[1, 2, 3, 4],[1, 2, 3, 6]]).T
df1.columns = ['values','timestamps']
df1['dev'] = 'Dev2'
df1['user'] = 'Sean'
df1['task'] = 'Oil Level'
df2 = pd.DataFrame(data=[[5, 6, 7, 8, 9],[1, 2, 3, 4, 6]]).T
df2.columns = ['values','timestamps']
df2['dev'] = 'Dev1'
df2['user'] = 'Martin'
df2['task'] = 'Ventilation'
df3 = pd.DataFrame(data=[[10, 11, 12, 13, 14],[1, 2, 3, 4, 6]]).T
df3.columns = ['values','timestamps']
df3['dev'] = 'Dev1'
df3['user'] = 'Sean'
df3['task'] = 'Ventilation'
merge them into one:
df = df1.merge(df2, on=['timestamps','dev','user','task','values'], how='outer')
df = df.merge(df3, on=['timestamps','dev','user','task','values'], how='outer')
and create the aggregation:
piv = df.groupby(['timestamps','user','task','dev']).sum()
piv = piv.unstack() # for the columns dev1, dev2

Cant assign value to cell in multiindex dataframe (assigning to copy / slice of df?)

I am trying to assign a value (mean of values in another column) to a cell in a multi-index Pandas dataframe over which I iterate to calculate means over a moving window in a different column. But, when I try to assign the value it doesn't change.
I am not used to working with multi-indexes and have solved several other problems but this one has me stumped for now...
Toy code that reproduces the problem:
tuples = [
('AFG', 1963), ('AFG', 1964), ('AFG', 1965), ('AFG', 1966), ('AFG', 1967), ('AFG', 1968),
('BRA', 1963), ('BRA', 1964), ('BRA', 1965), ('BRA', 1966), ('BRA', 1967), ('BRA', 1968)
]
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, None], [0, None],
[12, None], [0, 4],
[12, 5], [0, 4],
[12, 2], [0, 4],
[12, 2], [0, 4],
[1, 4], [7, 1]]
df = pd.DataFrame(values, columns=['Oil', 'Pop'], index=index)
lag =-2
lead=0
indicator = 'Pop'
new_indicator = 'Mean_pop'
df[new_indicator] = np.nan
df
Gives:
Oil Pop Mean_pop
AFG 1963 12 NaN NaN
1964 0 NaN NaN
1965 12 NaN NaN
1966 0 4.0 NaN
1967 12 5.0 NaN
1968 0 4.0 NaN
BRA 1963 12 2.0 NaN
1964 0 4.0 NaN
1965 12 2.0 NaN
1966 0 4.0 NaN
1967 1 4.0 NaN
1968 7 1.0 NaN
Then to iterate over the df:
for country, country_df in df.groupby(level=0):
oldestyear = country_df[indicator].first_valid_index()[1]
latestyear = country_df[indicator].last_valid_index()[1]
for t in range(oldestyear, latestyear+1):
print (country, oldestyear, latestyear, t)
print (" For", country, ", calculate mean over ", t+lag, "to", t+lead,
"and add to row for year", t)
dftt = country_df.loc[(country, t+lag):(country, t+lead)]
print(dftt[indicator])
mean = dftt[indicator].mean(axis=0)
print("mean for ", indicator, "in", country, "during", t+lag, "to", t+lead, "is", mean)
df.loc[country, t][new_indicator] = mean
Diagnostic output not pasted, but df looks the same after iterating over it and I get the following warning on some iterations:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if sys.path[0] == '':
Any pointers will be greatly appreciated.
I think it is a easy as setting last line to:
df.loc[(country, t), new_indicator] = mean

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

Split out if > value, divide, add value to column - Python/Pandas

import pandas as pd
df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
Name Amount Day
Dog 10 6
Cat 7 5
I would like to make the DataFrame look like the following:
Name Amount Day
Dog1 6 6
Dog2 2.5 7
Dog3 1.5 8
Cat 7 5
First step: For any Amount > 8, split into 3 different rows, with new name of 'Name1', 'Name2','Name3'
Second step:
For Dog1, 60% of Amount, Day = Day.
For Dog2, 25% of Amount, Day = Day + 1.
For Dog3, 15% of Amount, Day = Day + 2.
Keep Cat the same because Cat Amount < 8
Any ideas? Any help would be appreciated.
df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
template = pd.DataFrame([
['1', .6, 0],
['2', .25, 1],
['3', .15, 2]
], columns=df.columns)
def apply_template(r, t):
t = t.copy()
t['Name'] = t['Name'].radd(r['Name'])
t['Amount'] *= r['Amount']
t['Day'] += r['Day']
return t
pd.concat([apply_template(r, template) for _, r in df.query('Amount > 8').iterrows()],
ignore_index=True).append(df.query('Amount <= 8'), ignore_index=True)

Resources