Python: create new column and copy value from other row which is a swap of current row - python-3.x

I have a dataframe which has 3 columns:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
Dataframe looks like this:
A B VALUE
left right 0
right left 1
east west 2
west east 3
south north 4
north south 5
I am trying to create a new column VALUE_2 which should contain the value from the swapped row in the same Dataframe.
Eg: right - left value is 0, left - right value is 1 and I want the swapped values in the new column like this:
A B VALUE VALUE_2
left right 0 1
right left 1 0
east west 2 3
west east 3 2
south north 4 5
north south 5 4
I tried:
for row_num, record in df.iterrows():
A = df['A'][index]
B = df['B'][index]
if(pd.Series([record['A'] == B, record['B'] == A).all()):
df['VALUE_2'] = df['VALUE']
I'm struck here, inputs will be highly appreciated.

Use map by Series:
df['VALUE_2'] = df['A'].map(df.set_index('B')['VALUE'])
print (df)
A B VALUE VALUE_2
0 left right 0 1
1 right left 1 0
2 east west 2 3
3 west east 3 2
4 south north 4 5
5 north south 5 4

Just a more verbose answer:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
pdf = pd.DataFrame([])
for idx, item in df.iterrows():
indx = list(df['B']).index(str(df['A'][idx]))
pdf = pdf.append(pd.DataFrame({'VALUE_2': df.iloc[indx][2]}, index=[0]), ignore_index=True)
print(pdf)
data = pd.concat([df, pdf], axis=1)
print(data)

Related

how can i sum a columns in DataFrame with each date in time series data

Here's the example
'''
df = pd.DataFrame({'Country': ['United States', 'China', 'Italy', 'spain'],
'2020-01-01' : [0, 2, 1, 0]
'2020-01-02' : [1, 0, 1, 2]
'2020-01-03' : [0, 3, 2, 0]
df
'''
i want to sum the value of columns by date so that next columns has the added value.__ which means 2020-01-02 has a new added value of (2020-01-01+2020-01-02) and so on..
Convert Country column to index by DataFrame.set_index and use DataFrame.cumsum per rows by axis=1:
df = df.set_index('Country').cumsum(axis=1)
print (df)
2020-01-01 2020-01-02 2020-01-03
Country
United States 0 1 1
China 2 2 5
Italy 1 2 4
spain 0 2 2
Or select all columns without first by DataFrame.iloc before cumsum:
df.iloc[:, 1:] = df.iloc[:, 1:].cumsum(axis=1)
print (df)
Country 2020-01-01 2020-01-02 2020-01-03
0 United States 0 1 1
1 China 2 2 5
2 Italy 1 2 4
3 spain 0 2 2

find out percentage of duplicates

I have the following data:
id date A Area Price Hol
0 1 2019-01-01 No 80 200 No
1 2 2019-01-02 Yes 100 300 Yes
2 3 2019-01-03 Yes 100 300 Yes
3 4 2019-01-04 No 50 100 No
4 5 2019-01-05 No 20 50 No
5 1 2019-01-01 No 80 200 No
I want to find out duplicates (for the same id).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1], 'date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-01'],
'A': ['No', 'Yes', 'Yes', 'No', 'No', 'No'],
'Area': [80, 100, 100, 50, 20, 80], 'Price': [200, 300, 300, 100, 50, 200],
'Hol': ['No', 'Yes', 'Yes', 'No', 'No', 'No']})
df['date'] = pd.to_datetime(df['date'])
fig, ax = plt.subplots(figsize=(15, 7))
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().plot(ax=ax)
I can see that I have one duplicate (for id 1 , all the entries are the same)
Now, I want to find out what percentage those duplicates represent in the whole dataset.
I can't find a way to express this, since I am already using value_counts() in order to find the duplicates and I can't do something like:
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().size()
percentage = (test / test.groupby(level=0).sum()) * 100
I believe you need DataFrame.duplicated with Series.value_counts:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)
False 66.666667
True 33.333333
dtype: float64
Is duplicated what you need ?
df.duplicated(keep=False).mean()
Out[107]: 0.3333333333333333

How do we add dataframes with same id?

I'm a beginner in data science learning. Gone through the pandas topic and I found a task here, which I'm unable to understand what is wrong. Let me explain the problem.
I have three data frames:
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
Here, I need to add to all the medals into one column, country in another. When I added it was showing NAN. So, I filled the NAN with zero values, still I'm unable to get deserved output.
Code:
gold.set_index('Country', inplace = True)
silver.set_index('Country',inplace = True)
bronze.set_index('Country', inplace = True)
Total = silver.add(gold,fill_value = 0)
Total = bronze.add(silver,fill_value = 0)
Total = gold + silver + bronze
print(Total)
Actual Output:
Medals
Country
France NaN
Germany NaN
Russia NaN
UK NaN
USA 72.0
Expected:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0
Let me know what is wrong.
Just do concat with groupby sum
pd.concat([gold,silver,bronze]).groupby('Country').sum()
Out[1306]:
Medals
Country
France 53
Germany 20
Russia 25
UK 27
USA 72
Fixing your code
silver.add(gold,fill_value = 0).add(bronze,fill_value=0)
if we expect floating point:
pd.concat([gold,silver,bronze]).groupby('Country').sum().astype(float)
# For a video solution of the code, copy-paste the following link on your browser:
# https://youtu.be/p0cnApQDotA
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False)
# Print the sorted dataframe
print(total)

Create two Dataframes based on series membership in Pandas

I'm a beginner, I can't seem to find an exact answer to this.
I have two dataframes, the first has localized economic data (df1):
(index) (index) 2000 2010 Diff
State Region
NY NYC 1000 1100 100
NY Upstate 200 270 70
NY Long_Island 1700 1800 100
IL Chicago 300 500 200
IL South 50 35 15
IL Suburbs 800 650 -150
The second has a list of state and regions, (df2):
index State Region
0 NY NYC
1 NY Long_Island
2 IL Chicago
Ultimately what I'm trying to do is run a t-test on the Diff column between the state and regions in df2 vs all the other ones in df1 that are not included in df2. However, I haven't managed to divide the groups yet so I can't run the test.
My latest attempt (of many) looks like this:
df1['Region', 'State'].isin(df2['Region', 'State'])
I've tried pd.merge too but can't seem to get it to work. I think it's because of the multi-level indexing but I still don't know how to get the state/regions that are not in df2.
It seems you need difference of MultiIndexes and then select by loc:
print (df1.index)
MultiIndex(levels=[['IL', 'NY'], ['Chicago', 'Long_Island',
'NYC', 'South', 'Suburbs', 'Upstate']],
labels=[[1, 1, 1, 0, 0, 0], [2, 5, 1, 0, 3, 4]],
names=['State', 'Region'])
print (df2.index)
Int64Index([0, 1, 2], dtype='int64', name='index')
print (df1.index.names)
['State', 'Region']
#create index from both columns
df2 = df2.set_index(df1.index.names)
what is same as
#df2 = df2.set_index(['State','Region'])
mux = df1.index.difference(df2.index)
print (mux)
MultiIndex(levels=[['IL', 'NY'], ['South', 'Suburbs', 'Upstate']],
labels=[[0, 0, 1], [0, 1, 2]],
names=['State', 'Region'],
sortorder=0)
print (df1.loc[mux])
2000 2010 Diff
State Region
IL South 50 35 15
Suburbs 800 650 -150
NY Upstate 200 270 70
All together:
df2 = df2.set_index(df1.index.names)
df = df1.loc[df1.index.difference(df2.index)]
print (df)

Split out if > value, divide, add value to column - Python/Pandas

import pandas as pd
df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
Name Amount Day
Dog 10 6
Cat 7 5
I would like to make the DataFrame look like the following:
Name Amount Day
Dog1 6 6
Dog2 2.5 7
Dog3 1.5 8
Cat 7 5
First step: For any Amount > 8, split into 3 different rows, with new name of 'Name1', 'Name2','Name3'
Second step:
For Dog1, 60% of Amount, Day = Day.
For Dog2, 25% of Amount, Day = Day + 1.
For Dog3, 15% of Amount, Day = Day + 2.
Keep Cat the same because Cat Amount < 8
Any ideas? Any help would be appreciated.
df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
template = pd.DataFrame([
['1', .6, 0],
['2', .25, 1],
['3', .15, 2]
], columns=df.columns)
def apply_template(r, t):
t = t.copy()
t['Name'] = t['Name'].radd(r['Name'])
t['Amount'] *= r['Amount']
t['Day'] += r['Day']
return t
pd.concat([apply_template(r, template) for _, r in df.query('Amount > 8').iterrows()],
ignore_index=True).append(df.query('Amount <= 8'), ignore_index=True)

Resources