Python: create new column and copy value from other row which is a swap of current row

Python: create new column and copy value from other row which is a swap of current row - python-3.x

I have a dataframe which has 3 columns:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
Dataframe looks like this:
A B VALUE
left right 0
right left 1
east west 2
west east 3
south north 4
north south 5
I am trying to create a new column VALUE_2 which should contain the value from the swapped row in the same Dataframe.
Eg: right - left value is 0, left - right value is 1 and I want the swapped values in the new column like this:
A B VALUE VALUE_2
left right 0 1
right left 1 0
east west 2 3
west east 3 2
south north 4 5
north south 5 4
I tried:
for row_num, record in df.iterrows():
A = df['A'][index]
B = df['B'][index]
if(pd.Series([record['A'] == B, record['B'] == A).all()):
df['VALUE_2'] = df['VALUE']
I'm struck here, inputs will be highly appreciated.

Use map by Series:
df['VALUE_2'] = df['A'].map(df.set_index('B')['VALUE'])
print (df)
A B VALUE VALUE_2
0 left right 0 1
1 right left 1 0
2 east west 2 3
3 west east 3 2
4 south north 4 5
5 north south 5 4

Just a more verbose answer:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
pdf = pd.DataFrame([])
for idx, item in df.iterrows():
indx = list(df['B']).index(str(df['A'][idx]))
pdf = pdf.append(pd.DataFrame({'VALUE_2': df.iloc[indx][2]}, index=[0]), ignore_index=True)
print(pdf)
data = pd.concat([df, pdf], axis=1)
print(data)

Related

how can i sum a columns in DataFrame with each date in time series data

Here's the example
'''
df = pd.DataFrame({'Country': ['United States', 'China', 'Italy', 'spain'],
'2020-01-01' : [0, 2, 1, 0]
'2020-01-02' : [1, 0, 1, 2]
'2020-01-03' : [0, 3, 2, 0]
df
'''
i want to sum the value of columns by date so that next columns has the added value.__ which means 2020-01-02 has a new added value of (2020-01-01+2020-01-02) and so on..

Convert Country column to index by DataFrame.set_index and use DataFrame.cumsum per rows by axis=1:
df = df.set_index('Country').cumsum(axis=1)
print (df)
2020-01-01 2020-01-02 2020-01-03
Country
United States 0 1 1
China 2 2 5
Italy 1 2 4
spain 0 2 2
Or select all columns without first by DataFrame.iloc before cumsum:
df.iloc[:, 1:] = df.iloc[:, 1:].cumsum(axis=1)
print (df)
Country 2020-01-01 2020-01-02 2020-01-03
0 United States 0 1 1
1 China 2 2 5
2 Italy 1 2 4
3 spain 0 2 2

find out percentage of duplicates

I have the following data:
id date A Area Price Hol
0 1 2019-01-01 No 80 200 No
1 2 2019-01-02 Yes 100 300 Yes
2 3 2019-01-03 Yes 100 300 Yes
3 4 2019-01-04 No 50 100 No
4 5 2019-01-05 No 20 50 No
5 1 2019-01-01 No 80 200 No
I want to find out duplicates (for the same id).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1], 'date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-01'],
'A': ['No', 'Yes', 'Yes', 'No', 'No', 'No'],
'Area': [80, 100, 100, 50, 20, 80], 'Price': [200, 300, 300, 100, 50, 200],
'Hol': ['No', 'Yes', 'Yes', 'No', 'No', 'No']})
df['date'] = pd.to_datetime(df['date'])
fig, ax = plt.subplots(figsize=(15, 7))
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().plot(ax=ax)
I can see that I have one duplicate (for id 1 , all the entries are the same)
Now, I want to find out what percentage those duplicates represent in the whole dataset.
I can't find a way to express this, since I am already using value_counts() in order to find the duplicates and I can't do something like:
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().size()
percentage = (test / test.groupby(level=0).sum()) * 100

I believe you need DataFrame.duplicated with Series.value_counts:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)
False 66.666667
True 33.333333
dtype: float64

Is duplicated what you need ?
df.duplicated(keep=False).mean()
Out[107]: 0.3333333333333333

How do we add dataframes with same id?

I'm a beginner in data science learning. Gone through the pandas topic and I found a task here, which I'm unable to understand what is wrong. Let me explain the problem.
I have three data frames:
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
Here, I need to add to all the medals into one column, country in another. When I added it was showing NAN. So, I filled the NAN with zero values, still I'm unable to get deserved output.
Code:
gold.set_index('Country', inplace = True)
silver.set_index('Country',inplace = True)
bronze.set_index('Country', inplace = True)
Total = silver.add(gold,fill_value = 0)
Total = bronze.add(silver,fill_value = 0)
Total = gold + silver + bronze
print(Total)
Actual Output:
Medals
Country
France NaN
Germany NaN
Russia NaN
UK NaN
USA 72.0
Expected:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0
Let me know what is wrong.

Just do concat with groupby sum
pd.concat([gold,silver,bronze]).groupby('Country').sum()
Out[1306]:
Medals
Country
France 53
Germany 20
Russia 25
UK 27
USA 72
Fixing your code
silver.add(gold,fill_value = 0).add(bronze,fill_value=0)
if we expect floating point:
pd.concat([gold,silver,bronze]).groupby('Country').sum().astype(float)

# For a video solution of the code, copy-paste the following link on your browser:
# https://youtu.be/p0cnApQDotA
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False)
# Print the sorted dataframe
print(total)

Create two Dataframes based on series membership in Pandas

I'm a beginner, I can't seem to find an exact answer to this.
I have two dataframes, the first has localized economic data (df1):
(index) (index) 2000 2010 Diff
State Region
NY NYC 1000 1100 100
NY Upstate 200 270 70
NY Long_Island 1700 1800 100
IL Chicago 300 500 200
IL South 50 35 15
IL Suburbs 800 650 -150
The second has a list of state and regions, (df2):
index State Region
0 NY NYC
1 NY Long_Island
2 IL Chicago
Ultimately what I'm trying to do is run a t-test on the Diff column between the state and regions in df2 vs all the other ones in df1 that are not included in df2. However, I haven't managed to divide the groups yet so I can't run the test.
My latest attempt (of many) looks like this:
df1['Region', 'State'].isin(df2['Region', 'State'])
I've tried pd.merge too but can't seem to get it to work. I think it's because of the multi-level indexing but I still don't know how to get the state/regions that are not in df2.

It seems you need difference of MultiIndexes and then select by loc:
print (df1.index)
MultiIndex(levels=[['IL', 'NY'], ['Chicago', 'Long_Island',
'NYC', 'South', 'Suburbs', 'Upstate']],
labels=[[1, 1, 1, 0, 0, 0], [2, 5, 1, 0, 3, 4]],
names=['State', 'Region'])
print (df2.index)
Int64Index([0, 1, 2], dtype='int64', name='index')
print (df1.index.names)
['State', 'Region']
#create index from both columns
df2 = df2.set_index(df1.index.names)
what is same as
#df2 = df2.set_index(['State','Region'])
mux = df1.index.difference(df2.index)
print (mux)
MultiIndex(levels=[['IL', 'NY'], ['South', 'Suburbs', 'Upstate']],
labels=[[0, 0, 1], [0, 1, 2]],
names=['State', 'Region'],
sortorder=0)
print (df1.loc[mux])
2000 2010 Diff
State Region
IL South 50 35 15
Suburbs 800 650 -150
NY Upstate 200 270 70
All together:
df2 = df2.set_index(df1.index.names)
df = df1.loc[df1.index.difference(df2.index)]
print (df)

Split out if > value, divide, add value to column - Python/Pandas

import pandas as pd
df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
Name Amount Day
Dog 10 6
Cat 7 5
I would like to make the DataFrame look like the following:
Name Amount Day
Dog1 6 6
Dog2 2.5 7
Dog3 1.5 8
Cat 7 5
First step: For any Amount > 8, split into 3 different rows, with new name of 'Name1', 'Name2','Name3'
Second step:
For Dog1, 60% of Amount, Day = Day.
For Dog2, 25% of Amount, Day = Day + 1.
For Dog3, 15% of Amount, Day = Day + 2.
Keep Cat the same because Cat Amount < 8
Any ideas? Any help would be appreciated.

df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
template = pd.DataFrame([
['1', .6, 0],
['2', .25, 1],
['3', .15, 2]
], columns=df.columns)
def apply_template(r, t):
t = t.copy()
t['Name'] = t['Name'].radd(r['Name'])
t['Amount'] *= r['Amount']
t['Day'] += r['Day']
return t
pd.concat([apply_template(r, template) for _, r in df.query('Amount > 8').iterrows()],
ignore_index=True).append(df.query('Amount <= 8'), ignore_index=True)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python: create new column and copy value from other row which is a swap of current row - python-3.x

Use map by Series: df['VALUE_2'] = df['A'].map(df.set_index('B')['VALUE']) print (df) A B VALUE VALUE_2 0 left right 0 1 1 right left 1 0 2 east west 2 3 3 west east 3 2 4 south north 4 5 5 north south 5 4

Related

how can i sum a columns in DataFrame with each date in time series data

find out percentage of duplicates

How do we add dataframes with same id?

Create two Dataframes based on series membership in Pandas

Split out if > value, divide, add value to column - Python/Pandas

Categories

Resources