Pandas: How to build a column based on another column which is indexed by another one? - python-3.x

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64

You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Related

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Pandas create a new column based on existing column with chain rule

I have below code
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, 3, 1],"B":[7, 2, 54, 3, None],"C":[20, 16, 11, 3, 8],"D":[14, 3, None, 2, 6]})
df['A1'] = np.where(df['A'] > 10, 10, np.where(df['A'] < 3, 3, df['A']))
While this is okay, I want create the final dataframe (i.e. 2nd line of code) using chain rule from the first line. I want to achieve this to increase readability.
Could you please help how can I achieve this?
You can use clip here:
df.assign(A1=df['A'].clip(upper=10,lower=3))
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3
If you really need to do this in one line (note that I dont find this readable)
pd.DataFrame({"A":[12, 4, 5, 3, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]}).assign(A1=lambda x:x['A'].clip(upper=10,lower=3))
You could use np.select() like the following. It makes the conditions and choices very readable.
conditions = [df['A'] > 10,
df['A'] < 3]
choices = [10,3]
df['A2'] = np.select(conditions, choices, default = df['A'])
print(df)
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3

How to aggregate n previous rows as list in Pandas DataFrame?

As the title says:
a = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
Having a dataframe with 10 values we want to aggregate say last 5 rows and put them as list into a new column:
>>> a new_col
0
0 1
1 2
2 3
3 4
4 5 [1,2,3,4,5]
5 6 [2,3,4,5,6]
6 7 [3,4,5,6,7]
7 8 [4,5,6,7,8]
8 9 [5,6,7,8,9]
9 10 [6,7,8,9,10]
How?
Due to how rolling windows are implemented, you won't be able to aggregate the results as you expect, but we still can reach your desired result by iterating each window and storing the values as a list of values:
>>> new_col_values = [
window.to_list() if len(window) == 5 else None
for window in df["column"].rolling(5)
]
>>> df["new_col"] = new_col_values
>>> df
column new_col
0 1 None
1 2 None
2 3 None
3 4 None
4 5 [1, 2, 3, 4, 5]
5 6 [2, 3, 4, 5, 6]
6 7 [3, 4, 5, 6, 7]
7 8 [4, 5, 6, 7, 8]
8 9 [5, 6, 7, 8, 9]
9 10 [6, 7, 8, 9, 10]

Function on column from dictionary

I have a df like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
A B
0 3 5
1 1 6
2 2 7
3 3 8
And I have a dictionary like this:
{'A': 1, 'B': 2}
Is there a simple way to performa function (eg. divide) on df values based on the values from the dictionary?
Example, all values in column A is divided by 1, and all values in column B is divided by 2?
For me working division by dictionary, because keys of dict matching columns names:
d = {'A': 1, 'B': 2}
df1 = df.div(d)
Or:
df1 = df / d
print(df1)
A B
0 3.0 2.5
1 1.0 3.0
2 2.0 3.5
3 3.0 4.0
If you want to do it using for loop you can try this
df = pd.DataFrame({'A': [3, 1,2, 3, 4],
'B': [5, 6, 7, 8, 9]})
dict={'A': 1, 'B': 2}
final_dict={}
for col in df.columns:
if col in dict.keys():
for item in dict.keys():
if col==item:
lists=[i/dict[item] for i in df[col]]
final_dict[col]=lists
df=pd.DataFrame(final_dict)

Efficient evaluation of weighted average variable in a Pandas Dataframe

Please, considere the dataframe df generated below:
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'var1': [10, 20, 30, 20 , 30, 30],
'var2': [2,4,6,4,6,6],
'price': [20, 30, 40 , 50, 10, 20],
'sells': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'var1','var2', 'price', 'sells'])
return df
if __name__=="__main__":
df=creatingDataFrame()
setCode=set(df['code'])
listDF=[]
for code in setCode:
dfCode=df[df['code'] == code].copy()
print(dfCode)
lenDfCode=len(dfCode)
if(lenDfCode==1):
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['price'].iloc[0]],
'totalSells': [dfCode['sells'].iloc[0]]
}
else:
dfCode['price*sells']=dfCode['price']*dfCode['sells']
sumSells=np.sum(dfCode['sells'])
sumProducts=np.sum(dfCode['price*sells'])
dfCode['totalSells']=sumSells
av=sumProducts/sumSells
dfCode['averagePrice']=av
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['averagePrice'].iloc[0]],
'totalSells': [dfCode['totalSells'].iloc[0]]
}
dfPart=pd.DataFrame(theData, columns = ['code', 'var1','var2', 'averagePrice','totalSells'])
listDF.append(dfPart)
newDF = pd.concat(listDF)
print(newDF)
I have this dataframe
code var1 var2 price sells
0 1 10 2 20 3
1 2 20 4 30 4
2 3 30 6 40 5
3 2 20 4 50 1
4 3 30 6 10 2
5 3 30 6 20 3
I want to generate the following dataframe:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3
0 2 20 4 34.0 5
0 3 30 6 28.0 10
Note that this dataframe is created from the first by evaluating the average price and total sells for each code. Furthermore, var1 and var2 are the same for each code. The python code above does that, but I know that it is inefficient. I believe that a desired solution can be done using groupby, but I am not able to generate it.
It is different , apply with pd.Series
df.groupby(['code','var1','var2']).apply(lambda x : pd.Series({'averagePrice': sum(x['sells']*x['price'])/sum(x['sells']),'totalSells':sum(x['sells'])})).reset_index()
Out[366]:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3.0
1 2 20 4 34.0 5.0
2 3 30 6 28.0 10.0

Resources