Efficient evaluation of weighted average variable in a Pandas Dataframe - python-3.x

Please, considere the dataframe df generated below:
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'var1': [10, 20, 30, 20 , 30, 30],
'var2': [2,4,6,4,6,6],
'price': [20, 30, 40 , 50, 10, 20],
'sells': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'var1','var2', 'price', 'sells'])
return df
if __name__=="__main__":
df=creatingDataFrame()
setCode=set(df['code'])
listDF=[]
for code in setCode:
dfCode=df[df['code'] == code].copy()
print(dfCode)
lenDfCode=len(dfCode)
if(lenDfCode==1):
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['price'].iloc[0]],
'totalSells': [dfCode['sells'].iloc[0]]
}
else:
dfCode['price*sells']=dfCode['price']*dfCode['sells']
sumSells=np.sum(dfCode['sells'])
sumProducts=np.sum(dfCode['price*sells'])
dfCode['totalSells']=sumSells
av=sumProducts/sumSells
dfCode['averagePrice']=av
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['averagePrice'].iloc[0]],
'totalSells': [dfCode['totalSells'].iloc[0]]
}
dfPart=pd.DataFrame(theData, columns = ['code', 'var1','var2', 'averagePrice','totalSells'])
listDF.append(dfPart)
newDF = pd.concat(listDF)
print(newDF)
I have this dataframe
code var1 var2 price sells
0 1 10 2 20 3
1 2 20 4 30 4
2 3 30 6 40 5
3 2 20 4 50 1
4 3 30 6 10 2
5 3 30 6 20 3
I want to generate the following dataframe:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3
0 2 20 4 34.0 5
0 3 30 6 28.0 10
Note that this dataframe is created from the first by evaluating the average price and total sells for each code. Furthermore, var1 and var2 are the same for each code. The python code above does that, but I know that it is inefficient. I believe that a desired solution can be done using groupby, but I am not able to generate it.

It is different , apply with pd.Series
df.groupby(['code','var1','var2']).apply(lambda x : pd.Series({'averagePrice': sum(x['sells']*x['price'])/sum(x['sells']),'totalSells':sum(x['sells'])})).reset_index()
Out[366]:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3.0
1 2 20 4 34.0 5.0
2 3 30 6 28.0 10.0

Related

Range mapping in Python

MAPPER DATAFRAME
col_data = {'p0_tsize_qbin_':[1, 2, 3, 4, 5] ,
'p0_tsize_min':[0.0, 7.0499999999999545, 16.149999999999977, 32.65000000000009, 76.79999999999973] ,
'p0_tsize_max':[7.0, 16.100000000000023, 32.64999999999998, 76.75, 6759.850000000006]}
map_df = pd.DataFrame(col_data, columns = ['p0_tsize_qbin_', 'p0_tsize_min','p0_tsize_max'])
map_df
in Above data frame is map_df where column 2 and column 3 is the range and column1 is mapper value to the new data frame .
MAIN DATAFRAME
raw_data = {
'id': ['1', '2', '2', '3', '3','1', '2', '2', '3', '3','1', '2', '2', '3', '3'],
'val' : [3, 56, 78, 11, 5000,37, 756, 78, 49, 21,9, 4, 14, 75, 31,]}
df = pd.DataFrame(raw_data, columns = ['id', 'val','p0_tsize_qbin_mapped'])
df
EXPECTED OUTPUT MARKED IN BLUE
look for val of df dataframe in map_df min(column1) and max(columns2) where ever it lies get the p0_tsize_qbin_ value.
For Example : from df data frame val = 3 , lies in the range of p0_tsize_min p0_tsize_max where p0_tsize_qbin_ ==1 . so 1 will return
Try using pd.cut()
bins = map_df['p0_tsize_min'].tolist() + [map_df['p0_tsize_max'].max()]
labels = map_df['p0_tsize_qbin_'].tolist()
df.assign(p0_tsize_qbin_mapped = pd.cut(df['val'],bins = bins,labels = labels))
or
bins = pd.IntervalIndex.from_arrays(map_df['p0_tsize_min'],map_df['p0_tsize_max'])
map_df.loc[bins.get_indexer(df['val'].tolist()),'p0_tsize_qbin_'].to_numpy()
Output:
id val p0_tsize_qbin_mapped
0 1 3 1
1 2 56 4
2 2 78 5
3 3 11 2
4 3 5000 5
5 1 37 4
6 2 756 5
7 2 78 5
8 3 49 4
9 3 21 3
10 1 9 2
11 2 4 1
12 2 14 2
13 3 75 4
14 3 31 3

Replace data type in dataFrame

I wonder how to replace types in data frame. In this sample I want to replace all strings to 0 or NaN. Here is my simple df and I try too do:
df.replace(str, 0, inplace=True)
or
df.replace({str: 0}, inplace=True)
but above solutions does not work.
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
check this code will visit every cell in the data frame , and if it was nan or string will replace them with 0
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 1, 2, 3, np.nan],
'B': [np.nan, 6, 7, 8, 9],
'C': ['a', 10, 500, 'd', 'e']})
print("before >>> \n",df)
def replace_nan_and_strings(cell_value):
if pd.isnull(cell_value) or isinstance(cell_value,str):
return 0
else :
return cell_value
new_df=df.applymap(lambda (x):replace_nan_strings(x))
print("after >>> \n",new_df)
Try this:
df = df.replace('[a-zA-Z]', 0, regex=True)
This is how I tested it:
'''
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
'''
import pandas as pd
df = pd.read_clipboard()
df = df.replace('[a-zA-Z]', 0, regex=True)
print(df)
Output:
0 1 2
0 NaN 1 0
1 2.0 3 0
2 4.0 0 5
3 10.0 20 30
New scenario as requested in the comments below:
Input:
'''
0 '1' 2
0 NaN 1 'b'
1 2 3 'c'
2 '4' 'd' 5
3 10 20 30
'''
Output:
0 '1' 2
0 NaN 1 0
1 2 3 0
2 '4' 0 5
3 10 20 30

Get value from another dataframe column based on condition

I have a dataframe like below:
>>> df1
a b
0 [1, 2, 3] 10
1 [4, 5, 6] 20
2 [7, 8] 30
and another like:
>>> df2
a
0 1
1 2
2 3
3 4
4 5
I need to create column 'c' in df2 from column 'b' of df1 if column 'a' value of df2 is in coulmn 'a' df1. In df1 each tuple of column 'a' is a list.
I have tried to implement from following url, but got nothing so far:
https://medium.com/#Imaadmkhan1/using-pandas-to-create-a-conditional-column-by-selecting-multiple-columns-in-two-different-b50886fabb7d
expect result is
>>> df2
a c
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
Use Series.map by flattening values from df1 to dictionary:
d = {c: b for a, b in zip(df1['a'], df1['b']) for c in a}
print (d)
{1: 10, 2: 10, 3: 10, 4: 20, 5: 20, 6: 20, 7: 30, 8: 30}
df2['new'] = df2['a'].map(d)
print (df2)
a new
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
EDIT: I think problem is mixed integers in list in column a, solution is use if/else for test it for new dictionary:
d = {}
for a, b in zip(df1['a'], df1['b']):
if isinstance(a, list):
for c in a:
d[c] = b
else:
d[a] = b
df2['new'] = df2['a'].map(d)
Use :
m=pd.DataFrame({'a':np.concatenate(df.a.values),'b':df.b.repeat(df.a.str.len())})
df2.merge(m,on='a')
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
First we unnest the list df1 to rows, then we merge them on column a:
df1 = df1.set_index('b').a.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'a'})
print(df1, '\n')
df_final = df2.merge(df1, on='a')
print(df_final)
b a
0 10 1.0
1 10 2.0
2 10 3.0
0 20 4.0
1 20 5.0
2 20 6.0
0 30 7.0
1 30 8.0
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Pandas: Random integer between values in two columns

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

Resources