Range mapping in Python - python-3.x

MAPPER DATAFRAME
col_data = {'p0_tsize_qbin_':[1, 2, 3, 4, 5] ,
'p0_tsize_min':[0.0, 7.0499999999999545, 16.149999999999977, 32.65000000000009, 76.79999999999973] ,
'p0_tsize_max':[7.0, 16.100000000000023, 32.64999999999998, 76.75, 6759.850000000006]}
map_df = pd.DataFrame(col_data, columns = ['p0_tsize_qbin_', 'p0_tsize_min','p0_tsize_max'])
map_df
in Above data frame is map_df where column 2 and column 3 is the range and column1 is mapper value to the new data frame .
MAIN DATAFRAME
raw_data = {
'id': ['1', '2', '2', '3', '3','1', '2', '2', '3', '3','1', '2', '2', '3', '3'],
'val' : [3, 56, 78, 11, 5000,37, 756, 78, 49, 21,9, 4, 14, 75, 31,]}
df = pd.DataFrame(raw_data, columns = ['id', 'val','p0_tsize_qbin_mapped'])
df
EXPECTED OUTPUT MARKED IN BLUE
look for val of df dataframe in map_df min(column1) and max(columns2) where ever it lies get the p0_tsize_qbin_ value.
For Example : from df data frame val = 3 , lies in the range of p0_tsize_min p0_tsize_max where p0_tsize_qbin_ ==1 . so 1 will return

Try using pd.cut()
bins = map_df['p0_tsize_min'].tolist() + [map_df['p0_tsize_max'].max()]
labels = map_df['p0_tsize_qbin_'].tolist()
df.assign(p0_tsize_qbin_mapped = pd.cut(df['val'],bins = bins,labels = labels))
or
bins = pd.IntervalIndex.from_arrays(map_df['p0_tsize_min'],map_df['p0_tsize_max'])
map_df.loc[bins.get_indexer(df['val'].tolist()),'p0_tsize_qbin_'].to_numpy()
Output:
id val p0_tsize_qbin_mapped
0 1 3 1
1 2 56 4
2 2 78 5
3 3 11 2
4 3 5000 5
5 1 37 4
6 2 756 5
7 2 78 5
8 3 49 4
9 3 21 3
10 1 9 2
11 2 4 1
12 2 14 2
13 3 75 4
14 3 31 3

Related

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Pandas: apply list of functions on columns, one function per column

Setting: for a dataframe with 10 columns I have a list of 10 functions which I wish to apply in a function1(column1), function2(column2), ..., function10(column10) fashion. I have looked into pandas.DataFrame.apply and pandas.DataFrame.transform but they seem to broadcast and apply each function on all possible columns.
IIUC, with zip and a for loop:
Example
def function1(x):
return x + 1
def function2(x):
return x * 2
def function3(x):
return x**2
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3], 'C': [1, 2, 3]})
functions = [function1, function2, function3]
print(df)
# A B C
# 0 1 1 1
# 1 2 2 2
# 2 3 3 3
for col, func in zip(df, functions):
df[col] = df[col].apply(func)
print(df)
# A B C
# 0 2 2 1
# 1 3 4 4
# 2 4 6 9
You could do something like:
# list containing functions
fun_list = []
# assume df is your dataframe
for i, fun in enumerate(fun_list):
df.iloc[:,i] = fun(df.iloc[:,i])
You can probably try to map your N functions to each row by using a lambda containing a Series with your operations, check the following code:
import pandas as pd
matrix = [(22, 34, 23), (33, 31, 11), (44, 16, 21), (55, 32, 22), (66, 33, 27),
(77, 35, 11)]
df = pd.DataFrame(matrix, columns=list('xyz'), index=list('abcdef'))
Will produce:
x y z
a 22 34 23
b 33 31 11
c 44 16 21
d 55 32 22
e 66 33 27
f 77 35 11
and then:
res_df = df.apply(lambda row: pd.Series([row[0] + 1, row[1] + 2, row[2] + 3]), axis=1)
will give you:
0 1 2
a 23 36 26
b 34 33 14
c 45 18 24
d 56 34 25
e 67 35 30
f 78 37 14
You can simply apply to specific column
df['x'] = df['X'].apply(lambda x: x*2)
Similar to #Chris Adams's answer but makes a copy of the dataframe using dictionary comprehension and zip.
def function1(x):
return x + 1
def function2(x):
return x * 2
def function3(x):
return x**2
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3], 'C': [1, 2, 3]})
functions = [function1, function2, function3]
print(df)
# A B C
# 0 1 1 1
# 1 2 2 2
# 2 3 3 3
df_2 = pd.DataFrame({col: func(df[col]) for col, func in zip(df, functions)})
print(df_2)
# A B C
# 0 2 2 1
# 1 3 4 4
# 2 4 6 9

How to use two for loops to copy a list variable into a location variable in a dataframe?

I have a dataframe which has 2 columns called locStuff and data. Someone was kind enough to show me how to index a location range in the df so that it correctly changes the data to a single integer attached to locStuff instead of the dataframe index, that works fine, now I cannot see how to change the data values of that location range with a list of values.
import pandas as pd
INDEX = list(range(1, 11))
LOCATIONS = [3, 10, 6, 2, 9, 1, 7, 5, 8, 4]
DATA = [94, 43, 85, 10, 81, 57, 88, 11, 35, 86]
# Make dataframe
DF = pd.DataFrame(LOCATIONS, columns=['locStuff'], index=INDEX)
DF['data'] = pd.Series(DATA, index=INDEX)
# Location and new value inputs
LOC_TO_CHANGE = 8
NEW_LOC_VALUE = 999
NEW_LOC_VALUE = [999,666,333]
LOC_RANGE = list(range(3, 6))
DF.iloc[3:6, 1] = ('%03d' % NEW_LOC_VALUE)
print(DF)
#I TRIED BOTH OF THESE SEPARATELY
for i in NEW_LOC_VALUE:
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
print (DF)
i=0
while i<len(NEW_LOC_VALUE):
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
i=+1
print(DF)
Neither of these work:
for i in NEW_LOC_VALUE:
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
print (DF)
i=0
while i<len(NEW_LOC_VALUE):
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
i=+1
I know how to do this using loops or list comprehensions for an empty list but no idea how to adapt what I have above for a DataFrame.
Expected behaviour would be:
locStuff data
1 3 999
2 10 43
3 6 85
4 2 10
5 9 81
6 1 57
7 7 88
8 5 333
9 8 35
10 4 666
Try setting locStuff as index, assign values, and reset_index:
DF.set_index('locStuff', inplace=True)
DF.loc[LOC_RANGE, 'data'] = NEW_LOC_VALUE
DF.reset_index(inplace=True)
Output:
locStuff data
0 3 999
1 10 43
2 6 85
3 2 10
4 9 81
5 1 57
6 7 88
7 5 333
8 8 35
9 4 666

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Efficient evaluation of weighted average variable in a Pandas Dataframe

Please, considere the dataframe df generated below:
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'var1': [10, 20, 30, 20 , 30, 30],
'var2': [2,4,6,4,6,6],
'price': [20, 30, 40 , 50, 10, 20],
'sells': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'var1','var2', 'price', 'sells'])
return df
if __name__=="__main__":
df=creatingDataFrame()
setCode=set(df['code'])
listDF=[]
for code in setCode:
dfCode=df[df['code'] == code].copy()
print(dfCode)
lenDfCode=len(dfCode)
if(lenDfCode==1):
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['price'].iloc[0]],
'totalSells': [dfCode['sells'].iloc[0]]
}
else:
dfCode['price*sells']=dfCode['price']*dfCode['sells']
sumSells=np.sum(dfCode['sells'])
sumProducts=np.sum(dfCode['price*sells'])
dfCode['totalSells']=sumSells
av=sumProducts/sumSells
dfCode['averagePrice']=av
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['averagePrice'].iloc[0]],
'totalSells': [dfCode['totalSells'].iloc[0]]
}
dfPart=pd.DataFrame(theData, columns = ['code', 'var1','var2', 'averagePrice','totalSells'])
listDF.append(dfPart)
newDF = pd.concat(listDF)
print(newDF)
I have this dataframe
code var1 var2 price sells
0 1 10 2 20 3
1 2 20 4 30 4
2 3 30 6 40 5
3 2 20 4 50 1
4 3 30 6 10 2
5 3 30 6 20 3
I want to generate the following dataframe:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3
0 2 20 4 34.0 5
0 3 30 6 28.0 10
Note that this dataframe is created from the first by evaluating the average price and total sells for each code. Furthermore, var1 and var2 are the same for each code. The python code above does that, but I know that it is inefficient. I believe that a desired solution can be done using groupby, but I am not able to generate it.
It is different , apply with pd.Series
df.groupby(['code','var1','var2']).apply(lambda x : pd.Series({'averagePrice': sum(x['sells']*x['price'])/sum(x['sells']),'totalSells':sum(x['sells'])})).reset_index()
Out[366]:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3.0
1 2 20 4 34.0 5.0
2 3 30 6 28.0 10.0

Resources