define range in pandas column based on define input from list - python-3.x

I have one data frame, wherein I need to apply range in one column, based on the list provided,
I am able to achieve results using fixed values but input values will be dynamic in a list format and the range will be based on input.
MY Data frame looks like below:
import pandas as pd
rangelist=[90,70,50]
data = {'Result': [75,85,95,45,76,8,10,44,22,65,35,67]}
sampledf=pd.DataFrame(data)
range list is my list, from that I need to create range like 100-90,90-70 & 70-50. These ranges may differ from time to time, till now I am achieving results using the below function.
def cat(value):
cat=''
if (value>90):
cat='90-100'
if (value<90 and value>70 ):
cat='90-70'
else:
cat='< 50'
return cat
sampledf['category']=sampledf['Result'].apply(cat)
How can I pass dynamic value in function"cat" based on the range list? I will be grateful if someone can help me to achieve the below result.
Result category
0 75 90-70
1 85 90-70
2 95 < 50
3 45 < 50
4 76 90-70
5 8 < 50
6 10 < 50
7 44 < 50
8 22 < 50
9 65 < 50
10 35 < 50
11 67 < 50

I would recommend pd.cut for this:
sampledf['Category'] = pd.cut(sampledf['Result'],
[-np.inf] + sorted(rangelist) + [np.inf])
Output:
Result Category
0 75 (70.0, 90.0]
1 85 (70.0, 90.0]
2 95 (90.0, inf]
3 45 (-inf, 50.0]
4 76 (70.0, 90.0]
5 8 (-inf, 50.0]
6 10 (-inf, 50.0]
7 44 (-inf, 50.0]
8 22 (-inf, 50.0]
9 65 (50.0, 70.0]
10 35 (-inf, 50.0]
11 67 (50.0, 70.0]

import numpy as np
breaks = pd.Series([100, 90, 75, 50, 45, 20, 0])
sampledf["ind"] = sampledf.Result.apply(lambda x: np.where(x >= breaks)[0][0])
sampledf["category"] = sampledf.ind.apply(lambda i: (breaks[i], breaks[i-1]))
sampledf
# Result ind category
# 0 75 2 (75, 90)
# 1 85 2 (75, 90)
# 2 95 1 (90, 100)
# 3 45 4 (45, 50)
# 4 76 2 (75, 90)
# 5 8 6 (0, 20)
# 6 10 6 (0, 20)
# 7 44 5 (20, 45)
# 8 22 5 (20, 45)
# 9 65 3 (50, 75)
# 10 35 5 (20, 45)
# 11 67 3 (50, 75)

Related

pandas dataframe slicing to a subset from row #y1 to row #y2

I can't see the forest for the trees right now:
I have a Pandas dataframe:
import pandas as pd
df = pd.DataFrame({'UTCs': [32776, 32777, 32778, 32779, 32780, 32781, 32782, 32783],
'Temperature': [5, 7, 7, 9, 12, 9, 9, 4],
'Humidity': [50, 50, 48, 47, 46, 47, 48, 52],
'pressure': [998, 998, 999, 999, 999, 999, 1000, 1000]})
print(df)
UTCs Temperature Humidity pressure
0 32776 5 50 998
1 32777 7 50 998
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
7 32783 4 52 1000
Now I want to create a subset of all dataset columns for UTCs between 32778 and 32782
I can chose a subset with:
df_sub=df.iloc[2:7,:]
print(df_sub)
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
But how can I do that with the condition like 'chose rows between UTCs=32778 and UTCs=32782'?
Something like
df_sub = df.iloc[df[df.UTCs == 32778] : df[df.UTCs == 32783], : ]
does not work.
Any hint for me?
Use between for boolean indexing:
df_sub = df[df['UTCs'].between(32778, 32783, inclusive='left')]
output:
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000

Creating a list from series of pandas

Click here for the imageI m trying to create a list from 3 different series which will be of the shape "({A} {B} {C})" where A denotes the 1st element from series 1, B is for 1st element from series 2, C is for 1st element from series 3 and this way it should create a list containing 600 element.
List 1 List 2 List 3
u_p0 1 v_p0 2 w_p0 7
u_p1 21 v_p1 11 w_p1 45
u_p2 32 v_p2 25 w_p2 32
u_p3 45 v_p3 76 w_p3 49
... .... ....
u_p599 56 v_p599 78 w_599 98
Now I want the output list as follows
(1 2 7)
(21 11 45)
(32 25 32)
(45 76 49)
.....
These are the 3 series I created from a dataframe
r1=turb_1.iloc[qw1] #List1
r2=turb_1.iloc[qw2] #List2
r3=turb_1.iloc[qw3] #List3
Pic of the seriesFor the output I think formatted string python method will be useful but I m quite not sure how to proceed.
turb_3= ["({A} {B} {C})".format(A=i,B=j,C=k) for i in r1 for j in r2 for k in r3]
Any kind of help will be useful.
Use pandas.DataFrame.itertuples with str.format:
# Sample data
print(df)
col1 col2 col3
0 1 2 7
1 21 11 45
2 32 25 32
3 45 76 49
fmt = "({} {} {})"
[fmt.format(*tup) for tup in df[["col1", "col2", "col3"]].itertuples(False, None)]
Output:
['(1 2 7)', '(21 11 45)', '(32 25 32)', '(45 76 49)']

Python 3 script uses too much memory

As homework for IT lessons I need to write a script which will check for the highest power of 4 which is in modified input number, but I can use only 8MB of RAM. I used for this logarithmic function, so my code looks like this:
from math import log, floor
n = int(input())
numbers = []
for i in range (0, n):
numbers.append(floor(int(input()) / 10))
for i in numbers:
print(4 ** floor(log(i, 4)))
But I checked this script on my PC and it uses more than 8MB!
Partition of a set of 74690 objects. Total size = 8423721 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 23305 31 2100404 25 2100404 25 str
1 19322 26 1450248 17 3550652 42 tuple
2 5017 7 724648 9 4275300 51 types.CodeType
3 9953 13 716915 9 4992215 59 bytes
4 742 1 632536 8 5624751 67 type
5 4618 6 628048 7 6252799 74 function
6 742 1 405720 5 6658519 79 dict of type
7 187 0 323112 4 6981631 83 dict of module
8 612 1 278720 3 7260351 86 dict (no owner)
9 63 0 107296 1 7367647 87 set
<197 more rows. Type e.g. '_.more' to view.>
On my phone, however, this script uses only 2.5MB:
Partition of a set of 35586 objects. Total size = 2435735 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 9831 28 649462 27 649462 27 str
1 9014 25 365572 15 1015034 42 tuple
2 4669 13 261232 11 1276266 52 bytes
3 2357 7 198684 8 1474950 61 types.CodeType
4 436 1 166276 7 1641226 67 type
5 2156 6 155232 6 1796458 74 function
6 436 1 130836 5 1927294 79 dict of type
7 93 0 87384 4 2014678 83 dict of module
8 237 1 62280 3 2076958 85 dict (no owner) 9 1091 3 48004 2 2124962 87 types.WrapperDescriptorType
<115 more rows. Type e.g. '_.more' to view.>
I tried changing list to tuple, but it didn't make any difference.
Is there any possibility to decrease/limit RAM usage?

DataFrameGroupby.agg NamedAgg on same column errors out on custom function, but works on bult-in function

Setup
np.random.seed(0)
df = pd.DataFrame(zip([1, 1, 2, 2, 2, 3, 7, 7, 9, 10],
*np.random.randint(1, 100, 20).reshape(-1,10)),
columns=['A','B', 'C'])
Out[127]:
A B C
0 1 45 71
1 1 48 89
2 2 65 89
3 2 68 13
4 2 68 59
5 3 10 66
6 7 84 40
7 7 22 88
8 9 37 47
9 10 88 89
f = lambda x: x.max()
NamedAgg on built-in function works fine
df.groupby('A').agg(B_min=('B', 'min'), B_max=('B', 'max'), C_max=('C', 'max'))
Out[133]:
B_min B_max C_max
A
1 45 48 89
2 65 68 89
3 10 10 66
7 22 84 88
9 37 37 47
10 88 88 89
NamedAgg on custom function f errors out
df.groupby('A').agg(B_min=('B', 'min'), B_max=('B', f), C_max=('C', 'max'))
KeyError: "[('B', '<lambda>')] not in index"
Is there any explanation for this error? is this error an intentional restriction?
The issue is because of _mangle_lambda_list, which gets called at some point. There seems to be a mismatch where the resulting aggregation gets renamed but the list of output columns, ordered which are then used here, doesn't get changed. Since that function specifically checks for if com.get_callable_name(aggfunc) == "<lambda>" any name other than '<lambda>' will work without issue:
Sample data
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(zip([1, 1, 2, 2, 2, 3, 7, 7, 9, 10],
*np.random.randint(1, 100, 20).reshape(-1,10)),
columns=['A','B', 'C'])
f = lambda x: x.max()
kwargs = {'B_min': ('B', 'min'), 'B_max':('B', f), 'C_max':('C', 'max')}
Here are the most relevant major steps that get called when you aggregate, and we can see where the KeyError comes from.
func, columns, order = pd.core.groupby.generic._normalize_keyword_aggregation(kwargs)
print(order)
#[('B', 'min'), ('B', '<lambda>'), ('C', 'max')]
func = pd.core.groupby.generic._maybe_mangle_lambdas(func)
df.groupby('A')._aggregate(func)
# B C
# min <lambda_0> max # _0 ruins indexing with ('B', '<lambda>')
#A
#1 45 48 89
#2 65 68 89
#3 10 10 66
#7 22 84 88
#9 37 37 47
#10 88 88 89
Because _mangle_lambda_list is only called when there are multiple aggregations for the same column, you can get away with the '<lambda>' name, so long as it is the only aggregation for that column.
df.groupby('A').agg(A_min=('A', 'min'), B_max=('B', f))
# A_min B_max
#A
#1 1 48
#2 2 68
#3 3 10
#7 7 84
#9 9 37
#10 10 88

Pandas: Random integer between values in two columns

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

Resources