How can I speed this up with Numpy? - python-3.x

So I've been struggling with this for 2 days now and finally managed to make it work, but I wonder if there is a way to speed this up, since i have a loooot of data to process.
The goal here is for each line every column of my dataFrame, I want to compute an incremental sum (elt(n-1) + elt(n)), then take the absolute value and compare the local absolute value to previous one in order to, at the last element of my column, obtain the max value. I though simply using a rolling sum or a simple column sum would work but somehow I can't make it. Those max are calculated over 2000 lines, rolling. (so for elt n i take the elements from line n until line n+2000, etc). In the end, I will have a dataframe with a length of the original dataframe, minus 2000 elements.
About the speed, this takes around 1 minute to complete for all 4 columns (and this is for a relatively small file of around 5000 elements only, most of them would be 4 times bigger).
Ideally, i'd like to massively speed up what is inside the "for pulse in range(2000):" loop, but if I can speed up the entire code that's also fine.
I'm not sure exactly how I could use list comprehension with this. I checked the numpy accumulate() function, or the rolling() but it does not give me what I want.
edit1: indents.
edit2: here an exemple for the first 10 lines of input and output for the first column only (to make it less busy here). The thing is that you need a minimum of 2000 lines of the input to obtain the first item in the results, so not sure it's really useful here.
Input :
-2.1477511E-12
2.0970403E-12
2.0731764E-12
1.7241669E-12
1.2260080E-12
7.3381503E-13
8.2330457E-13
-9.2472616E-13
-1.1275693E-12
-1.3184806E-12
Output:
2.25436311E-10
2.28640040E-10
2.27405083E-10
2.25331907E-10
2.23607740E-10
2.22381732E-10
2.21647917E-10
2.20824612E-10
2.21749338E-10
2.22876908E-10
Here's my code:
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C', 'D']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i+1999,col] != 0 and i<len(ys_integral_check_reduced)-2000):
cur = 0
max = 0
for pulse in range(2000):
cur = cur + ys_integral_check_reduced.loc[i+pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse+1
ys_integral_check_reduced_final.loc[i, col] = max
i = i+1
print(ys_integral_check_reduced_final)

If I understood correctly, I created a toy example (WINDOW size of 3).
import pandas as pd
WINDOW = 3
ys_integral_check = pd.DataFrame({'A':[1, 2, -5, -6, 1, -10, -1, -10, 7, 4, 5, 6],
'B':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
ys_integral_check['C'] = -ys_integral_check['B']
Which looks like this:
A B C
0 1 1 -1
1 2 2 -2
2 -5 3 -3
3 -6 4 -4
4 1 5 -5
5 -10 6 -6
6 -1 7 -7
7 -10 8 -8
8 7 9 -9
9 4 10 -10
10 5 11 -11
11 6 12 -12
Your solution gives:
ys_integral_check_reduced_final = pd.DataFrame(columns=['A', 'B', 'C'])
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i+WINDOW-1,col] != 0 and i<len(ys_integral_check_reduced)-WINDOW):
cur = 0
max = 0
for pulse in range(WINDOW):
cur = cur + ys_integral_check_reduced.loc[i+pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse+1
ys_integral_check_reduced_final.loc[i, col] = max
i = i+1
print(ys_integral_check_reduced_final)
A B C
0 3 6 6
1 9 9 9
2 11 12 12
3 15 15 15
4 10 18 18
5 21 21 21
6 11 24 24
7 10 27 27
8 16 30 30
Here is a variant using Pandas and Rolling.apply():
ys_integral_check_reduced_final = ys_integral_check[['A', 'B', 'C']].rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
Which gives:
A B C
0 3.0 6.0 6.0
1 9.0 9.0 9.0
2 11.0 12.0 12.0
3 15.0 15.0 15.0
4 10.0 18.0 18.0
5 21.0 21.0 21.0
6 11.0 24.0 24.0
7 10.0 27.0 27.0
8 16.0 30.0 30.0
9 15.0 33.0 33.0
There is an extra row, because I believe your solution skips a possible window at the end.
I tested it on a random DataFrame with 100'000 rows and 3 columns and a window size of 2000 and it took 18 seconds to process:
import time
import numpy as np
WINDOW = 2000
DF_SIZE = 100000
test_df = pd.DataFrame(np.random.random((DF_SIZE, 3)), columns=list('ABC'))
t0 = time.time()
test_df.rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
t1 = time.time()
print(t1-t0) # 18.102170944213867

Related

Pandas Min and Max Across Rows

I have a dataframe that looks like below. I want to get a min and max value per city along with the information about which products were ordered min and max for that city. Please help.
Dataframe
db.min(axis=0) - min value for each column
db.min(axis=1) - min value for each row
use Dataframe.min and Datafram.max
DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
matrix = [(22, 16, 23),
(33, 50, 11),
(44, 34, 11),
(55, 35, 60),
(66, 36, 13)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
x y z
a 22 16.0 23.0
b 33 50 11.0
c 44 34.0 11.0
d 55 35.0 60
e 66 36.0 13.0
Get a series containing the minimum value of each row
minValuesObj = dfObj.min(axis=1)
print('minimum value in each row : ')
print(minValuesObj)
output
minimum value in each row :
a 16.0
b 11.0
c 11.0
d 35.0
e 13.0
dtype: float64
MMT Marathi, based on the answers provided by Danil and Sutharp777, you should be able to get to your answer. However, I see you have questions for them. Not sure if you are looking for a column to be created that has the min/max value for each row.
Here's the full dataframe with the solution. I am merely compiling the answers they have already given
import pandas as pd
d = [['20in Monitor',2,2,1,2,2,2,2,2,2],
['27in 4k Gaming Monitor',2,1,2,2,1,2,2,2,2],
['27in FHD Monitor',2,2,2,2,2,2,2,2,2],
['34in Ultrawide Monitor',2,1,2,2,2,2,2,2,2],
['AA Batteries (4-pack)',5,5,6,7,6,6,6,6,5],
['AAA Batteries (4-pack)',7,7,8,8,9,7,8,9,7],
['Apple Airpods Headphones',2,2,3,2,2,2,2,2,2],
['Bose SoundSport Headphones',2,2,2,2,3,2,2,3,2],
['Flatscreen TV',2,1,2,2,2,2,2,2,2]]
c = ['Product','Atlanta','Austin','Boston','Dallas','Los Angeles',
'New York City','Portland','San Francisco','Seattle']
df = pd.DataFrame(d,columns=c)
df['min_value'] = df.min(axis=1)
df['max_value'] = df.max(axis=1)
print (df)
The output of this will be:
Product Atlanta Austin ... Seattle min_value max_value
0 20in Monitor 2 2 ... 2 1 2
1 27in 4k Gaming Monitor 2 1 ... 2 1 2
2 27in FHD Monitor 2 2 ... 2 2 2
3 34in Ultrawide Monitor 2 1 ... 2 1 2
4 AA Batteries (4-pack) 5 5 ... 5 5 7
5 AAA Batteries (4-pack) 7 7 ... 7 7 9
6 Apple Airpods Headphones 2 2 ... 2 2 3
7 Bose SoundSport Headphones 2 2 ... 2 2 3
8 Flatscreen TV 2 1 ... 2 1 2
If you want the min and max of each column, then you can do this:
print ('min of each column :', df.min(axis=0).to_list()[1:])
print ('max of each column :', df.max(axis=0).to_list()[1:])
This will give you:
min of each column : [2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2]
max of each column : [7, 7, 8, 8, 9, 7, 8, 9, 7, 7, 9]

Locate dataframe rows where values are outside bounds specified for each column

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?
One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

Replace a column value with number using pandas

For the following dataset, I can replace column 1 with the numeric value easily.
df['1'].replace(['A', 'B', 'C', 'D'], [0, 1, 2, 3], inplace=True)
But if I have 3600 or more than that different values in a column, how can I replace it with the numeric values without writing the value of the column.
Please let me know. I don't understand how to do that. If anybody has any solution please share with me.
Thanks in advance.
import pandas as pd
df = pd.DataFrame({1:['A','B','C','C','D','A'],
2:[0.6,0.9,5,4,7,1,],
3:[0.3,1,0.7,8,2,4]})
print(df)
1 2 3
0 A 0.6 0.3
1 B 0.9 1.0
2 C 5.0 0.7
3 C 4.0 8.0
4 D 7.0 2.0
5 A 1.0 4.0
np.where makes it easy.
import numpy as np
df[1] = np.where(df[1]=="A", "0",
np.where(df[1]=="B", "1",
np.where(df[1]=="C","2",
np.where(df[1]=="D","3",np.nan))))
print(df)
1 2 3
0 0 0.6 0.3
1 1 0.9 1.0
2 2 5.0 0.7
3 2 4.0 8.0
4 3 7.0 2.0
5 0 1.0 4.0
But if you have a lot of categories, you might want to think about other ways.
import string
upper=list(string.ascii_uppercase)
a=pd.DataFrame({'Alp':upper})
print(a)
Alp
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
.
.
19 T
20 U
21 V
22 W
23 X
24 Y
25 Z
for k in np.arange(0,26):
a=a.replace(to_replace =upper[k],value =k)
print(a)
Alp
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
.
.
.
21 21
22 22
23 23
24 24
25 25
If there is many values for replace you can use factorize:
df[1] = pd.factorize(df[1])[0] + 1
print (df)
1 2 3
0 1 0.6 0.3
1 2 0.9 1.0
2 3 5.0 0.7
3 3 4.0 8.0
4 4 7.0 2.0
5 1 1.0 4.0
You could do something like
df.loc[df['1'] == 'A','1'] = 0
df.loc[df['1'] == 'B','1'] = 1
### Or
keys = df['1'].unique().tolist()
i = 0
for key in keys
df.loc[df['1'] == key,'1'] = i
i = i+1

How To Average Top Or Bottom 'n' Values In Python

How do you calculate an average top or bottom 'n' Values in Python? Example below, column = c2 calculates the average of the top 2 in the last 4 days.
c0 c1 c2
1 2 na
2 2 na
3 3 na
4 5 4
5 6 5.5
6 7 6.5
7 5 6.5
8 4 6.5
9 5 6
10 5 5
Sort the list, get the sum of the last n numbers and divide it by n:
def avg_of_top_n(l, n):
return sum(sorted(l)[-n:]) / n
l = [2, 2, 3, 5, 6, 7, 5, 4, 5, 5]
for i in range(4, 11):
print(avg_of_top_n(l[i - 4: i], 2))
This outputs:
4.0
5.5
6.5
6.5
6.5
6.0
5.0
You could first sort a list of values, then grab the first n values into a new list. Then average them by diving the sum of the list by the number of values in the list.
n = 2
new_list = [1,5,4,3,2,6]
new_list.sort()
top_vals = new_list[:n]
top_vals_avg = sum(top_vals) / float(len(top_vals))

How to expand Python Pandas Dataframe in linearly spaced increments

Beginner question:
I have a pandas dataframe that looks like this:
x1 y1 x2 y2
0 0 2 2
10 10 12 12
and I want to expand that dataframe by half units along the x and y coordinates to look like this:
x1 y1 x2 y2 Interpolated_X Interpolated_Y
0 0 2 2 0 0
0 0 2 2 0.5 0.5
0 0 2 2 1 1
0 0 2 2 1.5 1.5
0 0 2 2 2 2
10 10 12 12 10 10
10 10 12 12 10.5 10.5
10 10 12 12 11 11
10 10 12 12 11.5 11.5
10 10 12 12 12 12
Any help would be much appreciated.
The cleanest way I know how to expand rows like this is through groupby.apply. May be faster to use something like itertuples in pandas but it will be a little more complicated code (keep that in mind if your data-set is larger).
groupby the index which will send each row to my apply function (your index has to be unique for each row, if its not just run reset_index). I can return a DataFrame from my apply therefore we can expand from one row to multiple rows.
caveat, your x2-x1 and y2-y1 distance must be the same or this won't work.
import pandas as pd
import numpy as np
def expand(row):
row = row.iloc[0] # passes a dateframe so this gets reference to first and only row
xdistance = (row.x2 - row.x1)
ydistance = (row.y2 - row.y1)
xsteps = np.arange(row.x1, row.x2 + .5, .5) # create steps arrays
ysteps = np.arange(row.y1, row.y2 + .5, .5)
return (pd.DataFrame([row] * len(xsteps)) # you can expand lists in python by multiplying like this [val] * 3 = [val, val, val]
.assign(int_x = xsteps, int_y = ysteps))
(df.groupby(df.index) # "group" on each row
.apply(expand) # send row to expand function
.reset_index(level=1, drop=True)) # groupby gives us an extra index we don't want
starting df
x1 y1 x2 y2
0 0 2 2
10 10 12 12
ending df
x1 y1 x2 y2 int_x int_y
0 0 0 2 2 0.0 0.0
0 0 0 2 2 0.5 0.5
0 0 0 2 2 1.0 1.0
0 0 0 2 2 1.5 1.5
0 0 0 2 2 2.0 2.0
1 10 10 12 12 10.0 10.0
1 10 10 12 12 10.5 10.5
1 10 10 12 12 11.0 11.0
1 10 10 12 12 11.5 11.5
1 10 10 12 12 12.0 12.0

Resources