How do you calculate an average top or bottom 'n' Values in Python? Example below, column = c2 calculates the average of the top 2 in the last 4 days.
c0 c1 c2
1 2 na
2 2 na
3 3 na
4 5 4
5 6 5.5
6 7 6.5
7 5 6.5
8 4 6.5
9 5 6
10 5 5
Sort the list, get the sum of the last n numbers and divide it by n:
def avg_of_top_n(l, n):
return sum(sorted(l)[-n:]) / n
l = [2, 2, 3, 5, 6, 7, 5, 4, 5, 5]
for i in range(4, 11):
print(avg_of_top_n(l[i - 4: i], 2))
This outputs:
4.0
5.5
6.5
6.5
6.5
6.0
5.0
You could first sort a list of values, then grab the first n values into a new list. Then average them by diving the sum of the list by the number of values in the list.
n = 2
new_list = [1,5,4,3,2,6]
new_list.sort()
top_vals = new_list[:n]
top_vals_avg = sum(top_vals) / float(len(top_vals))
Related
So I've been struggling with this for 2 days now and finally managed to make it work, but I wonder if there is a way to speed this up, since i have a loooot of data to process.
The goal here is for each line every column of my dataFrame, I want to compute an incremental sum (elt(n-1) + elt(n)), then take the absolute value and compare the local absolute value to previous one in order to, at the last element of my column, obtain the max value. I though simply using a rolling sum or a simple column sum would work but somehow I can't make it. Those max are calculated over 2000 lines, rolling. (so for elt n i take the elements from line n until line n+2000, etc). In the end, I will have a dataframe with a length of the original dataframe, minus 2000 elements.
About the speed, this takes around 1 minute to complete for all 4 columns (and this is for a relatively small file of around 5000 elements only, most of them would be 4 times bigger).
Ideally, i'd like to massively speed up what is inside the "for pulse in range(2000):" loop, but if I can speed up the entire code that's also fine.
I'm not sure exactly how I could use list comprehension with this. I checked the numpy accumulate() function, or the rolling() but it does not give me what I want.
edit1: indents.
edit2: here an exemple for the first 10 lines of input and output for the first column only (to make it less busy here). The thing is that you need a minimum of 2000 lines of the input to obtain the first item in the results, so not sure it's really useful here.
Input :
-2.1477511E-12
2.0970403E-12
2.0731764E-12
1.7241669E-12
1.2260080E-12
7.3381503E-13
8.2330457E-13
-9.2472616E-13
-1.1275693E-12
-1.3184806E-12
Output:
2.25436311E-10
2.28640040E-10
2.27405083E-10
2.25331907E-10
2.23607740E-10
2.22381732E-10
2.21647917E-10
2.20824612E-10
2.21749338E-10
2.22876908E-10
Here's my code:
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C', 'D']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i+1999,col] != 0 and i<len(ys_integral_check_reduced)-2000):
cur = 0
max = 0
for pulse in range(2000):
cur = cur + ys_integral_check_reduced.loc[i+pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse+1
ys_integral_check_reduced_final.loc[i, col] = max
i = i+1
print(ys_integral_check_reduced_final)
If I understood correctly, I created a toy example (WINDOW size of 3).
import pandas as pd
WINDOW = 3
ys_integral_check = pd.DataFrame({'A':[1, 2, -5, -6, 1, -10, -1, -10, 7, 4, 5, 6],
'B':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
ys_integral_check['C'] = -ys_integral_check['B']
Which looks like this:
A B C
0 1 1 -1
1 2 2 -2
2 -5 3 -3
3 -6 4 -4
4 1 5 -5
5 -10 6 -6
6 -1 7 -7
7 -10 8 -8
8 7 9 -9
9 4 10 -10
10 5 11 -11
11 6 12 -12
Your solution gives:
ys_integral_check_reduced_final = pd.DataFrame(columns=['A', 'B', 'C'])
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i+WINDOW-1,col] != 0 and i<len(ys_integral_check_reduced)-WINDOW):
cur = 0
max = 0
for pulse in range(WINDOW):
cur = cur + ys_integral_check_reduced.loc[i+pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse+1
ys_integral_check_reduced_final.loc[i, col] = max
i = i+1
print(ys_integral_check_reduced_final)
A B C
0 3 6 6
1 9 9 9
2 11 12 12
3 15 15 15
4 10 18 18
5 21 21 21
6 11 24 24
7 10 27 27
8 16 30 30
Here is a variant using Pandas and Rolling.apply():
ys_integral_check_reduced_final = ys_integral_check[['A', 'B', 'C']].rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
Which gives:
A B C
0 3.0 6.0 6.0
1 9.0 9.0 9.0
2 11.0 12.0 12.0
3 15.0 15.0 15.0
4 10.0 18.0 18.0
5 21.0 21.0 21.0
6 11.0 24.0 24.0
7 10.0 27.0 27.0
8 16.0 30.0 30.0
9 15.0 33.0 33.0
There is an extra row, because I believe your solution skips a possible window at the end.
I tested it on a random DataFrame with 100'000 rows and 3 columns and a window size of 2000 and it took 18 seconds to process:
import time
import numpy as np
WINDOW = 2000
DF_SIZE = 100000
test_df = pd.DataFrame(np.random.random((DF_SIZE, 3)), columns=list('ABC'))
t0 = time.time()
test_df.rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
t1 = time.time()
print(t1-t0) # 18.102170944213867
I have a dataframe that looks like below. I want to get a min and max value per city along with the information about which products were ordered min and max for that city. Please help.
Dataframe
db.min(axis=0) - min value for each column
db.min(axis=1) - min value for each row
use Dataframe.min and Datafram.max
DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
matrix = [(22, 16, 23),
(33, 50, 11),
(44, 34, 11),
(55, 35, 60),
(66, 36, 13)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
x y z
a 22 16.0 23.0
b 33 50 11.0
c 44 34.0 11.0
d 55 35.0 60
e 66 36.0 13.0
Get a series containing the minimum value of each row
minValuesObj = dfObj.min(axis=1)
print('minimum value in each row : ')
print(minValuesObj)
output
minimum value in each row :
a 16.0
b 11.0
c 11.0
d 35.0
e 13.0
dtype: float64
MMT Marathi, based on the answers provided by Danil and Sutharp777, you should be able to get to your answer. However, I see you have questions for them. Not sure if you are looking for a column to be created that has the min/max value for each row.
Here's the full dataframe with the solution. I am merely compiling the answers they have already given
import pandas as pd
d = [['20in Monitor',2,2,1,2,2,2,2,2,2],
['27in 4k Gaming Monitor',2,1,2,2,1,2,2,2,2],
['27in FHD Monitor',2,2,2,2,2,2,2,2,2],
['34in Ultrawide Monitor',2,1,2,2,2,2,2,2,2],
['AA Batteries (4-pack)',5,5,6,7,6,6,6,6,5],
['AAA Batteries (4-pack)',7,7,8,8,9,7,8,9,7],
['Apple Airpods Headphones',2,2,3,2,2,2,2,2,2],
['Bose SoundSport Headphones',2,2,2,2,3,2,2,3,2],
['Flatscreen TV',2,1,2,2,2,2,2,2,2]]
c = ['Product','Atlanta','Austin','Boston','Dallas','Los Angeles',
'New York City','Portland','San Francisco','Seattle']
df = pd.DataFrame(d,columns=c)
df['min_value'] = df.min(axis=1)
df['max_value'] = df.max(axis=1)
print (df)
The output of this will be:
Product Atlanta Austin ... Seattle min_value max_value
0 20in Monitor 2 2 ... 2 1 2
1 27in 4k Gaming Monitor 2 1 ... 2 1 2
2 27in FHD Monitor 2 2 ... 2 2 2
3 34in Ultrawide Monitor 2 1 ... 2 1 2
4 AA Batteries (4-pack) 5 5 ... 5 5 7
5 AAA Batteries (4-pack) 7 7 ... 7 7 9
6 Apple Airpods Headphones 2 2 ... 2 2 3
7 Bose SoundSport Headphones 2 2 ... 2 2 3
8 Flatscreen TV 2 1 ... 2 1 2
If you want the min and max of each column, then you can do this:
print ('min of each column :', df.min(axis=0).to_list()[1:])
print ('max of each column :', df.max(axis=0).to_list()[1:])
This will give you:
min of each column : [2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2]
max of each column : [7, 7, 8, 8, 9, 7, 8, 9, 7, 7, 9]
I have a data set like this:-
S.No.,Year of birth,year of death
1, 1, 5
2, 3, 6
3, 2, -
4, 5, 7
I need to calculate population on till that years let say:-
year,population
1 1
2 2
3 3
4 3
5 4
6 3
7 2
8 1
How can i solve it in pandas?
Since i am not good in pandas.
Any help would be appreciate.
First is necessary choose maximum year of year of death if not exist, in solution is used 8.
Then convert values of year of death to numeric and replace missing values by this year. In first solution is used difference between birth and death column with Index.repeat with GroupBy.cumcount, for count is used Series.value_counts:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
df['year of death'] = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
df = df.loc[df.index.repeat(df['year of death'].add(1).sub(df['Year of birth']).astype(int))]
df['Year of birth'] += df.groupby(level=0).cumcount()
df1 = (df['Year of birth'].value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Another solution use list comprehension with range for repeat years:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
L = [x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)]
df1 = (pd.Series(L).value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Similar like before, only is used Counter for dictionary for final DataFrame:
from collections import Counter
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
d = Counter([x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)])
print (d)
Counter({5: 4, 3: 3, 4: 3, 6: 3, 2: 2, 7: 2, 1: 1, 8: 1})
df1 = pd.DataFrame({'year':list(d.keys()),
'population':list(d.values())})
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
This question already has an answer here:
Numpy intersect1d with array with matrix as elements
(1 answer)
Closed 5 years ago.
I'm currently trying to compare two matrices and return matching rows into the "intersection matrix" via python. Both matrices are numerical data-and I'm trying to return the rows of their common entries (I have also tried just creating a matrix with matching positional entries along the first column and then creating an accompanying tuple). These matrices are not necessarily the same in dimensionality.
Let's say I have two matrices of matching column length but arbitrary (can be very large and different row length)
23 3 4 5 23 3 4 5
12 6 7 8 45 7 8 9
45 7 8 9 34 5 6 7
67 4 5 6 3 5 6 7
I'd like to create a matrix with the "intersection" being for this low dimensional example
23 3 4 5
45 7 8 9
perhaps it looks like this though:
1 2 3 4 2 4 6 7
2 4 6 7 4 10 6 9
4 6 7 8 5 6 7 8
5 6 7 8
in which case we only want:
2 4 6 7
5 6 7 8
I've tried things of this nature:
def compare(x):
# This is a matrix I created with another function-purely numerical data of arbitrary size with fixed column length D
y =n_c(data_cleaner(x))
# this is a second matrix that i'd like to compare it to. note that the sizes are probably not the same, but the columns length are
z=data_cleaner(x)
# I initialized an array that would hold the matching values
compare=[]
# create nested for loop that will check a single index in one matrix over all entries in the second matrix over iteration
for i in range(len(y)):
for j in range(len(z)):
if y[0][i] == z[0][i]:
# I want the row or the n tuple (shown here) of those columns with the matching first indexes as shown above
c_vec = ([0][i],[15][i],[24][i],[0][25],[0][26])
compare.append(c_vec)
else:
pass
return compare
compare(c_i_w)
Sadly, I'm running into some errors. Specifically it seems that I'm telling python to improperly reference values.
Consider the arrays a and b
a = np.array([
[23, 3, 4, 5],
[12, 6, 7, 8],
[45, 7, 8, 9],
[67, 4, 5, 6]
])
b = np.array([
[23, 3, 4, 5],
[45, 7, 8, 9],
[34, 5, 6, 7],
[ 3, 5, 6, 7]
])
print(a)
[[23 3 4 5]
[12 6 7 8]
[45 7 8 9]
[67 4 5 6]]
print(b)
[[23 3 4 5]
[45 7 8 9]
[34 5 6 7]
[ 3 5 6 7]]
Then we can broadcast and get an array of equal rows with
x = (a[:, None] == b).all(-1)
print(x)
[[ True False False False]
[False False False False]
[False True False False]
[False False False False]]
Using np.where we can identify the indices
i, j = np.where(x)
Show which rows of a
print(a[i])
[[23 3 4 5]
[45 7 8 9]]
And which rows of b
print(b[j])
[[23 3 4 5]
[45 7 8 9]]
They are the same! That's good. That's what we wanted.
We can put the results into a pandas dataframe with a MultiIndex with row number from a in the first level and row number from b in the second level.
pd.DataFrame(a[i], [i, j])
0 1 2 3
0 0 23 3 4 5
2 1 45 7 8 9
I have a df and some of the columns contains numbers and I calculate mean, std, median etc on these columns using df.mean(0)..
How can I put these summary statistics in a list?? One list for mean, one for median etc..
I think you can use Series.tolist, because output of your functions is Series:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#sum(0), std(0) is same as sum(), std() because 0 is by default
L1 = df.sum().tolist()
L2 = df.std().tolist()
print (L1)
print (L2)
[6, 15, 24, 9, 14, 14]
[1.0, 1.0, 1.0, 2.0, 1.5275252316519465, 2.0816659994661326]