I have a dataframe that looks like below. I want to get a min and max value per city along with the information about which products were ordered min and max for that city. Please help.
Dataframe
db.min(axis=0) - min value for each column
db.min(axis=1) - min value for each row
use Dataframe.min and Datafram.max
DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
matrix = [(22, 16, 23),
(33, 50, 11),
(44, 34, 11),
(55, 35, 60),
(66, 36, 13)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
x y z
a 22 16.0 23.0
b 33 50 11.0
c 44 34.0 11.0
d 55 35.0 60
e 66 36.0 13.0
Get a series containing the minimum value of each row
minValuesObj = dfObj.min(axis=1)
print('minimum value in each row : ')
print(minValuesObj)
output
minimum value in each row :
a 16.0
b 11.0
c 11.0
d 35.0
e 13.0
dtype: float64
MMT Marathi, based on the answers provided by Danil and Sutharp777, you should be able to get to your answer. However, I see you have questions for them. Not sure if you are looking for a column to be created that has the min/max value for each row.
Here's the full dataframe with the solution. I am merely compiling the answers they have already given
import pandas as pd
d = [['20in Monitor',2,2,1,2,2,2,2,2,2],
['27in 4k Gaming Monitor',2,1,2,2,1,2,2,2,2],
['27in FHD Monitor',2,2,2,2,2,2,2,2,2],
['34in Ultrawide Monitor',2,1,2,2,2,2,2,2,2],
['AA Batteries (4-pack)',5,5,6,7,6,6,6,6,5],
['AAA Batteries (4-pack)',7,7,8,8,9,7,8,9,7],
['Apple Airpods Headphones',2,2,3,2,2,2,2,2,2],
['Bose SoundSport Headphones',2,2,2,2,3,2,2,3,2],
['Flatscreen TV',2,1,2,2,2,2,2,2,2]]
c = ['Product','Atlanta','Austin','Boston','Dallas','Los Angeles',
'New York City','Portland','San Francisco','Seattle']
df = pd.DataFrame(d,columns=c)
df['min_value'] = df.min(axis=1)
df['max_value'] = df.max(axis=1)
print (df)
The output of this will be:
Product Atlanta Austin ... Seattle min_value max_value
0 20in Monitor 2 2 ... 2 1 2
1 27in 4k Gaming Monitor 2 1 ... 2 1 2
2 27in FHD Monitor 2 2 ... 2 2 2
3 34in Ultrawide Monitor 2 1 ... 2 1 2
4 AA Batteries (4-pack) 5 5 ... 5 5 7
5 AAA Batteries (4-pack) 7 7 ... 7 7 9
6 Apple Airpods Headphones 2 2 ... 2 2 3
7 Bose SoundSport Headphones 2 2 ... 2 2 3
8 Flatscreen TV 2 1 ... 2 1 2
If you want the min and max of each column, then you can do this:
print ('min of each column :', df.min(axis=0).to_list()[1:])
print ('max of each column :', df.max(axis=0).to_list()[1:])
This will give you:
min of each column : [2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2]
max of each column : [7, 7, 8, 8, 9, 7, 8, 9, 7, 7, 9]
Related
So I've been struggling with this for 2 days now and finally managed to make it work, but I wonder if there is a way to speed this up, since i have a loooot of data to process.
The goal here is for each line every column of my dataFrame, I want to compute an incremental sum (elt(n-1) + elt(n)), then take the absolute value and compare the local absolute value to previous one in order to, at the last element of my column, obtain the max value. I though simply using a rolling sum or a simple column sum would work but somehow I can't make it. Those max are calculated over 2000 lines, rolling. (so for elt n i take the elements from line n until line n+2000, etc). In the end, I will have a dataframe with a length of the original dataframe, minus 2000 elements.
About the speed, this takes around 1 minute to complete for all 4 columns (and this is for a relatively small file of around 5000 elements only, most of them would be 4 times bigger).
Ideally, i'd like to massively speed up what is inside the "for pulse in range(2000):" loop, but if I can speed up the entire code that's also fine.
I'm not sure exactly how I could use list comprehension with this. I checked the numpy accumulate() function, or the rolling() but it does not give me what I want.
edit1: indents.
edit2: here an exemple for the first 10 lines of input and output for the first column only (to make it less busy here). The thing is that you need a minimum of 2000 lines of the input to obtain the first item in the results, so not sure it's really useful here.
Input :
-2.1477511E-12
2.0970403E-12
2.0731764E-12
1.7241669E-12
1.2260080E-12
7.3381503E-13
8.2330457E-13
-9.2472616E-13
-1.1275693E-12
-1.3184806E-12
Output:
2.25436311E-10
2.28640040E-10
2.27405083E-10
2.25331907E-10
2.23607740E-10
2.22381732E-10
2.21647917E-10
2.20824612E-10
2.21749338E-10
2.22876908E-10
Here's my code:
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C', 'D']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i+1999,col] != 0 and i<len(ys_integral_check_reduced)-2000):
cur = 0
max = 0
for pulse in range(2000):
cur = cur + ys_integral_check_reduced.loc[i+pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse+1
ys_integral_check_reduced_final.loc[i, col] = max
i = i+1
print(ys_integral_check_reduced_final)
If I understood correctly, I created a toy example (WINDOW size of 3).
import pandas as pd
WINDOW = 3
ys_integral_check = pd.DataFrame({'A':[1, 2, -5, -6, 1, -10, -1, -10, 7, 4, 5, 6],
'B':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
ys_integral_check['C'] = -ys_integral_check['B']
Which looks like this:
A B C
0 1 1 -1
1 2 2 -2
2 -5 3 -3
3 -6 4 -4
4 1 5 -5
5 -10 6 -6
6 -1 7 -7
7 -10 8 -8
8 7 9 -9
9 4 10 -10
10 5 11 -11
11 6 12 -12
Your solution gives:
ys_integral_check_reduced_final = pd.DataFrame(columns=['A', 'B', 'C'])
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i+WINDOW-1,col] != 0 and i<len(ys_integral_check_reduced)-WINDOW):
cur = 0
max = 0
for pulse in range(WINDOW):
cur = cur + ys_integral_check_reduced.loc[i+pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse+1
ys_integral_check_reduced_final.loc[i, col] = max
i = i+1
print(ys_integral_check_reduced_final)
A B C
0 3 6 6
1 9 9 9
2 11 12 12
3 15 15 15
4 10 18 18
5 21 21 21
6 11 24 24
7 10 27 27
8 16 30 30
Here is a variant using Pandas and Rolling.apply():
ys_integral_check_reduced_final = ys_integral_check[['A', 'B', 'C']].rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
Which gives:
A B C
0 3.0 6.0 6.0
1 9.0 9.0 9.0
2 11.0 12.0 12.0
3 15.0 15.0 15.0
4 10.0 18.0 18.0
5 21.0 21.0 21.0
6 11.0 24.0 24.0
7 10.0 27.0 27.0
8 16.0 30.0 30.0
9 15.0 33.0 33.0
There is an extra row, because I believe your solution skips a possible window at the end.
I tested it on a random DataFrame with 100'000 rows and 3 columns and a window size of 2000 and it took 18 seconds to process:
import time
import numpy as np
WINDOW = 2000
DF_SIZE = 100000
test_df = pd.DataFrame(np.random.random((DF_SIZE, 3)), columns=list('ABC'))
t0 = time.time()
test_df.rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
t1 = time.time()
print(t1-t0) # 18.102170944213867
I have below code
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, 3, 1],"B":[7, 2, 54, 3, None],"C":[20, 16, 11, 3, 8],"D":[14, 3, None, 2, 6]})
df['A1'] = np.where(df['A'] > 10, 10, np.where(df['A'] < 3, 3, df['A']))
While this is okay, I want create the final dataframe (i.e. 2nd line of code) using chain rule from the first line. I want to achieve this to increase readability.
Could you please help how can I achieve this?
You can use clip here:
df.assign(A1=df['A'].clip(upper=10,lower=3))
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3
If you really need to do this in one line (note that I dont find this readable)
pd.DataFrame({"A":[12, 4, 5, 3, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]}).assign(A1=lambda x:x['A'].clip(upper=10,lower=3))
You could use np.select() like the following. It makes the conditions and choices very readable.
conditions = [df['A'] > 10,
df['A'] < 3]
choices = [10,3]
df['A2'] = np.select(conditions, choices, default = df['A'])
print(df)
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3
I apply pd.Series(pred).value_counts() and get this output:
0 2084
-1 15
1 13
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
dtype: int64
When I create a list I get only the second column:
c_list=list(pd.Series(pred).value_counts()), Out:
[2084, 15, 13, 10, 7, 4, 3, 3, 3, 2, 2, 2, 2]
How do I get ultimately a dataframe that looks like this including a new column for size% of total size?
df=
[class , size ,relative_size]
0 2084 , x%
-1 15 , y%
1 13 , etc.
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
You are very nearly there. Typing this in the blind as you didn't provide a sample input:
df = pd.Series(pred).value_counts().to_frame().reset_index()
df.columns = ['class', 'size']
df['relative_size'] = df['size'] / df['size'].sum()
How do you calculate an average top or bottom 'n' Values in Python? Example below, column = c2 calculates the average of the top 2 in the last 4 days.
c0 c1 c2
1 2 na
2 2 na
3 3 na
4 5 4
5 6 5.5
6 7 6.5
7 5 6.5
8 4 6.5
9 5 6
10 5 5
Sort the list, get the sum of the last n numbers and divide it by n:
def avg_of_top_n(l, n):
return sum(sorted(l)[-n:]) / n
l = [2, 2, 3, 5, 6, 7, 5, 4, 5, 5]
for i in range(4, 11):
print(avg_of_top_n(l[i - 4: i], 2))
This outputs:
4.0
5.5
6.5
6.5
6.5
6.0
5.0
You could first sort a list of values, then grab the first n values into a new list. Then average them by diving the sum of the list by the number of values in the list.
n = 2
new_list = [1,5,4,3,2,6]
new_list.sort()
top_vals = new_list[:n]
top_vals_avg = sum(top_vals) / float(len(top_vals))
I have a df and some of the columns contains numbers and I calculate mean, std, median etc on these columns using df.mean(0)..
How can I put these summary statistics in a list?? One list for mean, one for median etc..
I think you can use Series.tolist, because output of your functions is Series:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#sum(0), std(0) is same as sum(), std() because 0 is by default
L1 = df.sum().tolist()
L2 = df.std().tolist()
print (L1)
print (L2)
[6, 15, 24, 9, 14, 14]
[1.0, 1.0, 1.0, 2.0, 1.5275252316519465, 2.0816659994661326]