Apply a For loop in with a multiple GroupBy and nlargest - python-3.x

I'm trying to create a function that will iterate through a date_id column in a dataframe and create a group for each 6 sequential values and sum the values in the 'value' column, then also return the max value form each group of six along with the result.
date_id item_id value
0 1828 32 1180727.00
1 1828 43 944937.00
2 1828 40 806681.00
3 1828 42 721810.02
4 1828 36 567950.00
5 1828 45 545306.38
6 1828 26 480506.00
7 1828 53 375788.00
8 1828 37 236000.00
9 1828 38 234780.00
10 1828 21 208998.47
11 1828 41 135000.00
12 1797 39 63420.00
13 1828 28 24410.00
14 1462 52 0.00
15 1493 16 0.00
16 1493 17 0.00
17 1493 18 0.00
18 1493 15 0.00
19 1462 53 0.00
20 1462 47 0.00
21 1462 51 0.00
22 1462 50 0.00
23 1462 49 0.00
24 1462 45 0.00
The desired output for each item_id would be
date_id item_id value
0 max value from each date group 36 sum of all values in each date grouping
I have tried using a lambda
df_rslt = df.groupby('date_id')['value'].apply(lambda grp: grp.nlargest(6).sum())
but then quickly realized this will only return one result.
I then tried something like this in a for loop, but it got nowhere
grp_data = df.groupby(['date_id','item_id'])
.aggregate({'value':np.sum})
df_rslt = grp_data.groupby('date_id')
.apply(lambda x: x.nlargest(6,'value'))
.reset_index(level=0, drop=True)

From this
iterate through a date_id column in a dataframe and create a group for each 6 sequential values
I believe you need to identify a block first, then groupby on those blocks along with the date:
blocks = df.groupby('date_id').cumcount()//6
df.groupby(['date_id', blocks], sort=False)['value'].agg(['sum','max'])
Output:
sum max
date_id
1828 0 4767411.40 1180727.0
1 1671072.47 480506.0
1797 0 63420.00 63420.0
1828 2 24410.00 24410.0
1462 0 0.00 0.0
1493 0 0.00 0.0
1462 1 0.00 0.0

Related

Create new column in data frame by interpolating other column in between a particular date range - Pandas

I have a df as shown below.
the data is like this.
Date y
0 2020-06-14 127
1 2020-06-15 216
2 2020-06-16 4
3 2020-06-17 90
4 2020-06-18 82
5 2020-06-19 70
6 2020-06-20 59
7 2020-06-21 48
8 2020-06-22 23
9 2020-06-23 25
10 2020-06-24 24
11 2020-06-25 22
12 2020-06-26 19
13 2020-06-27 10
14 2020-06-28 18
15 2020-06-29 157
16 2020-06-30 16
17 2020-07-01 14
18 2020-07-02 343
The code to create the data frame.
# Create a dummy dataframe
import pandas as pd
import numpy as np
y0 = [127,216,4,90, 82,70,59,48,23,25,24,22,19,10,18,157,16,14,343]
def initial_forecast(data):
data['y'] = y0
return data
# Initial date dataframe
df_dummy = pd.DataFrame({'Date': pd.date_range('2020-06-14', periods=19, freq='1D')})
# Dates
start_date = df_dummy.Date.iloc[1]
print(start_date)
end_date = df_dummy.Date.iloc[17]
print(end_date)
# Adding y0 in the dataframe
df_dummy = initial_forecast(df_dummy)
df_dummy
From the above I would like to interpolate the data for a particular date range.
I would like to interpolate(linear) between 2020-06-17 to 2020-06-27.
ie from 2020-06-17 to 2020-06-27 'y' values changes from 90 to 10 in 10 steps. so at an average in each step it reduces 8.
ie (90-10)/10(number of steps) = 8 in each steps
The expected output:
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
Note: In the remaining date range y_new value should be same as y value.
I tried below code, that is not giving desired output
# Function
def df_interpolate(df, start_date, end_date):
df["Date"]=pd.to_datetime(df["Date"])
df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'y_new'] = np.nan
df['y_new'] = df['y'].interpolate().round()
return df
df1 = df_interpolate(df_dummy, '2020-06-17', '2020-06-27')
With some tweaks to your function it works. np.where to create the new column, removing the = from your conditionals, and casting to int as per your expected output.
def df_interpolate(df, start_date, end_date):
df["Date"] = pd.to_datetime(df["Date"])
df['y_new'] = np.where((df['Date'] > start_date) & (df['Date'] < end_date), np.nan, df['y'])
df['y_new'] = df['y_new'].interpolate().round().astype(int)
return df
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343

Function against pandas column not generating expected output

I am trying to min-max scale a single column in a dataframe.
I am following this: Writing Min-Max scaler function
My code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
print(df, '\n')
y = df['A'].values
def func(x):
return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]
df['E'] = func(y)
print(df)
df['E'] is just df['A'] / 100.
Not sure what I am missing, but my result is incorrect.
IIUC, are you trying to do something like this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
print(df, '\n')
def func(x):
return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]
df_out = df.apply(func).add_prefix('Norm_')
print(df_out)
print(df.join(df_out))
Output:
A B C D
0 91 59 44 5
1 85 44 57 17
2 6 65 37 46
3 40 50 3 40
4 73 58 47 53
.. .. .. .. ..
95 94 76 22 66
96 70 99 40 59
97 96 84 85 24
98 43 51 59 60
99 31 5 55 89
[100 rows x 4 columns]
Norm_A Norm_B Norm_C Norm_D
0 0.93 0.60 0.44 0.05
1 0.87 0.44 0.58 0.17
2 0.06 0.66 0.37 0.47
3 0.41 0.51 0.03 0.41
4 0.74 0.59 0.47 0.54
.. ... ... ... ...
95 0.96 0.77 0.22 0.67
96 0.71 1.00 0.40 0.60
97 0.98 0.85 0.86 0.24
98 0.44 0.52 0.60 0.61
99 0.32 0.05 0.56 0.91
[100 rows x 4 columns]
A B C D Norm_A Norm_B Norm_C Norm_D
0 91 59 44 5 0.93 0.60 0.44 0.05
1 85 44 57 17 0.87 0.44 0.58 0.17
2 6 65 37 46 0.06 0.66 0.37 0.47
3 40 50 3 40 0.41 0.51 0.03 0.41
4 73 58 47 53 0.74 0.59 0.47 0.54
.. .. .. .. .. ... ... ... ...
95 94 76 22 66 0.96 0.77 0.22 0.67
96 70 99 40 59 0.71 1.00 0.40 0.60
97 96 84 85 24 0.98 0.85 0.86 0.24
98 43 51 59 60 0.44 0.52 0.60 0.61
99 31 5 55 89 0.32 0.05 0.56 0.91
[100 rows x 8 columns]
Also consider that using apply() with a function is typically quite inefficient. Try to use vectorized operations whenever you can...
This is a more efficient expression to normalize each column according to the minimum and maximum for that column:
min = df.min() # per column
max = df.max() # per column
df.join(np.round((df - min) / (max - min), 2).add_prefix('Norm_'))
That's much faster than using apply() on a function. For your sample DataFrame:
%timeit df.join(np.round((df - df.min()) / (df.max() - df.min()), 2).add_prefix('Norm_'))
9.89 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
While the version with apply takes about 4x longer:
%timeit df.join(df.apply(func).add_prefix('Norm_'))
45.8 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
But this difference grows quickly with the size of the DataFrame. For example, with a DataFrame with size 1,000 x 26, I get
37.2 ms ± 269 µs for the version using vectorized instructions, versus 19.5 s ± 1.82 s for the version using apply, around 500x faster!
Not sure what you are after. Your max and minimum are near known because of the number range.
df.loc[:,'A':'D'].apply(lambda x : x.agg({'min','max'}))
and if all you need is df['E'] is just df['A'] / 100.
why not;
df['E']=df['A']/100
y=df['E'].values
y
Please dont mark me down just trying to get some clarity

How to update a DataFrame with a another DataFrame by its index?

I have a origin data frame as below:
0 0
1 6.19
2 6.19
3 16.19
4 16.19
5 179.33
6 179.33
7 179.33
8 179.33
9 179.33
10 179.33
11 0
12 179.33
13 179.33
14 0
15 0
16 0
17 0
18 0
19 11.49
20 11.49
21 7.15
22 7.15
23 16.19
24 16.19
25 17.85
26 17.85
And the second Data Frame is:
2 3.19
4 16.19
6 179.33
8 179.33
10 179.33
13 179.33
20 11.49
22 7.15
24 16.19
26 17.85
You can see the first column is index, and I want it to be updated according the second data list.
For example:
0 0
1 6.19
2 6.19
Since my second data Frame is 3.19 with index at 2. so my expect output should be like:
0 0
1 6.19
2 3.19
How to I reach that? BTW, I have try to do that like:
for i in df.index:
new = df2.aa[i]
if new:
df.loc['aa'][i]=new
But it should have a good way to do that in pandas, plz help me.
Use Series.combine_first or Series.update:
df1['col'] = df2['col'].combine_first(df1['col'])
Or:
df1['col'].update(df2['col'])
print (df1)
col
0 0.00
1 6.19
2 3.19
3 16.19
4 16.19
5 179.33
6 179.33
7 179.33
8 179.33
9 179.33
10 179.33
11 0.00
12 179.33
13 179.33
14 0.00
15 0.00
16 0.00
17 0.00
18 0.00
19 11.49
20 11.49
21 7.15
22 7.15
23 16.19
24 16.19
25 17.85
26 17.85

Concatenate two dataframes by column

I have 2 dataframes. First dataframe contain number of year and count with 0:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 0
4 1894 0
5 1895 0
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 0
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 0
18 1908 0
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 0
27 1917 0
28 1918 0
29 1919 0
.. ... ...
90 1980 0
91 1981 0
92 1982 0
93 1983 0
94 1984 0
95 1985 0
96 1986 0
97 1987 0
98 1988 0
99 1989 0
100 1990 0
101 1991 0
102 1992 0
103 1993 0
104 1994 0
105 1995 0
106 1996 0
107 1997 0
108 1998 0
109 1999 0
110 2000 0
111 2001 0
112 2002 0
113 2003 0
114 2004 0
115 2005 0
116 2006 0
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
Second dataframe have similar columns but filled with smaller number of years and filled count:
year count
0 1970 1
1 1957 7
2 1947 19
3 1987 12
4 1979 7
5 1940 1
6 1950 19
7 1972 4
8 1954 15
9 1976 15
10 2006 3
11 1963 16
12 1980 6
13 1956 13
14 1967 5
15 1893 1
16 1985 5
17 1964 6
18 1949 11
19 1945 15
20 1948 16
21 1959 16
22 1958 12
23 1929 1
24 1965 12
25 1969 15
26 1946 12
27 1961 1
28 1988 1
29 1918 1
30 1999 3
31 1986 3
32 1981 2
33 1960 2
34 1974 4
35 1953 9
36 1968 11
37 1916 2
38 1955 5
39 1978 1
40 2003 1
41 1982 4
42 1984 3
43 1966 4
44 1983 3
45 1962 3
46 1952 4
47 1992 2
48 1973 4
49 1993 10
50 1975 2
51 1900 1
52 1991 1
53 1907 1
54 1977 4
55 1908 1
56 1998 2
57 1997 3
58 1895 1
I want to create third dataframe df3. For each row, if year in df1 and df2 are equal, then df3["count"] = df2["count"] else df3["count"] = df1["count"].
I tried to use join to do this:
df_new = df2.join(df1, on='year', how='left')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But got an error:
ValueError: columns overlap but no suffix specified: Index(['year'], dtype='object')
I found the solution to this error(Pandas join issue: columns overlap but no suffix specified) But after I run code with those changes:
df_new = df2.join(df1, on='year', how='left', lsuffix='_left', rsuffix='_right')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But output is not what I want:
count year
0 NaN 1890
1 NaN 1891
2 NaN 1892
3 NaN 1893
4 NaN 1894
5 NaN 1895
6 NaN 1896
7 NaN 1897
8 NaN 1898
9 NaN 1899
10 NaN 1900
11 NaN 1901
12 NaN 1902
13 NaN 1903
14 NaN 1904
15 NaN 1905
16 NaN 1906
17 NaN 1907
18 NaN 1908
19 NaN 1909
20 NaN 1910
21 NaN 1911
22 NaN 1912
23 NaN 1913
24 NaN 1914
25 NaN 1915
26 NaN 1916
27 NaN 1917
28 NaN 1918
29 NaN 1919
.. ... ...
29 1.0 1918
30 3.0 1999
31 3.0 1986
32 2.0 1981
33 2.0 1960
34 4.0 1974
35 9.0 1953
36 11.0 1968
37 2.0 1916
38 5.0 1955
39 1.0 1978
40 1.0 2003
41 4.0 1982
42 3.0 1984
43 4.0 1966
44 3.0 1983
45 3.0 1962
46 4.0 1952
47 2.0 1992
48 4.0 1973
49 10.0 1993
50 2.0 1975
51 1.0 1900
52 1.0 1991
53 1.0 1907
54 4.0 1977
55 1.0 1908
56 2.0 1998
57 3.0 1997
58 1.0 1895
[179 rows x 2 columns]
Desired output is:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 1
4 1894 0
5 1895 1
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 1
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 1
18 1908 1
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 2
27 1917 0
28 1918 1
29 1919 0
.. ... ...
90 1980 6
91 1981 2
92 1982 4
93 1983 3
94 1984 3
95 1985 5
96 1986 3
97 1987 12
98 1988 1
99 1989 0
100 1990 0
101 1991 1
102 1992 2
103 1993 10
104 1994 0
105 1995 0
106 1996 0
107 1997 3
108 1998 2
109 1999 3
110 2000 0
111 2001 0
112 2002 0
113 2003 1
114 2004 0
115 2005 0
116 2006 3
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
The issue if because you should place year as index. In addition, if you don't want to lose data, you should join on outer instead of left.
This is my code:
df = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df2 = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df = df.set_index("year")
df2 = df2.set_index("year")
df3 = df.join(df2["qty"], how = "outer", lsuffix='_left', rsuffix='_right')
df3 = df3.fillna(0)
At this step you have 2 columns with values from df1 or df2. In you merge rule, I don't get what you want. You said :
if df1["qty"] == df2["qty"] => df3["qty"] = df2["qty"]
if df1["qty"] != df2["qty"] => df3["qty"] = df1["qty"]
That means you want everytime df1["qty"] because of df1["qty"] == df2["qty"]. Am I right ?
Just in case. If you want a code to adjust you can use apply as follow :
def foo(x1, x2):
if x1 == x2:
return x2
else:
return x1
df3["count"] = df3.apply(lambda row: foo(row["qty_left"], row["qty_left"]), axis=1)
df3.drop(["qty_left","qty_right"], axis = 1, inplace = True)
I hope it helps,
Nicolas

Optimal class weight parameter for the following SVC?

Hello I am working with sklearn to perform a classifier, I have the following distribution of labels:
label : 0 frecuency : 119
label : 1 frecuency : 1615
label : 2 frecuency : 197
label : 3 frecuency : 70
label : 4 frecuency : 203
label : 5 frecuency : 137
label : 6 frecuency : 18
label : 7 frecuency : 142
label : 8 frecuency : 15
label : 9 frecuency : 182
label : 10 frecuency : 986
label : 12 frecuency : 73
label : 13 frecuency : 27
label : 14 frecuency : 81
label : 15 frecuency : 168
label : 18 frecuency : 107
label : 21 frecuency : 125
label : 22 frecuency : 172
label : 23 frecuency : 3870
label : 25 frecuency : 2321
label : 26 frecuency : 25
label : 27 frecuency : 314
label : 28 frecuency : 76
label : 29 frecuency : 116
One thing that clearly stands out is that I am working with a unbalanced data set I have many labels for the class 25,23,1,10, I am getting bad results after the training as follows:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.61 0.23 0.34 528
2 0.00 0.00 0.00 70
3 0.67 0.06 0.11 32
4 0.00 0.00 0.00 62
5 0.78 0.82 0.80 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.14 0.01 0.02 313
12 0.00 0.00 0.00 30
13 0.31 0.57 0.40 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.41 0.74 0.53 1278
25 0.28 0.39 0.33 758
26 0.50 0.25 0.33 8
27 0.29 0.02 0.03 115
28 1.00 0.61 0.76 23
29 0.00 0.00 0.00 42
avg / total 0.33 0.39 0.32 3683
I am getting many zeros and the SVC is not able to learn from several class, the hyperparameters that I am using are the followings:
from sklearn import svm
clf2= svm.SVC(kernel='linear')
I order to overcome this issue I builded one dictionary with weights for each class as follows:
weight={}
for i,v in enumerate(uniqLabels):
weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)
for i,v in weight.items():
print(i,v)
print(weight)
these are the numbers and output, I am just taking the numbers of element of determinated label divided by the total of elements in the labels set, the sum of these numbers is 1:
0 0.010664037996236221
1 0.14472622994892015
2 0.01765391164082803
3 0.006272963527197778
4 0.018191594228873554
5 0.012277085760372793
6 0.0016130477641365713
7 0.012725154583744062
8 0.0013442064701138096
9 0.01630970517071422
10 0.0883591719688144
12 0.0065418048212205395
13 0.002419571646204857
14 0.007258714938614571
15 0.015055112465274667
18 0.009588672820145173
21 0.011201720584281746
22 0.015413567523971682
23 0.34680526928936284
25 0.20799354780894344
26 0.0022403441168563493
27 0.028138722107715744
28 0.006810646115243301
29 0.01039519670221346
trying again with this dictionary of weights as follows:
from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)
I got:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.90 0.19 0.31 528
2 0.00 0.00 0.00 70
3 0.00 0.00 0.00 32
4 0.00 0.00 0.00 62
5 0.00 0.00 0.00 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.00 0.00 0.00 313
12 0.00 0.00 0.00 30
13 0.00 0.00 0.00 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.36 0.99 0.52 1278
25 0.46 0.01 0.02 758
26 0.00 0.00 0.00 8
27 0.00 0.00 0.00 115
28 0.00 0.00 0.00 23
29 0.00 0.00 0.00 42
avg / total 0.35 0.37 0.23 3683
Since I am not getting good results I really appreciate suggestions to automatically adjust the weight of each class and express that in the SVC, I don have many expierience dealing with unbalanced problems so all the suggestions are well Received.
It seems that you are doing the opposite of what you should be doing. In particular, what you want is to put higher weights on the smaller classes, so that the classifier is penalized more during training on these classes. A good point to start would be setting class_weight="balanced".

Resources