Function against pandas column not generating expected output - python-3.x

I am trying to min-max scale a single column in a dataframe.
I am following this: Writing Min-Max scaler function
My code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
print(df, '\n')
y = df['A'].values
def func(x):
return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]
df['E'] = func(y)
print(df)
df['E'] is just df['A'] / 100.
Not sure what I am missing, but my result is incorrect.

IIUC, are you trying to do something like this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
print(df, '\n')
def func(x):
return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]
df_out = df.apply(func).add_prefix('Norm_')
print(df_out)
print(df.join(df_out))
Output:
A B C D
0 91 59 44 5
1 85 44 57 17
2 6 65 37 46
3 40 50 3 40
4 73 58 47 53
.. .. .. .. ..
95 94 76 22 66
96 70 99 40 59
97 96 84 85 24
98 43 51 59 60
99 31 5 55 89
[100 rows x 4 columns]
Norm_A Norm_B Norm_C Norm_D
0 0.93 0.60 0.44 0.05
1 0.87 0.44 0.58 0.17
2 0.06 0.66 0.37 0.47
3 0.41 0.51 0.03 0.41
4 0.74 0.59 0.47 0.54
.. ... ... ... ...
95 0.96 0.77 0.22 0.67
96 0.71 1.00 0.40 0.60
97 0.98 0.85 0.86 0.24
98 0.44 0.52 0.60 0.61
99 0.32 0.05 0.56 0.91
[100 rows x 4 columns]
A B C D Norm_A Norm_B Norm_C Norm_D
0 91 59 44 5 0.93 0.60 0.44 0.05
1 85 44 57 17 0.87 0.44 0.58 0.17
2 6 65 37 46 0.06 0.66 0.37 0.47
3 40 50 3 40 0.41 0.51 0.03 0.41
4 73 58 47 53 0.74 0.59 0.47 0.54
.. .. .. .. .. ... ... ... ...
95 94 76 22 66 0.96 0.77 0.22 0.67
96 70 99 40 59 0.71 1.00 0.40 0.60
97 96 84 85 24 0.98 0.85 0.86 0.24
98 43 51 59 60 0.44 0.52 0.60 0.61
99 31 5 55 89 0.32 0.05 0.56 0.91
[100 rows x 8 columns]

Also consider that using apply() with a function is typically quite inefficient. Try to use vectorized operations whenever you can...
This is a more efficient expression to normalize each column according to the minimum and maximum for that column:
min = df.min() # per column
max = df.max() # per column
df.join(np.round((df - min) / (max - min), 2).add_prefix('Norm_'))
That's much faster than using apply() on a function. For your sample DataFrame:
%timeit df.join(np.round((df - df.min()) / (df.max() - df.min()), 2).add_prefix('Norm_'))
9.89 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
While the version with apply takes about 4x longer:
%timeit df.join(df.apply(func).add_prefix('Norm_'))
45.8 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
But this difference grows quickly with the size of the DataFrame. For example, with a DataFrame with size 1,000 x 26, I get
37.2 ms ± 269 µs for the version using vectorized instructions, versus 19.5 s ± 1.82 s for the version using apply, around 500x faster!

Not sure what you are after. Your max and minimum are near known because of the number range.
df.loc[:,'A':'D'].apply(lambda x : x.agg({'min','max'}))
and if all you need is df['E'] is just df['A'] / 100.
why not;
df['E']=df['A']/100
y=df['E'].values
y
Please dont mark me down just trying to get some clarity

Related

convert txt tables to python dictionary

I have multiple text files which contain some tables , most of the tables are of these two types. I want a way to convert these tables into python dictionaries.
precision recall f1-score support
BASE_RENT_ANNUAL 0.53 0.57 0.55 1408
BASE_RENT_MONTHLY 0.65 0.54 0.59 3904
BASE_RENT_PSF 0.68 0.59 0.63 1248
RENT_INCREMENT_MONTHLY 0.63 0.44 0.52 7530
SECURITY_DEPOSIT_AMOUNT 0.88 0.89 0.88 3557
micro avg 0.69 0.58 0.63 17647
macro avg 0.67 0.61 0.63 17647
weighted avg 0.68 0.58 0.62 17647
Hard Evaluation Metrics
--------------------------------------------------
Reading predictions from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/training/test_predictions.txt...
Nb tokens in test set: 957800
Reading training data from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/corpus/train.txt...
Nb tokens in training set: 211153
Strict mode: OFF
---------------------------------------------------------------------
Test tokens Nb tokens Nb words Nb errors Token error rate
---------------------------------------------------------------------
all 957800 5408 39333 0.0411
---------------------------------------------------------------------
unseen-I 704 19 704 1.0000
unseen-O 59870 1724 10208 0.1705
unseen-all 60574 1743 10912 0.1801
---------------------------------------------------------------------
diff-I 13952 70 13952 1.0000
diff-O 5285 121 4645 0.8789
diff-etype 0 0 0 0.0000
diff-all 19237 191 18597 0.9667
---------------------------------------------------------------------
all-unseen+diff 79811 1934 29509 0.3697
---------------------------------------------------------------------
Avg TER on unseen and diff: 0.5734
I have tried this in my code to convert the second table to dictionary but it is not working as expected.:
from itertools import dropwhile, takewhile
with open("idm.txt") as f:
dp = dropwhile(lambda x: not x.startswith("-"), f)
next(dp) # skip ----
names = next(dp).split() # get headers names
next(f) # skip -----
out = []
for line in takewhile(lambda x: not x.startswith("-"), f):
a, b = line.rsplit(None, 1)
out.append(dict(zip(names, a.split(None, 7) + [b])))
Expected output:
{BASE_RENT_ANNUAL: {precision:0.53,recall:0.57,f1-score:0.55,support:1408},
BASE_RENT_MONTHLY: {...}, ..
}
Not the same approach, but the following could be a beginning to your true solution
txt = ''' precision recall f1-score support
BASE_RENT_ANNUAL 0.53 0.57 0.55 1408
BASE_RENT_MONTHLY 0.65 0.54 0.59 3904
BASE_RENT_PSF 0.68 0.59 0.63 1248
RENT_INCREMENT_MONTHLY 0.63 0.44 0.52 7530
SECURITY_DEPOSIT_AMOUNT 0.88 0.89 0.88 3557
micro avg 0.69 0.58 0.63 17647
macro avg 0.67 0.61 0.63 17647
weighted avg 0.68 0.58 0.62 17647
Hard Evaluation Metrics
--------------------------------------------------
Reading predictions from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/training/test_predictions.txt...
Nb tokens in test set: 957800
Reading training data from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/corpus/train.txt...
Nb tokens in training set: 211153
Strict mode: OFF
---------------------------------------------------------------------
Test tokens Nb tokens Nb words Nb errors Token error rate
---------------------------------------------------------------------
all 957800 5408 39333 0.0411
---------------------------------------------------------------------
unseen-I 704 19 704 1.0000
unseen-O 59870 1724 10208 0.1705
unseen-all 60574 1743 10912 0.1801
---------------------------------------------------------------------
diff-I 13952 70 13952 1.0000
diff-O 5285 121 4645 0.8789
diff-etype 0 0 0 0.0000
diff-all 19237 191 18597 0.9667
---------------------------------------------------------------------
all-unseen+diff 79811 1934 29509 0.3697
---------------------------------------------------------------------
Avg TER on unseen and diff: 0.5734'''
lst1 = [x.split() for x in txt.split('\n') if x]
lst2 = [(x[0],x[1:]) for x in lst1 if (not x[0].startswith('-') and x[0] == x[0].upper())]
dico = dict(lst2)
dico2 = {}
for k in dico:
dico2[k] = {'precision':dico[k][0],'recall':dico[k][1],'f1-score':dico[k][2],'support':dico[k][3]}
print(dico2)

Determine the trend of the market over time use Pandas + python3

print(df)
Index Time left Type Price
0 1797.0 4.00 0.83
1 1789.0 4.00 0.83
2 1781.0 4.00 0.83
3 1757.0 4.00 0.83
4 1445.0 4.00 0.83
5 1413.0 NaN NaN
6 1397.0 NaN NaN
7 1389.0 4.00 0.66
8 1381.0 4.00 0.66
9 349.0 4.00 0.66
10 1325.0 4.00 0.66
11 1317.0 4.00 0.61
12 301.0 4.00 0.62
13 1293.0 4.00 0.65
14 1285.0 4.00 0.56
15 1261.0 4.00 0.56
16 1245.0 4.00 0.56
17 1237.0 4.00 0.57
18 1213.0 4.00 0.51
19 1197.0 4.00 0.52
20 1021.0 3.75 0.86
21 933.0 3.75 0.86
22 813.0 3.75 0.85
23 797.0 3.75 0.85
24 781.0 3.75 0.81
25 525.0 3.75 0.82
26 509.0 3.75 0.82
27 269.0 3.75 0.83
28 181.0 3.75 0.82
29 165.0 3.75 0.82
30 157.0 3.75 0.82
31 37.0 3.75 0.83
we can see the Price is going down according to Time left and Type, and value difference
df = df.set_index("Type")
a = (df.max(level='Type')["Price"] - df.min(level='Type')["Price"]).sum()
print(a) # value difference: 0.37
How does pandas understand whether a market is going up or down over Time left(example: pandas return up=1, down=0)
My idea:
index1 = (["Type"].max & ["Price"].max)
index2 = (["Type"].min & ["Price"].max)
if index1 > index2:
return 0 (market down)
else:
return 1( market up)
Or is there any more solution, help me convert the idea into code, I have just learn pandas, plz...
You can find indices of min and max with idxmin and idxmax, then compare them:
# get indices of min and max:
z = df.set_index('Time left').groupby('Type')['Price'].agg(['idxmin', 'idxmax'])
# if index of max < index of min, then it's increasing
# (since it's `Time left`, not `Time`, so it's all reversed)
z = z['idxmax'].lt(z['idxmin']).astype(int).to_frame('is_increasing')
z
Output:
is_increasing
Type
3.75 0
4.00 0

Apply a For loop in with a multiple GroupBy and nlargest

I'm trying to create a function that will iterate through a date_id column in a dataframe and create a group for each 6 sequential values and sum the values in the 'value' column, then also return the max value form each group of six along with the result.
date_id item_id value
0 1828 32 1180727.00
1 1828 43 944937.00
2 1828 40 806681.00
3 1828 42 721810.02
4 1828 36 567950.00
5 1828 45 545306.38
6 1828 26 480506.00
7 1828 53 375788.00
8 1828 37 236000.00
9 1828 38 234780.00
10 1828 21 208998.47
11 1828 41 135000.00
12 1797 39 63420.00
13 1828 28 24410.00
14 1462 52 0.00
15 1493 16 0.00
16 1493 17 0.00
17 1493 18 0.00
18 1493 15 0.00
19 1462 53 0.00
20 1462 47 0.00
21 1462 51 0.00
22 1462 50 0.00
23 1462 49 0.00
24 1462 45 0.00
The desired output for each item_id would be
date_id item_id value
0 max value from each date group 36 sum of all values in each date grouping
I have tried using a lambda
df_rslt = df.groupby('date_id')['value'].apply(lambda grp: grp.nlargest(6).sum())
but then quickly realized this will only return one result.
I then tried something like this in a for loop, but it got nowhere
grp_data = df.groupby(['date_id','item_id'])
.aggregate({'value':np.sum})
df_rslt = grp_data.groupby('date_id')
.apply(lambda x: x.nlargest(6,'value'))
.reset_index(level=0, drop=True)
From this
iterate through a date_id column in a dataframe and create a group for each 6 sequential values
I believe you need to identify a block first, then groupby on those blocks along with the date:
blocks = df.groupby('date_id').cumcount()//6
df.groupby(['date_id', blocks], sort=False)['value'].agg(['sum','max'])
Output:
sum max
date_id
1828 0 4767411.40 1180727.0
1 1671072.47 480506.0
1797 0 63420.00 63420.0
1828 2 24410.00 24410.0
1462 0 0.00 0.0
1493 0 0.00 0.0
1462 1 0.00 0.0

How to take all combination of a pandas dataframe (choosing 2 at a time) and make a new dataframe with each two combination in a single row?

I have a Input dataframe as shown. Taking two rows at a time there are 4C2 combinations. I want the output to be saved in a dataframe as shown in output dataframe . In the output dataframe for each possible combination columns of two rows are side by side.
Input df
A B
0 0.5 12
1 0.7 16
2 0.9 20
3 0.11 24
Output df
combination A B A' B'
(0,1) 0.5 12 0.7 16
(0,2) 0.5 12 0.9 20
.................................
.................................
Method 1
Create an artificial key column, then merge the df to itself:
df['key'] = 1
df.merge(df, on='key',suffixes=["", "'"]).reset_index(drop=True).drop('key', axis=1)
A B A' B'
0 0.50 12 0.50 12
1 0.50 12 0.70 16
2 0.50 12 0.90 20
3 0.50 12 0.11 24
4 0.70 16 0.50 12
5 0.70 16 0.70 16
6 0.70 16 0.90 20
7 0.70 16 0.11 24
8 0.90 20 0.50 12
9 0.90 20 0.70 16
10 0.90 20 0.90 20
11 0.90 20 0.11 24
12 0.11 24 0.50 12
13 0.11 24 0.70 16
14 0.11 24 0.90 20
15 0.11 24 0.11 24
Method 2
First prepare a dataframe with all the possible combinations, then we merge our original dataframe to get the combinations side by side:
idx = [x for x in range(len(df))] * len(df)
idx.sort()
df2 = pd.concat([df]*len(df))
df2.index = idx
df.merge(df2, left_index=True, right_index=True, suffixes=["", "'"]).reset_index(drop=True)
A B A' B'
0 0.50 12 0.50 12
1 0.50 12 0.70 16
2 0.50 12 0.90 20
3 0.50 12 0.11 24
4 0.70 16 0.50 12
5 0.70 16 0.70 16
6 0.70 16 0.90 20
7 0.70 16 0.11 24
8 0.90 20 0.50 12
9 0.90 20 0.70 16
10 0.90 20 0.90 20
11 0.90 20 0.11 24
12 0.11 24 0.50 12
13 0.11 24 0.70 16
14 0.11 24 0.90 20
15 0.11 24 0.11 24
Let's use itertools.combinations:
from itertools import combinations
pd.concat([df.loc[[i,j]]
.unstack()
.set_axis(["A","A'","B","B'"], axis=0, inplace=False)
.to_frame(name=(i,j)).T
for i, j in combinations(df.index, 2)])
Output dataframe with multiindex:
A A' B B'
0 1 0.5 0.70 12.0 16.0
2 0.5 0.90 12.0 20.0
3 0.5 0.11 12.0 24.0
1 2 0.7 0.90 16.0 20.0
3 0.7 0.11 16.0 24.0
2 3 0.9 0.11 20.0 24.0
Or as index with string
pd.concat([df.loc[[i,j]]
.unstack()
.set_axis(["A","A'","B","B'"], axis=0, inplace=False)
.to_frame(name='('+str(i)+','+ str(j)+')').T
for i,j in combinations(df.index,2)]))
Output:
A A' B B'
(0,1) 0.5 0.70 12.0 16.0
(0,2) 0.5 0.90 12.0 20.0
(0,3) 0.5 0.11 12.0 24.0
(1,2) 0.7 0.90 16.0 20.0
(1,3) 0.7 0.11 16.0 24.0
(2,3) 0.9 0.11 20.0 24.0

why am I getting a too many indexers error?

cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')

Resources