I have multiple text files which contain some tables , most of the tables are of these two types. I want a way to convert these tables into python dictionaries.
precision recall f1-score support
BASE_RENT_ANNUAL 0.53 0.57 0.55 1408
BASE_RENT_MONTHLY 0.65 0.54 0.59 3904
BASE_RENT_PSF 0.68 0.59 0.63 1248
RENT_INCREMENT_MONTHLY 0.63 0.44 0.52 7530
SECURITY_DEPOSIT_AMOUNT 0.88 0.89 0.88 3557
micro avg 0.69 0.58 0.63 17647
macro avg 0.67 0.61 0.63 17647
weighted avg 0.68 0.58 0.62 17647
Hard Evaluation Metrics
--------------------------------------------------
Reading predictions from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/training/test_predictions.txt...
Nb tokens in test set: 957800
Reading training data from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/corpus/train.txt...
Nb tokens in training set: 211153
Strict mode: OFF
---------------------------------------------------------------------
Test tokens Nb tokens Nb words Nb errors Token error rate
---------------------------------------------------------------------
all 957800 5408 39333 0.0411
---------------------------------------------------------------------
unseen-I 704 19 704 1.0000
unseen-O 59870 1724 10208 0.1705
unseen-all 60574 1743 10912 0.1801
---------------------------------------------------------------------
diff-I 13952 70 13952 1.0000
diff-O 5285 121 4645 0.8789
diff-etype 0 0 0 0.0000
diff-all 19237 191 18597 0.9667
---------------------------------------------------------------------
all-unseen+diff 79811 1934 29509 0.3697
---------------------------------------------------------------------
Avg TER on unseen and diff: 0.5734
I have tried this in my code to convert the second table to dictionary but it is not working as expected.:
from itertools import dropwhile, takewhile
with open("idm.txt") as f:
dp = dropwhile(lambda x: not x.startswith("-"), f)
next(dp) # skip ----
names = next(dp).split() # get headers names
next(f) # skip -----
out = []
for line in takewhile(lambda x: not x.startswith("-"), f):
a, b = line.rsplit(None, 1)
out.append(dict(zip(names, a.split(None, 7) + [b])))
Expected output:
{BASE_RENT_ANNUAL: {precision:0.53,recall:0.57,f1-score:0.55,support:1408},
BASE_RENT_MONTHLY: {...}, ..
}
Not the same approach, but the following could be a beginning to your true solution
txt = ''' precision recall f1-score support
BASE_RENT_ANNUAL 0.53 0.57 0.55 1408
BASE_RENT_MONTHLY 0.65 0.54 0.59 3904
BASE_RENT_PSF 0.68 0.59 0.63 1248
RENT_INCREMENT_MONTHLY 0.63 0.44 0.52 7530
SECURITY_DEPOSIT_AMOUNT 0.88 0.89 0.88 3557
micro avg 0.69 0.58 0.63 17647
macro avg 0.67 0.61 0.63 17647
weighted avg 0.68 0.58 0.62 17647
Hard Evaluation Metrics
--------------------------------------------------
Reading predictions from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/training/test_predictions.txt...
Nb tokens in test set: 957800
Reading training data from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/corpus/train.txt...
Nb tokens in training set: 211153
Strict mode: OFF
---------------------------------------------------------------------
Test tokens Nb tokens Nb words Nb errors Token error rate
---------------------------------------------------------------------
all 957800 5408 39333 0.0411
---------------------------------------------------------------------
unseen-I 704 19 704 1.0000
unseen-O 59870 1724 10208 0.1705
unseen-all 60574 1743 10912 0.1801
---------------------------------------------------------------------
diff-I 13952 70 13952 1.0000
diff-O 5285 121 4645 0.8789
diff-etype 0 0 0 0.0000
diff-all 19237 191 18597 0.9667
---------------------------------------------------------------------
all-unseen+diff 79811 1934 29509 0.3697
---------------------------------------------------------------------
Avg TER on unseen and diff: 0.5734'''
lst1 = [x.split() for x in txt.split('\n') if x]
lst2 = [(x[0],x[1:]) for x in lst1 if (not x[0].startswith('-') and x[0] == x[0].upper())]
dico = dict(lst2)
dico2 = {}
for k in dico:
dico2[k] = {'precision':dico[k][0],'recall':dico[k][1],'f1-score':dico[k][2],'support':dico[k][3]}
print(dico2)
print(df)
Index Time left Type Price
0 1797.0 4.00 0.83
1 1789.0 4.00 0.83
2 1781.0 4.00 0.83
3 1757.0 4.00 0.83
4 1445.0 4.00 0.83
5 1413.0 NaN NaN
6 1397.0 NaN NaN
7 1389.0 4.00 0.66
8 1381.0 4.00 0.66
9 349.0 4.00 0.66
10 1325.0 4.00 0.66
11 1317.0 4.00 0.61
12 301.0 4.00 0.62
13 1293.0 4.00 0.65
14 1285.0 4.00 0.56
15 1261.0 4.00 0.56
16 1245.0 4.00 0.56
17 1237.0 4.00 0.57
18 1213.0 4.00 0.51
19 1197.0 4.00 0.52
20 1021.0 3.75 0.86
21 933.0 3.75 0.86
22 813.0 3.75 0.85
23 797.0 3.75 0.85
24 781.0 3.75 0.81
25 525.0 3.75 0.82
26 509.0 3.75 0.82
27 269.0 3.75 0.83
28 181.0 3.75 0.82
29 165.0 3.75 0.82
30 157.0 3.75 0.82
31 37.0 3.75 0.83
we can see the Price is going down according to Time left and Type, and value difference
df = df.set_index("Type")
a = (df.max(level='Type')["Price"] - df.min(level='Type')["Price"]).sum()
print(a) # value difference: 0.37
How does pandas understand whether a market is going up or down over Time left(example: pandas return up=1, down=0)
My idea:
index1 = (["Type"].max & ["Price"].max)
index2 = (["Type"].min & ["Price"].max)
if index1 > index2:
return 0 (market down)
else:
return 1( market up)
Or is there any more solution, help me convert the idea into code, I have just learn pandas, plz...
You can find indices of min and max with idxmin and idxmax, then compare them:
# get indices of min and max:
z = df.set_index('Time left').groupby('Type')['Price'].agg(['idxmin', 'idxmax'])
# if index of max < index of min, then it's increasing
# (since it's `Time left`, not `Time`, so it's all reversed)
z = z['idxmax'].lt(z['idxmin']).astype(int).to_frame('is_increasing')
z
Output:
is_increasing
Type
3.75 0
4.00 0
I'm trying to create a function that will iterate through a date_id column in a dataframe and create a group for each 6 sequential values and sum the values in the 'value' column, then also return the max value form each group of six along with the result.
date_id item_id value
0 1828 32 1180727.00
1 1828 43 944937.00
2 1828 40 806681.00
3 1828 42 721810.02
4 1828 36 567950.00
5 1828 45 545306.38
6 1828 26 480506.00
7 1828 53 375788.00
8 1828 37 236000.00
9 1828 38 234780.00
10 1828 21 208998.47
11 1828 41 135000.00
12 1797 39 63420.00
13 1828 28 24410.00
14 1462 52 0.00
15 1493 16 0.00
16 1493 17 0.00
17 1493 18 0.00
18 1493 15 0.00
19 1462 53 0.00
20 1462 47 0.00
21 1462 51 0.00
22 1462 50 0.00
23 1462 49 0.00
24 1462 45 0.00
The desired output for each item_id would be
date_id item_id value
0 max value from each date group 36 sum of all values in each date grouping
I have tried using a lambda
df_rslt = df.groupby('date_id')['value'].apply(lambda grp: grp.nlargest(6).sum())
but then quickly realized this will only return one result.
I then tried something like this in a for loop, but it got nowhere
grp_data = df.groupby(['date_id','item_id'])
.aggregate({'value':np.sum})
df_rslt = grp_data.groupby('date_id')
.apply(lambda x: x.nlargest(6,'value'))
.reset_index(level=0, drop=True)
From this
iterate through a date_id column in a dataframe and create a group for each 6 sequential values
I believe you need to identify a block first, then groupby on those blocks along with the date:
blocks = df.groupby('date_id').cumcount()//6
df.groupby(['date_id', blocks], sort=False)['value'].agg(['sum','max'])
Output:
sum max
date_id
1828 0 4767411.40 1180727.0
1 1671072.47 480506.0
1797 0 63420.00 63420.0
1828 2 24410.00 24410.0
1462 0 0.00 0.0
1493 0 0.00 0.0
1462 1 0.00 0.0
I have a Input dataframe as shown. Taking two rows at a time there are 4C2 combinations. I want the output to be saved in a dataframe as shown in output dataframe . In the output dataframe for each possible combination columns of two rows are side by side.
Input df
A B
0 0.5 12
1 0.7 16
2 0.9 20
3 0.11 24
Output df
combination A B A' B'
(0,1) 0.5 12 0.7 16
(0,2) 0.5 12 0.9 20
.................................
.................................
Method 1
Create an artificial key column, then merge the df to itself:
df['key'] = 1
df.merge(df, on='key',suffixes=["", "'"]).reset_index(drop=True).drop('key', axis=1)
A B A' B'
0 0.50 12 0.50 12
1 0.50 12 0.70 16
2 0.50 12 0.90 20
3 0.50 12 0.11 24
4 0.70 16 0.50 12
5 0.70 16 0.70 16
6 0.70 16 0.90 20
7 0.70 16 0.11 24
8 0.90 20 0.50 12
9 0.90 20 0.70 16
10 0.90 20 0.90 20
11 0.90 20 0.11 24
12 0.11 24 0.50 12
13 0.11 24 0.70 16
14 0.11 24 0.90 20
15 0.11 24 0.11 24
Method 2
First prepare a dataframe with all the possible combinations, then we merge our original dataframe to get the combinations side by side:
idx = [x for x in range(len(df))] * len(df)
idx.sort()
df2 = pd.concat([df]*len(df))
df2.index = idx
df.merge(df2, left_index=True, right_index=True, suffixes=["", "'"]).reset_index(drop=True)
A B A' B'
0 0.50 12 0.50 12
1 0.50 12 0.70 16
2 0.50 12 0.90 20
3 0.50 12 0.11 24
4 0.70 16 0.50 12
5 0.70 16 0.70 16
6 0.70 16 0.90 20
7 0.70 16 0.11 24
8 0.90 20 0.50 12
9 0.90 20 0.70 16
10 0.90 20 0.90 20
11 0.90 20 0.11 24
12 0.11 24 0.50 12
13 0.11 24 0.70 16
14 0.11 24 0.90 20
15 0.11 24 0.11 24
Let's use itertools.combinations:
from itertools import combinations
pd.concat([df.loc[[i,j]]
.unstack()
.set_axis(["A","A'","B","B'"], axis=0, inplace=False)
.to_frame(name=(i,j)).T
for i, j in combinations(df.index, 2)])
Output dataframe with multiindex:
A A' B B'
0 1 0.5 0.70 12.0 16.0
2 0.5 0.90 12.0 20.0
3 0.5 0.11 12.0 24.0
1 2 0.7 0.90 16.0 20.0
3 0.7 0.11 16.0 24.0
2 3 0.9 0.11 20.0 24.0
Or as index with string
pd.concat([df.loc[[i,j]]
.unstack()
.set_axis(["A","A'","B","B'"], axis=0, inplace=False)
.to_frame(name='('+str(i)+','+ str(j)+')').T
for i,j in combinations(df.index,2)]))
Output:
A A' B B'
(0,1) 0.5 0.70 12.0 16.0
(0,2) 0.5 0.90 12.0 20.0
(0,3) 0.5 0.11 12.0 24.0
(1,2) 0.7 0.90 16.0 20.0
(1,3) 0.7 0.11 16.0 24.0
(2,3) 0.9 0.11 20.0 24.0
cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')