LSTM named entity recognition model - shape are incompatible or logits/labels have different dimensions - Tensorflow 2.9 - keras

I am working on NLP LSTM named entity extraction model but running into different errors below are more details about error. I am running this code in jupiter notebook
Tensorflow version 2.9
Both input and output are of length 50
input sentence : [123 88 170 221 132 52 105 32 211 91 126 211 24 221 134 154 221 162
215 80 144 101 61 136 68 133 40 200 133 40 218 131 139 199 124 74
184 92 213 185 221 221 221 221 221 221 221 221 221 221]
output sentece label: [ 7 7 7 7 0 7 6 2 7 5 1 7 7 7 7 7 7 7 7 10 7 7 7 7
3 8 7 3 8 7 7 7 7 7 7 7 7 6 2 7 7 7 7 7 7 7 7 7
7 7]
Added upto 5 layers to train the model
Here is the model:
model = tf.keras.Sequential([
tf.keras.layers.Embedding(num_words, 50, input_length=50),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(64, activation=‘relu’),
tf.keras.layers.Dense(num_tags, activation=‘softmax’)
])
If I use loss function as “categorical_crossentropy” , I get this error:
ValueError: Shapes (None, 50) and (None, 11) are incompatible
If I use loss function as “sparse_categorical_crossentropy” , I get this error:
logits and labels must have the same first dimension, got logits shape [13,11] and labels shape [650]
[[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]]
I tried adding input shape as first layer but still no luck
tf.keras.layers.Input(shape=(max_len,))
Can anyone help , how to solve this. Tried different approaches but no luck
Here is model summary
Layer (type) Output Shape Param #
=================================================================
embedding_18 (Embedding) (None, 50, 50) 11100
bidirectional_35 (Bidirecti (None, 50, 128) 58880
onal)
bidirectional_36 (Bidirecti (None, 64) 41216
onal)
dropout_17 (Dropout) (None, 64) 0
dense_35 (Dense) (None, 64) 4160
dense_36 (Dense) (None, 11) 715
=================================================================
Total params: 116,071
Trainable params: 116,071
Non-trainable params: 0
_________________________________________________________________

I think you have a problem in 2 last dense layers. When run on a sequence of 50 numbers, you will get 'num_tags' numbers as output (11).
But you want to get 'num_tags' outputs at each step of the sequence, not at the end. To achieve this, you can use TimeDistributed layer:
tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(64, activation=‘relu’)),
tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(num_tags, activation=‘softmax’))
Then you can use “sparse_categorical_crossentropy” loss function since your labels are ints.
Please see as example:
https://towardsdatascience.com/named-entity-recognition-ner-using-keras-bidirectional-lstm-28cd3f301f54

Related

Retaining bad_lines identified by pandas in the output file instead of skipping those lines

I have to convert text files into csv's after processing the contents of the text file as pandas dataframe.
Below is the code i am using. out_txt is my input text file and out_csv is my output csv file.
df = pd.read_csv(out_txt, sep='\s', header=None, on_bad_lines='warn', encoding = "ANSI")
df = df.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
df.to_csv(out_csv, header=None)
If "on_bad_lines = 'warn'" is not decalred the csv files are not created. But if i use this condition those bad lines are getting skipped (obviously) with the warning
Skipping line 6: Expected 8 fields in line 7, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
I would like to retain these bad lines in the csv. I have highlighted the bad lines detected in the below image (my input text file).
Below is the contents of the text file which is getting saved. In this content i would like to remove characters like #, &, (, ).
75062 220 8 6 110 220 250 <1
75063 260 5 2 584 878 950 <1
75064 810 <2 <2 456 598 3700 <1
75065 115 5 2 96 74 5000 <1
75066 976 <5 2 5 68 4200 <1
75067 22 210 4 348 140 4050 <1
75068 674 5 4 - 54 1130 3850 <1
75069 414 5 y) 446 6.6% 2350 <1
75070 458 <5 <2 548 82 3100 <1
75071 4050 <5 2 780 6430 3150 <1
75072 115 <7 <1 64 5.8% 4050 °#&4«x<i1
75073 456 <7 4 46 44 3900 <1
75074 376 <7 <2 348 3.8% 2150 <1
75075 378 <6 y) 30 40 2000 <1
I would split on \s later with str.split rather than read_csv :
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
.squeeze().str.split(expand=True)
)
Another variant (skipping everything that comes in-between the numbers):
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
[0].str.findall(r"\b(\d+)\b"))
.str.split(expand=True)
)
​
Output :
print(df)
0 1 2 3 4 5 6 7
0 375020 1060 115 38 440 350 7800 1
1 375021 920 80 26 310 290 5000 1
2 375022 1240 110 28 460 430 5900 1
3 375023 830 150 80 650 860 6200 1
4 375024 185 175 96 800 1020 2400 1
5 375025 680 370 88 1700 1220 172 1
6 375026 550 290 72 2250 1460 835 2
7 375027 390 120 60 1620 1240 158 1
8 375028 630 180 76 820 1360 180 1
9 375029 460 280 66 380 790 3600 1
10 375030 660 260 62 11180 1040 300 1
11 375031 530 200 84 1360 1060 555 1

column multiplication based on a mapping

I have the following two dataframes. The first one, maps some nodes to area number and the maximum electric load of that node.
bus = pd.DataFrame(data={'Node':[101, 102, 103, 104, 105], 'Area':[1, 1, 2, 2, 3], 'Load':[10, 15, 12, 20, 25]})
which gives us:
Bus Area Load
0 101 1 10
1 102 1 15
2 103 2 12
3 104 2 20
4 105 3 25
The second dataframe, shows the total electric load of each area over a time period (from hour 0 to 5). The column names are the areas (matching the column Area in dataframe bus.
load = pd.DataFrame(data={1:[20, 18, 17, 19, 22, 25], 2:[23, 25,24, 27, 30, 32], 3:[10, 14, 19, 25, 22, 20]})
which gives us:
1 2 3
0 20 23 10
1 18 25 14
2 17 24 19
3 19 27 25
4 22 30 22
5 25 32 20
I would like to have a dataframe that shows the electric load of each bus over the 6 hours.
Assumption: The percentage of the load over time is the same as the percentage of the maximum load shown in bus; e.g., bus 101 has 10/(10+15)=0.4 percent of the electric load of area 1, therefore, to calculate its hourly load, 10/(10+15) should be multiplied by the column corresponding to area 1 in load.
The desired output should be of the following format:
101 102 103 104 105
0 8 12 8.625 14.375 10
1 7.2 10.8 9.375 15.625 14
2 6.8 10.2 9 15 19
3 7.6 11.4 10.125 16.875 25
4 8.8 13.2 11.25 18.75 22
5 10 15 12 20 20
For column 101, we have 0.4 multiplied by column 1 of load.
Any help is greatly appreaciated.
One option is to get the Load divided by the sum, then pivot, get the index matching for both load and bus, before multiplying on the matching levels:
(bus.assign(Load = bus.Load.div(bus.groupby('Area').Load.transform('sum')))
.pivot(None, ['Area', 'Node'], 'Load')
.reindex(load.index)
.ffill() # get the data spread into all rows
.bfill()
.mul(load, level=0)
.droplevel(0,1)
.rename_axis(columns=None)
)
101 102 103 104 105
0 8.0 12.0 8.625 14.375 10.0
1 7.2 10.8 9.375 15.625 14.0
2 6.8 10.2 9.000 15.000 19.0
3 7.6 11.4 10.125 16.875 25.0
4 8.8 13.2 11.250 18.750 22.0
5 10.0 15.0 12.000 20.000 20.0
You can calculate the ratio in bus, transpose load, merge the two and multiply the ratio by the load, here goes:
bus['area_sum'] = bus.groupby('Area')['Load'].transform('sum')
bus['node_ratio'] = bus['Load'] / bus['area_sum']
full_data = bus.merge(load.T.reset_index(), left_on='Area', right_on='index')
result = pd.DataFrame([full_data['node_ratio'] * full_data[x] for x in range(6)])
result.columns = full_data['Node'].values
result:
101
102
103
104
105
0
8
12
8.625
14.375
10
1
7.2
10.8
9.375
15.625
14
2
6.8
10.2
9
15
19
3
7.6
11.4
10.125
16.875
25
4
8.8
13.2
11.25
18.75
22
5
10
15
12
20
20

how to map two dataframes on condition while having different rows

I have two dataframes that need to be mapped (or joined?) based on some condition. These are the dataframes:
df_1
img_names img_array
0 1_rel 253
1 1_rel_right 255
2 1_rel_top 250
3 4_rel 180
4 4_rel_right 182
5 4_rel_top 189
6 7_rel 217
7 7_rel_right 183
8 7_rel_top 196
df_2
List_No time
0 1 38
1 4 23
2 7 32
After mapping I would like to get the following dataframe:
df_3
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
Basically, df_2's each row is populated 3 times to match the number of rows in df_1 and the mapping (if we can say so) is done by the split string in each row of df_1's img_name column. The names of row elements in img_names may have different names, but each of them always starts with the some number (1,4,7 in this case) and an undescore, etc. So I need to split the correspongding number in each row and map it with the row elements of List_No.
I hope the example above is clear.
Thank you.
Looks like you can just extract the digit parts and merge:
df_1['List_No'] = df_1['img_names'].str.split('_').str[0].astype(int)
df_3 = df_1.merge(df_2, on='List_No')
Output:
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
An alternative to #QuangHoang's answer (which I believe you should pick, as it is more robust). This uses the map method, and assumes every value in df2's time is in df1:
df1.assign(
List_No=df1.img_names.str.extract(r"(\d)", expand=False).astype(int),
time=lambda x: x.List_No.map(df2["time"]),
)
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32

How do I compare values in two dataframe in an efficient way

df1
df2
I am new with python, pandas and Stack Overflow, so I will appreciate any help. I have two panda dataframes, the first one is in ascending order(values from 0 to 100 in steps of 0.1), the second one has 26000 values from 2.3 to 38.5, in no order, some values are also repeated in that dataframe. What I am trying to do is, for each value in the first dataframe, find how many values in the second dataframe are less than or equal to that value in an efficient way.
My code below does it in 45 seconds, but I'd like it to be done in around 10.
Thanks in advance:
Code:
def get_CDF2(df1, df2):
x=df1 #The first dataframe is already sorted in ascending order
y = np.sort(df2, axis=0) #Sort the columns of the second dataframe in ascending order
df_res = [] # keep the results here
yi = iter(y) # Use of an iterator to move over y
yindex = 0
flag = 0 #Flag, when set to 1 no comparison is done
y_val = next(yi)
for value in x:
if flag >=1:
df_res.append(largest_ind)#append the number of y_val smaller than value
#yindex+1
else:
# Search through y to find the index of an item bigger than value
while (y_val) <= (value) and yindex < len(y)-1:
y_val= next(yi) #Point at the next value in df2
yindex += 1 #Keep track of how many y_val are smaller than value
'''if for any value in df1 we iterate through the entire df2 and they are all less, that means
the rest of values in df1 will have the same effect since df1 is in ascending other, so no need to iterate again,
just set flag to 1'''
if ((yindex==len(y)-1)) and ((y_val <= float(value))):
flag=1
largest_ind=yindex+1
df_res.append(largest_ind)#append the number of y_val smaller than value
else:
df_res.append(yindex) #append the number of y_val smaller than value
return df_res
df1:
0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9, 1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7,
1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6,
2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5,
3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4,
4.5, 4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3,
5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2,
6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. ,
8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9,
9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8,
9.9, 10. , 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7,
10.8, 10.9, 11. , 11.1, 11.2, 11.3, 11.4, 11.5, 11.6,
11.7, 11.8, 11.9, 12. , 12.1, 12.2, 12.3, 12.4, 12.5,
12.6, 12.7, 12.8, 12.9, 13. , 13.1, 13.2, 13.3, 13.4,
13.5, 13.6, 13.7, 13.8, 13.9, 14. , 14.1, 14.2, 14.3,
14.4, 14.5, 14.6, 14.7, 14.8, 14.9, 15. , 15.1, 15.2,
15.3, 15.4, 15.5, 15.6, 15.7, 15.8, 15.9, 16. , 16.1,
16.2, 16.3, 16.4, 16.5, 16.6, 16.7, 16.8, 16.9, 17. ,
17.1, 17.2, 17.3, 17.4, 17.5, 17.6, 17.7, 17.8, 17.9,
18. , 18.1, 18.2, 18.3, 18.4, 18.5, 18.6, 18.7, 18.8,
18.9, 19. , 19.1, 19.2, 19.3, 19.4, 19.5, 19.6, 19.7,
19.8, 19.9, 20. , 20.1, 20.2, 20.3, 20.4, 20.5, 20.6,
20.7, 20.8, 20.9, 21. , 21.1, 21.2, 21.3, 21.4, 21.5,
21.6, 21.7, 21.8, 21.9, 22. , 22.1, 22.2, 22.3, 22.4,
22.5, 22.6, 22.7, 22.8, 22.9, 23. , 23.1, 23.2, 23.3,
23.4, 23.5, 23.6, 23.7, 23.8, 23.9, 24. , 24.1, 24.2,
24.3, 24.4, 24.5, 24.6, 24.7, 24.8, 24.9, 25. , 25.1,
25.2, 25.3, 25.4, 25.5, 25.6, 25.7, 25.8, 25.9, 26. ,
26.1, 26.2, 26.3, 26.4, 26.5, 26.6, 26.7, 26.8, 26.9,
27. , 27.1, 27.2, 27.3, 27.4, 27.5, 27.6, 27.7, 27.8,
27.9, 28. , 28.1, 28.2, 28.3, 28.4, 28.5, 28.6, 28.7,
28.8, 28.9, 29. , 29.1, 29.2, 29.3, 29.4, 29.5, 29.6
df2:
0 12.993
1 12.054
2 21.957
3 10.917
4 33.890
5 10.597
6 22.911
7 7.431
8 10.437
9 19.165
10 12.169
11 14.847
12 10.093
13 10.795
14 14.419
15 27.199
16 15.045
17 12.764
18 7.766
19 18.066
20 10.254
21 16.922
22 7.011
23 10.322
24 11.619
25 25.719
26 18.142
27 14.557
28 26.367
29 13.443
30 17.318
31 10.971
32 6.073
33 20.050
34 11.863
35 25.619
36 18.326
37 30.830
38 13.130
39 11.734
40 14.457
41 22.659
42 16.479
43 17.845
44 23.712
45 16.670
46 10.322
47 16.250
48 20.920
49 17.479
50 15.526
51 15.732
52 19.836
53 10.513
54 24.818
55 10.933
56 14.785
57 25.253
58 15.732
59 14.290
60 23.979
61 24.788
62 12.420
63 21.324
64 9.658
65 24.307
66 17.601
67 12.352
68 18.089
69 23.353
70 12.718
71 18.707
72 9.147
73 17.494
74 8.743
75 22.407
76 16.227
77 15.396
78 16.807
79 26.733
80 14.084
81 19.516
82 15.106
83 21.187
84 13.008
85 13.618
86 16.266
87 19.706
88 6.591
89 14.999
90 16.449
91 18.883
92 15.243
93 15.976
94 18.242
95 16.662
96 6.691
97 16.952
98 25.940
99 23.018
100 29.365
101 14.564
102 15.625
103 9.727
104 7.652
105 12.726
106 7.263
107 19.943
108 17.540
109 7.469
110 10.360
111 17.898
112 20.393
113 7.011
114 15.999
115 12.985
116 16.624
117 18.753
118 12.520
119 13.488
120 17.959
121 16.433
122 14.518
123 12.909
124 19.752
125 9.277
126 25.566
127 19.272
128 10.360
129 22.148
130 20.294
131 18.402
132 17.631
133 17.341
134 13.672
135 19.600
136 20.653
137 15.999
138 15.480
139 30.655
140 15.426
141 16.067
142 29.838
143 13.099
144 12.184
145 15.693
146 26.031
147 16.052
148 8.087
149 16.754
150 17.029
151 16.601
152 9.956
153 20.363
154 11.215
155 15.106
156 13.809
157 23.178
158 21.484
159 13.359
160 31.860
161 14.564
162 19.737
163 19.424
164 29.556
165 15.678
166 22.148
167 28.389
168 21.309
169 22.262
170 11.314
171 8.018
172 24.551
173 14.740
174 15.716
175 24.269
176 20.042
177 15.968
178 11.337
179 27.618
180 22.522
181 19.066
182 9.323
183 20.622
184 13.092
185 15.464
186 21.171
187 11.604
188 19.050
189 15.823
190 33.859
191 15.106
192 13.549
193 17.296
194 13.740
195 12.054
196 10.955
197 21.164
198 14.427
199 9.719
200 12.176
201 9.742
202 21.278
203 20.515
204 18.265
205 9.666
206 13.870
207 15.968
208 13.313
209 16.517
210 18.417
211 15.419
212 20.523
213 15.655
214 26.977
215 13.084
216 31.349
217 29.854
218 13.008
219 11.306
220 22.384
221 20.798
222 17.433
223 12.916
224 11.284
225 20.248
226 9.803
227 10.376
228 9.315
229 14.976
230 16.327
231 9.590
232 16.830
233 23.979
234 11.558
235 13.183
236 18.776
237 20.416
238 9.163
239 10.345
240 28.252
241 22.888
242 20.538
243 6.912
244 24.040
245 8.682
246 31.929
247 14.908
248 19.195
249 17.112
250 18.379
251 15.869
252 13.794
253 14.129
254 12.458
255 10.795
256 25.291
257 26.382
258 20.881
Try this. It will add a column called check to df1. The column will contain the count of the values in df2 that are <= each value in df1.
df1['check'] = df1[0].apply(lambda x: df2[df2[0] <= x].size)
You may have to replace the [0] with the names of the first column in your data frames.

Excel - Concatenate cell based other column

I have 3 column that I want to gorup it hierarichal.
Column : delivery_order_tli_id is parent.
Column : delivery_order_hanwa_id is child of parent.
COlumn : coil_ids is concatenate based child of parent.
This is the data.
FLAT DATA
delivery_order_tli_id delivery_order_hanwa_id coil_id
1 1 108
1 1 114
1 1 116
1 1 120
1 1 123
1 1 130
1 1 163
2 1 113
2 1 115
2 1 117
2 1 119
2 1 129
2 1 131
2 1 161
3 3 171
3 221 2880
3 221 2881
3 221 2887
3 221 2889
3 221 2890
4 4 236
4 4 237
4 4 238
4 4 239
4 4 244
4 4 245
4 5 246
4 4 253
Into like this :
MERGE DATA (RESULT)
delivery_order_tli_id delivery_order_hanwa_id coil_ids
1 1 108, 114, 116, 120, 123, 130, 163
2 1 113, 115, 117, 119, 129, 131, 161
3 3 171
3 221 2880, 2881, 2887, 2889,2890
4 4 236, 237,238,239, 244, 245, 253
4 5 246
Please advise.

Resources