The failure in using CRF+0.58 train NE Model - linux

when i use CRF++0.58 to model a NE and progarm have a problem:
"reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s"
the develop environment:
red hat linux 6.5,gcc 5.0,CRF++0.58
written feature template:
template
dataset:
Boson_train.txt
Boson_test.txt
the first column is words ,the second column is pos,the third column is NER tagger
the problem:
when i want to train the NER model, i type this sentences "crf_learn -f 3 -c 4.0 template Boson_train crf_model", and i got
this notification, "reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s". I can't understand
the C++ language, so i can't fix the problem.
the method i tryed:
1.change the encode type of dataset. I use notepad++ to change "utf-8 with no BOM" to "utf-8". It didn't work.
2.change the delimiter from '\t' to ' '(space). It didn't work.
3.And i think maybe the template was wrong.So i use the crf++0.58/example/seg/template for test. It worked. But this template
is simple, so I use /example/JapaneseNE/template which is more similar with my feature template. It didn't work. Then, i check
the JapaneseNE example It works well. So i got confused. Is there someone can help me.
template
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
U10:%x[-2,1]/%x[0,1]
U11:%x[-2,1]/%x[1,1]
U11:%x[-1,1]/%x[0,1]
U12:%x[0,0]/%x[0,1]
U13:%x[0,1]/%x[1,1]
U14:%x[0,1]/%x[2,1]
U15:%x[-1,0]/%x[0,1]
U16:%x[-1,0]/%x[-1,1]
U17:%x[1,0]/%x[1,1]
U18:%x[1,0]/%x[1,1]
U19:%x[2,0]/%x[2,1]
U20:%x[-1,2]
U21:%x[-2,2]
U22:%x[0,1]/%x[-1,2]
U23:%x[0,1]/%x[-2,2]
U24:%x[0,0]/%x[-1,2]
U25:%x[0,0]/%x[-2,2]
U26:%x[-1,2]/%x[-2,2]/%x[0,1]
U27:%x[-2,2]/%x[0,1]/%x[1,1]
U28:%x[-1,1]/%x[-1,2]/%x[0,1]
U29:%x[-1,2]/%x[0,0]/%x[0,1]
Boson_train
浙江 ns B_product_name
在线 b I_product_name
杭州 ns I_product_name
4 m B_time
月 m I_time
25 m I_time
日 m I_time
讯 ng Out
( x Out
记者 n Out
x Out
x B_person_name
施宇翔 nr I_person_name
x Out
通讯员 n B_person_name
x Out
方英 nr B_person_name
) x Out
毒贩 n Out
很 zg Out
“ x Out
时髦 nr Out
” x Out
, x Out
用 p Out
微信 vn B_product_name
交易 n Out
毒品 n Out
。 x Out
没 v Out
料想 v Out
警方 n B_person_name
也 d Out

You were debugging in the right direction. The issue is indeed with your template file.
Your training data has 3 columns (column 0:word, column 1:pos-tag and column 2:tag).
You cannot use the tag as feature, but your template file has reference to it (i.e, column 2) in many feature definitions (see, U20 to U29). Your training should work after removing/correcting these.
Hope this helps :)
You can also checkout these video tutorials for better understanding of Template Files and Training NER with CRF++ :
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4

Related

How can I take the outer product of string vectors in J?

I'm trying to replicate the outer product notation in APL:
∘.,⍨ 'x1' 'y1' 'z1' 'x2' 'y2' 'z2' 'x3' 'y3' 'z3'
which yields
x1x1 x1y1 x1z1 x1x2 x1y2 x1z2 x1x3 x1y3 x1z3
y1x1 y1y1 y1z1 y1x2 y1y2 y1z2 y1x3 y1y3 y1z3
z1x1 z1y1 z1z1 z1x2 z1y2 z1z2 z1x3 z1y3 z1z3
x2x1 x2y1 x2z1 x2x2 x2y2 x2z2 x2x3 x2y3 x2z3
y2x1 y2y1 y2z1 y2x2 y2y2 y2z2 y2x3 y2y3 y2z3
z2x1 z2y1 z2z1 z2x2 z2y2 z2z2 z2x3 z2y3 z2z3
x3x1 x3y1 x3z1 x3x2 x3y2 x3z2 x3x3 x3y3 x3z3
y3x1 y3y1 y3z1 y3x2 y3y2 y3z2 y3x3 y3y3 y3z3
z3x1 z3y1 z3z1 z3x2 z3y2 z3z2 z3x3 z3y3 z3z3
But I can't figure out how to do something similar in J. I found this Cartesian product in J post that I thought would be similar enough, but I just can't seem to translate it to an array of strings from an array of numbers.
Adapting Dan Bron's answer therein and applying it to a simpler example
6 6 $ , > { 2 # < 'abc'
gives
aaabac
babbbc
cacbcc
aaabac
babbbc
cacbcc
which is almost what I want, but I don't know how to generalize it to use 2-letter (or more) strings instead of single ones in a similar fashion. I also don't know how to format those results with spaces between the pairs like the APL output, so it may not be the right path either.
Similarly, I tried adapting Michael Berry's answer from that thread to get
9 36 $ ,,"1/ ~ 9 2 $ 'x1y1z1x2y2z2x3y3z3'
which gives
x1x1x1y1x1z1x1x2x1y2x1z2x1x3x1y3x1z3
y1x1y1y1y1z1y1x2y1y2y1z2y1x3y1y3y1z3
z1x1z1y1z1z1z1x2z1y2z1z2z1x3z1y3z1z3
x2x1x2y1x2z1x2x2x2y2x2z2x2x3x2y3x2z3
y2x1y2y1y2z1y2x2y2y2y2z2y2x3y2y3y2z3
z2x1z2y1z2z1z2x2z2y2z2z2z2x3z2y3z2z3
x3x1x3y1x3z1x3x2x3y2x3z2x3x3x3y3x3z3
y3x1y3y1y3z1y3x2y3y2y3z2y3x3y3y3y3z3
z3x1z3y1z3z1z3x2z3y2z3z2z3x3z3y3z3z3
Again, this is almost what I want, and this one handled the multiple characters, but there are still no spaces between them and the command is getting farther from the simplicity of the APL version.
I can get the same results a bit more cleanly with ravel items
,. ,"1/ ~ 9 2 $ 'x1y1z1x2y2z2x3y3z3'
I've been going through the J primer and exploring parts that look relevant in the dictionary, but I'm still very new, so I apologize if this is a dumb question. I feel like the rank conjunction operator should be able to help me here, but I had a hard time following its explanation in the primer. I played with ": to try to format the strings to have trailing spaces, but I also couldn't figure that out. The fact that this was so easy in APL also makes me think I'm doing something very wrong in J to be having this much trouble.
After reading more of the primer I got something that looks like what I want with
,. 9 1 $ ' ' ,."2 ,"1/~ [ ;._2 'x1 y1 z1 x2 y2 z2 x3 y3 z3 '
but this is still way more complicated than the APL version, so I'm still hoping there is an actually elegant and concise way to do this.
I think that the only thing that I can add to the things that you have already pointed out is that to keep a string separate into components you would need to box.
<#,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3'
+----+----+----+----+----+----+----+----+----+
|x1x1|x1y1|x1z1|x1x2|x1y2|x1z2|x1x3|x1y3|x1z3|
+----+----+----+----+----+----+----+----+----+
|y1x1|y1y1|y1z1|y1x2|y1y2|y1z2|y1x3|y1y3|y1z3|
+----+----+----+----+----+----+----+----+----+
|z1x1|z1y1|z1z1|z1x2|z1y2|z1z2|z1x3|z1y3|z1z3|
+----+----+----+----+----+----+----+----+----+
|x2x1|x2y1|x2z1|x2x2|x2y2|x2z2|x2x3|x2y3|x2z3|
+----+----+----+----+----+----+----+----+----+
|y2x1|y2y1|y2z1|y2x2|y2y2|y2z2|y2x3|y2y3|y2z3|
+----+----+----+----+----+----+----+----+----+
|z2x1|z2y1|z2z1|z2x2|z2y2|z2z2|z2x3|z2y3|z2z3|
+----+----+----+----+----+----+----+----+----+
|x3x1|x3y1|x3z1|x3x2|x3y2|x3z2|x3x3|x3y3|x3z3|
+----+----+----+----+----+----+----+----+----+
|y3x1|y3y1|y3z1|y3x2|y3y2|y3z2|y3x3|y3y3|y3z3|
+----+----+----+----+----+----+----+----+----+
|z3x1|z3y1|z3z1|z3x2|z3y2|z3z2|z3x3|z3y3|z3z3|
+----+----+----+----+----+----+----+----+----+
If you want to get rid of the boxes and instead insert spaces then you are not really going to have the character items separately, you will have long strings with the spaces as part of the result.
And it is a very good question because it requires you to understand the fact that character strings in J are vectors. I suppose that technically what you are looking for is this which results in a 9 9 4 shape, but it won't look the way that you expect.
,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3'
x1x1
x1y1
x1z1
x1x2
x1y2
x1z2
x1x3
x1y3
x1z3
y1x1
y1y1
y1z1
y1x2
y1y2
y1z2
y1x3
y1y3
y1z3
z1x1
z1y1
z1z1
z1x2
z1y2
z1z2
z1x3
z1y3
z1z3
x2x1
x2y1
x2z1
x2x2
x2y2
x2z2
x2x3
x2y3
x2z3
y2x1
y2y1
y2z1
y2x2
y2y2
y2z2
y2x3
y2y3
y2z3
z2x1
z2y1
z2z1
z2x2
z2y2
z2z2
z2x3
z2y3
z2z3
x3x1
x3y1
x3z1
x3x2
x3y2
x3z2
x3x3
x3y3
x3z3
y3x1
y3y1
y3z1
y3x2
y3y2
y3z2
y3x3
y3y3
y3z3
z3x1
z3y1
z3z1
z3x2
z3y2
z3z2
z3x3
z3y3
z3z3
$ ,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3'
9 9 4
You could also take the boxes and convert them to symbols, which might be closer to what you want, although they do have the backtick indicator as part of their representation.
s:#<#,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3'
`x1x1 `x1y1 `x1z1 `x1x2 `x1y2 `x1z2 `x1x3 `x1y3 `x1z3
`y1x1 `y1y1 `y1z1 `y1x2 `y1y2 `y1z2 `y1x3 `y1y3 `y1z3
`z1x1 `z1y1 `z1z1 `z1x2 `z1y2 `z1z2 `z1x3 `z1y3 `z1z3
`x2x1 `x2y1 `x2z1 `x2x2 `x2y2 `x2z2 `x2x3 `x2y3 `x2z3
`y2x1 `y2y1 `y2z1 `y2x2 `y2y2 `y2z2 `y2x3 `y2y3 `y2z3
`z2x1 `z2y1 `z2z1 `z2x2 `z2y2 `z2z2 `z2x3 `z2y3 `z2z3
`x3x1 `x3y1 `x3z1 `x3x2 `x3y2 `x3z2 `x3x3 `x3y3 `x3z3
`y3x1 `y3y1 `y3z1 `y3x2 `y3y2 `y3z2 `y3x3 `y3y3 `y3z3
`z3x1 `z3y1 `z3z1 `z3x2 `z3y2 `z3z2 `z3x3 `z3y3 `z3z3
I'd say the closest direct analogue of the APL expresion is to keep each string boxed:
,&.>/~ 'x1';'y1';'z1';'x2';'y2';'z2';'x3';'y3';'z3'
┌────┬────┬────┬────┬────┬────┬────┬────┬────┐
│x1x1│x1y1│x1z1│x1x2│x1y2│x1z2│x1x3│x1y3│x1z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│y1x1│y1y1│y1z1│y1x2│y1y2│y1z2│y1x3│y1y3│y1z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│z1x1│z1y1│z1z1│z1x2│z1y2│z1z2│z1x3│z1y3│z1z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│x2x1│x2y1│x2z1│x2x2│x2y2│x2z2│x2x3│x2y3│x2z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│y2x1│y2y1│y2z1│y2x2│y2y2│y2z2│y2x3│y2y3│y2z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│z2x1│z2y1│z2z1│z2x2│z2y2│z2z2│z2x3│z2y3│z2z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│x3x1│x3y1│x3z1│x3x2│x3y2│x3z2│x3x3│x3y3│x3z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│y3x1│y3y1│y3z1│y3x2│y3y2│y3z2│y3x3│y3y3│y3z3│
├────┼────┼────┼────┼────┼────┼────┼────┼────┤
│z3x1│z3y1│z3z1│z3x2│z3y2│z3z2│z3x3│z3y3│z3z3│
└────┴────┴────┴────┴────┴────┴────┴────┴────┘

kdb/q: How to apply a string manipulation function to a vector of strings to output a vector of strings?

Thanks in advance for the help. I am new to kdb/q, coming from a Python and C++ background.
Just a simple syntax question: I have a string with fields and their corresponding values
pp_str: "field_1:abc field_2:xyz field_3:kdb"
I wrote an atomic (scalar) function to extract the value of a given field.
get_field_value: {[field; pp_str] pp_fields: " " vs pp_str; pid_field: pp_fields[where like[pp_fields; field,":*"]]; start_i: (pid_field[0] ss ":")[0] + 1; end_i: count pid_field[0]; indices: start_i + til (end_i - start_i); pid_field[0][indices]}
show get_field_value["field_1"; pp_str]
"abc"
show get_field_value["field_3"; pp_str]
"kdb"
Now how do I generalize this so that if I input a vector of fields, I get a vector of values? I want to input ("field_1"; "field_2"; "field_3") and output ("abc"; "xyz"; "kdb"). I tried multiple approaches (below) but I just don't understand kdb/q's syntax well enough to vectorize my function:
/ Attempt 1 - Fail
get_field_value[enlist ("field_1"; "field_2"); pp_str]
/ Attempt 2 - Fail
get_field_value[; pp_str] /. enlist ("field_1"; "field_3")
/ Attempt 3 - Fail
fields: ("field_1"; "field_2")
get_field_value[fields; pp_str]
To run your function for each you could project the pp_str variable and use each for the others
q)get_field_value[;pp_str]each("field_1";"field_3")
"abc"
"kdb"
Kdb actually has built-in functionality to handle this: https://code.kx.com/q/ref/file-text/#key-value-pairs
q){#[;x](!/)"S: "0:y}[`field_1;pp_str]
"abc"
q)
q){#[;x](!/)"S: "0:y}[`field_1`field_3;pp_str]
"abc"
"kdb"
I think this might be the syntax you're looking for.
q)get_field_value[; pp_str]each("field_1";"field_2")
"abc"
"xyz"

linearK error in seq. default() cannot be NA, NaN

I am trying to learn linearK estimates on a small linnet object from the CRC spatstat book (chapter 17) and when I use the linearK function, spatstat throws an error. I have documented the process in the comments in the r code below. The error is as below.
Error in seq.default(from = 0, to = right, length.out = npos + 1L) : 'to' cannot be NA, NaN or infinite
I do not understand how to resolve this. I am following this process:
# I have data of points for each data of the week
# d1 is district 1 of the city.
# I did the step below otherwise it was giving me tbl class
d1_data=lapply(split(d1, d1$openDatefactor),as.data.frame)
# I previously create a linnet and divided it into districts of the city
d1_linnet = districts_linnet[["d1"]]
# I create point pattern for each day
d1_ppp = lapply(d1_data, function(x) as.ppp(x, W=Window(d1_linnet)))
plot(d1_ppp[[1]], which.marks="type")
# I am then converting the point pattern to a point pattern on linear network
d1_lpp <- as.lpp(d1_ppp[[1]], L=d1_linnet, W=Window(d1_linnet))
d1_lpp
Point pattern on linear network
3 points
15 columns of marks: ‘status’, ‘number_of_’, ‘zip’, ‘ward’,
‘police_dis’, ‘community_’, ‘type’, ‘days’, ‘NAME’,
‘DISTRICT’, ‘openDatefactor’, ‘OpenDate’, ‘coseDatefactor’,
‘closeDate’ and ‘instance’
Linear network with 4286 vertices and 6183 lines
Enclosing window: polygonal boundary
enclosing rectangle: [441140.9, 448217.7] x [4640080, 4652557] units
# the errors start from plotting this lpp object
plot(d1_lpp)
"show.all" is not a graphical parameter
Show Traceback
Error in plot.window(...) : need finite 'xlim' values
coords(d1_lpp)
x y seg tp
441649.2 4649853 5426 0.5774863
445716.9 4648692 5250 0.5435492
444724.6 4646320 677 0.9189631
3 rows
And then consequently, I also get error on linearK(d1_lpp)
Error in seq.default(from = 0, to = right, length.out = npos + 1L) : 'to' cannot be NA, NaN or infinite
I feel lpp object has the problem, but I find it hard to interpret the errors and how to resolve them. Could someone please guide me?
Thanks
I can confirm there is a bug in plot.lpp when trying to plot the marked point pattern on the linear network. That will hopefully be fixed soon. You can plot the unmarked point pattern using
plot(unmark(d1_lpp))
I cannot reproduce the problem with linearK. Which version of spatstat are you running? In the development version on my laptop spatstat_1.51-0.073 everything works. There has been changes to this code recently, so it is likely that this will be solved by updating to development version (see https://github.com/spatstat/spatstat).

Calculate the average of Spearman correlation

I have 2 columns A and B which contain the Spearman's correlation values as follows:
0.127272727 -0.260606061
-0.090909091 -0.224242424
0.345454545 0.745454545
0.478787879 0.660606061
-0.345454545 -0.333333333
0.151515152 -0.127272727
0.478787879 0.660606061
-0.321212121 -0.284848485
0.284848485 0.515151515
0.36969697 -0.139393939
-0.284848485 0.272727273
How can I calculate the average of those correlation values in these 2 columns in Excel or Matlab ? I found a close answer in this link : https://stats.stackexchange.com/questions/8019/averaging-correlation-values
The main point is we can not use mean or average in this case, as explained in the link. They proposed a nice way to do that, but I dont know how to implement it in Excel or Matlab.
Following the second answer of the link you provided, which is the most general case, you can calculate the average Spearman's rho in Matlab as follows:
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
z = atanh(M);
meanRho = tanh(mean(z));
As you can see it gives mean values of
meanRho =
0.1165 0.1796
whereas the simple mean is quite close:
mean(M)
ans =
0.1085 0.1350
Edit: more information on Fisher's transformation here.
In MATLAB, define a matrix with these values and use mean function as follows:
%define a matrix M
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
%calculates the mean of each column
meanVals = mean(M);
Result
meanVals =
0.1085 0.1350
It is also possible to calculate the total meanm and the mean of each row as follows:
meanVals = mean(M); %total mean
meanVals = mean(M,2); %mean of each row

Reading a specific txt file and re-arrange it to a given format

Below is an output of Chemichal analysis instrument. I need to rearrange the format and sort it in a way that percentage figure for each element goes below its name. My question is how to read this file word by word? how can I choose, for instance word number 12?
txt file format:
Header_1 Date Time Method_Name (Filter_Name) Calc_Mode Heat No. Quality Anal. Code Sample ID C Si Mn P S Cr Mo Ni Al Co Cu Nb Ti V W Pb Sn As Bi Ca Sb Se B Zn N Fe Place Code Work Phase
Single 13.01.13 09:51:10 Fe-10 Test AutoResult 12A 00001.040 00000.437 00000.292 00000.023 00000.007 00001.505 00000.263 00000.081 00000.012 00000.014 00000.110 00000.155 00000.040 00000.098 00000.015 00000.014 00000.013 00000.012 00000.002 00000.001 00000.016 00000.014 00000.005 00000.001 00000.016 00095.813
To find word 12, read the line character by character until you have seen 11 instances of whatever is being used to separate words (which you have not specified); what follows, until the next such separator, will be the 12th word.

Resources