Can't Input Tab Delimited file to Stanford Classifier - nlp

I'm having a problem inputting tab delimited files into the stanford classifier.
Although I was able to successfully walk through all the included stanford tutorials, including the newsgroup tutorial, when I try to input my own training and test data it doesn't load properly.
At first I thought the problem was that I was saving the data into a tab delimited file using an Excel spreadsheet and it was some kind of encoding issue.
But then I got exactly the same results when I did the following. First I literally typed the demo data below into gedit, making sure to use a tab between the politics/sports class and the ensuing text:
politics Obama today announced a new immigration policy.
sports The NBA all-star game was last weekend.
politics Both parties are eyeing the next midterm elections.
politics Congress votes tomorrow on electoral reforms.
sports The Lakers lost again last night, 102-100.
politics The Supreme Court will rule on gay marriage this spring.
sports The Red Sox report to spring training in two weeks.
sports Messi set a world record for goals in a calendar year in 2012.
politics The Senate will vote on a new budget proposal next week.
politics The President declared on Friday that he will veto any budget that doesn't include revenue increases.
I saved that as myproject/demo-train.txt and a similar file as myproject/demo-test.txt.
I then ran the following:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt
The good news: this actually ran without throwing any errors.
The bad news: since it doesn't extract any features, it can't actually estimate a real model and the probability defaults to 1/n for each item, where n is the number of classes.
So then I ran the same command but with two basic options specified:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
That yielded:
Exception in thread "main" java.lang.RuntimeException: Training dataset could not be processed
at edu.stanford.nlp.classify.ColumnDataClassifier.readDataset(ColumnDataClassifier.java:402)
at edu.stanford.nlp.classify.ColumnDataClassifier.readTrainingExamples (ColumnDataClassifier.java:317)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:1652)
at edu.stanford.nlp.classify.ColumnDataClassifier.main(ColumnDataClassifier.java:1628)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:670)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(ColumnDataClassifier.java:267)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:396)
... 3 more
These are exactly the same results I get when I used the real data I saved from Excel.
Even more though, I don't know how to make sense of the ArrayIndexOutOfBoundsException. When I used readline in python to print out the raw strings for both the demo files I created and the tutorial files that worked, nothing about the formatting seemed different. So I don't know why this exception would be raised with one set of files but not the other.
Finally, one other quirk. At one point I thought maybe line breaks were the problem. So I deleted all line breaks from the demo files while preserving tab breaks and ran the same command:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
Surprisingly, this time no java exceptions are thrown. But again, it's worthless: it treats the entire file as one observation, and can't properly fit a model as a result.
I've spent 8 hours on this now and have exhausted everything I can think of. I'm new to Java but I don't think that should be an issue here -- according to Stanford's API documentation for ColumnDataClassifier, all that's required is a tab delimited file.
Any help would be MUCH appreciated.
One last note: I've run these same commands with the same files on both Windows and Ubuntu, and the results are the same in each.

Use a properties file. In the example Stanford classifier example
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
2.useSplitWords=true
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=\\s+
The number 2 at the start of lines 3, 4 and 5 signifies the column in your tsv file. So in your case you would use
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+
or if you want to run with command line arguments
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -1.useSplitWords =1.splitWordsRegexp "\s+"

I've faced the same error as you.
Pay attention on tabs in the text you are classifying.
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
This means, that at some point classifier expects array of 3 elements, after it splits the string with tabs.
I've run a method, that counts amount of tabs in each line, and if at some line you have not two of them - here is an error.

Related

Table values moved after being added to old QGIS version: behaves normally on current version

I am teaching a man the basics of QGis for a project he needs to do at his work. He has very little computer knowledge and would like to standardise the work as much as possible (specific workarounds would complicate it too much for him). His QGis version is 3.16 "Hannover" and as this is a work laptop he does not have permission to download a newer version.
We have been having problem with one specific table. The first few rows are below, written exactly as they are originally.
Baum-Nr. Baumart BHD Alter Y X Biotopbaum Klassifizierung Bemerkungen
1 Buche 86 120 49.1356 11.0488 A Altbaum Freistellen !!!
2 Kiefer 45 100 49.13561 11.04883 Hlb,Bs,Th Höhlenbaum
3 Kiefer 32 100 49.13571 11.04579 Hlb,Sw,Th Höhlenbaum
4 Kiefer 74 120 49.13513 11.0495 A Altbaum
After adding it from Excel to QGis through "add vector layer", the header "Klassifizierung" becomes one of the coordinates and I believe one of the columns are switched (unfortunately, I can't remember specifics. This is a small side job and I haven't had time to look into it for days. I should have taken a photo, but this isn't possible anymore). We have attempted to copy the column into a new Excel document and transferring it to QGis again, and this time the headers were shoved one cell to the right such that "Y" was placed over "X" and "Biotopbaum" over "Klassifizierung", for example.
I could not find a way to fix the import problem in his laptop. He e-mailed me the problematic table and I opened it successfully in my QGis 3.26 "Buenos Aires".
I believe this may be a problem with his QGis version, but it is curious that we only encountered it with this one table. All other tables we have worked with have the same headers and the same kind of data on their respective rows.
Is this a known problem, or have other people faced similar situations? Could someone explain what could be causing it? Would there be a way to fix it such that we can successfully import the table without having to edit it in QGis? This is not a solution the man would accept.
Thank you in advance.
Remove the commas in the Biotopbaum field or replace them with a less common delimiter. In fact, remove all punctionation (e.g., Baum-Nr. >> remove the period ".").
Also save the table into a csv format and try to import.

compare documents using most similar method

I am able to build the model using the built-in lee_background corpus. But when I try to compare using most_similar method, I get an error.
lee_train_file = '/opt/conda/lib/python3.6/site-packages/gensim/test/test_data/lee_background.cor'
train_corpus=list()
with open(lee_train_file) as f:
for i, line in enumerate(f):
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i]))
model = gensim.models.doc2vec.Doc2Vec(vector_size=48, min_count=2, epochs=40)
model.build_vocab(train_corpus)
model.wv.vocab['penalty'].count
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
line="""
dummy text here...
"""
inferred_vector=model.infer_vector(gensim.utils.simple_preprocess(line) )
model.docvecs.most_similar(inferred_vector, topn=3)
I tried this with list(inferred_vector) but still getting an error.
TypeError: 'numpy.float32' object is not iterable
I am trying to compare the dummy text with the corpus and find if the entry already exist in the data file.
Update:
Instead of list(inferred_vector) I need to use [inferred_vector]. This has solved my problem. But ever-time I run this code, I get different similar documents. How is this possible?
line="""
The national executive of the strife-torn Democrats last night appointed little-known West Australian senator Brian Greig
as interim leader--a shock move likely to provoke further conflict between the party's senators and its organisation.
In a move to reassert control over the party's seven senators, the national executive last night rejected Aden Ridgeway's
bid to become interim leader, in favour of Senator John, a supporter of deposed leader Natasha Stott Despoja and an outspoken
gay rights activist.
"""
inferred_vector=model.infer_vector(gensim.utils.simple_preprocess(line))
model.docvecs.most_similar([inferred_vector], topn=5)
Sometimes I get this list and the list keeps changing everytime I run the code even if there is no change in the model.
[(151, 0.5980586409568787),
(74, 0.5736572742462158),
(106, 0.5714541077613831),
(249, 0.5695925951004028),
(209, 0.5642371773719788)]
[(249, 0.5727256536483765),
(151, 0.5725511312484741),
(74, 0.5711895823478699),
(106, 0.5583171248435974),
(292, 0.5491517782211304)]
As a matter of fact, the first line in training corpus is 99% similar to this line because only 1 word is changed. Surprisingly the document_id 1 is nowhere in the top 5 list.
The dummy line should be selected from lee_background.cor and not from lee.cor
The model text will match with training corpus and not with test corpus.

batch file extract numbers from text file with little information

So This is related to my other two posts. Im dealing with extracting text from a text file and analyzing it and I've run into some problems. For A while I've been using a method that sets all the text between two other strings as a variable, but here is the situation I have. I need to extract the speed (numbers) from the below string: "etc...,query":{"ping":47855},"cmts":...etc. The problem is that the text cmts sometimes changes to something else so really I need to extract all the numbers from this:
,query":{"ping":47855},"
One more thing that makes this difficult is that the characters }," Are all over the file. Thank you for helping me! -Lucas EDG Programmer.
Here's the full file:
{"_id":53291,"ip":"158.69.22.95","domain":"jectile.com","port":25565,"url":"","date_add":1453897770,"status":1,"scan":1,"uptime":99.53,"last_update":1485436105,"geo":{"country":"US","country_name":"United States","city":"Lake Forest"},"info":{"name":" Jectile | jectile.com [1.8-1.11]\n Shoota (Call of Duty) \/ Zambies (Zombie Survival)","type":"FML","version":"1.10","plugins":[],"players":18,"max_players":420,"players_list":[],"map":"world","software":"BungeeCord 1.8.x, 1.9.x, 1.10.x, 1.11.x","avg_player_day":24.458333,"avg_load_day":5.8234,"platform":"MINECRAFT","icon":true},"counter":{"online":47871,"offline":228,"players":{"date":"2017-01-26","total":0},"last_offline":0,"query":{"ping":47855},"cmts":1},"rating":{"main":19.24,"difference":-0.64,"content_up":0.15,"K":0},"last":{"offline":1485415702,"online":1485436105},"chart":{"14:30":14,"14:40":16,"14:50":15,"15:00":18,"15:10":12,"15:20":13,"15:30":9,"15:40":9,"15:50":11,"16:00":12,"16:10":11,"16:20":11,"16:30":18,"16:40":25,"16:50":23,"17:00":27,"17:10":27,"17:20":23,"17:30":24,"17:40":26,"17:50":33,"18:00":31,"18:10":31,"18:20":32,"18:30":37,"18:40":38,"18:50":39,"19:00":38,"19:10":34,"19:20":33,"19:30":40,"19:40":36,"19:50":37,"20:00":38,"20:10":36,"20:20":38,"20:30":37,"20:40":37,"20:50":37,"21:00":34,"21:10":32,"21:20":33,"21:30":33,"21:40":29,"21:50":28,"22:00":26,"22:10":21,"22:20":24,"22:30":29,"22:40":22,"22:50":23,"23:00":27,"23:10":24,"23:20":26,"23:30":25,"23:40":28,"23:50":27,"00:00":32,"00:10":29,"00:20":33,"00:30":32,"00:40":31,"00:50":33,"01:00":40,"01:10":40,"01:20":40,"01:30":41,"01:40":45,"01:50":48,"02:00":43,"02:10":45,"02:20":46,"02:30":46,"02:40":43,"02:50":42,"03:00":39,"03:10":36,"03:20":44,"03:30":34,"03:40":0,"03:50":32,"04:00":35,"04:10":35,"04:20":33,"04:30":43,"04:40":37,"04:50":26,"05:00":31,"05:10":31,"05:20":27,"05:30":25,"05:40":26,"05:50":18,"06:00":13,"06:10":15,"06:20":17,"06:30":18,"06:40":17,"06:50":15,"07:00":16,"07:10":17,"07:20":16,"07:30":16,"07:40":18,"07:50":19,"08:00":14,"08:10":12,"08:20":12,"08:30":13,"08:40":17,"08:50":20,"09:00":18,"09:10":0,"09:20":0,"09:30":27,"09:40":18,"09:50":20,"10:00":15,"10:10":13,"10:20":12,"10:30":10,"10:40":10,"10:50":11,"11:00":13,"11:10":13,"11:20":16,"11:30":19,"11:40":17,"11:50":13,"12:00":10,"12:10":11,"12:20":12,"12:30":16,"12:40":15,"12:50":16,"13:00":14,"13:10":10,"13:20":13,"13:30":16,"13:40":16,"13:50":17,"14:00":20,"14:10":16,"14:20":16},"query":"ping","max_stat":{"max_online":{"date":1470764061,"players":129}},"status_query":"ok"}
By the way, the reason things change is because it looks at info from different servers
Very similar to ther answer I gave you to your first question:
#Echo Off
Set/P var=<some.json
Set var=%var:*:{"ping":=%
Set var=%var:},=&:%
Echo=%var%
Timeout -1

2 Sequential Transactions, setting Detail Number (Revit API / Python)

Currently, I made a tool to rename view numbers (“Detail Number”) on a sheet based on their location on the sheet. Where this is breaking is the transactions. Im trying to do two transactions sequentially in Revit Python Shell. I also did this originally in dynamo, and that had a similar fail , so I know its something to do with transactions.
Transaction #1: Add a suffix (“-x”) to each detail number to ensure the new numbers won’t conflict (1 will be 1-x, 4 will be 4-x, etc)
Transaction #2: Change detail numbers with calculated new number based on viewport location (1-x will be 3, 4-x will be 2, etc)
Better visual explanation here: https://www.docdroid.net/EP1K9Di/161115-viewport-diagram-.pdf.html
Py File here: http://pastebin.com/7PyWA0gV
Attached is the python file, but essentially what im trying to do is:
# <---- Make unique numbers
t = Transaction(doc, 'Rename Detail Numbers')
t.Start()
for i, viewport in enumerate(viewports):
setParam(viewport, "Detail Number",getParam(viewport,"Detail Number")+"x")
t.Commit()
# <---- Do the thang
t2 = Transaction(doc, 'Rename Detail Numbers')
t2.Start()
for i, viewport in enumerate(viewports):
setParam(viewport, "Detail Number",detailViewNumberData[i])
t2.Commit()
Attached is py file
As I explained in my answer to your comment in the Revit API discussion forum, the behaviour you describe may well be caused by a need to regenerate between the transactions. The first modification does something, and the model needs to be regenerated before the modifications take full effect and are reflected in the parameter values that you query in the second transaction. You are accessing stale data. The Building Coder provides all the nitty gritty details and numerous examples on the need to regenerate.
Summary of this entire thread including both problems addressed:
http://thebuildingcoder.typepad.com/blog/2016/12/need-for-regen-and-parameter-display-name-confusion.html
So this issue actually had nothing to do with transactions or doc regeneration. I discovered (with some help :) ), that the problem lied in how I was setting/getting the parameter. "Detail Number", like a lot of parameters, has duplicate versions that share the same descriptive param Name in a viewport element.
Apparently the reason for this might be legacy issues, though im not sure. Thus, when I was trying to get/set detail number, it was somehow grabbing the incorrect read-only parameter occasionally, one that is called "VIEWER_DETAIL_NUMBER" as its builtIn Enumeration. The correct one is called "VIEWPORT_DETAIL_NUMBER". This was happening because I was trying to get the param just by passing the descriptive param name "Detail Number".Revising how i get/set parameters via builtIn enum resolved this issue. See images below.
Please see pdf for visual explanation: https://www.docdroid.net/WbAHBGj/161206-detail-number.pdf.html

Resolving an error in using the Crf++ toolkit

To all those who have had experience with using the crf++ toolkit (refer: http://crfpp.sourceforge.net/)
Please find the error message which pops up on trying to execute the CRF++ training program:
CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2009 Taku Kudo, All rights reserved.
encoder.cpp(280) [feature_index.open(templfile, trainfile)] feature_index.cpp(86) [max_size == size] inconsistent column size: 21 20 train.data
I'm not sure how to interpret the error message.
There are 20 features in my training file and the 21st token is the class value.
I have created the Crf++ template file as per the instructions on the site.
It looks like a training data format issue, make sure the number of columns are consistent across all sentences.
I got this error today, and found that crf++ toolkit just set tab character(\t) to default column separator whereas my train data file using one white space lead to error.
Some points to check:
1. Check if you have a new line after each sentence
2. Check if your columnar values does not contain any sp
Error suggests that your number of columns in rows are not same among all. Your maximum number of columns are 21 and that should be consistent through out the training file but crf_learn finds it 20 somewhere in your train.data training file. So locate such row and remove/repair it.

Resources