compare documents using most similar method - nlp

I am able to build the model using the built-in lee_background corpus. But when I try to compare using most_similar method, I get an error.
lee_train_file = '/opt/conda/lib/python3.6/site-packages/gensim/test/test_data/lee_background.cor'
train_corpus=list()
with open(lee_train_file) as f:
for i, line in enumerate(f):
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i]))
model = gensim.models.doc2vec.Doc2Vec(vector_size=48, min_count=2, epochs=40)
model.build_vocab(train_corpus)
model.wv.vocab['penalty'].count
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
line="""
dummy text here...
"""
inferred_vector=model.infer_vector(gensim.utils.simple_preprocess(line) )
model.docvecs.most_similar(inferred_vector, topn=3)
I tried this with list(inferred_vector) but still getting an error.
TypeError: 'numpy.float32' object is not iterable
I am trying to compare the dummy text with the corpus and find if the entry already exist in the data file.
Update:
Instead of list(inferred_vector) I need to use [inferred_vector]. This has solved my problem. But ever-time I run this code, I get different similar documents. How is this possible?
line="""
The national executive of the strife-torn Democrats last night appointed little-known West Australian senator Brian Greig
as interim leader--a shock move likely to provoke further conflict between the party's senators and its organisation.
In a move to reassert control over the party's seven senators, the national executive last night rejected Aden Ridgeway's
bid to become interim leader, in favour of Senator John, a supporter of deposed leader Natasha Stott Despoja and an outspoken
gay rights activist.
"""
inferred_vector=model.infer_vector(gensim.utils.simple_preprocess(line))
model.docvecs.most_similar([inferred_vector], topn=5)
Sometimes I get this list and the list keeps changing everytime I run the code even if there is no change in the model.
[(151, 0.5980586409568787),
(74, 0.5736572742462158),
(106, 0.5714541077613831),
(249, 0.5695925951004028),
(209, 0.5642371773719788)]
[(249, 0.5727256536483765),
(151, 0.5725511312484741),
(74, 0.5711895823478699),
(106, 0.5583171248435974),
(292, 0.5491517782211304)]
As a matter of fact, the first line in training corpus is 99% similar to this line because only 1 word is changed. Surprisingly the document_id 1 is nowhere in the top 5 list.

The dummy line should be selected from lee_background.cor and not from lee.cor
The model text will match with training corpus and not with test corpus.

Related

How to remove a feature from a product?

So I am still learning this amazing complex system folks call Hybris, or SAP Commerce, along with lots of other names :) I ran into a problem and am looking to learn how to get out of it. I have added four new classifications attributes (Utilization, Fit, Material, Function). When I went to add them to the products I added a space between them and the numeric code that came after it:
$feature1=#Utilization, 445 [$clAttrModifiers]; # Style
$feature2=#Fit, 446 [$clAttrModifiers]; # Colour
$feature3=#Material, 447 [$clAttrModifiers]; # Connections
$feature4=#Function, 448 [$clAttrModifiers]; # Function
INSERT_UPDATE Product;code[unique=true];$feature1;$feature2;$feature3;$feature4;$catalogVersion;
;300413166;my;feature;has;a space
The problem is I want to take the space out, as seen in the follow code:
$feature1=#Utilization,445 [$clAttrModifiers];# Style
$feature2=#Fit,446 [$clAttrModifiers];# Colour
$feature3=#Material,447 [$clAttrModifiers];# Connections
$feature4=#Function,448 [$clAttrModifiers];# Function
INSERT_UPDATE Product;code[unique=true];$feature1;$feature2;$feature3;$feature4;$catalogVersion;
;300413166;Bottom;Loose;Yam type;Sportswear
When I run both of these scripts together, I get 8 features:
So how do I remove the four features that have spaces in them?
How do I go about actually removing the first set of features?
Below removes the ClassAttributeAssignment records, which also removes the ProductFeature entries assigned to all products. However, you need to find the correct classification category code (e.g. clasificationCategory1) the attribute (e.g. Utilization, 445) belongs to. The classification category is the grouping/header you will find in the Attributes
$classificationCatalog=ElectronicsClassification
$classificationSystemVersion=systemVersion(catalog(id[default=$classificationCatalog]),version[default='1.0'])[unique=true,default=$classificationCatalog:1.0]
$classificationCatalogVersion=catalogversion(catalog(id[default=$classificationCatalog]),version[default='1.0'])[unique=true]
$class=classificationClass(code,$classificationCatalogVersion)[unique=true]
$attribute=classificationAttribute(code,$classificationSystemVersion)[unique=true]
REMOVE ClassAttributeAssignment[batchmode=true];$class;$attribute
;clasificationCategory1;Utilization, 445
;clasificationCategory1;Fit, 446
You can also remove the ProductFeature entries (for all products) like this, but it doesn't remove the ClassAttribute assignment.
REMOVE ProductFeature[batchmode=true];qualifier[unique=true]
;ElectronicsClassification/1.0/clasificationCategory1.Utilization, 445
Other Reference:
Classification System API: https://help.sap.com/viewer/d0224eca81e249cb821f2cdf45a82ace/1905/en-US/8b7ad17c86691014aa0ee2d228c56dd1.html

2 Sequential Transactions, setting Detail Number (Revit API / Python)

Currently, I made a tool to rename view numbers (“Detail Number”) on a sheet based on their location on the sheet. Where this is breaking is the transactions. Im trying to do two transactions sequentially in Revit Python Shell. I also did this originally in dynamo, and that had a similar fail , so I know its something to do with transactions.
Transaction #1: Add a suffix (“-x”) to each detail number to ensure the new numbers won’t conflict (1 will be 1-x, 4 will be 4-x, etc)
Transaction #2: Change detail numbers with calculated new number based on viewport location (1-x will be 3, 4-x will be 2, etc)
Better visual explanation here: https://www.docdroid.net/EP1K9Di/161115-viewport-diagram-.pdf.html
Py File here: http://pastebin.com/7PyWA0gV
Attached is the python file, but essentially what im trying to do is:
# <---- Make unique numbers
t = Transaction(doc, 'Rename Detail Numbers')
t.Start()
for i, viewport in enumerate(viewports):
setParam(viewport, "Detail Number",getParam(viewport,"Detail Number")+"x")
t.Commit()
# <---- Do the thang
t2 = Transaction(doc, 'Rename Detail Numbers')
t2.Start()
for i, viewport in enumerate(viewports):
setParam(viewport, "Detail Number",detailViewNumberData[i])
t2.Commit()
Attached is py file
As I explained in my answer to your comment in the Revit API discussion forum, the behaviour you describe may well be caused by a need to regenerate between the transactions. The first modification does something, and the model needs to be regenerated before the modifications take full effect and are reflected in the parameter values that you query in the second transaction. You are accessing stale data. The Building Coder provides all the nitty gritty details and numerous examples on the need to regenerate.
Summary of this entire thread including both problems addressed:
http://thebuildingcoder.typepad.com/blog/2016/12/need-for-regen-and-parameter-display-name-confusion.html
So this issue actually had nothing to do with transactions or doc regeneration. I discovered (with some help :) ), that the problem lied in how I was setting/getting the parameter. "Detail Number", like a lot of parameters, has duplicate versions that share the same descriptive param Name in a viewport element.
Apparently the reason for this might be legacy issues, though im not sure. Thus, when I was trying to get/set detail number, it was somehow grabbing the incorrect read-only parameter occasionally, one that is called "VIEWER_DETAIL_NUMBER" as its builtIn Enumeration. The correct one is called "VIEWPORT_DETAIL_NUMBER". This was happening because I was trying to get the param just by passing the descriptive param name "Detail Number".Revising how i get/set parameters via builtIn enum resolved this issue. See images below.
Please see pdf for visual explanation: https://www.docdroid.net/WbAHBGj/161206-detail-number.pdf.html

How to use entitymentions annotator in stanford CoreNLP?

I am trying the newest version of Stanford CoreNLP. When I extract location or organisation names, I see that every single word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". I find that the newest CoreNLP have "entitymentions" annotator. I think this annotator may help me to solve this problem. However, there is no usage instruction or example for this annotator. Could anyone give me more info about this new feature?
Take a look at the mentions annotation key. This should be attached to a sentence, and contain a list of CoreMaps corresponding to each mention. So, there should be a CoreMap in there that corresponds to the mention of "New York Times".
I guess no annotator will annotate NEW YORK TIMES as a single entity, unless you train the model with such dataset.
Stanford NER and POS tagger is trained with some datasets, based on it it will annotate the entities. (I saw, it has some text dictionary list of people, location, organization in stanford source file. It would be deciding which entities to be annotated).
Trained dataset can annotate Newyork as a entity, if you want to annotate NEW YORK TIME as a entity then in that case you have to train with such datasets.
I tested with this annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref.
Query: New York Times is really nice.
Result : [Text=New CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP Lemma=New NamedEntityTag=ORGANIZATION] [Text=York CharacterOffsetBegin=4 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=York NamedEntityTag=ORGANIZATION] [Text=Times CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=NNP Lemma=Times NamedEntityTag=ORGANIZATION] [Text=is CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VBZ Lemma=be NamedEntityTag=O] [Text=really CharacterOffsetBegin=18 CharacterOffsetEnd=24 PartOfSpeech=RB Lemma=really NamedEntityTag=O] [Text=nice CharacterOffsetBegin=25 CharacterOffsetEnd=29 PartOfSpeech=JJ Lemma=nice NamedEntityTag=O] [Text=. CharacterOffsetBegin=29 CharacterOffsetEnd=30 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Query: Newyork times
Result : [Text=Newyork CharacterOffsetBegin=0 CharacterOffsetEnd=7 PartOfSpeech=NNP Lemma=Newyork NamedEntityTag=LOCATION] [Text=times CharacterOffsetBegin=8 CharacterOffsetEnd=13 PartOfSpeech=NNS Lemma=time NamedEntityTag=O]
Try
Integer entityMentionIndex = coreLabel.get(CoreAnnotations.EntityMentionIndexAnnotation.class);
If you try it with string "New York Times newspaper is distributed in California", you can see the entityMentionIndex is 0 (zero) for each word New, York and Times. That means if the index is same then those words are single entity.

Python - CSV Module, Getting Information From a File

Here is the situation:
The first problem I'm having is with obtaining information from a CSV file. The purpose of the code I'm writing is to get a bunch of information on ZCTAs (zip codes), for a number of different cohorts (there are six currently being used, but the code is meant to be flexible to have any number of cohorts). One file contains the population, by cohort, for each ZCTA. Another file has the number of 'cases' (cases of cancer observed) for each cohort, for each ZCTA. Another file has the crude rate for each cohort, for the state of Iowa (the focus of this research), for the rate at which one can 'expect' to see the number of people who have cancer, for a population, by cohort. There are a couple of other files, but these are the focus, as this is where my issue is exhibited.
What my code does, initially, is to read the population file and get the population of each cohort by ZCTA. Each ZCTA, and the information, is stored in a list, which is then stored in a list of lists (nested), containing all of the ZCTAs. The code then gets the crude rate. Then, the crude rate is taken times the appropriate cohort, for each ZCTA and summed with all of the other cohorts within each ZCTA, to get the total number of people we can EXPECT to see having cancer, for each ZCTA. The population is also summed up. This information is stored in a another list, as well as a list containing all of the ZCTAs. This information will be the focus (The list of all of the ZCTAs, which each contain the total population and the total number of expected cases).
So, the problem is that I then need to take this newly acquired list and get the number of OBSERVED cases, for each cohort, sum those together, append it to the appropriate ZCTA and write it to a new file. I have code implemented that does this fine, EXCEPT that the bottom 22 or so ZCTAs don't get the number of observed cases. I don't know if it is the code, or what, but it works for all of the other 906, but doesn't get the bottom 22.
The reader will find sample data for the files I've discussed (the observed case file, and the output file) at: Gist
Here is the code I'm using:
`expectedcsv = open('ExpectedCases.csv', 'w', newline= '')
expectedwriter = csv.writer(expectedcsv, delimiter = ',')
expectedHeader = ['zcta', 'expected', 'pop', 'observed']
thecasesreader = csv.reader(thecasescsv, delimiter = ',')
for zcta in zctaPop:
caseCounter = 0
thecasescsv = open('NewCaseFile.csv', 'r', newline = '')
thecasesreader = csv.reader(thecasescsv, delimiter = ',')
for case in thecasesreader:
if case[0] == zcta[0]:
for i in range(3, len(case)):
caseCounter += int(case[i])
zcta.append(caseCounter)
expectedwriter.writerow(zcta)
expectedcsv.close()
thecasescsv.close()`
Something else I would also like to bring up is that later on in the code, the actual purpose for all of this, is to create an SMR filter, for each grid point. The grid points are somewhat arbitrary they have been placed (via coordinates) over the entire state of Iowa. The SMR is the number of observed divided by the number of expected cases. The threshold, that is, how many expected cases for a particular filter, is set by the user. So, if a user wants a filter created on 150 expected cases (for each grid point), the code goes through each ZCTA, summing up the expected cases until greater than 150 are found. The distance to this last ZCTA is the 'radius' of the filter.
To do this, I built a distance matrix (the distance from each grid point to every ZCTA) and then sorted it, nearest to furthest. Because of the size of the file (2300 X 930), I have to read this file line by line and get all of the information from other files. So, starting with the nearest ZCTA, I get the population, expected cases, and observed cases (the problem with this file was discussed above) and add these each to their respective counter (one for population, one for observed and one for expected). Then it goes to the next closest ZCTA and does the same, until the the threshold is exceeded.
The problem here is that I couldn't use the CSV Module to read these files, as I was already reading from another file and the index would be lost. So, I had to use just the regular filename.read(), which then required some interesting use of maketrans and .translate. I'm not sure its efficient or works great. Everything seems to be fine, but without the above problem being fixed, it's impossible to tell. I have included the code below, but was wondering if anybody had any better ideas/suggestions?
`expectedCSV = open('ExpectedCases.csv', 'r', newline = '')
table = str.maketrans('\r', ' ')
content = expectedCSV.read()
expectedCSV.close()
content = content.translate(table)
content = content.split(sep = '\n')
newContent = []
for item in content:
newContent.append((item.split(sep= ',')))
content = ' '
for item in newContent:
if item[0] == currentZcta:
expectedTotal += (float(item[1]))
totalPop += (float(item[2]))
totalObservedCount += (float(item[3]))`
Also, I couldn't figure out how to color the methods blue and the variables red, as some of the more awesome users of this site do. I would be very much interested in learning how to do that for future posts.
If anybody needs more info or anything clarified to help answer/formulate a solution, please, by all means, ask! Thanks for taking the time to read!
So, I ended up "solving" this by computing the observed along with the expected and population, by opening the file for each ZCTA computed. This did not really solve the issue I was dealing with, but rather found a way around it. I'm somewhat disappointed that more people didn't view and/or respond to this. If someone comes up with an answer to the actual problem, by all means, post it here. -Mike

Can't Input Tab Delimited file to Stanford Classifier

I'm having a problem inputting tab delimited files into the stanford classifier.
Although I was able to successfully walk through all the included stanford tutorials, including the newsgroup tutorial, when I try to input my own training and test data it doesn't load properly.
At first I thought the problem was that I was saving the data into a tab delimited file using an Excel spreadsheet and it was some kind of encoding issue.
But then I got exactly the same results when I did the following. First I literally typed the demo data below into gedit, making sure to use a tab between the politics/sports class and the ensuing text:
politics Obama today announced a new immigration policy.
sports The NBA all-star game was last weekend.
politics Both parties are eyeing the next midterm elections.
politics Congress votes tomorrow on electoral reforms.
sports The Lakers lost again last night, 102-100.
politics The Supreme Court will rule on gay marriage this spring.
sports The Red Sox report to spring training in two weeks.
sports Messi set a world record for goals in a calendar year in 2012.
politics The Senate will vote on a new budget proposal next week.
politics The President declared on Friday that he will veto any budget that doesn't include revenue increases.
I saved that as myproject/demo-train.txt and a similar file as myproject/demo-test.txt.
I then ran the following:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt
The good news: this actually ran without throwing any errors.
The bad news: since it doesn't extract any features, it can't actually estimate a real model and the probability defaults to 1/n for each item, where n is the number of classes.
So then I ran the same command but with two basic options specified:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
That yielded:
Exception in thread "main" java.lang.RuntimeException: Training dataset could not be processed
at edu.stanford.nlp.classify.ColumnDataClassifier.readDataset(ColumnDataClassifier.java:402)
at edu.stanford.nlp.classify.ColumnDataClassifier.readTrainingExamples (ColumnDataClassifier.java:317)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:1652)
at edu.stanford.nlp.classify.ColumnDataClassifier.main(ColumnDataClassifier.java:1628)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:670)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(ColumnDataClassifier.java:267)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:396)
... 3 more
These are exactly the same results I get when I used the real data I saved from Excel.
Even more though, I don't know how to make sense of the ArrayIndexOutOfBoundsException. When I used readline in python to print out the raw strings for both the demo files I created and the tutorial files that worked, nothing about the formatting seemed different. So I don't know why this exception would be raised with one set of files but not the other.
Finally, one other quirk. At one point I thought maybe line breaks were the problem. So I deleted all line breaks from the demo files while preserving tab breaks and ran the same command:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
Surprisingly, this time no java exceptions are thrown. But again, it's worthless: it treats the entire file as one observation, and can't properly fit a model as a result.
I've spent 8 hours on this now and have exhausted everything I can think of. I'm new to Java but I don't think that should be an issue here -- according to Stanford's API documentation for ColumnDataClassifier, all that's required is a tab delimited file.
Any help would be MUCH appreciated.
One last note: I've run these same commands with the same files on both Windows and Ubuntu, and the results are the same in each.
Use a properties file. In the example Stanford classifier example
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
2.useSplitWords=true
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=\\s+
The number 2 at the start of lines 3, 4 and 5 signifies the column in your tsv file. So in your case you would use
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+
or if you want to run with command line arguments
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -1.useSplitWords =1.splitWordsRegexp "\s+"
I've faced the same error as you.
Pay attention on tabs in the text you are classifying.
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
This means, that at some point classifier expects array of 3 elements, after it splits the string with tabs.
I've run a method, that counts amount of tabs in each line, and if at some line you have not two of them - here is an error.

Resources