Making a Datum from a Data Set, Stanford NLP - nlp

I was trying to run through examples for the Stanford NLP Classifier and had a question about classifying a new data set. I see that the ".test" file contains the "goldClass" which is the right answer as well as the String which is supposed to be tested.
The example test set has the following format:
<label> <string>
<label> <String>
...
....
This makes sense for evaluation of a model once we a model has been created from a hand classified data set. But now, once a model is created, how do I classify a completely new data set? I no longer have the associated Labels... I just have the new set of strings that I want to know the class for...
But to classify them, I will have to create a Datum object. To create a datum object, I will need to use makeDatumFromLine(), which requires a TSV line... WHY does this have to be TSV? What is the use of specifying a goldClass when classifying new data?
I hope my question was clear..

"Why does it have to be TSV?"
Because it does. My solution to get the class of an unknown line is the following:
Datum<String, String> example = cdc.makeDatumFromLine("\tEverybody here needs food badly!");
System.out.println("Example data-> "+cl.classOf(example));
Seems to work fine. I don't know if this is the best use- but it works for me.

Related

Need Help creating class hierarchy in Python

I have a hierarchy of data that i would like to build using classes instead of hard coding it in. The structure is like so:
Unit (has name, abbreviation, subsystems[5 different types of subsystems])
Subsystem ( has type, block diagram(photo), ParameterModel[20 different sets of parameterModels])
ParameterModel (30 or so parameters that will have [parameter name, value, units, and model index])
I'm not sure how to do this using classes but what i have made kindof work so far is creating nested dictionaries.
{'Unit':{'Unit1':{'Subsystem':{'Generator':{Parameter:{'Name': param1, 'Value':1, 'Units': 'seconds'}
like this but with 10-15 units and 5-6 subsystems and 30 or so parameters per subsystem. I know using dictionaries is not the best way to go about it but i cannot figure out the class sharing structure or where to start on building the class structure.
I want to be able to create, read, update and delete, parameters in a tkinter gui that i have built as well as export/import these system parameters and do calculations on them. I can handle the calculations and the import export but i need to create classes that will build out this structure and be able to reference each individual unit/subsystem/parameter/value/etc
I know thats alot but any advice? ive been looking into the factory and abstract factory patterns in hope to try and figure out how to create the code structure but to no avail. I have experience with matlab, visual basic, c++, and various arduio projects so i know most basic programming but this inheritance class structure is something i cannot figure out how to do in an abstract way without hardcoding each parameter with giant names like Unit1_Generator_parameterName_parameter = ____ and i really dont want to do that.
Thanks,
-A
EDIT: Here is one way I've done the implementation using a dictionary but i would like to do this using a class that can take a list and make a bunch of empty attributes and have those be editable/callable generally like setParamValue(unit, susystem, param) where i can pass the unit the subsystem and then the parameter such as 'Td' and then be able to change the value of the key,value pair within this hierarchy.
def create_keys(list):
dict = {key: None for key in list}
return dict
unit_list = ['FL','ES','NN','SF','CC','HD','ND','TH'] #unit abbreviation
sub_list = ['Gen','Gov','Exc','PSS','Rel','BlkD']
params_GENROU = ["T'do","T''do","T'qo","T''qo",'H','D','Xd','Xq',"Xd'","Xq'","X''d=X''q",'Xl','S(1.0)','S(1.2)','Ra'] #parameter names
dict = create_keys(unit_list)
for key in dict:
dict[key] = create_keys(sub_list)
dict[key]['Gen'] = create_keys(params_GENROU)
and inside each dict[unit][Gen][ParamNames] there should be a dict containing Value, units(seconds,degrees,etc), description and CON(#basically in index for another program we use)

How to Correctly Extend An Enum

I am trying to extend the geo.StreetSuffix enum to include some more possible values. It currently doesn't have a value for Greene which is a valid street suffix. This is what my concept looks like:
enum (StreetSuffix) {
description (Street Suffix)
extends(geo.StreetSuffix)
symbol (Greene)
}
This is a training sample:
[g:Evaluate:prompt] (19)[v:geo.StreetNumber] (Fake Hills)[v:geo.StreetName] (Lane)[v:StreetSuffix:Lane]
When I do this though the training files give me the following error:
Confusion Points: Match(es) on : "Lane". and the language recognition no longer works for that value. Am I doing something wrong here, is there a bug, or is this not how Enum inheritance is supposed to work?
I am happy to write my own enum which would be a copy of geo.StreetSuffix but it seems like a waste if I could just extend it and add some of my own values.
Unfortunately, you'd have to copy everything from the old vocab file (which you don't have access to).
Note
If you extend a type into another capsule, a new vocabulary file must still be created. Vocabulary is never inherited, even if you use extends or add role-of to a model.
https://bixbydevelopers.com/dev/docs/dev-guide/developers/training.vocabulary#adding-vocabulary
That being said, you can file a ticket with support to add Greene and any other missing values you might come across...

How to view and interpret the output of lda model using gensim

I am able to create the lda model and save it. Now I am trying load the model, and pass a new document
lda = LdaModel.load('..\\models\\lda_v0.1.model')
doc_lda = lda[new_doc_term_matrix]
print(doc_lda )
On printing the doc_lda I am getting the object. <gensim.interfaces.TransformedCorpus object at 0x000000F82E4BB630>
However I want to get the topic words associated with it. What is the method I have to use. I was referring to this.
Not sure if this is still relevant, but have you tried get_document_topics()? Though I assume that would only work if you've updated your LDA model using update().
I don't think there is anything wrong with your code - the "Usage example" from the documentation link you posted uses doc2bow which returns a sparse vector - I don't know what new_doc_term_matrix consists of, but I'll assume it worked fine.
You might want to look at this stackoverflow question: you want to print an "object" - that isn't printable, the data you want is somewhere in the object, and that in itself is printable.
Alternatively, you can also use your IDE's capabilities - the Variable explorer in Spyder, for example - to click yourself into the objects and get the info you need.
For more info on similarity analysis with gensim, see this tutorial.
Use this
doc_lda.print_topics(-1)

Mallet, how to use ExpGain and GradientGain method to construct a FeatureSelector

I want to test the accuracy of a text classifier built with Mallet,there are 4 feature selection methods available.(FeatureCounts,InfoGain,ExpGain and GradientGain).
i want to know how to use ExpGain and GradientGain.
Eg:
FeatureSelector fselector=new FeatureSelector
(new FeatureCounts.Factory(),numOfFeature);
Each of the classes you mentioned is a subclass of RankedFeatureVector. They apply different rules to generate a score for each feature. You can then construct a new FeatureSelection object by passing the RankedFeatureVector and the number of features you want to keep.
This API page shows how to use FeatureSelection objects to train classifiers.

Convert string to sentence case but leave acronyms untouched

This is not a language-specific question.
I have a string in ALL CAPS. This string comes in from a separate source and for some reason is always in all caps.
I've been given the task of making the string a little more reader-friendly so I decided to just slap a sentence case converter method on it using simple regex.
The thing is, there are a lot of acronyms used in this string and I would like to keep them unaffected. Things like country codes(US, CA, JP, FR, etc...), or airport codes(LAX, LGA) and sometimes many others.
Now I'm guessing I would first need a list of the acronyms in a database or something, of all the possible airport codes, country codes and a list of commonly used acronyms like ETA, COD, etc...
Once I have this database created, how can I apply it to the string in question?? How can I prevent the word "us" being changed to US and vice-versa?? What I basically wanna know is, how do I take what's in the DB and apply all the necessary changes to the string?
Remember, I get the original string in ALL CAPS so there's no way to differentiate.
Any ideas would be greatly appreciated!!
Thanks!!!
Something close to this can be done with ActiveSupport::Inflector, which provides the titleize method (which does the work for String.titleize).
First, define your own inflections in an initializer.
# config/initializers/inflections.rb
ActiveSupport::Inflector.inflections do |inflect|
inflect.acronym 'US'
end
Restart your app to pick up the change. Now titleize knows how to handle "US". Fire up a Rails console to check it out:
> "us".titleize
=> "US"
Next, check out the source code for titleize. Once you understand it, reopen the Inflector class in an initializer and define your own method that doesn't capitalize the first letter of each word. Call it something nifty, like decapitalize.
module ActiveSupport::Inflector
def decapitalize(word)
humanize(underscore(word)) # you may enhance this a bit
end
end
class String
def decapitalize
ActiveSupport::Inflector.decapitalize(self)
end
end
Caveats and Limitations
You may need to tweak the code, but I think it's close.
Here are some sentences this solution won't handle very well:
> "US STATES VISITED BY US".titleize
=> "US States Visited By US"
> "COLUMBIA (CO) EXPORTS ARE PROCESSED BY ACME BUILDING CO.".decapitalize
=> "Columbia (CO) exports are processed by acme building CO."

Resources