I train a topic model with Mallet, and I want to serialize it for later use. I ran it on two test documents, and then deserialized it and ran the loaded model on the same documents, and the results were completely different.
Is there anything wrong with the way I'm saving/loading the documents (code attached)?
Thanks!
List<Pipe> pipeList = initPipeList();
// Begin by importing documents from text to feature sequences
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
for (String document : documents) {
Instance inst = new Instance(document, "","","");
instances.addThruPipe(inst);
}
ParallelTopicModel model = new ParallelTopicModel(numTopics, alpha_t * numTopics, beta_w);
model.addInstances(instances);
model.setNumThreads(numThreads);
model.setNumIterations(numIterations);
model.estimate();
printProbabilities(model, "doc 1"); // I replaced the contents of the docs due to copywrite issues
printProbabilities(model, "doc 2");
model.write(new File("model.bin"));
model = ParallelTopicModel.read("model.bin");
printProbabilities(model, "doc 1");
printProbabilities(model, "doc 2");
Definition of printProbabilities():
public void printProbabilities(ParallelTopicModel model, String doc) {
List<Pipe> pipeList = initPipeList();
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new Instance(doc, "", "", ""));
double[] probabilities = model.getInferencer().getSampledDistribution(instances.get(0), 10, 1, 5);
for (int i = 0; i < probabilities.length; i++) {
double probability = probabilities[i];
if (probability > 0.01) {
System.out.println("Topic " + i + ", probability: " + probability);
}
}
}
You have to use the same pipe for training and for classification. During traning, pipe's data alphabet gets updated with each training instance. You don't produce the same pipe using new SerialPipe(pipeList) as its data alphabet is empty. Save/load the pipe or instance list containing the pipe along with the model, and use that pipe to add test instances.
When you don't fix a random seed, every run of Mallet gives you a different topic model (with the numbers of the topics permuted, some topics slightly different, other topics very different).
Fix the random seed to get replicable topics.
Related
I'm trying to dynamically reference Excel sheets or tables within the .dat for a Mixed Integer Problem in Vehicle Routing that I'm trying to solve in CPLEX (OPL).
The setup is a: .mod = model, .dat = data and a MS Excel spreadsheet
I have a 2 dimensional array with customer demand data = Excel range (for coding convenience I did not format the excel data as a table yet)
The decision variable in .mod looks like this:
dvar boolean x[vertices][vertices][scenarios]
in .dat:
vertices from SheetRead (data, "Table!vertices");
and
scenarios from SheetRead (data, "dont know how to yet"); this might not be needed
without the scenario Index everything is fine.
But as the demand for the customers changes in this model I'd like to include this via changing the data base reference.
Now what I'd like to do is one of 2 things:
Either:
Change the spreadsheet in Excel so that depending on the scenario I get something like that in .dat:
scenario = 1:
vertices from SheetRead (data, "table-scenario-1!vertices");
scenario = 2:
vertices from SheetRead (data, "table-scenario-2!vertices");
so changing the spreadsheet for new base data,
or:
Change the range within the same spreadsheet:
scenario = 1:
vertices from SheetRead (data, "table!vertices-1");
scenario = 2:
vertices from SheetRead (data, "table!vertices-2");
either way would be fine.
Knowing how 3D Tables in Excel are created using multiple spreadsheets with 2D Tables grouped, the more natural approach seems to be, to have vertices always reference the same range in every Excel spreadsheet while depending on the scenario the spreadsheet/page is switched, but I just don't know how to.
Thanks for the advice.
Unfortunately, the arguments to SheetConnection must be a string literal or an Id (see the OPL grammar in the user manual here). And similarly for SheetRead. This means, you cannot have dynamic sources for a sheet connection.
As we discussed in the comments, one option is to add an additional index to all data: the scenario. Then always read the data for all scenarios and in the .mod file select what you want to actually use.
at https://www.ibm.com/developerworks/community/forums/html/topic?id=5af4d332-2a97-4250-bc06-76595eef1ab0&ps=25 I shared an example where you can set a dynamic name for the Excel file. The same way you could have a dynamic range, the trick is to use flow control.
sub.mod
float maxOfx = 2;
string fileName=...;
dvar float x;
maximize x;
subject to {
x<=maxOfx;
}
execute
{
writeln("filename= ",fileName);
}
and then the main model is
main {
var source = new IloOplModelSource("sub.mod");
var cplex = new IloCplex();
var def = new IloOplModelDefinition(source);
var opl = new IloOplModel(def,cplex);
for(var k=11;k<=20;k++)
{
var opl = new IloOplModel(def,cplex);
var data2= new IloOplDataElements();
data2.fileName="file"+k;
opl.addDataSource(data2);
opl.generate();
if (cplex.solve()) {
writeln("OBJ = " + cplex.getObjValue());
} else {
writeln("No solution");
}
opl.postProcess();
opl.end();
}
}
I am trying to create a wall with 2 layers and each layer materials are different. When I try to set the CompoundStructure for the wall I am getting an exception that CompoundStructure is not valid.
CompoundStructure cStructure = CompoundStructure.CreateSimpleCompoundStructure(clayer);
wallType.SetCompoundStructure(cStructure);
Can anyone tell me how I can create compound structure for layers with different materials?
First of all, solve your task manually through the end user interface and verify that it works at all.
Then, use RevitLookup and other database exploration tools to examine the results in the BIM elements, their properties and relationships.
Once you have done that, you will have a good idea how to address the task programmatically – and have confidence that it will work as expected:
How to research to find a Revit API solution
Intimate Revit database exploration with the Python Shell
newWallMaterial = wallMaterial.Duplicate("newCreatedMaterial");
newWallmaterial2 = wallMaterial.Duplicate("NewCreatedMAterial2");
//roofMaterial3 = roofMaterial2.Duplicate("NewCreatedMAterial3");
bool usr = newWallMaterial.UseRenderAppearanceForShading;
//newWallMaterial.Color = BuiltInTypeParam.materialCol;
foreach (Layers layer in layers)
{
if (layer.layerId == 0)
{
c = new CompoundStructureLayer(layer.width, layer.materialAssignement, newWallMaterial.Id);
newWallMaterial.Color = color;
clayer.Add(c);
}
if (layer.layerId == 1)
{
c1 = new CompoundStructureLayer(layer.width, layer.materialAssignement, newWallmaterial2.Id);
newWallmaterial2.Color = color;
clayer.Add(c1);
}
I am annotating and analyzing a series of text files.
The pipeline.annotate method becomes increasingly slow each time it reads a file. Eventually, I get an OutOfMemoryError.
Pipeline is initialized ONCE:
protected void initializeNlp()
{
Log.getLogger().debug("Starting Stanford NLP");
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, depparse, natlog, openie");
props.put("regexner.mapping", namedEntityPropertiesPath);
pipeline = new StanfordCoreNLP(props);
Log.getLogger().debug("\n\n\nStarted Stanford NLP Successfully\n\n\n");
}
I then process each file using same instance of pipeline (as recommended elsewhere on SO and by Stanford).
public void processFile(Path file)
{
try
{
Instant start = Instant.now();
Annotation document = new Annotation(cleanString);
Log.getLogger().info("ANNOTATE");
pipeline.annotate(document);
Long millis= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Annotation Duration in millis: "+millis);
AnalyzedFile af = AnalyzedFileFactory.getAnalyzedFile(AnalyzedFileFactory.GENERIC_JOB_POST, file);
processSentences(af, document);
Log.getLogger().info("\n\n\nFile Processing Complete\n\n\n\n\n");
Long millis1= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Total Duration in millis: "+millis1);
allFiles.put(file.toUri().toString(), af);
}
catch (Exception e)
{
Log.getLogger().debug(e.getMessage(), e);
}
}
To be clear, I expect the problem is with my configuration. However, I am certain that the stall and memory issues occur at the pipeline.annotate(file) method.
I dispose of all references to Stanford-NLP objects other than pipeline (e.g., CoreLabel) after processing each file. That is, I do not keep references to any Stanford objects in my code beyond the method level.
Any tips or guidance would be deeply appreciated
OK, that last sentence of the question made me go double check. The answer is that I WAS keeping reference to CoreMap in one of my own classes. In other words, I was keeping in memory all the Trees, Tokens and other analyses for every sentence in my corpus.
In short, keep StanfordNLP CoreMaps for a given number of sentences and then dispose.
(I expect a hard core computational linguist would say there is rarely any need to keep a CoreMap once it has been analyzed, but I have to declare my neophyte status here)
I am new to Stanford NLP and NER and trying to train a custom classifier with a data sets of currencies and countries.
My training data in training-data-currency.tsv looks like -
USD CURRENCY
GBP CURRENCY
And, training data in training-data-countries.tsv looks like -
USA COUNTRY
UK COUNTRY
And, classifiers properties look like -
trainFileList = classifiers/training-data-currency.tsv,classifiers/training-data-countries.tsv
ner.model=classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz
serializeTo = classifiers/my-classification-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
Java code to find the categories is -
LinkedHashMap<String, LinkedHashSet<String>> map = new<String, LinkedHashSet<String>> LinkedHashMap();
NERClassifierCombiner classifier = null;
try {
classifier = new NERClassifierCombiner(true, true,
"C:\\Users\\perso\\Downloads\\stanford-ner-2015-04-20\\stanford-ner-2015-04-20\\classifiers\\my-classification-model.ser.gz"
);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
List<List<CoreLabel>> classify = classifier.classify("Zambia");
for (List<CoreLabel> coreLabels : classify) {
for (CoreLabel coreLabel : coreLabels) {
String word = coreLabel.word();
String category = coreLabel
.get(CoreAnnotations.AnswerAnnotation.class);
if (!"O".equals(category)) {
if (map.containsKey(category)) {
map.get(category).add(word);
} else {
LinkedHashSet<String> temp = new LinkedHashSet<String>();
temp.add(word);
map.put(category, temp);
}
System.out.println(word + ":" + category);
}
}
}
When I run the above code with input as "USD" or "UK", I get expected result as "CURRENCY" or "COUNTRY". But, when I input something like "Russia", return value is "CURRENCY" which is from the first train file in the properties. I am expecting 'O' would be returned for these values which is not present in my training dat.
How can I achieve this behavior? Any pointers where I am going wrong would be really helpful.
Hi I'll try to help out!
So it sounds to me like you have a list of strings that should be called "CURRENCY", and you have a list of strings that should be called "COUNTRY", etc...
And you want something to tag strings based off of your list. So when you see "RUSSIA", you want it to be tagged "COUNTRY", when you see "USD", you want it to be tagged "CURRENCY".
I think these tools will be more helpful for you (particularly the first one):
http://nlp.stanford.edu/software/regexner/
http://nlp.stanford.edu/software/tokensregex.shtml
The NERClassifierCombiner is designed to train on large volumes of tagged sentences and look at a variety of features including the capitalization and the surrounding words to make a guess about a given word's NER label.
But it sounds to me in your case you just want to explicitly tag certain sequences based off of your pre-defined list. So I would explore the links I provided above.
Please let me know if you need any more help and I will be happy to follow up!
so the problem i'm trying to tackle is the following:
I need a data source that emits messages at a certain frequency
There are N neural nets that need to process each message individually
The outputs from all neural nets are aggregated and only when all N outputs for each message are collected, should a message be declared fully processed
At the end i should measure the time it took for a message to be fully processed (time between when it was emitted and when all N neural net outputs from that message have been collected)
I'm curious as to how one would approach such a task using spark streaming.
My current implementation uses 3 types of components: a custom receiver and two classes that implement Function, one for the neural nets, one for the end aggregator.
In broad strokes, my application is built as follows:
JavaReceiverInputDStream<...> rndLists = jssc.receiverStream(new JavaRandomReceiver(...));
Function<JavaRDD<...>, Void> aggregator = new JavaSyncBarrier(numberOfNets);
for(int i = 0; i < numberOfNets; i++){
rndLists.map(new NeuralNetMapper(neuralNetConfig)).foreachRDD(aggregator);
}
The main problem i'm having with this, though, is that it runs faster in local mode than when submitted to a 4-node cluster.
Is my implementation wrong to begin with or is something else happening here ?
There's also a full post here http://apache-spark-user-list.1001560.n3.nabble.com/Developing-a-spark-streaming-application-td12893.html with more details regarding the implementation of each of the three components mentioned previously.
It seems there might be a lot of repetitive instantiation and serialization of objects. The later might be hitting your performance in a cluster.
You should try instantiating your neural networks only once. You will have to ensure that they are serializable. You should use flatMap instead of multiple maps + union. Something along these lines:
// Initialize neural net first
List<NeuralNetMapper> neuralNetMappers = new ArrayList<>(numberOfNets);
for(int i = 0; i < numberOfNets; i++){
neuralNetMappers.add(new NeuralNetMapper(neuralNetConfig));
}
// Then create a DStream applying all of them
JavaDStream<Result> neuralNetResults = rndLists.flatMap(new FlatMapFunction<Item, Result>() {
#Override
public Iterable<Result> call(Item item) {
List<Result> results = new ArrayList<>(numberOfNets);
for (int i = 0; i < numberOfNets; i++) {
results.add(neuralNetMappers.get(i).doYourNeuralNetStuff(item));
}
return results;
}
});
// The aggregation stuff
neuralNetResults.foreachRDD(aggregator);
If you can afford to initialize the networks this way, you can save quite a lot of time. Also, the union stuff you included in your linked posts seems unnecessary and is penalizing your performance: a flatMap will do.
Finally, in order to further tune your performance in the cluster, you can use the Kryo serializer.