Dealing with integer-valued features for CRF in mallet - nlp

I am just starting to use the SimpleTagger class in mallet. My impression is that it expects binary features. The model that I want to implement has positive integer-valued features and I wonder how to implement this in mallet. Also, I heard that non-binary features need to be normalized if the model is to make sense. I would appreciate any suggestions on how to do this.
ps. yes, I know that there is a dedicated mallet mail list but I am waiting for nearly a day already to get my subscription approved to be able to post there. I'm simply in a hurry.

Well it's 6 years later now. If you're not in a hurry anymore, you could check out the Java API to create your instances. A minimal example:
private Instance createInstance(LabelAlphabet labelAlphabet){
// observations and labels should be equal size for linear chain CRFs
TokenSequence observations = new TokenSequence();
LabelSequence labels = new LabelSequence(labelAlphabet, n);
observations.add(createToken());
labels.add("idk, some target or something");
return new Instance(
observations,
label,
"myInstance",
null
);
}
private Token createToken() {
Token token = new Token("exampleToken");
// Note: properties are not used for computing (I think)
token.setProperty("SOME_PROPERTY", "hello");
// Any old double value
token.setFeatureValue(featureVal, 666.0);
// etc for more features ...
return token;
}
public static void main(String[] args){
// Note the first arg is false to denote we *do not* deal with binary features
InstanceList instanceList = new InstanceList(new TokenSequence2FeatureVectorSequence(false, false));
LabelAlphabet labelAlphabet = new LabelAlphabet();
// Converts our tokens to feature vectors
instances.addThruPipe(createInstance(labelAlphabet));
}
Or, if you want to keep using SimpleTagger, just define binary features like HAS_1_LETTER, HAS_2_LETTER, etc, though this seems tedious.

Related

what code-instrument should be added to register each http event in MeterRegistry with specific tag & minute value. Event requests are in millions

I need to analyse one http event value which should not be greater than 30mins. & 95% event should belong to this bucket. If it fails send the alert.
My first concern is to get the right metrics in /actuator/prometheus
Steps I took:
As in every http request event, I am getting one integer value called eventMinute.
Using micrometer MeterRegistry, I tried below code
// MeterRegistry meterRegistry ...
meterRegistry.summary("MINUTES_ANALYSIS", tags);
where tag = EVENT_MINUTE which receives some integer value in each
http event.
But this way, it floods the metrics due to millions of event.
Guide me a way please, i am beginner to this. Thanks!!
The simplest solution (which I would recommend you start with) would be to just create 2 counters:
int theThing = //getTheThing()
if(theThing > 30) {
meterRegistry.counter("my.request.counter.abovethreshold").inc()
}
meterRegistry.counter("my.request.counter.total").inc()
You would increment the counter that matches your threshold and another that tracks all requests (or reuse another meter that does that for you).
Then it is simple to setup a chart or alarm:
my_request_counter_abovethreshold/my_request_counter_total < .95
(I didn't test the code. It might need a tiny bit of tweaking)
You'll be able to do a similar thing with DistributionSummary by setting various SLOs (I'm not familiar with them to be able to offer one), but start with something simple first and if it is sufficient, you won't need the other complexity.
There are certain ways to solve this problem
1 ; here is a function which receives tags, name of metrics and a value
public void createOrUpdateHistogram(String metricName, Map<String, String> stringTags, double numericValue)
{
DistributionSummary.builder(metricName)
.tags(tags)
//can enforce slo if required
.publishPercentileHistogram()
.minimumExpectedValue(1.0D) // can take this based on how you want your distibution
.maximumExpectedValue(30.0D)
.register(this.meterRegistry)
.record(numericValue);
}
then it produce metrics like
delta_bucket{mode="CURRENT",le="30.0",} 11.0
delta_bucket{mode="CURRENT", le="+Inf",} 11.0
so as infinte also hold the less than value, so subtract the le=30 from le=+Inf
Another ways could be
public void createOrUpdateHistogram(String metricName, Map<String, String> stringTags, double numericValue)
{
Timer.builder(metricName)
.tags(tags)
.publishPercentiles(new double[]{0.5D, 0.95D})
.publishPercentileHistogram()
.serviceLevelObjectives(new Duration[]{Duration.ofMinutes(30L)})
.minimumExpectedValue(Duration.ofMinutes(30L))
.maximumExpectedValue(Duration.ofMinutes(30L))
.register(this.meterRegistry)
.record((long)timeDifference, TimeUnit.MINUTES);
}
it will only have two le, the given time and +inf
it can be change based on our requirements also it gives us quantile.

Stanford NLP: OutOfMemoryError

I am annotating and analyzing a series of text files.
The pipeline.annotate method becomes increasingly slow each time it reads a file. Eventually, I get an OutOfMemoryError.
Pipeline is initialized ONCE:
protected void initializeNlp()
{
Log.getLogger().debug("Starting Stanford NLP");
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, depparse, natlog, openie");
props.put("regexner.mapping", namedEntityPropertiesPath);
pipeline = new StanfordCoreNLP(props);
Log.getLogger().debug("\n\n\nStarted Stanford NLP Successfully\n\n\n");
}
I then process each file using same instance of pipeline (as recommended elsewhere on SO and by Stanford).
public void processFile(Path file)
{
try
{
Instant start = Instant.now();
Annotation document = new Annotation(cleanString);
Log.getLogger().info("ANNOTATE");
pipeline.annotate(document);
Long millis= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Annotation Duration in millis: "+millis);
AnalyzedFile af = AnalyzedFileFactory.getAnalyzedFile(AnalyzedFileFactory.GENERIC_JOB_POST, file);
processSentences(af, document);
Log.getLogger().info("\n\n\nFile Processing Complete\n\n\n\n\n");
Long millis1= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Total Duration in millis: "+millis1);
allFiles.put(file.toUri().toString(), af);
}
catch (Exception e)
{
Log.getLogger().debug(e.getMessage(), e);
}
}
To be clear, I expect the problem is with my configuration. However, I am certain that the stall and memory issues occur at the pipeline.annotate(file) method.
I dispose of all references to Stanford-NLP objects other than pipeline (e.g., CoreLabel) after processing each file. That is, I do not keep references to any Stanford objects in my code beyond the method level.
Any tips or guidance would be deeply appreciated
OK, that last sentence of the question made me go double check. The answer is that I WAS keeping reference to CoreMap in one of my own classes. In other words, I was keeping in memory all the Trees, Tokens and other analyses for every sentence in my corpus.
In short, keep StanfordNLP CoreMaps for a given number of sentences and then dispose.
(I expect a hard core computational linguist would say there is rarely any need to keep a CoreMap once it has been analyzed, but I have to declare my neophyte status here)

Using a lookup table using google guava

I am new to Java. I have a requirement of holding a lookup table in memory(Abbreviations and their expansions). I was thinking of using Java Hash map. But I want to know if that really is the best approach.
Also, If there are any equivalent libraries in Google Guava, for the same requirement.
I want it to me optimized and very efficient w.r.t time and memory
Using Maps
Maps are indeed fine for this, as used below.
Apparently, it's a bit early for you to care that much about performance or memory consumption though, and we can't really help you if we don't have more context on the actual use case.
In Pure Java
final Map<String, String> lookup = new HashMap<>();
lookup.put("IANAL", "I Ain't A Lawyer");
lookup.put("IMHO", "In My Humble Opinion");
Note that there are several implementations of the Map interface, or that you can write your own.
Using Google Guava
If you want an immutable map:
final Map<String, String> lookup = ImmutableMap.<String, String>builder()
.put("IANAL", "I Ain't A Lawyer")
.put("IMHO", "In My Humble Opinion")
.build();
Retrieving Data
Then to use it to lookup an abbreviation:
// retrieval:
if (lookup.containsKey("IMHO")) {
final String value = lookup.get("IMHO");
/* do stuff */
}
Using Enums
I was speaking of alternatives...
If you know at coding time what the key/value pairs will be, you may very well be better off using a Java enum:
class Abbrevations {
IANAL ("I Ain't A Lawyer")
IMHO ("In My Humble Opinion");
private final String value;
private Abbreviations(final String value) {
this.value = value;
}
public String getValue() {
return (value);
}
}
You can then lookup values directly, ie either by doing this:
Abbreviations.IMHO.getValue()
Or by using:
Abbreviations.valueOf("IMHO).getValue()
Considering where you seem to be in your learning process, I'd recommend you follow the links and read through the Java tutorial and implement the examples.

What's the mapping from Type.getSort() to the local and stack arrays in visitFrame(...)?

I need to adapt my code to the stricter Java 7 verifier and have to add visitFrame calls in my MethodNode (I'm using the tree api). I could not find any information on how Type maps to the Object[]s used in visitFrame, so please help me out here...
This is what I have so far:
private Object getFrameType(Type type) {
switch (type.getSort()) {
case Type.BOOLEAN:
case Type.CHAR:
case Type.BYTE:
case Type.SHORT:
case Type.INT:
return Opcodes.INTEGER;
case Type.LONG:
return Opcodes.LONG;
case Type.FLOAT:
return Opcodes.FLOAT;
case Type.DOUBLE:
return Opcodes.DOUBLE;
case Type.OBJECT:
case Type.ARRAY:
return type.getInternalName();
}
throw new RuntimeException(type + " can not be converted to frame type");
}
What I'd like to know is: what are Type.VOID and Type.METHOD?
When do I need Opcodes.TOP, Opcodes.NULL and Opcodes.UNINITIALIZED_THIS?
I'm guessing UNINITIALIZED_THIS is only used in the constructor and I can probably ignore VOID and METHOD, but I'm not sure and I don't have the slightest idea what TOP is.
If I understood your need correctly, you could just let ASM calculate the frames for you. This will probably slow down the class generation a bit, but certainly worth a try.
When creating the ClassWriter, just add COMPUTE_FRAMES to the flags argument of the constructor, e.g.
new ClassWriter(ClassWriter.COMPUTE_FRAMES);
Similarly, if you are transforming a class, the ClassReader can be asked to expand the frames like:
ClassReader cr = ...;
ClassNode cn = new ClassNode(ASM4);
cr.accept(cn, ClassReader.EXPAND_FRAMES);
The former option has the benefit that you can forget about the frames (and "maxs") altogether while the latter option might require you to patch the frames yourself depending on what kind of transformation you do.
The examples are for ASM version 4, but these features have been supported at least since version 3 - the parameters are just passed a bit differently.

Using *.resx files to store string value pairs

I have an application that requires mappings between string values, so essentially a container that can hold key values pairs. Instead of using a dictionary or a name-value collection I used a resource file that I access programmatically in my code. I understand resource files are used in localization scenarios for multi-language implementations and the likes. However I like their strongly typed nature which ensures that if the value is changed the application does not compile.
However I would like to know if there are any important cons of using a *.resx file for simple key-value pair storage instead of using a more traditional programmatic type.
There are two cons which I can think of out of the blue:
it requires I/O operation to read key/value pair, which may result in significant performance decrease,
if you let standard .Net logic to resolve loading resources, it will always try to find the file corresponding to CultureInfo.CurrentUICulture property; this could be problematic if you decide that you actually want to have multiple resx-es (i.e. one per language); this could result in even further performance degradation.
BTW. Couldn't you just create helper class or structure containing properties, like that:
public static class GlobalConstants
{
private const int _SomeInt = 42;
private const string _SomeString = "Ultimate answer";
public static int SomeInt
{
get
{
return _SomeInt;
}
}
public static string SomeString
{
get
{
return _SomeString;
}
}
}
You can then access these properties exactly the same way, as resource files (I am assuming that you're used to this style):
textBox1.Text = GlobalConstants.SomeString;
textBox1.Top = GlobalConstants.SomeInt;
Maybe it is not the best thing to do, but I firmly believe this is still better than using resource file for that...

Resources