OpenNLP: foreign names does not get recognized

OpenNLP: foreign names does not get recognized - nlp

I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use?

You can make your own model with your data using an opennlp addon called modelbuilder-addon, if you try it you may be the first one to do so other than me...it's brand new.
it is very new, but it works for me.
You feed it the following:
a list of "known entities" via a file where each line is a name
a list of sentences from YOUR data via file where each line is a sentence
(optionally) a blacklist to remove false positives
you can checkout the addon here
https://svn.apache.org/repos/asf/opennlp/addons/modelbuilder-addon
you can use this to get started
import java.io.File;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
public class ModelBuilderAddonUse {
public static void main(String[] args) {
File fileOfSentences = new File("path to your sentence file");
File fileOfNames = new File("path to your file of person names");
File blackListFile = new File("path to your blacklist file");
File modelOutFile = new File("path to you where the model will be saved");
File annotatedSentencesOutFile = new File("path to your sentence file");
DefaultModelBuilderUtil.generateModel(fileOfSentences, fileOfNames, blackListFile, modelOutFile, annotatedSentencesOutFile, "person", 3);
}
}
the idea is that your known entities (common names in your data) are used to create annotations, and those annotations are used to generate a model, then the model is used to generate more names and annotations etc... the tool will do this as per the "iterations" parameter. You should run it, check your results, any undesirable hits should be added to the blacklist file, and then you can run the training again. I've used this and got pretty good results. If you find problems with it, put in a ticket at OpenNLP.

Related

Create docx file from a template file in java

I need to create docx files based on a templates.
The template should contain the place holders and I should be able to fill the the place holders from java .
Is it possible to do it , If so suggest me the good and efficient way to do it .

A little late for the original question, but if anyone else needs to dynamically create docx documents from templates, you might want to have a look at the DocxStamper Java library which I created on top of docx4j.
It allows to use the Spring Expression Language in docx templates and you can create a document out of a template with a couple lines like this:
MyData data = ...; // your own POJO containing the data
InputStream template = ...; // InputStream to the template file
OutputStream out = ...; // OutputStream to the resulting document
DocxStamper stamper = new DocxStamperConfiguration()
.build();
stamper.stamp(template, context, out);
out.close();

As discussed elsewhere before, there are 3 basic approaches:
BEST: content control data binding
cheap/cheerful: Variable replacement (ie magic strings on the document surface), but brittle (the split run problem)
LEGACY: MERGEFIELD with or without other field codes.
Docx4j supports all three approaches, but we generally recommend content control databinding, since it aligns with Microsoft's direction (as best can be ascertained), and is most powerful.
You'll want to consider the technical skills of your template authors.

See https://github.com/centic9/poi-mail-merge for a simple "Variable replacement" method. It does not work if one replacement-string has multiple formats applied, but does work well for simple cases where the template is carefully crafted.
Basically it reads the template and data from CSV or an Excel file and then merges it into multiple result files, one for each line of data.
It works on the DOCX XML content, so is not fully using Apache POI XWPF support, but this way formatting and other things from the template are used as expected without the need for full support for everything in Apache POI (which has DOCX support still as part of the "scratchpad" component as support is not considered fully done yet).

You can use Word template with following syntax of LINQ Reporting to achieve your requirements using Aspose.Words for Java.
<< tag_name [expression] -switch1 -switch2 ...>>
A tag body typically consists of the following elements:
A tag name
An expression surrounded by brackets
A set of switches available for the tag, each of which is preceded by the “-“ character
Assume, that you have the Sender class defined in your application as follows:
public class Sender {
public Sender(String name, String message) {
_name = name;
_message = message;
}
public String getName() {
return _name;
}
public String getMessage() {
return _message;
}
private String _name;
private String _message;
}
To produce a report containing a message of a concrete sender on its behalf, you can use a template document with the following content.
<<[s.getName()]>> says: "<<[s.getMessage()]>>."
To build a report from the template, you can use the following source code.
Document doc = new Document(getMyDir() + "temp_HelloWorld.docx");
Sender sender = new Sender("LINQ Reporting Engine", "Hello World");
ReportingEngine engine = new ReportingEngine();
engine.buildReport(doc, sender, "s");
doc.save(getMyDir() + "out.docx");
I work with Aspose as Developer evangelist.

How to get feature file name/path in cucumber step implementation java file

In cucumber framework, is there a way I can get the currently executing feature file's name or even better it's folder path in the step definition file?
My project is implemented in java and I'm using intelliJ idea. I've already tried using before hook which helps me fetch the scenario instance. But, I can't find a way to get the feature file info.

The only solution I could come up with is by mentioning the feature file's name in the feature file Title and then in the #Before hook get this title using scenario.getId().split(";")[0]
And then parse this Title to fetch the feature file name and store it in a variable which I can later on use in the #After hook to pass it to the Custom Formatter to parse my feature file and save it's data in the database.

Long story short: you are not supposed to. Ask yourself what you are really trying to achieve, why you need the path in the first place. Is it because of some external file? Do you really need an external file or can the content be sufficiently represented in you feature files? If you really need an external file why not have it as a resource? And so on.

Not supposed to?
The reason you would want it is for traceability and explainability.
Very helpful for debugging, too.
Especially when you have more than 20 gherkin files (with up to 200 steps) and more than 20 step definition files.
I put one of these at the top of every Java step definition file:
#Before
public void printScenarioName(Scenario scenario) {
this.scenario = scenario;
this.featureName = CukeUtils.getFeatureName(scenario);
String result = "#Before:\n*************Setting Feature: " + this.featureName +
"\n*************Setting Scenario: " + this.scenario.getName();
log.info(result);
}
where in CukeUtils I have defined:
public static String getFeatureName(Scenario scenario) {
String featureName = "";
System.out.println("scenario.getId(): " + scenario.getId());
// Usually the scenario Id is doctored version of the lines following
// the Feature: and the Scenario: keywords.
// Eg.: scenario.getId(): a-long-(20-minute)-non-invasive-smoke-test-that-
//comfirms-that-i-can-login-to-area51-via-the-nasa-portal;as-a-superuser-i-
//must-be-able-to-login-to-area51-via-the-nasa-portal-so-that-i-can-access-
//all-the-secret-files
String rawFeatureName = scenario.getId().split(";")[0]
.replace("-i-", "-I-").replace("-"," ");
featureName = featureName + rawFeatureName.substring(0, 1).toUpperCase() +
rawFeatureName.substring(1).replace("nasa", "NASA");
return featureName;
}

Get all classes from a package

I'd like a (platform independent) way to list all classes from a package.
A possible way would be get a list of all classes known by Haxe, then filtering through it.

I made a macro to help with just this. It's in the compiletime haxelib.
haxelib install compiletime
And then add it to your build (hxml file):
-lib compiletime
And then use it to import your classes:
// All classes in this package, including sub packages.
var testcases = CompileTime.getAllClasses("my.package");
// All classes in this package, not including sub packages.
var testcases = CompileTime.getAllClasses("my.package",false);
// All classes that extend (or implement) TestCase
var testcases = CompileTime.getAllClasses(TestCase);
// All classes in a package that extend TestCase
var testcases = CompileTime.getAllClasses("my.package",TestCase);
And then you can iterate over them:
for ( testClass in testcases ) {
var test = Type.createInstance( testClass, [] );
}
Please note, if you never have "import some.package.SomeClass" in your code, that class will never be included, because of dead code elimination. So if you want to make sure it gets included, even if you never explicitly call it in your code, you can do something like this:
CompileTime.importPackage( "mygame.levels" );
CompileTime.getAllClasses( "mygame.levels", GameLevel );
How it works
CompileTime.getAllClasses is a macro, and what it does is:
Waits until compilation is finished, and we know all of the types / classes in our app.
Go through each one, and see if it is in the specified package
See also if it extends (or implements) the specified class/interface
If it does, add the class name to some special metadata - #classLists(...) metadata on the CompileTimeClassList file, containing the names of all the matching classes.
At runtime, use the metadata, together with Type.resolveClass(...) to create a list of all matching types.

This is one way to store and retrieve the information: https://gist.github.com/back2dos/c9410ed3ed476ecc1007
Beyond that you could use haxe -xml to get the type information you want, then transform it as needed (use the parser from haxe.rtti to handle the data) and embed the JSON encoded result with -resource theinfo.json (accessed through haxe.Resource).
As a side note: there are chances you'll be better off not having any automation and just add the classes to an array manually. Imagine you have somepackage.ClassA, somepackage.ClassB, ... then you can do
import somepackage.*;
//...
static var CLASSES:Array<Class<Dynamic>> = [ClassA, ClassB, ...];
It gives you more flexibility as whatever you want to do, you can always add 3rd party classes, which may not necessarily be in the same package and you can also choose to not use a class without having to delete it.

How to compare 2 xsd schema files for equivalent functionality

I would like to compare 2 XSD schemas A and B to determine that all instance documents valid to schema A would also be valid to schema B. I hope to use this to prove that even though schema A and B are "different" they are effectively the same. Examples of differences this would not trigger would be Schema A uses types and Schema B declares all of it's elements inline.
I have found lots of people talking about "smart" diff type tools but these would claim the two files are different because they have different text but the resulting structure is the same. I found some references to XSOM but I'm not sure if that will help or not.
Any thoughts on how to proceed?

Membrane SOA Model - Java API for WSDL and XML Schema
package sample.schema;
import java.util.List;
import com.predic8.schema.Schema;
import com.predic8.schema.SchemaParser;
import com.predic8.schema.diff.SchemaDiffGenerator;
import com.predic8.soamodel.Difference;
public class CompareSchema {
public static void main(String[] args) {
compare();
}
private static void compare(){
SchemaParser parser = new SchemaParser();
Schema schema1 = parser.parse("resources/diff/1/common.xsd");
Schema schema2 = parser.parse("resources/diff/2/common.xsd");
SchemaDiffGenerator diffGen = new SchemaDiffGenerator(schema1, schema2);
List<Difference> lst = diffGen.compare();
for (Difference diff : lst) {
dumpDiff(diff, "");
}
}
private static void dumpDiff(Difference diff, String level) {
System.out.println(level + diff.getDescription());
for (Difference localDiff : diff.getDiffs()){
dumpDiff(localDiff, level + " ");
}
}
}
After executing you get the output shown in listing 2. It is a List of
differences between the two Schema documents.
ComplexType PersonType has changed: Sequence has changed:
Element id has changed:
The type of element id has changed from xsd:string to tns:IdentifierType.
http://www.service-repository.com/ offers an online XML Schema Version Comparator tool that displays a report of the differences between two XSD that appears to be produced from the Membrane SOA Model.

My approach to this was to canonicalize the representation of the XML Schema.
Unfortunately, I can also tell you that, unlike canonicalization of XML documents (used, as an example, to calculate a digital signature), it is not that simple or even standardized.
So basically, you have to transform both XML Schemas to a "canonical form" - whatever the tool you build or use thinks that form is, and then do the compare.
My approach was to create an XML Schema set (could be more than one file if you have more namespaces) for each root element I needed, since I found it easier to compare XSDs authored using the Russian Doll style, starting from the PSVI model.
I then used options such as auto matching substitution group members coupled with replacement of substitution groups with a choice; removal of "superfluous" XML Schema sequences, collapsing of single option choices or moving minOccurs/maxOccurs around for single item compositors, etc.
Depending on what your XSD-aware comparison tool's features are, or you settle to build, you might also have to rearrange particles under compositors such as xsd:choice or xsd:all; etc.
Anyway, what I learned after all of it was that it is extremely hard to build a tool that would work nice for all "cool" XSD features out there... One test case I remember fondly was to deal with various xsd:any content.
I do wonder though if things have changed since...

Code Contracts and Auto Generated Files

When I enabled code contracts on my WPF control project I ran into a problem with an auto generated file which was created at compile time (XamlNamespace.GeneratedInternalTypeHelper). Note, the generated file is called GeneratedInternalTypeHelper.g.cs and is not the same as the GeneratedInternalTypeHelper.g.i.cs which there are several obsolete blog posts about.
I'm not exactly sure what its purpose is, but I am assuming it is important for some internal reflection to resolve XAML. The problem is that it does not have code contracts, nor is the code contract system smart enough to recognize it as an auto generated file. This leads to a bunch of errors from the static checker.
I tried searching for a solution to this problem, but it seems like nobody is developing WPF controls and using code contracts. I did come across an interesting attribute, ContractVerificationAttribute, which takes a boolean value to set whether the assembly or class is to be verified. This allows you to decorate a class as not verified. Sadly the GeneratedInternalTypeHelper is regenerated with every compile, so it is not possible to exclude just this one class. The inverse scenario is possible though, decorate the assembly as not verified and then opt in for every class.
To mitigate the obvious hack I wanted to create a test that would at least verify that the exposed classes have code contract verification with a test like the following to ensure that own classes were at least being verified:
[Fact]
public void AllAssemblyTypesAreDecoratedWithContractVerificationTrue()
{
var assembly = typeof(someType).Assembly;
var exposedTypes = assembly.GetTypes().Where(t=>!string.IsNullOrWhiteSpace(t.Namespace) && t.Namespace.StartsWith("MyNamespace") && !t.Name.StartsWith("<>"));
var areAnyNotContractVerified = exposedTypes.Any(t =>
{
var verificationAttribute = t.GetCustomAttributes(typeof(ContractVerificationAttribute), true).OfType<ContractVerificationAttribute>();
return verificationAttribute.Any() && verificationAttribute.First().Value;
});
Assert.False(areAnyNotContractVerified);
}
As you can see it takes all classes in the controls assembly and finds the one from the company namespace which are not also auto generated anonymous types (<>WeirdClassName).
(I also need to exclude Resources and settings, but I hope you get the idea).
I'm not loving the solution since there are ways of avoiding contract verification, but currently it's the best I can come up with. If anyone has a better solution, please let me know.

So you can treat this class exactly like you would treat any other "3rd party" class or library. I'm sure certain assumptions would hold with the interaction with this generated class so at the interaction points, decorate your own code with Contract.Assume(result != null) or similar.
var result = new GennedClass().GetSomeValue();
Contract.Assume(result != null);
What this does is translate into an assertion that is checked at run time, but it allows the static analyzer to reason about the rest of the code that you do control.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

OpenNLP: foreign names does not get recognized - nlp

Related

Create docx file from a template file in java

How to get feature file name/path in cucumber step implementation java file

Get all classes from a package

How to compare 2 xsd schema files for equivalent functionality

Code Contracts and Auto Generated Files

Categories

Resources