Using UIMA, Stanford Core NLP together - nlp

The UIMA and the StanfordNLP produces the output after the pipeline of operation like if We want to do POS tagging so in the input text first the tokenization is done and then the POS tagging.
I want to use the tokenization of the UIMA and use that token in the POS tagger of the Stanford CoreNLP. But the POS tagger of Stanford CoreNLP have the requirement to run the tokenizer before POS tagger.
So, is it possible to use the different API in the same pipeline or not ?
Is it possible to use the UIMA tokenizer and the Stanford CoreNLP together ?

The typical approach to combine analysis steps from different tool chains (e.g. OpenNLP, Stanford CoreNLP, etc.) in UIMA would be to wrap each of them as a UIMA analysis engine. The analysis engine serves as an adapter between the UIMA data structure (the CAS) and the data structures used be the individual tools (e.g. the OpenNLP POS tagger or the CoreNLP parser). At the level of UIMA, these components can then be combined into pipelines.
There are various collections of UIMA components that wrap such tool chains, e.g. ClearTK, DKPro Core, or U-Compare.
The following example combines the OpenNLP segmenter (tokenizer/sentence splitter) and Stanford CoreNLP parser (which also creates the POS tags in the present example). The example is implemented as a Groovy script employing the uimaFIT API to create and run a pipeline using from components of the DKPro Core collection.
#!/usr/bin/env groovy
#Grab(group='de.tudarmstadt.ukp.dkpro.core',
module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl',
version='1.5.0')
#Grab(group='de.tudarmstadt.ukp.dkpro.core',
module='de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl',
version='1.5.0')
import static org.apache.uima.fit.pipeline.SimplePipeline.*;
import static org.apache.uima.fit.util.JCasUtil.*;
import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;
import org.apache.uima.fit.factory.JCasFactory;
import de.tudarmstadt.ukp.dkpro.core.opennlp.*;
import de.tudarmstadt.ukp.dkpro.core.stanfordnlp.*;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*;
import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*;
def jcas = JCasFactory.createJCas();
jcas.documentText = "This is a test";
jcas.documentLanguage = "en";
runPipeline(jcas,
createEngineDescription(OpenNlpSegmenter),
createEngineDescription(StanfordParser,
StanfordParser.PARAM_WRITE_PENN_TREE, true));
select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" }
select(jcas, PennTree).each { println it.pennTree }
Its output (after a lot of logging output) should look like this:
This DT
is VBZ
a DT
test NN
(ROOT
(S
(NP (DT This))
(VP (VBZ is)
(NP (DT a) (NN test)))))
I gave the Groovy script as an example because it works out of the box. A Java program would look quite similar, but would typically use e.g. Maven or Ivy to obtain the required libraries.
In case you want to try the script and need more information on installing Groovy and on potential trouble-shooting, you can find more information here.
Disclosure: I am working on the DKPro Core and Apache UIMA uimaFIT projects.

There are at least two ways to handle this if you want to use CoreNLP as the pipeline.
Force CoreNLP to ignore the requirements.
Properties props = new Properties();
props.put("enforceRequirements", "false");
props.put("annotators", "pos");
This will get rid of the "missing requirements" error. However, the POSTaggerAnnotator in CoreNLP expects the tokens to be CoreLabel objects and expects the sentences to be CoreMap objects (instantiated as ArrayCoreMap) so you'll have to convert them.
Add custom annotators to the pipeline.
The CoreMaps/CoreLabels are maps with classes as keys so you'll need a class/key for your custom annotation:
public class CustomAnnotations {
//this class will act as a key
public static class UIMATokensAnnotation
implements CoreAnnotation<List<CoreLabel>> {
//getType() defines/restricts the Type of the value associated with this key
public Class<List<CoreLabel>> getType() {
return ErasureUtils.<Class<List<CoreLabel>>> uncheckedCast(List.class);
}
}
}
You will also need an annotator class:
public class UIMATokensAnnotator implements Annotator{
//this constructor signature is expected by StanfordCoreNLP.class
public UIMATokensAnnotator(String name, Properties props) {
//initialize whatever you need
}
#Override
public void annotate(Annotation annotation) {
List<CoreLabel> tokens = //run the UIMA tokenization and convert output to CoreLabels
annotation.set(CustomAnnotations.UIMATokensAnnotation.class, tokens);
}
#Override
public Set<Requirement> requirementsSatisfied() {
return Collections.singleton(TOKENIZE_REQUIREMENT);
}
#Override
public Set<Requirement> requires() {
return Collections.emptySet();
}
}
finally:
props.put("customAnnotatorClass.UIMAtokenize", "UIMATokensAnnotator")
props.put("annotators", "UIMAtokenize, ssplit, pos")
The UIMA/OpenNLP/etc. sentence annotation can be added as a custom annotator in a similar fashion.
Check out http://nlp.stanford.edu/software/corenlp-faq.shtml#custom for a condensed version of option #2.

Related

Spring Integration support for the Normalizer EIP

Spring Integration here. I was expecting to see a normalize(...) method off the IntegrationFlow DSL and was surprised to find there wasn't one (like .route(...) or .aggregate(...), etc.).
In fact, some digging on Google and the Spring Integration docs, and I can't seem to find any built-in support for the Normalizer EIP. So I've taken a crack at my own:
public class Normalizer extends AbstractTransformer {
private Class<?> targetClass;
private GenericConverter genericConverter;
public Normalizer(Class<?> targetClass, GenericConverter genericConverter) {
Optional<GenericConverter.ConvertiblePair> maybePair = genericConverter.getConvertibleTypes().stream()
.filter(convertiblePair -> !convertiblePair.getTargetType().equals(targetClass))
.findAny();
assert(maybePair.isEmpty());
this.targetClass = targetClass;
this.genericConverter = genericConverter;
}
#Override
protected Object doTransform(Message<?> message) {
Object inbound = message.getPayload();
return genericConverter.convert(inbound, TypeDescriptor.forObject(inbound), TypeDescriptor.valueOf(targetClass));
}
}
The idea is that Spring already provides the GenericConverter SPI for converting multiple source types to 1+ target type instance. We just need a specialized flavor of that that has the same target type for all convertible pairings. So here we extend AbstractTransformer and pass it one of these GenericConverters to use. During initialization we just verify that all the possible convertible pairs convert to the same targetClass specified for the Normalizer.
The idea is I could instantiate it like so:
#Bean
public Normalizer<Fizz> fizzNormalizer(GenericConverter fizzConverter) {
return new Normalizer(Fizz.class, fizzConverter);
}
And then put it in a flow:
IntegrationFlow someFlow = IntegrationFlows.from(someChannel())
.transform(fizzNormalizer())
// other components
.get();
While I believe this will work, before I start using it too heavily I want to make sure I'm not overlooking anything in the Spring Integration framework that will accomplish/satisfy the Normalizer EIP for me. No point in trying to reinvent the wheel and all that jazz. Thanks for any insight.
If you take a closer look into that EI pattern, then you see:
Use a Normalizer to route each message type through a custom Message Translator so that the resulting messages match a common format.
The crucial part of this pattern that it is a composed one with a router as input endpoint and a set of transformers for each inbound message type.
Since it is that kind of component which is data model dependent and more over the routing and transforming logic might differ from use-case to use-case, it is really hard to make an out-of-the-box single configurable component.
Therefore you need to investigate what type of routing you need to do to chose a proper one for input: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#router
Then for every routed type you nee to implement respective transformer to produce a canonical data mode.
All of the can be just wrapped into a #MessagegingGateway API to hide the normalize behind so-called pattern implementation.
That's what I would do to follow that EI pattern recommendations.
However if your use-case is so simple as just convert from one type to another, so yes, then you can rely on the ConversionService. You register your custom Converter: https://docs.spring.io/spring-integration/docs/current/reference/html/endpoint.html#payload-type-conversion. And then just use a .convert(Class) API from IntegrationFlowDefinition.
But again: since there is no easy way to cover all the possible domain use-cases, we cannot provide an out-of-the-box Normalizer implementation.

ArchUnit: how to test for imports of specific classes outside of current package?

To externalize UI strings we use the "Messages-class" approach as supported e.g. in Eclipse and other IDEs. This approach requires that in each package where one needs some UI strings there has to be a class "Messages" that offers a static method String getString(key) via which one obtains the actual String to display to the user. The Strings are internally accessed/fetched using Java's Resources mechanism for i18n.
Esp. after some refactoring - we again and again have accidental imports from a class Messages from a different package.
Thus I would like to create an archunit rule checking whether we only access classes called "Messages" from the very same package. I.e. each import of a class x.y.z.Messages is an error if the package x.y.z is not the same package as the current class (i.e. the class that contains the import)
I got as far as this:
#ArchTest
void preventReferencesToMessagesOutsideCurrentPackage(JavaClasses classes) {
ArchRule rule;
rule = ArchRuleDefinition.noClasses()
.should().accessClassesThat().haveNameMatching("Messages")
.???
;
rule.check(classes);
}
but now I got stuck at the ???.
How can one phrase a condition "and the referenced/imported class "Messages" is not in the same package as this class"?
I somehow got lost with all these archunit methods of which none seems to fit here nor lend itself to compose said condition. Probably I just can't see the forest for the many trees.
Any suggestion or guidance anyone?
You need to operate on instances of JavaAccess to validate the dependencies. JavaAccess provides information about the caller and the target such that you can validate the access dynamically depending on the package name of both classes.
DescribedPredicate<JavaAccess<?>> isForeignMessageClassPredicate =
new DescribedPredicate<JavaAccess<?>>("target is a foreign message class") {
#Override
public boolean apply(JavaAccess<?> access) {
JavaClass targetClass = access.getTarget().getOwner();
if ("Message".equals(targetClass.getSimpleName())) {
JavaClass callerClass = access.getOwner().getOwner();
return !targetClass.getPackageName().equals(callerClass.getPackageName());
}
return false;
}
};
ArchRule rule =
noClasses().should().accessTargetWhere(isForeignMessageClassPredicate);

Groovy star import and usage of "partial" packages

To my surprise I have learned today that the following works just fine in Groovy:
import java.util.concurrent.*
def atomicBool = new atomic.AtomicBoolean(true)
i.e. after the star import, I can use a 'partial' package to refer to java.util.concurrent.atomic.AtomicBoolean.
Obviously, the same is not valid in Java:
import java.util.concurrent.*;
public class Sample {
public static void main(String[] args) {
// compiler error: 'atomic' cannot be resolved to a type
new atomic.AtomicBoolean(true);
}
}
So it seems that Groovy's idea of a package is similar to C++ (or C#) namespaces in this regard.
Question to the Groovy experts:
Is this by design or is it a (potentially unintended) side effect of the way the interpreter treats star imports?
If it is by design, can you point me to a section in the documentation or language specification where this behavior is documented? (There is no mention of this in the documentation on Star Import and neither in the language spec as far as I can tell or at least I couldn't find anything.)
Based on Groovy source code, this behavior seems to be intended. Before we dig deeper into Groovy internals you have to be aware of one thing - Groovy compiles to a bytecode that can be represented by a valid Java code. It means that Groovy code like the one form your example actually compiles to something like this (without compile static and type checked transformations):
import groovy.lang.Binding;
import groovy.lang.Script;
import java.util.concurrent.atomic.AtomicBoolean;
import org.codehaus.groovy.runtime.InvokerHelper;
import org.codehaus.groovy.runtime.ScriptBytecodeAdapter;
import org.codehaus.groovy.runtime.callsite.CallSite;
public class test extends Script {
public test() {
CallSite[] var1 = $getCallSiteArray();
}
public test(Binding context) {
CallSite[] var2 = $getCallSiteArray();
super(context);
}
public static void main(String... args) {
CallSite[] var1 = $getCallSiteArray();
var1[0].call(InvokerHelper.class, test.class, args);
}
public Object run() {
CallSite[] var1 = $getCallSiteArray();
AtomicBoolean atomicBool = (AtomicBoolean)ScriptBytecodeAdapter.castToType(var1[1].callConstructor(AtomicBoolean.class, true), AtomicBoolean.class);
return var1[2].callCurrent(this, atomicBool);
}
}
As you can see this Java class uses full java.util.concurrent.atomic.AtomicBoolean import and this is actually what Groovy transforms your input source code to.
How does it happen?
As you may know Groovy builds Abstract Syntax Tree (AST) from the input source file and it iterates over all nodes (like expressions, variable definitions, method calls etc.) and applies transformations. Groovy uses class called ResolverVisitor that is designed to resolve types. When Groovy compiles your code it finds ConstructorCallExpression:
new atomic.AtomicBoolean(true)
It sees that the expected type of the object you are trying to create is atomic.AtomicBoolean, so ResolverVisitor starts resolving the type by calling resolveOrFail(type, cce); at line 1131.
It tries several resolving strategies that fail until it reaches method resolveFromModule at line 695. What happens here is it iterates over all star imports (single java.util.concurrent.* in your case), then it concatenates star import with the type name and it checks if qualified name created from this concatenation is a valid type class. Luckily it is in your case:
When type gets resolved, Groovy replaces initial type with this resolved valid type name in the abstract syntax tree. After this operation your input code looks more like this:
import java.util.concurrent.*
java.util.concurrent.atomic.AtomicBoolean atomicBool = new java.util.concurrent.atomic.AtomicBoolean(true)
This is what compiler gets eventually. Of course fully qualified name gets replaced with the import (this is what Java compiler does with qualified names).
Was this "feature" introduced by designed?
I can't tell you that. However we can read from the source code, that this happens on purpose and type resolving like this is implemented with intention.
Why it is not documented?
I guess no one actually recommend using imports that way. Groovy is very powerful and you are able to do many things in a many different ways, but it doesn't mean you should do so. Star imports are pretty controversial, because using star imports over explicit imports makes your code more error-prone due to possible class import conflicts. But if you want to know exact answer to this kind of questions you would have to ask Groovy language designers and core developers - they may give you straight answer without any doubts.

Large vocabulary speech recognition in sphinx4

As far as I know till now, sphinx4 requires grammar to identify the words. Is there anyway to get the input without using grammar rules, that is not in the grammar, somewhat like I am dictating and it will write what I will say?
As far as I know till now, sphinx4 requires grammar to identify the words.
No, sphinx4 supports large vocabulary speech recognition
Is there anyway to get the input without using grammar rules, that is not in the grammar, somewhat like I am dictating and it will write what I will say? Or any algorithm maybe to check it?
You need to update sphinx4-5prealpha version.
You can check transcriber demo for example of large vocabulary speech recognition setup.
The code should look like this:
package com.example;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.LiveSpeechRecognizer;
public class TranscriberDemo {
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
configuration
.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
configuration
.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration
.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
recognizer.startRecognition(true);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
}
}

How to compare 2 xsd schema files for equivalent functionality

I would like to compare 2 XSD schemas A and B to determine that all instance documents valid to schema A would also be valid to schema B. I hope to use this to prove that even though schema A and B are "different" they are effectively the same. Examples of differences this would not trigger would be Schema A uses types and Schema B declares all of it's elements inline.
I have found lots of people talking about "smart" diff type tools but these would claim the two files are different because they have different text but the resulting structure is the same. I found some references to XSOM but I'm not sure if that will help or not.
Any thoughts on how to proceed?
Membrane SOA Model - Java API for WSDL and XML Schema
package sample.schema;
import java.util.List;
import com.predic8.schema.Schema;
import com.predic8.schema.SchemaParser;
import com.predic8.schema.diff.SchemaDiffGenerator;
import com.predic8.soamodel.Difference;
public class CompareSchema {
public static void main(String[] args) {
compare();
}
private static void compare(){
SchemaParser parser = new SchemaParser();
Schema schema1 = parser.parse("resources/diff/1/common.xsd");
Schema schema2 = parser.parse("resources/diff/2/common.xsd");
SchemaDiffGenerator diffGen = new SchemaDiffGenerator(schema1, schema2);
List<Difference> lst = diffGen.compare();
for (Difference diff : lst) {
dumpDiff(diff, "");
}
}
private static void dumpDiff(Difference diff, String level) {
System.out.println(level + diff.getDescription());
for (Difference localDiff : diff.getDiffs()){
dumpDiff(localDiff, level + " ");
}
}
}
After executing you get the output shown in listing 2. It is a List of
differences between the two Schema documents.
ComplexType PersonType has changed: Sequence has changed:
Element id has changed:
The type of element id has changed from xsd:string to tns:IdentifierType.
http://www.service-repository.com/ offers an online XML Schema Version Comparator tool that displays a report of the differences between two XSD that appears to be produced from the Membrane SOA Model.
My approach to this was to canonicalize the representation of the XML Schema.
Unfortunately, I can also tell you that, unlike canonicalization of XML documents (used, as an example, to calculate a digital signature), it is not that simple or even standardized.
So basically, you have to transform both XML Schemas to a "canonical form" - whatever the tool you build or use thinks that form is, and then do the compare.
My approach was to create an XML Schema set (could be more than one file if you have more namespaces) for each root element I needed, since I found it easier to compare XSDs authored using the Russian Doll style, starting from the PSVI model.
I then used options such as auto matching substitution group members coupled with replacement of substitution groups with a choice; removal of "superfluous" XML Schema sequences, collapsing of single option choices or moving minOccurs/maxOccurs around for single item compositors, etc.
Depending on what your XSD-aware comparison tool's features are, or you settle to build, you might also have to rearrange particles under compositors such as xsd:choice or xsd:all; etc.
Anyway, what I learned after all of it was that it is extremely hard to build a tool that would work nice for all "cool" XSD features out there... One test case I remember fondly was to deal with various xsd:any content.
I do wonder though if things have changed since...

Resources