Large vocabulary speech recognition in sphinx4 - cmusphinx

As far as I know till now, sphinx4 requires grammar to identify the words. Is there anyway to get the input without using grammar rules, that is not in the grammar, somewhat like I am dictating and it will write what I will say?

As far as I know till now, sphinx4 requires grammar to identify the words.
No, sphinx4 supports large vocabulary speech recognition
Is there anyway to get the input without using grammar rules, that is not in the grammar, somewhat like I am dictating and it will write what I will say? Or any algorithm maybe to check it?
You need to update sphinx4-5prealpha version.
You can check transcriber demo for example of large vocabulary speech recognition setup.
The code should look like this:
package com.example;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.LiveSpeechRecognizer;
public class TranscriberDemo {
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
configuration
.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
configuration
.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration
.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
recognizer.startRecognition(true);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
}
}

Related

Tenant Not availalbe exception despite the use of the workaround

I have implented GetAllBusinessPartnerCommand and also customized the code in the BusinessPartnerServlet. When I try to call the application with the customized code, I always get this error.
Code GetAllBusinessPartnersCommand
package com.sap.cloud.s4hana.examples.commands;
import java.util.Collections;
import java.util.List;
import org.slf4j.Logger;
import com.sap.cloud.s4hana.examples.BusinessPartnerServlet;
import com.sap.cloud.sdk.cloudplatform.logging.CloudLoggerFactory;
import com.sap.cloud.sdk.frameworks.hystrix.HystrixUtil;
import com.sap.cloud.sdk.s4hana.connectivity.ErpCommand;
import com.sap.cloud.sdk.s4hana.datamodel.odata.helper.Order;
import com.sap.cloud.sdk.s4hana.datamodel.odata.namespaces.businesspartner.BusinessPartner;
import com.sap.cloud.sdk.s4hana.datamodel.odata.services.DefaultBusinessPartnerService;
public class GetAllBusinessPartnersCommand extends ErpCommand<List<BusinessPartner>>{
private static final Logger logger = CloudLoggerFactory.getLogger(BusinessPartnerServlet.class);
public static final String CATEGORY_PERSON ="1";
public GetAllBusinessPartnersCommand() {
super(HystrixUtil.getDefaultErpCommandSetter(
GetAllBusinessPartnersCommand.class,
HystrixUtil.getDefaultErpCommandProperties().withExecutionTimeoutInMilliseconds(10000)));
}
#Override
protected List<BusinessPartner> run() throws Exception {
// TODO Auto-generated method stub
return new DefaultBusinessPartnerService().getAllBusinessPartner()
.select(BusinessPartner.BUSINESS_PARTNER,
BusinessPartner.LAST_NAME,
BusinessPartner.FIRST_NAME,
BusinessPartner.IS_MALE,
BusinessPartner.IS_FEMALE,
BusinessPartner.CREATION_DATE)
.filter(BusinessPartner.BUSINESS_PARTNER_CATEGORY.eq(CATEGORY_PERSON))
.orderBy(BusinessPartner.LAST_NAME, Order.ASC)
.execute();
}
#Override
protected List<BusinessPartner> getFallback() {
logger.warn("Fallback called because of exception:",
getExecutionException());
return Collections.emptyList();
}
}
n the following you can see the commands and the offered workaround for the problem set ALLOW_MOCKED_AUTH_HEADER=true. Before testing I checked if all variables are set correctly and set ALLOW_MOCKED_AUTH_HEADER=true again because I set it too early before.
After this steps i build my project like i always do and get the error from above when im calling the service. I also read this post where someone have the same problem and used the mentioned workaround. But this doesnt work for me and i have no clue why. TenantNotAvailableException, when trying to call business partner from s4 CF SDK
error when call page
Starting mock-server
set variables and workaround plus check
Unfortunately SAP Cloud SDK 2.x is no longer supported. Please use our migration guide to move to the latest version 3.43.0 You can find my blog post on this topic too. If you face any issues, please create another StackOverflow question (with tag sap-cloud-sdk) or comment to the blog post. For more details, please find our documentation.

ArchUnit: how to test for imports of specific classes outside of current package?

To externalize UI strings we use the "Messages-class" approach as supported e.g. in Eclipse and other IDEs. This approach requires that in each package where one needs some UI strings there has to be a class "Messages" that offers a static method String getString(key) via which one obtains the actual String to display to the user. The Strings are internally accessed/fetched using Java's Resources mechanism for i18n.
Esp. after some refactoring - we again and again have accidental imports from a class Messages from a different package.
Thus I would like to create an archunit rule checking whether we only access classes called "Messages" from the very same package. I.e. each import of a class x.y.z.Messages is an error if the package x.y.z is not the same package as the current class (i.e. the class that contains the import)
I got as far as this:
#ArchTest
void preventReferencesToMessagesOutsideCurrentPackage(JavaClasses classes) {
ArchRule rule;
rule = ArchRuleDefinition.noClasses()
.should().accessClassesThat().haveNameMatching("Messages")
.???
;
rule.check(classes);
}
but now I got stuck at the ???.
How can one phrase a condition "and the referenced/imported class "Messages" is not in the same package as this class"?
I somehow got lost with all these archunit methods of which none seems to fit here nor lend itself to compose said condition. Probably I just can't see the forest for the many trees.
Any suggestion or guidance anyone?
You need to operate on instances of JavaAccess to validate the dependencies. JavaAccess provides information about the caller and the target such that you can validate the access dynamically depending on the package name of both classes.
DescribedPredicate<JavaAccess<?>> isForeignMessageClassPredicate =
new DescribedPredicate<JavaAccess<?>>("target is a foreign message class") {
#Override
public boolean apply(JavaAccess<?> access) {
JavaClass targetClass = access.getTarget().getOwner();
if ("Message".equals(targetClass.getSimpleName())) {
JavaClass callerClass = access.getOwner().getOwner();
return !targetClass.getPackageName().equals(callerClass.getPackageName());
}
return false;
}
};
ArchRule rule =
noClasses().should().accessTargetWhere(isForeignMessageClassPredicate);

Groovy star import and usage of "partial" packages

To my surprise I have learned today that the following works just fine in Groovy:
import java.util.concurrent.*
def atomicBool = new atomic.AtomicBoolean(true)
i.e. after the star import, I can use a 'partial' package to refer to java.util.concurrent.atomic.AtomicBoolean.
Obviously, the same is not valid in Java:
import java.util.concurrent.*;
public class Sample {
public static void main(String[] args) {
// compiler error: 'atomic' cannot be resolved to a type
new atomic.AtomicBoolean(true);
}
}
So it seems that Groovy's idea of a package is similar to C++ (or C#) namespaces in this regard.
Question to the Groovy experts:
Is this by design or is it a (potentially unintended) side effect of the way the interpreter treats star imports?
If it is by design, can you point me to a section in the documentation or language specification where this behavior is documented? (There is no mention of this in the documentation on Star Import and neither in the language spec as far as I can tell or at least I couldn't find anything.)
Based on Groovy source code, this behavior seems to be intended. Before we dig deeper into Groovy internals you have to be aware of one thing - Groovy compiles to a bytecode that can be represented by a valid Java code. It means that Groovy code like the one form your example actually compiles to something like this (without compile static and type checked transformations):
import groovy.lang.Binding;
import groovy.lang.Script;
import java.util.concurrent.atomic.AtomicBoolean;
import org.codehaus.groovy.runtime.InvokerHelper;
import org.codehaus.groovy.runtime.ScriptBytecodeAdapter;
import org.codehaus.groovy.runtime.callsite.CallSite;
public class test extends Script {
public test() {
CallSite[] var1 = $getCallSiteArray();
}
public test(Binding context) {
CallSite[] var2 = $getCallSiteArray();
super(context);
}
public static void main(String... args) {
CallSite[] var1 = $getCallSiteArray();
var1[0].call(InvokerHelper.class, test.class, args);
}
public Object run() {
CallSite[] var1 = $getCallSiteArray();
AtomicBoolean atomicBool = (AtomicBoolean)ScriptBytecodeAdapter.castToType(var1[1].callConstructor(AtomicBoolean.class, true), AtomicBoolean.class);
return var1[2].callCurrent(this, atomicBool);
}
}
As you can see this Java class uses full java.util.concurrent.atomic.AtomicBoolean import and this is actually what Groovy transforms your input source code to.
How does it happen?
As you may know Groovy builds Abstract Syntax Tree (AST) from the input source file and it iterates over all nodes (like expressions, variable definitions, method calls etc.) and applies transformations. Groovy uses class called ResolverVisitor that is designed to resolve types. When Groovy compiles your code it finds ConstructorCallExpression:
new atomic.AtomicBoolean(true)
It sees that the expected type of the object you are trying to create is atomic.AtomicBoolean, so ResolverVisitor starts resolving the type by calling resolveOrFail(type, cce); at line 1131.
It tries several resolving strategies that fail until it reaches method resolveFromModule at line 695. What happens here is it iterates over all star imports (single java.util.concurrent.* in your case), then it concatenates star import with the type name and it checks if qualified name created from this concatenation is a valid type class. Luckily it is in your case:
When type gets resolved, Groovy replaces initial type with this resolved valid type name in the abstract syntax tree. After this operation your input code looks more like this:
import java.util.concurrent.*
java.util.concurrent.atomic.AtomicBoolean atomicBool = new java.util.concurrent.atomic.AtomicBoolean(true)
This is what compiler gets eventually. Of course fully qualified name gets replaced with the import (this is what Java compiler does with qualified names).
Was this "feature" introduced by designed?
I can't tell you that. However we can read from the source code, that this happens on purpose and type resolving like this is implemented with intention.
Why it is not documented?
I guess no one actually recommend using imports that way. Groovy is very powerful and you are able to do many things in a many different ways, but it doesn't mean you should do so. Star imports are pretty controversial, because using star imports over explicit imports makes your code more error-prone due to possible class import conflicts. But if you want to know exact answer to this kind of questions you would have to ask Groovy language designers and core developers - they may give you straight answer without any doubts.

Using UIMA, Stanford Core NLP together

The UIMA and the StanfordNLP produces the output after the pipeline of operation like if We want to do POS tagging so in the input text first the tokenization is done and then the POS tagging.
I want to use the tokenization of the UIMA and use that token in the POS tagger of the Stanford CoreNLP. But the POS tagger of Stanford CoreNLP have the requirement to run the tokenizer before POS tagger.
So, is it possible to use the different API in the same pipeline or not ?
Is it possible to use the UIMA tokenizer and the Stanford CoreNLP together ?
The typical approach to combine analysis steps from different tool chains (e.g. OpenNLP, Stanford CoreNLP, etc.) in UIMA would be to wrap each of them as a UIMA analysis engine. The analysis engine serves as an adapter between the UIMA data structure (the CAS) and the data structures used be the individual tools (e.g. the OpenNLP POS tagger or the CoreNLP parser). At the level of UIMA, these components can then be combined into pipelines.
There are various collections of UIMA components that wrap such tool chains, e.g. ClearTK, DKPro Core, or U-Compare.
The following example combines the OpenNLP segmenter (tokenizer/sentence splitter) and Stanford CoreNLP parser (which also creates the POS tags in the present example). The example is implemented as a Groovy script employing the uimaFIT API to create and run a pipeline using from components of the DKPro Core collection.
#!/usr/bin/env groovy
#Grab(group='de.tudarmstadt.ukp.dkpro.core',
module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl',
version='1.5.0')
#Grab(group='de.tudarmstadt.ukp.dkpro.core',
module='de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl',
version='1.5.0')
import static org.apache.uima.fit.pipeline.SimplePipeline.*;
import static org.apache.uima.fit.util.JCasUtil.*;
import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;
import org.apache.uima.fit.factory.JCasFactory;
import de.tudarmstadt.ukp.dkpro.core.opennlp.*;
import de.tudarmstadt.ukp.dkpro.core.stanfordnlp.*;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*;
import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*;
def jcas = JCasFactory.createJCas();
jcas.documentText = "This is a test";
jcas.documentLanguage = "en";
runPipeline(jcas,
createEngineDescription(OpenNlpSegmenter),
createEngineDescription(StanfordParser,
StanfordParser.PARAM_WRITE_PENN_TREE, true));
select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" }
select(jcas, PennTree).each { println it.pennTree }
Its output (after a lot of logging output) should look like this:
This DT
is VBZ
a DT
test NN
(ROOT
(S
(NP (DT This))
(VP (VBZ is)
(NP (DT a) (NN test)))))
I gave the Groovy script as an example because it works out of the box. A Java program would look quite similar, but would typically use e.g. Maven or Ivy to obtain the required libraries.
In case you want to try the script and need more information on installing Groovy and on potential trouble-shooting, you can find more information here.
Disclosure: I am working on the DKPro Core and Apache UIMA uimaFIT projects.
There are at least two ways to handle this if you want to use CoreNLP as the pipeline.
Force CoreNLP to ignore the requirements.
Properties props = new Properties();
props.put("enforceRequirements", "false");
props.put("annotators", "pos");
This will get rid of the "missing requirements" error. However, the POSTaggerAnnotator in CoreNLP expects the tokens to be CoreLabel objects and expects the sentences to be CoreMap objects (instantiated as ArrayCoreMap) so you'll have to convert them.
Add custom annotators to the pipeline.
The CoreMaps/CoreLabels are maps with classes as keys so you'll need a class/key for your custom annotation:
public class CustomAnnotations {
//this class will act as a key
public static class UIMATokensAnnotation
implements CoreAnnotation<List<CoreLabel>> {
//getType() defines/restricts the Type of the value associated with this key
public Class<List<CoreLabel>> getType() {
return ErasureUtils.<Class<List<CoreLabel>>> uncheckedCast(List.class);
}
}
}
You will also need an annotator class:
public class UIMATokensAnnotator implements Annotator{
//this constructor signature is expected by StanfordCoreNLP.class
public UIMATokensAnnotator(String name, Properties props) {
//initialize whatever you need
}
#Override
public void annotate(Annotation annotation) {
List<CoreLabel> tokens = //run the UIMA tokenization and convert output to CoreLabels
annotation.set(CustomAnnotations.UIMATokensAnnotation.class, tokens);
}
#Override
public Set<Requirement> requirementsSatisfied() {
return Collections.singleton(TOKENIZE_REQUIREMENT);
}
#Override
public Set<Requirement> requires() {
return Collections.emptySet();
}
}
finally:
props.put("customAnnotatorClass.UIMAtokenize", "UIMATokensAnnotator")
props.put("annotators", "UIMAtokenize, ssplit, pos")
The UIMA/OpenNLP/etc. sentence annotation can be added as a custom annotator in a similar fashion.
Check out http://nlp.stanford.edu/software/corenlp-faq.shtml#custom for a condensed version of option #2.

OpenNLP: foreign names does not get recognized

I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use?
You can make your own model with your data using an opennlp addon called modelbuilder-addon, if you try it you may be the first one to do so other than me...it's brand new.
it is very new, but it works for me.
You feed it the following:
a list of "known entities" via a file where each line is a name
a list of sentences from YOUR data via file where each line is a sentence
(optionally) a blacklist to remove false positives
you can checkout the addon here
https://svn.apache.org/repos/asf/opennlp/addons/modelbuilder-addon
you can use this to get started
import java.io.File;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
public class ModelBuilderAddonUse {
public static void main(String[] args) {
File fileOfSentences = new File("path to your sentence file");
File fileOfNames = new File("path to your file of person names");
File blackListFile = new File("path to your blacklist file");
File modelOutFile = new File("path to you where the model will be saved");
File annotatedSentencesOutFile = new File("path to your sentence file");
DefaultModelBuilderUtil.generateModel(fileOfSentences, fileOfNames, blackListFile, modelOutFile, annotatedSentencesOutFile, "person", 3);
}
}
the idea is that your known entities (common names in your data) are used to create annotations, and those annotations are used to generate a model, then the model is used to generate more names and annotations etc... the tool will do this as per the "iterations" parameter. You should run it, check your results, any undesirable hits should be added to the blacklist file, and then you can run the training again. I've used this and got pretty good results. If you find problems with it, put in a ticket at OpenNLP.

Resources