What do I do if sphinx is completely inaccurate? - cmusphinx

Edit for clarity: The output I get from sphinx is not even close to the actual words in my sound file. What do I need to do to make it more accurate?
Here's the file I'm trying to get a transcript from. It should be at a sample rate of 8K.
Acoustic model I'm using: en-us-8khz.tar.gz
Dictionary: dictionary.
Language model: US English Generic
The speech in the file is "What should you do if you experience a problem with your iPod", but
as output, I get:
<s> <sil> well how how [um] [cough] [um] old [cough] [noise] [cough] <sil> [noise]
[um] <sil> [um] <sil> [uh] [cough] [noise] [cough] [um]
Here's my code:
package com.test.sphinxtest;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.LiveSpeechRecognizer;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;
public class App
{
public static void main( String[] args )
{
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("models/acousticmodel/en-us-8khz");
configuration.setDictionaryPath("dictionary/cmudict-0.6d");
configuration.setLanguageModelPath("models/languagemodel/en-us.lm");
configuration.setSampleRate(8000);
try {
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
recognizer.startRecognition(new FileInputStream("speech/speech8k.wav"));
System.out.println("Starting recognition");
SpeechResult result = recognizer.getResult();
System.out.println("Stopping recognition");
recognizer.stopRecognition();
System.out.println("number of words " + result.getWords().size());
for(int i=0;i<result.getWords().size();i++){
System.out.println(result.getWords().get(i).getWord());
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

I looked at this page:
http://cmusphinx.sourceforge.net/wiki/faq#qwhy_my_accuracy_is_poor
And I saw that I had to set my audio file to be single-channel.
After I did that, I got a reasonable output.

Related

Spark is creating too many threads when reading ML model

I have a multilabel text classification problem that I tried to resolve using the binary relevance method, by creating one binary classifier per label.
I have to read 10000 models classifier to perform my classification phase, after my training phase, on all my documents, using spark.
But for an unknown reason, it becomes very slow when I try to read more than 1000 models, because spark creates a new thread each time, which progressively slow down the process, and I don't know why.
Here is the minimal code which illustrate my problem.
package entrepot.spark;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.sql.SparkSession;
public class maintest {
public static void main(String[] args) throws FileNotFoundException, IllegalArgumentException, IOException {
try(SparkSession spark = SparkSession.builder().appName("test").getOrCreate()) {
//Listing directories to get the list of labels
Set<String> labels = new HashSet<>();
FileStatus[] filesstatus = FileSystem.get(spark.sparkContext().hadoopConfiguration()).listStatus(new Path("C:\\Users\\*\\Desktop\\model\\"));
for(int i = 0; i < filesstatus.length; i++) {
if(filesstatus[i].isDirectory()) {
labels.add(filesstatus[i].getPath().getName());
}
}
List<MultilayerPerceptronClassificationModel> models = new ArrayList<>();
// Here is the problem
for(String label : labels) {
System.out.println(label);
MultilayerPerceptronClassificationModel model = MultilayerPerceptronClassificationModel.load("C:\\Users\\*\\Desktop\\model\\" + label + "\\CL\\");
models.add(model);
}
System.out.println("done");
}
}
}
I'm running the program on Windows, with Spark 2.1.1 and Hadoop 2.7.3, using the following command line:
.\bin\spark-submit^
--class entrepot.spark.maintest^
--master local[*]^
/C:/Users/*/eclipse-workspace/spark/target/spark-0.0.1-SNAPSHOT.jar
To download a small repetitive sample of one of my labels model, here is the link : we.tl/T50s9UffYV (Why can't I post a simple link ??)
PS: Even though the models are serializable, I couldn't save and load everything at once using a java collection and an object stream, because I get a scala conversion error. Instead, I'm using the save/load static method from MLLib on each model, resulting in hundreds of thousands of files.

Is there any API available to get how many slots of specific browser available in selenium grid?

I would like to get the number of tests running with a specific browser in the selenium grid.
I have looked at the existing API, where i can get the slotcount which is the sum of all the available slots which includes all the browsers.
ex: curl -X GET http://localhost:4444/grid/api/hub/ -d '{"configuration":["slotCounts"]}'
Output will be: {"success":true,"slotCounts":{"free":178,"total":196}}
Is there any API available to get how many slots of chrome browser(say) available?
Other options come to my mind is to parse the existing API
curl -X GET http://localhost:4444/grid/console
which return the full stack, where i need to parse the html structure, which is like
<img src='/grid/resources/org/openqa/grid/images/chrome.png' width='16' height='16' class='busy' title='POST - /session/8802ebae-10cb-480d-bbbd-5e7edd7ee7b2/execute executed.' />
No. Currently there's no such API available out there in the Selenium Grid that can do this for you.
You would need to build a custom servlet which when invoked can extract and provide this information for you.
Your Hub servlet could look like this:
import org.openqa.grid.internal.ProxySet;
import org.openqa.grid.internal.Registry;
import org.openqa.grid.internal.RemoteProxy;
import org.openqa.grid.web.servlet.RegistryBasedServlet;
import org.openqa.selenium.remote.CapabilityType;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
public class SimpleServlet extends RegistryBasedServlet {
public SimpleServlet(Registry registry) {
super(registry);
}
#Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
ProxySet proxySet = getRegistry().getAllProxies();
Iterator<RemoteProxy> iterator = proxySet.iterator();
Map<String, List<String>> returnValue = new HashMap<>();
while (iterator.hasNext()) {
RemoteProxy each = iterator.next();
each.getTestSlots().forEach(slot -> {
String browser = (String) slot.getCapabilities().get(CapabilityType.BROWSER_NAME);
String machineIp = each.getRemoteHost().getHost();
List<String> machines = returnValue.get(browser);
if (machines == null) {
machines = new ArrayList<>();
}
machines.add(machineIp);
returnValue.put(browser, machines);
});
}
//Write logic to have the Map returned back as perhaps a JSON payload
}
}
You can refer to the Selenium documentation here to learn how to inject servlets to the Hub or the node.

Operation APPEND failed with HTTP500?

package org.apache.spark.examples.kafkaToflink;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.Properties;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import com.microsoft.azure.datalake.store.ADLException;
import com.microsoft.azure.datalake.store.ADLFileOutputStream;
import com.microsoft.azure.datalake.store.ADLStoreClient;
import com.microsoft.azure.datalake.store.IfExists;
import com.microsoft.azure.datalake.store.oauth2.AccessTokenProvider;
import com.microsoft.azure.datalake.store.oauth2.ClientCredsTokenProvider;
import scala.util.parsing.combinator.testing.Str;
public class App {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.1.72:9092");
properties.setProperty("group.id", "test");
DataStream<String> stream = env.addSource(
new FlinkKafkaConsumer010<String>("tenant", new SimpleStringSchema(), properties), "Kafka_Source");
stream.addSink(new ADLSink()).name("Custom_Sink").setParallelism(128);
env.execute("App");
}
}
class ADLSink<String> extends RichSinkFunction<String> {
private java.lang.String clientId = "***********";
private java.lang.String authTokenEndpoint = "***************";
private java.lang.String clientKey = "*****************";
private java.lang.String accountFQDN = "****************";
private java.lang.String filename = "/Bitfinex/ETHBTC/ORDERBOOK/ORDERBOOK.json";
#Override
public void invoke(String value) {
AccessTokenProvider provider = new ClientCredsTokenProvider(authTokenEndpoint, clientId, clientKey);
ADLStoreClient client = ADLStoreClient.createClient(accountFQDN, provider);
try {
client.setPermission(filename, "744");
ADLFileOutputStream stream = client.getAppendStream(filename);
System.out.println(value);
stream.write(value.toString().getBytes());
stream.close();
} catch (ADLException e) {
System.out.println(e.requestId);
} catch (Exception e) {
System.out.println(e.getMessage());
System.out.println(e.getCause());
}
}
}
I am continuously trying to append a file which is in Azure data lake Store using while loop .But sometimes it gives this , Operation APPEND failed with HTTP500, error in starting or sometimes after 10 min. I am using java
Anubhav, Azure Data Lake streams are single-writer streams - i.e., you cannot write to the same stream from multiple threads, unless you do some form of synchronization between these threads. This is because each write specifies the offset it is writing to, and with multiple threads, the offsets are not consistent.
You seem to be writing from multiple threads (.setParallelism(128) call in your code)
In your case, you have two choices:
Write to a different file in each thread. I do not know your use-case, but we have found that for a lot of cases that is the natural use of different threads - to write to different files.
If it is important to have all the threads write to the same file, then you will need to refactor the sink a little bit so that all the instances have reference to the same ADLFileOutputStream, and you will need to make sure the calls to write() and close() are synchronized.
Now, there is one more issue here - the error you got should have been an HTPP 4xx error (indicating a lease conflict, since the ADLFileOutputStreams acquire lease), rather than HTTP 500, which says there was a server-side problem. To troubleshoot that, I will need to know your account name and time of access. That info is not safe to share on StackOverflow, so please open a support ticket for that and reference this SO question, so the issue gets eventually routed to me.

streaming.StreamingContext: Error starting the context, marking it as stopped [Spark Streaming]

I was trying to run a sample spark streaming code. but I get this error:
16/06/02 15:25:42 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
My code is given below. I know there are a few unused imports, because I was doing something else and getting the same error so I modified the same code to run the sample program given on the spark streaming website:
package com.streams.spark_consumer;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
import org.apache.spark.api.java.JavaSparkContext;
public class SparkConsumer {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
System.out.println("Han chal raha hai"); //just to know if this part of the code is executed
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
System.out.println("Han bola na chal raha hau chutiye 1"); //just to know if this part of the code is executed
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
JavaPairDStream<String, Integer> pairs = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
jssc.start();
jssc.awaitTermination();
}
}
Can anybody help me out with this?
I am using local master, even then I have tried starting a master and stoping a master (also slaves), I didn't know why that might help but just in case, I have already tried that.
According to Spark documentation
Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
So use any of the output operations after your tranformations.
print()
foreachRDD(func)
saveAsObjectFiles(prefix, [suffix])
saveAsTextFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])

Antlr4 doesn't correctly recognizes unicode characters

I've very simple grammar which tries to match 'é' to token E_CODE.
I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it.
My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4.
Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'
grammar Unicode;
stat:EOF;
E_CODE: '\u00E9' | 'é';
I tested the grammar:
grammar Unicode;
stat: E_CODE* EOF;
E_CODE: '\u00E9' | 'é';
as follows:
UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());
and the following got printed to my console:
éé<EOF>
Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).
EDIT
Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?
This is not an answer but a large comment.
I just hit a snag with Unicode, so I thought I would test this. Turned out I wrongly encoded the input file, but here is the test code, everything is default and working extremely well in ANTLR 4.10.1. Maybe of some use:
grammar LetterNumbers;
text: WORD*;
WS: [ \t\r\n]+ -> skip ; // toss out whitespace
// The letters that return Character.LETTER_NUMBER to Character.getType(ch)
// The list: https://www.compart.com/en/unicode/category/Nl
// Roman Numerals are the best known here
WORD: LETTER_NUMBER+;
LETTER_NUMBER:
[\u16ee-\u16f0]|[\u2160-\u2182]|[\u2185-\u2188]
|'\u3007'
|[\u3021-\u3029]|[\u3038-\u303a]|[\ua6e6-\ua6ef];
And the JUnit5 test that goes with that:
package antlerization.minitest;
import antlrgen.minitest.LetterNumbersBaseListener;
import antlrgen.minitest.LetterNumbersLexer;
import antlrgen.minitest.LetterNumbersParser;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.TerminalNode;
import org.junit.jupiter.api.Test;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import java.util.LinkedList;
import java.util.List;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.*;
public class MiniTest {
static class WordCollector extends LetterNumbersBaseListener {
public final List<String> collected = new LinkedList<>();
#Override
public void exitText(LetterNumbersParser.TextContext ctx) {
for (TerminalNode tn : ctx.getTokens(LetterNumbersLexer.WORD)) {
collected.add(tn.getText());
}
}
}
private static ParseTree stringToParseTree(String inString) {
Lexer lexer = new LetterNumbersLexer(CharStreams.fromString(inString));
CommonTokenStream tokens = new CommonTokenStream(lexer);
// "text" is the root of the grammar tree
// this returns a sublcass of ParseTree: LetterNumbersParser.TextContext
return (new LetterNumbersParser(tokens)).text();
}
private static List<String> collectWords(ParseTree parseTree) {
WordCollector wc = new WordCollector();
(new ParseTreeWalker()).walk(wc, parseTree);
return wc.collected;
}
private static String joinForTest(List<String> list) {
return String.join(",",list);
}
private static String stringInToStringOut(String parseThis) {
return joinForTest(collectWords(stringToParseTree(parseThis)));
}
#Test
void unicodeCharsOneWord() {
String res = stringInToStringOut("ⅣⅢⅤⅢ");
assertThat(res,equalTo("ⅣⅢⅤⅢ"));
}
#Test
void escapesOneWord() {
String res = stringInToStringOut("\u2163\u2162\u2164\u2162");
assertThat(res,equalTo("ⅣⅢⅤⅢ"));
}
#Test
void unicodeCharsMultipleWords() {
String res = stringInToStringOut("ⅠⅡⅢ ⅣⅤⅥ ⅦⅧⅨ ⅩⅪⅫ ⅬⅭⅮⅯ");
assertThat(res,equalTo("ⅠⅡⅢ,ⅣⅤⅥ,ⅦⅧⅨ,ⅩⅪⅫ,ⅬⅭⅮⅯ"));
}
#Test
void unicodeCharsLetters() {
String res = stringInToStringOut("Ⅰ Ⅱ Ⅲ \n Ⅳ Ⅴ Ⅵ \n Ⅶ Ⅷ Ⅸ \n Ⅹ Ⅺ Ⅻ \n Ⅼ Ⅽ Ⅾ Ⅿ");
assertThat(res,equalTo("Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ,Ⅷ,Ⅸ,Ⅹ,Ⅺ,Ⅻ,Ⅼ,Ⅽ,Ⅾ,Ⅿ"));
}
}
Your grammar file is not saved in utf8 format.
Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.

Resources