Language model/set does not contain </s> - cmusphinx

I am developing an ASR using PocketSphinx and I have followed every step on this page. When I run pocketsphinx_continousI get the following error:
ERROR: "ngram_search.c", line 221: Language model/set does not contain </s>, recognition will fail
My language model contains the and the /s tag though.
My language model is as follows:
This is an ARPA-format language model file, generated by CMU Sphinx
\data\
ngram 1=3
ngram 2=1
ngram 3=1
\1-grams:
-0.4770 <s>Alif</s> -0.3010
-0.4770 <s>Baa</s> 0.0000
-0.4770 <s>Jeem</s> 0.0000
\2-grams:
-0.1761 <s>Alif</s> <s>Baa</s> -0.1249
\3-grams:
-0.3010 <s>Alif</s> <s>Baa</s> <s>Jeem</s>
\end\
The corpus file from which this was made is:
<s> Alif </s>
<s> Baa </s>
<s> Jeem </s>
Assistance in resolving this issue is highly appreciated.

When you prepared the corpus you didn't have spaces between <s> and Alif and thus lm training counted <s>Alif</s> as a single word. You should have spaces there and proper language model should look like this:
\data\
ngram 1=5
ngram 2=6
ngram 3=0
\1-grams:
-0.3010 </s> 0.0000
-99.0000 <s> -7.3814
-0.7782 Alif -99.0000
-0.7782 Baa -99.0000
-0.7782 Jeem -99.0000
\2-grams:
-0.4771 <s> Alif 0.0000
-0.4771 <s> Baa 0.0000
-0.4771 <s> Jeem 0.0000
0.0000 Alif </s> 0.0000
0.0000 Baa </s> 0.0000
0.0000 Jeem </s> 0.0000
\3-grams:
\end\
This correct LM has separate entry for </s>

Related

Python - A dictionary of dataframes: merge all the dataframes to one big dataframe

I have a dictionary of dataframes. It consists of around 50 dfs, but for the simplicity of demonstration, let's say I only have 2.
This is the dictionary:
it's a lot of weather parameters for a specific location, for several days.
dict_df = { "Location_1" : [ temp_max temp_min precip_mm \
date
2012-05-16 31.370001 15.050000 0.0000
2012-05-17 30.559999 16.780001 0.0000
2012-05-18 32.529999 17.040001 0.0000
2012-05-19 32.860001 19.190001 0.0000
2012-05-20 33.340000 18.580000 0.0000
2012-05-21 27.430000 17.450001 18.5245
2012-05-22 26.730000 13.800000 0.0000
2012-05-23 29.340000 13.300000 0.0000
2012-05-24 32.779999 19.500000 0.0000
2012-05-25 32.919998 22.830000 0.0000
solar_energy_w_h_per_m2 rel_humidity_max_% rel_humidity_min_%
date
2012-05-16 7677.530273 83.779999 24.580000
2012-05-17 7488.292969 78.629997 25.270000
2012-05-18 6644.316895 83.879997 26.900000
2012-05-19 7523.830078 83.709999 33.230000
2012-05-20 6840.391113 90.139999 33.930000
2012-05-21 5472.107910 93.139999 43.490002
2012-05-22 8293.391602 87.540001 28.680000
2012-05-23 8351.654297 91.379997 25.240000
2012-05-24 8176.128418 69.089996 35.290001
2012-05-25 6369.352539 76.449997 40.139999 ],
"Location_2" : [temp_max_cels temp_min_cels precip_amount_mm \
date
2012-05-16 31.370001 15.050000 0.0000
2012-05-17 30.559999 16.780001 0.0000
2012-05-18 32.529999 17.040001 0.0000
2012-05-19 32.860001 19.190001 0.0000
2012-05-20 33.340000 18.580000 0.0000
2012-05-21 27.430000 17.450001 18.5245
2012-05-22 26.730000 13.800000 0.0000
2012-05-23 29.340000 13.300000 0.0000
2012-05-24 32.779999 19.500000 0.0000
2012-05-25 32.919998 22.830000 0.0000
solar_energy_w_h_per_m2 rel_humidity_max_% rel_humidity_min_% \
date
2012-05-16 7677.530273 83.779999 24.580000
2012-05-17 7488.292969 78.629997 25.270000
2012-05-18 6644.316895 83.879997 26.900000
2012-05-19 7523.830078 83.709999 33.230000
2012-05-20 6840.391113 90.139999 33.930000
2012-05-21 5472.107910 93.139999 43.490002
2012-05-22 8293.391602 87.540001 28.680000
2012-05-23 8351.654297 91.379997 25.240000
2012-05-24 8176.128418 69.089996 35.290001
2012-05-25 6369.352539 76.449997 40.139999]}
And it goes like that for 50 locations or so.
I want to be able to merge all the dataframes from the dictionary that may not have the exact dates but all have the same type and number of columns to a dataframe like this:
I hope it's clear. I really appreciate any help you can provide.
You can just do
df = pd.concat(dict_df).reset_index(level=1,drop=True)

how do i improve my nlp model to classify 4 different mental illness?

I have a dataset in csv containing 2 columns: 1 is the label which determines the type of mental illness of the patient and the other is the corresponding reddit posts from a certain time period of that user.
These are the total number of patients in each group of illness:
control: 3000
depression: 2118
bipolar: 1062
ptsd: 330
schizophrenia: 148
for starters I tried binary classification between my depression and bipolar patients. I used tfidf vectors and fed it into 2 different types of classifiers: MultinomialNB and SVM.
here is a sample of the code:
using MultinomialNB:
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
text_clf = text_clf.fit(x_train, y_train)
using SVM:
text_clf_svm = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),])
text_clf_svm = text_clf_svm.fit(x_train, y_train)
these are my results:
precision recall f1-score support
bipolar 0.00 0.00 0.00 304
depression 0.68 1.00 0.81 650
accuracy 0.68 954
macro avg 0.34 0.50 0.41 954
weighted avg 0.46 0.68 0.55 954
The problem is the models are simply just predicting all the patients to be in the class of the larger data sample, in this case all are being predicted as depressed patients. I have tried using BERT as well but I get the same accuracy. I have read papers on them using LIWC lexicon, These categories include variables that characterize linguistic style as well as psychological aspects of language.
I don't understand if what I am doing is correct or is there a better way at classifying using NLP, if so please enlighten me.
Thanking anybody who comes across such a big post and shares their idea beforehand!

Split one column into multiple columns by multiple delimiters in Pandas

Given a dataframe as follows:
player score
0 Sergio Agüero Forward — Manchester City 209.98
1 Eden Hazard Midfield — Chelsea 274.04
2 Alexis Sánchez Forward — Arsenal 223.86
3 Yaya Touré Midfield — Manchester City 197.91
4 Angel María Midfield — Manchester United 132.23
How could split player into three new columns name, position and team?
player score name position team
0 Sergio Agüero Forward — Manchester City 209.98 Sergio Forward Manchester City
1 Eden Hazard Midfield — Chelsea 274.04 Eden Midfield Chelsea
2 Alexis Sánchez Forward — Arsenal 223.86 Alexis Forward Arsenal
3 Yaya Touré Midfield — Manchester City 197.91 Yaya Midfield Manchester City
4 Angel María Midfield — Manchester United 132.23 Angel Midfield Manchester United
I have considered split it two columns with df[['name_position', 'team']] = df['player'].str.split(pat= ' — ', expand=True), then split name_position to name and position. But is there any better solutions?
Many thanks.
You can use str.extract as well if you want to do it in one go:
print(df["player"].str.extract(r"(?P<name>.*?)\s.*?\s(?P<position>[A-Za-z]+)\s—\s(?P<team>.*)"))
name position team
0 Sergio Forward Manchester City
1 Eden Midfield Chelsea
2 Alexis Forward Arsenal
3 Yaya Midfield Manchester City
4 Angel Midfield Manchester United
You can split a python string by space with string.split(). This will break up your text into 'words', then you can simply access the one you like, like this:
string = "Sergio Agüero Forward — Manchester City"
name = string.split()[0]
position = string.split()[2]
team = string.split()[4] + (string.split().has_key(5) ? string.split()[5] : '')
For more complex patterns, you can use regex, which is a powerful string pattern finding tool.
Hope this helped :)

Genomics Analysis after Blast

This is how my Data look like, a results obtained after blast-p.
I have tried this command but it is not providing my desired outputs.
cat out_uniprot-proteome%3AUP000001415.fasta |grep -P -B5 'Identities = \d+/\d+\s\(([7-9]\d|100)%'
Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1
Length=788
Score E
Sequences producing significant alignments: (Bits) Value
sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=... 109 8e-24
tr|A8K1E1|A8K1E1_HUMAN cDNA FLJ75589, highly similar to Homo sa... 107 4e-23
sp|P20585|MSH3_HUMAN DNA mismatch repair protein Msh3 OS=Homo s... 107 4e-23
tr|B4DSB9|B4DSB9_HUMAN cDNA FLJ51069, highly similar to DNA mis... 102 1e-21
tr|B4DL39|B4DL39_HUMAN cDNA FLJ57316, highly similar to DNA mis... 102 1e-21
tr|A0A2R8YFH0|A0A2R8YFH0_HUMAN DNA mismatch repair protein OS=H... 101 3e-21
tr|A0A2R8Y6P0|A0A2R8Y6P0_HUMAN DNA mismatch repair protein OS=H... 101 3e-21
tr|B4DN49|B4DN49_HUMAN DNA mismatch repair protein OS=Homo sapi... 101 3e-21
tr|E9PHA6|E9PHA6_HUMAN DNA mismatch repair protein OS=Homo sapi... 101 3e-21
sp|P43246|MSH2_HUMAN DNA mismatch repair protein Msh2 OS=Homo s... 101 3e-21
tr|Q53GS1|Q53GS1_HUMAN DNA mismatch repair protein (Fragment) O... 101 3e-21
tr|A0A2R8YG02|A0A2R8YG02_HUMAN DNA mismatch repair protein OS=H... 101 3e-21
tr|Q53FK0|Q53FK0_HUMAN DNA mismatch repair protein (Fragment) O... 100 6e-21
tr|B4DZX3|B4DZX3_HUMAN cDNA FLJ54211, highly similar to MutS pr... 90.1 5e-18
tr|A0A0G2JJ70|A0A0G2JJ70_HUMAN MSH5-SAPCD1 readthrough (NMD can... 89.7 5e-18
tr|A2ABF0|A2ABF0_HUMAN cDNA FLJ39914 fis, clone SPLEN2018732, h... 89.7 5e-18
tr|Q9UFG2|Q9UFG2_HUMAN Uncharacterized protein DKFZp434C1615 (F... 87.0 6e-18
tr|H0YF11|H0YF11_HUMAN MSH5-SAPCD1 readthrough (NMD candidate) ... 87.0 6e-18
> sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=9606
GN=MSH4 PE=1 SV=2
Length=936
Score = 109 bits (273), Expect = 8e-24, Method: Compositional matrix adjust.
Identities = 71/228 (31%), Positives = 118/228 (52%), Gaps = 8/228 (4%)
> tr|Q0QEN7|Q0QEN7_HUMAN ATP synthase subunit beta (Fragment) OS=Homo
sapiens OX=9606 GN=ATP5B PE=2 SV=1
Length=445
Score = 590 bits (1522), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 300/448 (67%), Positives = 357/448 (80%), Gaps = 12/448 (3%)
--
Query 423 SYVPVAETVRGFKEILEGKHDNLPEEAF 450
VP+ ET++GF++IL G++D+LPE+AF
Sbjct 416 KLVPLKETIKGFQQILAGEYDHLPEQAF 443
> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362
Score = 459 bits (1182), Expect = 1e-158, Method: Compositional matrix adjust.
Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
--
Query 342 DPLASSSSALAPEIVGEEHYEVATEVQ 368
DPL S+S + P IVG EHY+VA VQ
Sbjct 336 DPLDSTSRIMDPNIVGSEHYDVARGVQ 362
> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270
Score = 281 bits (720), Expect = 1e-90, Method: Compositional matrix adjust.
Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
--
Query 265 LGRMPSAVGYQPTLATEMGQLQERITSTKKGSITSIQAIYVPADDYTD 312
LGR+PSAVGYQPTLAT+MG +QERIT+TKKGSITS+QAIYVPADD TD
Sbjct 223 LGRIPSAVGYQPTLATDMGTMQERITTTKKGSITSVQAIYVPADDLTD 270
Output i want is:
Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1
> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362
Score = 459 bits (1182), Expect = 1e-158, Method: Compositional matrix adjust.
Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270
Score = 281 bits (720), Expect = 1e-90, Method: Compositional matrix adjust.
Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
I want the Query of the respective strains having Identities 70% or greater.

grep and egrep selecting numbers

I have to find all entries of people whose zip code has “22” in it. NOTE: this should not include something like Mike Keneally whose street address includes “22”.
Here are some samples of data:
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
Here is the command I have so far, but I don't know why it's not working.
egrep '.*[A-Z][A-Z]\s*[0-9]+[22][0-9]+$' names.txt
guess this is your sample names.txt
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
egrep '.[A-Z][A-Z]\s[0-9]+[22][0-9]+$' names.txt
your code translates to match any line satisfy this conditions:
[A-Z][A-Z] has two consecutive upper case characters
\s* zero or more space characters
[0-9]+ one or more digit character
[22] a character matches either 2 or 2
[0-9]+$ one or more digit characters at the end of the line
to get lines satisfying your requirement:
zip code has “22” in it
you can do it this way:
egrep '[A-Z]{2}\s+[0-9]*22' names.txt
If zip code is always the last field, you can use this awk
awk '$NF~/22/' file
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Ruth Underwood, Mariemont, OH 42522

Resources