CMU Sphinx and time indexed words - speech-to-text

I have run the example of Pocketsphinx Python and now I am facing the issue that I want to run a 60sec wav file for speech recognition in English and want as output
- the English translation AND
- at which second each word was mentioned.
Now, I do not know where to start to dome some research to get the required output. Could anyone please point me in the right direction??

ok, the open source tools like Kaldi automatically offers this:
https://americanarchivepb.wordpress.com/2017/12/04/dockerized-kaldi-speech-to-text-tool/

You need recognition with forced alignment. Here is an example for pocketsphinx:
pocketsphinx_continuous
-infile with.wav
-jsgf with-word.jsgf
-dict words.dict
-backtrace yes
-fsgusefiller no
-bestpath no
2>&1 > with-word.txt
Output:
==> with-word.txt <==
INFO: fsg_search.c(869): fsg 0.05 CPU 0.051 xRT
INFO: fsg_search.c(871): fsg 0.09 wall 0.084 xRT
INFO: pocketsphinx.c(1171): sil with sil (-2607)
word start end pprob ascr lscr lback
sil 3 77 1.000 -1602 0 1
with 78 102 1.000 -845 0 1
sil 103 107 1.000 -160 0 1
INFO: fsg_search.c(265): TOTAL fsg 0.05 CPU 0.051 xRT
INFO: fsg_search.c(268): TOTAL fsg 0.09 wall 0.085 xRT
sil with sil
For CMU Sphinx 4 you need the SpeechAligner class from Sphinx API. Here you'll find an implementation of simple aligner tool.
./align.sh sample.wav sample.txt 2>/dev/null
Output:
"it's","IH T S","false","0.0","170","200"
"a","AH","false","-5540774.0","200","390"
"crowd","K R AW D","false","-1.13934288E8","850","1300"
"in","IH N","false","-1.95127088E8","1300","1470"
"two","T UW","false","-2.23176048E8","1470","1700"
"distinct","D IH S T IH NG K T","false","-2.6345264E8","1700","2230"
"ways","W EY Z","false","-3.58427808E8","2230","2730"
"the","DH AH","false","-4.72551168E8","2920","3100"
"fruit","F R UW T","false","-5.24233504E8","3220","3530"
"of","AH V","false","-5.79971456E8","3530","3640"
"a","AH","false","-5.99515456E8","3640","3760"
"figg","F IH G","false","-6.2017152E8","3760","4060"
"tree","T R IY","false","-6.72126656E8","4060","4490"
"is","IH Z","false","-7.4763744E8","4490","4570"
"apple","AE P AH L","false","-7.73581184E8","4630","5040"
"shaped","SH EY P T","false","-8.44424704E8","5040","5340"

Related

How to display days in a different language?

I downloaded a conky script from gnome-look which shows the day in english and would like to change the day language to Arabic. the script is :
background yes
update_interval 1
cpu_avg_samples 2
net_avg_samples 2
temperature_unit celsius
double_buffer yes
no_buffers yes
text_buffer_size 2048
alignment bottom_left
gap_x 20
gap_y 7000
minimum_size 550 550
maximum_width 550
own_window yes
own_window_type normal
own_window_transparent yes
own_window_hints undecorated,below,sticky,skip_taskbar,skip_pager
own_window_argb_visual yes
own_window_argb_value 0
border_inner_margin 0
border_outer_margin 0
draw_shades no
draw_outline no
draw_borders no
draw_graph_borders no
default_shade_color 112422
override_utf8_locale yes
use_xft yes
xftfont Feena Casual:size=10
xftalpha 1
uppercase yes
default_color D6D5D4
#E87E3C
own_window_colour 000000
TEXT
${font NotoSansArabic:size=15}${color D6D5D4}${LANG=AR_EG.UTF-8 time %A}\
${font Anurati:size=45}${color D6D5D4}${time %A}#${color yellow}\
${font Finlandica:size=25}${color D6D5D4}${time %H:%M}#${color yellow}\
${font Finlandica:size=25}${color D6D5D4}${time %d %B}#${color yellow}\
${font Anurati:size=25}${color D6D5D4}${time %d %B %Y}#${color yellow}\
I've added this line of code hoping it is going to work :
${font Noto Sans Arabic:size=15}${color D6D5D4}${LANG=AR_EG.UTF-8 time %A}\
However, It doesn't work as expected. Any idea how to edit this line of code ?

Tabula-py for borderless table extraction

Can anyone please suggest me how to extract tabular data from a PDF using python/java program for the below borderless table present in a pdf file?
This table might be difficult one for tabla. How about using guess=False, stream=True ?
Update: As of tabula-py 1.0.3, guess and stream should work together. No need to set guess=False to use stream or lattice option.
I solved this problem via tabula-py
conda install tabula-py
and
>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
) # `tabula` doc explains params very well
>>> page2
and I got this result
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
This is an iterable obj, so you can manipulate it via for row in page2:
Hope it help you
Tabula-py borderless table extraction:
Tabula-py has stream which on True detects table based on gaping.
from tabula convert_into
src_pdf = r"src_path"
des_csv = r"des_path"
convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")

How to handle such errors?

companies = pd.read_csv("http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv", index_col = 0)
companies.head()
I'm getting this error please suggest what approaches should be tried.
"utf-8' codec can't decode byte 0xb7 in position 7"
Try encoding as 'latin1' on macOS.
companies = pd.read_csv("http://www.richardmuir.com/data/public/csv/CompaniesRevenueEmployees.csv",
index_col=0,
encoding='latin1')
Downloading the file and opening it in notepad++ shows it is ansi-encoded. If you are on a windows system this should fix it:
import pandas as pd
url = "http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv"
companies = pd.read_csv(url, index_col = 0, encoding='ansi')
print(companies)
If not (on windows), you need to research how to convert ansi-encoded text to something you can read.
See: https://docs.python.org/3/library/codecs.html#standard-encodings
Output:
Name Industry \
0 Walmart Retail
1 Sinopec Group Oil and gas
2 China National Petroleum Corporation Oil and gas
... ... ...
47 Hewlett Packard Enterprise Electronics
48 Tata Group Conglomerate
Revenue (USD billions) Employees
0 482 2200000
1 455 358571
2 428 1636532
... ... ...
47 111 302000
48 108 600000

user defined feature in CRF++

I tried to add more feature to CRF++ template.
According to How can I tell CRF++ classifier that a word x is captilized or understanding punctuations?
training sample
The DT 0 1 0 1 B-MISC
Oxford NNP 0 1 0 1 I-MISC
Companion NNP 0 1 0 1 I-MISC
to TO 0 0 0 0 I-MISC
Philosophy NNP 0 1 0 1 I-MISC
feature template
# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
U07:%x[-2,0]/%x[-1,0]/%x[0,0]
#shape feature
U08:%x[-2,2]
U09:%x[-1,2]
U10:%x[0,2]
U11:%x[1,2]
U12:%x[2,2]
B
The traing phase is ok. But I get no ouput with crf_test
tilney#ubuntu:/data/wikipedia/en$ crf_test -m validation_model test.data
tilney#ubuntu:/data/wikipedia/en$
Everything works fine if ignore the shape fearture above. where did I go wrong?
I figured this out. It's the problem with my test data. I thought that every feature should be taken from the trained model, so I only have two columns in my test data: word tag, which turns out that the test file should have the exact same format as the training data do!

Correlation analysis between stock prices

Let us consider following stock prices taken from yahoo.finance.com:
Date Open High Low Close Volume Adj Close
3/4/2013 23.15 23.84 23.03 23.67 30908300 23.3
2/25/2013 23.5 23.53 22.81 23.19 40710800 22.83
2/19/2013 23.42 23.75 23.12 23.39 38743400 23.03
2/11/2013 22.49 23.55 22.35 23.29 46448500 22.74
2/4/2013 22.41 22.62 22.27 22.5 34498100 21.97
1/28/2013 22.44 22.64 22.18 22.62 39634900 22.09
1/22/2013 22.18 22.31 21.75 22.29 47826300 21.77
1/14/2013 21.18 22.19 21.01 22.04 54826000 21.52
1/7/2013 21.16 21.24 20.68 21.13 35304100 20.63
12/31/2012 20.29 21.54 20.26 21.2 45796500 20.7
12/24/2012 20.79 20.96 20.42 20.44 28597100 19.96
12/17/2012 21.69 21.95 20.56 20.88 70719700 20.39
12/10/2012 21.43 21.95 21.36 21.62 39455500 20.92
12/3/2012 21.18 21.48 20.71 21.46 35913000 20.77
11/26/2012 20.88 21.36 20.5 21.13 36203100 20.45
11/19/2012 20.41 21.04 20.37 21.04 35401500 20.36
11/12/2012 21.04 21.14 19.87 20.15 45095400 19.5
11/5/2012 21.2 21.78 20.7 21 37812800 20.32
11/2/2012 21.53 21.68 21.26 21.31 47475200 20.62
And i want to do correlation matrix between for example Volume and Low Close variable. I used correlation function from data analysis toolbox from excel, but I got only one side matrix, like this:
Close Volume
Close 1
Volume -0.117267345 1
It does not show me correlation coefficients up side of main diagonal, why? Maybe it is symmetric and because of this?
The correlation matrix is necessarily symmetric, so the above the diagonal element in your case is -0.117257345. If you check the documentation on the correl function, and look at the defining equation, you can see that it is symmetric with respect to exchanging X<->Y

Resources