NEologd Japanese tokenizer dictionary cannot be correctly installed

NEologd Japanese tokenizer dictionary cannot be correctly installed - nlp

I was installing the NEologd dictionary for Japanese tokenization. While compiling the NEologd dictionary .csv file (mecab-user-dict-seed.20200910.csv) to .dic file using mecab-dict-index, there was a successful message on cmd, but I could not find the .dic file in my MeCab directory. Does anyone know how to fix this issue?
Below is the success message of the cmd.
`PS C:\Program Files (x86)\MeCab> ./bin/mecab-dict-index -d "C:\Program Files (x86)\MeCab\dic\ipadic" -u neologd.dic -f utf-8 -t utf-8 ./mecab-user-dict-seed.20200910.csv
reading ./mecab-user-dict-seed.20200910.csv ... 3224584
emitting double-array: 100% |###########################################|
done!`

Related

Change huge amount of data from NIST to RIFF wav file

So, I am writing a speech recognition program. To do that I downloaded 400MB of data from TIMIT. When I inteded to read the wav files (I tried two libraries) as follow:
import scipy.io.wavfile as wavfile
import wave
(fs, x) = wavfile.read('../data/TIMIT/TRAIN/DR1/FCJF0/SA1.WAV')
w = wave.open('../data/TIMIT/TRAIN/DR1/FCJF0/SA1.WAV')
In both cases they have the problem that the wav file format says 'NIST' and it must be in 'RIFF' format. (Something about sph also I readed but the nist file I donwloaded are .wav, not .sph).
I downloaded then SOX from http://sox.sourceforge.net/
I added the path correctly to my enviromental variables so that my cmd recognize sox. But I can't really find how to use it correctly.
What I need now is a script or something to make sox change EVERY wav file format from NIST to RIFF under certain folder and subfolder.
EDIT:
in reading a WAV file from TIMIT database in python I found a response that worked for me...
Running sph2pipe -f wav input.wav output.wav
What I need is a script or something that searches under a folder, all subfolders that contain a .wav file to apply that line of code.

Since forfiles is a Windows command, here is a solution for unix.
Just cd to the upper folder and type:
find . -name '*.WAV' | parallel -P20 sox {} '{.}.wav'
You need to have installed parallel and sox though, but for Mac you can get both via brew install. Hope this helps.

Ok, I got it finally. Go to the upper folder and run this code:
forfiles /s /m *.wav /c "cmd /c sph2pipe -f wav #file #fnameRIFF.wav"
This code searches for every file and make it readble for the python libs. Hope it helps!

How to gzip file with unicode encoding using linux cmd prompt?

I have large tsv format file(30GB). I have to transform all those data to google bigquery. So I split the files into smaller chunks and gzip all those chunk files and moved to google cloud storage. After that I have calling google bigquery api to load data from GCS. But I have facing following encoding error.
file_data.part_0022.gz: Error detected while parsing row starting at position: 0. Error: Bad character (ASCII 0) encountered. (error code: invalid)
I am using following unix commands in my python code for splitting and gzip tasks.
cmd = [
"split",
"-l",
"300000",
"-d",
"-a",
"4",
"%s%s" % (<my-dir>, file_name),
"%s/%s.part_" % (<my temp dir>, file_prefix)
]
code = subprocess.check_call(cmd)
cmd = 'gzip %s%s/%s.part*' % (<my temp dir>,file_prefix,file_prefix)
logging.info("Running shell command: %s" % cmd)
code = subprocess.Popen(cmd, shell=True)
code.communicate()
Files are successfully splitted and gziped (file_data.part_0001.gz, file_data.part_0002.gz, etc..) but when I try to load these files to bigquery it throws above error. I understand that was encoding issue.
Is there any way to encoding files while split and gzip operation? or we need to use python file object to read line by line and do unicode encoding and write it to new gzip file?(pythonic way)

Reason:
Error: Bad character (ASCII 0) encountered
Clearly states you have a unicode (UTF-16) tab character there which cannot be decoded.
BigQuery service only supports UTF-8 and latin1 text encodings. So, the file is supposed to be UTF-8 encoded.
Solution: I haven't tested it. Use the -a or --ascii flag with gzip command. It'll be decoded ok by bigquery.

Unzip the archive with more than one entry

I'm trying to decompress ~8GB .zip file piped from curl command. Everything I have tried is being interrupted at <1GB and returns a message:
... has more than one entry--rest ignored
I've tried: funzip, gunzip, gzip -d, zcat, ... also with different arguments - all end up in the above message.
The datafile is public, so it's easy to repro the issue:
curl -L https://archive.org/download/nycTaxiTripData2013/faredata2013.zip | funzip > datafile

Are you sure the mentioned file deflates to a single file? If it extracts to multiple files you unfortunately cannot unzip on the fly.
Zip is a container as well as compression format and it doesn't know where the new file begins. You'll have to download the whole file and unzip it.

Problems reading .bz2 or .tar.bz2 files as hdf5 in R

I downloaded some files with extension .tar.bz2. I was able to untar these into folders containing .bz2 files. These should unzip as hdf5 files (Metadata said they were hdf5) , but they unzip into files with no extensions I have tried the following but didnt work:
untar("File.tar.bz2")
#Read lines of one of the files from the unzipped file
readLines(bzfile("File1.bz2"))
[1] "‰HDF" "\032"
library (rhdf5)
#Explore just as a bzip2 file
bzfile("File1.bz2")
description "File1.bz2"
class "bzfile"
mode "rb"
text "text"
opened "closed"
can read "yes"
can write "yes"
#Try to read as hdf5 using rhdf5 library
h5ls(bzfile("File1.bz2"))
Error in h5checktypeOrOpenLoc(). Argument neither of class H5IdComponent nor a character.
Is there some sort of encoding I need to do? What am I missing? What should I do?

How to remove non UTF-8 characters from text file

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them.
Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?

This command:
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

iconv
can do it
iconv -f cp1252 foo.txt

None of the methods here or on any other similar questions worked for me.
In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.
May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

NEologd Japanese tokenizer dictionary cannot be correctly installed - nlp

Related

Change huge amount of data from NIST to RIFF wav file

How to gzip file with unicode encoding using linux cmd prompt?

Unzip the archive with more than one entry

Problems reading .bz2 or .tar.bz2 files as hdf5 in R

How to remove non UTF-8 characters from text file

Categories

Resources