OpenNLP POSTagger output from command line - nlp

I want to use OpenNLP in order to tokenize Thai words. I downloaded OpenNLP and Thai tokenize model and run the following
./bin/opennlp POSTagger -lang th -model thai.tok.bin < sentence.txt > output.txt
I put thai.tok.bin that I downloaded on the directory that I call from and run the following. sentence.txt has this text inside กินอะไรยังนาย. However, the output I got has only these text:
Usage: opennlp POSTagger model < sentences
Execution time: 0.000 seconds
I'm pretty new to OpenNLP, please let me know if anyone knows how to get output from it.

The models from your link are outdated. First you need some manual steps to convert the model.
Download the file thai.tok.bin.gz and extract to an empty folder. Rename the extracted file thai.tok.bin to token.model
In the same folder, create a file named manifest.properties with the following contents:
Manifest-Version=1.0.
Language=th
OpenNLP-Version=1.5.0
Component-Name=TokenizerME
useAlphaNumericOptimization=false
Now you can zip the files, if you are using Linux you can use this command: zip thai.tok.bin token.model manifest.properties
Try your model:
sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt
Loading Tokenizer model ... done (0,097s)
กินอะไร ยังนาย
Average: 333,3 sent/s
Total: 1 sent
Runtime: 0.003s
Execution time: 0,108 seconds
Now that you have the updated tokenizer, you can do similar with the POS Tagger model.
Download the file thai.tag.bin.gz and extract to a empty folder. Rename the extracted file thai.tag.bin to pos.model
In the same folder, create a file named manifest.properties with the following contents:
Manifest-Version=1.0
Language=th
OpenNLP-Version=1.5.0
Component-Name=POSTaggerME
Now you can zip the files, if you are using Linux you can use this command: zip thai.pos.bin pos.model manifest.properties
Finally, we can try the two models combined:
sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt > thai_tokens.txt
sh bin/opennlp POSTagger ~/Downloads/pt-pos-maxent/thai.pos.bin < thai_tokens.txt
The result is:
กินอะไร_VACT ยังนาย_NCMN
Please, let me know if this is the expected result.

Related

Execute a subprocess that takes an input file and write the output to a file

I am using a third-party C++ program to generate intermediate results for the python program that I am working on. The terminal command that I use looks like follows, and it works fine.
./ukb/src/ukb_wsd --ppr_w2w -K ukb/scripts/wn30g.bin -D ukb/scripts/wn30_dict.txt ../data/glass_ukb_input2.txt > ../data/glass_ukb_output2w2.txt
If I break it down into smaller pieces:
./ukb/src/ukb_wsd - executable program
--ppr_w2w - one of the options/switches
-K ukb/scripts/wn30g.bin - parameter K indicates that the next item is a file (network file)
-D ukb/scripts/wn30_dict.txt - parameter D indicate that the next item is a file (dictionary file)
../data/glass_ukb_input2.txt - input file
> - shell command to write the output to a file
../data/glass_ukb_output2w2.txt - output file
The above works fine for one instance. I am trying to do this for around 70000 items (input files). So found a way by using the subprocess module in Python. The body of the python function that I created looks like this.
with open('../data/glass_ukb_input2.txt', 'r') as input, open('../data/glass_ukb_output2w2w_subproc.txt', 'w') as output:
subprocess.run(['./ukb/src/ukb_wsd', '--ppr_w2w', '-K', 'ukb/scripts/wn30g.bin', '-D', 'ukb/scripts/wn30_dict.txt'],
stdin=input,
stdout=output)
This error is no longer there
When I execute the function, it gives an error as follows:
...
STDOUT = subprocess.STDOUT
AttributeError: module 'subprocess' has no attribute 'STDOUT'
Can anyone shed some light about solving this problem.
EDIT
The error was due to a file named subprocess.py in the source dir which masked Python's subprocess file. Once it was removed no error.
But the program could not identify the input file given in stdin. I am thinking it has to do with having 3 input files. Is there a way to provide more than one input file?
EDIT 2
This problem is now solved with the current approach:
subprocess.run('./ukb/src/ukb_wsd --ppr_w2w -K ukb/scripts/wn30g.bin -D ukb/scripts/wn30_dict.txt ../data/glass_ukb_input2.txt > ../data/glass_ukb_output2w2w_subproc.txt',shell=True)

why does the -f switch of kaggle API download not recognize the file name passed to it as a string

i want to extract a subset of image files from the kaggle dataset 'hpa-single-cell-image-classification'. i tried to use the kaggle API.
using the command below, when i download an individual image, it downloads fine,
!kaggle competitions download -c hpa-single-cell-image-classification -f /train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_blue.png
but when i try to pass it through a loop ( kaggle_img_names.csv contains the names of the images )
with open('kaggle_img_names.csv','r') as fh:
data=fh.readlines()
data=[item.strip() for item in data]
data=data[1:10]
for file in data:
print(file)
!kaggle competitions download -c hpa-single-cell-image-classification -f file
'''
it shows 404- file not found .
I have realized that the with quotes at the end of the file name, the API says file not found
!kaggle competitions download -c hpa-single-cell-image-classification -f '/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_blue.png'
how to pass the name of the file to the API such that the API processes it ? more than downloading the images i want to know why the -f switch of the API does not recognize the the string object (file name) passed to it ? what is the type of the object passed to the -f switch ? is it something other than a string ?
Thanks in advance !

Unzipping image directory in Google Colab doesn't unzip entire contents

I'm trying to unzip a directory of 75,000 images for training a CNN. When unzipping using,
!unzip -uq "/content/BDD_gt_8045.zip" -d "/content/drive/My Drive/Ground_Truth"
not all images unzip. I have about 5,000 I believe. I tried doing it several times but then I have some duplicates. Is there a limit to the number of images I can unzip?
I'm currently stuck on how else I'm meant to get all files into my drive to train the model.
Colab's default 'unzip' binary doesn't work as expected. It seems to cancel the unzipping automatically after a few cycles. Run latest version of 7z and you are good to go.
# To extract with full paths
!7z x <filename.zip>
# To extract all the files in the same folder (ignore paths)
!7z e <filename.zip>
# To specify output directory, use '-o'
!7z x <filename.zip> -o '/content/drive/My Drive/Datasets/FashionMNIST'

How to write UD Pipe tagger output to file?

I have been using UD Pipe to train and tag data in the Hindi Language.
I run the tagger using
udpipe --tag model.output hi-ud-test.conllu
which works perfectly fine and displays the output in command line. How do I write this output in a file?
Simply use the command
udpipe --tag model.output hi-ud-test.conllu > output.txt
This will write your output in a txt file called "output" instead of displaying in the terminal

Using chef how to update a file after reading input from another file?

I am using linux commands to generate a file "a.txt" . Now I need to read the first word of file "a.txt" and update an existing file called "b.txt". I will search a word called "/fill " in b.txt and replace it with the word read from a.txt Below is the code
bash 'example' do
code <<-EOH
cat ex.txt >> a.txt
EOH
end
test = /#{'cat /a.txt'}/
file_names = ['/b.txt']
file_names.each do |file_name|
text = File.read(file_name)
new_contents = text.gsub(/fill, test)
puts new_contents
File.open(file_name, "w") {|file| file.puts new_contents }
With the help of linux command "cat ex.txt >> a.txt " I am putting the content of ex.txt to a.txt.
After this I want to read the file a.txt with test = /#{'cat /a.txt'}/. Example a.txt contains "azure" word
Now in b.txt I want to search for a word "/fill" and replace with content read in step 2 from b.txt file ie azure
The problem is instead of replacing /fill with azure, /fill is getting replaced with cat /a.txt.
Hope its clear now
Can you please help here
It is a bit hard to follow, what you actually want to achieve. Your code has a few issues. General advice:
put your ruby code inside a ruby_block resource so that is executed during Chef's convergence phase
Use Chef::Util::FileEdit, if you want to edit files that are not entirely managed by Chef (see this question for more inspiration)).
In case you really want to write the complete file using Chef, use a file resource and specify the content based on what you've read using File.read.
As said, ruby code outside of ruby_block is executed in the compile phase (which precedes the convergence phase). If you this is too early (because the source file isn't there yet, you can use a lazy block for lazy evaluation:
file "b.txt" do
content lazy { File.read .. }
end

Resources