programmatically access IME - nlp

Is there a way to access Japanese or chinese IME either from the command line or python? I have Linux/osx/win8 boxes, so which ever system exposes the easiest accessible api is fine.
I'm experimenting with building a Japanese kana-kanji conversion algorithm and would like to establish a baseline using existing tools. I also have some collections of kana I would like to process.
Preferably I would like something along the lines of
$ ime JP "きしゃのきしゃがきしゃできしゃした"
貴社の記者が汽車で帰社した
I've looked at anthy, mozc and dbus on Linux but can't find anyway to interact with them via the terminal or scripting (such as python)

Anthy provides a cli tool
Personally, I prefer google's IME / mozc for better results, but perhaps this helps.
The source for anthy (sourceforge, file anthy-9100h.tar.gz) includes a simple cli program for testing. Download the source file, extract it, run
./configure && make
Enter the directory test which contains the binary anthy. By default, it reads from test.txt and uses EUC_JP encoding.
Simple test:
Input file test.txt
*にほんごにゅうりょく
*もももすももももものうち。
Run (using iconv to convert to UTF-8:
./anthy --all | iconv -f EUC-JP -t UTF-8
Output:
1:(にほんごにゅうりょく)
|にほんご|にゅうりょく
にほんご(日本語:(1,1000,N,72089)2500,001 ,にほんご:(N,0,-)2 ,ニホンゴ:(N,0,-)1 ,):
にゅうりょく(入力:(1,1000,N,62394)2500,001 ,にゅうりょく:(N,0,-)2 ,ニュウリョク:(N,0,-)1 ,):
2:(もももすももももものうち。)
|ももも|すももも|もものうち|。
ももも(桃も:(,1000,Ny,72089)225,279 ,ももも:(N,1000,Ny,72089)220,773 ,モモも:(,1000,Ny,72089)205,004 ,腿も:(,1000,Ny,72089)204,722 ,股も:(,1000,Ny,72089)146,431 ,モモモ:(N,0,-)1 ,):
すももも(すももも:(N,1000,Ny,72089)202,751 ,スモモも:(,1000,Ny,72089)168,959 ,李も:(,1000,Ny,72089)168,677 ,スモモモ:(N,0,-)1 ,):
もものうち(桃のうち:(,1000,N,655)2,047 ,もものうち:(N,1000,N,655)2,006 ,モモのうち:(,1000,N,655)1,863 ,腿のうち:(,1000,N,655)1,861 ,股のうち:(,1000,N,655)1,331 ,モモノウチ:(N,0,-)1 ,):
。(。:(1N,100,N,70203)57,040 ,.:(1,100,N,70203)52,653 ,.:(1,100,N,70203)3,840 ,):
You can uncomment some printf statements in the source files test/main.c and src-main/context.c to make the output more readable/parsable, eg:
1 にほんごにゅうりょく
にほんご 日本語
にゅうりょく 入力
2 もももすももももものうち。
ももも 桃も
すももも すももも
もものうち 桃のうち
。 。

Related

Is it possible to display a file's contents and delete that file in the same command?

I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.
For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example,  /dev/stdin, /dev/stdout,  and  /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)
No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.

Unicode character not visible while doing cat

I have a CSV file generated by a windows system. The file is then moved to linux. The linux environment is NAME="Red Hat Enterprise Linux Server".VERSION="7.3 (Maipo)".ID="rhel".
When I use vi editor, all characters are visible. For example, one line is given :"Sarah--bitte nicht löschen".
But when i cat the file, i get something like "Sarah--bitte nicht l▒schen".
This file is consumed by datastage application and this unicode characters are coming as "?" in datastage. Since cat is not showing the character properly, I believe the issue is at the linux server. Any help is appreciated.
vi reads the file using encoding according fenc setting and show the content using your locales setting ($LANG env). If fenc is different from LANG, vi can handle the translate.
But cat doesn't handle the translate, it always output the exact byte stream without any convert.
Your terminal will show the output content of both vi and cat using your local PC locale setting.

From Mac iOS to Linux

I have a CDR file with binary code and code written on Perl to decode CDR file.
Now I use Mac, but next week I will start to use Linux.
I've never used Linux before.
If now I use terminal to decode my files and I use this command:
cat 201301101536_00240349.cdr | ./huawei2text.pl >
~/Desktop/201301101536_00240349.txt &&
cat ~/Desktop/201301101536_00240349.txt| tr "," "\n" >
~/Desktop/out_201301101536_00240349.txt
Using this command I decode my file with Perl decoder, then write in txt file, then I change "," to "\n" - new line and save it in new txt file.
Now a question, Is it a same command in linux to do all these actions.
Thank you in advance
In short words: yes.
Unless you have two different perl versions on your mac and linux distribution, but I doubt that.

Print contents of a PDF to the command line

I'm looking for a command-line program that will print out the text of a PDF file, just like cat for a text file.
I've found pdftotxt, and that would be workable, but I'd prefer something that replicates the cat functionality because I want to pipe to grep. Thanks!
On the man pages for pdftotext, I found this:
pdftotext [options] [PDF-file [text-file]]
Description
Pdftotext converts Portable Document Format (PDF) files to plain text.
Pdftotext reads the PDF file, PDF-file, and writes a text file, text-file. If text-file is not specified, pdftotext converts file.pdf to file.txt. If text-file is '-', the text is sent to stdout.
Thus to output to stdout in order to pipe to grep use this:
pdftotext mydoc.pdf - | grep mysearchterm
Maybe you can try this: https://github.com/luochen1990/nodejs-easy-pdf-parser
It is a npm package and you need to install nodejs (and npm) to use it.
It can be used as a command line tool:
npm install -g easy-pdf-parser
pdf2text test.pdf > test.txt
And this tool will sort text lines by their y coordinates, so it works great at most case. And it also works well with unicode and cross platform (as comparison: mingw64's pdftotext will lose unicode characters on windows).

help - change diff symbol "<", "|" or ">" to a desired one?

diff -w command is used to create a side by side comparison diff file (instead of parallel)
i then view them using vi via ssh terminal
the changes are indicated by either "<" or "|" or ">"
Since the file i am viewing is a source code, navigating to changes alone
using above symbols is difficult since they are also in C source code.
How can i change these default symbols to desired ones ?
Kindly help. Thanks.
Instead of viewing the output of diff -w in vim, you can use vim's built-in diff:
vim -d file1 file2
This opens vim in a vertical split with both files open, and diff markings in the code. This is what it looks like:
And it works in a terminal too:
You can find a short tutorial here
According to my version of diff (2.8.1 from the GNU diffutils by the FSF) -w is used to change the width of the output; The -y parameter outputs side by side comparison. In combination, the two show no further effect than the -y parameter used alone, which means you may have an alias in your terminal profile or in the global terminal profile that aliases diff to diff -y.
I say all this because all options to change the symbols ("<", "|", and ">") conflict with the -y option. If you can live without side-by-side, you have the option of two other included output styles or defining your own. The two output styles are -c (context) and -u (unified). (For more information on what they do see the diff Wikipedia page. For more information on the options see the diff man page.)
A more in depth fix would be to use the following options:
diff --old-group-format="(deleted)---" \
--new-group-format="(added)---" \
--changed-group-format="(updated)---" \
--unchanged-group-format="(nodiff)---" \
old_file.c new_file.c
Now the old file's lines that are not present in the new file are represented by (deleted)---
The new file's lines that are not present in the old file are represented by (added)---
Lines that have been changed are represented by (updated)---
Lines common to both files are represented by (nodiff)---
Since you seem to do this often enough, you have the option of making it an alias in your terminal profile or writing a small shell script to handle it. For more options, see the manual's section on options and specifically see the section on line group formats for information on what you can put between the quotes in the format definitions.
Of course, if you must have side-by-side, try Nathan Fellman's idea above. Otherwise, there's the option of using a dedicated GUI tool for it such as Kompare.

Resources