Annotating a Corpus (Syntaxnet) - nlp

I downloaded and installed SyntaxNet following Syntax official documentation on Github. following the documentation (annotating corpus) I tried to read a .conll file named wj.conll by SyntaxNet and write the results in wj-tagged.conll but I could not. My questions are:
does SyntaxNet always reads .conll files? (not .txt files?). I got a bit confused as I knew SyntaxNet reads .conll file for training and testing process but I am a bit suspicious that it is necessary to convert a .txt file to .conll file in order to have their Part Of Speach and Dependancy Parsing.
How can I make SyntaxNet reads from files (I tired all possible ways explain in GitHub documentation about SyntaxNet and It didn't work for me)

Add these declaration lines to "context.pbtxt" at the end of the file. Here "inp" and "out" are the text files present in the root directory of syntexnet.
input {
name: 'inp_file'
record_format: 'english-text'
Part {
file_pattern: 'inp'
}
}
input {
name: 'out_file'
record_format: 'english-text'
Part {
file_pattern: 'out'
}
}
Add sentences to the "inp" file for which you want tagging to be done and specify them in shell the next time you run syntaxnet using --input and --output tags.
Just to help you a bit more I am pasting an example shell command.
bazel-bin/syntaxnet/parser_eval \
--input inp_file \
--output stdout-conll \
--model syntaxnet/models/parsey_mcparseface/tagger-params \
--task_context syntaxnet/models/parsey_mcparseface/context.pbtxt \
--hidden_layer_sizes 64 \
--arg_prefix brain_tagger \
--graph_builder structured \
--slim_model \
--batch_size 1024 | bazel-bin/syntaxnet/parser_eval \
--input stdout-conll \
--output out_file \
--hidden_layer_sizes 512,512 \
--arg_prefix brain_parser \
--graph_builder structured \
--task_context syntaxnet/models/parsey_mcparseface/context.pbtxt \
--model_path syntaxnet/models/parsey_mcparseface/parser-params \
--slim_model --batch_size 1024
In the above script the output(POS tagging) of the first shell command is used as an input for the second shell command, where the two shell commands are seperated by "|"

just a quick help if you want to save the output of demo in a .txt file:
try echo "open file X with application Y" | ./demo.sh > output.txt
it gives you sentence tree to the current directory.

Related

HaplotypeCaller provide variants more than expected

I used HaplotypeCaller for variant calling out of WES picard.sorted.MarkedDup.bam file with GATK 4.2.6.1. HaplotypeCaller standard command line.
Apparently, everything worked well and I received standard .vcf file. But the number of identified variants are too much for WES result. It's close to one million variants for one sample!
Did I perform something wrong?
What solution do you recommend?
Any help would be appreciated.
The command line I used was as follow:
gatk --java-options -Xmx8g HaplotypeCaller \ -R $refFile \ -I ${base}.picard.sorted.markedDup.bam \ --dont-use-soft-clipped-bases -stand-call-conf 20.0 \ --emit-ref-confidence GVCF \ -O ${base}.rrrrealigned.vcf

Get All AWS Lambda Functions and Their Tags and Output to CSV

My bash script runs to retrieve lambda functions and their tags.
It runs ok and does what it needed to do, however I need to get the output written to a .txt or a .csv file, which needs to be in a readable format.
Below is the script I have;
#!/bin/bash
while read -r name; do
aws lambda list-functions | jq -r ".Functions[].FunctionArn" | xargs -I {} aws lambda list-tags --resource {} --query '{"{}":Tags}' --output text
done
Below is what a returned value looks like after the script runs;
ARN:AWS:LAMBDA:EU-WEST-1:1939999:FUNCTION:example-lambda EXX dev example-lambda False release-1.1.9 False True
I need to get all the items returned and lined up neatly in a txt or csv file. Any help would be appreciated.
I would recommend to use the resourcegroupstaggingapi API to solve this problem. This API allows you to get all resources of a specific type and their tags.
To get all your Lambda functions for your default region and their tags you can run the following command:
aws resourcegroupstaggingapi get-resources --resource-type-filters "lambda"
The output of this command can now be parsed with jq. The great thing about jq is that you can manipulate the output to be CSV.
To get CSV output with two columns (ARN, Tags) you can run the following command:
aws \
resourcegroupstaggingapi \
get-resources \
--resource-type-filters "lambda" \
| jq -r '.ResourceTagMappingList[] | [.ResourceARN, ((.Tags | map([.Key, .Value] | join("="))) | join(","))] | #csv'
The advantage of this approach is that you only have a single HTTP call making it relatively fast. The disadvantage is that you only get the ARN and the tags.
As shimo mentioned in a comment to the question, a way to save the output of a command to a file is using the > operator.
> operator replaces the existing content of the file with the output of the command. If you want to save the output of multiple commands to the same file, you should use >> operator.
You can also use a pipe and the command tee. The output will be printed in your screen and in a file, which will be at the end of the file.
I found this tutorial helpful. Based on what you written, you could pipe the output of your command into a csv, or concotanate into an array then write it into a file with a newline character at the end of each line.

Target for make gives "nothing to be done"

I have an issue with "make" (Oh, the horror!).
We're trying to migrate some COBOL code from Windows to Linux. The compiler and such are from Micro Focus. Under Windows the code is developed with Micro Focus Net Express. Linux has Micro Focus Server Express as the equivalent. The programs are compiled and linked using "make" scripts.
So much for the background.
The problem is a "make" script that doesn't want to compile and link an executable under Linux. The targets look like this:
# HP INIT-Daten laden
#
datLoad$O: \
$(UI)/defretrn.cpy \
$(UI)/e12sy00s.cpy \
$(UI)/e12sy005.cpy \
$(UI)/e12sy006.cpy \
$(UI)/e12sy010.cpy \
$(UI)/e12sy013.cpy \
$(UI)/e12sy050.cpy \
$(UI)/e12db001.cpy \
$(UI)/e12db050.cpy \
$(UI)/evlg.cpy \
$(UI)/deffehl.cpy \
datLoad.xcbl $(FRC)
# #echo "dollar-O is \"$O\" in $#"
datLoad$X: $(LIBDSQL) datLoad$O \
$(LP)/evlg$O $(LP)/alock$O
$(LCOB) -o $(#:$X=) -e $(#:$X=) $(LCOBFLAGS) \
-d e12db001 -d e12db003 -d e12db012 \
-d e12sy005 -d e12sy006 -d e12sy009 \
-d e12sy010 -d e12sy012 -d e12sy013 \
-d e12sy050 \
-I EvLgSetCategory $(LP)/evlg$O \
-I ALckSetDebug $(LP)/alock$O \
$(LIBEXEEXT) "$(LIBXSQL)"
if [ -f $B/$# -a ! -w $B/$# ] ; then rm -f $B/$# ; fi
cp $# $B
To put this into context, $0=".o" (i.e. an object file extension). $(LCOB) is the link command. $X=".exe" (an executable ... just forget about the extension, we'll fix that in due course). All the other stuff relates to paths ==> not relevant to the issue at hand and, yes, they've all been checked and verified.
Ultimately, I am trying to get "make" to resolve a target called "datLoad.o".
Included is a second "make" script containing the following:
COBFLAGS = -cx # create object file
GNTFLAGS = -ug # create .gnt file
SOFLAGS = -z # create
LCOB = cob
...
.cbl$O:
$(CCOB) $(COBFLAGS) $*.cbl, $*$O, NUL, NUL
if [ -f $(LP)/$*$O -a ! -w $(LP)/$*$O ] ; then rm -f $(LP)/$*$O ; fi
cp $*$O $(LP)
The relevant part is the target which resolves to ".cbl.o:". Yes, that's the shorthand version and I don't really like it but I did not write this script. I'm assured that it really means *.o:*.cbl and other similar constructs in the script do work correctly.
With a simple "make" I get a link error:
> In function `cbi_entry_point': (.data+0x384): undefined reference to
> `datLoad' /tmp/cobwZipkt/%cob0.o: In function `main': (.text+0x28):
> undefined reference to `datLoad' make: *** [datLoad.exe] Error 1
That means datLoad.o was not created. If I do create it explicitly with:
cob -cx datload
Then "make" still gives the same error as above. Weird! However, what I really cannot understand is the response I get from "make datLoad.o" when the target does not exist:
make: Nothing to be done for `datLoad.o'.
I assumed (heaven help me) that the target "datLoad.o" would try to create the required target file if that file does not already exist. Am I going mad?
Sorry if this seems a bit obscure, I'm not sure how to phrase it better. If anybody has an idea what might be going on, I'd be really grateful...
Thank you Mad Scientist. Your tip was correct.
The included .mk contained a .SUFFIXES rule. The problem was that the $O was not being used consistently. $O was originally set to ".obj" for Windows. Under Linux it's ".o". However, the .SUFFIXES rule had the ".obj" hard coded into it, so of course the ".o" targets were not being recognised. I replaced the hard coded suffix with the $O variable and it now works.
Achim

Export vim syntax highlighting to latex

I would like to exploit vim's syntax highlighting capabilities to highlight code (any language) in latex (using the xcolor package). Therefore I wonder if it is possible to have a vim-script export the vim internal information about the highlighted text in the buffer. Obviously it would be sufficient to know start, end and color of each highlighted entity. The generation of the latex code or other languages such as html would then be obvious.
You can use my formatvim plugin: it can export to latex-xcolor format with
Format format latex-xcolor
. If you are not fine with the result (it is completely untested and I almost never used this option) feel free to send patches, dictionary with format specification can be seen here, everything what you need to create your own format is in documentation.
Note: if you need to export to any other language all you need is to write a specification for it in terms of my plugin. Here is a code that will add latex-xcolor-clone format to my plugin:
scriptencoding utf-8
execute frawor#Setup('0.0', {'plugin/format': '3.0'})
let s:texescape=
\'substitute('.
\ 'substitute(###, ''\v[\\\[\]{}&$_\^%#]'', '.
\ '''\=''''\char''''.char2nr(submatch(0))."{}"'', '.
\ '"g"),'.
\'" ", ''\\enskip{}'', "g")'
let s:texstylestart=
\'((#inverse#)?'.
\ '(''\colorbox[HTML]{''.'.
\ '((#fgcolor#!=#"")?'.
\ '(toupper(#fgcolor#[1:])):'.
\ '(toupper(#_fgcolor#[1:])))."}{".'.
\ '''\textcolor[HTML]{''.'.
\ '((#bgcolor#!=#"")?'.
\ '(toupper(#bgcolor#[1:])):'.
\ '(toupper(#_bgcolor#[1:])))."}{"):'.
\ '(((#bgcolor#!=#"")?'.
\ '(''\colorbox[HTML]{''.toupper(#bgcolor#[1:])."}{"):'.
\ '("")).'.
\ '''\textcolor[HTML]{''.'.
\ '((#fgcolor#!=#"")?'.
\ '(toupper(#fgcolor#[1:])):'.
\ '(toupper(#_fgcolor#[1:])))."}{"))'
let s:texstyleend=
\'repeat("}", '.
\ '((#inverse#)?'.
\ '(2):'.
\ '((#bgcolor#!=#"")+1)))'
let s:format={
\'begin': '\documentclass[a4paper,12pt]{article}'.
\ '\usepackage[utf8]{inputenc}'.
\ '\usepackage[HTML]{xcolor}'.
\ '\pagecolor[HTML]{%''toupper(#_bgcolor#[1:])''%}'.
\ '\color[HTML]{%''toupper(#_fgcolor#[1:])''%}'.
\ '\begin{document}{\ttfamily\noindent',
\'line': '%>'.s:texstylestart.".".
\ s:texescape.".".
\ s:texstyleend,
\'lineend': '\\',
\'end': '}\end{document}',
\'strescape': s:texescape,
\}
call s:_f.format.add('latex-xcolor-clone', s:format)
The :TOhtml command is built in Vim. It, rather obviously, generates HTML, not Latex, though.

Synchronize shell script execution

A modified version of a shell script converts an audio file from FLAC to MP3 format. The computer has a quad-core CPU. The script is run using:
./flac2mp3.sh $(find flac -type f)
This converts the FLAC files in the flac directory (no spaces in file names) to MP3 files in the mp3 directory (at the same level as flac). If the destination MP3 file already exists, the script skips the file.
The problem is that sometimes two instances of the script check for the existence of the same MP3 file at nearly the same time, resulting in mangled MP3 files.
How would you run the script multiple times (i.e., once per core), without having to specify a different file set on each command-line, and without overwriting work?
Update - Minimal Race Condition
The script uses the following locking mechanism:
# Convert FLAC to MP3 using tags from flac file.
#
if [ ! -e $FLAC.lock ]; then
touch $FLAC.lock
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3"
rm $FLAC.lock
fi;
However, this still leaves a race condition.
The "lockfile" command provides what you're trying to do for shell scripts without the race condition. The command was written by the procmail folks specifically for this sort of purpose and is available on most BSD/Linux systems (as procmail is available for most environments).
Your test becomes something like this:
lockfile -r 3 $FLAC.lock
if test $? -eq 0 ; then
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3"
fi
rm -f $FLAC.lock
Alternatively, you could make lockfile keep retrying indefinitely so you don't need to test the return code, and instead can test for the output file for determining whether to run flac.
If you don't have lockfile and cannot install it (in any of its versions - there are several implementations) a robust and portable atomic mutex is mkdir.
If the directory you attempt to create already exists, mkdir will fail, so you can check for that; when creation succeeds, you have a guarantee that no other cooperating process is in the critical section at the same time as your code.
if mkdir "$FLAC.lockdir"; then
# you now have the exclusive lock
: critical section
: code goes here
rmdir "$FLAC.lockdir"
else
: nothing? to skip this file
# or maybe sleep 1 and loop back and try again
fi
For completeness, maybe also look for flock if you are on a set of platforms where that is reliably made available and need a performant alternative to lockfile.
You could implement locking of FLAC files that it's working on. Something like:
if (not flac locked)
lock flac
do work
else
continue to next flac
Send output to a temporary file with a unique name, then rename the file to the desired name.
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3.$$"
mv "$MP3.$$" "$MP3"
If a race condition leaks through your file locking system every once in a while, the final output will still be the result of one process.
To lock a file process you can create a file with the same name with a .lock extension.
Before starting the encoding check the existence of the .lock file, and optionally make sure the date of the lockfile isn't too old (in case the process dies). If it does not exist, create it before the encoding starts, and remove it after the encoding is complete.
You can also flock the file, but this only really works in c where you are calling flock() and writing to the file then closing and unlocking. For a shell script, you probably are calling another utility to do the writing of the file.
How about writing a Makefile?
ALL_FLAC=$(wildcard *.flac)
ALL_MP3=$(patsubst %.flac, %.mp3, $(ALL_FLAC)
all: $(ALL_MP3)
%.mp3: %.flac
$(FLAC) ...
Then do
$ make -j4 all
In bash it's possible to set noclobber option to avoid file overwriting.
help set | egrep 'noclobber|-C'
Use a tool like FLOM (Free LOck Manager) and simply serialize your command as below:
flom -- flac ....

Resources