How to create a union of FSTs from an FST archive (FAR)? - nlp

I currently have a (natural language) corpus, and these are the steps already taken:
Generated the symbol table after concatenating the corpus into one big file:
$ ngramsymbols <corpus.txt >corpus.syms
Given this symbol table, converted the corpus to a binary FST archive (FAR):
$ farcompilestrings -symbols=corpus.syms -keep_symbols=1 corpus.txt > corpus.far
I want to take the union of all the FSTs in the FAR, and compute the highest-weight path from start state to final state. To test from shell, this is what I did:
$ farextract corpus.far # generates fst files corpus-01, corpus-02, ...
$ fstarcsort --sort_type=olabel corpus.txt-01 1.fst
$ fstarcsort --sort_type=ilabel corpus.txt-02 2.fst
$ fstunion 1.fst 2.fst 12.fst
But I keep running into the following error:
WARNING: CompatSymbols: first symbol table present but second missing
ERROR: Union: input/output symbol tables of 1st argument do not match input/output symbol tables of 2nd argument
This error, of course, persists if I try to run a binary operation without sorting the FSTs first.
I think I am not sorting the FSTs correctly, or ... I have completely misunderstood how to use the symbol tables. Any idea why the union (or any other binary operation, for that matter) is failing like this?

When you extract the components from the far archive the symbol table is attached to the first fst from the archive. When combining FSTs the symbols table embedded into the individual FSTs an need to match each other. For example, the union operation would need the input symbols across the components to be the same each other, and the output symbol across the components to be the same each other. Composition needd the output symbols of the left machine to match the input symbols of the right machine.
You can clear symbols from an FST using the fstsymbols command:
fstsymbols --clear_isymbols ---clear_osymbols with-syms.fst > no-syms.fst
Removing the symbols from corpus.txt-01 should solve this problem. Alternatively, you can compile the far file without the --keep_symbol flag.
For the union command you don't need sort the arcs from the component machines before combing them, however you would normally need to sort them before composing them.
If you text corpus is large you might find it much quicker just to directly construct
the unioned FST direcly from the text file using the C++ interface or some other bindings such as pyfst.

Related

How to convert model.tflite to model.cc and model.h on Windows 10

I have created a TensorFlow Lite .tflite model which I plan to use on a microcontroller. However, this file must be converted to a C source file, i.e, a TensorFlow Lite for Microcontrollers model. TensorFlow documentation provides a simple way to convert to a C array with the unix command xxd. I am using Windows 10 and do not have access to the unix command and there are no alternative Windows methods documented. After searching superuser, I saw that xxd for Windows now exists. I downloaded the command and ran it on my .tflite model. The results were different than the hello world example.
First, the hello world example model.h file has a comment that say it was "Automatically created from a TensorFlow Lite flatbuffer using the command: xxd -i model.tflite > model.cc" When I ran the command, model.h was not "automatically created".
Second, comparing the model.cc file from the hello world example, with the model.cc file that I generated, they are quite different and I'm not sure how to interpret this (I'm not referring to the differences in the actual array). Again, in the example model.cc file, it states that it was "automatically created" using the xxd command. Line 28 in the example is alignas(8) const unsigned char g_model[] = { and line 237 is const int g_model_len = 2488;. In comparison, the equivalent lines in the file I generated are unsigned char _________g_model[] = { and unsigned int _________g_model_len = 4009981;
While I am not a C expert, I am not sure how to interpret the differences in the files and if I have generated the model.cc file incorrectly. I would greatly appreciate any insight or guidance here on how to properly generate both the model.h and model.cc files from the original model.tflite file.
After doing some experiments, I think this is why you are getting differences:
xxd replaces any non-letter/non-digit character of the path to the input file by an underscore ('_'). Apparently you called xxd with a path for the input file that has 9 such leading characters, perhaps something like "../../../g.model". The syntax of C allows only letters (a to z, A to Z), digits (0 to 9) and underscore as characters of objects' names, and the names need to start with a non-digit. This is the only "manipulation" xxd does to the name of an input file.
Since xxd knows nothing about TensorFlow, it could not had generated the copyright notice. Using this as indication, any other difference had been inserted by other means by the TensorFlow authors, despite the statement "Automatically created from a TensorFlow Lite flatbuffer ...". This could be done manually or by a script, unfortunately I did not find any hint in some quick research on their repository. Apparently the statement means just the data values.
So you need to edit your result:
Add any comment you see fit.
Add the compiler-specific alignas(8) to the array, if your compiler supports it.
Add the keywords const to the array and the length variable. This will tell the compiler to prohibit any write access. And probably this will place the data in read-only memory.
Rename array and length variables to g_model and g_model_len, respectively. Most probably TensorFlow expects these names.
Copy "model.cc" into "model.h", and then apply more editions, as the example demonstrated.
Don't be bothered by different values. Different contents of the model's file are the reason. It's especially simple to check the length variable, it has to have exactly the same value as the size of the input file.
EDIT:
On line 28 which is this text alignas(8) const unsigned char as shown in the example converted model. When I attempt to convert a model (whether it's my custom model or the "hello_world.tflite" example model) the text that would be on line 28 is unsigned char (any other text on that line is not in question). How is line 28 edited & explained?
Concerning the "how": I firmly believe that the authors of TensorFlow literally used an editor (an IDE or a stand-alone program like Notepad++ or Geany) and edited the line, or used some script to automate this.
The reason for alignas(8) is most probably that TensorFlow expects the data with an alignment of 8 bytes, for example because it casts the byte array to a structure that contains values of 8 bytes width.
The insertion of const will also commonly locate the model in read-only memory, which is preferable on most microcontrollers. If it were left out, the model's data were not only writable, but would be located in precious RAM.
On line 237, the text specifically is const int. When I attempt to convert a model (whether it's my custom model or the "hello_world.tflite" example model) the text that would be on line 237 is unsigned int (any other text on that line is not in question). Why are these two lines different in these specific places? It makes me believe that xxd on Windows is not functioning the same?
Again, I firmly believe this was edited manually or by a script. TensorFlow might expect this variable to be of data type int, but any xxd I tried (Windows and Linux) generates unsigned int. I don't think that your specific version of xxd functions differently on Windows.
For const the same thoughts apply as above.
Finally, when I attempt to convert the example model "hello_world.tflite" file using the xxd for windows utility, my resulting array doesn't match the example "hello_world.cc" file. I would expect the array values to be identical if the xxd worked. The last question is how to generate the "model.h" and "model.cc" files on Windows.
Did you note that the model you link is in another branch of the repository?
If I use the branch on GitHub as in your link to "hello_world.cc", I find in "../train/README.md" this archive hello_world_2020_12_28.zip. I unpacked it and ran xxd on the included "model.tflite". The result's data match the included "model.cc" in the archive. But it does not match the data of "hello_world.cc" in the same branch that you linked. The difference is already there.
My conclusion is, that the example result was not generated from the example model. This happens, since developers sometimes don't pay enough attention on what they commit. Yes, it's unfortunate, as it irritates and frustrates beginners like you.
But, as I wrote, don't let this make you headaches. Try the simple example, use the documentation as instructions on the process. Look at the differences in specific data as a quirk. You will encounter such things time after time when working with other's projects. It is quite normal.

Get elements of a list inside a list

When working with lists in Haskell, i can simply load my file into ghci and type head list or last list to get the information that I need. But if I have a list of lists, lets say: list = [[1,2,3],[4,5,6]], how can I get the first element (head) of the first list (in this case, 1), or the last element of the second list (in this case, 6), and so one?
If all you need is the first or last element, concat will flatten the list for you.
There is an indexing function (!!), so for your examples, head . (!!0) and last . (!!1) . If your question is more general, then please elaborate. Indexing is not great because it can throw errors if you attempt to index past the end of the list, so usually we try to work around that, eg. by saying "well I want to do the same thing to every element of the list so I don't really need the index" (map function) or "if I really do need the index then don't use it directly") (zip [0..], or use of eg. a record data type).
Also, Hoogle is your friend if you've not met it before. If you can break down your functions into simple ones you think might be standard, then search their type signatures, that's usually a good place to start. Hoogle [a] -> Int -> a even if you don't find exactly what you want, often if you find something similar and browse it's module or source code you can find something helpful.

Fortran90 cray writing unformatted array using "*"

I have a program, written in fortran90, that is writing an array to a file, but for some reason is using an asterix to represent multiple columns:
8*9, 4, 2*9, 4
later on reading from the file I am getting I/O errors:
lib-4190 : UNRECOVERABLE library error
A numeric input field contains an invalid character.
Encountered during a list-directed READ from unit 10 Fortran unit 10 is connected to a sequential formatted text file:
Does anyone have any idea why this is happening, and if there is a flag to feed to the compiler to prevent it. I'm using the cray fortran compiler, and the write statement looks like this:
write (lun,*) nsf_species(bundle%species(1:bundle%n_prim))
Update:
The line reading in the data file looks like:
read (lun,*) Info(ifile)%alpha_i(1:size)
I have checked to ensure that it is this line that is causing the problem.
This compression of list-directed output is a very useful feature of the Cray Compilation Environment when writing out large amounts of data. This compressed output will, however, not be read in correctly, as you point out (which is less useful).
You can modify this behaviour, not using a compiler flag but by using the "assign" command.
Consider this sample code:
PROGRAM test
IMPLICIT NONE
INTEGER :: u
OPEN(UNIT=u,FILE="f1",FORM="FORMATTED",STATUS="UNKNOWN")
WRITE(u,*) 0,0,0
CLOSE(u)
OPEN(UNIT=u,FILE="f2",FORM="FORMATTED",STATUS="UNKNOWN")
WRITE(u,*) 0,0,0
CLOSE(u)
END PROGRAM test
We first build with CCE and execute. Files f1 and f2 both contain the compressed output form:
$ ftn -o test.x test.F90
$ ./test.x
$ cat f1
3*0
$ cat f2
3*0
Now we will use "assign" to modify the format in file f2. First we need to define a filename to hold the assign information:
$ export FILENV=my_filenenv
Now we use assign to switch off the compressed output for file f2:
$ assign -y on f:f2
Now we rerun the experiment (without needing to recompile):
$ ./test.x
$ cat f1
3*0
$ cat f2
0, 0, 0
There are options to do this for all files, for certain filename patterns or many other cases.
There are other things that assign can do. See "man assign" with PrgEnv-cray loaded for more details.
The write statement is using list directed formatting (it is still a formatted output statement - "formatted" means "formatted such that a human can read it")- as specified by the * inside the parenthesised part of the statement. The rules for list directed output give a great deal of freedom to the compiler. Typically, if you actually care about the details of the output, you should provide an explicit format.
One of the rules that does apply is that the resulting output should generally be suitable for list directed input. But there are some rather surprising rules for what is permitted as input for list directed formatting. One such feature is that you can specify in the input text a repeat count for an input values using the syntax repeat*value.
The compiler has noticed that there are repeat values in the output, so it has used this repeat count feature.
I don't know why you get an error message when reading the file under list directed input - as the line you show is a valid input line for list directed input. Make sure that the line causing the error is actually the line that you show.
A simple workaround solution would be to change the write statement so that it does not use the compressed format. e.g. change to:
write (lun,'(*(I5))') nsf_species(bundle%species(1:bundle%n_prim))
The '*' allows an arbitrary number of repeats of the specified format and should suppress the compressed output format.
However, if the compiler outputs in compressed format than it should be able to read back in in the same compressed format. Hopefully the helpdesk will be able to get to the root of why that does not work.

how to plot a graph using haskell graphViz

I'm planning to draw a graph using Haskell graphViz. I'm new to haskell, so this is quite difficult for me. Can someone show me a simple example ? I need a very simple example actually, so that I can understand it and use it in the scenario I'm working on
I get the above error on trying to install chart-cairo. I saw some examples on the internet and all of them requires chart-cairo. any idea how to resolve it ?
*EDITED"
The output that I get after executing the code given by https://stackoverflow.com/users/2827654/jamshidh
(This addresses your original question, described in the title, and doesn't go into the problems installing chart-cairo or chart, etc, which really should be spun out into different questions)....
The graphviz package includes some example graphs in module Data.Graph.Inductive.Example that can be used to get you up and running. You can see the list of included graphs at http://hackage.haskell.org/package/fgl-5.3/docs/Data-Graph-Inductive-Example.html.... I will use one called clr479.
Once you have a graph, you can convert it to an internal structure representing the dot format using graphToDot. Note that you will need to supply some parameters, which are described in http://hackage.haskell.org/package/graphviz-2999.11.0.0/docs/Data-GraphViz.html. Just to get up and running, I will use the supplied nonClusteredParams.
let graphInDotFormat = graphToDot nonClusteredParams clr479
Then, you will need to convert this to text suitable for input to the dot program. You can do this with renderDot . toDot
let outputText = renderDot $ toDot graphInDotFormat
and, as usual, you need to convert text to string to use putStrLn (don't just use show, as it will include quotes and escape sequences, which dot will not understand)
putStrLn $ unpack outputText
Putting this all together, the final program createDotFile.hs would be
import Data.Text.Lazy
import Data.GraphViz
import Data.Graph.Inductive.Example
import Data.GraphViz.Printing
main = putStrLn $ unpack $ renderDot $ toDot $ graphToDot nonClusteredParams clr479
Compile using ghc createDotFile.hs (remember to cabal install the required packages, as well as graphviz itself if you want to do anything with the output). On the commandline, you can now pipe the output of this program to dot, which will convert this to a usual format.... For instance, here I convert to svg
./createDotFile | dot -Tsvg > graph.svg
which on my linux box can be viewed by typing
eog graph.svg
Edit-
To clarify, the output of the haskell program needs to be provided as an input to GraphViz. The msi file to install graphviz on windows here http://www.graphviz.org/Download_windows.php.

Racket: extracting field ids from structures

I want to see if I can map Racket structure fields to columns in a DB.
I've figured out how to extract accessor functions from structures in PLT scheme using the fourth return value of:
(struct-type-info)
However the returned procedure indexes into the struct using an integer. Is there some way that I can find out what the field names were at point of definition? Looking at the documentation it seems like this information is "forgotten" after the structure is defined and exists only via the generated-accessor functions: (<id>-<field-id> s).
So I can think of two possible solutions:
Search the namespace symbols for ones that start with my struct name (yuk);
Define a custom define-struct macro that captures the ordered sequence of field-names inside some hash that is keyed by struct name (eek).
I think something along the lines of 2. is the right approach (define-struct has a LOT of knobs and many don't make sense for this) but instead of making a hash, just make your macro expand into functions that manipulate the database directly. And the syntax/struct library can help you do the parsing of the define-struct form.
The answer depends on what you want to do with this information. The thing is that it's not kept in the runtime -- it's just like bindings in functions which do not exist at runtime. But they do exist at the syntax level (= compile-time). For example, this silly example will show you the value that is kept at the syntax level that contains the structure shape:
> (define-struct foo (x y))
> (define-syntax x (begin (syntax-local-value #'foo) 1))
> (define-syntax x (begin (printf ">>> ~s\n" (syntax-local-value #'foo)) 1))
>>> #<checked-struct-info>
It's not showing much, of course, but this should be a good start (you can look for struct-info in the docs and in the code). But this might not be what you're looking for, since this information exists only at the syntax level. If you want something that is there at runtime, then perhaps you're better off using alists or hash tables?
UPDATE (I've skimmed too quickly over your question before):
To map a struct into a DB table row, you'll need more things defined: at least hold the DB and the fields it stand for, possibly an open DB connection to store values into or read values from. So it looks to me like the best way to do that is via a macro anyway -- this macro would expand to a use of define-struct with everything else that you'd need to keep around.

Resources