How to store binary data in a Lua string

How to store binary data in a Lua string - string

I needed to create a custom file format with embedded meta information. Instead of whipping up my own format I decide to just use Lua.
texture
{
format=GL_LUMINANCE_ALPHA;
type=GL_UNSIGNED_BYTE;
width=256;
height=128;
pixels=[[
<binary-data-here>]];
}
texture is a function that takes a table as its sole argument. It then looks up the various parameters by name in the table and forwards the call on to a C++ routine. Nothing out of the ordinary I hope.
Occasionally the files fail to parse with the following error:
my_file.lua:8: unexpected symbol near ']'
What's going on here?
Is there a better way to store binary data in Lua?
Update
It turns out that storing binary data is a Lua string is non-trivial. But it is possible when taking care with 3 sequences.
Long-format-string-literals cannot have an embedded closing-long-bracket (]], ]=], etc).
This one is pretty obvious.
Long-format-string-literals cannot end with something like ]== which would match the chosen closing-long-bracket.
This one is more subtle. Luckily the script will fail to compile if done wrong.
The data cannot embed \n or \r.
Lua's built in line-end processing messes these up. This problem is much more subtle. The script will compile fine but it will yield the wrong data. 0x13 => 0x10, 0x1013 => 0x10, etc.
To get around these limitations I split the binary data up on \r, \n, then pick a long-bracket that works, finally emit Lua that concats the various parts back together. I used a script that does this for me.
input: XXXX\nXX]]XX\r\nXX]]XX]=
texture
{
--other fields omitted
pixels= '' ..
[[XXXX]] ..
'\n' ..
[=[XX]]XX]=] ..
'\r\n' ..
[==[XX]]XX]=]==];
}

Lua is able to encode most characters in long bracket format including nulls. However, Lua opens the script file in text mode and this causes some problems. On my Windows system the following characters have problems:
Char code(s) Problem
-------------- -------------------------------
13 (CR) Is translated to 10 (LF)
13 10 (CR LF) Is translated to 10 (LF)
26 (EOF) Causes "unfinished long string near '<eof>'"
If you are not using windows than these may not cause problems, but there may be different text-mode based problems.
I was only able to produce the error you received by encoding multiple close brackets:
a=[[
]]] --> a.lua:2: unexpected symbol near ']'
But, this was easily fixed with the following:
a=[==[
]]==]

The binary data needs to be encoded into printable characters. The simplest method for decoding purposes would be to use C-like escape sequences for all bytes. For example, hex bytes 13 41 42 1E would be encoded as '\19\65\66\30'. Of course, then the encoded data is three to four times larger than the source binary.
Alternatively, you could use something like Base64, but that would have to be decoded at runtime instead of relying on the Lua interpreter. Personally, I'd probably go the Base64 route. There are Lua examples of Base64 encoding and decoding.
Another alternative would be have two files. Use a well defined image format file (e.g. TGA) that is pointed to by a separate Lua script with the additional metadata. If you don't want two files to move around then they could be combined in an archive.

Related

How to convert model.tflite to model.cc and model.h on Windows 10

I have created a TensorFlow Lite .tflite model which I plan to use on a microcontroller. However, this file must be converted to a C source file, i.e, a TensorFlow Lite for Microcontrollers model. TensorFlow documentation provides a simple way to convert to a C array with the unix command xxd. I am using Windows 10 and do not have access to the unix command and there are no alternative Windows methods documented. After searching superuser, I saw that xxd for Windows now exists. I downloaded the command and ran it on my .tflite model. The results were different than the hello world example.
First, the hello world example model.h file has a comment that say it was "Automatically created from a TensorFlow Lite flatbuffer using the command: xxd -i model.tflite > model.cc" When I ran the command, model.h was not "automatically created".
Second, comparing the model.cc file from the hello world example, with the model.cc file that I generated, they are quite different and I'm not sure how to interpret this (I'm not referring to the differences in the actual array). Again, in the example model.cc file, it states that it was "automatically created" using the xxd command. Line 28 in the example is alignas(8) const unsigned char g_model[] = { and line 237 is const int g_model_len = 2488;. In comparison, the equivalent lines in the file I generated are unsigned char _________g_model[] = { and unsigned int _________g_model_len = 4009981;
While I am not a C expert, I am not sure how to interpret the differences in the files and if I have generated the model.cc file incorrectly. I would greatly appreciate any insight or guidance here on how to properly generate both the model.h and model.cc files from the original model.tflite file.

After doing some experiments, I think this is why you are getting differences:
xxd replaces any non-letter/non-digit character of the path to the input file by an underscore ('_'). Apparently you called xxd with a path for the input file that has 9 such leading characters, perhaps something like "../../../g.model". The syntax of C allows only letters (a to z, A to Z), digits (0 to 9) and underscore as characters of objects' names, and the names need to start with a non-digit. This is the only "manipulation" xxd does to the name of an input file.
Since xxd knows nothing about TensorFlow, it could not had generated the copyright notice. Using this as indication, any other difference had been inserted by other means by the TensorFlow authors, despite the statement "Automatically created from a TensorFlow Lite flatbuffer ...". This could be done manually or by a script, unfortunately I did not find any hint in some quick research on their repository. Apparently the statement means just the data values.
So you need to edit your result:
Add any comment you see fit.
Add the compiler-specific alignas(8) to the array, if your compiler supports it.
Add the keywords const to the array and the length variable. This will tell the compiler to prohibit any write access. And probably this will place the data in read-only memory.
Rename array and length variables to g_model and g_model_len, respectively. Most probably TensorFlow expects these names.
Copy "model.cc" into "model.h", and then apply more editions, as the example demonstrated.
Don't be bothered by different values. Different contents of the model's file are the reason. It's especially simple to check the length variable, it has to have exactly the same value as the size of the input file.
EDIT:
On line 28 which is this text alignas(8) const unsigned char as shown in the example converted model. When I attempt to convert a model (whether it's my custom model or the "hello_world.tflite" example model) the text that would be on line 28 is unsigned char (any other text on that line is not in question). How is line 28 edited & explained?
Concerning the "how": I firmly believe that the authors of TensorFlow literally used an editor (an IDE or a stand-alone program like Notepad++ or Geany) and edited the line, or used some script to automate this.
The reason for alignas(8) is most probably that TensorFlow expects the data with an alignment of 8 bytes, for example because it casts the byte array to a structure that contains values of 8 bytes width.
The insertion of const will also commonly locate the model in read-only memory, which is preferable on most microcontrollers. If it were left out, the model's data were not only writable, but would be located in precious RAM.
On line 237, the text specifically is const int. When I attempt to convert a model (whether it's my custom model or the "hello_world.tflite" example model) the text that would be on line 237 is unsigned int (any other text on that line is not in question). Why are these two lines different in these specific places? It makes me believe that xxd on Windows is not functioning the same?
Again, I firmly believe this was edited manually or by a script. TensorFlow might expect this variable to be of data type int, but any xxd I tried (Windows and Linux) generates unsigned int. I don't think that your specific version of xxd functions differently on Windows.
For const the same thoughts apply as above.
Finally, when I attempt to convert the example model "hello_world.tflite" file using the xxd for windows utility, my resulting array doesn't match the example "hello_world.cc" file. I would expect the array values to be identical if the xxd worked. The last question is how to generate the "model.h" and "model.cc" files on Windows.
Did you note that the model you link is in another branch of the repository?
If I use the branch on GitHub as in your link to "hello_world.cc", I find in "../train/README.md" this archive hello_world_2020_12_28.zip. I unpacked it and ran xxd on the included "model.tflite". The result's data match the included "model.cc" in the archive. But it does not match the data of "hello_world.cc" in the same branch that you linked. The difference is already there.
My conclusion is, that the example result was not generated from the example model. This happens, since developers sometimes don't pay enough attention on what they commit. Yes, it's unfortunate, as it irritates and frustrates beginners like you.
But, as I wrote, don't let this make you headaches. Try the simple example, use the documentation as instructions on the process. Look at the differences in specific data as a quirk. You will encounter such things time after time when working with other's projects. It is quite normal.

python translate bytecode to utf-8 using a variable

I have the following problem:
From a SQL Server database I am reading data using python module pypyodbc and ODBC Driver 13 for SQL Server and writing to txt files.
Database contains all kinds of special characters and they read as:
'PR\xc3\x86KVAL'
The '\xc3\x86' part is bytecode and should be interpreted that way. The other characters should be interpreted as shown. UTF8 would translate '\xc3\x86' to Æ.
If I type the value in b'PR\xc3\x86KVAL' , python recognizes it as bytecode and I can translate it to PRÆKVAL. See below:
s = b'PR\xc3\x86KVAL'
print(s)
bb = s.decode('utf-8')
print(bb)
The problem is that I don’t know how I can turn 'PR\xc3\x86KVAL’ to be recognized as a bytecode object.
I want the value that has to be decoded to be a variable so that all data from database can flow through it.
I Also tried ast.literal_eval(r”b'PR\xc3\x86KVAL'”), but variables won’t work in this way.

Since you start out with PR\xc3\x86KVAL as a text string and decode indeed expects a raw byte sequence, you need to convert the text string into a bytes object. But when converting from one "encoding" value to another, Python needs to know what encoding it is starting with!
The easiest way to do so is explicitly encoding the string, using an encoding that does not change the special characters. You must be careful, because it is very well possible that a character code might be translated to something else, destroying their meaning.
You can see that with a simple example: attempting to tell Python this should be plain ASCII fails, for an obvious reason.
>>> s = 'PR\xc3\x86KVAL'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
Even though there are more than 1,000 questions on Stack Overflow about this, the reason for the failure should be easy to understand. All an encoder/decoder pair does is translate each character from 'source' to 'destination'. This can only work if the character in question actually exists in both the 'source' and 'destination' encodings. Suppose you want to translate a Greek character β to a Russian б, then the source must be able to decode the Greek character (because that is what you entered it in) and the destination must be able to encode the Russian character.
So you must be careful to choose an encoding which does not change the character \x86 in your input string into Ж (which it would do when using cp866, for example).
Fortunately, as quoted from https://stackoverflow.com/a/2617930/2564301, there is an encoding that does not mess up things:
Pass data.decode('latin1') to the codec. latin1 maps bytes 0-255 to Unicode characters 0-255, which is kinda elegant.
and so this should work:
>>> s = 'PR\xc3\x86KVAL'.encode('latin1')
>>> print(s)
b'PR\xc3\x86KVAL'
Now s is a properly encoded byte object, so you can decode it at will:
>>> bb = s.decode('utf-8')
>>> print(bb)
PRÆKVAL
Done!

Converting bits to string (data)

I have file which contains some data (text copied and pasted from the "What You Will Learn" portion of this PDF). Firstly, I have converted the contents in the file to bits successfully. However, when I try to convert it back to the original format, some of the characters are not correctly converted, as shown below:
Cisco has
developed the Cisco Open Network Environment (ONE)
architecture as a multifaceted approach to network
programmability delivered across three pillars:
??)É¥ Í?Ð½??ÁÁ±¥?Ñ¥½¸ÁÉ½É?µµ¥¹?¥¹Ñ?É???Ì?¡A%Ì¤?)?áÁ½Í??¥É?Ñ±ä½¸ÍÝ¥Ñ¡?Ì?¹É½ÕÑ?ÉÌÑ¼?Õµ?¹Ð?)?á¥ÍÑ¥¹?=Á?¹±½ÜÍÁ?¥?¥?Ñ¥½¹Ì* ¤&öGV7F?öâ×&VG?÷VäfÆ÷r6öçG&öÆÆW"æB÷VäfÆ÷r ¦vVçG0¨?HÝZ]HÙ??ÙXÝÈÈ[]?\??\X[Ý?\?^\Ë?\X[?Ù\?XÙ\Ë[??\ÛÝ\?ÙHÜ?Ú\Ý?][Û?Ø\X?[]Y\È[?H?]HÙ[
As you can see here some characters are converted successfully, others are not.
My code is below:
file = open("test.txt",'r')
myfile = ''.join(map(str,file))
l = []
for i in myfile:
asc11 = ord(i)
b = "{0:08b}".format(asc11)
l.extend(int(y) for y in b)
string_bin = ''.join(map(str,l))
mydata = ''.join(chr(int(string_bin[i:i+8], 2)) for i in range(0,len(string_bin), 8))
print(mydata)
What wrong with my code? What I need to change to make it work properly?

What's Going On?
You are running into an encoding issue because some characters in the PDF are non-ASCII characters. For example, the bullet points are U+2022 which require 3 bytes of storage.
When Python reads from your file, it doesn't know what encoding you used to write that data. Thus it reads bytes from the file and uses a character encoding to translate them into strs which are stored using Python's own internal unicode format. (This differs from Python 2 where open() returned raw bytes stored in a str which you could then manually decoded to unicode.)
Thus, in Python 3, open() accepts a named encoding parameter. For example open("test.txt",'r', encoding='ascii'). Because you don't specify the encoding when you call open(), you end up using your system's default encoding. For instance, on my laptop, the default encoding is CP1252 (LATIN-1). Yours may differ.
Whatever encoding Python uses to interpret your file, it then internally uses it's own unicode format to store your string. This means that your string may internally use mutli-byte characters even if the original encoding did not. For example, my laptop uses CP1252 to interpret U+2022 as â€¢ which is internally stored as U+00e2, U+20AC and U+00A2 -- € is stored using a multi-byte character even though it was just one byte in the original file.
Let's assume you computer is sane and uses UTF-8 by default (this explanation is similar for many multi-byte characters). When you reach a bullet point, it is stored as U+2022. When you call ord('\u2022') the result is 8226. When you then call "{0:08b}".format(8226) this returns "10000000100010". That's a 14 character string. Your parsing code assumes all of the ordinals will generate 8 character strings. Because of this, the "binary" output becomes misaligned. This means that when you then parse the binary string in 8-character segments, it gets thrown off and starts interpreting things as control characters and all sorts of foreign language characters.
If you call open(..., encoding='ascii'), Python will actually throw an exception because it reads non-valid ASCII characters.
Possible Solutions
I'm not sure why exactly you are converting the input string into the representation that you are using. It's not binary, as your question title would suggest. Rather, you've converted the data into a textual representation of it's binary encoding.
Technically speaking, when you store encoded text to a file, it's stored using a binary representation. Python, and any text editor, has to decode those bytes into it's internal character representation before it can display them as text. Thus, calling open("test.txt", "r", encoding="utf-8") reads the binary data out of your text file and converts it into Python's internal unicode format. Similarly, calling myfile.encode('utf-8') will return the UTF-8 encoded bytes which can then be written to a file, network socket, etc.
If, however, you do need to use a format similar to what you are currently using, first, I still recommend you specify an encoding when you call open() (I recommend UTF-8). Then you can consider these options:
Detect and omit non-ASCII characters. They will have an ordinal >= 128.
Mimic UTF-16 or UTF-32 and output multi-byte output for all characters. For example, use "{0:032b}".format(asc11) and then parse the result in 32-character chunks. It's memory and storage inefficient, but it will preserve multi-byte characters.
Regardless, I highly recommend reading the Dive Into Python 3 chapter about strings.

VC6 /r/n and Write works; Visual Studio 2013 does not work

the following code
if(!cfile.Open(fileName, CFile::modeCreate | CFile::modeReadWrite)){
return;
}
ggg.Format(_T("0 \r\n"));
cfile.Write(ggg, ggg.GetLength());
ggg.Format(_T("SECTION \r\n"));
cfile.Write(ggg, ggg.GetLength());
produces the following:
0 SECTI
clearly this is wrong: (a) \r\n is ignored, and (b) the word SECTION is cut off.
Can someone please tell me what I am doing wrong?
The same code without _T() in VC6 produces the correct results.
Thank you
a.

Apparently, you are building a Unicode build; CString (presumably that's what ggg is) holds a sequence of wchar_t characters, each two bytes large. ggg.GetLength() is the length of the string in characters.
However, CFile::Write takes the length in bytes, not in characters. You are passing half the number of bytes actually taken by the string, so only half the number of characters gets written.

Have you considered changing lines like:
cfile.Write(ggg, ggg.GetLength());
to`
cfile.Write(ggg, ggg.GetLength() * sizeof(TCHAR))
Write needs the number of bytes (not characters). Since Unicode is 2 bytes wide you need to account for that. sizeof(TCHAR) should be the number of bytes each character takes on a given platform. If it is built as Ansi it would be 1 and Unicode would have 2. Multiply that by the string length and the number of bytes should be correct.
Information on TCHAR can be found on MSDN documentation here. In particular it is defined as:
The _TCHAR data type is defined conditionally in Tchar.h. If the symbol _UNICODE is defined for your build, _TCHAR is defined as wchar_t; otherwise, for single-byte and MBCS builds, it is defined as char. (wchar_t, the basic Unicode wide-character data type, is the 16-bit counterpart to an 8-bit signed char.)
TCHAR and _TCHAR in your usage should be synonymous. However I believe these days Microsoft recommends including <tchar.h> and using _TCHAR. What I can't tell you is if _TCHAR existed on VC 6.0.
If using the method above - if you build using Unicode your output files will be in Unicode. If you build for Ansi it will be output as 8bit ASCII.
Want CFile.write to output Ascii no matter what? Read on...
If you want all text written to the file as 8bit ASCII you are going to have to use one of the macros for conversion. In particular CT2A. More on the macros can be found in this MSDN article. Each macro can be broken up by name, however CT2A says convert the Generic character string (equivalent to W when _UNICODE is defined, equivalent to A otherwise) to Ascii per the chart at the link. So no matter whether using Unicode or Ascii it would output Ascii. Your code would look something like:
ggg.Format(_T("0 \r\n"));
cfile.Write(CT2A(ggg), ggg.GetLength());
ggg.Format(_T("SECTION \r\n"));
cfile.Write(CT2A(ggg), ggg.GetLength());
Since the macro converts everything to Ascii CString's GetLength() will suffice.

How to convert between bytes and strings in Python 3?

This is a Python 101 type question, but it had me baffled for a while when I tried to use a package that seemed to convert my string input into bytes.
As you will see below I found the answer for myself, but I felt it was worth recording here because of the time it took me to unearth what was going on. It seems to be generic to Python 3, so I have not referred to the original package I was playing with; it does not seem to be an error (just that the particular package had a .tostring() method that was clearly not producing what I understood as a string...)
My test program goes like this:
import mangler # spoof package
stringThing = """
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>你好</Greeting>
</Doc>
"""
# print out the input
print('This is the string input:')
print(stringThing)
# now make the string into bytes
bytesThing = mangler.tostring(stringThing) # pseudo-code again
# now print it out
print('\nThis is the bytes output:')
print(bytesThing)
The output from this code gives this:
This is the string input:
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>你好</Greeting>
</Doc>
This is the bytes output:
b'\n<Doc>\n <Greeting>Hello World</Greeting>\n <Greeting>\xe4\xbd\xa0\xe5\xa5\xbd</Greeting>\n</Doc>\n'
So, there is a need to be able to convert between bytes and strings, to avoid ending up with non-ascii characters being turned into gobbledegook.

The 'mangler' in the above code sample was doing the equivalent of this:
bytesThing = stringThing.encode(encoding='UTF-8')
There are other ways to write this (notably using bytes(stringThing, encoding='UTF-8'), but the above syntax makes it obvious what is going on, and also what to do to recover the string:
newStringThing = bytesThing.decode(encoding='UTF-8')
When we do this, the original string is recovered.
Note, using str(bytesThing) just transcribes all the gobbledegook without converting it back into Unicode, unless you specifically request UTF-8, viz., str(bytesThing, encoding='UTF-8'). No error is reported if the encoding is not specified.

In python3, there is a bytes() method that is in the same format as encode().
str1 = b'hello world'
str2 = bytes("hello world", encoding="UTF-8")
print(str1 == str2) # Returns True
I didn't read anything about this in the docs, but perhaps I wasn't looking in the right place. This way you can explicitly turn strings into byte streams and have it more readable than using encode and decode, and without having to prefex b in front of quotes.

This is a Python 101 type question,
It's a simple question but one where the answer is not so simple.
In python3, a "bytes" object represents a sequence of bytes, a "string" object represents a sequence of unicode code points.
To convert between from "bytes" to "string" and from "string" back to "bytes" you use the bytes.decode and string.encode functions. These functions take two parameters, an encoding and an error handling policy.
Sadly there are an awful lot of cases where sequences of bytes are used to represent text, but it is not necessarily well-defined what encoding is being used. Take for example filenames on unix-like systems, as far as the kernel is concerned they are a sequence of bytes with a handful of special values, on most modern distros most filenames will be UTF-8 but there is no gaurantee that all filenames will be.
If you want to write robust software then you need to think carefully about those parameters. You need to think carefully about what encoding the bytes are supposed to be in and how you will handle the case where they turn out not to be a valid sequence of bytes for the encoding you thought they should be in. Python defaults to UTF-8 and erroring out on any byte sequence that is not valid UTF-8.
print(bytesThing)
Python uses "repr" as a fallback conversion to string. repr attempts to produce python code that will recreate the object. In the case of a bytes object this means among other things escaping bytes outside the printable ascii range.

TRY THIS:
StringVariable=ByteVariable.decode('UTF-8','ignore')
TO TEST TYPE:
print(type(StringVariable))
Here 'StringVariable' represented as a string. 'ByteVariable' represent as Byte. Its not relevent to question Variables..

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string