Reversing Bytes and cross compatible binary parsing in Nim

Reversing Bytes and cross compatible binary parsing in Nim - nim-lang

I've started taking a look at Nim for hobby game modding purposes.
Intro
Yet, I found it difficult to work with Nim compared to C when it comes to machine-specific low-level memory layout and would like to know if Nim actually has better support here.
I need to control byte order and be able to de/serialize arbitrary Plain-Old-Datatype objects to binary custom file formats. I didn't directly find a Nim library which allows flexible storage options like representing enum and pointers with Big-Endian 32-bit. Or maybe I just don't know how to use the feature.
std/marshal : just JSON, i.e. no efficient, flexible nor binary format but cross-compatible
nim-serialization : seems like being made for human readable formats
nesm : flexible cross-compatibility? (It has some options and has a good interface)
flatty : no flexible cross-compatibility, no byte order?
msgpack4nim : no flexible cross-compatibility, byte order?
bingo : ?
Flexible cross-compatibility means, it must be able to de/serialize fields independently of Nim's ABI but with customization options.
Maybe "Kaitai Struct" is more what I look for, a file parser with experimental Nim support.
TL;DR
As a workaround for a serialization library I tried myself at a recursive "member fields reverser" that makes use of std/endians which is almost sufficient.
But I didn't succeed with implementing byte reversal of arbitrarily long objects in Nim. Not practically relevant but I still wonder if Nim has a solution.
I found reverse() and reversed() from std/algorithm but I need a byte array to reverse it and turn it back into the original object type. In C++ there would be reinterprete_cast, in C there is void*-cast, in D there is a void[] cast (D allows defining array slices from pointers) but I couldn't get it working with Nim.
I tried cast[ptr array[value.sizeof, byte]](unsafeAddr value)[] but I can't assign it to a new variable. Maybe there was a different problem.
How to "byte reverse" arbitrary long Plain-Old-Datatype objects?
How to serialize to binary files with byte order, member field size, pointer as file "offset - start offset"? Are there bitfield options in Nim?

It is indeed possible to use algorithm.reverse and the appropriate cast invocation to reverse bytes in-place:
import std/[algorithm,strutils,strformat]
type
LittleEnd{.packed.} = object
a: int8
b: int16
c: int32
BigEnd{.packed.} = object
c: int32
b: int16
a: int8
## just so we can see what's going on:
proc `$`(b: LittleEnd):string = &"(a:0x{b.a.toHex}, b:0x{b.b.toHex}, c:0x{b.c.toHex})"
proc `$`(l:BigEnd):string = &"(c:0x{l.c.toHex}, b:0x{l.b.toHex}, a:0x{l.a.toHex})"
var lit = LittleEnd(a: 0x12, b:0x3456, c: 0x789a_bcde)
echo lit # (a:0x12, b:0x3456, c:0x789ABCDE)
var big:BigEnd
copyMem(big.addr,lit.addr,sizeof(lit))
# here's the reinterpret_cast you were looking for:
cast[var array[sizeof(big),byte]](big.addr).reverse
echo big # (c:0xDEBC9A78, b:0x5634, a:0x12)
for C-style bitfields there is also the {.bitsize.} pragma
but using it causes Nim to lose sizeof information, and of course bitfields wont be reversed within bytes
import std/[algorithm,strutils,strformat]
type
LittleNib{.packed.} = object
a{.bitsize: 4}: int8
b{.bitsize: 12}: int16
c{.bitsize: 20}: int32
d{.bitsize: 28}: int32
BigNib{.packed.} = object
d{.bitsize: 28}: int32
c{.bitsize: 20}: int32
b{.bitsize: 12}: int16
a{.bitsize: 4}: int8
const nibsize = 8
proc `$`(b: LittleNib):string = &"(a:0x{b.a.toHex(1)}, b:0x{b.b.toHex(3)}, c:0x{b.c.toHex(5)}, d:0x{b.d.toHex(7)})"
proc `$`(l:BigNib):string = &"(d:0x{l.d.toHex(7)}, c:0x{l.c.toHex(5)}, b:0x{l.b.toHex(3)}, a:0x{l.a.toHex(1)})"
var lit = LitNib(a: 0x1,b:0x234, c:0x56789, d: 0x0abcdef)
echo lit # (a:0x1, b:0x234, c:0x56789, d:0x0ABCDEF)
var big:BigNib
copyMem(big.addr,lit.addr,nibsize)
cast[var array[nibsize,byte]](big.addr).reverse
echo big # (d:0x5DEBC0A, c:0x8967F, b:0x123, a:0x4)
It's less than optimal to copy the bytes over, then rearrange them with reverse, anyway, so you might just want to copy the bytes over in a loop. Here's a proc that can swap the endianness of any object, (including ones for which sizeof is not known at compiletime):
template asBytes[T](x:var T):ptr UncheckedArray[byte] =
cast[ptr UncheckedArray[byte]](x.addr)
proc swapEndian[T,U](src:var T,dst:var U) =
assert sizeof(src) == sizeof(dst)
let len = sizeof(src)
for i in 0..<len:
dst.asBytes[len - i - 1] = src.asBytes[i]

Bit fields are supported in Nim as a set of enums:
type
MyFlag* {.size: sizeof(cint).} = enum
A
B
C
D
MyFlags = set[MyFlag]
proc toNum(f: MyFlags): int = cast[cint](f)
proc toFlags(v: int): MyFlags = cast[MyFlags](v)
assert toNum({}) == 0
assert toNum({A}) == 1
assert toNum({D}) == 8
assert toNum({A, C}) == 5
assert toFlags(0) == {}
assert toFlags(7) == {A, B, C}
For arbitrary bit operations you have the bitops module, and for endianness conversions you have the endians module. But you already know about the endians module, so it's not clear what problem you are trying to solve with the so called byte reversal. Usually you have an integer, so you first convert the integer to byte endian format, for instance, then save that. And when you read back, convert from byte endian format and you have the int. The endianness procs should be dealing with reversal or not of bytes, so why do you need to do one yourself? In any case, you can follow the source hyperlink of the documentation and see how the endian procs are implemented. This can give you an idea of how to cast values in case you need to do some yourself.
Since you know C maybe the last resort would be to write a few serialization functions and call them from Nim, or directly embed them using the emit pragma. However this looks like the least cross platform and pain free option.
Can't answer anything about generic data structure serialization libraries. I stray away from them because they tend to require hand holding imposing certain limitations on your code and depending on the feature set, a simple refactoring (changing field order in your POD) may destroy the binary compatibility of the generated output without you noticing it until runtime. So you end up spending additional time writing unit tests to verify that the black box you brought in to save you some time behaves as you want (and keeps doing so across refactorings and version upgrades!).

Related

How to convert data type if Variant.Type is known?

How do I convert the data type if I know the Variant.Type from typeof()?
for example:
var a=5;
var b=6.9;
type_cast(b,typeof(a)); # this makes b an int type value

How do I convert the data type if I know the Variant.Type from typeof()?
You can't. GDScript does not have generics/type templates, so beyond simple type inference, there is no way to specify a type without knowing the type.
Thus, any workaround to cast the value to a type only known at runtime would have to be declared to return Variant, because there is no way to specify the type.
Furthermore, to store the result on a variable, how do you declare the variable if you don't know the type?
Let us have a look at variable declarations. If you do not specify a type, you get a Variant.
For example in this code, a is a Variant that happens to have an int value:
var a = 5
In this other example a is an int:
var a:int = 5
This is also an int:
var a := 5
In this case the variable is typed according to what you are using to initialized, that is the type is inferred.
You may think you can use that like this:
var a = 5
var b := a
Well, no. That is an error. "The variable type can't be inferred". As far as Godot is concerned a does not have a type in this example.
I'm storing data in a json file: { variable:[ typeof(variable), variable_value ] } I added typeof() because for example I store an int but when I reassign it from the file it gets converted to float (one of many other examples)
It is true that JSON is not good at storing Godot types. Which is why many authors do not recommend using JSON to save state.
Now, be aware that we can't get a variable with the right type as explained above. Instead we should try to get a Variant of the right type.
If you cannot change the serialization format, then you are going to need one big match statement. Something like this:
match type:
TYPE_NIL:
return null
TYPE_BOOL:
return bool(value)
TYPE_INT:
return int(value)
TYPE_REAL:
return float(value)
TYPE_STRING:
return str(value)
Those are not all the types that a Variant can hold, but I think it would do for JSON.
Now, if you can change the serialization format, then I will suggest to use str2var and var2str.
For example:
var2str(Vector2(1, 10))
Will return a String value "Vector2( 1, 10 )". And if you do:
str2var("Vector2( 1, 10 )")
You get a Variant with a Vector2 with 1 for the x, and 10 for the y.
This way you can always store Strings, in a human readable format, that Godot can parse. And if you want to do that for whole objects, or you want to put them in a JSON structure, that is up to you.
By the way, you might also be interested in ResourceFormatSaver and ResourceFormatLoader.

What exactly are strings in Nim?

From what I understand, strings in Nim are basically a mutable sequence of bytes and that they are copied on assignment.
Given that, I assumed that sizeof would tell me (like len) the number of bytes, but instead it always gives 8 on my 64-bit machine, so it seems to be holding a pointer.
Given that, I have the following questions...
What was the motivation behind copy on assignment? Is it because they're mutable?
Is there ever a time when it isn't copied when assigned? (I assume non-var function parameters don't copy. Anything else?)
Are they optimized such that they only actually get copied if/when they're mutated?
Is there any significant difference between a string and a sequence, or can the answers to the above questions be equally applied to all sequences?
Anything else in general worth noting?
Thank you!

The definition of strings actually is in system.nim, just under another name:
type
TGenericSeq {.compilerproc, pure, inheritable.} = object
len, reserved: int
PGenericSeq {.exportc.} = ptr TGenericSeq
UncheckedCharArray {.unchecked.} = array[0..ArrayDummySize, char]
# len and space without counting the terminating zero:
NimStringDesc {.compilerproc, final.} = object of TGenericSeq
data: UncheckedCharArray
NimString = ptr NimStringDesc
So a string is a raw pointer to an object with a len, reserved and data field. The procs for strings are defined in sysstr.nim.
The semantics of string assignments have been chosen to be the same as for all value types (not ref or ptr) in Nim by default, so you can assume that assignments create a copy. When a copy is unneccessary, the compiler can leave it out, but I'm not sure how much that is happening so far. Passing strings into a proc doesn't copy them. There is no optimization that prevents string copies until they are mutated. Sequences behave in the same way.
You can change the default assignment behaviour of strings and seqs by marking them as shallow, then no copy is done on assignment:
var s = "foo"
shallow s

The last challenge of Bytewiser: Array Buffers

The challenge is like that:
Array Buffers are the backend for Typed Arrays. Whereas Buffer in node
is both the raw bytes as well as the encoding/view, Array Buffers are
only raw bytes and you have to create a Typed Array on top of an Array
Buffer in order to access the data.
When you create a new Typed Array and don't give it an Array Buffer to
be a view on top of it will create it's own new Array Buffer instead.
For this challenge, take the integer from process.argv[2] and write it
as the first element in a single element Uint32Array. Then create a
Uint16Array from the Array Buffer of the Uint32Array and log out to
the console the JSON stringified version of the Uint16Array.
Bonus: try to explain the relevance of the integer from
process.argv[2], or explain why the Uint16Array has the particular
values that it does.
The solution given by the author is like that:
var num = +process.argv[2]
var ui32 = new Uint32Array(1)
ui32[0] = num
var ui16 = new Uint16Array(ui32.buffer)
console.log(JSON.stringify(ui16))
I don't understand what does the plus sign in the first line means? And I can't understand the logic of this block of code either.
Thank you a lot if you can solve my puzzle.

Typed arrays are often used in context of asm.js, a strongly typed subset of JavaScript that is highly optimisable. "strongly typed" and "subset of JavaScript" are contradictory requirements, since JavaScript does not natively distinguish integers and floats, and has no way to declare them. Thus, asm.js adopts the convention that the following no-ops on integers and floats respectively serve as declarations and casts:
n|0 is n for every integer
+n is n for every float
Thus,
var num = +process.argv[2]
would be the equivalent of the following line in C or Java:
float num = process.argv[2]
declaring a floating point variable num.
It is puzzling though, I would have expected the following, given the requirement for integers:
var num = (process.argv[2])|0
Even in normal JavaScript though they can have uses, because they will also convert string representations of integers or floats respectively into numbers.

Calling a C Function with C-style Strings

I'm struggling to call mktemp in D:
import core.sys.posix.stdlib;
import std.string: toStringz;
auto name = "alpha";
auto tmp = mktemp(name.toStringz);
but I can't figure out how to use it so DMD complains:
/home/per/Work/justd/fs.d(1042): Error: function core.sys.posix.stdlib.mktemp (char*) is not callable using argument types (immutable(char)*)
How do I create a mutable zero-terminated C-style string?
I think I've read somewhere that string literals (const or immutable) are implicitly convertible to zero (null)-terminated strings.

For this specific problem:
This is because mktemp needs to write to the string. From mktemp(3):
The last six characters of template must be XXXXXX and these are replaced with a string that makes the filename unique. Since it will be modified, template must not be a string constant, but should be declared as a character array.
So what you want to do here is use a char[] instead of a string. I'd go with:
import std.stdio;
void main() {
import core.sys.posix.stdlib;
// we'll use a little mutable buffer defined right here
char[255] tempBuffer;
string name = "alphaXXXXXX"; // last six X's are required by mktemp
tempBuffer[0 .. name.length] = name[]; // copy the name into the mutable buffer
tempBuffer[name.length] = 0; // make sure it is zero terminated yourself
auto tmp = mktemp(tempBuffer.ptr);
import std.conv;
writeln(to!string(tmp));
}
In general, creating a mutable string can be done in one of two ways: one is to .dup something, or the other is to use a stack buffer like I did above.
toStringz doesn't care if the input data is mutable, it always returns immutable (apparently...). But it is easy to do it yourself:
auto c_str = ("foo".dup ~ "\0").ptr;
That's how you do it, .dup makes a mutable copy, and appending the zero terminator yourself ensures it is there.
string name = "alphaXXXXXX"; // last six X's are required by mktemp
auto tmp = mktemp((name.dup ~ "\0").ptr);

In addition to Adam's great answer, there's also std.utf.toUTFz, in which case you can do
void main()
{
import core.sys.posix.stdlib;
import std.conv, std.stdio, std.utf;
auto name = toUTFz!(char*)("alphaXXXXXX");
auto tmp = mktemp(name);
writeln(to!string(tmp));
}
std.utf.toUTFz is std.string.toStringz's more capable cousin as it will generate null-terminated UTF-8, UTF-16, and UTF-32 strings (as opposed to just UTF-8) as well as any level of constness. The downside is that it's more verbose for cases where you just want immutable(char)*, because you have to specify the return type.
However, if efficiency is a concern, Adam's solution is likely better simply because it avoids having to allocate the C-string that you pass to mktemp on the heap. toUTFz is shorter though, so if you don't care about the efficiency cost of allocating the C-string on the heap (and most programs probably won't), then toUTFz is arguably better. It depends on the requirements of your particular program.

How do I rearrange bytes in a bytearray in Python?

I am new to python and doing embedded-related work (most of my programming experience has been using C).
I am reading a four-byte float into a bytearray from a serial port but instead of the normal little-endian order DCBA, it is encoded as CDAB. Or it may be encoded as BADC. (where A is most significant byte and D is LSB). What is the correct way of swapping bytes around in a bytearray?
For example, I have
tmp=bytearray(pack("f",3.14))
I want to be able to arbitrarily arrange the bytes in tmp, and then unpack() it back into a float.
Things like this seem essential when doing anything related to embedded systems, but either I am googling it wrong or no clear answer is out there (yet!).
edit:
sure, I can do this:
from struct import *
def modswap(num):
tmp=bytearray(pack("f",num))
res=bytearray()
res.append(tmp[2])
res.append(tmp[3])
res.append(tmp[0])
res.append(tmp[1])
return unpack('f',res)
def main():
print(modswap(3.14))
but there has to be a better way...
Ideally, I would like to be able to slice and re-concatenate as I please, or even replace a slice at a time if possible.

You can swizzle in one step:
from struct import pack,unpack
def modswap(num):
tmp=bytearray(pack("f",num))
tmp[0],tmp[1],tmp[2],tmp[3] = tmp[2],tmp[3],tmp[0],tmp[1]
return unpack('f',tmp)
You can modify a slice of a byte array:
>>> data = bytearray(b'0123456789')
>>> data[3:7] = data[5],data[6],data[3],data[4]
>>> data
bytearray(b'0125634789')

I came across the same problem and the closest answer is in this thread.
In Python3 the solution is:
b''.join((tmp[2:4],tmp[0:2]))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string