What exactly are strings in Nim? - string

From what I understand, strings in Nim are basically a mutable sequence of bytes and that they are copied on assignment.
Given that, I assumed that sizeof would tell me (like len) the number of bytes, but instead it always gives 8 on my 64-bit machine, so it seems to be holding a pointer.
Given that, I have the following questions...
What was the motivation behind copy on assignment? Is it because they're mutable?
Is there ever a time when it isn't copied when assigned? (I assume non-var function parameters don't copy. Anything else?)
Are they optimized such that they only actually get copied if/when they're mutated?
Is there any significant difference between a string and a sequence, or can the answers to the above questions be equally applied to all sequences?
Anything else in general worth noting?
Thank you!

The definition of strings actually is in system.nim, just under another name:
type
TGenericSeq {.compilerproc, pure, inheritable.} = object
len, reserved: int
PGenericSeq {.exportc.} = ptr TGenericSeq
UncheckedCharArray {.unchecked.} = array[0..ArrayDummySize, char]
# len and space without counting the terminating zero:
NimStringDesc {.compilerproc, final.} = object of TGenericSeq
data: UncheckedCharArray
NimString = ptr NimStringDesc
So a string is a raw pointer to an object with a len, reserved and data field. The procs for strings are defined in sysstr.nim.
The semantics of string assignments have been chosen to be the same as for all value types (not ref or ptr) in Nim by default, so you can assume that assignments create a copy. When a copy is unneccessary, the compiler can leave it out, but I'm not sure how much that is happening so far. Passing strings into a proc doesn't copy them. There is no optimization that prevents string copies until they are mutated. Sequences behave in the same way.
You can change the default assignment behaviour of strings and seqs by marking them as shallow, then no copy is done on assignment:
var s = "foo"
shallow s

Related

Reversing Bytes and cross compatible binary parsing in Nim

I've started taking a look at Nim for hobby game modding purposes.
Intro
Yet, I found it difficult to work with Nim compared to C when it comes to machine-specific low-level memory layout and would like to know if Nim actually has better support here.
I need to control byte order and be able to de/serialize arbitrary Plain-Old-Datatype objects to binary custom file formats. I didn't directly find a Nim library which allows flexible storage options like representing enum and pointers with Big-Endian 32-bit. Or maybe I just don't know how to use the feature.
std/marshal : just JSON, i.e. no efficient, flexible nor binary format but cross-compatible
nim-serialization : seems like being made for human readable formats
nesm : flexible cross-compatibility? (It has some options and has a good interface)
flatty : no flexible cross-compatibility, no byte order?
msgpack4nim : no flexible cross-compatibility, byte order?
bingo : ?
Flexible cross-compatibility means, it must be able to de/serialize fields independently of Nim's ABI but with customization options.
Maybe "Kaitai Struct" is more what I look for, a file parser with experimental Nim support.
TL;DR
As a workaround for a serialization library I tried myself at a recursive "member fields reverser" that makes use of std/endians which is almost sufficient.
But I didn't succeed with implementing byte reversal of arbitrarily long objects in Nim. Not practically relevant but I still wonder if Nim has a solution.
I found reverse() and reversed() from std/algorithm but I need a byte array to reverse it and turn it back into the original object type. In C++ there would be reinterprete_cast, in C there is void*-cast, in D there is a void[] cast (D allows defining array slices from pointers) but I couldn't get it working with Nim.
I tried cast[ptr array[value.sizeof, byte]](unsafeAddr value)[] but I can't assign it to a new variable. Maybe there was a different problem.
How to "byte reverse" arbitrary long Plain-Old-Datatype objects?
How to serialize to binary files with byte order, member field size, pointer as file "offset - start offset"? Are there bitfield options in Nim?
It is indeed possible to use algorithm.reverse and the appropriate cast invocation to reverse bytes in-place:
import std/[algorithm,strutils,strformat]
type
LittleEnd{.packed.} = object
a: int8
b: int16
c: int32
BigEnd{.packed.} = object
c: int32
b: int16
a: int8
## just so we can see what's going on:
proc `$`(b: LittleEnd):string = &"(a:0x{b.a.toHex}, b:0x{b.b.toHex}, c:0x{b.c.toHex})"
proc `$`(l:BigEnd):string = &"(c:0x{l.c.toHex}, b:0x{l.b.toHex}, a:0x{l.a.toHex})"
var lit = LittleEnd(a: 0x12, b:0x3456, c: 0x789a_bcde)
echo lit # (a:0x12, b:0x3456, c:0x789ABCDE)
var big:BigEnd
copyMem(big.addr,lit.addr,sizeof(lit))
# here's the reinterpret_cast you were looking for:
cast[var array[sizeof(big),byte]](big.addr).reverse
echo big # (c:0xDEBC9A78, b:0x5634, a:0x12)
for C-style bitfields there is also the {.bitsize.} pragma
but using it causes Nim to lose sizeof information, and of course bitfields wont be reversed within bytes
import std/[algorithm,strutils,strformat]
type
LittleNib{.packed.} = object
a{.bitsize: 4}: int8
b{.bitsize: 12}: int16
c{.bitsize: 20}: int32
d{.bitsize: 28}: int32
BigNib{.packed.} = object
d{.bitsize: 28}: int32
c{.bitsize: 20}: int32
b{.bitsize: 12}: int16
a{.bitsize: 4}: int8
const nibsize = 8
proc `$`(b: LittleNib):string = &"(a:0x{b.a.toHex(1)}, b:0x{b.b.toHex(3)}, c:0x{b.c.toHex(5)}, d:0x{b.d.toHex(7)})"
proc `$`(l:BigNib):string = &"(d:0x{l.d.toHex(7)}, c:0x{l.c.toHex(5)}, b:0x{l.b.toHex(3)}, a:0x{l.a.toHex(1)})"
var lit = LitNib(a: 0x1,b:0x234, c:0x56789, d: 0x0abcdef)
echo lit # (a:0x1, b:0x234, c:0x56789, d:0x0ABCDEF)
var big:BigNib
copyMem(big.addr,lit.addr,nibsize)
cast[var array[nibsize,byte]](big.addr).reverse
echo big # (d:0x5DEBC0A, c:0x8967F, b:0x123, a:0x4)
It's less than optimal to copy the bytes over, then rearrange them with reverse, anyway, so you might just want to copy the bytes over in a loop. Here's a proc that can swap the endianness of any object, (including ones for which sizeof is not known at compiletime):
template asBytes[T](x:var T):ptr UncheckedArray[byte] =
cast[ptr UncheckedArray[byte]](x.addr)
proc swapEndian[T,U](src:var T,dst:var U) =
assert sizeof(src) == sizeof(dst)
let len = sizeof(src)
for i in 0..<len:
dst.asBytes[len - i - 1] = src.asBytes[i]
Bit fields are supported in Nim as a set of enums:
type
MyFlag* {.size: sizeof(cint).} = enum
A
B
C
D
MyFlags = set[MyFlag]
proc toNum(f: MyFlags): int = cast[cint](f)
proc toFlags(v: int): MyFlags = cast[MyFlags](v)
assert toNum({}) == 0
assert toNum({A}) == 1
assert toNum({D}) == 8
assert toNum({A, C}) == 5
assert toFlags(0) == {}
assert toFlags(7) == {A, B, C}
For arbitrary bit operations you have the bitops module, and for endianness conversions you have the endians module. But you already know about the endians module, so it's not clear what problem you are trying to solve with the so called byte reversal. Usually you have an integer, so you first convert the integer to byte endian format, for instance, then save that. And when you read back, convert from byte endian format and you have the int. The endianness procs should be dealing with reversal or not of bytes, so why do you need to do one yourself? In any case, you can follow the source hyperlink of the documentation and see how the endian procs are implemented. This can give you an idea of how to cast values in case you need to do some yourself.
Since you know C maybe the last resort would be to write a few serialization functions and call them from Nim, or directly embed them using the emit pragma. However this looks like the least cross platform and pain free option.
Can't answer anything about generic data structure serialization libraries. I stray away from them because they tend to require hand holding imposing certain limitations on your code and depending on the feature set, a simple refactoring (changing field order in your POD) may destroy the binary compatibility of the generated output without you noticing it until runtime. So you end up spending additional time writing unit tests to verify that the black box you brought in to save you some time behaves as you want (and keeps doing so across refactorings and version upgrades!).

I confuse reference operator and references, dereference operator and pointer

As I know, & is called 'reference operator' which means 'address of'. So, its role is storing address in any variable. For example, 'a=&b;'. But I know another meaning which is 'references'. As you know, references is a alias of a variable. So, in my result, & has two meaning according to position. If 'a=&b;', & means 'address of b'. If 'int &a = b;', & means 'alias of another variable'.
As I know, * is called 'dereference operator'. But It is like &, It has two meaning according to position. If 'int *a = &b', * means 'pointer variable'. If 'a=*b', * means 'dereference variable'.
Are they right????
P.S. I'm foriegner. So I'm poor at English. Sorry... my poor English.
Hi as I understand you are having confusion with the concepts of pointers and references. Let me try and break it down for you :
When we use either of these in a variable declaration, think of it as specifying the datatype of that variable.
For example,
int *a; creates a pointer variable 'a', which can hold the address of(in other words, point to) another integer variable.
Similarly int & a = b; creates a reference variable 'a' which refers to b(in other words 'a' is simply an alias to integer b).
Now it might look as they are same, infact they both serve similar functionality, but here are some differences:
A pointer has memory allocated for it to hold the address of another variable to which it points, whereas references don't actually store the address of the variable they are referencing.
Another difference would be that a pointer need not be initialized when it is declared, but a reference must be initialized when it is declared, otherwise it throws an error. ie; int * a; is ok whereas int & a; throws error.
Now for the operators, try and see them not at all associated with pointers or references(Although thats where we use them the most).
Reference operator(&) simply returns you the address of its operand variable. Whereas a dereferencing operator(*) simply assumes that the value of its argument is an address, and returns the value stored at that address.
Hope this helped you. Here are some useful references(no pun intended):
https://www.ntu.edu.sg/home/ehchua/programming/cpp/cp4_PointerReference.html
How does a C++ reference look, memory-wise?

Fortran CHARACTER FUNCTION without defined size [duplicate]

I am writing the following simple routine:
program scratch
character*4 :: word
word = 'hell'
print *, concat(word)
end program scratch
function concat(x)
character*(*) x
concat = x // 'plus stuff'
end function concat
The program should be taking the string 'hell' and concatenating to it the string 'plus stuff'. I would like the function to be able to take in any length string (I am planning to use the word 'heaven' as well) and concatenate to it the string 'plus stuff'.
Currently, when I run this on Visual Studio 2012 I get the following error:
Error 1 error #6303: The assignment operation or the binary
expression operation is invalid for the data types of the two
operands. D:\aboufira\Desktop\TEMP\Visual
Studio\test\logicalfunction\scratch.f90 9
This error is for the following line:
concat = x // 'plus stuff'
It is not apparent to me why the two operands are not compatible. I have set them both to be strings. Why will they not concatenate?
High Performance Mark's comment tells you about why the compiler complains: implicit typing.
The result of the function concat is implicitly typed because you haven't declared its type otherwise. Although x // 'plus stuff' is the correct way to concatenate character variables, you're attempting to assign that new character object to a (implictly) real function result.
Which leads to the question: "just how do I declare the function result to be a character?". Answer: much as you would any other character variable:
character(len=length) concat
[note that I use character(len=...) rather than character*.... I'll come on to exactly why later, but I'll also point out that the form character*4 is obsolete according to current Fortran, and may eventually be deleted entirely.]
The tricky part is: what is the length it should be declared as?
When declaring the length of a character function result which we don't know ahead of time there are two1 approaches:
an automatic character object;
a deferred length character object.
In the case of this function, we know that the length of the result is 10 longer than the input. We can declare
character(len=LEN(x)+10) concat
To do this we cannot use the form character*(LEN(x)+10).
In a more general case, deferred length:
character(len=:), allocatable :: concat ! Deferred length, will be defined on allocation
where later
concat = x//'plus stuff' ! Using automatic allocation on intrinsic assignment
Using these forms adds the requirement that the function concat has an explicit interface in the main program. You'll find much about that in other questions and resources. Providing an explicit interface will also remove the problem that, in the main program, concat also implicitly has a real result.
To stress:
program
implicit none
character(len=[something]) concat
print *, concat('hell')
end program
will not work for concat having result of the "length unknown at compile time" forms. Ideally the function will be an internal one, or one accessed from a module.
1 There is a third: assumed length function result. Anyone who wants to know about this could read this separate question. Everyone else should pretend this doesn't exist. Just like the writers of the Fortran standard.

Calling a C Function with C-style Strings

I'm struggling to call mktemp in D:
import core.sys.posix.stdlib;
import std.string: toStringz;
auto name = "alpha";
auto tmp = mktemp(name.toStringz);
but I can't figure out how to use it so DMD complains:
/home/per/Work/justd/fs.d(1042): Error: function core.sys.posix.stdlib.mktemp (char*) is not callable using argument types (immutable(char)*)
How do I create a mutable zero-terminated C-style string?
I think I've read somewhere that string literals (const or immutable) are implicitly convertible to zero (null)-terminated strings.
For this specific problem:
This is because mktemp needs to write to the string. From mktemp(3):
The last six characters of template must be XXXXXX and these are replaced with a string that makes the filename unique. Since it will be modified, template must not be a string constant, but should be declared as a character array.
So what you want to do here is use a char[] instead of a string. I'd go with:
import std.stdio;
void main() {
import core.sys.posix.stdlib;
// we'll use a little mutable buffer defined right here
char[255] tempBuffer;
string name = "alphaXXXXXX"; // last six X's are required by mktemp
tempBuffer[0 .. name.length] = name[]; // copy the name into the mutable buffer
tempBuffer[name.length] = 0; // make sure it is zero terminated yourself
auto tmp = mktemp(tempBuffer.ptr);
import std.conv;
writeln(to!string(tmp));
}
In general, creating a mutable string can be done in one of two ways: one is to .dup something, or the other is to use a stack buffer like I did above.
toStringz doesn't care if the input data is mutable, it always returns immutable (apparently...). But it is easy to do it yourself:
auto c_str = ("foo".dup ~ "\0").ptr;
That's how you do it, .dup makes a mutable copy, and appending the zero terminator yourself ensures it is there.
string name = "alphaXXXXXX"; // last six X's are required by mktemp
auto tmp = mktemp((name.dup ~ "\0").ptr);
In addition to Adam's great answer, there's also std.utf.toUTFz, in which case you can do
void main()
{
import core.sys.posix.stdlib;
import std.conv, std.stdio, std.utf;
auto name = toUTFz!(char*)("alphaXXXXXX");
auto tmp = mktemp(name);
writeln(to!string(tmp));
}
std.utf.toUTFz is std.string.toStringz's more capable cousin as it will generate null-terminated UTF-8, UTF-16, and UTF-32 strings (as opposed to just UTF-8) as well as any level of constness. The downside is that it's more verbose for cases where you just want immutable(char)*, because you have to specify the return type.
However, if efficiency is a concern, Adam's solution is likely better simply because it avoids having to allocate the C-string that you pass to mktemp on the heap. toUTFz is shorter though, so if you don't care about the efficiency cost of allocating the C-string on the heap (and most programs probably won't), then toUTFz is arguably better. It depends on the requirements of your particular program.

Are numbers, bools or nils garbage collected in Lua?

This article implies that all types beside numbers, bools and nil are garbage collected.
The field gc is used for the other values (strings, tables, functions, heavy userdata, and threads), which are those subject to garbage collection.
Would this mean under certain circumstances that overusing these non-gc types might result in memory leaks?
In Lua, you have actually 2 kinds of types: Ones which are always passed by value, and ones passed by reference ( as per chapter 2.1 in the Lua Manual ).
The ones you cite are all of the "passed-by-value" type, hence they are directly stored in a variable.
If you delete the variable, the value will be gone instantly.
So it will not start leaking memory, unless, of course, you keep generating new variables containing new values. But in that case it's your own fault ;).
In the article you linked to they write down the C code that shows how values are represented:
/*You can also find this in lobject.h in the Lua source*/
/*I paraphrased a bit to remove some macro magic*/
/*unions in C store one of the values at a time*/
union Value {
GCObject *gc; /* collectable objects */
void *p; /* light userdata */
int b; /* booleans */
lua_CFunction f; /* light C functions */
numfield /* numbers */
};
typedef union Value Value;
/*the _tt tagtells what kind of value is actually stored in the union*/
struct lua_TObject {
int _tt;
Value value_;
};
As you can see in here, booleans and numbers are stored directly in the TObject struct. Since they are not "heap-allocated" it means that they can never "leak" and therefore garbage collecting them would have made no sense.
One interesting to note, however, is that the garbage collector does not collect references created to things on the C side of things (userdata and C C functions). These need to be manually managed from the C-side of things but that is sort of to be expected since in that case you are writing C instead of Lua.

Resources