Nim: work with read-only memory mapped files - nim-lang

I've only just started with Nim, hence it possibly is a simple question. We need to do many lookups into data that are stored in a file. Some of these files are too large to load into memory, hence the mmapped approach. I'm able to mmap the file by means of memfiles and either have a pointer or MemSlice at my hand. The file and the memory region are read-only, and hence have a fixed size. I was hoping that I'm able to access the data as immutable fixed size byte and char arrays without copying them, leveraging all the existing functionalities available to seqs, arrays, strings etc.. All the MemSlice / string methods copy the data, which is fair, but not what I want (and in my use case don't need).
I understand array, strings etc. types have a pointer to the data and a len field. But couldn't find a way to create them with a pointer and len. I assume it has something to do with ownership and refs to mem that may outlive my slice.
let mm = memfiles.open(...)
let myImmutableFixesSizeArr = ?? # cast[ptr array[fsize, char]](mm.mem) doesn't compile as fsize needs to be const. Neither could I find something like let x: [char] = array_from(mm.mem, fsize)
let myImmutableFixedSizeString = mm[20, 30].to_fixed_size_immutable_string # Create something that is string like so that I can use all the existing string methods.
UPDATE: I did find https://forum.nim-lang.org/t/4680#29226 which explains how to use OpenArray, but OpenArray is only allowed as function argument, and you - if I'm not mistaken - it is doesn't behave like a normal array.
Thanks for your help

It is not possible to convert a raw char array in memory (ptr UncheckedArray[char]) to a string without copying, only to an openArray[char] (or cstring)
So it won't be possible to use procs that expect a string, only those that accept openArray[T] or openArray[char]
Happily an openArray[T] behaves exactly like a seq[T] when sent to a proc.
({.experimental:"views".} does let you assign an openArray[T] to a local variable, but it's not anywhere near ready for production)
you can use the memSlices iterator to loop over delimited chunks in a memFile without copying:
import memfiles
template toOpenArray(ms: MemSlice, T: typedesc = byte): openArray[T] =
##template because openArray isn't a valid return type yet
toOpenArray(cast[ptr UncheckedArray[T]](ms.data),0,(ms.size div sizeof(T))-1)
func process(slice:openArray[char]) =
## your code here but e.g.
## count number of A's
var nA: int
for ch in slice.items:
if ch == 'A': inc nA
debugEcho nA
let mm = memfiles.open("file.txt")
for slice in mm.memSlices:
process slice.toOpenArray(char)
Or, to work with some char array represented in the middle of the file, you can use pointer arithmetic.
import memfiles
template extractImpl(typ,pntr,offset) =
cast[typ](cast[ByteAddress](pntr)+offset)
template checkFileLen(memfile,len,offset) =
if offset + len > memfile.size:
raise newException(IndexDefect,"file too short")
func extract*(mm: MemFile,T:typedesc, offset:Natural): ptr T =
checkFileLen(mm,T,offset)
result = extractImpl(ptr T,mm.mem,offset)
func extract*[U](mm: MemFile,T: typedesc[ptr U], offset: Natural): T =
extractImpl(T,mm.mem,offset)
let mm = memfiles.open("file.txt")
#to extract a compile-time known length string:
let mystring_offset = 3
const mystring_len = 10
type MyStringT = array[mystring_len,char]
let myString:ptr MyStringT = mm.extract(MyStringT,mystring_offset)
process myString[]
#to extract a dynamic length string:
let size_offset = 14
let string_offset = 18
let sz:ptr int32 = mm.extract(int32,size_offset)
let str:ptr UncheckedArray[char] = mm.extract(ptr UncheckedArray[char], string_offset)
checkFileLen(mm,sz[],string_offset)
process str.toOpenArray(0,sz[]-1)

Related

Inserting and item into CComboBoxEx and ReleaseBuffer

Code:
m_cbReminderInterval.ResetContent();
for (int i = 1; i <= m_iMaxReminderInterval; i++)
{
COMBOBOXEXITEM cmbItem = {};
CString strNumber;
strNumber.Format(_T("%d"), i);
cmbItem.mask = CBEIF_TEXT;
cmbItem.iItem = static_cast<INT_PTR>(i) - 1;
cmbItem.pszText = strNumber.GetBuffer(_MAX_PATH);
strNumber.ReleaseBuffer(); // TODO: When should I release the buffer - NOW or AFTER the InsertItem call?
m_cbReminderInterval.InsertItem(&cmbItem);
}
My question is:
Is it better to use GetString instead of GetBuffer in this context? The only issue I see is that pszText is LPWSTR whereas GetString returns LPCWSTR. If I should continue to use GetBuffer then when should it actually be released? Before or after the InsertItem call?
There's a common pattern in the Windows API you'll see over and over again: Structures that are less const-correct than what would appear to be possible. Undoubtedly, some of them are oversights, but not this one: COMBOBOXEXITEM is used both to insert and query an item's data.
This is hinted to, in part, in the documentation for the pszText member:
A pointer to a character buffer that contains or receives the item's text. If text information is being retrieved, this member must be set to the address of a character buffer that will receive the text.
The second part of the contract is omitted from the documentation, sadly. When setting an item's text, the control makes a copy of the string passed in, and neither takes ownership over the pointed to data, nor modifies it. In other words: When using the COMBOBOXEXITEM structure to insert an item, all pointers can be assumed to point to const.
Following that, it is perfectly valid to pass the pointer received from GetString():
for (int i = 1; i <= m_iMaxReminderInterval; i++)
{
COMBOBOXEXITEM cmbItem = {};
CString strNumber;
strNumber.Format(_T("%d"), i);
cmbItem.mask = CBEIF_TEXT;
cmbItem.iItem = static_cast<INT_PTR>(i) - 1;
cmbItem.pszText = const_cast<TCHAR*>(strNumber.GetString());
m_cbReminderInterval.InsertItem(&cmbItem);
}
According to CSimpleStringT::GetBuffer:
If you use the pointer returned by GetBuffer to change the string contents, you must call ReleaseBuffer before you use any other CSimpleStringT member methods.
You are not modifying the string, so you don't need to call ReleaseBuffer.
But as you said, it's better to use GetString, at least you indicate your intent to NOT modify it.

String memory usage in Golang

I was optimizing a code using a map[string]string where the value of the map was only either "A" or "B". So I thought Obviously a map[string]bool was way better as the map hold around 50 millions elements.
var a = "a"
var a2 = "Why This ultra long string take the same amount of space in memory as 'a'"
var b = true
var c map[string]string
var d map[string]bool
c["t"] = "A"
d["t"] = true
fmt.Printf("a: %T, %d\n", a, unsafe.Sizeof(a))
fmt.Printf("a2: %T, %d\n", a2, unsafe.Sizeof(a2))
fmt.Printf("b: %T, %d\n", b, unsafe.Sizeof(b))
fmt.Printf("c: %T, %d\n", c, unsafe.Sizeof(c))
fmt.Printf("d: %T, %d\n", d, unsafe.Sizeof(d))
fmt.Printf("c: %T, %d\n", c, unsafe.Sizeof(c["t"]))
fmt.Printf("d: %T, %d\n", d, unsafe.Sizeof(d["t"]))
And the result was:
a: string, 8
a2: string, 8
b: bool, 1
c: map[string]string, 4
d: map[string]bool, 4
c2: map[string]string, 8
d2: map[string]bool, 1
While testing I found something weird, why a2 with a really long string use 8 bytes, same as a which has only one letter?
unsafe.Sizeof() does not recursively go into data structures, it just reports the "shallow" size of the value passed. Quoting from its doc:
The size does not include any memory possibly referenced by x. For instance, if x is a slice, Sizeof returns the size of the slice descriptor, not the size of the memory referenced by the slice.
Maps in Go are implemented as pointers, so unsafe.Sizeof(somemap) will report the size of that pointer.
Strings in Go are just headers containing a pointer and a length. See reflect.StringHeader:
type StringHeader struct {
Data uintptr
Len int
}
So unsafe.Sizeof(somestring) will report the size of the above struct, which is independent of the length of the string value (which is the value of the Len field).
To get the actual memory requirement of a map ("deeply"), see How much memory do golang maps reserve? and also How to get memory size of variable in Go?
Go stores the UTF-8 encoded byte sequences of string values in memory. The builtin function len() reports the byte-length of a string, so
basically the memory required to store a string value in memory is:
var str string = "some string"
stringSize := len(str) + int(unsafe.Sizeof(str))
Also don't forget that a string value may be constructed by slicing another, bigger string, and thus even if the original string is no longer referenced (and thus no longer needed), the bigger backing array will still be required to be kept in memory for the smaller string slice.
For example:
s := "some loooooooong string"
s2 := s[:2]
Here, even though memory requirement for s2 would be len(s2) + unsafe.Sizeof(str) = 2 + unsafe.Sizeof(str), still, the whole backing array of s will be retained.

Swift 3 how to store a struct in a "Data" object

What is the "right" way to stuff an arbitrary, odd sized struct into a swift 3 Data object ?
I think that I have got there, but it seems horribly convoluted for what from prior experience was no than
dataObject.append(&structInstance, sizeof(structInstance))
My case is as follows:
The structure of interest:
public struct CutEntry {
var itemA : UInt64
var itemB : UInt32
}
I have an array of these things that I want to stuff into a data object, in a specific manner as the data object becomes a file which is eventually read by a different application on a different architecture.
The function to put them into a Data object
open func encodeCutsData() -> Data
{
var data = Data()
for entry in cutsArray
{
// bigendian stuff, as a var, just so the you can get the address
var entryCopy = CutEntry(itemA: entry.itemA.bigEndian, itemB: entry.itemB.bigEndian)
// step 1 get the address of the item as a UnsafePointer
let d2 = withUnsafePointer(to: &entryCopy) { return $0}
// step 2 cast it to a raw pointer
let d3 = UnsafeRawPointer(d2)
// step 3 create a temp data object
let d4 = Data(bytes:d3, count: MemoryLayout<CutEntry>.size )
// step 4 add the temp to main data object
data.append(d4)
}
return data
}
Earlier when we only had NSMutableData it was
let item = NSMutableData()
for entry in cutsArray
{
var entryCopy = CutEntry(cutPts: entry.cutPts.bigEndian, cutType: entry.cutType.bigEndian)
item.append(&entryCopy, length: MemoryLayout<CutEntry>.size)
}
I've spent a few hours searching for examples of manipulating struct and Data objects. I though that I was close when I found references to unsafebufferpointer. That blew up in my face when I discovered that "buffer" bit uses core memory alignment (which can be useful) and it was stuffing 16 bytes into the data object instead of the expected 12.
I am quite prepared to say that I have missed the blindingly obvious bit of RTFM somewhere. Can anyone offer a cleaner solution ? or has Swift really gone backwards here ?
If I could find a way of getting a pointer to the item as a UInt8 pointer that would remove a couple of lines, but that looks just a difficult.
With checking the reference of Data, I can find two things which may be useful for you:
init(bytes: UnsafeRawPointer, count: Int)
func append(Data)
You can write something like this:
var data = Data()
for entry in cutsArray {
var entryCopy = CutEntry(cutPts: entry.cutPts.bigEndian, cutType: entry.cutType.bigEndian)
data.append(Data(bytes: &entryCopy, count: MemoryLayout<CutEntry>.size))
}

Node.js buffer string serialization

I want to serialize a buffer to string without any overhead ( one character for one byte) and be able to unserialize it into buffer again.
var b = new Buffer (4) ;
var s = b.toString() ;
var b2 = new Buffer (s)
Produces the same results only for values below 128. I want to use the whole scope of 0-255.
I know I can write it in a loop with String.fromCharCode() in serializing and String.charCodeAt() in deserializing, but I'm looking for some native module implementation if there is any.
You can use the 'latin1' encoding, but you should generally try to avoid it because converting a Buffer to a binary string has some extra computational overhead.
Example:
var b = Buffer.alloc(4);
var s = b.toString('latin1');
var b2 = Buffer.from(s, 'latin1');

Sizeof struct in Go

I'm having a look at Go, which looks quite promising.
I am trying to figure out how to get the size of a go struct, for
example something like
type Coord3d struct {
X, Y, Z int64
}
Of course I know that it's 24 bytes, but I'd like to know it programmatically..
Do you have any ideas how to do this ?
Roger already showed how to use SizeOf method from the unsafe package. Make sure you read this before relying on the value returned by the function:
The size does not include any memory possibly referenced by x. For
instance, if x is a slice, Sizeof returns the size of the slice
descriptor, not the size of the memory referenced by the slice.
In addition to this I wanted to explain how you can easily calculate the size of any struct using a couple of simple rules. And then how to verify your intuition using a helpful service.
The size depends on the types it consists of and the order of the fields in the struct (because different padding will be used). This means that two structs with the same fields can have different size.
For example this struct will have a size of 32
struct {
a bool
b string
c bool
}
and a slight modification will have a size of 24 (a 25% difference just due to a more compact ordering of fields)
struct {
a bool
c bool
b string
}
As you see from the pictures, in the second example we removed one of the paddings and moved a field to take advantage of the previous padding. An alignment can be 1, 2, 4, or 8. A padding is the space that was used to fill in the variable to fill the alignment (basically wasted space).
Knowing this rule and remembering that:
bool, int8/uint8 take 1 byte
int16, uint16 - 2 bytes
int32, uint32, float32 - 4 bytes
int64, uint64, float64, pointer - 8 bytes
string - 16 bytes (2 alignments of 8 bytes)
any slice takes 24 bytes (3 alignments of 8 bytes). So []bool, [][][]string are the same (do not forget to reread the citation I added in the beginning)
array of length n takes n * type it takes of bytes.
Armed with the knowledge of padding, alignment and sizes in bytes, you can quickly figure out how to improve your struct (but still it makes sense to verify your intuition using the service).
import unsafe "unsafe"
/* Structure describing an inotify event. */
type INotifyInfo struct {
Wd int32 // Watch descriptor
Mask uint32 // Watch mask
Cookie uint32 // Cookie to synchronize two events
Len uint32 // Length (including NULs) of name
}
func doSomething() {
var info INotifyInfo
const infoSize = unsafe.Sizeof(info)
...
}
NOTE: The OP is mistaken. The unsafe.Sizeof does return 24 on the example Coord3d struct. See comment below.
binary.TotalSize is also an option, but note there's a slight difference in behavior between that and unsafe.Sizeof: binary.TotalSize includes the size of the contents of slices, while unsafe.Sizeof only returns the size of the top level descriptor. Here's an example of how to use TotalSize.
package main
import (
"encoding/binary"
"fmt"
"reflect"
)
type T struct {
a uint32
b int8
}
func main() {
var t T
r := reflect.ValueOf(t)
s := binary.TotalSize(r)
fmt.Println(s)
}
This is subject to change but last I looked there is an outstanding compiler bug (bug260.go) related to structure alignment. The end result is that packing a structure might not give the expected results. That was for compiler 6g version 5383 release.2010-04-27 release. It may not be affecting your results, but it's something to be aware of.
UPDATE: The only bug left in go test suite is bug260.go, mentioned above, as of release 2010-05-04.
Hotei
In order to not to incur the overhead of initializing a structure, it would be faster to use a pointer to Coord3d:
package main
import (
"fmt"
"unsafe"
)
type Coord3d struct {
X, Y, Z int64
}
func main() {
var dummy *Coord3d
fmt.Printf("sizeof(Coord3d) = %d\n", unsafe.Sizeof(*dummy))
}
/*
returns the size of any type of object in bytes
*/
func getRealSizeOf(v interface{}) (int, error) {
b := new(bytes.Buffer)
if err := gob.NewEncoder(b).Encode(v); err != nil {
return 0, err
}
return b.Len(), nil
}

Resources