Preface: I needed to figure out the structure of a binary grid_data_file. From the Fortran routines I figured that the first record consists of 57 bytes and has information in the following order.
No. of the file :: integer*4
File name :: char*16
file status :: char*3 (i.e. new, old, tmp)
.... so forth (rest is clear from write statement in the program)
Now for the testing I wrote a simple program as follows: (I haven't included all the parameters)
Program testIO
implicit none
integer :: x, nclat, nclon
character :: y, z
real :: lat_gap, lon_gap, north_lat, west_lat
integer :: gridtype
open(11, file='filename', access='direct', form='unformatted', recl='200')
read(11, rec=1) x,y,z,lat_gap,lon_gap, north_lat,west_lat, nclat, nclon, gridtyp
write(*,*) x,y,z,lat_gap,lon_gap, north_lat,west_lat, nclat, nclon, gridtyp
close(11)
END
To my surprise, when I change the declaration part to
integer*4 :: x, nclat, nclon
character*16 :: y
character*3 :: z
real*4 :: lat_gap, lon_gap, north_lat, west_lat
integer*2 :: gridtype
It gives me SOME correct information, albeit not all! I can't understand this. It would help me to improve my Fortran knowledge if someone explains this phenomenon.
Moreover, I can't use ACCESS=stream due to machine being old and not supported, so I conclude that above is the only possibility to figure out the file structure.
From your replies and what others have commented, I think your problem might be a misunderstanding of what a Fortran "record" is:
You say that you have a binary file where each entry (you said record, but more on that later) is 57 bytes.
The problem is that a "record" in Fortran I/O is not what you would expect it is coming from a C (or anywhere else, really) background. See the following document from Intel, which gives a good explanation of the different access modes:
https://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/fortran/lin/bldaps_for/common/bldaps_rectypes.htm
In short, it has extra data (a header) describing the data in each entry.
Moreover, I can't use ACCESS=stream due to machine being old and not supported, so I conclude that above is the only possibility to figure out the file structure. Any guidance would be a big help!
If you can't use stream, AFAIK there is really no simple and painless way to read binary files with no record information.
A possible solution which requires a C compiler is to do IO in a C function that you call from Fortran, "minimal" example:
main.f90:
program main
integer, parameter :: dp = selected_real_kind(15)
character(len=*), parameter :: filename = 'test.bin'
real(dp) :: val
call read_bin(filename, val)
print*, 'Read: ', val
end program
read.c:
#include <string.h>
#include <stdio.h>
void read_bin_(const char *fname, double *ret, unsigned int len)
{
char buf[256];
printf("len = %d\n", len);
strncpy(buf, fname, len);
buf[len] = '\0'; // fortran strings are not 0-terminated
FILE* fh = fopen(buf, "rb");
fread(ret, sizeof(double), 1, fh);
fclose(fh);
}
Note that there is an extra parameter needed in the end and some string manipulation because of the way Fortran handles strings, which differs from C.
write.c:
#include <stdio.h>
int main() {
double d = 1.234;
FILE* fh = fopen("test.bin", "wb");
fwrite(&d, sizeof(double), 1, fh);
fclose(fh);
}
Compilation instructions:
gcc -o write write.c
gcc -c -g read.c
gfortran -g -o readbin main.f90 read.o
Create binary file with ./write, then see how the Fortran code can read it back with ./readbin.
This can be extended for different data types to basically emulate access=stream. In the end, if you can recompile the original Fortran code to output the data file differently, this will be the easiest solution, as this one is pretty much a crude hack.
Lastly, a tip for getting into unknown data formats: The tool od is your friend, check its manpage. It can directly convert binary represantations into a variety of different native datatypes. Try with the above example (the z adds the character representation in the right-hand column, not very useful here, in general yes):
od -t fDz test.bin
Related
I've started taking a look at Nim for hobby game modding purposes.
Intro
Yet, I found it difficult to work with Nim compared to C when it comes to machine-specific low-level memory layout and would like to know if Nim actually has better support here.
I need to control byte order and be able to de/serialize arbitrary Plain-Old-Datatype objects to binary custom file formats. I didn't directly find a Nim library which allows flexible storage options like representing enum and pointers with Big-Endian 32-bit. Or maybe I just don't know how to use the feature.
std/marshal : just JSON, i.e. no efficient, flexible nor binary format but cross-compatible
nim-serialization : seems like being made for human readable formats
nesm : flexible cross-compatibility? (It has some options and has a good interface)
flatty : no flexible cross-compatibility, no byte order?
msgpack4nim : no flexible cross-compatibility, byte order?
bingo : ?
Flexible cross-compatibility means, it must be able to de/serialize fields independently of Nim's ABI but with customization options.
Maybe "Kaitai Struct" is more what I look for, a file parser with experimental Nim support.
TL;DR
As a workaround for a serialization library I tried myself at a recursive "member fields reverser" that makes use of std/endians which is almost sufficient.
But I didn't succeed with implementing byte reversal of arbitrarily long objects in Nim. Not practically relevant but I still wonder if Nim has a solution.
I found reverse() and reversed() from std/algorithm but I need a byte array to reverse it and turn it back into the original object type. In C++ there would be reinterprete_cast, in C there is void*-cast, in D there is a void[] cast (D allows defining array slices from pointers) but I couldn't get it working with Nim.
I tried cast[ptr array[value.sizeof, byte]](unsafeAddr value)[] but I can't assign it to a new variable. Maybe there was a different problem.
How to "byte reverse" arbitrary long Plain-Old-Datatype objects?
How to serialize to binary files with byte order, member field size, pointer as file "offset - start offset"? Are there bitfield options in Nim?
It is indeed possible to use algorithm.reverse and the appropriate cast invocation to reverse bytes in-place:
import std/[algorithm,strutils,strformat]
type
LittleEnd{.packed.} = object
a: int8
b: int16
c: int32
BigEnd{.packed.} = object
c: int32
b: int16
a: int8
## just so we can see what's going on:
proc `$`(b: LittleEnd):string = &"(a:0x{b.a.toHex}, b:0x{b.b.toHex}, c:0x{b.c.toHex})"
proc `$`(l:BigEnd):string = &"(c:0x{l.c.toHex}, b:0x{l.b.toHex}, a:0x{l.a.toHex})"
var lit = LittleEnd(a: 0x12, b:0x3456, c: 0x789a_bcde)
echo lit # (a:0x12, b:0x3456, c:0x789ABCDE)
var big:BigEnd
copyMem(big.addr,lit.addr,sizeof(lit))
# here's the reinterpret_cast you were looking for:
cast[var array[sizeof(big),byte]](big.addr).reverse
echo big # (c:0xDEBC9A78, b:0x5634, a:0x12)
for C-style bitfields there is also the {.bitsize.} pragma
but using it causes Nim to lose sizeof information, and of course bitfields wont be reversed within bytes
import std/[algorithm,strutils,strformat]
type
LittleNib{.packed.} = object
a{.bitsize: 4}: int8
b{.bitsize: 12}: int16
c{.bitsize: 20}: int32
d{.bitsize: 28}: int32
BigNib{.packed.} = object
d{.bitsize: 28}: int32
c{.bitsize: 20}: int32
b{.bitsize: 12}: int16
a{.bitsize: 4}: int8
const nibsize = 8
proc `$`(b: LittleNib):string = &"(a:0x{b.a.toHex(1)}, b:0x{b.b.toHex(3)}, c:0x{b.c.toHex(5)}, d:0x{b.d.toHex(7)})"
proc `$`(l:BigNib):string = &"(d:0x{l.d.toHex(7)}, c:0x{l.c.toHex(5)}, b:0x{l.b.toHex(3)}, a:0x{l.a.toHex(1)})"
var lit = LitNib(a: 0x1,b:0x234, c:0x56789, d: 0x0abcdef)
echo lit # (a:0x1, b:0x234, c:0x56789, d:0x0ABCDEF)
var big:BigNib
copyMem(big.addr,lit.addr,nibsize)
cast[var array[nibsize,byte]](big.addr).reverse
echo big # (d:0x5DEBC0A, c:0x8967F, b:0x123, a:0x4)
It's less than optimal to copy the bytes over, then rearrange them with reverse, anyway, so you might just want to copy the bytes over in a loop. Here's a proc that can swap the endianness of any object, (including ones for which sizeof is not known at compiletime):
template asBytes[T](x:var T):ptr UncheckedArray[byte] =
cast[ptr UncheckedArray[byte]](x.addr)
proc swapEndian[T,U](src:var T,dst:var U) =
assert sizeof(src) == sizeof(dst)
let len = sizeof(src)
for i in 0..<len:
dst.asBytes[len - i - 1] = src.asBytes[i]
Bit fields are supported in Nim as a set of enums:
type
MyFlag* {.size: sizeof(cint).} = enum
A
B
C
D
MyFlags = set[MyFlag]
proc toNum(f: MyFlags): int = cast[cint](f)
proc toFlags(v: int): MyFlags = cast[MyFlags](v)
assert toNum({}) == 0
assert toNum({A}) == 1
assert toNum({D}) == 8
assert toNum({A, C}) == 5
assert toFlags(0) == {}
assert toFlags(7) == {A, B, C}
For arbitrary bit operations you have the bitops module, and for endianness conversions you have the endians module. But you already know about the endians module, so it's not clear what problem you are trying to solve with the so called byte reversal. Usually you have an integer, so you first convert the integer to byte endian format, for instance, then save that. And when you read back, convert from byte endian format and you have the int. The endianness procs should be dealing with reversal or not of bytes, so why do you need to do one yourself? In any case, you can follow the source hyperlink of the documentation and see how the endian procs are implemented. This can give you an idea of how to cast values in case you need to do some yourself.
Since you know C maybe the last resort would be to write a few serialization functions and call them from Nim, or directly embed them using the emit pragma. However this looks like the least cross platform and pain free option.
Can't answer anything about generic data structure serialization libraries. I stray away from them because they tend to require hand holding imposing certain limitations on your code and depending on the feature set, a simple refactoring (changing field order in your POD) may destroy the binary compatibility of the generated output without you noticing it until runtime. So you end up spending additional time writing unit tests to verify that the black box you brought in to save you some time behaves as you want (and keeps doing so across refactorings and version upgrades!).
I want to do something like this (C sample):
char str[] = "Hello\x90\x90\xcc\x00";
How?
The equivalent of this in Fortran would be:
character(*), parameter :: str = "Hello"//char(144)//char(144)//char(204)//char(0)
I made this a named (PARAMETER) constant here, but the expression for initializing would be the same in a normal assignment context. Standard Fortran doesn't allow the use of hex constants (such as Z'90') as an argument to CHAR, though many compilers support that as an extension.
My program reads an “unformatted” file in Fortran. Among other things, this file contains an array that my program does not need, but which can get quite large. I would like to skip this array.
If this is the program writing the data:
program write
real :: useless(10), useful=42
open(123, file='foo', form='unformatted')
write(123) size(useless)
write(123) useless
write(123) useful
end program write
Then this works for reading:
program read
integer :: n
real, allocatable :: useless(:)
real :: useful
open(123, file='foo', form='unformatted')
read(123) n
allocate(useless(n))
read(123) useless
read(123) useful
print*, useful
end program read
But I would like to avoid allocating the “useless” array. I discovered this
program read2
integer :: n, i
real :: useless
real :: useful
open(123, file='foo', form='unformatted')
read(123) n
do i=1,n
read(123) useless
end do
read(123) useful
print*, useful
end program read2
does not work (because of the record lengths being written to the file [EDIT, see francescalus' answer]).
It is not an option to change the format of the file.
It isn't a sin to read fewer file storage units than are in the record.
program read
real :: useful
open(123, file='foo', form='unformatted')
read(123)
read(123)
read(123) useful
print*, useful
end program read
Each "empty" read still advances the record for a file connected for sequential access.
As a further comment: the second attempt doesn't fail "because of the record lengths". It fails because of the attempt to read separate records. Examples of the significance of this difference can be found in many SO posts.
Francescalus has shown how to skip over an entire line. If a line contains some data that should be skipped over as well some data to be read, you can read a dummy variable repeatedly to skip the bad data. Below is a program demonstrating this.
program write
real :: dummy,useless(10), useful=42
integer, parameter :: outu = 20, inu = 21
character (len=*), parameter :: fname = "foo"
integer :: n
call random_seed()
call random_number(useless)
open(outu, file=fname, form='unformatted', action = "write")
write(outu) size(useless)
write(outu) useless
write(outu) useful
close(outu)
open(inu, file=fname, form='unformatted', action = "read")
read (inu) n
read (inu) (dummy,i=1,n)
read (inu) useful
write (*,*) "useful = ",useful
end program write
I just started my path on Erlang and I'm facing a problem I can't sort out a solution about:
I wrote a metod to take a domain espressed as a binary string, i.e. <<"www.404pagenotfound.com">> and convert it in the domain format as required for DNS protocol, so: <<3,"www",15,"pagenotfound",3,"com">>.
In the following the code (I rewroted it many times in different ways):
domainbyte(Bin) ->
if byte_size(Bin) > 0 ->
Res = binary:split(Bin, <<".">>),
[Chunk|[RestList]] = Res,
ChunkSize = byte_size(Chunk),
if length(RestList) > 0 ->
Rest = domainbyte(RestList), %% <- Got "bad argument" here!
<<ChunkSize/binary,Chunk,Rest>>;
true ->
<<ChunkSize/binary,Chunk>>
end
end.
Thx in advance for any clues.
PS.
Thx to comments I've found the error in the code abobe:
if length(RestList) > 0 -> %% here RestList is binary data so length throw "bad argument" error.
I've rewroted the method this way, but still with no luck:
**NOTE: I was able to fix the code in the following, problem is that if you have a binary chunk and you want to use it in another binary string, you must specify /binary on it: something not obvious to me.
I.e.: consider this small code snip:
**
TT = <<"com">>,
SS = <<3, TT, 0>> %% <- you get error: bad argument
**
must be fixed this way:
**
TT = <<"com">>,
SS = <<3, TT/binary, 0>>
domainbyte(Bin) ->
if byte_size(Bin) > 0 ->
Res = binary:split(Bin, <<".">>),
if length(Res) > 1 ->
[Chunk|[RestList]] = Res,
ChunkSize = byte_size(Chunk),
Rest = domainbyte(RestList),
<<ChunkSize,Chunk,Rest>>;
true ->
[Chunk] = Res,
ChunkSize = byte_size(Chunk),
<<ChunkSize,Chunk>>
end
end.
MdP
I think the easiest solution is to define the function using a binary comprehension:
domainbyte(Bin) ->
Chunks = binary:split(Bin, <<".">>, [global]), %A list of all the chunks
<< <<byte_size(C),C/binary>> || C <- Chunks >>. %Build output binary
It might be slightly faster to build the output binary as a list of segments in a separate function then put them together in an iolist_to_binary/1. Note that if a '.' occurs outermost in the binary then this code will take that as an empty segment of length 0. If these should be discarded then you nee to add the option trim to binary:split/3. Note also that the size will occupy only one byte.
#Alnitak has the separate function but builds the binary one segment at time so it is not more efficient than the binary comprehension which does the same thing.
N.B. that if you have a binary segment Chunk/binary when constructing a binary this means that Chunk IS a binary, not that it should become one. Binaries are flat structures, think byte arrays, so everything becomes a binary. Or rather the binary.
EDIT: Though I see I missed the 0 which should be at the end. That is left as an exercise to the reader.
EDIT: Being in teaching mode, apart from the constructing binaries, a key to writing good Erlang code is understanding pattern matching. You use a bit but could do it more:
domainbyte(Bin) ->
case binary:split(Bin, <<".">>) of
[Chunk,Rest] -> %There was a '.'
RestBin = domainbyte(Rest),
Size = byte_size(Chunk),
<<Size,Chunk/binary,RestBin/binary>>;
[Chunk] -> %This was the last chunk
Size = byte_size(Chunk),
<<Size,Chunk/binary,0>> %Add terminating 0
end.
This is basically doing the same as your code but we are using pattern matching to select a clause, not only to pull apart a known structure. Pattern matching is the basic method for control, not just in case as here but also in functions and receive. This results in if being used quite sparingly.
Enough from me for now.
I'm a bit rusty on Erlang, but I think your problem is that RestList is an array of chunks from the output of binary:split.
So when you chuck it recursively back into domainbyte it's in the wrong format.
Also - don't forget the terminating NUL byte to represent the root label!
FWIW, here's my working version:
label([]) ->
<< 0 >>;
label([H|T]) ->
D = label(H),
P = label(T),
<< D/binary, P/binary>>;
label(A) ->
L = byte_size(A),
<< <<L>>/binary, A/binary>>.
domainbyte(A) ->
Res = binary:split(A, <<".">>, [global, trim]),
label(Res).
It correctly adds a trailing NUL byte, and trims any extra trailing dots.
The concat of binaries should be written like this:
<< <<ChunkSize>>/binary, Chunk/binary, Rest/binary>>
% and
<< <<ChunkSize>>/binary, Chunk/binary>>
Consider this simple "benchmark":
n :: Int
n = 1000
main = do
print $ length [(a,b,c) | a<-[1..n],b<-[1..n],c<-[1..n],a^2+b^2==c^2]
and appropriate C version:
#include <stdio.h>
int main(void)
{
int a,b,c, N=1000;
int cnt = 0;
for (a=1;a<=N;a++)
for (b=1;b<=N;b++)
for (c=1;c<=N;c++)
if (a*a+b*b==c*c) cnt++;
printf("%d\n", cnt);
}
Compilation:
Haskell version is compiled as: ghc -O2 triangle.hs (ghc 7.4.1)
C version is compiled as: gcc -O2 -o triangle-c triangle.c (gcc 4.6.3)
Run times:
Haskell: 4.308s real
C: 1.145s real
Is it OK behavior even for such a simple and maybe well optimizable program that Haskell is almost 4 times slower? Where does Haskell waste time?
The Haskell version is wasting time allocating boxed integers and tuples.
You can verify this by for example running the haskell program with the flags +RTS -s. For me the outputted statistics include:
80,371,600 bytes allocated in the heap
A straightforward encoding of the C version is faster since the compiler can use unboxed integers and skip allocating tuples:
n :: Int
n = 1000
main = do
print $ f n
f :: Int -> Int
f max = go 0 1 1 1
where go cnt a b c
| a > max = cnt
| b > max = go cnt (a+1) 1 1
| c > max = go cnt a (b+1) 1
| a^2+b^2==c^2 = go (cnt+1) a b (c+1)
| otherwise = go cnt a b (c+1)
See:
51,728 bytes allocated in the heap
The running time of this version is 1.920s vs. 1.212s for the C version.
I don't know how much your "bench" is relevant.
I agree that the list-comprehension syntax is "nice" to use, but if you want to compare the performances of the two languages, you should maybe compare them on a fairer test.
I mean, creating a list of possibly a lot of elements and then calculating it's length is nothing like incrementing a counter in a (triple loop).
So maybe haskell has some nice optimizations which detects what you are doing and never creates the list, but I wouldn't code relying on that, and you probably shouldn't either.
I don't think you would code your program like that if you needed to count rapidly, so why do it for this bench?
Haskell can be optimized quite well — but you need the proper techniques, and you need to know what you're doing.
This list comprehension syntax is elegant, yet wasteful. You should read the appropriate chapter of RealWorldHaskell in order to find out more about your profiling opportunities. In this exact case, you create a lot of list spines and boxed Ints for no good reason at all. See here:
You should definitely do something about that. EDIT: #opqdonut just posted a good answer on how to make this faster.
Just remember next time to profile your application before comparing any benchmarks. Haskell makes it easy to write idiomatic code, but it also hides a lot of complexity.