Doing file operations with 64-bit addresses in C + MinGW32 - cygwin

I'm trying to read in a 24 GB XML file in C, but it won't work. I'm printing out the current position using ftell() as I read it in, but once it gets to a big enough number, it goes back to a small number and starts over, never even getting 20% through the file. I assume this is a problem with the range of the variable that's used to store the position (long), which can go up to about 4,000,000,000 according to http://msdn.microsoft.com/en-us/library/s3f49ktz(VS.80).aspx, while my file is 25,000,000,000 bytes in size. A long long should work, but how would I change what my compiler(Cygwin/mingw32) uses or get it to have fopen64?

The ftell() function typically returns an unsigned long, which only goes up to 232 bytes (4 GB) on 32-bit systems. So you can't get the file offset for a 24 GB file to fit into a 32-bit long.
You may have the ftell64() function available, or the standard fgetpos() function may return a larger offset to you.

You might try using the OS provided file functions CreateFile and ReadFile. According to the File Pointers topic, the position is stored as a 64bit value.

Unless you can use a 64-bit method as suggested by Loadmaster, I think you will have to break the file up.
This resource seems to suggest it is possible using _telli64(). I can't test this though, as I don't use mingw.

I don't know of any way to do this in one file, a bit of a hack but if splitting the file up properly isn't a real option, you could write a few functions that temp split the file, one that uses ftell() to move through the file and swaps ftell() to a new file when its reaching the split point, then another that stitches the files back together before exiting. An absolutely botched up approach, but if no better solution comes to light it could be a way to get the job done.

I found the answer. Instead of using fopen, fseek, fread, fwrite... I'm using _open, lseeki64, read, write. And I am able to write and seek in > 4GB files.
Edit: It seems the latter functions are about 6x slower than the former ones. I'll give the bounty anyone who can explain that.
Edit: Oh, I learned here that read() and friends are unbuffered. What is the difference between read() and fread()?

Even if the ftell() in the Microsoft C library returns a 32-bit value and thus obviously will return bogus values once you reach 2 GB, just reading the file should still work fine. Or do you need to seek around in the file, too? For that you need _ftelli64() and _fseeki64().
Note that unlike some Unix systems, you don't need any special flag when opening the file to indicate that it is in some "64-bit mode". The underlying Win32 API handles large files just fine.

Related

How do I seek for holes and data in a sparse file in golang [duplicate]

I want to copy files from one place to another and the problem is I deal with a lot of sparse files.
Is there any (easy) way of copying sparse files without becoming huge at the destination?
My basic code:
out, err := os.Create(bricks[0] + "/" + fileName)
in, err := os.Open(event.Name)
io.Copy(out, in)
Some background theory
Note that io.Copy() pipes raw bytes – which is sort of understandable once you consider that it pipes data from an io.Reader to an io.Writer which provide Read([]byte) and Write([]byte), correspondingly.
As such, io.Copy() is able to deal with absolutely any source providing
bytes and absolutely any sink consuming them.
On the other hand, the location of the holes in a file is a "side-channel" information which "classic" syscalls such as read(2) hide from their users.
io.Copy() is not able to convey such side-channel information in any way.
IOW, initially, file sparseness was an idea to just have efficient storage of the data behind the user's back.
So, no, there's no way io.Copy() could deal with sparse files in itself.
What to do about it
You'd need to go one level deeper and implement all this using the syscall package and some manual tinkering.
To work with holes, you should use the SEEK_HOLE and SEEK_DATA special values for the lseek(2) syscall which are, while formally non-standard, are supported by all major platforms.
Unfortunately, the support for those "whence" positions is not present
neither in the stock syscall package (as of Go 1.8.1)
nor in the golang.org/x/sys tree.
But fear not, there are two easy steps:
First, the stock syscall.Seek() is actually mapped to lseek(2)
on the relevant platforms.
Next, you'd need to figure out the correct values for SEEK_HOLE and
SEEK_DATA for the platforms you need to support.
Note that they are free to be different between different platforms!
Say, on my Linux system I can do simple
$ grep -E 'SEEK_(HOLE|DATA)' </usr/include/unistd.h
# define SEEK_DATA 3 /* Seek to next data. */
# define SEEK_HOLE 4 /* Seek to next hole. */
…to figure out the values for these symbols.
Now, say, you create a Linux-specific file in your package
containing something like
// +build linux
const (
SEEK_DATA = 3
SEEK_HOLE = 4
)
and then use these values with the syscall.Seek().
The file descriptor to pass to syscall.Seek() and friends
can be obtained from an opened file using the Fd() method
of os.File values.
The pattern to use when reading is to detect regions containing data, and read the data from them – see this for one example.
Note that this deals with reading sparse files; but if you'd want to actually transfer them as sparse – that is, with keeping this property of them, – the situation is more complicated: it appears to be even less portable, so some research and experimentation is due.
On Linux, it appears you could try to use fallocate(2) with
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE to try to punch a hole at the
end of the file you're writing to; if that legitimately fails
(with syscall.EOPNOTSUPP), you just shovel as many zeroed blocks to the destination file as covered by the hole you're reading – in the hope
the OS will do the right thing and will convert them to a hole by itself.
Note that some filesystems do not support holes at all – as a concept.
One example is the filesystems in the FAT family.
What I'm leading you to is that inability of creating a sparse file might
actually be a property of the target filesystem in your case.
You might find Go issue #13548 "archive/tar: add support for writing tar containing sparse files" to be of interest.
One more note: you might also consider checking whether the destination directory to copy a source file resides in the same filesystem as the source file, and if this holds true, use the syscall.Rename() (on POSIX systems)
or os.Rename() to just move the file across different directories w/o
actually copying its data.
You don't need to resort to syscalls.
package main
import "os"
func main() {
f, _ := os.Create("/tmp/sparse.dat")
f.Write([]byte("start"))
f.Seek(1024*1024*10, 0)
f.Write([]byte("end"))
}
Then you'll see:
$ ls -l /tmp/sparse.dat
-rw-rw-r-- 1 soren soren 10485763 Jun 25 14:29 /tmp/sparse.dat
$ du /tmp/sparse.dat
8 /tmp/sparse.dat
It's true you can't use io.Copy as is. Instead you need to implement an alternative to io.Copy which reads a chunk from the src, checks if it's all '\0'. If it is, just dst.Seek(len(chunk), os.SEEK_CUR) to skip past that part in dst. That particular implementation is left as an exercise to the reader :)

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

How does the bittorrent assemble the missing pieces?

I use BitTorrent and sometimes encounter files that do not have seed(missing pieces).
At that time, we sometimes force the file transfer to end and try to open the incompleted files (for example, an image file).
If we are lucky, may be able to see the downloaded image even if some parts are lost.
I would like to artificially reproduce this situation, and here's how I tried:
1) spliting a bmp image file of about 1 megabyte into 16 kilobytes by the Linux split command,
2) and then make just one of the divided files 0 kilobytes.
3) after that, rejoin all the files with the cat command.
However, in this case, unlike the torrent's "lost pieces" situation, the file becomes completely corrupt and can not be read.
Theoretically it does not seem like anything special, but what's wrong? And how can I achieve what I want?
I would appreciate your help.
Use dd:
dd if=/dev/zero of=image.jpg bs=1 conv=notrunc seek=X count=Y
being X the offset in the file you want to erase and Y the number of bytes.
About the corruption, it depends on the type of file, the piece you are losing and the program you are using to read it.
For instance, JPG files use a variable bit-length encoding, meaning that just losing one bit may corrupt all the file from that point on. But just for that, there can be resyncronization points where the bitstream is reset, so from that point on, the file will look ok. But those resync points are optional when writing the file, and not every reader honor them in case of corruption...
And anyway, losing part of the headers will make the file totally unreadable.

Write unformatted (binary data) to stdout

I want to write unformatted (binary) data to STDOUT in a Fortran 90 program. I am using AIX Unix and unfortunately it won't let me open unit 6 as "unformatted". I thought I would try and open /dev/stdout instead under a different unit number, but /dev/stdout does not exist in AIX (although this method worked under Linux).
Basically, I want to pipe my programs output directly into another program, thus avoiding having an intermediate file, a bit like gzip -c does. Is there some other way I can achieve this, considering the two problems I have encountered above?
I would try to convert the data by TRANSFER() to a long character and print it with nonadvancing i/o. The problem will be your processors' limit for the record length. If it is too short you will end up having an unexpected end of record sign somewhere. Also your processor may not write the unprintable characters the way you would like.
i.e., something like
character(len=max_length) :: buffer
buffer = transfer(data,buffer)
write(*,'(a)',advance='no') trim(buffer)
The largest problem I see in the unprintable characters. See also A suprise with non-advancing I/O
---EDIT---
Another possibility, try to use file /proc/self/fd/1 or /dev/fd/1
test:
open(11,file='/proc/self/fd/1',access='stream',action='write')
write(11) 11
write(11) 1.1
close(11)
end
This is more of a comment/addition to #VladimirF than a new answer, but I can't add those yet. You can first inquire about the location of the preconnected I/O units and then open the unformatted connection:
character(1024) :: stdout
inquire(6, name = stdout)
open(11, file = stdout, access = 'stream', action = 'write')
This is probably the most convenient way, but it uses stream access, a Fortran 2003 feature. Without this, you can only use sequential access (which adds header data to each record) or direct access (which does not add headers but requires a fixed record length).

Windows console application with gets() ROP exploit

I'm trying (for learning purposes) to take advantage of gets() function vulnerability using return-oriented programming (ROP) technique. The target program is a Windows console application that in some point asks for some input, and then uses gets() to store the input in the local 80 characters long array.
I created a file that contains 80 'a' characters in the beginning + some extra characters + 0x5da06c48 address for overwriting the old EIP pointer.
I'm opening the file in text editor and copy-pasting the content into the console as input. I've used IDA Pro (or OllyDbg) to set a breakpoint right after the return from the gets() function and noticed that the address was corrupted - it was set to 0x3fa03f48 (two 3f substitutions).
I've tried other addresses as well - part of them works well, but most of the times the address is being corrupted (sometimes characters missing or substituted, sometimes truncated).
How to get over this problem? Any suggestion will be highly appreciated!
Copy-Pasting binary data is hit-and-miss. Have you tried feeding the input into your test program directly from the file using input redirection?
First of all keep track of the Endianness of your platform. If you think your bits are in the right order but you are still getting malformed input, it might be that your shell/text editor isn't binary safe. You are better off writing an exploit for this flaw in a scripting language such as Python, using the Subprocess library which allows you to write data directly to an arbitrary process's stdin pipe.

Resources