What is a quick way to check if file contents are null?

What is a quick way to check if file contents are null? - string

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.

Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Related

Slow Buffer.concat

When I read a 16MB file in pieces of 64Kb, and do Buffer.concat on each piece, the latter proves to be incredibly slow, takes a whole 4s to go through the lot.
Is there a better way to concatenate a buffer in Node.js?
Node.js version used: 7.10.0, under Windows 10 (both are 64-bit).
This question is asked while researching the following issue: https://github.com/brianc/node-postgres/issues/1286, which affects a large audience.
The PostgreSQL driver reads large bytea columns in chunks of 64Kb, and then concatenates them. We found out that calling Buffer.concat is the culprit behind a huge loss of performance in such examples.

Rather than concatenating every time (which creates a new buffer each time), just keep an array of all of your buffers and concat at the end.
Buffer.concat() can take a whole list of buffers. Then it's done in one operation. https://nodejs.org/api/buffer.html#buffer_class_method_buffer_concat_list_totallength

If you read from a file and know the size of that file, then you can pre-allocate the final buffer. Then each time you get a chunk of data, you can simply write it to that large 16Mb buffer.
// use the "unsafe" version to avoid clearing 16Mb for nothing
let buf = Buffer.allocUnsafe(file_size)
let pos = 0
file.on('data', (chunk) => {
buf.fill(chunk, pos, pos + chunk.length)
pos += chunk.length
})
if(pos != file_size) throw new Error('Ooops! something went wrong.')
The main difference with #Brad's code sample is that you're going to use 16Mb + size of one chunk (roughly) instead of 32Mb + size of one chunk.
Also, each chunk has a header, various pointers, etc. so you are not unlikely to use 33Mb or even 34Mb... that's a lot more RAM. The amount of RAM copied is otherwise the same. That being said, it could be that Node starts reading the next chunk while you copy so it could make it transparent. When done in one large chunk in the 'end' event, you're going to have to wait for the contact() to complete while doing nothing else in parallel.
In case you are receiving an HTTP POST and are reading it. Remember that you get a Content-Length parameter so you also have the length in that case and can pre-allocate the entire buffer before reading the data.

VimL: Get extra KB from function that outputs file size

Right now I'm creating a plugin of sorts for Vim, it's meant to simply have all kinds of utility functions to put in your statusline, here's the link: https://github.com/Greduan/vim-usefulstatusline
Right now I have this function: https://github.com/Greduan/vim-usefulstatusline/blob/master/autoload/usefulstatusline_filesize.vim
It simply outputs the file size from bytes to megabytes. Now, currently if the file size reaches 1MB for example it outputs 1MB, this is fine, but I would also like for it to output the amount of bytes or KB extra that it has.
From example, instead of outputting 1MB it would output 1MB-367KB, see what I mean? It would output the biggest size, and then the remainder of the size that follows it. It's hard to explain.
So how would I modify the current function(s) to output the size this way?
Thanks for your help! Any of it is appreciated. :)

Who needs this? I doubt it would be convenient to anyone (especially when having small remainders like 1MB + 3KB), using 1.367MB is much better. I see in your code that you don’t have either MB (1000*1000 B) or MiB (1024*1024 B), 1000*1024 bytes is very strange. Also, don’t use getfsize, it is wrong for any non-file buffer you constantly see in plugins. Use line2byte(line('$')+1)-1.
For 1.367MB you can just rewrite humanize_bytes function in VimL if you are fine with depending on +float feature.
Using integer arithmetic you can get the remainder with
let kbytes_remainder = kbytes % 1000
And do change to either MiB/KiB (M/K is a common shortcut used in ls. Without B) or MB/KB.

How to partially read from a TStringStream, free the read data from the stream and keep the rest (the unread data)?

What I want to do: lets suppose I have a TStringStream that just read a string with 100 characters. If I call .ReadString(50), I will get the first 50 characters of this stream and its cursor is going to be placed on the position 51.
My question is: how do I toss the characters 1 to 50 in this stream in a fast and clean way? I want to read the rest (51 to 100) later.
Thanks in advance.

You cannot do what you are hoping to do. The string stream's data is a Delphi string which is stored as a single memory block. Memory blocks are atomic, they cannot be split. You cannot free some part of a memory block.
If you really need to return memory to the memory manager then you should create a new string with the already processed data removed. You can then re-create your string stream with this new input and destroy the previous string stream.
Having said that, it's hard to see that doing much other than increasing your memory fragmentation. If the sizes of memory involved are large enough, and if the string stream persists for long enough, then this just might be a sensible approach. Otherwise it sounds like an attempt to optimise that actually would hinder performance.
Perhaps some class other than string stream could be more appropriate but it's very hard to advise without knowing more details.

You can't do this. If you really need to do this, you should write your own class that implements the stream-interface and which would let you process some data a little bit at a time and free whatever you want to free. Note that you would only be able to go through the data once, since you've now deleted your data. That is, seeking to the beginning again would become impossible, and your current stream "position" would be a lie.
In short, sounds like you're confused.

If I understand correctly you which to skip forward in the stream?
You can do:
Str.Position := Str.Position + 50;
Or like this:
Str.Seek(50,TSeekOrigin.soCurrent);

Doing file operations with 64-bit addresses in C + MinGW32

I'm trying to read in a 24 GB XML file in C, but it won't work. I'm printing out the current position using ftell() as I read it in, but once it gets to a big enough number, it goes back to a small number and starts over, never even getting 20% through the file. I assume this is a problem with the range of the variable that's used to store the position (long), which can go up to about 4,000,000,000 according to http://msdn.microsoft.com/en-us/library/s3f49ktz(VS.80).aspx, while my file is 25,000,000,000 bytes in size. A long long should work, but how would I change what my compiler(Cygwin/mingw32) uses or get it to have fopen64?

The ftell() function typically returns an unsigned long, which only goes up to 232 bytes (4 GB) on 32-bit systems. So you can't get the file offset for a 24 GB file to fit into a 32-bit long.
You may have the ftell64() function available, or the standard fgetpos() function may return a larger offset to you.

You might try using the OS provided file functions CreateFile and ReadFile. According to the File Pointers topic, the position is stored as a 64bit value.

Unless you can use a 64-bit method as suggested by Loadmaster, I think you will have to break the file up.
This resource seems to suggest it is possible using _telli64(). I can't test this though, as I don't use mingw.

I don't know of any way to do this in one file, a bit of a hack but if splitting the file up properly isn't a real option, you could write a few functions that temp split the file, one that uses ftell() to move through the file and swaps ftell() to a new file when its reaching the split point, then another that stitches the files back together before exiting. An absolutely botched up approach, but if no better solution comes to light it could be a way to get the job done.

I found the answer. Instead of using fopen, fseek, fread, fwrite... I'm using _open, lseeki64, read, write. And I am able to write and seek in > 4GB files.
Edit: It seems the latter functions are about 6x slower than the former ones. I'll give the bounty anyone who can explain that.
Edit: Oh, I learned here that read() and friends are unbuffered. What is the difference between read() and fread()?

Even if the ftell() in the Microsoft C library returns a 32-bit value and thus obviously will return bogus values once you reach 2 GB, just reading the file should still work fine. Or do you need to seek around in the file, too? For that you need _ftelli64() and _fseeki64().
Note that unlike some Unix systems, you don't need any special flag when opening the file to indicate that it is in some "64-bit mode". The underlying Win32 API handles large files just fine.

Purpose of ibs/obs/bs in dd

I have a script that creates file system in a file on a linux machine. I see that to create the file system, it uses 'dd' with bs=x option, reads from /dev/zero and writes to a file. I think usually specifying ibs/obs/bs is useful to read from real hardware devices as one has specific block size constraints. In this case however, as it is reading from virtual device and writing to a file, I don't see any point behind using 'bs=x bytes' option. Is my understanding wrong here?
(Just in case if it helps, this file system is later on used to boot a qemu vm)

To understand block sizes, you have to be familiar with tape drives. If you're not interested in tape drives - for example, you don't think you're ever going to use one - then you can go back to sleep now.
Remember the tape drives from films in the 60s, 70s, maybe even 80s? The ones where the reel went spinning around, and so on? Not your Exabyte or even QIC - quarter-inch cartridge - tapes; your good old fashioned reel-to-reel half-inch tape drives? On those, block size mattered.
The data on a tape was written in blocks. Each block was separated from the next by an inter-record gap.
----+-------+-----+-------+-----+----
... | block | IRG | block | IRG | ...
----+-------+-----+-------+-----+----
Depending on the tape drive hardware and software, there were a variety of problems that could happen. For example, if the tape was written with a block size of 5120 bytes and you read the tape with a block size of 512 bytes, then the tape drive might read the first block, return you 512 bytes of it, and then discard the remaining data; the next read would start on the next block. Conversely, if the tape was written with a block size of 512 bytes and you requested blocks of 5120 bytes, you would get short reads; each read would return just 512 bytes, and if your software wasn't paying attention, you'd be reading garbage. There was also the issue that the tape drive had to get up to speed to read the block, and then slow down. The ASCII art suggests that the IRG was smaller than the data blocks; that was not necessarily the case. And it took time to read one block, overshoot the IRG, rewind backwards to get to the next block, and start forwards again. And if the tape drive didn't have the memory to buffer data - the cheaper ones did not - then you could seriously affect your tape drive performance.
War story: work prepared on newer machine with a slightly more modern tape drive. I wrote a tape using tar without a sensible block size (so it defaulted to 512 bytes). It was a large bit of software - must have been, oh, less than 100 MB in total (a long time ago, in other words). The tape wrote nicely because the machine was modern enough, and it took just a few seconds to do so. But, I had to get the material off the tape on a machine with an older tape drive, one that did not have any on-board buffer. So, it read the material, 512 bytes at a time, and the reel rocked forward, reading one block, and then rocked back all but maybe half an inch, and then read forwards to get to the next block, and then rocked back, and ... well, you could see it doing this, and since it took appreciable portions of a second to read each 512 byte block, the total time taken was horrendous. My flight was due to leave...and I needed to get that data across too. (It was long enough ago, and in a land far enough away, that last minute changes to flights weren't much of an option either.) To cut a long story short, it did get read - but if I'd used a sensible block size (such as 5120 bytes instead of the default of 512), I would have been done much, much quicker and with much less danger of missing the plane (but I did actually catch the plane, with maybe 20 minutes to spare, despite a taxi ride across Paris in the rush hour).
With more modern tape drives, there was enough memory on the drive to do buffering and getting a tape drive to stream - write continuously without reversing - was feasible. It used to be that I'd use a block size like 256 KB to get QIC tapes to stream. I've not done much with tape drives recently - let's see, not this millennium and not much for a few years before that, either; certainly not much since CD and DVD became the software distribution mechanisms of choice (when electronic download wasn't used).
But the block size really did matter in the old days. And dd provided good support for it. You could even transfer data from a tape drive that was written with, say, 4 KB block to another that you wanted to write with, say, 16 KB blocks, by specifying the ibs (input block size) separately from the obs (output block size). Darned useful!
Also, the count parameter is in terms of the (input) block size. It was useful to say 'dd bs=1024 count=1024 if=/dev/zero of=/my/file/of/zeroes' to copy 1 MB of zeroes around. Or to copy 1 MB of a file.
The importance of dd is vastly diminished; it was an essential part of the armoury for anybody who worked with tape drives a decade or more ago.

The block size is the number of bytes that are read and written at a time. Presumably there is a count= option, and that is specified in units of the block size. If there is a skip= or seek= option, those will also be in block size units. However if you are reading and writing a regular file, and there are no disk errors, then the block size doesn't really matter as long as you can scale those parameters accordingly and they are still integers. However certain sizes may be more efficient than others.

For reading from /dev/zero, it doesn't matter. ibs/obs/bs specify how many bytes will be read at a time. It's helpful to choose a number based on the way bytes are read/written in the operating system. For instance, Linux usually reads from a hard drive in 4096 byte chunks. If you have at least some idea about how the underlying hardware reads/writes, it might be a good idea to specify ibs/obs/bs. By the way, if you specify bs, it will override whatever you specify for ibs and obs.

In addition to the great answer by Jonathan Leffler, keep in mind that the bs= option isn't always a substitute for using both ibs= and obs=, in particular for the old [ugly] days of tape drives.
The bs= option reserves the right for dd to write the data as soon as it's read. This can cause you to no longer have identically sized blocks on the output. Here is GNU's take on this, but the behavior dates back as far as I can remember (80's):
(bs=) Set both input and output block sizes to bytes. This makes dd read and write bytes per block, overriding any ‘ibs’ and ‘obs’ settings. In addition, if no data-transforming conv option is specified, input is copied to the output as soon as it’s read, even if it is smaller than the block size.
For instance, back in the QIC days on an old Sun system, if you did this:
tar cvf /dev/rst0c /bla
It would work, but cause an enormous amount of back and forth thrashing while the drive wrote a small block, and tried to backup and read to reposition itself properly for the next write.
If you swapped this with:
tar cvf - /bla | dd ibs=16K obs=16K of=/dev/rst0c
You'd get the QIC drive writing much larger chunks and not thrashing quite so much.
However, if you made the mistake of this:
tar cvf - /bla | dd bs=16K of=/dev/rst0c
You'd run the risk of having precisely the same thrashing you had before depending upon how much data was available at the time of each read.
Specifying both ibs= and obs= precludes this from happening.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string