I have a chunk of fairly random binary data. I want to find where that chunk exists in a file, how many times it occurs, and at what byte (or sector) offsets. Any ideas on how to do that?
Thanks,
Justin
I believe that no existing command does exactly what you want. If your chunk is small and your file fits in memory, it's easy to write your own. Just scan through the file contents, applying strncmp at each position.
If your file is very large but still fits in your address space, you can do the same thing with mmap.
If your chunk is not small, you'll probably be better off using the Boyer-Moore algorithm instead of strncmp. This is still not too much work since there are already implementations out there that you can use.
I would recommend X-Ways WinHex for that. I find myself using it quite often to search arbitrary data on hard disk drives or large disk image files.
You can do some of this with grep
This outputs lines with the byte offset
grep --text --byte-offset 'ls' /bin/ls
Add a --count parameter to get the total number of matches.
Related
I'm looking for, essentially, the ext4 equivalent of mremap().
I have a big mmap()'d file that I'm allocating arrays in, and the arrays need to grow. So I want to make the first array larger at its current location, and budge all the other arrays along in the file and the address space to make room.
If this was just anonymous memory, I could use mremap() to budge over whole pages in constant time, as long as I'm inserting a whole number of memory pages. But this is a disk-backed file, so the data needs to move in the file as well as in memory.
I don't actually want to read and then rewrite whole blocks of data to and from the physical disk. I want the data to stay on disk in the physical sectors it is in, and to induce the filesystem to adjust the file metadata to insert new sectors where I need the extra space. If I have to keep my inserts to some multiple of a filesystem-dependent disk sector size, that's fine. If I end up having to copy O(N) sector or extent references around to make room for the inserted extent, that's fine. I just don't want to have 2 gigabytes move from and back to the disk in order to insert a block in the middle of a 4 gigabyte file.
How do I accomplish an efficient insert by manipulating file metadata? Is a general API for this actually exposed in Linux? Or one that works if the filesystem happens to be e.g. ext4? Will a write() call given a source address in the memory-mapped file reduce to the sort of efficient shift I want under the right circumstances?
Is there a C or C++ API function with the semantics "copy bytes from here to there and leave the source with an undefined value" that I should be calling in case this optimization gets added to the standard library and the kernel in the future?
I've considered just always allocating new pages at the end of the file, and mapping them at the right place in memory. But then I would need to work out some way to reconstruct that series of mappings when I reload the file. Also, shrinking the data structure would be a nontrivial problem. At that point, I would be writing a database page manager.
I think I actually may have figured it out.
I went looking for "linux make a file sparse", which led me to this answer on Unix & Linux Stack Exchange which mentioned the fallocate command line tool. The fallocate tool has a --dig-holes option, which turns parts of a file that could be represented by holes into holes.
I then went looking for "fallocate dig holes" to find out how that works, and I got the fallocate man page. I noticed it also offers a way to insert a hole of some size:
-i, --insert-range
Insert a hole of length bytes from offset, shifting existing
data.
If a command line tool can do it, Linux can do it, so I dug into the source code for fallocate, which you can find on Github:
case 'i':
mode |= FALLOC_FL_INSERT_RANGE;
break;
It looks like the fallocate tool accomplishes a cheap hole insert (and a move of all the other file data) by calling the fallocate() Linux-specific function with the FALLOC_FL_INSERT_RANGE flag, added in Linux 4.1. This flag won't work on all filesystems, but it does work on ext4 and it does exactly what I want: adjust the file metadata to efficiently free up some space in the file's offset space at a certain point.
It's not immediately clear to me how this interacts with currently memory-mapped pages, but I think I can work with this.
I have a 1.4 GB text file named test.txt and I want to search a string inside the file.
I'd like to know why vim search (vim test.txt, then type /targetText to search the string) performs much slower than cat test.txt | grep targetText?
On my machine, vim search takes about several minutes to complete the search, while cat test.txt | grep targetText takes about several seconds to complete the search.
Vim is an editor. It will try to load the file in memory then you can do edits on it. Vim can edit huge files, but is not optimized for it.
On the other Hand cat and grep do not need to read the whole file in memory.
BTW you can just do grep search file without using cat.
If targetText is short the delay should be caused by numerous loads from disk (necessary to search through the whole text). We should note that vim is an interactive tool and it is not designed for fast processing of gygabytes. Of course if we know in advance that our pattern match lays in many, many megabytes downstream from the current screen, we could read huge pieces from disk and in such a way get fast. But in real life Vim doesn't know how much data worth to read in once, because if we expect the pattern to be found in rather short distance, say, three lines below (agree, it's much more expected situation) then we have absolutely no reason to read huge data amounts from disk; it would be useless consumption of time and bandwidth. As Vim doesn't know a priori what amount of data to read at once, it uses some trade-off which doesn't occur to be optimal in your case.
On the opposite side, a pipeline "cat|.." bravely operates with very large pieces of data only limited by memory available to the process (ideally having once found the file it reads data in non-stop mode and sends to the pipeline). Because cat "knows" that the whole file content is needed and there is no reason to read it by small pages.
Thus, although grep and cat suck the same amount of data, the latter seeks a track on disk much less times that results in dramatic efficiency increase.
If a prefix character combination of our pattern is very frequent in the file to scan, we may also experience an efficiency advantage of grep search technique based on Aho–Corasick string matching algorithm.
If I have a big file containing many zeros, how can i efficiently make it a sparse file?
Is the only possibility to read the whole file (including all zeroes, which may patrially be stored sparse) and to rewrite it to a new file using seek to skip the zero areas?
Or is there a possibility to make this in an existing file (e.g. File.setSparse(long start, long end))?
I'm looking for a solution in Java or some Linux commands, Filesystem will be ext3 or similar.
A lot's changed in 8 years.
Fallocate
fallocate -d filename can be used to punch holes in existing files. From the fallocate(1) man page:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place,
without using extra disk space. The minimum size of the hole
depends on filesystem I/O block size (usually 4096 bytes).
Also, when using this option, --keep-size is implied. If no
range is specified by --offset and --length, then the entire
file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the
need for extra disk space.
See --punch-hole for a list of supported filesystems.
(That list:)
Supported for XFS (since Linux 2.6.38), ext4 (since Linux
3.0), Btrfs (since Linux 3.7) and tmpfs (since Linux 3.5).
tmpfs being on that list is the one I find most interesting. The filesystem itself is efficient enough to only consume as much RAM as it needs to store its contents, but making the contents sparse can potentially increase that efficiency even further.
GNU cp
Additionally, somewhere along the way GNU cp gained an understanding of sparse files. Quoting the cp(1) man page regarding its default mode, --sparse=auto:
sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.
But there's also --sparse=always, which activates the file-copy equivalent of what fallocate -d does in-place:
Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.
I've finally been able to retire my tar cpSf - SOURCE | (cd DESTDIR && tar xpSf -) one-liner, which for 20 years was my graybeard way of copying sparse files with their sparseness preserved.
Some filesystems on Linux / UNIX have the ability to "punch holes" into an existing file. See:
LKML posting about the feature
UNIX file trunctation FAQ (search for F_FREESP)
It's not very portable and not done the same way across the board; as of right now, I believe Java's IO libraries do not provide an interface for this.
If hole punching is available either via fcntl(F_FREESP) or via any other mechanism, it should be significantly faster than a copy/seek loop.
I think you would be better off pre-allocating the whole file and maintaining a table/BitSet of the pages/sections which are occupied.
Making a file sparse would result in those sections being fragmented if they were ever re-used. Perhaps saving a few TB of disk space is not worth the performance hit of a highly fragmented file.
You can use $ truncate -s filename filesize on linux teminal to create sparse file having
only metadata.
NOTE --Filesize is in bytes.
According to this article, it seems there is currently no easy solution, except for using FIEMAP ioctl. However, I don't know how you can make "non sparse" zero blocks into "sparse" ones.
Suppose I have a console program that outputs trace debug lines on stdout, that I want to run on a server.
And then I do:
./serverapp > someoutputfile
If I need to see how the program's doing, I would just log into the server and do:
tail -f someoutputfile
However, understandably over time, someoutputfile would become pretty big.
Is there a way to make it so that someoutputfile is limited to a certain size, and only the most recent parts of it?
I mean, the hard way would be to make a custom script/program that cycles the output between different files, but that seems like overkill.
You can truncate the log file. One way to do this is to type:
>someoutputfile
at the shell command-line. It's a redirect with no output and it will erase all the contents of the file.
The tricky bit here is that any program writing to that file will continue to write into the file at its last output position. So the file will immediately gain a "hole" from 0 to X bytes, where X is the output position.
In most Linux file systems these holes result in sparse files, which don't actually use the space in the hole. So the file may contain many gigabytes of 0's at the beginning but only use 500 KB on disk.
Another way to do fast logging is to memory map a file on disk of fixed size: 16 MB for example. Then the logging writes into a memory pointer which wraps around when it reaches the size limit. It then continues to write at the front of the file. It's a good idea to have some kind of write position marker. I use <====>, for example. I find this method to be ridiculously fast and great for debug logging.
I haven't used it, but it gets good reviews here on SO, try logrotate
A more general discussion of managing output files may show you that a custom script/solution is not out of the question ;-) : Problem with Bash output redirection
I hope this helps.
I'm trying to optimize handling of large datasets using mmap. A dataset is in the gigabyte range. The idea was to mmap the whole file into memory, allowing multiple processes to work on the dataset concurrently (read-only). It isn't working as expected though.
As a simple test I simply mmap the file (using perl's Sys::Mmap module, using the "mmap" sub which I believe maps directly to the underlying C function) and have the process sleep. When doing this, the code spends more than a minute before it returns from the mmap call, despite this test doing nothing - not even a read - from the mmap'ed file.
Guessing, I though maybe linux required the whole file to be read when first mmap'ed, so after the file had been mapped in the first process (while it was sleeping), I invoked a simple test in another process which tried to read the first few megabytes of the file.
Suprisingly, it seems the second process also spends a lot of time before returning from the mmap call, about the same time as mmap'ing the file the first time.
I've made sure that MAP_SHARED is being used and that the process that mapped the file the first time is still active (that it has not terminated, and that the mmap hasn't been unmapped).
I expected a mmapped file would allow me to give multiple worker processes effective random access to the large file, but if every mmap call requires reading the whole file first, it's a bit harder. I haven't tested using long-running processes to see if access is fast after the first delay, but I expected using MAP_SHARED and another separate process would be sufficient.
My theory was that mmap would return more or less immediately, and that linux would load the blocks more or less on-demand, but the behaviour I am seeing is the opposite, indicating it requires reading through the whole file on each call to mmap.
Any idea what I'm doing wrong, or if I've completely misunderstood how mmap is supposed to work?
Ok, found the problem. As suspected, neither linux or perl were to blame. To open and access the file I do something like this:
#!/usr/bin/perl
# Create 1 GB file if you do not have one:
# dd if=/dev/urandom of=test.bin bs=1048576 count=1000
use strict; use warnings;
use Sys::Mmap;
open (my $fh, "<test.bin")
|| die "open: $!";
my $t = time;
print STDERR "mmapping.. ";
mmap (my $mh, 0, PROT_READ, MAP_SHARED, $fh)
|| die "mmap: $!";
my $str = unpack ("A1024", substr ($mh, 0, 1024));
print STDERR " ", time-$t, " seconds\nsleeping..";
sleep (60*60);
If you test that code, there are no delays like those I found in my original code, and after creating the minimal sample (always do that, right!) the reason suddenly became obvious.
The error was that I in my code treated the $mh scalar as a handle, something which is light weight and can be moved around easily (read: pass by value). Turns out, it's actually a GB long string, definitively not something you want to move around without creating an explicit reference (perl lingua for a "pointer"/handle value). So if you need to store in in a hash or similar, make sure you store \$mh, and deref it when you need to use it like ${$hash->{mh}}, typically as the first parameter in a substr or similar.
If you have a relatively recent version of Perl, you shouldn't be using Sys::Mmap. You should be using PerlIO's mmap layer.
Can you post the code you are using?
On 32-bit systems the address space for mmap()s is rather limited (and varies from OS to OS). Be aware of that if you're using multi-gigabyte files and your are only testing on a 64-bit system. (I would have preferred to write this in a comment but I don't have enough reputation points yet)
one thing that can help performance is the use of 'madvise(2)'. probably most easily
done via Inline::C. 'madvise' lets you tell the kernel what your access pattern will be like (e.g. sequential, random, etc).
If I may plug my own module: I'd advice using File::Map instead of Sys::Mmap. It's much easier to use, and is less crash-prone than Sys::Mmap.
That does sound surprising. Why not try a pure C version?
Or try your code on a different OS/perl version.
See Wide Finder for perl performance with mmap. But there is one big pitfall. If your dataset will be on classical HD and you will read from multiple processes, you can easily fall in random access and your IO will fall down to unacceptable values (20~40 times).
Ok, here's another update. Using Sys::Mmap or PerlIO's ":mmap" attribute both works fine in perl, but only up to 2 GB files (the magic 32 bit limit). Once the file is more than 2 GB, the following problems appear:
Using Sys::Mmap and substr for accessing the file, it seems that substr only accepts a 32 bit int for the position parameter, even on systems where perl supports 64 bit. There's at least one bug posted about it:
#62646: Maximum string length with substr
Using open(my $fh, "<:mmap", "bigfile.bin"), once the file is larger than 2 GB, it seems perl will either hang/or insist on reading the whole file on the first read (not sure which, I never ran it long enough to see if it completed), leading to dead slow performance.
I haven't found any workaround to either of these, and I'm currently stuck with slow file (non mmap'ed) operations for working on these files. Unless I find a workaround I may have to implement the processing in C or another higher level language that supports mmap'ing huge files better.
Your access to that file had better be well random to justify a full mmap. If your usage isn't evenly distributed, you're probably better off with a seek, read to a freshly malloced area and process that, free, rinse and repeat. And work with chunks of multiples of 4k, say 64k or so.
I once benchmarked a lot string pattern matching algorithms. mmaping the entire file was slow and pointless. Reading to a static 32kish buffer was better, but still not particularly good. Reading to freshly malloced chunk, processing that and then letting it go allows kernel to work wonders under the hood. The difference in speed was enormous, but then again pattern matching is very fast complexitywise and more emphasis must be put on handling efficiency than perhaps is usually needed.