I have a 1.4 GB text file named test.txt and I want to search a string inside the file.
I'd like to know why vim search (vim test.txt, then type /targetText to search the string) performs much slower than cat test.txt | grep targetText?
On my machine, vim search takes about several minutes to complete the search, while cat test.txt | grep targetText takes about several seconds to complete the search.
Vim is an editor. It will try to load the file in memory then you can do edits on it. Vim can edit huge files, but is not optimized for it.
On the other Hand cat and grep do not need to read the whole file in memory.
BTW you can just do grep search file without using cat.
If targetText is short the delay should be caused by numerous loads from disk (necessary to search through the whole text). We should note that vim is an interactive tool and it is not designed for fast processing of gygabytes. Of course if we know in advance that our pattern match lays in many, many megabytes downstream from the current screen, we could read huge pieces from disk and in such a way get fast. But in real life Vim doesn't know how much data worth to read in once, because if we expect the pattern to be found in rather short distance, say, three lines below (agree, it's much more expected situation) then we have absolutely no reason to read huge data amounts from disk; it would be useless consumption of time and bandwidth. As Vim doesn't know a priori what amount of data to read at once, it uses some trade-off which doesn't occur to be optimal in your case.
On the opposite side, a pipeline "cat|.." bravely operates with very large pieces of data only limited by memory available to the process (ideally having once found the file it reads data in non-stop mode and sends to the pipeline). Because cat "knows" that the whole file content is needed and there is no reason to read it by small pages.
Thus, although grep and cat suck the same amount of data, the latter seeks a track on disk much less times that results in dramatic efficiency increase.
If a prefix character combination of our pattern is very frequent in the file to scan, we may also experience an efficiency advantage of grep search technique based on Aho–Corasick string matching algorithm.
Related
I have a lot of files on my computer (who doesn't).
It is split between harddrives.
I realized a long time ago, that find takes a whole lot of time scanning the whole harddisk. Minutes, for all drives i might take over an hour,
That is why I got used to running du -ba / >> ~/du."$*(date +%F)" on a regular base. Then I would just grep 'WHATEVER' ~/du | sed 's#^ \+[0-9]\+ ##' | xargs -d\\n command
I understand why that is faster than find.
Now I set up a mysql, that has a complete, refreshable index of all files. directories are a simple tree with just a foreign key to the parent row. (or however you call a foreign key that references NOT to a foreign table but to it's own primary of a different row).
Although It is as complex, it is still much faster than using the Filesystems.
Why is that? Am I missing some tools that could search the TOC faster than the normal posix calls to the kernel?
How long should It take to print all files of a harddrive to stdout, whithout a DB or textfile cache?
I'm trying a more efficient way of reading file contents in Linux without using the "cat" command, especially for larger file contents, as in such cases cat just shoots up the memory and CPU on the server.
One thing that comes to my mind is using a grep -v "character-set-which-is-unlikely-in-the-file" filename
But using different character sets every time and hoping it would not appear in the file, wouldn't be efficient.
Any other thoughts ?
If you just want to read through the file, so it gets cached, the simplest way is perhaps this:
cat filename > /dev/null
Note that you don't need to show the data on screen to read it from disk. That command reads the file, and ignores the content by dumping it in /dev/null, but it still reads all the data.
If the CPU load goes up, that is probably a good thing, meaning that the computer is working hard, and will be finished sooner rather than later. If it crashes, though, there is something else wrong.
If you have some specific reason not to use the "cat" command, you can try "dd" instead, but it is more complicated to write and will not be faster:
dd if=filename of=/dev/null bs=1M
Addendum:
This inspired me to run some tests. On my particular computer both "cat" and "dd" took 24.27-24.31 seconds to read a large file on a mechanical disk when it wasn't already cached, and 0.39-0.40 seconds when it was cached. (Three tests of each case, with very little variability.)
Both these programs contain code to write the data, even if it is dumped to /dev/null, so one could expect that a program specifically written to just read would be slightly faster, but I got the same times when I tried that.
I have a linux box, whose resource utilization i need to monitor every hour. By resource, i mean mainly cpu, memory and network. I am using atop for the cpu memory and nethogs for the network utilization monitoring. I am thinking of redirecting the reports to text files and send them my email, but the initial startup screen for atop shows all statistics since boot, and it makes the text look messy, so is there a way to make atop skip the initial statistics ?
I would suggest you to use something other than atop. There are many other tools like top, free -m, etc for your cpu, memory and network statistics. The only disadvantage would be that you would have o write them independently.
Landed on your question as I was looking for just that. SeaLion actually works well for this purpose and also you wouldn't need to store them in files. It's all presented on a timeline so you can just "Jump to" whenever you want to check your data. You don't even manually have to record the data.
I suppose this is all you need.
Having the same problem right now, I came up with
atop -PCPU,NET,DSK | sed -n -e '/SEP/,$p'
The -P.. instructs atop to only show the requested information, so roll your own. The important bit is sed which skips lines until the first line containing SEP is found, which effectively skips over the first block of data containing the summary since boot time.
I an not sure, but i think you can't because atop produce statistics over some interval. On the initial run there is no previous point, so atop produce stats since boot to current point, but you can easily use for example awk to parse the output:
atop 1 2 | awk '/ seconds elapsed$/ {output=1;} {if (output) print}'
This is simplest way to solve problem with atop, but there is tons of other tools probably better suited for this job.
Suppose I have a console program that outputs trace debug lines on stdout, that I want to run on a server.
And then I do:
./serverapp > someoutputfile
If I need to see how the program's doing, I would just log into the server and do:
tail -f someoutputfile
However, understandably over time, someoutputfile would become pretty big.
Is there a way to make it so that someoutputfile is limited to a certain size, and only the most recent parts of it?
I mean, the hard way would be to make a custom script/program that cycles the output between different files, but that seems like overkill.
You can truncate the log file. One way to do this is to type:
>someoutputfile
at the shell command-line. It's a redirect with no output and it will erase all the contents of the file.
The tricky bit here is that any program writing to that file will continue to write into the file at its last output position. So the file will immediately gain a "hole" from 0 to X bytes, where X is the output position.
In most Linux file systems these holes result in sparse files, which don't actually use the space in the hole. So the file may contain many gigabytes of 0's at the beginning but only use 500 KB on disk.
Another way to do fast logging is to memory map a file on disk of fixed size: 16 MB for example. Then the logging writes into a memory pointer which wraps around when it reaches the size limit. It then continues to write at the front of the file. It's a good idea to have some kind of write position marker. I use <====>, for example. I find this method to be ridiculously fast and great for debug logging.
I haven't used it, but it gets good reviews here on SO, try logrotate
A more general discussion of managing output files may show you that a custom script/solution is not out of the question ;-) : Problem with Bash output redirection
I hope this helps.
I have a chunk of fairly random binary data. I want to find where that chunk exists in a file, how many times it occurs, and at what byte (or sector) offsets. Any ideas on how to do that?
Thanks,
Justin
I believe that no existing command does exactly what you want. If your chunk is small and your file fits in memory, it's easy to write your own. Just scan through the file contents, applying strncmp at each position.
If your file is very large but still fits in your address space, you can do the same thing with mmap.
If your chunk is not small, you'll probably be better off using the Boyer-Moore algorithm instead of strncmp. This is still not too much work since there are already implementations out there that you can use.
I would recommend X-Ways WinHex for that. I find myself using it quite often to search arbitrary data on hard disk drives or large disk image files.
You can do some of this with grep
This outputs lines with the byte offset
grep --text --byte-offset 'ls' /bin/ls
Add a --count parameter to get the total number of matches.