Read contents of large files without using "cat" command in linux

Read contents of large files without using "cat" command in linux - linux

I'm trying a more efficient way of reading file contents in Linux without using the "cat" command, especially for larger file contents, as in such cases cat just shoots up the memory and CPU on the server.
One thing that comes to my mind is using a grep -v "character-set-which-is-unlikely-in-the-file" filename
But using different character sets every time and hoping it would not appear in the file, wouldn't be efficient.
Any other thoughts ?

If you just want to read through the file, so it gets cached, the simplest way is perhaps this:
cat filename > /dev/null
Note that you don't need to show the data on screen to read it from disk. That command reads the file, and ignores the content by dumping it in /dev/null, but it still reads all the data.
If the CPU load goes up, that is probably a good thing, meaning that the computer is working hard, and will be finished sooner rather than later. If it crashes, though, there is something else wrong.
If you have some specific reason not to use the "cat" command, you can try "dd" instead, but it is more complicated to write and will not be faster:
dd if=filename of=/dev/null bs=1M
Addendum:
This inspired me to run some tests. On my particular computer both "cat" and "dd" took 24.27-24.31 seconds to read a large file on a mechanical disk when it wasn't already cached, and 0.39-0.40 seconds when it was cached. (Three tests of each case, with very little variability.)
Both these programs contain code to write the data, even if it is dumped to /dev/null, so one could expect that a program specifically written to just read would be slightly faster, but I got the same times when I tried that.

Related

Why vim search is much slower than "cat fileName | grep targetText"?

I have a 1.4 GB text file named test.txt and I want to search a string inside the file.
I'd like to know why vim search (vim test.txt, then type /targetText to search the string) performs much slower than cat test.txt | grep targetText?
On my machine, vim search takes about several minutes to complete the search, while cat test.txt | grep targetText takes about several seconds to complete the search.

Vim is an editor. It will try to load the file in memory then you can do edits on it. Vim can edit huge files, but is not optimized for it.
On the other Hand cat and grep do not need to read the whole file in memory.
BTW you can just do grep search file without using cat.

If targetText is short the delay should be caused by numerous loads from disk (necessary to search through the whole text). We should note that vim is an interactive tool and it is not designed for fast processing of gygabytes. Of course if we know in advance that our pattern match lays in many, many megabytes downstream from the current screen, we could read huge pieces from disk and in such a way get fast. But in real life Vim doesn't know how much data worth to read in once, because if we expect the pattern to be found in rather short distance, say, three lines below (agree, it's much more expected situation) then we have absolutely no reason to read huge data amounts from disk; it would be useless consumption of time and bandwidth. As Vim doesn't know a priori what amount of data to read at once, it uses some trade-off which doesn't occur to be optimal in your case.
On the opposite side, a pipeline "cat|.." bravely operates with very large pieces of data only limited by memory available to the process (ideally having once found the file it reads data in non-stop mode and sends to the pipeline). Because cat "knows" that the whole file content is needed and there is no reason to read it by small pages.
Thus, although grep and cat suck the same amount of data, the latter seeks a track on disk much less times that results in dramatic efficiency increase.
If a prefix character combination of our pattern is very frequent in the file to scan, we may also experience an efficiency advantage of grep search technique based on Aho–Corasick string matching algorithm.

Disk read/write perfomance in linux

I wanted to check the read/write performance of my disk. I am executing the below command to write into a file
time dd if=/dev/zero of=/home/test.txt bs=2k count=32k;
which gives about 400MB/s
For checking the read performance i have executed below commands.with and without 'of' parameter. There is a huge difference between those results
time dd if=/home/test.txt of=/dev/zero bs=2k (gives about 2.8GB/s)
time dd if=/home/test.txt bs=2k (9MB/s)
I read that "of=/dev/zero" is used to read data from some temp file while creating the file.
But why is it required while checking for read performance and why there is a huge difference in speed with and without "of=/dev/zero"

/dev/zero is a special file. It's contents stem from a device driver. All write operations on /dev/zero are guaranteed to succeed. A bit more about that here and here
Without specifying of dd prints to stdout. Thus the data which the terminal receives has to be formatted and printed. The terminal you're using is very likely to bottleneck the performance of your drive.
Also if likely stands for input file, likewise of means output file.
Edit:
Writing to /dev/zero can have unexpected results. I wouldn't say this is an accurate way of measuring read performance.

You use HDD's and filesystem's caches on read operations. Try oflag=direct flag.

How to create a log file that "pop_front"s?

Suppose I have a console program that outputs trace debug lines on stdout, that I want to run on a server.
And then I do:
./serverapp > someoutputfile
If I need to see how the program's doing, I would just log into the server and do:
tail -f someoutputfile
However, understandably over time, someoutputfile would become pretty big.
Is there a way to make it so that someoutputfile is limited to a certain size, and only the most recent parts of it?
I mean, the hard way would be to make a custom script/program that cycles the output between different files, but that seems like overkill.

You can truncate the log file. One way to do this is to type:
>someoutputfile
at the shell command-line. It's a redirect with no output and it will erase all the contents of the file.
The tricky bit here is that any program writing to that file will continue to write into the file at its last output position. So the file will immediately gain a "hole" from 0 to X bytes, where X is the output position.
In most Linux file systems these holes result in sparse files, which don't actually use the space in the hole. So the file may contain many gigabytes of 0's at the beginning but only use 500 KB on disk.
Another way to do fast logging is to memory map a file on disk of fixed size: 16 MB for example. Then the logging writes into a memory pointer which wraps around when it reaches the size limit. It then continues to write at the front of the file. It's a good idea to have some kind of write position marker. I use <====>, for example. I find this method to be ridiculously fast and great for debug logging.

I haven't used it, but it gets good reviews here on SO, try logrotate
A more general discussion of managing output files may show you that a custom script/solution is not out of the question ;-) : Problem with Bash output redirection
I hope this helps.

Can I get a faster output pipe than /dev/null?

I am running a huge task [automated translation scripted with perl + database etc.] to run for about 2 weeks non-stop. While thinking how to speed it up I saw that the translator outputs everything (all translated sentences, all info on the way) to STDOUT all the time. This makes it work visibly slower when I get the output on the console.
I obviously piped the output to /dev/null, but then I thought "could there be something even faster?" It's so much output that it'd really make a difference.
And that's the question I'm asking You, because as far as I know there is nothing faster... (But I'm far from being a guru having used linux on a daily basis only last 3 years)

Output to /dev/null is implemented in the kernel, which is pretty bloody fast. The output pipe isn't your problem now, it's the time it takes to build the strings that are getting sent to /dev/null. I would recommend you go through the program and comment out (or guard with if $be_verbose) all the lines that are useless print statements. I'm pretty sure that'll give you a noticeable speedup.

I'm able (via dd) to dump 20 gigabytes of data per second down /dev/null. This is not your bottleneck :-p
Pretty much the only way to make it faster is to not generate the data in the first place - remove the logging statements entirely. The cost of producing all the log messages likely exceeds the cost of throwing them away quite a bit.

Unrelated to perl and standard output, but there is null_blk block device, which is even faster than /dev/null. Basically, it bounded by syscall performance and with large blocks it can saturate memory bus.

How do I transparently compress/decompress a file as a program writes to/reads from it?

I have a program that reads and writes very large text files. However, because of the format of these files (they are ASCII representations of what should have been binary data), these files are actually very easily compressed. For example, some of these files are over 10GB in size, but gzip achieves 95% compression.
I can't modify the program but disk space is precious, so I need to set up a way that it can read and write these files while they're being transparently compressed and decompressed.
The program can only read and write files, so as far as I understand, I need to set up a named pipe for both input and output. Some people are suggesting a compressed filesystem instead, which seems like it would work, too. How do I make either work?
Technical information: I'm on a modern Linux. The program reads a separate input and output file. It reads through the input file in order, though twice. It writes the output file in order.

Check out zlibc: http://zlibc.linux.lu/.
Also, if FUSE is an option (i.e. the kernel is not too old), consider: compFUSEd http://www.biggerbytes.be/

named pipes won't give you full duplex operations, so it will be a little bit more complicated if you need to provide just one filename.
Do you know if your applications needs to seek through the file ?
Does your application work with stdin, stdout ?
Maybe a solution is to create a mini compressed file system that contains only a directory with your files
Since you have separate input and output file you can do the following :
mkfifo readfifo
mkfifo writefifo
zcat your inputfile > readfifo &
gzip writefifo > youroutputfile &
launch your program !
Now, you probably will get in trouble with the read twice in order of the input, because as soon as zcat is finished reading the input file, yout program will get a SIGPIPE signal
The proper solution is probably to use a compressed file system like CompFUSE, because then you don't have to worry about unsupported operations like seek.

btrfs:
https://btrfs.wiki.kernel.org/index.php/Main_Page
provides support for pretty fast "automatic transparent compression/decompression" these days, and is present (though marked experimental) in newer kernels.

FUSE options:
http://apps.sourceforge.net/mediawiki/fuse/index.php?title=CompressedFileSystems

Which language are you using?
If you are using Java, take a look at GZipInputStream and GZipOutputStream classes in the API doc.
If you are using C/C++, zlibc is probably the best way to go about it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string