Bash Pipe Handling - linux

Does anyone know how bash handles sending data through pipes?
cat file.txt | tail -20
Does this command print all the contents of file.txt into a buffer, which is then read by tail? Or does this command, say, print the contents of file.txt line by line, and then pause at each line for tail to process, and then ask for more data?
The reason I ask is that I'm writing a program on an embedded device that basically performs a sequence of operations on some chunk of data, where the output of one operation is send off as the input of the next operation. I would like to know how linux (bash) handles this so please give me a general answer, not specifically what happens when I run "cat file.txt | tail -20".
EDIT: Shog9 pointed out a relevant Wikipedia Article, this didn't lead me directly to the article but it helped me find this: http://en.wikipedia.org/wiki/Pipeline_%28Unix%29#Implementation which did have the information I was looking for.
I'm sorry for not making myself clear. Of course you're using a pipe and of course you're using stdin and stdout of the respective parts of the command. I had assumed that was too obvious to state.
What I'm asking is how this is handled/implemented. Since both programs cannot run at once, how is data sent from stdin to stdout? What happens if the first program generates data significantly faster than the second program? Does the system just run the first command until either it's terminated or it's stdout buffer is full, and then move on to the next program, and so on in a loop until no more data is left to be processed or is there a more complicated mechanism?

I decided to write a slightly more detailed explanation.
The "magic" here lies in the operating system. Both programs do start up at roughly the same time, and run at the same time (the operating system assigns them slices of time on the processor to run) as every other simultaneously running process on your computer (including the terminal application and the kernel). So, before any data gets passed, the processes are doing whatever initialization necessary. In your example, tail is parsing the '-20' argument and cat is parsing the 'file.txt' argument and opening the file. At some point tail will get to the point where it needs input and it will tell the operating system that it is waiting for input. At some other point (either before or after, it doesn't matter) cat will start passing data to the operating system using stdout. This goes into a buffer in the operating system. The next time tail gets a time slice on the processor after some data has been put into the buffer by cat, it will retrieve some amount of that data (or all of it) which leaves the buffer on the operating system. When the buffer is empty, at some point tail will have to wait for cat to output more data. If cat is outputting data much faster than tail is handling it, the buffer will expand. cat will eventually be done outputting data, but tail will still be processing, so cat will close and tail will process all remaining data in the buffer. The operating system will signal tail when their is no more incoming data with an EOF. Tail will process the remaining data. In this case, tail is probably just receiving all the data into a circular buffer of 20 lines, and when it is signalled by the operating system that there is no more incoming data, it then dumps the last twenty lines to its own stdout, which just gets displayed in the terminal. Since tail is a much simpler program than cat, it will likely spend most of the time waiting for cat to put data into the buffer.
On a system with multiple processors, the two programs will not just be sharing alternating time slices on the same processor core, but likely running at the same time on separate cores.
To get into a little more detail, if you open some kind of process monitor (operating system specific) like 'top' in Linux you will see a whole list of running processes, most of which are effectively using 0% of the processor. Most applications, unless they are crunching data, spend most of their time doing nothing. This is good, because it allows other processes to have unfettered access to the processor according to their needs. This is accomplished in basically three ways. A process could get to a sleep(n) style instruction where it basically tells the kernel to wait n milliseconds before giving it another time slice to work with. Most commonly a program needs to wait for something from another program, like 'tail' waiting for more data to enter the buffer. In this case the operating system will wake up the process when more data is available. Lastly, the kernel can preempt a process in the middle of execution, giving some processor time slices to other processes. 'cat' and 'tail' are simple programs. In this example, tail spends most of it's time waiting for more data on the buffer, and cat spends most of it's time waiting for the operating system to retrieve data from the harddrive. The bottleneck is the speed (or slowness) of the physical medium that the file is stored on. That perceptible delay you might detect when you run this command for the first time is the time it takes for the read heads on the disk drive to seek to the position on the harddrive where 'file.txt' is. If you run the command a second time, the operating system will likely have the contents of file.txt cached in memory, and you will not likely see any perceptible delay (unless file.txt is very large, or the file is no longer cached.)
Most operations you do on your computer are IO bound, which is to say that you are usually waiting for data to come from your harddrive, or from a network device, etc.

Shog9 already referenced the Wikipedia article, but the implementation section has the details you want. The basic implementation is a bounded buffer.

cat will just print the data to standard out, which happens to be redirected to the standard in of tail. This can be seen in the man page of bash.
In other words, there is no pausing going on, tail is just reading from standard in and cat is just writing to standard out.

Related

Read contents of large files without using "cat" command in linux

I'm trying a more efficient way of reading file contents in Linux without using the "cat" command, especially for larger file contents, as in such cases cat just shoots up the memory and CPU on the server.
One thing that comes to my mind is using a grep -v "character-set-which-is-unlikely-in-the-file" filename
But using different character sets every time and hoping it would not appear in the file, wouldn't be efficient.
Any other thoughts ?
If you just want to read through the file, so it gets cached, the simplest way is perhaps this:
cat filename > /dev/null
Note that you don't need to show the data on screen to read it from disk. That command reads the file, and ignores the content by dumping it in /dev/null, but it still reads all the data.
If the CPU load goes up, that is probably a good thing, meaning that the computer is working hard, and will be finished sooner rather than later. If it crashes, though, there is something else wrong.
If you have some specific reason not to use the "cat" command, you can try "dd" instead, but it is more complicated to write and will not be faster:
dd if=filename of=/dev/null bs=1M
Addendum:
This inspired me to run some tests. On my particular computer both "cat" and "dd" took 24.27-24.31 seconds to read a large file on a mechanical disk when it wasn't already cached, and 0.39-0.40 seconds when it was cached. (Three tests of each case, with very little variability.)
Both these programs contain code to write the data, even if it is dumped to /dev/null, so one could expect that a program specifically written to just read would be slightly faster, but I got the same times when I tried that.

File contents lost after power outage

I am using C++ ofstream to write a log file on Linux. When I monitor the file contents with tail -f command I can see the contents are correctly populated. But if a power outage happens and I check the file again after power cycle, the last couple lines of records are gone. With hexdump I can see those records turned into null characters '\0' instead. I tried flush() and manipulator std::endl and they don't help anyway.
Is it true what tail showed to me was not actually written to the disk and they were just in buffer? The inode table wasn't update before the power outage? I can accept this fact but I don't understand why the records turned to null characters if they weren't written to the file.
Btw, I tried Google's glog and have the same results (a bunch of null characters at the end). I also tried zlog, a C library. and found it only lost the last records but didn't replace them with null chars.
Well, when you have a power outage, and then start the system again, the linux kernel tries to forward the journal log to detect and correct the inconsistencies held from memory to disk when the system crashed. Normally this means to redo and commit all operations possible until the system crash, but undo (and erase) all data not commited on the time of the crash.
Linux (and other un*x kernels, like freebsd) has a facility called ordered data write, that forces metadata (like block pointers from inodes, or directory entries) to be updated after the actual data they point to is effectively written on disk, so inconsistencies reduce to a minimum. I don't know the actual linux implementation, but for example, in freebsd what you point (a block of zeros in a file instead of the actual data written) is completely impossible with freebsd kernel (well, you can do it on purpose, but not accidentally) The most probable thing is that linux probably just manages the blocks info and not the file contents, or it has updated the file size pointer and not the data up to there. This should not happen as it's an already solved problem.
The other thing is how many data you have written or why what you see on the screen doesn't appear after the system crash. Probably you have heard about something called delayed write that allows the kernel to save write operations to disk on busy systems by not writing immediately data onto disk, but waiting some time so updates can be resolved in core memory buffers before they go to disk. Disk writes, anyway, are forced after some time delay, that means 5secs in linux (I try to remember, there's a lot of time I checked that value last time, I'm in doubt between 5 and 30 seconds) so you can lose your last five seconds at most.

Does writing data to stdout in linux occupy disk space?

I want to know whether writing data to stdout on a terminal in linux occupies disk space. I tried to look up information about stdout and tty in the man page, but it seems no answer to the question.
Thanks for any tip.
I have a router on which I have installed openwrt, but the total space of the router is 2MB and now there is only 12KB left. And I want to run a bash script to print some information on the terminal in the openwrt system. So I want to know whether it could occupy disk space when printing data to stdout.
stdout is not a specific file, or a location (disk, memory, etc.) - it is a concept. Usually, when talking about stdout in linux applications, we mean a file like object (ie. supporting (write, flush, etc.).
Output can be redirected to a real file on disk in which case, writing to stdout can and will use up disk space.
Unless the output is being redirected (though pipes to e.g. tee, or to a file with the > shell redirection operator), then no, output will not go to a file system. Output to a console (or terminal emulator) is slow as it is, passing it through the file system would make it even slower, and may not allow the disks to spin down if there's a lot of output.
It may, however, end up in the swap space, if the OS thinks the process should be swapped out before the output is actually written.

How to create a log file that "pop_front"s?

Suppose I have a console program that outputs trace debug lines on stdout, that I want to run on a server.
And then I do:
./serverapp > someoutputfile
If I need to see how the program's doing, I would just log into the server and do:
tail -f someoutputfile
However, understandably over time, someoutputfile would become pretty big.
Is there a way to make it so that someoutputfile is limited to a certain size, and only the most recent parts of it?
I mean, the hard way would be to make a custom script/program that cycles the output between different files, but that seems like overkill.
You can truncate the log file. One way to do this is to type:
>someoutputfile
at the shell command-line. It's a redirect with no output and it will erase all the contents of the file.
The tricky bit here is that any program writing to that file will continue to write into the file at its last output position. So the file will immediately gain a "hole" from 0 to X bytes, where X is the output position.
In most Linux file systems these holes result in sparse files, which don't actually use the space in the hole. So the file may contain many gigabytes of 0's at the beginning but only use 500 KB on disk.
Another way to do fast logging is to memory map a file on disk of fixed size: 16 MB for example. Then the logging writes into a memory pointer which wraps around when it reaches the size limit. It then continues to write at the front of the file. It's a good idea to have some kind of write position marker. I use <====>, for example. I find this method to be ridiculously fast and great for debug logging.
I haven't used it, but it gets good reviews here on SO, try logrotate
A more general discussion of managing output files may show you that a custom script/solution is not out of the question ;-) : Problem with Bash output redirection
I hope this helps.

How to only allowed certain text to be printed on Emacs Console?

This question maybe not related to Emacs only, but to all development environment that use Console for its debugging process. Here is the problem. I use Eshell to run the application we are being developed. It's a J2ME application. And for debugging, we just use System.out.println(). And now, suppose I want to allow only text that started with Eko: to be displayed in the console (interactively), is it possible?
I installed Cygwin in my Windows environment, and try to grep the output like this :
run | grep Eko:. It surely filtered only output with Eko: as the beginning, but it's not interactive. The output suppressed until the application quit. Well, that's useless anyway.
Is it possible to do it? What I mean is, we don't have to touch the application code itself?
I tag to linux also, because maybe some guys in Linux know the answer.
Many thanks!
The short: Try adding --line-buffered to your grep command.
The long: I assume that your application is flushing its output stream with every System.out.println(), and that grep has the lines available to read immediately, but is choosing to buffer output until it has 'enough' output saved up to make writing make sense. (This is typically 4k or 8k of data, which could be several hundred lines, depending upon your line length.)
This buffering makes great sense when the output is another program in the pipeline; reducing needless context switches is a great way to improve program throughput.
But if your printing is slow enough that it doesn't fill the buffer quickly enough for 'real time' output, then switching to line-buffered output might fix it.

Resources