Linux piping behaviour - linux

Ok this is merely to satisfy my curiosity and anyone else that might have a similar question. Please bear with the ignorance and lengthy question as its partially a case of "i dont know what i dont know".
Section 1
Assume a fileToFollow.txt that has some arbitrary content.
row1
row2
row3
Executing tail fileToFollow.txt | cat yields the contents of the file as expected.
Executing tail -f fileToFollow.txt | cat will keep on outputing anything that is written into fileToFollow.txt
I imagined piping as getting the output of one program and feeding it into the input of another ( so for example if cat was a C program it would be able to access that input through the main() arguments ).
Question 1: Is that whats going on here, is cat just called everytime tail has an output?
Section 2
I decided to throw grep into the mix setting up the following:
tail -f fileToFollow.txt | grep "whatever" | cat
Clearly cat is not needed here as grep itself does output to the terminal anyway. But given the idea that piping is output from one program to input in another, i was assuming it would. However, in this case no output is displayed in the terminal.
The following of course works fine:
tail -f fileToFollow.txt | grep "whatever"
I have a hunch i am little bit confused as to how piping might actually work and why the cases i presented aren't behaving as i would expect them to.
Any kind of enlightenment is welcome. Thanks a bunch for taking the time.

Answer to question 1: No, cat is running as a process all the time but it's blocked reading stdin when there is nothing available. When the process before it in the pipeline (tail) writes new bytes to the pipe, the read call will return and cat is able to process the new data. After that it will read again and block until new data is available.

When you pipe into a program, the stdout of the source will usually switch into buffered mode (see man setvbuf()), which means that a certain amount of data (2KiB or 4KiB or so) must be generated before it will be given to write(2).
Giving it out to tty uses line buffered mode so that buffers are flushed after \n.
There exists a stdbuf tool to modify this behavior.

Related

How can I select() (ie, simultaneously read from) standard input *and* a file in bash?

I have a program that accepts input on one FIFO and emits output to another FIFO. I want to write a small script to control this program. The script needs to listen both to standard input (so I can input commands to adjust things in real time) and the program's output FIFO (so it can respond to events happening there as well).
Essentially my control program needs to select between standard input and a file (my FIFO).
I like learning how to figure out how to develop simple and elegant bash-based solutions to complex problems, and after a little headscratching I remembered that that tail -f will happily select on multiple files and tell you when one of them changes in real time, so I initially tried
tail -f <(od -An -vtd1 -w1) <(cat fifo)
to read both standard input (I'd previously run stty icanon min 1; this od invocation shows each stdin character on a separate line alongside its ASCII code, and is great for escape sequence parsing) and my FIFO. This failed epically (as does cat <(cat)): od gets run here as a backgrounded task, so it doesn't get access to the controlling TTY, and fails with a cryptic "I/O error" that was explained incredibly well here.
So now I'm a bit stumped. I realize that I can use any scripting language like Perl/Python/Ruby/Tcl to solve this; my compsci/engineering question is whether/how I might be able to solve this using (Linux) shell scripting.

Streaming split

I am trying to split the output of a program into smaller files. This is a long-running program that prints its output to stderr and I'd like to capture the logs in a series of smaller files rather than in one gigantic file. So what I have is:
program 2>&1 | split -l100 &
... but to my dismay I found that the split tool doesn't actually write any files out to disk until the input buffer ends. What I want is a streaming tool that automatically copies its input to the output files in a streaming manner without waiting until the source stream ends, which is unnecessary in my case. I've also tried the -u option of the split tool but it doesn't seem to work unless you choose the -n option but that option doesn't really apply in my case because the number of generated files could be arbitrarily high. Is there a Unix tool that might let me do this?
Barmar's suggestion to add a call to fflush() after every iteration in the awk script worked for me. This was preferable to me to calling close() on each file when it's done since that would only flush when each file is full, while I wanted a line-buffered behavior. I also had to configure the output pipe to be line-buffered, so the command in the end looks like this:
stdbuf -oL -eL command 2>&1 | awk -v number=1 '++i>1000 {++number; i=0} {print > "file" number; fflush("file" number)}'

Better way to output to both console and output file than tee?

What I need to display is a log refreshing periodically. It's a block of about 10 lines of text. I'm using |tee and it works right now. However, the performance is less satisfying. It waits a while and then outputs several blocks of texts from multiple refreshes (especially when the program just starts, it takes quite a while to start displaying anything on the console and the first time I saw this, I thought the program was hanging). In addition, it breaks randomly in the middle of the last block, so it's quite ugly to present.
Is there a way to improve this? (Maybe output less each time and switch between output file and console more frequently?)
Solved by flushing stdout after printing each block. Credit to Kenneth L!
https://superuser.com/questions/889019/bash-better-way-to-output-to-both-console-and-output-file-than-tee
Assuming you can monitor the log as a file directly [update: turned out not to be the case]:
The usual way of monitoring a [log] file for new lines is to use tail -f, which - from what I can tell - prints new data added to the log file as it is being added, without buffering.
Similarly, tee passes data it receives via stdin on without buffering.
Thus, you should be able to combine the two:
tail -f logFile | tee newLogEntriesFile

Performance of sort command in unix

I am writing a custom apache log parser for my company and I noticed a performance issue that I can't explain. I have a text file log.txt with size 1.2GB.
The command: sort log.txt is up to 3 sec slower than the command: cat log.txt | sort
Does anybody know why this is happening?
cat file | sort is a Useless Use of Cat.
The purpose of cat is to concatenate
(or "catenate") files. If it's only
one file, concatenating it with
nothing at all is a waste of time, and
costs you a process.
It shouldn't take longer. Are you sure your timings are right?
Please post the output of:
time sort file
and
time cat file | sort
You need to run the commands a few times and get the average.
Instead of worrying about the performance of sort instead you should change your logging:
Eliminate unnecessarily verbose output to your log.
Periodically roll the log (based on either date or size).
...fix the errors outputting to the log. ;)
Also, are you sure cat is reading the entire file? It may have a read buffer etc.

File output redirection in Linux

I have two programs A and B. I can't change the program A - I can only run it with some parameters, but I have written the B myself, and I can modify it the way I like.
Program A runs for a long time (20-40 hours) and during that time it produces output to the file, so that its size increases constantly and can be huge at the end of run (like 100-200 GB). The program B then reads the file and calculates some stuff. The special property of the file is that its content is not correlated: I can divide the file in half and run calculations on each part independently, so that I don't need to store all the data at once: I can calculate on the first part, then throw it away, calculate on the second one, etc.
The problem is that I don't have enough space to store such a big files. I wonder if it is possible to pipe somehow the output of the A to B without storing all the data at once and without making huge files. Is it possible to do something like that?
Thank you in advance, this is crucial for me now, Roman.
If program A supports it, simply pipe.
A | B
Otherwise, use a fifo.
mkfifo /tmp/fifo
ls -la > /tmp/fifo &
cat /tmp/fifo
EDIT: Adjust buffer sizes with ulimit -p and then:
cat /tmp/fifo | B
It is possible to pipeline output of one program into another.
Read here to know the syntax and know-hows of Unix pipelining.
you can use socat which can take stdout and feed it to network and get from network and feed it to stdin
named or unnamed pipe have a problem of small ( 4k ? ) buffer .. that means too many process context switches if you are writing multi gb ...
Or if you are adventurous enough .. you can LD_PRELOAD a so in process A, and trap the open/write calls to do whatever ..

Resources