Performance of sort command in unix - linux

I am writing a custom apache log parser for my company and I noticed a performance issue that I can't explain. I have a text file log.txt with size 1.2GB.
The command: sort log.txt is up to 3 sec slower than the command: cat log.txt | sort
Does anybody know why this is happening?

cat file | sort is a Useless Use of Cat.
The purpose of cat is to concatenate
(or "catenate") files. If it's only
one file, concatenating it with
nothing at all is a waste of time, and
costs you a process.
It shouldn't take longer. Are you sure your timings are right?
Please post the output of:
time sort file
and
time cat file | sort
You need to run the commands a few times and get the average.

Instead of worrying about the performance of sort instead you should change your logging:
Eliminate unnecessarily verbose output to your log.
Periodically roll the log (based on either date or size).
...fix the errors outputting to the log. ;)
Also, are you sure cat is reading the entire file? It may have a read buffer etc.

Related

Linux piping behaviour

Ok this is merely to satisfy my curiosity and anyone else that might have a similar question. Please bear with the ignorance and lengthy question as its partially a case of "i dont know what i dont know".
Section 1
Assume a fileToFollow.txt that has some arbitrary content.
row1
row2
row3
Executing tail fileToFollow.txt | cat yields the contents of the file as expected.
Executing tail -f fileToFollow.txt | cat will keep on outputing anything that is written into fileToFollow.txt
I imagined piping as getting the output of one program and feeding it into the input of another ( so for example if cat was a C program it would be able to access that input through the main() arguments ).
Question 1: Is that whats going on here, is cat just called everytime tail has an output?
Section 2
I decided to throw grep into the mix setting up the following:
tail -f fileToFollow.txt | grep "whatever" | cat
Clearly cat is not needed here as grep itself does output to the terminal anyway. But given the idea that piping is output from one program to input in another, i was assuming it would. However, in this case no output is displayed in the terminal.
The following of course works fine:
tail -f fileToFollow.txt | grep "whatever"
I have a hunch i am little bit confused as to how piping might actually work and why the cases i presented aren't behaving as i would expect them to.
Any kind of enlightenment is welcome. Thanks a bunch for taking the time.
Answer to question 1: No, cat is running as a process all the time but it's blocked reading stdin when there is nothing available. When the process before it in the pipeline (tail) writes new bytes to the pipe, the read call will return and cat is able to process the new data. After that it will read again and block until new data is available.
When you pipe into a program, the stdout of the source will usually switch into buffered mode (see man setvbuf()), which means that a certain amount of data (2KiB or 4KiB or so) must be generated before it will be given to write(2).
Giving it out to tty uses line buffered mode so that buffers are flushed after \n.
There exists a stdbuf tool to modify this behavior.

Streaming split

I am trying to split the output of a program into smaller files. This is a long-running program that prints its output to stderr and I'd like to capture the logs in a series of smaller files rather than in one gigantic file. So what I have is:
program 2>&1 | split -l100 &
... but to my dismay I found that the split tool doesn't actually write any files out to disk until the input buffer ends. What I want is a streaming tool that automatically copies its input to the output files in a streaming manner without waiting until the source stream ends, which is unnecessary in my case. I've also tried the -u option of the split tool but it doesn't seem to work unless you choose the -n option but that option doesn't really apply in my case because the number of generated files could be arbitrarily high. Is there a Unix tool that might let me do this?
Barmar's suggestion to add a call to fflush() after every iteration in the awk script worked for me. This was preferable to me to calling close() on each file when it's done since that would only flush when each file is full, while I wanted a line-buffered behavior. I also had to configure the output pipe to be line-buffered, so the command in the end looks like this:
stdbuf -oL -eL command 2>&1 | awk -v number=1 '++i>1000 {++number; i=0} {print > "file" number; fflush("file" number)}'

Better way to output to both console and output file than tee?

What I need to display is a log refreshing periodically. It's a block of about 10 lines of text. I'm using |tee and it works right now. However, the performance is less satisfying. It waits a while and then outputs several blocks of texts from multiple refreshes (especially when the program just starts, it takes quite a while to start displaying anything on the console and the first time I saw this, I thought the program was hanging). In addition, it breaks randomly in the middle of the last block, so it's quite ugly to present.
Is there a way to improve this? (Maybe output less each time and switch between output file and console more frequently?)
Solved by flushing stdout after printing each block. Credit to Kenneth L!
https://superuser.com/questions/889019/bash-better-way-to-output-to-both-console-and-output-file-than-tee
Assuming you can monitor the log as a file directly [update: turned out not to be the case]:
The usual way of monitoring a [log] file for new lines is to use tail -f, which - from what I can tell - prints new data added to the log file as it is being added, without buffering.
Similarly, tee passes data it receives via stdin on without buffering.
Thus, you should be able to combine the two:
tail -f logFile | tee newLogEntriesFile

Is it possible to display the progress of a sort in linux?

My job involves a lot of sorting fields from very large files. I usually do this with the sort command in bash. Unfortunately, when I start a sort I am never really sure how long it is going to take. Should I wait a second for the results to appear, or should I start working on something else while it runs?
Is there any possible way to get an idea of how far along a sort has progressed or how fast it is working?
$ cut -d , -f 3 VERY_BIG_FILE | sort -du > output
No, GNU sort does not do progress reporting.
However, if are you using sort just to remove duplicates, and you don't actually care about the ordering, then there's a more scalable way of doing that:
awk '! a[$0]++'
This writes out the first occurrence of a line as soon as it's been seen, which can give you an idea of the progress.
You might want to give pv a try, it should give you a pretty good idea of what is going on in your pipe in terms of throughput.
Example (untested) injecting pv before and after the sort command to get an idea of the throughput:
$ cut -d , -f 3 VERY_BIG_FILE | pv -cN cut | sort -du | pv -cN sort > output
EDIT: I missed the -u in your sort command, so calculating lines first to be able to get a percentage output is void. Removed that part from my answer.
You can execute your "sort" in background
you will get prompt and you can do other jobs
$sort ...... & # (& means run in background )

File output redirection in Linux

I have two programs A and B. I can't change the program A - I can only run it with some parameters, but I have written the B myself, and I can modify it the way I like.
Program A runs for a long time (20-40 hours) and during that time it produces output to the file, so that its size increases constantly and can be huge at the end of run (like 100-200 GB). The program B then reads the file and calculates some stuff. The special property of the file is that its content is not correlated: I can divide the file in half and run calculations on each part independently, so that I don't need to store all the data at once: I can calculate on the first part, then throw it away, calculate on the second one, etc.
The problem is that I don't have enough space to store such a big files. I wonder if it is possible to pipe somehow the output of the A to B without storing all the data at once and without making huge files. Is it possible to do something like that?
Thank you in advance, this is crucial for me now, Roman.
If program A supports it, simply pipe.
A | B
Otherwise, use a fifo.
mkfifo /tmp/fifo
ls -la > /tmp/fifo &
cat /tmp/fifo
EDIT: Adjust buffer sizes with ulimit -p and then:
cat /tmp/fifo | B
It is possible to pipeline output of one program into another.
Read here to know the syntax and know-hows of Unix pipelining.
you can use socat which can take stdout and feed it to network and get from network and feed it to stdin
named or unnamed pipe have a problem of small ( 4k ? ) buffer .. that means too many process context switches if you are writing multi gb ...
Or if you are adventurous enough .. you can LD_PRELOAD a so in process A, and trap the open/write calls to do whatever ..

Resources