unix : how does a "./process | sort" work? - linux

To debug some map/reduce jobs I often test them using a simple unix command that basically reads
cat data/* | mapper | sort | reduce > out
Now everything works just fine, but I'm wondering what really happens with the map | sort command.
More precisely :
does someone knows how the ram/cpu is loaded by sort ?
Is the sort command sorting data on the fly, or does it wait for the map job to be finished ( note that the mapper uses STDOUT and does not wait for the end of the computation to output data) ?
Using quite a large amount of input data does not seem to load the ram as I would expect ( I rather observe peaks of cpu, but I'm not really measuring this very precisely). Is it possible for the process to use less ram as the amount of output information ?
Thanks for your answers :)

In Linux, sort uses merge sort algorithm (from http://en.wikipedia.org/wiki/Sort_(Unix) ). A merge sort can store some parts in temporary files on disk (and it does in case of sort). So the process uses a reasonable amount of RAM (you can specify how much RAM is used via --buffer-size option).

Related

Why are file systems so much slower then a database?

I have a lot of files on my computer (who doesn't).
It is split between harddrives.
I realized a long time ago, that find takes a whole lot of time scanning the whole harddisk. Minutes, for all drives i might take over an hour,
That is why I got used to running du -ba / >> ~/du."$*(date +%F)" on a regular base. Then I would just grep 'WHATEVER' ~/du | sed 's#^ \+[0-9]\+ ##' | xargs -d\\n command
I understand why that is faster than find.
Now I set up a mysql, that has a complete, refreshable index of all files. directories are a simple tree with just a foreign key to the parent row. (or however you call a foreign key that references NOT to a foreign table but to it's own primary of a different row).
Although It is as complex, it is still much faster than using the Filesystems.
Why is that? Am I missing some tools that could search the TOC faster than the normal posix calls to the kernel?
How long should It take to print all files of a harddrive to stdout, whithout a DB or textfile cache?

Why vim search is much slower than "cat fileName | grep targetText"?

I have a 1.4 GB text file named test.txt and I want to search a string inside the file.
I'd like to know why vim search (vim test.txt, then type /targetText to search the string) performs much slower than cat test.txt | grep targetText?
On my machine, vim search takes about several minutes to complete the search, while cat test.txt | grep targetText takes about several seconds to complete the search.
Vim is an editor. It will try to load the file in memory then you can do edits on it. Vim can edit huge files, but is not optimized for it.
On the other Hand cat and grep do not need to read the whole file in memory.
BTW you can just do grep search file without using cat.
If targetText is short the delay should be caused by numerous loads from disk (necessary to search through the whole text). We should note that vim is an interactive tool and it is not designed for fast processing of gygabytes. Of course if we know in advance that our pattern match lays in many, many megabytes downstream from the current screen, we could read huge pieces from disk and in such a way get fast. But in real life Vim doesn't know how much data worth to read in once, because if we expect the pattern to be found in rather short distance, say, three lines below (agree, it's much more expected situation) then we have absolutely no reason to read huge data amounts from disk; it would be useless consumption of time and bandwidth. As Vim doesn't know a priori what amount of data to read at once, it uses some trade-off which doesn't occur to be optimal in your case.
On the opposite side, a pipeline "cat|.." bravely operates with very large pieces of data only limited by memory available to the process (ideally having once found the file it reads data in non-stop mode and sends to the pipeline). Because cat "knows" that the whole file content is needed and there is no reason to read it by small pages.
Thus, although grep and cat suck the same amount of data, the latter seeks a track on disk much less times that results in dramatic efficiency increase.
If a prefix character combination of our pattern is very frequent in the file to scan, we may also experience an efficiency advantage of grep search technique based on Aho–Corasick string matching algorithm.

How to make atop exclude the statistics since boot?

I have a linux box, whose resource utilization i need to monitor every hour. By resource, i mean mainly cpu, memory and network. I am using atop for the cpu memory and nethogs for the network utilization monitoring. I am thinking of redirecting the reports to text files and send them my email, but the initial startup screen for atop shows all statistics since boot, and it makes the text look messy, so is there a way to make atop skip the initial statistics ?
I would suggest you to use something other than atop. There are many other tools like top, free -m, etc for your cpu, memory and network statistics. The only disadvantage would be that you would have o write them independently.
Landed on your question as I was looking for just that. SeaLion actually works well for this purpose and also you wouldn't need to store them in files. It's all presented on a timeline so you can just "Jump to" whenever you want to check your data. You don't even manually have to record the data.
I suppose this is all you need.
Having the same problem right now, I came up with
atop -PCPU,NET,DSK | sed -n -e '/SEP/,$p'
The -P.. instructs atop to only show the requested information, so roll your own. The important bit is sed which skips lines until the first line containing SEP is found, which effectively skips over the first block of data containing the summary since boot time.
I an not sure, but i think you can't because atop produce statistics over some interval. On the initial run there is no previous point, so atop produce stats since boot to current point, but you can easily use for example awk to parse the output:
atop 1 2 | awk '/ seconds elapsed$/ {output=1;} {if (output) print}'
This is simplest way to solve problem with atop, but there is tons of other tools probably better suited for this job.

Linux: get amount of memory swapped in/out over a time period

is there an (easy(?)) way to get the the amount of data moved to/from swap over a certain time ? Maybe, either integrated over all processes and time or integrated over specific processes and time?
Story: I have a machine which tends to swap. However, I do not know, if swap is 'actively' used. I.e., if it is constantly swapping or let's say just the shared libraries not really used are swapped away after some time and 'active' memory usage happens in mem in the end.
Thus, I am looking for a way to comfort myself, that the swap usage may be not serious...
Cheers and thanks for ideas,
Thomas
This can be relatively easily (if you know kernel MM subsystem) done via SystemTap.
You need to know the name of functions which do swapin/swapout, create corresponding probes and two counters incremented from probes. Finally, you need a timer which is fired every N seconds, dumps current counters and resets them.
here is my temporary solution to get the overall number of pages swapped in/out between to calls using vmstat
#!/bin/sh
OLDSWAPPEDIN=$SWAPPEDIN
OLDSWAPPEDOUT=$SWAPPEDOUT
PAGEINOUT=$(vmstat -s | grep swapped 2>&1)
SWAPPEDIN=`echo $PAGEINOUT | awk '{print $1}'`
SWAPPEDOUT=`echo $PAGEINOUT | awk '{print $5}'`
SWAPPEDINDIFF=`expr $SWAPPEDIN - $OLDSWAPPEDIN`
SWAPPEDOUTDIFF=`expr $SWAPPEDOUT - $OLDSWAPPEDOUT`
I tried to avoid temporary files for storage variables (so either sourcing it or create the variables at login would be necessary)

Bash Pipe Handling

Does anyone know how bash handles sending data through pipes?
cat file.txt | tail -20
Does this command print all the contents of file.txt into a buffer, which is then read by tail? Or does this command, say, print the contents of file.txt line by line, and then pause at each line for tail to process, and then ask for more data?
The reason I ask is that I'm writing a program on an embedded device that basically performs a sequence of operations on some chunk of data, where the output of one operation is send off as the input of the next operation. I would like to know how linux (bash) handles this so please give me a general answer, not specifically what happens when I run "cat file.txt | tail -20".
EDIT: Shog9 pointed out a relevant Wikipedia Article, this didn't lead me directly to the article but it helped me find this: http://en.wikipedia.org/wiki/Pipeline_%28Unix%29#Implementation which did have the information I was looking for.
I'm sorry for not making myself clear. Of course you're using a pipe and of course you're using stdin and stdout of the respective parts of the command. I had assumed that was too obvious to state.
What I'm asking is how this is handled/implemented. Since both programs cannot run at once, how is data sent from stdin to stdout? What happens if the first program generates data significantly faster than the second program? Does the system just run the first command until either it's terminated or it's stdout buffer is full, and then move on to the next program, and so on in a loop until no more data is left to be processed or is there a more complicated mechanism?
I decided to write a slightly more detailed explanation.
The "magic" here lies in the operating system. Both programs do start up at roughly the same time, and run at the same time (the operating system assigns them slices of time on the processor to run) as every other simultaneously running process on your computer (including the terminal application and the kernel). So, before any data gets passed, the processes are doing whatever initialization necessary. In your example, tail is parsing the '-20' argument and cat is parsing the 'file.txt' argument and opening the file. At some point tail will get to the point where it needs input and it will tell the operating system that it is waiting for input. At some other point (either before or after, it doesn't matter) cat will start passing data to the operating system using stdout. This goes into a buffer in the operating system. The next time tail gets a time slice on the processor after some data has been put into the buffer by cat, it will retrieve some amount of that data (or all of it) which leaves the buffer on the operating system. When the buffer is empty, at some point tail will have to wait for cat to output more data. If cat is outputting data much faster than tail is handling it, the buffer will expand. cat will eventually be done outputting data, but tail will still be processing, so cat will close and tail will process all remaining data in the buffer. The operating system will signal tail when their is no more incoming data with an EOF. Tail will process the remaining data. In this case, tail is probably just receiving all the data into a circular buffer of 20 lines, and when it is signalled by the operating system that there is no more incoming data, it then dumps the last twenty lines to its own stdout, which just gets displayed in the terminal. Since tail is a much simpler program than cat, it will likely spend most of the time waiting for cat to put data into the buffer.
On a system with multiple processors, the two programs will not just be sharing alternating time slices on the same processor core, but likely running at the same time on separate cores.
To get into a little more detail, if you open some kind of process monitor (operating system specific) like 'top' in Linux you will see a whole list of running processes, most of which are effectively using 0% of the processor. Most applications, unless they are crunching data, spend most of their time doing nothing. This is good, because it allows other processes to have unfettered access to the processor according to their needs. This is accomplished in basically three ways. A process could get to a sleep(n) style instruction where it basically tells the kernel to wait n milliseconds before giving it another time slice to work with. Most commonly a program needs to wait for something from another program, like 'tail' waiting for more data to enter the buffer. In this case the operating system will wake up the process when more data is available. Lastly, the kernel can preempt a process in the middle of execution, giving some processor time slices to other processes. 'cat' and 'tail' are simple programs. In this example, tail spends most of it's time waiting for more data on the buffer, and cat spends most of it's time waiting for the operating system to retrieve data from the harddrive. The bottleneck is the speed (or slowness) of the physical medium that the file is stored on. That perceptible delay you might detect when you run this command for the first time is the time it takes for the read heads on the disk drive to seek to the position on the harddrive where 'file.txt' is. If you run the command a second time, the operating system will likely have the contents of file.txt cached in memory, and you will not likely see any perceptible delay (unless file.txt is very large, or the file is no longer cached.)
Most operations you do on your computer are IO bound, which is to say that you are usually waiting for data to come from your harddrive, or from a network device, etc.
Shog9 already referenced the Wikipedia article, but the implementation section has the details you want. The basic implementation is a bounded buffer.
cat will just print the data to standard out, which happens to be redirected to the standard in of tail. This can be seen in the man page of bash.
In other words, there is no pausing going on, tail is just reading from standard in and cat is just writing to standard out.

Resources