Memory usage when reading from STDIN - linux

If I have text file (textfile) with lines of text and a Perl file (perlscript) of
#!/usr/bin/perl
use strict;
use warnings;
my $var;
$var.=$_ while (<>) ;
print $var;
And the command run in terminal
cat ./textfile | ./perlscript | ./perlscript | ./perlscript
If I run the above code on a 1kb text file, other than the program stack etc., have I used 4Kb of memory? Or when I pull from STDIN, have I freed that memory, therefore, I would only use 1 Kb?
To word the above question another way, is copying from STDIN to a variable effectively neutral in memory usage? Or doubling memory consumption?

You've already got a good answer, but I wasn't satisfied with my guess, so I decided to test my assumptions.
I made a simple C++ program called streamstream that just takes STDIN and writes it to STDOUT in 1024-byte chunks. It looks like this:
#include <stdio.h>
int main()
{
const int BUF_SIZE = 1024;
unsigned char* buf = new unsigned char[BUF_SIZE];
size_t read = fread(buf, 1, BUF_SIZE, stdin);
while(read > 0)
{
fwrite(buf, 1, read, stdout);
read = fread(buf, 1, BUF_SIZE, stdin);
}
delete buf;
}
To test how the program uses memory, I ran it with valgrind while piping the output from one to another as follows:
cat onetwoeightk | valgrind --tool=massif ./streamstream | valgrind --tool=massif ./streamstream | valgrind --tool=massif ./streamstream | hexdump
...where onetwoeightk is just a 128KB file of random bytes. Then I used the ms_print tool on the massif output to aid in interpretation. Obviously there is the overhead of the program itself and its heap, but it starts at about 80KB and never grows beyond that, because it's sipping STDIN just one kilobyte at a time.
The data is passed from process to process 1 kilobyte at a time. Our overall memory usage will peak at 1 kilobyte * the number of instances of the program handling the stream.
Now let's do what your perl program is doing--I'll read the whole stream (growing my buffer each time) and then write it all to STDOUT. Then I'll check the valgrind output again.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
const int BUF_INCREMENT = 1024;
unsigned char* inbuf = (unsigned char*)malloc(BUF_INCREMENT);
unsigned char* buf = NULL;
unsigned int bufsize = 0;
size_t read = fread(inbuf, 1, BUF_INCREMENT, stdin);
while(read > 0)
{
bufsize += read;
buf = (unsigned char *)realloc(buf, bufsize);
memcpy(buf + bufsize - read, inbuf, read);
read = fread(inbuf, 1, BUF_INCREMENT, stdin);
}
fwrite(buf, 1, bufsize, stdout);
free(inbuf);
free(buf);
}
Unsurprisingly, memory usage climbs to over 128 kilobytes over the execution of the program.
KB
137.0^ :#
| ::#
| ::::#
| :#:::#
| :::#:::#
| :::::#:::#
| :#:::::#:::#
| :#:#:::::#:::#
| ::#:#:::::#:::#
| :#::#:#:::::#:::#
| :#:#::#:#:::::#:::#
| #::#:#::#:#:::::#:::#
| ::#::#:#::#:#:::::#:::#
| :#::#::#:#::#:#:::::#:::#
| #:#::#::#:#::#:#:::::#:::#
| ::#:#::#::#:#::#:#:::::#:::#
| :#::#:#::#::#:#::#:#:::::#:::#
| #::#::#:#::#::#:#::#:#:::::#:::#
| ::#::#::#:#::#::#:#::#:#:::::#:::#
| ::::#::#::#:#::#::#:#::#:#:::::#:::#
0 +----------------------------------------------------------------------->ki
0 210.9
But the question is, what is the total memory usage due to this approach? I can't find a good tool for measuring the memory footprint over time of a set of interacting processes. ps doesn't seem accurate enough here, even when I insert a bunch of sleeps. But we can work it out: the 128KB buffer is only freed at the end of program execution, after the stream is written. But while the stream is being written, another instance of the program builds its own 128KB buffer. So we know our memory usage will climb to 2x 128KB. But it won't rise to 3x or 4x 128KB by chaining more instances of our program, as our instances free their memory and close as soon as they are done writing to STDOUT.

More like 2kB, but a 1kB file isn't a very good example as your read buffer is probably bigger than that. Let's make the file 1GB instead. Then your peak memory usage would probably be around 2GB plus some overhead. cat uses negligible memory, just shuffling its input to its output. The first perl process has to read all of that input and store it in $var, using 1GB (plus a little bit). Then it starts writing it to the second one, which will store it into its own private $var, also using 1GB (plus a little bit), so we're up to 2GB. When the first perl process finishes writing, it exits, which closes its stdout, causing the second perl process to get EOF on stdin, which is what makes the while(<>) loop terminate and the second perl proces to start writing. At this point the third perl process starts reading and storing into its own $var, using another 1GB, but the first one is gone, so we're still in the neighborhood of 2GB. Then the second perl process ends, and the third starts writing to stdout, and exits itself.

Related

How to read stdout from a sub process in bash in real time

I have a simple C++ program that counts from 0 to 10 with an increment every 1 second. When the value is incremented, it is written to stdout. This program intentionally uses printf rather than std::cout.
I want to call this program from a bash script, and perform some function (eg echo) on the value when it is written to stdout.
However, my script waits for the program to terminate, and then process all the values at the same time.
C++ prog:
#include <stdio.h>
#include <unistd.h>
int main(int argc, char **argv)
{
int ctr = 0;
for (int i = 0; i < 10; ++i)
{
printf("%i\n", ctr++);
sleep(1);
}
return 0;
}
Bash script:
#!/bin/bash
for c in $(./script-test)
do
echo $c
done
Is there another way to read the output of my program, that will access it in real time, rather than wait for for the process to terminate.
Note: the C++ program is a demo sample - the actual program I am using also uses printf, but I am not able to make changes to this code, hence the solution needs to be in the bash script.
Many thanks,
Stuart
As you correctly observed, $(command) waits for the entire output of command, splits that output, and only after that, the for loop starts.
To read output as soon as is available, use while read:
./script-test | while IFS= read -r line; do
echo "do stuff with $line"
done
or, if you need to access variables from inside the loop afterwards, and your system supports <()
while IFS= read -r line; do
echo "do stuff with $line"
done < <(./script-test)
# do more stuff, that depends on variables set inside the loop
You might be more lucky using a pipe:
#!/bin/bash
./script-test | while IFS= read -r c; do
echo "$c"
done

How does Shell implement pipe programmatically?

I understand how I/O redirection works in Unix/Linux, and I know Shell uses this feature to pipeline programs with a special type of file - anonymous pipe. But I'd like to know the details of how Shell implements it programmatically? I'm interested in not only the system calls involved, but also the whole picture.
For example ls | sort, how does Shell perform I/O redirection for ls and sort?
The whole picture is complex and the best way to understand is to study a small shell. For a limited picture, here goes. Before doing anything, the shell parses the whole command line so it knows exactly how to chain processes. Let's say it encounters proc1 | proc2.
It sets up a pipe. Long story short, writing into thepipe[0] ends up in thepipe[1]
int thepipe[2];
pipe(thepipe);
It forks the first process and changes the direction of its stdout before exec
dup2 (thepipe[1], STDOUT_FILENO);
It execs the new program which is blissfully unaware of redirections and just writes to stdout like a well-behaved process
It forks the second process and changes the source of its stdin before exec
dup2 (thepipe[0], STDIN_FILENO);
It execs the new program, which is unaware its input comes from another program
Like I said, this is a limited picture. In a real picture the shell daisy-chains these in a loop and also remembers to close pipe ends at opportune moments.
This is a sample program from the book operating system concepts by silberschatz
Program is self-explanatory if you know the concepts of fork() and related things..hope this helps! (If you still want an explanation then I can explain it!)
Obviously some changes(such as change in fork() etc) should be made in this program if you want it to make it work like
ls | sort
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#define BUFFER SIZE 25
#define READ END 0
#define WRITE END 1
int main(void)
{
char write msg[BUFFER SIZE] = "Greetings";
char read msg[BUFFER SIZE];
int fd[2];
pid t pid;
/* create the pipe */
if (pipe(fd) == -1) {
fprintf(stderr,"Pipe failed");
return 1;
}
/* fork a child process */
pid = fork();
if (pid < 0) { /* error occurred */
fprintf(stderr, "Fork Failed");
return 1;
}
if (pid > 0) { /* parent process */
/* close the unused end of the pipe */
close(fd[READ END]);
/* write to the pipe */
write(fd[WRITE END], write msg, strlen(write msg)+1);
/* close the write end of the pipe */
close(fd[WRITE END]);
}
else { /* child process */
/* close the unused end of the pipe */
close(fd[WRITE END]);
/* read from the pipe */
read(fd[READ END], read msg, BUFFER SIZE);
printf("read %s",read msg);
}
}
/* close the write end of the pipe */
close(fd[READ END]);
return 0;
}

"quick select" (or similar) implementation on Linux? (instead of sort|uniq -c|sort -rn|head -$N)

PROBLEM: Frequently I face a need to see what are the most-frequently-repeated "patterns" within last day of specific logs. Like for a small subset of tomcat logs here:
GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5
GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97
GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18
GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50
DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1
GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40
GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4
GET /app2/provisioning/v3/555412562561928/devices 200 1643 65
...
If I wish to find out the most-frequently-used URLs (along with method and retcode) - I'll do:
[root#srv112:~]$ N=6;cat test|awk '{print $1" "$2" ("$3")"}'\
|sed 's/[0-9a-f-]\+ (/%GUID% (/;s/\/[0-9]\{4,\}\//\/%USERNAME%\//'\
|sort|uniq -c|sort -rn|head -$N
4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401)
2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200)
2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200)
2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404)
1 POST /app2/servlet/handler (200)
1 POST /app1/servlet/handler (200)
If I wish to find out the most-frequent-username from same file - I'll do:
[root#srv112:~]$ N=4;cat test|grep -Po '(?<=\/)[0-9]{4,}(?=\/)'\
|sort|uniq -c|sort -rn|head -$N
9 555412562561928
2 555411543532089
1 555417257243373
1 555416264256223
Above works quite fine on a small data-sets, but for a larger sets of input - the performance (complexity) of sort|uniq -c|sort -rn|head -$N is unbearable (talking about ~100 servers, ~250 log files per server, ~1mln lines per log file)
ATTEMPT TO SOLVE: |sort|uniq -c part can be easily replaced with awk 1-liner, turning it into:
|awk '{S[$0]+=1}END{for(i in S)print S[i]"\t"i}'|sort -rn|head -$N
but I failed to find standard/simple and memory-efficient implementation of "Quick select algorithm" (discussed here) to optimize the |sort -rn|head -$N part.
Was looking for GNU binaries, rpms, awk 1-liners or some easily-compilable Ansi C code which I could carry/spread across datacenters, to turn:
3 tasty oranges
225 magic balls
17 happy dolls
15 misty clouds
93 juicy melons
55 rusty ideas
...
into (given N=3):
225 magic balls
93 juicy melons
55 rusty ideas
I probably could grab sample Java code and port it for above stdin format (by the way - was surprised by lack of .quickselect(...) within core java) - but the need to deploy java-runtime everywhere isn't appealing.
I maybe could grab sample (array-based) C snippet of it too, then adapt it to above stdin format, then test-and-fix-leaks&etc for a while. Or even implement it from scratch in awk.
BUT(!) - this simple need is likely faced by more than 1% of people on regular basis - there should've been a standard (pre-tested) implementation of it out there??
Hopes... maybe I'm using wrong keywords to look it up...
OTHER OBSTACLES: Also faced a couple of issues to work it around for large data-sets:
log files are located on NFS-mounted volumes of ~100 servers - so it
made sense to parallelize and split the work into smaller chunks
the above awk '{S[$0]+=1}... requires memory - I'm seeing it die
whenever it eats up 16GB (despite having 48GB of free RAM and
plenty of swap... maybe some linux limit I overlooked)
My current solution is still not-reliable and not-optimal (in progress) looks like:
find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"|\
# TODO: reorder 'find' output to round-robin through srv1 srv2 ...
# to help 'parallel' work with multiple servers at once
parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {}\
|awk '{S[\$0]+=1}
END{for(i in S)if(S[i]>4)print \"count: \"S[i]\"\\n\"i}'"
# I throw away patterns met less than 5 times per log file
# in hope those won't pop on top of result list anyway - bogus
# but helps to address 16GB-mem problem for 'awk' below
awk '{if("count:"==$1){C=$2}else{S[$0]+=C}}
END{for(i in S)if(S[i]>99)print S[i]"\t"i}'|\
# I also skip all patterns which are met less than 100 times
# the hope that these won't be on top of the list is quite reliable
sort -rn|head -$N
# above line is the inefficient one I strive to address
I'm not sure if writing your own little tool is acceptable to you, but you can easily write a small tool to replace the |sort|uniq -c|sort -rn|head -$N-part with |sort|quickselect $N. The benefit of the tool is, that it reads the output from the first sort only once, line-by-line and without keeping much data in memory. Actually, it only needs memory to hold the current line and the top $N lines which are then printed.
Here's the source quickselect.cpp:
#include <iostream>
#include <string>
#include <map>
#include <cstdlib>
#include <cassert>
typedef std::multimap< std::size_t, std::string, std::greater< std::size_t > > winner_t;
winner_t winner;
std::size_t max;
void insert( int count, const std::string& line )
{
winner.insert( winner_t::value_type( count, line ) );
if( winner.size() > max )
winner.erase( --winner.end() );
}
int main( int argc, char** argv )
{
assert( argc == 2 );
max = std::atol( argv[1] );
assert( max > 0 );
std::string current, last;
std::size_t count = 0;
while( std::getline( std::cin, current ) ) {
if( current != last ) {
insert( count, last );
count = 1;
last = current;
}
else ++count;
}
if( count ) insert( count, current );
for( winner_t::iterator it = winner.begin(); it != winner.end(); ++it )
std::cout << it->first << " " << it->second << std::endl;
}
to be compiled with:
g++ -O3 quickselect.cpp -o quickselect
Yes, I do realize you were asking for out-of-the-box solutions, but I don't know anything that would be equally efficient. And the above is so simple, there is hardly any margin for errors (given you don't mess up the single numeric command line parameter :)

Detect if pid is zombie on Linux

We can detect if some is a zombie process via shell command line
ps ef -o pid,stat | grep <pid> | grep Z
To get that info in our C/C++ programs we use popen(), but we would like to avoid using popen(). Is there a way to get the same result without spawning additional processes?
We are using Linux 2.6.32-279.5.2.el6.x86_64.
You need to use the proc(5) filesystem. Access to files inside it (e.g. /proc/1234/stat ...) is really fast (it does not involve any physical I/O).
You probably want the third field from /proc/1234/stat (which is readable by everyone, but you should read it sequentially, since it is unseekable.). If that field is Z then process of pid 1234 is zombie.
No need to fork a process (e.g. withpopen or system), in C you might code
pid_t somepid;
// put the process pid you are interested in into somepid
bool iszombie = false;
// open the /proc/*/stat file
char pbuf[32];
snprintf(pbuf, sizeof(pbuf), "/proc/%d/stat", (int) somepid);
FILE* fpstat = fopen(pbuf, "r");
if (!fpstat) { perror(pbuf); exit(EXIT_FAILURE); };
{
int rpid =0; char rcmd[32]; char rstatc = 0;
fscanf(fpstat, "%d %30s %c", &rpid, rcmd, &rstatc);
iszombie = rstatc == 'Z';
}
fclose(fpstat);
Consider also procps and libproc so see this answer.
(You could also read the second line of /proc/1234/status but this is probably harder to parse in C or C++ code)
BTW, I find that the stat file in /proc/ has a weird format: if your executable happens to contain both spaces and parenthesis in its name (which is disgusting, but permitted) parsing the /proc/*/stat file becomes tricky.

Is there a command to write random garbage bytes into a file?

I am now doing some tests of my application again corrupted files. But I found it is hard to find test files.
So I'm wondering whether there are some existing tools, which can write random/garbage bytes into a file of some format.
Basically, I need this tool to:
It writes random garbage bytes into the file.
It does not need to know the format of the file, just writing random bytes are OK for me.
It is best to write at random positions of the target file.
Batch processing is also a bonus.
Thanks.
The /dev/urandom pseudo-device, along with dd, can do this for you:
dd if=/dev/urandom of=newfile bs=1M count=10
This will create a file newfile of size 10M.
The /dev/random device will often block if there is not sufficient randomness built up, urandom will not block. If you're using the randomness for crypto-grade stuff, you can steer clear of urandom. For anything else, it should be sufficient and most likely faster.
If you want to corrupt just bits of your file (not the whole file), you can simply use the C-style random functions. Just use rnd() to figure out an offset and length n, then use it n times to grab random bytes to overwrite your file with.
The following Perl script shows how this can be done (without having to worry about compiling C code):
use strict;
use warnings;
sub corrupt ($$$$) {
# Get parameters, names should be self-explanatory.
my $filespec = shift;
my $mincount = shift;
my $maxcount = shift;
my $charset = shift;
# Work out position and size of corruption.
my #fstat = stat ($filespec);
my $size = $fstat[7];
my $count = $mincount + int (rand ($maxcount + 1 - $mincount));
my $pos = 0;
if ($count >= $size) {
$count = $size;
} else {
$pos = int (rand ($size - $count));
}
# Output for debugging purposes.
my $last = $pos + $count - 1;
print "'$filespec', $size bytes, corrupting $pos through $last\n";
# Open file, seek to position, corrupt and close.
open (my $fh, "+<$filespec") || die "Can't open $filespec: $!";
seek ($fh, $pos, 0);
while ($count-- > 0) {
my $newval = substr ($charset, int (rand (length ($charset) + 1)), 1);
print $fh $newval;
}
close ($fh);
}
# Test harness.
system ("echo =========="); #DEBUG
system ("cp base-testfile testfile"); #DEBUG
system ("cat testfile"); #DEBUG
system ("echo =========="); #DEBUG
corrupt ("testfile", 8, 16, "ABCDEFGHIJKLMNOPQRSTUVWXYZ ");
system ("echo =========="); #DEBUG
system ("cat testfile"); #DEBUG
system ("echo =========="); #DEBUG
It consists of the corrupt function that you call with a file name, minimum and maximum corruption size and a character set to draw the corruption from. The bit at the bottom is just unit testing code. Below is some sample output where you can see that a section of the file has been corrupted:
==========
this is a file with nothing in it except for lowercase
letters (and spaces and punctuation and newlines).
that will make it easy to detect corruptions from the
test program since the character range there is from
uppercase a through z.
i have to make it big enough so that the random stuff
will work nicely, which is why i am waffling on a bit.
==========
'testfile', 344 bytes, corrupting 122 through 135
==========
this is a file with nothing in it except for lowercase
letters (and spaces and punctuation and newlines).
that will make iFHCGZF VJ GZDYct corruptions from the
test program since the character range there is from
uppercase a through z.
i have to make it big enough so that the random stuff
will work nicely, which is why i am waffling on a bit.
==========
It's tested at a basic level but you may find there are edge error cases which need to be taken care of. Do with it what you will.
Just for completeness, here's another way to do it:
shred -s 10 - > my-file
Writes 10 random bytes to stdout and redirects it to a file. shred is usually used for destroying (safely overwriting) data, but it can be used to create new random files too.
So if you have already have a file that you want to fill with random data, use this:
shred my-existing-file
You could read from /dev/random:
# generate a 50MB file named `random.stuff` filled with random stuff ...
dd if=/dev/random of=random.stuff bs=1000000 count=50
You can specify the size also in a human readable way:
# generate just 2MB ...
dd if=/dev/random of=random.stuff bs=1M count=2
You can also use cat and head. Both are usually installed.
# write 1024 random bytes to my-file-to-override
cat /dev/urandom | head -c 1024 > my-file-to-override

Resources