merge sorted files, with minimal buffering

merge sorted files, with minimal buffering - linux

I have two log files that are prefixed with a sortable timesetamp.
I'd like to see them, in order, while the processes generating the log files are still running. This is a pretty faithful simulation of the situation:
slow() {
# print stdout at 30bps
exec pv -qL 30
}
timestamp() {
# prefix stdin with a sortable timestamp
exec tai64n
}
# Simulate two slowly-running batch jobs:
seq 000 099 | slow | timestamp > seq.1 &
seq1=$!
seq 100 199 | slow | timestamp > seq.2 &
seq2=$!
# I'd like to see the combined output of those two logs, in timestamp-sorted order
try1() {
# this shows me the output as soon as it's available,
# but it's badly interleaved and not necessarily in order
tail -f seq.1 --pid=$seq1 &
tail -f seq.2 --pid=$seq2 &
}
try2() {
# this gives the correct output,
# but outputs nothing till both jobs have stopped
sort -sm <(tail -f seq.1 --pid=$seq1) <(tail -f seq.2 --pid=$seq2)
}
try2
wait

The solution using tee (to write the files so that the standard output still goes to the console) won't work because tee introduces unwanted latency and doesn't solve the problem. Similarly, I couldn't get solutions using tail -f -s 0.01 (which alters the polling to 100/s), and/or some kind of call like split --filter='sort -sm' to sort small batches.
I also don't have tai64n, so my test code actually used this functionally identical perl code:
tai64n() {
perl -MTime::HiRes=time -pe '
printf "\#4%015x%x%n", split(/\./,time), $c; print 0 x(25-$c) . " "'
}
After failing to solve this with sh and bash, I exercised my standard failover, perl:
slow() {
# print stdout at 30bps
pv -qL 30
}
tai64n_and_tee() {
# prefix stdin with a sortable timestamp and copy to given file
perl -MTime::HiRes=time -e '
$_ = shift;
open(TEE, "> $_") or die $!;
while (<>) {
$_ = sprintf("\#4%015x%x%n", split(/\./,time), $c) . 0 x(25-$c) . " $_";
print TEE $_;
print $_;
}
' "$1"
}
# Simulate two slowly-running batch jobs:
seq 000 099 | slow | tai64n_and_tee seq.1 &
seq 100 199 | slow | tai64n_and_tee seq.2 &
wait
This was convenient for me because I had already used perl for the timestamp. I failed to do this with perl acting as tai64n and a separate perl call to act as tee, but it might work with the real tai64n.

Related

get full current crontab line, and create unique result email headers (from, subject, etc) per line

I have a giant crontab with many scripts on a linux machine. I need to be able to a) change the subject and/or from of cronjob result emails, because the default is unreadably long. b) Do so via a centralized solution. c) Only require minimal changes to the crontab itself.
For example this crontab line:
0 */3 * * * /path/to/script1 | /path/to/script2 | /path/to/script3
Creates this email subject:
Cron <cronuser#myserver> /path/to/script1 | /path/to/script2 | /path/to/script3
Which in my inbox gets cut off somewhere in the path to script1. (And many lines in the crontab are significantly longer.)
Options I've tried:
Piping to mail and setting subject etc per line (The -E preserves cron's default only-send-on-output behavior):
0 */3 * * * /path/to/script1 | /path/to/script2 | /path/to/script3 2>&1 | mail -E -s "test subject" -S from="Cron Script2 <cronuser#myserver.com>" recipient#myserver.com
This "works", but I want to centralize my changes in one place, and minimize how much I add to each cron line for readability
Using shell : (noop) command, which shows up first in the subject (note that the space after the : is important!):
0 */3 * * * : Descriptive Words; /path/to/script1 | /path/to/script2 | /path/to/script3
Unfortunately there's still too much unnecessary that cron puts before "Descriptive Words" on the subject line, so this is still unusable.
What I want to create:
Something generic, like this:
0 */3 * * * /path/to/script1 | /path/to/script2 | /path/to/script3 2>&1 | coolmailer.pl
coolmailer.pl would build the subject line by getting the commands on that cron line, strip out the paths and arguments, and email me this (optionally only if there's a fail in any of the scripts):
SUBJECT: script1 | script2 | script3
FROM: Cron Script3<noreply#myserver.com>
(actual results of the command /path/to/script1 | /path/to/script2 | /path/to/script3)
As a bonus, I'd also love to say whether any of the previous commands (script1 or script2) failed on the subject line.
This has turned out to be... way more complicated than I expected.
Challenges:
Find a way for a pipeline member (coolmailer) to know the other
members of the pipeline.
There's an ingenious method for 1 here using lsof how-do-you-determine-the-actual-command-that-is-piping-into-you but it also sometimes finds commands started by the scripts in my pipeline (ie if script1 forks processes or does system calls, those show up too any time they take long enough to complete.) Ditto for the method using process groups at the same link.
Find a way for a pipeline member (coolmailer) to know the results of
other members of the pipeline. (I realize this may not be possible at all, but the lsof hack gives me hope.)
Any better way? Does the fact that I'm running from cron buy me anything? Part of me wants to combine the lsof strategy with grepping through crontab -l results, but that just seems too kludgy and prone to errors.
Caveats:
I can have changes made to my account, but I can't make changes that
would effect all users. I.e. if there's a way to change cron's
mailing format server-wide, that doesn't help.
I can't realistically update every script called to handle emailing its own results, even if that's probably the "right" way.
I know about the mail -s -E -S options, but would prefer to have a single place to change things. Also, I really want to find a way to get the pipeline.
Language used now for "coolmailer" is Perl, but I'll try anything
My first attempt:
(which works, except it often also shows other commands started inside my scripts, which means it doesn't work)
#!/usr/bin/perl -w
my $pgid=`ps -o pgid= -p $$`;
my $lsofout = `/usr/sbin/lsof -g $pgid`;
my #otherpids = `echo "$lsofout" | awk '\$5 == "1w" { print \$2 }'`;
my #longcmds;
my #shortcmds;
foreach my $pid (#otherpids) {
chomp($pid);
if (my $cmd = `ps -o cmd= -p $pid 2>/dev/null`) {
chomp($cmd);
push #longcmds, $cmd;
next;
}
}
my $cmdline = join (' | ',#longcmds);
foreach my $cmd (#longcmds) {
$cmd =~ s/(\/\S+\/)(\S+)/$2/g;
push #shortcmds, $cmd;
}
my $subj = join('|',#shortcmds);
print "SUBJ:$subj\n";
print "CMDLINE: $cmdline\n";
# and now do some mail stuff
And final version, based on suggestion by Jhnc
#!/usr/bin/perl -w
# cronmgr.pl -- understand cron emails for once
# usage: 0 */3 * * * cronmgr.pl cd blah\; /path/to/script1 \| /path/to/script2 \| /path/to/script3
# note that ; | & in any cronmgr.pl line must be backslashed to run!
use strict;
use IPC::Cmd qw[can_run run run_forked];
my $CMDLINE = join(' ',#ARGV);
my( $success, $error_message, $full_buf, $stdout_buf, $stderr_buf ) =
run( command => $CMDLINE, verbose => 0 ); # verbose = 0 means don't output normally, capture all output
my ($stdout, $stderr);
$stdout = join "", #$stdout_buf;
$stderr = join "", #$stderr_buf;
my $emailsubject;
if( $success ) {
if ($stdout eq '' && $stderr eq '') { # if there's no output, don't send any email!
exit;
}
} else {
print "CMD FAIL!\n$error_message\nSTDERR:\n$stderr";
$emailsubject = "FAIL:$error_message";
}
# etc etc
(Edited for clarity re goals and why options attempted so far aren't sufficient.)

determining the pipeline
If you always invoke coolmailer.pl with a unique argument then you can simply grep it from your list of cronjobs:
#!/usr/bin/perl -wT
$ENV{PATH} = '/sensible:/path';
my ($pipeline) = grep /\|\s+$0\s+$ARGV[0]/, `crontab -l`;
$pipeline ||= "oops";
# ... mung $pipeline ...
# ... do mail stuff ...
checking pipe failure
If you rewrite your cronjob entries from:
/path/to/script1 | /path/to/script2 | /path/to/script3 2>&1 | coolmailer.pl
to:
coolermailer /path/to/script1 \| /path/to/script2 \| /path/to/script3
then you could construct the pipeline manually and have control over pipe member status information. (This also gives you the pipeline directly, although you then have to construct it before it will run.)
For example, with a bash implementation, you might make use of eval and PIPESTATUS. With Perl, you might use results() from IPC::Run

Using Net::OpenSSH tail the message file and grep

I am using Net::OpenSSH
my $ssh = Net::OpenSSH->new("$linux_machine_host")
Using the SSH object, fews commands are executed multiple times for N hours.
At times I need to look for any error messages, such as Timeout, in the var/adm/message file.
My suggestion
$ssh->capture2("echo START >> /var/adm/messages");
$ssh->capture2("some command which will be run in background for n hours");
$ssh->capture2("echo END >> /var/adm/messages");
Then read all lines between START and END and grep for the required error message.
$ssh->capture2("grep -A 100000 "START" /var/adm/messages | grep -B 100000 END");`
Without writing START and END into the message file, can I tail the var/adm/message file at some point and capture any new messages appearing afterwards.
Are there any Net::OpenSSH methods which would capture new lines and write them into a file?

You can read the messages file via SFTP (see Net::SFTP::Foreign):
# untested!
use Net::SFTP::Foreign::Constants qw(:flags);
...
my $sftp = $ssh->sftp;
# open the messages file creating it if it doesn't exist
# and move to the end:
my $fh = $sftp->open("/var/adm/messages",
SSH2_FXF_READ|SSH2_FXF_CREAT)
or die $sftp->error;
seek($fh, 0, 2);
$ssh->capture2("some command which...");
# look for the size of /var/adm/messages now so that we
# can ignore any lines that may be appended while we are
# reading it:
my $end = (stat $fh)[7];
# and finally read any lines added since we opened it:
my #msg;
while (1) {
my $pos = tell $fh;
last if $pos < 0 or $pos >= $end;
my $line = <$fh>;
last unless defined $line;
push #msg, $line;
}
Note that you are not taking into account that the messages file may be rotated. Handling that would require more convoluted approaches.

Bash: Split stdout from multiple concurrent commands into columns

I am running multiple commands in a bash script using single ampersands like so:
commandA & commandB & commandC
They each have their own stdout output but they are all mixed together and flood the console in an incoherent mess.
I'm wondering if there is an easy way to pipe their outputs into their own columns... using the column command or something similar. ie. something like:
commandA | column -1 & commandB | column -2 & commandC | column -3
New to this kind of thing, but from initial digging it seems something like pr might be the ticket? or the column command...?

Regrettably answering my own question.
None of the supplied solutions were exactly what I was looking for. So I developed my own command line utility: multiview. Maybe others will benefit?
It works by piping processes' stdout/stderr to a command interface and then by launching a "viewer" to see their outputs in columns:
fooProcess | multiview -s & \
barProcess | multiview -s & \
bazProcess | multiview -s & \
multiview
This will display a neatly organized column view of their outputs. You can name each process as well by adding a string after the -s flag:
fooProcess | multiview -s "foo" & \
barProcess | multiview -s "bar" & \
bazProcess | multiview -s "baz" & \
multiview
There are a few other options, but thats the gist of it.
Hope this helps!

pr is a solution, but not a perfect one. Consider this, which uses process substitution (<(command) syntax):
pr -m -t <(while true; do echo 12; sleep 1; done) \
<(while true; do echo 34; sleep 2; done)
This produces a marching column of the following:
12 34
12 34
12 34
12 34
Though this trivially provides the output you want, the columns do not advance individually—they advance together when all files have provided the same output. This is tricky, because in theory the first column should produce twice as much output as the second one.
You may want to investigate invoking tmux or screen in a tiled mode to allow the columns to scroll separately. A terminal multiplexer will provide the necessary machinery to buffer output and scroll it independently, which is important when showing output side-by-side without allowing excessive output from commandB to scroll commandA and commandC off-screen. Remember that scrolling each column separately will require a lot of screen redrawing, and the only way to avoid screen redraws is to have all three columns produce output simultaneously.
As a last-ditch solution, consider piping each output to a command that indents each column by a different number of characters:
this is something that commandA outputs and is
and here is something that commandB outputs
interleaved with the other output, but visually
you might have an easier time distinguishing one
here is something that commandC outputs
which is also interleaved with the others
from the other

Script print out three vertical rows and a timer each row containing the output from a single script.
Comment on anything you dont understand and ill add answers to my answer as needed
Hope this helps :)
#!/bin/bash
#Script by jidder
count=0
Elapsed=0
control_c()
{
tput rmcup
rm tail.tmp
rm tail2.tmp
rm tail3.tmp
stty sane
}
Draw()
{
tput clear
echo "SCRIPT 1 Elapsed time =$Elapsed seconds"
echo "------------------------------------------------------------------------------------------------------------------------------------------------------"
tail -n10 tail.tmp
tput cup 25 0
echo "Script 2 "
echo "------------------------------------------------------------------------------------------------------------------------------------------------------"
tail -n10 tail2.tmp
tput cup 50 0
echo "Script 3 "
echo "------------------------------------------------------------------------------------------------------------------------------------------------------"
tail -n10 tail3.tmp
}
Timer()
{
if [[ $count -eq 10 ]]; then
Draw
((Elapsed = Elapsed + 1))
count=0
fi
}
main()
{
stty -icanon time 0 min 0
tput smcup
Draw
count=0
keypress=''
MYSCRIPT1.sh > tail.tmp &
MYSCRIPT2.sh > tail2.tmp &
MYSCRIPT3.sh > tail3.tmp &
while [ "$keypress" != "q" ]; do
sleep 0.1
read keypress
(( count = count + 2 ))
Timer
done
stty sane
tput rmcup
rm tail.tmp
rm tail2.tmp
rm tail3.tmp
echo "Thanks for using this script."
exit 0
}
main
trap control_c SIGINT

SED command inside a loop

Hello: I have a lot of files called test-MR3000-1.txt to test-MR4000-1.nt, where the number in the name changes by 100 (i.e. I have 11 files),
$ ls test-MR*
test-MR3000-1.nt test-MR3300-1.nt test-MR3600-1.nt test-MR3900-1.nt
test-MR3100-1.nt test-MR3400-1.nt test-MR3700-1.nt test-MR4000-1.nt
test-MR3200-1.nt test-MR3500-1.nt test-MR3800-1.nt
and also a file called resonancia.kumac which in a couple on lines contains the string XXXX.
$ head resonancia.kumac
close 0
hist/delete 0
vect/delete *
h/file 1 test-MRXXXX-1.nt
sigma MR=XXXX
I want to execute a bash file which substitutes the strig XXXX in a file by a set of numbers obtained from the command ls *MR* | cut -b 8-11.
I found a post in which there are some suggestions. I try my own code
for i in `ls *MR* | cut -b 8-11`; do
sed -e "s/XXXX/$i/" resonancia.kumac >> proof.kumac
done
however, in the substitution the numbers are surrounded by sigle qoutes (e.g. '3000').
Q: What should I do to avoid the single quote in the set of numbers? Thank you.

This is a reproducer for the environment described:
for ((i=3000; i<=4000; i+=100)); do
touch test-MR${i}-1.nt
done
cat >resonancia.kumac <<'EOF'
close 0
hist/delete 0
vect/delete *
h/file 1 test-MRXXXX-1.nt
sigma MR=XXXX
EOF
This is a script which will run inside that environment:
content="$(<resonancia.kumac)"
for f in *MR*; do
substring=${f:7:3}
echo "${content//XXXX/$substring}"
done >proof.kumac
...and the output looks like so:
close 0
hist/delete 0
vect/delete *
h/file 1 test-MR300-1.nt
sigma MR=300
There are no quotes anywhere in this output; the problem described is not reproduced.

or if it could be perl:
#!/usr/bin/perl
#ls = glob('*MR*');
open (FILE, 'resonancia.kumac') || die("not good\n");
#cont = <FILE>;
$f = shift(#ls);
$f =~ /test-MR([0-9]*)-1\.nt/;
$nr = $1;
#out = ();
foreach $l (#cont){
if($l =~ s/XXXX/$nr/){
$f = shift(#ls);
$f =~ /test-MR([0-9]*)-1\.nt/;
$nr = $1;
}
push #out, $l;
}
close FILE;
open FILE, '>resonancia.kumac' || die("not good\n");
print FILE #out;
That would replace the first XXXX with the first filename, what seemed to be the question before change.

How to delete older contents of file that is being continuously written to?

I have a simulation running and expect it to go on for atleast 10 more hours. I have directed the console out put to a .txt file using
(binary) > out.txt
This out.txt is becoming too huge. I do not need a lot of contents in this file. How can I delete the older parts of this file without harming the writing process? The contents that will be written towards the end of the simulation is important to me.

As Carl mentioned in the comments, you cannot really do this on an actively written log file. However, if the initial data is not relevant to you, you can do the following (though beware that you will loose all data)
> out.txt
For future, you can use a utility called logrotate(8)

You could use tail to only store the end of the file:
# Say you want to save the last 100 lines
your_binary | tail -n 100 > out.txt
This assumes that the output ends at some point.

saw your comments - the file is 10 GB now ... try using sed -i to reduce the size so that it will work with the other tools, if you want to completely erase it then :> logfile.
tools can cope up with a file which is as big as their buffer , else they should be streamed ..... something like split wont work on a 4 GB file , dont know if they made a code adjustment for this , its been long since i had to work with a file that big.
two suggestions :
1
there were a few methods i could think off like using split ....but almost all were involving creation of a seperate file from the log (a reduced version) and renaming that or redirecting to that.
use split to break the log to smaller logs (split -l 100 ...) and just redirect the program output to the recent the last log found using ls -1.
this seems to work fine .
2
Also i tried a second method to edit/truncate top 10 lines in the same file ......
Kaizen ~/shell_prac
$ cat zcntr.sh
## test truncate a log file
##set -xv
:> zcntr.log ;
## fxn
cntr_log()
{
limit=$1 ;
start=0 ;
while [ $start -lt $limit ]
do
echo "count is $start" >> zcntr.log ; ## generate a continuous log
start=$(($start + 1));
sleep 1;
cnt=$(($start % 10)) ;
if [ $cnt -eq 0 ] ## check to truncate the top 10 lines using sed
then
echo "truncate at $start " >> zcntr.log ;
sed -i "1,10d" zcntr.log ;
fi
done ;
}
## main cntrlr
echo "enter a limit" ;
read lmt ;
cntr_log $lmt ;
this seems to work
i tested it with a counter to print till value 25
output :
Kaizen ~/shell_prac
$ cat zcntr.log
count is 19
truncate at 20
count is 20
count is 21
count is 22
count is 23
count is 24
i think either of the two will help.
let me know if there is something else on your mind !!

Truncate file with cat
> cat /dev/null > out.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

merge sorted files, with minimal buffering - linux

Related

get full current crontab line, and create unique result email headers (from, subject, etc) per line

Using Net::OpenSSH tail the message file and grep

Bash: Split stdout from multiple concurrent commands into columns

SED command inside a loop

How to delete older contents of file that is being continuously written to?

Categories

Resources