How to swap the even and odd lines via bash script? - string

In the incoming string stream from the standard input, swap the even and odd lines.
I've tried to do it like this, but reading from file and $i -lt $a.count aren't working:
$a= gc test.txt
for($i=0;$i -lt $a.count;$i++)
{
if($i%2)
{
$a[$i-1]
}
else
{
$a[$i+1]
}
}
Please, help me to get this working

Suggesting one line awk script:
awk '!(NR%2){print$0;print r}NR%2{r=$0}' input.txt
awk script explanation
!(NR % 2){ # if row number divide by 2 witout reminder
print $0; # print current row
print evenRow; # print saved row
}
(NR % 2){ # if row number divided by 2 with reminder
evenRow = $0; # save current row in variable
}

There's already a good awk-based answer to the question here, and there are a few decent-looking sed and awk solutions in Swap every 2 lines in a file. However, the shell solution in Swap every 2 lines in a file is almost comically incompetent. The code below is an attempt at a functionally correct pure Bash solution. Bash is very slow, so it is practical to use only on small files (maybe up to 10 thousand lines).
#! /bin/bash -p
idx=0 lines=()
while IFS= read -r 'lines[idx]' || [[ -n ${lines[idx]} ]]; do
(( idx == 1 )) && printf '%s\n%s\n' "${lines[1]}" "${lines[0]}"
idx=$(( (idx+1)%2 ))
done
(( idx == 1 )) && printf '%s\n' "${lines[0]}"
The lines array is used to hold two consecutive lines.
IFS= prevents whitespace being stripped as lines are read.
The idx variable cycles through 0, 1, 0, 1, ... (idx=$(( (idx+1)%2 ))) so reading to lines[idx] cycles through putting input lines at indexes 0 and 1 in the lines array.
The read function returns non-zero status immediately if it encounters an unterminated final line in the input. That could cause the loop to terminate before processing the last line, thus losing it in the output. The || [[ -n ${lines[idx]} ]] avoids that by checking if the read actually read some input. (Fortunately, there's no such thing as an unterminated empty line at the end of a file.)
printf is used instead of echo to output the lines because echo doesn't work reliably for arbitrary strings. (For instance, a line containing just -n would get lost completely by echo "$line".) See Why is printf better than echo?.
The question doesn't say what to do if the input has an odd number of lines. This code ((( idx == 1 )) && printf '%s\n' "${lines[0]}") just passes the last line through, after the swapped lines. Other reasonable options would be to drop the last line or print it preceded by an extra blank line. The code can easily be modified to do one of those if desired.
The code is Shellcheck-clean.

Related

Extract orders and match to trades from two files

I have two attached files (orders1.txt and trades1.txt) I need to write a Bash script (possibly awk?) to extract orders and match them to trades.
The output should produce a report that prints comma separated values containing “ClientID, OrderID, Price, Volume”.
In addition to this for each client, I need to print the total volume and turnover (turnover is the subtotal of price * volume on each trade).
Can someone please help me with a bash script that will do the above using the attached files?
Any help would be greatly appreciated
orders1.txt
Entry Time, Client ID, Security ID, Order ID
25455410,DOLR,XGXUa,DOLR1435804437
25455410,XFKD,BUP3d,XFKD4746464646
25455413,QOXA,AIDl,QOXA7176202067
25455415,QOXA,IRUXb,QOXA6580494597
25455417,YXKH,OBWQs,YXKH4575139017
25455420,JBDX,BKNs,JBDX6760353333
25455428,DOLR,AOAb,DOLR9093170513
25455429,JBDX,QMP1Sh,JBDX2756804453
25455431,QOXA,QIP1Sh,QOXA6563975285
25455434,QOXA,XMUp,QOXA5569701531
25455437,XFKD,QLOJc,XFKD8793976660
25455438,YXKH,MRPp,YXKH2329856527
25455442,JBDX,YBPu,JBDX0100506066
25455450,QOXA,BUPYd,QOXA5832015401
25455451,QOXA,SIOQz,QOXA3909507967
25455451,DOLR,KID1Sh,DOLR2262067037
25455454,DOLR,JJHi,DOLR9923665017
25455461,YXKH,KBAPBa,YXKH2637373848
25455466,DOLR,EPYp,DOLR8639062962
25455468,DOLR,UQXKz,DOLR4349482234
25455474,JBDX,EFNs,JBDX7268036859
25455481,QOXA,XCB1Sh,QOXA4105943392
25455486,YXKH,XBAFp,YXKH0242733672
25455493,JBDX,BIF1Sh,JBDX2840241688
25455500,DOLR,QSOYp,DOLR6265839896
25455503,YXKH,IIYz,YXKH8505951163
25455504,YXKH,ZOIXp,YXKH2185348861
25455513,YXKH,MBOOp,YXKH4095442568
25455515,JBDX,P35p,JBDX9945514579
25455524,QOXA,YXOKz,QOXA1900595629
25455528,JBDX,XEQl,JBDX0126452783
25455528,XFKD,FJJMp,XFKD4392227425
25455535,QOXA,EZIp,QOXA4277118682
25455543,QOXA,YBPFa,QOXA6510879584
25455551,JBDX,EAMp,JBDX8924251479
25455552,QOXA,JXIQp,QOXA4360008399
25455554,DOLR,LISXPh,DOLR1853653280
25455557,XFKD,LOX14p,XFKD1759342196
25455558,JBDX,YXYb,JBDX8177118129
25455567,YXKH,MZQKl,YXKH6485420018
25455569,JBDX,ZPIMz,JBDX2010952336
25455573,JBDX,COPe,JBDX1612537068
25455582,JBDX,HFKAp,JBDX2409813753
25455589,QOXA,XFKm,QOXA9692126523
25455593,XFKD,OFYp,XFKD8556940415
25455601,XFKD,FKQLb,XFKD4861992028
25455606,JBDX,RIASp,JBDX0262502677
25455608,DOLR,HRKKz,DOLR1739013513
25455615,DOLR,ZZXp,DOLR6727725911
25455623,JBDX,CKQPp,JBDX2587184235
25455630,YXKH,ZLQQp,YXKH6492126889
25455632,QOXA,ORPz,QOXA3594333316
25455640,XFKD,HPIXSh,XFKD6780729432
25455648,QOXA,ABOJe,QOXA6661411952
25455654,XFKD,YLIp,XFKD6374702721
25455654,DOLR,BCFp,DOLR8012564477
25455658,JBDX,ZMDKz,JBDX6885176695
25455665,JBDX,CBOe,JBDX8942732453
25455670,JBDX,FRHMl,JBDX5424320405
25455679,DOLR,YFJm,DOLR8212353717
25455680,XFKD,XAFp,XFKD4132890550
25455681,YXKH,PBIBOp,YXKH6106504736
25455684,DOLR,IFDu,DOLR8034515043
25455687,JBDX,JACe,JBDX8243949318
25455688,JBDX,ZFZKz,JBDX0866225752
25455693,QOXA,XOBm,QOXA5011416607
25455694,QOXA,IDQe,QOXA7608439570
25455698,JBDX,YBIDb,JBDX8727773702
25455705,YXKH,MXOp,YXKH7747780955
25455710,YXKH,PBZRYs,YXKH7353828884
25455719,QOXA,QFDb,QOXA2477859437
25455720,XFKD,PZARp,XFKD4995735686
25455722,JBDX,ZLKKb,JBDX3564523161
25455730,XFKD,QFH1Sh,XFKD6181225566
25455733,JBDX,KWVJYc,JBDX7013108210
25455733,YXKH,ZQI1Sh,YXKH7095815077
25455739,YXKH,XIJp,YXKH0497248757
25455739,YXKH,ZXJp,YXKH5848658513
25455747,JBDX,XASd,JBDX4986246117
25455751,XFKD,XQIKz,XFKD5919379575
25455760,JBDX,IBXPb,JBDX8168710376
25455763,XFKD,EVAOi,XFKD8175209012
25455765,XFKD,JXKp,XFKD2750952933
25455773,XFKD,PTBAXs,XFKD8139382011
25455778,QOXA,XJp,QOXA8227838196
25455783,QOXA,CYBIp,QOXA2072297264
25455792,JBDX,PZI1Sh,JBDX7022115629
25455792,XFKD,XIKQl,XFKD6434550362
25455792,DOLR,YKPm,DOLR6394606248
25455796,QOXA,JXOXPh,QOXA9672544909
25455797,YXKH,YIWm,YXKH5946342983
25455803,YXKH,JZEm,YXKH5317189370
25455810,QOXA,OBMFz,QOXA0985316706
25455810,QOXA,DAJPp,QOXA6105975858
25455810,JBDX,FBBJl,JBDX1316207043
25455819,XFKD,YXKm,XFKD6946276671
25455821,YXKH,UIAUs,YXKH6010226371
25455828,DOLR,PTJXs,DOLR1387517499
25455836,DOLR,DCEi,DOLR3854078054
25455845,YXKH,NYQe,YXKH3727923537
25455853,XFKD,TAEc,XFKD5377097556
25455858,XFKD,LMBOXo,XFKD4452678489
25455858,XFKD,AIQXp,XFKD5727938304
trades1.txt
# The first 8 characters is execution time in microseconds since midnight
# The next 14 characters is the order ID
# The next 8 characters is the zero padded price
# The next 8 characters is the zero padded volume
25455416QOXA6580494597 0000013800001856
25455428JBDX6760353333 0000007000002458
25455434DOLR9093170513 0000000400003832
25455435QOXA6563975285 0000034700009428
25455449QOXA5569701531 0000007500009023
25455447YXKH2329856527 0000038300009947
25455451QOXA5832015401 0000039900006432
25455454QOXA3909507967 0000026900001847
25455456DOLR2262067037 0000034700002732
25455471YXKH2637373848 0000010900006105
25455480DOLR8639062962 0000027500001975
25455488JBDX7268036859 0000005200004986
25455505JBDX2840241688 0000037900002029
25455521YXKH4095442568 0000046400002150
25455515JBDX9945514579 0000040800005904
25455535QOXA1900595629 0000015200006866
25455533JBDX0126452783 0000001700006615
25455542XFKD4392227425 0000035500009948
25455570XFKD1759342196 0000025700007816
25455574JBDX8177118129 0000022400000427
25455567YXKH6485420018 0000039000008327
25455573JBDX1612537068 0000013700001422
25455584JBDX2409813753 0000016600003588
25455603XFKD4861992028 0000017600004552
25455611JBDX0262502677 0000007900003235
25455625JBDX2587184235 0000024300006723
25455658XFKD6374702721 0000046400009451
25455673JBDX6885176695 0000010900009258
25455671JBDX5424320405 0000005400003618
25455679DOLR8212353717 0000041100003633
25455697QOXA5011416607 0000018800007376
25455696QOXA7608439570 0000013000007463
25455716YXKH7747780955 0000037000006357
25455719QOXA2477859437 0000039300009840
25455723XFKD4995735686 0000045500009858
25455727JBDX3564523161 0000021300000639
25455742YXKH7095815077 0000023000003945
25455739YXKH5848658513 0000042700002084
25455766XFKD5919379575 0000022200003603
25455777XFKD8175209012 0000033300006350
25455788XFKD8139382011 0000034500007461
25455793QOXA8227838196 0000011600007081
25455784QOXA2072297264 0000017000004429
25455800XFKD6434550362 0000030000002409
25455801QOXA9672544909 0000039600001033
25455815QOXA6105975858 0000034800008373
25455814JBDX1316207043 0000026500005237
25455831YXKH6010226371 0000011400004945
25455838DOLR1387517499 0000046200006129
25455847YXKH3727923537 0000037400008061
25455873XFKD5727938304 0000048700007298
I have the following script:
'''
#!/bin/bash
declare -A volumes
declare -A turnovers
declare -A orders
# Read the first file, remembering for each order the client id
while read -r line
do
# Jump over comments
if [[ ${line:0:1} == "#" ]] ; then continue; fi;
details=($(echo $line | tr ',' " "))
order_id=${details[3]}
client_id=${details[1]}
orders[$order_id]=$client_id
done < $1
echo "ClientID,OrderID,Price,Volume"
while read -r line
do
# Jump over comments
if [[ ${line:0:1} == "#" ]] ; then continue; fi;
order_id=$(echo ${line:8:20} | tr -d '[:space:]')
client_id=${orders[$order_id]}
price=${line:28:8}
volume=${line: -8}
echo "$client_id,$order_id,$price,$volume"
price=$(echo $price | awk '{printf "%d", $0}')
volume=$(echo $volume | awk '{printf "%d", $0}')
order_turnover=$(($price*$volume))
old_turnover=${turnovers[$client_id]}
[[ -z "$old_turnover" ]] && old_turnover=0
total_turnover=$(($old_turnover+$order_turnover))
turnovers[$client_id]=$total_turnover
old_volumes=${volumes[$client_id]}
[[ -z "$old_volumes" ]] && old_volumes=0
total_volume=$((old_volumes+volume))
volumes[$client_id]=$total_volume
done < $2
echo "ClientID,Volume,Turnover"
for client_id in ${!volumes[#]}
do
volume=${volumes[$client_id]}
turnover=${turnovers[$client_id]}
echo "$client_id,$volume,$turnover"
done
Can anyone think of anything more elegant?
Thanks in advance
C
Assumption 1: the two files are ordered, so line x represents an action that is older than x+1. If not, then further work is needed.
The assumption makes our work easier. Let's first change the delimiter of traders into a comma:
sed -i 's/ /,/g' traders.txt
This will be done in place for sake of simplicity. So, you now have traders which is comma separated, as is orders. This is the Assumption 2.
Keep working on traders: split all columns and add titles1. More on the reasons why in a moment.
gawk -i inplace -v INPLACE_SUFFIX=.bak 'BEGINFILE{FS=",";OFS=",";print "execution time,order ID,price,volume";}{print substr($1,1,8),substr($1,9),substr($2,1,9),substr($2,9)}' traders.txt
Ugly but works. Now let's process your data using the following awk script:
BEGIN {
FS=","
OFS=","
}
{
if (1 == NR) {
getline line < TRADERS # consume title line
print "Client ID,Order ID,Price,Volume,Turnover"; # consume title line. Remove print to forget it
getline line < TRADERS # reads first data line
split(line, transaction, ",")
next
}
if (transaction[2] == $4) {
print $2, $4, transaction[3], transaction[4], transaction[3]*transaction[4]
getline line < TRADERS # reads new data line
split(line, transaction, ",")
}
}
called by:
gawk -f script -v TRADERS=traders.txt orders.txt
And there you have it. Some caveats:
check the numbers, as implicit gawk number conversion might not be correct with zero-padded numbers. There is a fix for that in case;
getline might explode if we run out of lines from traders. I haven't put any check, that's up to you
no control over timestamps. Match is based on Order ID.
Output file:
Client ID,Order ID,Price,Volume,Turnover
QOXA,QOXA6580494597,000001380,00001856,2561280
JBDX,JBDX6760353333,000000700,00002458,1720600
DOLR,DOLR9093170513,000000040,00003832,153280
QOXA,QOXA6563975285,000003470,00009428,32715160
QOXA,QOXA5569701531,000000750,00009023,6767250
YXKH,YXKH2329856527,000003830,00009947,38097010
QOXA,QOXA5832015401,000003990,00006432,25663680
QOXA,QOXA3909507967,000002690,00001847,4968430
DOLR,DOLR2262067037,000003470,00002732,9480040
YXKH,YXKH2637373848,000001090,00006105,6654450
DOLR,DOLR8639062962,000002750,00001975,5431250
JBDX,JBDX7268036859,000000520,00004986,2592720
JBDX,JBDX2840241688,000003790,00002029,7689910
YXKH,YXKH4095442568,000004640,00002150,9976000
JBDX,JBDX9945514579,000004080,00005904,24088320
QOXA,QOXA1900595629,000001520,00006866,10436320
JBDX,JBDX0126452783,000000170,00006615,1124550
XFKD,XFKD4392227425,000003550,00009948,35315400
XFKD,XFKD1759342196,000002570,00007816,20087120
JBDX,JBDX8177118129,000002240,00000427,956480
YXKH,YXKH6485420018,000003900,00008327,32475300
JBDX,JBDX1612537068,000001370,00001422,1948140
JBDX,JBDX2409813753,000001660,00003588,5956080
XFKD,XFKD4861992028,000001760,00004552,8011520
JBDX,JBDX0262502677,000000790,00003235,2555650
JBDX,JBDX2587184235,000002430,00006723,16336890
XFKD,XFKD6374702721,000004640,00009451,43852640
JBDX,JBDX6885176695,000001090,00009258,10091220
JBDX,JBDX5424320405,000000540,00003618,1953720
DOLR,DOLR8212353717,000004110,00003633,14931630
QOXA,QOXA5011416607,000001880,00007376,13866880
QOXA,QOXA7608439570,000001300,00007463,9701900
YXKH,YXKH7747780955,000003700,00006357,23520900
QOXA,QOXA2477859437,000003930,00009840,38671200
XFKD,XFKD4995735686,000004550,00009858,44853900
JBDX,JBDX3564523161,000002130,00000639,1361070
YXKH,YXKH7095815077,000002300,00003945,9073500
YXKH,YXKH5848658513,000004270,00002084,8898680
XFKD,XFKD5919379575,000002220,00003603,7998660
XFKD,XFKD8175209012,000003330,00006350,21145500
XFKD,XFKD8139382011,000003450,00007461,25740450
QOXA,QOXA8227838196,000001160,00007081,8213960
QOXA,QOXA2072297264,000001700,00004429,7529300
XFKD,XFKD6434550362,000003000,00002409,7227000
QOXA,QOXA9672544909,000003960,00001033,4090680
QOXA,QOXA6105975858,000003480,00008373,29138040
JBDX,JBDX1316207043,000002650,00005237,13878050
YXKH,YXKH6010226371,000001140,00004945,5637300
DOLR,DOLR1387517499,000004620,00006129,28315980
YXKH,YXKH3727923537,000003740,00008061,30148140
XFKD,XFKD5727938304,000004870,00007298,35541260
1: requires gawk 4.1.0 or higher

Print a file in multiple columns based on delimiter

This seems like a simple task, but using duckduckgo I wasn't able to find a way to properly do what I'm trying to.
The main question is: How do I split the output of a command in linux or bash into multiple columns using a delimeter?
I have a file that looks like this: (this is just a simplified example)
-----------------------------------
Some data
that varies in line length
-----------------------------------
-----------------------------------
More data that is seperated
by a new line and dashes
-----------------------------------
And so on. Everytime data gets written to the file, it's enclosed in a line of dashes, seperated by an empty line from the last block. Line-length of the data varies. What I want is basically a tool or way using bash to split the file into multiple columns like this:
----------------------------------- -----------------------------------
Some data More data that is seperated
that varies in line length by a new line and dashes
----------------------------------- -----------------------------------
Each column should take 50% of the screen, no centering (as in alignment) needed. The file has to be split per-block. Splitting the file in the middle or something like that won't work. I basically want block 1 go to the left column, block 2 to the right, 3 to the left again, 4 right, and so on. The file gets updated constantly and updates should be written to the screen right away. (Currently I'm using tail -f)
Since this sounds like a rather common question I would welcome a general approach to this instead of a specific answer that works only for my case so people coming from search engines looking for a way to have a two column layout in bash get some information too. I tried column and pr, both don't work as desired. (I elaborated on this in the comments)
Edit: To be clear, I am looking for a general approach on this. Going through a file, getting data between the delimiter, putting it to column A, getting the next one putting it to column B, and so on.
The question is tagged as Perl so here is a possible Perl answer:
#!/usr/bin/env perl
use strict;
use warnings;
my $is_col1 = 1;
my $in_block = 0;
my #col1;
while (<DATA>) {
chomp;
if (/^\s*-+\s*$/ ... /^\s*-+\s*$/) {
$in_block = 1;
if ($is_col1) {
push #col1, $_;
}
else {
printf "%-40s%-40s\n", shift #col1 // '', $_;
}
}
else {
if ($in_block) {
$in_block = ! $in_block;
$is_col1 = ! $is_col1;
print "\n" if $is_col1; # line separating blocks
}
}
}
print join("\n", #col1), "\n\n" if #col1;
__DATA__
-----------------------------------
Some data
that varies in line length
-----------------------------------
-----------------------------------
More data that is seperated
by a new line and dashes
with a longer column2
-----------------------------------
-----------------------------------
The odd last column
-----------------------------------
Output:
----------------------------------- -----------------------------------
Some data More data that is seperated
that varies in line length by a new line and dashes
----------------------------------- with a longer column2
-----------------------------------
-----------------------------------
The odd last column
-----------------------------------
This script is getting max width of current terminal and splitting it in 2, then printing records split by RS="\n\n" separator, print the first found and placing the cursor at the first line/last column of it to write the next record.
#!/bin/bash
tput clear
# get half current terminal width
twidth=$(($(tput cols)/2))
tail -n 100 -f test.txt | stdbuf -i0 -o0 gawk -v twidth=$twidth 'BEGIN{ RS="\n\n"; FS=OFS="\n"; oldNF=0 } {
sep="-----------------------------------"
pad=" "
printf "%-" twidth "s", $0
getline
for(i = 1; i <= NF; i++){
# move cursor to first line, last column of previous record
print "\033[" oldNF ";" twidth "f" $i
oldNF+=1
}
}'
Here's a simpler version
gawk 'BEGIN{ RS="[-]+\n\n"; FS="\n" } {
sep="-----------------------------------"
le=$2
lo=$3
getline
printf "%-40s %-40s\n", sep,sep
printf "%-40s %-40s\n", le,$2
printf "%-40s %-40s\n", lo,$3
printf "%-40s %-40s\n\n", sep,sep
}' test.txt
Output
----------------------------------- -----------------------------------
Some data More data that is seperated
that varies in line length by a new line and dashes
----------------------------------- -----------------------------------
----------------------------------- -----------------------------------
Some data More data that is seperated
that varies in line length by a new line and dashes
----------------------------------- -----------------------------------
Assuming file contains uniform blocks of five lines each, using paste, sed, and printf:
c=$((COLUMNS/2))
paste -d'#' <(sed -n 'p;n;p;n;p;n;p;n;p;n;n;n;n;n' file) \
<(sed -n 'n;n;n;n;n;p;n;p;n;p;n;p;n;p' file) |
sed 's/.*/"&"/;s/#/" "/' |
xargs -L 1 printf "%-${c}s %-${c}s\n"
Problem with OP spec
The OP reports that the block lengths may vary, and should be separated by a fixed number of lines. Even numbered blocks go in Column A, odd numbered blocks in Column B.
That creates a tail -f problem then. Suppose the block lengths of the source input begin with 1000 lines, then one line, 1000, 1, 1000, 1, etc. So Column A gets all the 1000 line blocks, and Column B gets all the one line blocks. Let's say the blocks in the output are separated by 1 line each. So one block in Column A lines up with 500 blocks in Column B. So for a terminal with scrolling output, that means before we can output the first block in Column A, we have to wait for 1000 blocks of input. To output the third block in Column A, (just below the first block), we have to wait for 2000 blocks of input.
If the blocks are added to the input file relatively slowly, with a one second delay between blocks, then it will take three seconds for the block 3 to appear in the input file, but it will take 33 minutes for block 3 to be displayed in the output file.
Alright, since apprently there is no clean way to do this I came up with my own solution. It's a bit messy and requires GNU screen to be installed, but it works. Any amount of lines within or around the blocks, 50% of the screen automatically resizing and each column prints independantly from each other with a fixed amount of newlines between them. Also automatic updates every x seconds. (120 in my example)
#!/bin/bash
screen -S testscr -X layout save default
screen -S testscr -X split -v
screen -S testscr -X screen tail -f /tmp/testscr1.txt
screen -S testscr -X focus
screen -S testscr -X screen tail -f /tmp/testscr2.txt
while : ; do
echo "" > /tmp/testscr1.txt
echo "" > /tmp/testscr2.txt
cfile=1 # current column
ctype=0 # start or end of block
while read; do
if [[ $REPLY == "------------------------------------------------------------" ]]; then
if [[ $ctype -eq 0 ]]; then
ctype=1
else
if [[ $cfile -eq 1 ]]; then
echo "${REPLY}" >> /tmp/testscr1.txt
echo "" >> /tmp/testscr1.txt
echo "" >> /tmp/testscr1.txt
cfile=2
else
echo "${REPLY}" >> /tmp/testscr2.txt
echo "" >> /tmp/testscr2.txt
echo "" >> /tmp/testscr2.txt
cfile=1
fi
ctype=0
fi
fi
if [[ $ctype -eq 1 ]]; then
if [[ $cfile -eq 1 ]]; then
echo "${REPLY}" >> /tmp/testscr1.txt
else
echo "${REPLY}" >> /tmp/testscr2.txt
fi
fi
done < "$1"
sleep 120
done
First, start a screen session with screen -S testscr then, either within or outside the session, execute the script above. This will split the screen vertically using 50% per column and execute tail -f on both columns, afterwards it will go through the input file and write block by block to each tmp. file in the desired way. Since it's in an infinite while loop it's essentially automatically updating the shown output every x seconds (here 120).

Filter a very large, numerically sorted CSV file based on a minimum/maximum value using Linux?

I'm trying to output lines of a CSV file which is quite large. In the past I have tried different things and ultimately come to find that Linux's command line interface (sed, awk, grep, etc) is the fastest way to handle these types of files.
I have a CSV file like this:
1,rand1,rand2
4,randx,randy,
6,randz,randq,
...
1001,randy,randi,
1030,rando,randn,
1030,randz,randc,
1036,randp,randu
...
1230994,randm,randn,
1230995,randz,randl,
1231869,rande,randf
Although the first column is numerically increasing, the space between each number varies randomly. I need to be able to output all lines that have a value between X and Y in their first column.
Something like:
sed ./csv -min --col1 1000 -max --col1 1400
which would output all the lines that have a first column value between 1000 and 1400.
The lines are different enough that in a >5 GB file there might only be ~5 duplicates, so it wouldn't be a big deal if it counted the duplicates only once -- but it would be a big deal if it threw an error due to a duplicate line.
I may not know whether particular line values exist (e.g. 1000 is a rough estimate and should not be assumed to exist as a first column value).
Optimizations matter when it comes to large files; the following awk command:
is parameterized (uses variables to define the range boundaries)
performs only a single comparison for records that come before the range.
exits as soon as the last record of interest has been found.
awk -F, -v from=1000 -v to=1400 '$1 < from { next } $1 > to { exit } 1' ./csv
Because awk performs numerical comparison (with input fields that look like numbers), the range boundaries needn't match field values precisely.
You can easily do this with awk, though it won't take full advantage of the file being sorted:
awk -F , '$1 > 1400 { exit(0); } $1 >= 1000 { print }' file.csv
If you know that the numbers are increasing and unique, you can use addresses like this:
sed '/^1000,/,/^1400,/!d' infile.csv
which does not print any line that is outside of the lines between the one that matches /^1000,/ and the one that matches /^1400,/.
Notice that this doesn't work if 1000 or 1400 don't actually exist as values, i.e., it wouldn't print anything at all in that case.
In any case, as demonstrated by the answers by mklement0 and that other guy, awk is a the better choice here.
Here's a bash-version of the script:
#! /bin/bash
fname="$1"
start_nr="$2"
end_nr="$3"
while IFS=, read -r nr rest || [[ -n $nr && -n $rest ]]; do
if (( $nr < $start_nr )); then continue;
elif (( $nr > $end_nr )); then break; fi
printf "%s,%s\n" "$nr" "$rest"
done < "$fname"
Which you would then call script.sh foo.csv 1000 2000
The script will start printing when the number is large enough and then immediately stops when the number gets above the limit.

Average values across multiple files

I am trying to write a shell script to average several identically formatted files with names file1, file2, file3 and so on.
In each file, the data is in a table of a format like for example 4 columns and 5 rows of data. Let's assume file1, file2 and file3 are in the same directory. What I want to do is to create an average file, which has the same format as file1/file2/file3 where it should have the average of the each element from the table. For example,
{(Element in row 1, column 1 in file1)+
(Element in row 1, column 1 in file2)+
(Element in row 1, column 1 in file3)} >>
(Element in row 1, column 1 in average file)
Likewise, I need to do it for each element in the table, the average file has the same number of elements as the file1, file2, file3.
I tried to write a shell script, but it doesn't work. What I want is to read the files in a loop and grep the same element from each file, add them and average them over the number of files and finally write to a similar file format. This is what I tried to write:
#!/bin/bash
s=0
for i in {1..5..1} do
for j in {1..4..1} do
for f in m* do
a=$(awk 'FNR == i {print $j}' $f)
echo $a
s=$s+$a
echo $f
done
avg=$s/3
echo $avg > output
done
done
This is a rather inefficient way of going about it: for every single number you're trying to extract, you process one of the input files completely – even though you only have three files, you process 60!
Also, mixing Bash and awk in this way is a massive antipattern. This here is a great Q&A explaining why.
A few more remarks:
For brace expansion, the default step size is 1, so {1..4..1} is the same as {1..4}.
Awk has no clue what i and j are. As far as it is concerned, those were never defined. If you really wanted to get your shell variables into awk, you could do
a=$(awk -v i="$i" -v j="$j" 'FNR == i { print $j }' $f)
but the approach is not sound anyway.
Shell arithmetic does not work like s=$s+$a or avg=$s/3 – these are just concatenating strings. To have the shell do calculations for you, you'd need arithmetic expansion:
s=$(( s + a ))
or, a little shorter,
(( s += a ))
and
avg=$(( s / 3 ))
Notice that you don't need the $ signs in an arithmetic context.
echo $avg > output would print every number on a separate line, which is probably not what you want.
Indentation matters! If not for the machine, then for human readers.
A Bash solution
This solves the problem using just Bash. It is hard coded to three files, but flexible in the number of lines and elements per line. There are no checks to make sure that the number of elements is the same for all lines and files.
Notice that Bash is not fast at that kind of thing and should only be used for small files, if at all. Also, is uses integer arithmetic, so the "average" of 3 and 4 would become 3.
I've added comments to explain what happens.
#!/bin/bash
# Read a line from the first file into array arr1
while read -a arr1; do
# Read a line from the second file at file descriptor 3 into array arr2
read -a arr2 <&3
# Read a line from the third file at file descriptor 4 into array arr3
read -a arr3 <&4
# Loop over elements
for (( i = 0; i < ${#arr1[#]}; ++i )); do
# Calculate average of element across files, assign to res array
res[i]=$(( (arr1[i] + arr2[i] + arr3[i]) / 3 ))
done
# Print res array
echo "${res[#]}"
# Read from files supplied as arguments
# Input for the second and third file is redirected to file descriptors 3 and 4
# to enable looping over multiple files concurrently
done < "$1" 3< "$2" 4< "$3"
This has to be called like
./bashsolution file1 file2 file3
and output can be redirected as desired.
An awk solution
This is a solution in pure awk. It's a bit more flexible in that it takes the average of however many files are supplied as arguments; it should also be faster than the Bash solution by about an order of magnitude.
#!/usr/bin/awk -f
# Count number of files: increment on the first line of each new file
FNR == 1 { ++nfiles }
{
# (Pseudo) 2D array summing up fields across files
for (i = 1; i <= NF; ++i) {
values[FNR, i] += $i
}
}
END {
# Loop over lines of array with sums
for (i = 1; i <= FNR; ++i) {
# Loop over fields of current line in array of sums
for (j = 1; j <= NF; ++j) {
# Build record with averages
$j = values[i, j]/nfiles
}
print
}
}
It has to be called like
./awksolution file1 file2 file3
and, as mentioned, there is no limit to the number of files to average over.

Use grep to remove words from dictionary whose roots are already present

I am trying to write a random passphrase generator. I have a dictionary with a bunch of words and I would like to remove words whose root is already in the dictionary, so that a dictionary that looks like:
ablaze
able
abler
ablest
abloom
ably
would end up with only
ablaze
able
abloom
ably
because abler and ablest contain able which was previously used.
I would prefer to do this with grep so that I can learn more about how that works. I am capable of writing a program in c or python that will do this.
If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.
awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words
If the current word matches the prefix regex r (defined in a moment) and the prefix p (ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r) and also remember the prefix string in p.
If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k and p entirely too, I guess.
awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words
This is based on the assumption that the input file is sorted. In that case, when looking up each word, all matches after the first one can be safely skipped (because they will correspond to "the same word with a different suffix").
#/bin/bash
input=$1
while read -r word ; do
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this line
echo $word
# skip next lines that start with $word as prefix
skip=$(grep -c -E -e "^${word}" $input)
for ((i=1; i<$skip; i++)) ; do read -r word ; done
done <$input
Call as ./filter.sh input > output
This takes somewhat less than 2 minutes on all words of 4 or more letters found in my /usr/share/dict/american-english dictionary. The algorithm is O(n²), and therefore unsuitable for large files.
However, you can speed things up a lot if you avoid using grep at all. This version takes only 4 seconds to do the job (because it does not need to scan the whole file almost once per word). Since it performs a single pass over the input, its complexity is O(n):
#/bin/bash
input=$1
while true ; do
# use already-read word, or fail if cannot read new
if [ -n "$next" ] ; then word=$next; unset next;
elif ! read -r word ; then break; fi
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this word
echo ${word}
# skip words that start with $word as prefix
while read -r next ; do
unique=${next#$word}
if [ ${#next} -eq ${#unique} ] ; then break; fi
done
done <$input
Supposing you want to start with words that share the same first four (up to ten) letters, you could do something like this:
cp /usr/share/dict/words words
str="...."
for num in 4 5 6 7 8 9 10; do
for word in `grep "^$str$" words`; do
grep -v "^$word." words > words.tmp
mv words.tmp words
done
str=".$str"
done
You wouldn't want to start with 1 letter, unless 'a' is not in your dictionary, etc.
Try this BASH script:
a=()
while read -r w; do
[[ ${#a[#]} -eq 0 ]] && a+=("$w") && continue
grep -qvf <(printf "^%s\n" "${a[#]}") <<< "$w" && a+=("$w")
done < file
printf "%s\n" "${a[#]}"
ablaze
able
abloom
ably
It seems like you want to group adverbs together. Some adverbs, including those that can also be adjectives, use er and est to form comparisons:
able, abler, ablest
fast, faster, fastest
soon, sooner, soonest
easy, easier, easiest
This procedure is know as stemming in natural language processing, and can be achieved using a stemmer or lemmatizer. there are popular implementations in python's NLTK module but the problem is not completely solved. The best out the box stemmer is the snowball stemmer but it does not stem adverbs to their root.
import nltk
initial = '''
ablaze
able
abler
ablest
abloom
ably
fast
faster
fastest
'''.splitlines()
snowball = nltk.stem.snowball.SnowballStemmer("english")
stemmed = [snowball.stem(word) for word in initial]
print set(stemmed)
output...
set(['', u'abli', u'faster', u'abl', u'fast', u'abler', u'abloom', u'ablest', u'fastest', u'ablaz'])
the other option is to use a regex stemmer but this has its own difficulties I'm afraid.
patterns = "er$|est$"
regex_stemmer = nltk.stem.RegexpStemmer(patterns, 4)
stemmed = [regex_stemmer.stem(word) for word in initial]
print set(stemmed)
output...
set(['', 'abloom', 'able', 'abl', 'fast', 'ably', 'ablaze'])
If you just want to weed out some of the words, this gross command will work. Note that it'll throw out some legit words like best, but it's dead simple. It assumes you have a test.txt file with one word per line
egrep -v "er$|est$" test.txt >> results.txt
egrep is the same as grep -E. -v means throw out matching lines. x|y means if x or y match, and $ means end of line, so you'd be looking for words that end in er or est

Resources