optimizing awk command for large file - linux
I have these functions to process a 2GB text file. I'm splitting it into 6 parts for simultaneous processing but it is still taking 4+ hours.
What else can I try make the script faster?
A bit of details:
I feed my input csv into a while loop to be read line by line.
I grabbed the values from the csv line from 4 fields in the read2col function
The awk in my mainf function takes the values from read2col and do some arithmetic calculation. I'm rounding the result to 2 decimal places. Then, print the line to a text file.
Sample data:
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Script:
read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}
#################################################
#for each line in the csv
mainf()
{
cd $infarepath
while read -r line; do
#read the value of csv fields into variables
read2col
if [[ $is_one_way == 0 ]]; then
if [[ $price_outbound > 0 ]]; then
#calculate price inc and print the entire line to txt file
echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
else
#divide price ecx and inc by 2 if price outbound is not greater than 0
echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
fi
else
echo $line >>"$csvsplitfile".tmp
fi
done < $csvsplitfile
}
The first thing you should do is stop invoking six subshells for running awk for every single line of input. Let's do some quick, back-of-the-envelope calculations.
Assuming your input lines are about 292 characters (as per you example), a 2G file will consist of a little over 7.3 million lines. That means you are starting and stopping a whopping forty-four million processes.
And, while Linux admirably handles fork and exec as efficiently as possible, it's not without cost:
pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s
In addition, bash hasn't really been optimised for this sort of processing, its design leads to sub-optimal behaviour for this specific use case. For details on this, see this excellent answer over on one of our sister sites.
An analysis of the two methods of file processing (one program reading an entire file (each line has just hello on it), and bash reading it a line at a time) is shown below. The two commands used to get the timings were:
time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )
For varying file sizes (user+sys time, averaged over a few runs), it's quite interesting:
# of lines cat-method while-method
---------- ---------- ------------
1,000 0.375s 0.031s
10,000 0.391s 0.234s
100,000 0.406s 1.994s
1,000,000 0.391s 19.844s
10,000,000 0.375s 205.583s
44,000,000 0.453s 889.402s
From this, it appears that the while method can hold its own for smaller data sets, it really does not scale well.
Since awk itself has ways to do calculations and formatted output, processing the file with one single awk script, rather than your bash/multi-awk-per-line combination, will make the cost of creating all those processes and line-based delays go away.
This script would be a good first attempt, let's call it prog.awk:
BEGIN {
FMT = "%.2f"
OFS = FS
}
{
isOneWay=$7
priceOutbound=$30
priceExc=$25
tax=$27
priceInc=$26
if (isOneWay == 0) {
if (priceOutbound > 0) {
$25 = sprintf(FMT, priceOutbound)
$26 = sprintf(FMT, priceOutbound + tax / 2)
} else {
$25 = sprintf(FMT, priceExc / 2)
$26 = sprintf(FMT, priceInc / 2)
}
}
print
}
You just run that single awk script with:
awk -F'","' -f prog.awk data.txt
With the test data you provided, here's the before and after, with markers for field numbers 25 and 26:
<-25-> <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Related
separate columns of a text file
Hii experts i have a big text file that contain many columns.Now i want to extract each column in separate text file serially with adding two strings on the top. suppose i have a input file like this 2 3 4 5 6 3 4 5 6 7 2 3 4 5 6 1 2 2 2 2 then i need to extract each column in separate text file with two strings on the top file1.txt file2.txt .... filen.txt s=5 s=5 r=9 r=9 2 3 3 4 2 3 1 2 i tried script as below:but it doesnot work properly.need help from experts.Thanks in advance. #!/bin/sh for i in $(seq 1 1 5) do echo $i awk '{print $i}' inp_file > file_$i done
Could you please try following, written and tested with shown samples in GNU awk. Following doesn't have close file function used because your sample shows you have only 5 columns in Input_file. Also created 2 awk variables which will be printed before actual column values are getting printed to output file(named var1 and var2). awk -v var1="s=5" -v var2="r=9" ' { count++ for(i=1;i<=NF;i++){ outputFile="file"i".txt" if(count==1){ print (var1 ORS var2) > (outputFile) } print $i > (outputFile) } } ' Input_file In case you can have more than 5 or more columns then better close output files kin backend using close option, use this then(to avoid error too many files opened). awk -v var1="s=5" -v var2="r=9" ' { count++ for(i=1;i<=NF;i++){ outputFile="file"i".txt" if(count==1){ print (var1 ORS var2) > (outputFile) } print $i >> (outputFile) } close(outputFile) } ' Input_file
Pretty simple to do in one pass through the file with awk using its output redirection: awk 'NR==1 { for (n = 1; n <= NF; n++) print "s=5\nr=9" > ("file_" n) } { for (n = 1; n <= NF; n++) print $n > ("file_" n) }' inp_file
With GNU awk to internally handle more than a dozen or so simultaneously open files: NR == 1 { for (i=1; i<=NF; i++) { out[i] = "file" i ".txt" print "s=5" ORS "r=9" > out[i] } } { for (i=1; i<=NF; i++) { print $i > out[i] } } or with any awk just close them as you go: NR == 1 { for (i=1; i<=NF; i++) { out[i] = "file" i ".txt" print "s=5" ORS "r=9" > out[i] close(out[i]) } } { for (i=1; i<=NF; i++) { print $i >> out[i] close(out[i]) } }
split -nr/$(wc -w <(head -1 input) | cut -d' ' -f1) -t' ' --additional-suffix=".txt" -a4 --numeric-suffix=1 --filter "cat <(echo -e 's=5 r=9') - | tr ' ' '\n' >\$FILE" <(tr -s '\n' ' ' <input) file This uses the nifty split command in a unique way to rearrange the columns. Hopefully it's faster than awk, although after spending a considerable amount of time coding it, testing it, and writing it up, I find that it may not be scalable enough for you since it requires a process per column, and many systems are limited in user processes (check ulimit -u). I submit it though because it may have some limited learning usefulness, to you or to a reader down the line. Decoding: split -- Divide a file up into subfiles. Normally this is by lines or by size but we're tweaking it to use columns. -nr/$(...) -- Use round-robin output: Sort records (in our case, matrix cells) into the appropriate number of bins in a round-robin fashion. This is the key to making this work. The part in parens means, count (wc) the number of words (-w) in the first line (<(head -1 input)) of the input and discard the filename (cut -d' ' -f1), and insert the output into the command line. -t' ' -- Use a single space as a record delimiter. This breaks the matrix cells into records for split to split on. --additional-suffix=".txt" -- Append .txt to output files. -a4 -- Use four-digit numbers; you probably won't get 1,000 files out of it but just in case ... --numeric-suffix=1 -- Add a numeric suffix (normally it's a letter combination) and start at 1. This is pretty pedantic but it matches the example. If you have more than 100 columns, you will need to add a -a4 option or whatever length you need. --filter ... -- Pipe each file through a shell command. Shell command: cat -- Concatenate the next two arguments. <(echo -e 's=5 r=9') -- This means execute the echo command and use its output as the input to cat. We use a space instead of a newline to separate because we're converting spaces to newlines eventually and it is shorter and clearer to read. - -- Read standard input as an argument to cat -- this is the binned data. | tr ' ' '\n' -- Convert spaces between records to newlines, per the desired output example. >\$FILE -- Write to the output file, which is stored in $FILE (but we have to quote it so the shell doesn't interpret it in the initial command). Shell command over -- rest of split arguments: <(tr -s '\n' ' ' < input) -- Use, as input to split, the example input file but convert newlines to spaces because we don't need them and we need a consistent record separator. The -s means only output one space between each record (just in case we got multiple ones on input). file -- This is the prefix to the output filenames. The output in my example would be file0001.txt, file0002.txt, ..., file0005.txt.
Extract orders and match to trades from two files
I have two attached files (orders1.txt and trades1.txt) I need to write a Bash script (possibly awk?) to extract orders and match them to trades. The output should produce a report that prints comma separated values containing “ClientID, OrderID, Price, Volume”. In addition to this for each client, I need to print the total volume and turnover (turnover is the subtotal of price * volume on each trade). Can someone please help me with a bash script that will do the above using the attached files? Any help would be greatly appreciated orders1.txt Entry Time, Client ID, Security ID, Order ID 25455410,DOLR,XGXUa,DOLR1435804437 25455410,XFKD,BUP3d,XFKD4746464646 25455413,QOXA,AIDl,QOXA7176202067 25455415,QOXA,IRUXb,QOXA6580494597 25455417,YXKH,OBWQs,YXKH4575139017 25455420,JBDX,BKNs,JBDX6760353333 25455428,DOLR,AOAb,DOLR9093170513 25455429,JBDX,QMP1Sh,JBDX2756804453 25455431,QOXA,QIP1Sh,QOXA6563975285 25455434,QOXA,XMUp,QOXA5569701531 25455437,XFKD,QLOJc,XFKD8793976660 25455438,YXKH,MRPp,YXKH2329856527 25455442,JBDX,YBPu,JBDX0100506066 25455450,QOXA,BUPYd,QOXA5832015401 25455451,QOXA,SIOQz,QOXA3909507967 25455451,DOLR,KID1Sh,DOLR2262067037 25455454,DOLR,JJHi,DOLR9923665017 25455461,YXKH,KBAPBa,YXKH2637373848 25455466,DOLR,EPYp,DOLR8639062962 25455468,DOLR,UQXKz,DOLR4349482234 25455474,JBDX,EFNs,JBDX7268036859 25455481,QOXA,XCB1Sh,QOXA4105943392 25455486,YXKH,XBAFp,YXKH0242733672 25455493,JBDX,BIF1Sh,JBDX2840241688 25455500,DOLR,QSOYp,DOLR6265839896 25455503,YXKH,IIYz,YXKH8505951163 25455504,YXKH,ZOIXp,YXKH2185348861 25455513,YXKH,MBOOp,YXKH4095442568 25455515,JBDX,P35p,JBDX9945514579 25455524,QOXA,YXOKz,QOXA1900595629 25455528,JBDX,XEQl,JBDX0126452783 25455528,XFKD,FJJMp,XFKD4392227425 25455535,QOXA,EZIp,QOXA4277118682 25455543,QOXA,YBPFa,QOXA6510879584 25455551,JBDX,EAMp,JBDX8924251479 25455552,QOXA,JXIQp,QOXA4360008399 25455554,DOLR,LISXPh,DOLR1853653280 25455557,XFKD,LOX14p,XFKD1759342196 25455558,JBDX,YXYb,JBDX8177118129 25455567,YXKH,MZQKl,YXKH6485420018 25455569,JBDX,ZPIMz,JBDX2010952336 25455573,JBDX,COPe,JBDX1612537068 25455582,JBDX,HFKAp,JBDX2409813753 25455589,QOXA,XFKm,QOXA9692126523 25455593,XFKD,OFYp,XFKD8556940415 25455601,XFKD,FKQLb,XFKD4861992028 25455606,JBDX,RIASp,JBDX0262502677 25455608,DOLR,HRKKz,DOLR1739013513 25455615,DOLR,ZZXp,DOLR6727725911 25455623,JBDX,CKQPp,JBDX2587184235 25455630,YXKH,ZLQQp,YXKH6492126889 25455632,QOXA,ORPz,QOXA3594333316 25455640,XFKD,HPIXSh,XFKD6780729432 25455648,QOXA,ABOJe,QOXA6661411952 25455654,XFKD,YLIp,XFKD6374702721 25455654,DOLR,BCFp,DOLR8012564477 25455658,JBDX,ZMDKz,JBDX6885176695 25455665,JBDX,CBOe,JBDX8942732453 25455670,JBDX,FRHMl,JBDX5424320405 25455679,DOLR,YFJm,DOLR8212353717 25455680,XFKD,XAFp,XFKD4132890550 25455681,YXKH,PBIBOp,YXKH6106504736 25455684,DOLR,IFDu,DOLR8034515043 25455687,JBDX,JACe,JBDX8243949318 25455688,JBDX,ZFZKz,JBDX0866225752 25455693,QOXA,XOBm,QOXA5011416607 25455694,QOXA,IDQe,QOXA7608439570 25455698,JBDX,YBIDb,JBDX8727773702 25455705,YXKH,MXOp,YXKH7747780955 25455710,YXKH,PBZRYs,YXKH7353828884 25455719,QOXA,QFDb,QOXA2477859437 25455720,XFKD,PZARp,XFKD4995735686 25455722,JBDX,ZLKKb,JBDX3564523161 25455730,XFKD,QFH1Sh,XFKD6181225566 25455733,JBDX,KWVJYc,JBDX7013108210 25455733,YXKH,ZQI1Sh,YXKH7095815077 25455739,YXKH,XIJp,YXKH0497248757 25455739,YXKH,ZXJp,YXKH5848658513 25455747,JBDX,XASd,JBDX4986246117 25455751,XFKD,XQIKz,XFKD5919379575 25455760,JBDX,IBXPb,JBDX8168710376 25455763,XFKD,EVAOi,XFKD8175209012 25455765,XFKD,JXKp,XFKD2750952933 25455773,XFKD,PTBAXs,XFKD8139382011 25455778,QOXA,XJp,QOXA8227838196 25455783,QOXA,CYBIp,QOXA2072297264 25455792,JBDX,PZI1Sh,JBDX7022115629 25455792,XFKD,XIKQl,XFKD6434550362 25455792,DOLR,YKPm,DOLR6394606248 25455796,QOXA,JXOXPh,QOXA9672544909 25455797,YXKH,YIWm,YXKH5946342983 25455803,YXKH,JZEm,YXKH5317189370 25455810,QOXA,OBMFz,QOXA0985316706 25455810,QOXA,DAJPp,QOXA6105975858 25455810,JBDX,FBBJl,JBDX1316207043 25455819,XFKD,YXKm,XFKD6946276671 25455821,YXKH,UIAUs,YXKH6010226371 25455828,DOLR,PTJXs,DOLR1387517499 25455836,DOLR,DCEi,DOLR3854078054 25455845,YXKH,NYQe,YXKH3727923537 25455853,XFKD,TAEc,XFKD5377097556 25455858,XFKD,LMBOXo,XFKD4452678489 25455858,XFKD,AIQXp,XFKD5727938304 trades1.txt # The first 8 characters is execution time in microseconds since midnight # The next 14 characters is the order ID # The next 8 characters is the zero padded price # The next 8 characters is the zero padded volume 25455416QOXA6580494597 0000013800001856 25455428JBDX6760353333 0000007000002458 25455434DOLR9093170513 0000000400003832 25455435QOXA6563975285 0000034700009428 25455449QOXA5569701531 0000007500009023 25455447YXKH2329856527 0000038300009947 25455451QOXA5832015401 0000039900006432 25455454QOXA3909507967 0000026900001847 25455456DOLR2262067037 0000034700002732 25455471YXKH2637373848 0000010900006105 25455480DOLR8639062962 0000027500001975 25455488JBDX7268036859 0000005200004986 25455505JBDX2840241688 0000037900002029 25455521YXKH4095442568 0000046400002150 25455515JBDX9945514579 0000040800005904 25455535QOXA1900595629 0000015200006866 25455533JBDX0126452783 0000001700006615 25455542XFKD4392227425 0000035500009948 25455570XFKD1759342196 0000025700007816 25455574JBDX8177118129 0000022400000427 25455567YXKH6485420018 0000039000008327 25455573JBDX1612537068 0000013700001422 25455584JBDX2409813753 0000016600003588 25455603XFKD4861992028 0000017600004552 25455611JBDX0262502677 0000007900003235 25455625JBDX2587184235 0000024300006723 25455658XFKD6374702721 0000046400009451 25455673JBDX6885176695 0000010900009258 25455671JBDX5424320405 0000005400003618 25455679DOLR8212353717 0000041100003633 25455697QOXA5011416607 0000018800007376 25455696QOXA7608439570 0000013000007463 25455716YXKH7747780955 0000037000006357 25455719QOXA2477859437 0000039300009840 25455723XFKD4995735686 0000045500009858 25455727JBDX3564523161 0000021300000639 25455742YXKH7095815077 0000023000003945 25455739YXKH5848658513 0000042700002084 25455766XFKD5919379575 0000022200003603 25455777XFKD8175209012 0000033300006350 25455788XFKD8139382011 0000034500007461 25455793QOXA8227838196 0000011600007081 25455784QOXA2072297264 0000017000004429 25455800XFKD6434550362 0000030000002409 25455801QOXA9672544909 0000039600001033 25455815QOXA6105975858 0000034800008373 25455814JBDX1316207043 0000026500005237 25455831YXKH6010226371 0000011400004945 25455838DOLR1387517499 0000046200006129 25455847YXKH3727923537 0000037400008061 25455873XFKD5727938304 0000048700007298 I have the following script: ''' #!/bin/bash declare -A volumes declare -A turnovers declare -A orders # Read the first file, remembering for each order the client id while read -r line do # Jump over comments if [[ ${line:0:1} == "#" ]] ; then continue; fi; details=($(echo $line | tr ',' " ")) order_id=${details[3]} client_id=${details[1]} orders[$order_id]=$client_id done < $1 echo "ClientID,OrderID,Price,Volume" while read -r line do # Jump over comments if [[ ${line:0:1} == "#" ]] ; then continue; fi; order_id=$(echo ${line:8:20} | tr -d '[:space:]') client_id=${orders[$order_id]} price=${line:28:8} volume=${line: -8} echo "$client_id,$order_id,$price,$volume" price=$(echo $price | awk '{printf "%d", $0}') volume=$(echo $volume | awk '{printf "%d", $0}') order_turnover=$(($price*$volume)) old_turnover=${turnovers[$client_id]} [[ -z "$old_turnover" ]] && old_turnover=0 total_turnover=$(($old_turnover+$order_turnover)) turnovers[$client_id]=$total_turnover old_volumes=${volumes[$client_id]} [[ -z "$old_volumes" ]] && old_volumes=0 total_volume=$((old_volumes+volume)) volumes[$client_id]=$total_volume done < $2 echo "ClientID,Volume,Turnover" for client_id in ${!volumes[#]} do volume=${volumes[$client_id]} turnover=${turnovers[$client_id]} echo "$client_id,$volume,$turnover" done Can anyone think of anything more elegant? Thanks in advance C
Assumption 1: the two files are ordered, so line x represents an action that is older than x+1. If not, then further work is needed. The assumption makes our work easier. Let's first change the delimiter of traders into a comma: sed -i 's/ /,/g' traders.txt This will be done in place for sake of simplicity. So, you now have traders which is comma separated, as is orders. This is the Assumption 2. Keep working on traders: split all columns and add titles1. More on the reasons why in a moment. gawk -i inplace -v INPLACE_SUFFIX=.bak 'BEGINFILE{FS=",";OFS=",";print "execution time,order ID,price,volume";}{print substr($1,1,8),substr($1,9),substr($2,1,9),substr($2,9)}' traders.txt Ugly but works. Now let's process your data using the following awk script: BEGIN { FS="," OFS="," } { if (1 == NR) { getline line < TRADERS # consume title line print "Client ID,Order ID,Price,Volume,Turnover"; # consume title line. Remove print to forget it getline line < TRADERS # reads first data line split(line, transaction, ",") next } if (transaction[2] == $4) { print $2, $4, transaction[3], transaction[4], transaction[3]*transaction[4] getline line < TRADERS # reads new data line split(line, transaction, ",") } } called by: gawk -f script -v TRADERS=traders.txt orders.txt And there you have it. Some caveats: check the numbers, as implicit gawk number conversion might not be correct with zero-padded numbers. There is a fix for that in case; getline might explode if we run out of lines from traders. I haven't put any check, that's up to you no control over timestamps. Match is based on Order ID. Output file: Client ID,Order ID,Price,Volume,Turnover QOXA,QOXA6580494597,000001380,00001856,2561280 JBDX,JBDX6760353333,000000700,00002458,1720600 DOLR,DOLR9093170513,000000040,00003832,153280 QOXA,QOXA6563975285,000003470,00009428,32715160 QOXA,QOXA5569701531,000000750,00009023,6767250 YXKH,YXKH2329856527,000003830,00009947,38097010 QOXA,QOXA5832015401,000003990,00006432,25663680 QOXA,QOXA3909507967,000002690,00001847,4968430 DOLR,DOLR2262067037,000003470,00002732,9480040 YXKH,YXKH2637373848,000001090,00006105,6654450 DOLR,DOLR8639062962,000002750,00001975,5431250 JBDX,JBDX7268036859,000000520,00004986,2592720 JBDX,JBDX2840241688,000003790,00002029,7689910 YXKH,YXKH4095442568,000004640,00002150,9976000 JBDX,JBDX9945514579,000004080,00005904,24088320 QOXA,QOXA1900595629,000001520,00006866,10436320 JBDX,JBDX0126452783,000000170,00006615,1124550 XFKD,XFKD4392227425,000003550,00009948,35315400 XFKD,XFKD1759342196,000002570,00007816,20087120 JBDX,JBDX8177118129,000002240,00000427,956480 YXKH,YXKH6485420018,000003900,00008327,32475300 JBDX,JBDX1612537068,000001370,00001422,1948140 JBDX,JBDX2409813753,000001660,00003588,5956080 XFKD,XFKD4861992028,000001760,00004552,8011520 JBDX,JBDX0262502677,000000790,00003235,2555650 JBDX,JBDX2587184235,000002430,00006723,16336890 XFKD,XFKD6374702721,000004640,00009451,43852640 JBDX,JBDX6885176695,000001090,00009258,10091220 JBDX,JBDX5424320405,000000540,00003618,1953720 DOLR,DOLR8212353717,000004110,00003633,14931630 QOXA,QOXA5011416607,000001880,00007376,13866880 QOXA,QOXA7608439570,000001300,00007463,9701900 YXKH,YXKH7747780955,000003700,00006357,23520900 QOXA,QOXA2477859437,000003930,00009840,38671200 XFKD,XFKD4995735686,000004550,00009858,44853900 JBDX,JBDX3564523161,000002130,00000639,1361070 YXKH,YXKH7095815077,000002300,00003945,9073500 YXKH,YXKH5848658513,000004270,00002084,8898680 XFKD,XFKD5919379575,000002220,00003603,7998660 XFKD,XFKD8175209012,000003330,00006350,21145500 XFKD,XFKD8139382011,000003450,00007461,25740450 QOXA,QOXA8227838196,000001160,00007081,8213960 QOXA,QOXA2072297264,000001700,00004429,7529300 XFKD,XFKD6434550362,000003000,00002409,7227000 QOXA,QOXA9672544909,000003960,00001033,4090680 QOXA,QOXA6105975858,000003480,00008373,29138040 JBDX,JBDX1316207043,000002650,00005237,13878050 YXKH,YXKH6010226371,000001140,00004945,5637300 DOLR,DOLR1387517499,000004620,00006129,28315980 YXKH,YXKH3727923537,000003740,00008061,30148140 XFKD,XFKD5727938304,000004870,00007298,35541260 1: requires gawk 4.1.0 or higher
Print a row of 16 lines evenly side by side (column)
I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below: asdljsdbfajhsdbflakjsdff235 asjhbasdjbfajskdfasdbajsdx3 asjhbasdjbfajs23kdfb235ajds asjhbasdjbfajskdfbaj456fd3v asjhbasdjb6589fajskdfbaj235 asjhbasdjbfajs54kdfbaj2f879 asjhbasdjbfajskdfbajxdfgsdh asjhbasdf3709ddjbfajskdfbaj 100 100 150 125 trh77rnv9vnd9dfnmdcnksosdmn 220 225 sdkjNSDfasd89asdg12asdf6asdf So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below asdljsdbfajhsdbflakjsdff235 100 asjhbasdjbfajskdfasdbajsdx3 100 asjhbasdjbfajs23kdfb235ajds 150 asjhbasdjbfajskdfbaj456fd3v 125 asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn asjhbasdjbfajs54kdfbaj2f879 220 asjhbasdjbfajskdfbajxdfgsdh 225 asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}' Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below): $ awk '{ a[NR]=$0 # hash all the records } END { # after hashing mid=int(NR/2) # compute the midpoint, int in case NR is uneven for(i=1;i<=mid;i++) # iterate from start to midpoint print a[i],a[mid+i] # output }' file If you have the memory to hash half of the file ("mid"): $ awk ' NR==FNR { # on 1st pass hash second half of records if(FNR>1) { # we dont need the 1st record ever a[FNR]=$0 # hash record if(FNR%2) # if odd record delete a[int(FNR/2)+1] # remove one from the past } next } FNR==1 { # on the start of 2nd pass if(NR%2==0) # if record count is uneven exit # exit as there is always even count of them offset=int((NR-1)/2) # compute offset to the beginning of hash } FNR<=offset { # only process the 1st half of records print $0,a[offset+FNR] # output one from file, one from hash next } { # once 1st half of 2nd pass is finished exit # just exit }' file file # notice filename twice And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"): $ awk ' NR==FNR { # just get the NR of 1st pass next } FNR==1 { mid=(NR-1)/2 # get the midpoint file=FILENAME # filename for getline while(++i<=mid && (getline line < file)>0); # jump getline to mid } { if((getline line < file)>0) # getline read from mid+FNR print $0,line # output }' file file # notice filename twice Standard disclaimer on getline and no real error control implemented. Performance: I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max. +------------------+-----------+-----------+ | which | | | | min | mid | max | +------------------+-----------+-----------+ | time | | | | real 1m7.027s | 1m30.146s | 0m48.405s | | user 1m6.387s | 1m27.314 | 0m43.801s | | sys 0m0.641s | 0m2.820s | 0m4.505s | +------------------+-----------+-----------+ | mem | | | | 3 MB | 6.8 GB | 13.5 GB | +------------------+-----------+-----------+ Update: I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively: real 0m41.455s user 0m39.086s sys 0m2.369s and real 0m39.577s user 0m37.037s sys 0m2.541s Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file asdljsdbfajhsdbflakjsdff235 100 asjhbasdjbfajskdfasdbajsdx3 100 asjhbasdjbfajs23kdfb235ajds 150 asjhbasdjbfajskdfbaj456fd3v 125 asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn asjhbasdjbfajs54kdfbaj2f879 220 asjhbasdjbfajskdfbajxdfgsdh 225 asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf if you want just one space between columns, change to $ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g. awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file Example Use/Output With your data in file, then $ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file asdljsdbfajhsdbflakjsdff235 100 asjhbasdjbfajskdfasdbajsdx3 100 asjhbasdjbfajs23kdfb235ajds 150 asjhbasdjbfajskdfbaj456fd3v 125 asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn asjhbasdjbfajs54kdfbaj2f879 220 asjhbasdjbfajskdfbajxdfgsdh 225 asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines: paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt) You can use columns utility from autogen: columns -c2 --by-columns file.txt You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can: column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you: wc -l file.txt gives number of lines echo $(($(wc -l < file.txt)/2)) gives a half head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1) #!/usr/bin/env bash array=() file=$1 total=$(wc -l < "$file") half=$(( total / 2 )) plus1=$(( half + 1 )) for ((m=1;m<=half;m++)); do array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j') done After all of that if just want to print the output to stdout. Add the line below to the script. printf '%s\n' "${array[#]}" ,p Q | ed -s "$file" If you want to write the changes directly to the file itself, Use this code instead below the script. printf '%s\n' "${array[#]}" w | ed -s "$file" Here is an example. printf '%s\n' {1..10} > file.txt Now running the script against that file. ./myscript file.txt Output 1 6 2 7 3 8 4 9 5 10 Or using bash4+ feature mapfile aka readarray Save the file in an array named array. mapfile -t array < file.txt Separate the files. left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}") loop and print side-by-side for i in "${!left[#]}"; do printf '%s %s\n' "${left[i]}" "${right[i]}" done What you said The only known thing in my scenario is, they are even number all the time. That solution should work.
How to select random lines from a file
I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file. I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines. I tried this piece of code, but it's very slow and takes hours. FILEsrc=$1; FILEtrg=$2; MaxLines=$3; let LineIndex=1; while [ "$LineIndex" -le "$MaxLines" ] do # count number of lines NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g'); let X=(${RANDOM} % ${NUM} + 1); echo $X; sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file sed -i -e ${X}d ${FILEsrc}; #remove selected line from source file LineIndex=`expr $LineIndex + 1`; done I found this line the most time consuming one in the code: sed -i -e ${X}d ${FILEsrc}; is there any way to overcome this problem and make the code faster? Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?
A simple O(n) algorithm is described in: http://en.wikipedia.org/wiki/Reservoir_sampling array R[k]; // result integer i, j; // fill the reservoir array for each i in 1 to k do R[i] := S[i] done; // replace elements with gradually decreasing probability for each i in k+1 to length(S) do j := random(1, i); // important: inclusive range if j <= k then R[j] := S[i] fi done
Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this: sed "s!.*!&{w $FILEtrg\nd;}!" offsets The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance: ... | sed -i -f - "$FILEsrc" Generating the offsets file left as an exercise. Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input. Here is a complete script, updated to use shuf (thanks #Thor!) to avoid possible duplicate random numbers. #!/bin/sh FILEsrc=$1 FILEtrg=$2 MaxLines=$3 # Add a line number to each input line nl -ba "$FILEsrc" | # Rearrange lines shuf | # Pick out the line number from the first $MaxLines ones into sed script sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" | # Run the generated sed script on the original input file sed -i -f - "$FILEsrc"
[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.] Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash): tmp=$(mktemp /tmp/XXXXXXXX) awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines ' BEGIN { srand(); } maxLines==0 { exit; } { if (rand() < maxLines/total--) { print; maxLines--; } else { print $0 > /dev/fd/3 } }' "$FILEsrc" > "$FILEtrg" 3> $tmp mv $tmp "$FILEsrc" As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines. Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic: FILEsrc=$1 FILEtrg=$2 MaxLines=$3 tmp=$(mktemp /tmp/XXXXXXXX) total=$(wc -l < "$FILEsrc") while read -r line && (( MaxLines > 0 )); do (( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; } (( MaxLines-- )) printf "$line\n" done < "$FILEsrc" > "$FILEtrg" 3> $tmp mv $tmp "$FILEsrc"
Here's a complete Go program : package main import ( "bufio" "fmt" "log" "math/rand" "os" "sort" "time" ) func main() { N := 10 rand.Seed( time.Now().UTC().UnixNano()) f, err := os.Open(os.Args[1]) // open the file if err!=nil { // and tell the user if the file wasn't found or readable log.Fatal(err) } r := bufio.NewReader(f) var lines []string // this will contain all the lines of the file for { if line, err := r.ReadString('\n'); err == nil { lines = append(lines, line) } else { break } } nums := make([]int, N) // creates the array of desired line indexes for i, _ := range nums { // fills the array with random numbers (lower than the number of lines) nums[i] = rand.Intn(len(lines)) } sort.Ints(nums) // sorts this array for _, n := range nums { // let's print the line fmt.Println(lines[n]) } } Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this : go build randomlines And then call it like this : ./randomlines "path_to_my_file" This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.
Here's an interesting two-pass option with coreutils, sed and awk: n=5 total=$(wc -l < infile) seq 1 $total | shuf | head -n $n \ | sed 's/^/NR == /; $! s/$/ ||/' \ | tr '\n' ' ' \ | sed 's/.*/ & { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \ | awk -f - infile A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output: { if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" } So the random lines are saved in rndlines and the rest in leftover.
Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming. Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file. FILE=input.txt SAMPLE=10 SEP=$'\t' <$FILE nl -s $"SEP" -nln -w1 | sort -R | tee \ >(awk "NR > $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \ >(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \ >/dev/null # check the results wc -l $FILE* # 'remove' the lines, if needed mv $FILE.rest $FILE
This might work for you (GNU sed, sort and seq): n=10 seq 1 $(sed '$=;d' input_file) | sort -R | sed $nq | sed 's/.*/&{w output_file\nd}/' | sed -i -f - input_file Where $n is the number of lines to extract.
awk - how to "re-awk" the output?
I need to take a file and count the number of occurrences of $7 - I've done this with awk (because I need to run this through more awk) What I want to do is combine this into one script - so far I have #! /usr/bin/awk -f # get the filename, count the number of occurs # <no occurs> <filename> { print $7 | "grep /datasheets/ | sort | uniq -c"} how do I grab that output and run it through more awk commands - in the same file Eventually, I need to be able to run ./process.awk <filename> so it can be a drop-in replacement for a previous setup which would take too much time/effor to to change -
if you want to forward the output of an awk script to another awk script, just pipe it to awk. awk 'foobar...' file|awk 'new awkcmd' and your current awk|grep|sort|uniq could be done with awk itself. save your 3 processes. you want to get the repeated counts, don't you? awk '$7~=/datasheets/{a[$7]++;} END{for(x in a)print x": "a[x]' file should work.
If you use Gawk, you could use the 2-way communications to push the data to the external command then read it back: #!/usr/bin/gawk -f BEGIN { COMMAND = "sort | uniq -c" SEEN = 0 PROCINFO[ COMMAND, "pty" ] = 1 } /datasheets/ { print $7 |& COMMAND SEEN = 1 } END { # Don't read sort output if no input was provided if ( SEEN == 1 ) { # Tell sort no more input data is available close( COMMAND, "to" ) # Read the sorted data while( ( COMMAND |& getline SORTED ) > 0 ) { # Do whatever you want on the sorted data print SORTED } close( COMMAND, "from" ) } } See https://www.gnu.org/software/gawk/manual/gawk.html#Two_002dway-I_002fO