How to select random lines from a file - linux

I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file.
I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines.
I tried this piece of code, but it's very slow and takes hours.
FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e ${X}d ${FILEsrc}; #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done
I found this line the most time consuming one in the code:
sed -i -e ${X}d ${FILEsrc};
is there any way to overcome this problem and make the code faster?
Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?

A simple O(n) algorithm is described in:
http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done

Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this:
sed "s!.*!&{w $FILEtrg\nd;}!" offsets
The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance:
... | sed -i -f - "$FILEsrc"
Generating the offsets file left as an exercise.
Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input.
Here is a complete script, updated to use shuf (thanks #Thor!) to avoid possible duplicate random numbers.
#!/bin/sh
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
# Add a line number to each input line
nl -ba "$FILEsrc" |
# Rearrange lines
shuf |
# Pick out the line number from the first $MaxLines ones into sed script
sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" |
# Run the generated sed script on the original input file
sed -i -f - "$FILEsrc"

[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.]
Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash):
tmp=$(mktemp /tmp/XXXXXXXX)
awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines '
BEGIN { srand(); }
maxLines==0 { exit; }
{ if (rand() < maxLines/total--) {
print; maxLines--;
} else {
print $0 > /dev/fd/3
}
}' "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines.
Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic:
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
tmp=$(mktemp /tmp/XXXXXXXX)
total=$(wc -l < "$FILEsrc")
while read -r line && (( MaxLines > 0 )); do
(( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; }
(( MaxLines-- ))
printf "$line\n"
done < "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"

Here's a complete Go program :
package main
import (
"bufio"
"fmt"
"log"
"math/rand"
"os"
"sort"
"time"
)
func main() {
N := 10
rand.Seed( time.Now().UTC().UnixNano())
f, err := os.Open(os.Args[1]) // open the file
if err!=nil { // and tell the user if the file wasn't found or readable
log.Fatal(err)
}
r := bufio.NewReader(f)
var lines []string // this will contain all the lines of the file
for {
if line, err := r.ReadString('\n'); err == nil {
lines = append(lines, line)
} else {
break
}
}
nums := make([]int, N) // creates the array of desired line indexes
for i, _ := range nums { // fills the array with random numbers (lower than the number of lines)
nums[i] = rand.Intn(len(lines))
}
sort.Ints(nums) // sorts this array
for _, n := range nums { // let's print the line
fmt.Println(lines[n])
}
}
Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this :
go build randomlines
And then call it like this :
./randomlines "path_to_my_file"
This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.

Here's an interesting two-pass option with coreutils, sed and awk:
n=5
total=$(wc -l < infile)
seq 1 $total | shuf | head -n $n \
| sed 's/^/NR == /; $! s/$/ ||/' \
| tr '\n' ' ' \
| sed 's/.*/ & { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \
| awk -f - infile
A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output:
{ if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" }
So the random lines are saved in rndlines and the rest in leftover.

Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming.
Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file.
FILE=input.txt
SAMPLE=10
SEP=$'\t'
<$FILE nl -s $"SEP" -nln -w1 |
sort -R |
tee \
>(awk "NR > $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \
>(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \
>/dev/null
# check the results
wc -l $FILE*
# 'remove' the lines, if needed
mv $FILE.rest $FILE

This might work for you (GNU sed, sort and seq):
n=10
seq 1 $(sed '$=;d' input_file) |
sort -R |
sed $nq |
sed 's/.*/&{w output_file\nd}/' |
sed -i -f - input_file
Where $n is the number of lines to extract.

Related

separate columns of a text file

Hii experts i have a big text file that contain many columns.Now i want to extract each column in separate text file serially with adding two strings on the top.
suppose i have a input file like this
2 3 4 5 6
3 4 5 6 7
2 3 4 5 6
1 2 2 2 2
then i need to extract each column in separate text file with two strings on the top
file1.txt file2.txt .... filen.txt
s=5 s=5
r=9 r=9
2 3
3 4
2 3
1 2
i tried script as below:but it doesnot work properly.need help from experts.Thanks in advance.
#!/bin/sh
for i in $(seq 1 1 5)
do
echo $i
awk '{print $i}' inp_file > file_$i
done
Could you please try following, written and tested with shown samples in GNU awk. Following doesn't have close file function used because your sample shows you have only 5 columns in Input_file. Also created 2 awk variables which will be printed before actual column values are getting printed to output file(named var1 and var2).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i > (outputFile)
}
}
' Input_file
In case you can have more than 5 or more columns then better close output files kin backend using close option, use this then(to avoid error too many files opened).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i >> (outputFile)
}
close(outputFile)
}
' Input_file
Pretty simple to do in one pass through the file with awk using its output redirection:
awk 'NR==1 { for (n = 1; n <= NF; n++) print "s=5\nr=9" > ("file_" n) }
{ for (n = 1; n <= NF; n++) print $n > ("file_" n) }' inp_file
With GNU awk to internally handle more than a dozen or so simultaneously open files:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
}
}
{
for (i=1; i<=NF; i++) {
print $i > out[i]
}
}
or with any awk just close them as you go:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
close(out[i])
}
}
{
for (i=1; i<=NF; i++) {
print $i >> out[i]
close(out[i])
}
}
split -nr/$(wc -w <(head -1 input) | cut -d' ' -f1) -t' ' --additional-suffix=".txt" -a4 --numeric-suffix=1 --filter "cat <(echo -e 's=5 r=9') - | tr ' ' '\n' >\$FILE" <(tr -s '\n' ' ' <input) file
This uses the nifty split command in a unique way to rearrange the columns. Hopefully it's faster than awk, although after spending a considerable amount of time coding it, testing it, and writing it up, I find that it may not be scalable enough for you since it requires a process per column, and many systems are limited in user processes (check ulimit -u). I submit it though because it may have some limited learning usefulness, to you or to a reader down the line.
Decoding:
split -- Divide a file up into subfiles. Normally this is by lines or by size but we're tweaking it to use columns.
-nr/$(...) -- Use round-robin output: Sort records (in our case, matrix cells) into the appropriate number of bins in a round-robin fashion. This is the key to making this work. The part in parens means, count (wc) the number of words (-w) in the first line (<(head -1 input)) of the input and discard the filename (cut -d' ' -f1), and insert the output into the command line.
-t' ' -- Use a single space as a record delimiter. This breaks the matrix cells into records for split to split on.
--additional-suffix=".txt" -- Append .txt to output files.
-a4 -- Use four-digit numbers; you probably won't get 1,000 files out of it but just in case ...
--numeric-suffix=1 -- Add a numeric suffix (normally it's a letter combination) and start at 1. This is pretty pedantic but it matches the example. If you have more than 100 columns, you will need to add a -a4 option or whatever length you need.
--filter ... -- Pipe each file through a shell command.
Shell command:
cat -- Concatenate the next two arguments.
<(echo -e 's=5 r=9') -- This means execute the echo command and use its output as the input to cat. We use a space instead of a newline to separate because we're converting spaces to newlines eventually and it is shorter and clearer to read.
- -- Read standard input as an argument to cat -- this is the binned data.
| tr ' ' '\n' -- Convert spaces between records to newlines, per the desired output example.
>\$FILE -- Write to the output file, which is stored in $FILE (but we have to quote it so the shell doesn't interpret it in the initial command).
Shell command over -- rest of split arguments:
<(tr -s '\n' ' ' < input) -- Use, as input to split, the example input file but convert newlines to spaces because we don't need them and we need a consistent record separator. The -s means only output one space between each record (just in case we got multiple ones on input).
file -- This is the prefix to the output filenames. The output in my example would be file0001.txt, file0002.txt, ..., file0005.txt.

Extract orders and match to trades from two files

I have two attached files (orders1.txt and trades1.txt) I need to write a Bash script (possibly awk?) to extract orders and match them to trades.
The output should produce a report that prints comma separated values containing “ClientID, OrderID, Price, Volume”.
In addition to this for each client, I need to print the total volume and turnover (turnover is the subtotal of price * volume on each trade).
Can someone please help me with a bash script that will do the above using the attached files?
Any help would be greatly appreciated
orders1.txt
Entry Time, Client ID, Security ID, Order ID
25455410,DOLR,XGXUa,DOLR1435804437
25455410,XFKD,BUP3d,XFKD4746464646
25455413,QOXA,AIDl,QOXA7176202067
25455415,QOXA,IRUXb,QOXA6580494597
25455417,YXKH,OBWQs,YXKH4575139017
25455420,JBDX,BKNs,JBDX6760353333
25455428,DOLR,AOAb,DOLR9093170513
25455429,JBDX,QMP1Sh,JBDX2756804453
25455431,QOXA,QIP1Sh,QOXA6563975285
25455434,QOXA,XMUp,QOXA5569701531
25455437,XFKD,QLOJc,XFKD8793976660
25455438,YXKH,MRPp,YXKH2329856527
25455442,JBDX,YBPu,JBDX0100506066
25455450,QOXA,BUPYd,QOXA5832015401
25455451,QOXA,SIOQz,QOXA3909507967
25455451,DOLR,KID1Sh,DOLR2262067037
25455454,DOLR,JJHi,DOLR9923665017
25455461,YXKH,KBAPBa,YXKH2637373848
25455466,DOLR,EPYp,DOLR8639062962
25455468,DOLR,UQXKz,DOLR4349482234
25455474,JBDX,EFNs,JBDX7268036859
25455481,QOXA,XCB1Sh,QOXA4105943392
25455486,YXKH,XBAFp,YXKH0242733672
25455493,JBDX,BIF1Sh,JBDX2840241688
25455500,DOLR,QSOYp,DOLR6265839896
25455503,YXKH,IIYz,YXKH8505951163
25455504,YXKH,ZOIXp,YXKH2185348861
25455513,YXKH,MBOOp,YXKH4095442568
25455515,JBDX,P35p,JBDX9945514579
25455524,QOXA,YXOKz,QOXA1900595629
25455528,JBDX,XEQl,JBDX0126452783
25455528,XFKD,FJJMp,XFKD4392227425
25455535,QOXA,EZIp,QOXA4277118682
25455543,QOXA,YBPFa,QOXA6510879584
25455551,JBDX,EAMp,JBDX8924251479
25455552,QOXA,JXIQp,QOXA4360008399
25455554,DOLR,LISXPh,DOLR1853653280
25455557,XFKD,LOX14p,XFKD1759342196
25455558,JBDX,YXYb,JBDX8177118129
25455567,YXKH,MZQKl,YXKH6485420018
25455569,JBDX,ZPIMz,JBDX2010952336
25455573,JBDX,COPe,JBDX1612537068
25455582,JBDX,HFKAp,JBDX2409813753
25455589,QOXA,XFKm,QOXA9692126523
25455593,XFKD,OFYp,XFKD8556940415
25455601,XFKD,FKQLb,XFKD4861992028
25455606,JBDX,RIASp,JBDX0262502677
25455608,DOLR,HRKKz,DOLR1739013513
25455615,DOLR,ZZXp,DOLR6727725911
25455623,JBDX,CKQPp,JBDX2587184235
25455630,YXKH,ZLQQp,YXKH6492126889
25455632,QOXA,ORPz,QOXA3594333316
25455640,XFKD,HPIXSh,XFKD6780729432
25455648,QOXA,ABOJe,QOXA6661411952
25455654,XFKD,YLIp,XFKD6374702721
25455654,DOLR,BCFp,DOLR8012564477
25455658,JBDX,ZMDKz,JBDX6885176695
25455665,JBDX,CBOe,JBDX8942732453
25455670,JBDX,FRHMl,JBDX5424320405
25455679,DOLR,YFJm,DOLR8212353717
25455680,XFKD,XAFp,XFKD4132890550
25455681,YXKH,PBIBOp,YXKH6106504736
25455684,DOLR,IFDu,DOLR8034515043
25455687,JBDX,JACe,JBDX8243949318
25455688,JBDX,ZFZKz,JBDX0866225752
25455693,QOXA,XOBm,QOXA5011416607
25455694,QOXA,IDQe,QOXA7608439570
25455698,JBDX,YBIDb,JBDX8727773702
25455705,YXKH,MXOp,YXKH7747780955
25455710,YXKH,PBZRYs,YXKH7353828884
25455719,QOXA,QFDb,QOXA2477859437
25455720,XFKD,PZARp,XFKD4995735686
25455722,JBDX,ZLKKb,JBDX3564523161
25455730,XFKD,QFH1Sh,XFKD6181225566
25455733,JBDX,KWVJYc,JBDX7013108210
25455733,YXKH,ZQI1Sh,YXKH7095815077
25455739,YXKH,XIJp,YXKH0497248757
25455739,YXKH,ZXJp,YXKH5848658513
25455747,JBDX,XASd,JBDX4986246117
25455751,XFKD,XQIKz,XFKD5919379575
25455760,JBDX,IBXPb,JBDX8168710376
25455763,XFKD,EVAOi,XFKD8175209012
25455765,XFKD,JXKp,XFKD2750952933
25455773,XFKD,PTBAXs,XFKD8139382011
25455778,QOXA,XJp,QOXA8227838196
25455783,QOXA,CYBIp,QOXA2072297264
25455792,JBDX,PZI1Sh,JBDX7022115629
25455792,XFKD,XIKQl,XFKD6434550362
25455792,DOLR,YKPm,DOLR6394606248
25455796,QOXA,JXOXPh,QOXA9672544909
25455797,YXKH,YIWm,YXKH5946342983
25455803,YXKH,JZEm,YXKH5317189370
25455810,QOXA,OBMFz,QOXA0985316706
25455810,QOXA,DAJPp,QOXA6105975858
25455810,JBDX,FBBJl,JBDX1316207043
25455819,XFKD,YXKm,XFKD6946276671
25455821,YXKH,UIAUs,YXKH6010226371
25455828,DOLR,PTJXs,DOLR1387517499
25455836,DOLR,DCEi,DOLR3854078054
25455845,YXKH,NYQe,YXKH3727923537
25455853,XFKD,TAEc,XFKD5377097556
25455858,XFKD,LMBOXo,XFKD4452678489
25455858,XFKD,AIQXp,XFKD5727938304
trades1.txt
# The first 8 characters is execution time in microseconds since midnight
# The next 14 characters is the order ID
# The next 8 characters is the zero padded price
# The next 8 characters is the zero padded volume
25455416QOXA6580494597 0000013800001856
25455428JBDX6760353333 0000007000002458
25455434DOLR9093170513 0000000400003832
25455435QOXA6563975285 0000034700009428
25455449QOXA5569701531 0000007500009023
25455447YXKH2329856527 0000038300009947
25455451QOXA5832015401 0000039900006432
25455454QOXA3909507967 0000026900001847
25455456DOLR2262067037 0000034700002732
25455471YXKH2637373848 0000010900006105
25455480DOLR8639062962 0000027500001975
25455488JBDX7268036859 0000005200004986
25455505JBDX2840241688 0000037900002029
25455521YXKH4095442568 0000046400002150
25455515JBDX9945514579 0000040800005904
25455535QOXA1900595629 0000015200006866
25455533JBDX0126452783 0000001700006615
25455542XFKD4392227425 0000035500009948
25455570XFKD1759342196 0000025700007816
25455574JBDX8177118129 0000022400000427
25455567YXKH6485420018 0000039000008327
25455573JBDX1612537068 0000013700001422
25455584JBDX2409813753 0000016600003588
25455603XFKD4861992028 0000017600004552
25455611JBDX0262502677 0000007900003235
25455625JBDX2587184235 0000024300006723
25455658XFKD6374702721 0000046400009451
25455673JBDX6885176695 0000010900009258
25455671JBDX5424320405 0000005400003618
25455679DOLR8212353717 0000041100003633
25455697QOXA5011416607 0000018800007376
25455696QOXA7608439570 0000013000007463
25455716YXKH7747780955 0000037000006357
25455719QOXA2477859437 0000039300009840
25455723XFKD4995735686 0000045500009858
25455727JBDX3564523161 0000021300000639
25455742YXKH7095815077 0000023000003945
25455739YXKH5848658513 0000042700002084
25455766XFKD5919379575 0000022200003603
25455777XFKD8175209012 0000033300006350
25455788XFKD8139382011 0000034500007461
25455793QOXA8227838196 0000011600007081
25455784QOXA2072297264 0000017000004429
25455800XFKD6434550362 0000030000002409
25455801QOXA9672544909 0000039600001033
25455815QOXA6105975858 0000034800008373
25455814JBDX1316207043 0000026500005237
25455831YXKH6010226371 0000011400004945
25455838DOLR1387517499 0000046200006129
25455847YXKH3727923537 0000037400008061
25455873XFKD5727938304 0000048700007298
I have the following script:
'''
#!/bin/bash
declare -A volumes
declare -A turnovers
declare -A orders
# Read the first file, remembering for each order the client id
while read -r line
do
# Jump over comments
if [[ ${line:0:1} == "#" ]] ; then continue; fi;
details=($(echo $line | tr ',' " "))
order_id=${details[3]}
client_id=${details[1]}
orders[$order_id]=$client_id
done < $1
echo "ClientID,OrderID,Price,Volume"
while read -r line
do
# Jump over comments
if [[ ${line:0:1} == "#" ]] ; then continue; fi;
order_id=$(echo ${line:8:20} | tr -d '[:space:]')
client_id=${orders[$order_id]}
price=${line:28:8}
volume=${line: -8}
echo "$client_id,$order_id,$price,$volume"
price=$(echo $price | awk '{printf "%d", $0}')
volume=$(echo $volume | awk '{printf "%d", $0}')
order_turnover=$(($price*$volume))
old_turnover=${turnovers[$client_id]}
[[ -z "$old_turnover" ]] && old_turnover=0
total_turnover=$(($old_turnover+$order_turnover))
turnovers[$client_id]=$total_turnover
old_volumes=${volumes[$client_id]}
[[ -z "$old_volumes" ]] && old_volumes=0
total_volume=$((old_volumes+volume))
volumes[$client_id]=$total_volume
done < $2
echo "ClientID,Volume,Turnover"
for client_id in ${!volumes[#]}
do
volume=${volumes[$client_id]}
turnover=${turnovers[$client_id]}
echo "$client_id,$volume,$turnover"
done
Can anyone think of anything more elegant?
Thanks in advance
C
Assumption 1: the two files are ordered, so line x represents an action that is older than x+1. If not, then further work is needed.
The assumption makes our work easier. Let's first change the delimiter of traders into a comma:
sed -i 's/ /,/g' traders.txt
This will be done in place for sake of simplicity. So, you now have traders which is comma separated, as is orders. This is the Assumption 2.
Keep working on traders: split all columns and add titles1. More on the reasons why in a moment.
gawk -i inplace -v INPLACE_SUFFIX=.bak 'BEGINFILE{FS=",";OFS=",";print "execution time,order ID,price,volume";}{print substr($1,1,8),substr($1,9),substr($2,1,9),substr($2,9)}' traders.txt
Ugly but works. Now let's process your data using the following awk script:
BEGIN {
FS=","
OFS=","
}
{
if (1 == NR) {
getline line < TRADERS # consume title line
print "Client ID,Order ID,Price,Volume,Turnover"; # consume title line. Remove print to forget it
getline line < TRADERS # reads first data line
split(line, transaction, ",")
next
}
if (transaction[2] == $4) {
print $2, $4, transaction[3], transaction[4], transaction[3]*transaction[4]
getline line < TRADERS # reads new data line
split(line, transaction, ",")
}
}
called by:
gawk -f script -v TRADERS=traders.txt orders.txt
And there you have it. Some caveats:
check the numbers, as implicit gawk number conversion might not be correct with zero-padded numbers. There is a fix for that in case;
getline might explode if we run out of lines from traders. I haven't put any check, that's up to you
no control over timestamps. Match is based on Order ID.
Output file:
Client ID,Order ID,Price,Volume,Turnover
QOXA,QOXA6580494597,000001380,00001856,2561280
JBDX,JBDX6760353333,000000700,00002458,1720600
DOLR,DOLR9093170513,000000040,00003832,153280
QOXA,QOXA6563975285,000003470,00009428,32715160
QOXA,QOXA5569701531,000000750,00009023,6767250
YXKH,YXKH2329856527,000003830,00009947,38097010
QOXA,QOXA5832015401,000003990,00006432,25663680
QOXA,QOXA3909507967,000002690,00001847,4968430
DOLR,DOLR2262067037,000003470,00002732,9480040
YXKH,YXKH2637373848,000001090,00006105,6654450
DOLR,DOLR8639062962,000002750,00001975,5431250
JBDX,JBDX7268036859,000000520,00004986,2592720
JBDX,JBDX2840241688,000003790,00002029,7689910
YXKH,YXKH4095442568,000004640,00002150,9976000
JBDX,JBDX9945514579,000004080,00005904,24088320
QOXA,QOXA1900595629,000001520,00006866,10436320
JBDX,JBDX0126452783,000000170,00006615,1124550
XFKD,XFKD4392227425,000003550,00009948,35315400
XFKD,XFKD1759342196,000002570,00007816,20087120
JBDX,JBDX8177118129,000002240,00000427,956480
YXKH,YXKH6485420018,000003900,00008327,32475300
JBDX,JBDX1612537068,000001370,00001422,1948140
JBDX,JBDX2409813753,000001660,00003588,5956080
XFKD,XFKD4861992028,000001760,00004552,8011520
JBDX,JBDX0262502677,000000790,00003235,2555650
JBDX,JBDX2587184235,000002430,00006723,16336890
XFKD,XFKD6374702721,000004640,00009451,43852640
JBDX,JBDX6885176695,000001090,00009258,10091220
JBDX,JBDX5424320405,000000540,00003618,1953720
DOLR,DOLR8212353717,000004110,00003633,14931630
QOXA,QOXA5011416607,000001880,00007376,13866880
QOXA,QOXA7608439570,000001300,00007463,9701900
YXKH,YXKH7747780955,000003700,00006357,23520900
QOXA,QOXA2477859437,000003930,00009840,38671200
XFKD,XFKD4995735686,000004550,00009858,44853900
JBDX,JBDX3564523161,000002130,00000639,1361070
YXKH,YXKH7095815077,000002300,00003945,9073500
YXKH,YXKH5848658513,000004270,00002084,8898680
XFKD,XFKD5919379575,000002220,00003603,7998660
XFKD,XFKD8175209012,000003330,00006350,21145500
XFKD,XFKD8139382011,000003450,00007461,25740450
QOXA,QOXA8227838196,000001160,00007081,8213960
QOXA,QOXA2072297264,000001700,00004429,7529300
XFKD,XFKD6434550362,000003000,00002409,7227000
QOXA,QOXA9672544909,000003960,00001033,4090680
QOXA,QOXA6105975858,000003480,00008373,29138040
JBDX,JBDX1316207043,000002650,00005237,13878050
YXKH,YXKH6010226371,000001140,00004945,5637300
DOLR,DOLR1387517499,000004620,00006129,28315980
YXKH,YXKH3727923537,000003740,00008061,30148140
XFKD,XFKD5727938304,000004870,00007298,35541260
1: requires gawk 4.1.0 or higher

optimizing awk command for large file

I have these functions to process a 2GB text file. I'm splitting it into 6 parts for simultaneous processing but it is still taking 4+ hours.
What else can I try make the script faster?
A bit of details:
I feed my input csv into a while loop to be read line by line.
I grabbed the values from the csv line from 4 fields in the read2col function
The awk in my mainf function takes the values from read2col and do some arithmetic calculation. I'm rounding the result to 2 decimal places. Then, print the line to a text file.
Sample data:
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Script:
read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}
#################################################
#for each line in the csv
mainf()
{
cd $infarepath
while read -r line; do
#read the value of csv fields into variables
read2col
if [[ $is_one_way == 0 ]]; then
if [[ $price_outbound > 0 ]]; then
#calculate price inc and print the entire line to txt file
echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
else
#divide price ecx and inc by 2 if price outbound is not greater than 0
echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
fi
else
echo $line >>"$csvsplitfile".tmp
fi
done < $csvsplitfile
}
The first thing you should do is stop invoking six subshells for running awk for every single line of input. Let's do some quick, back-of-the-envelope calculations.
Assuming your input lines are about 292 characters (as per you example), a 2G file will consist of a little over 7.3 million lines. That means you are starting and stopping a whopping forty-four million processes.
And, while Linux admirably handles fork and exec as efficiently as possible, it's not without cost:
pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s
In addition, bash hasn't really been optimised for this sort of processing, its design leads to sub-optimal behaviour for this specific use case. For details on this, see this excellent answer over on one of our sister sites.
An analysis of the two methods of file processing (one program reading an entire file (each line has just hello on it), and bash reading it a line at a time) is shown below. The two commands used to get the timings were:
time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )
For varying file sizes (user+sys time, averaged over a few runs), it's quite interesting:
# of lines cat-method while-method
---------- ---------- ------------
1,000 0.375s 0.031s
10,000 0.391s 0.234s
100,000 0.406s 1.994s
1,000,000 0.391s 19.844s
10,000,000 0.375s 205.583s
44,000,000 0.453s 889.402s
From this, it appears that the while method can hold its own for smaller data sets, it really does not scale well.
Since awk itself has ways to do calculations and formatted output, processing the file with one single awk script, rather than your bash/multi-awk-per-line combination, will make the cost of creating all those processes and line-based delays go away.
This script would be a good first attempt, let's call it prog.awk:
BEGIN {
FMT = "%.2f"
OFS = FS
}
{
isOneWay=$7
priceOutbound=$30
priceExc=$25
tax=$27
priceInc=$26
if (isOneWay == 0) {
if (priceOutbound > 0) {
$25 = sprintf(FMT, priceOutbound)
$26 = sprintf(FMT, priceOutbound + tax / 2)
} else {
$25 = sprintf(FMT, priceExc / 2)
$26 = sprintf(FMT, priceInc / 2)
}
}
print
}
You just run that single awk script with:
awk -F'","' -f prog.awk data.txt
With the test data you provided, here's the before and after, with markers for field numbers 25 and 26:
<-25-> <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"

Count total number of pattern between two pattern (using sed if possible) in Linux

I have to count all '=' between two pattern i.e '{' and '}'
Sample:
{
100="1";
101="2";
102="3";
};
{
104="1,2,3";
};
{
105="1,2,3";
};
Expected Output:
3
1
1
A very cryptic perl answer:
perl -nE 's/\{(.*?)\}/ say ($1 =~ tr{=}{=}) /ge'
The tr function returns the number of characters transliterated.
With the new requirements, we can make a couple of small changes:
perl -0777 -nE 's/\{(.*?)\}/ say ($1 =~ tr{=}{=}) /ges'
-0777 reads the entire file/stream into a single string
the s flag to the s/// function allows . to handle newlines like a plain character.
Perl to the rescue:
perl -lne '$c = 0; $c += ("$1" =~ tr/=//) while /\{(.*?)\}/g; print $c' < input
-n reads the input line by line
-l adds a newline to each print
/\{(.*?)\}/g is a regular expression. The ? makes the asterisk frugal, i.e. matching the shortest possible string.
The (...) parentheses create a capture group, refered to as $1.
tr is normally used to transliterate (i.e. replace one character by another), but here it just counts the number of equal signs.
+= adds the number to $c.
Awk is here too
grep -o '{[^}]\+}'|awk -v FS='=' '{print NF-1}'
example
echo '{100="1";101="2";102="3";};
{104="1,2,3";};
{105="1,2,3";};'|grep -o '{[^}]\+}'|awk -v FS='=' '{print NF-1}'
output
3
1
1
First some test input (a line with a = outside the curly brackets and inside the content, one without brackets and one with only 2 brackets)
echo '== {100="1";101="2";102="3=3=3=3";} =;
a=b
{c=d}
{}'
Handle line without brackets (put a dummy char so you will not end up with an empty string)
sed -e 's/^[^{]*$/x/'
Handle line without equal sign (put a dummy char so you will not end up with an empty string)
sed -e 's/{[^=]*}/x/'
Remove stuff outside the brackets
sed -e 's/.*{\(.*\)}/\1/'
Remove stuff inside the double quotes (do not count fields there)
sed -e 's/"[^"]*"//g'
Use #repzero method to count equal signs
awk -F "=" '{print NF-1}'
Combine stuff
echo -e '{100="1";101="2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$/x/' -e 's/{[^=]*}/x/' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{print NF-1}'
The ugly temp fields x and replacing {} can be solved inside awk:
echo -e '= {100="1";101="2=2=2=2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$//' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{if (NF>0) c=NF-1; else c=0; print c}'
or shorter
echo -e '= {100="1";101="2=2=2=2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$//' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{print (NF>0) ? NF-1 : 0; }'
No harder sed than done ... in.
Restricting this answer to the environment as tagged, namely:
linux shell unix sed wc
will actually not require the use of wc (or awk, perl, or any other app.).
Though echo is used, a file source can easily exclude its use.
As for bash, it is the shell.
The actual environment used is documented at the end.
NB. Exploitation of GNU specific extensions has been used for brevity
but appropriately annotated to make a more generic implementation.
Also brace bracketed { text } will not include braces in the text.
It is implicit that such braces should be present as {} pairs but
the text src. dangling brace does not directly violate this tenet.
This is a foray into the world of `sed`'ng to gain some fluency in it's use for other purposes.
The ideas expounded upon here are used to cross pollinate another SO problem solution in order
to aquire more familiarity with vetting vagaries of vernacular version variances. Consequently
this pedantic exercice hopefully helps with the pedagogy of others beyond personal edification.
To test easily, at least in the environment noted below, judiciously highlight the appropriate
code section, carefully excluding a dangling pipe |, and then, to a CLI command line interface
drag & drop, copy & paste or use middle click to enter the code.
The other SO problem. linux - Is it possible to do simple arithmetic in sed addresses?
# _______________________________ always needed ________________________________
echo -e '\n
\n = = = {\n } = = = each = is outside the braces
\na\nb\n { } so therefore are not counted
\nc\n { = = = = = = = } while the ones here do count
{\n100="1";\n101="2";\n102="3";\n};
\n {\n104="1,2,3";\n};
a\nb\nc\n {\n105="1,2,3";\n};
{ dangling brace ignored junk = = = \n' |
# _____________ prepatory conditioning needed for final solutions _____________
sed ' s/{/\n{\n/g;
s/}/\n}\n/g; ' | # guarantee but one brace to a line
sed -n '/{/ h; # so sed addressing can "work" here
/{/,/}/ H; # use hHold buffer for only { ... }
/}/ { x; s/[^=]*//g; p } ' | # then make each {} set a line of =
# ____ stop code hi-lite selection in ^--^ here include quote not pipe ____
# ____ outputs the following exclusive of the shell " # " comment quotes _____
#
#
# =======
# ===
# =
# =
# _________________________________________________________________________
# ____________________________ "simple" GNU solution ____________________________
sed -e '/^$/ { s//0/;b }; # handle null data as 0 case: next!
s/=/\n/g; # to easily count an = make it a nl
s/\n$//g; # echo adds an extra nl - delete it
s/.*/echo "&" | sed -n $=/; # sed = command w/ $ counts last nl
e ' # who knew only GNU say you ah phoo
# 0
# 0
# 7
# 3
# 1
# 1
# _________________________________________________________________________
# ________________________ generic incomplete "solution" ________________________
sed -e '/^$/ { s//echo 0/;b }; # handle null data as 0 case: next!
s/=$//g; # echo adds an extra nl - delete it
s/=/\\\\n/g; # to easily count an = make it a nl
s/.*/echo -e & | sed -n $=/; '
# _______________________________________________________________________________
The paradigm used for the algorithm is instigated by the prolegomena study below.
The idea is to isolate groups of = signs between { } braces for counting.
These are found and each group is put on a separate line with ALL other adorning characters removed.
It is noted that sed can easily "count", actually enumerate, nl or \n line ends via =.
The first "solution" uses these sed commands:
print
branch w/o label starts a new cycle
h/Hold for filling this sed buffer
exchanage to swap the hold and pattern buffers
= to enumerate the current sed input line
substitute s/.../.../; with global flag s/.../.../g;
and most particularly the GNU specific
evaluate (execute can not remember the actual mnemonic but irrelevantly synonymous)
The GNU specific execute command is avoided in the generic code. It does not print the answer but
instead produces code that will print the answer. Run it to observe. To fully automate this, many
mechanisms can be used not the least of which is the sed write command to put these lines in a
shell file to be excuted or even embed the output in bash evaluation parentheses $( ) etc.
Note also that various sed example scripts can "count" and these too can be used efficaciously.
The interested reader can entertain these other pursuits.
prolegomena:
concept from counting # of lines between braces
sed -n '/{/=;/}/=;'
to
sed -n '/}/=;/{/=;' |
sed -n 'h;n;G;s/\n/ - /;
2s/^/ Between sets of {} \n the nl # count is\n /;
2!s/^/ /;
p'
testing "done in":
linuxuser#ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
linuxuser#ubuntu:~$ sed --version -----> sed (GNU sed) 4.4
And for giggles an awk-only alternative:
echo '{
> 100="1";
> 101="2";
> 102="3";
> };
> {
> 104="1,2,3";
> };
> {
> 105="1,2,3";
> };' | awk 'BEGIN{RS="\n};";FS="\n"}{c=gsub(/=/,""); if(NF>2){print c}}'
3
1
1

Bash - Swap Values in Column

I have some CSV/tabular data in a file, like so:
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
(They're not always numbers, just random comma-separated values. Single-digit numbers are easier for an example, though.)
I want to shuffle a random 40% of any of the columns. As an example, say the 3rd one. So perhaps 3 and 1 get swapped with each other. Now the third column is:
1 << Came from the last position
8
5
7
3 << Came from the first position
I am trying to do this in place in a file from within a bash script that I am working on, and I am not having much luck. I keep wandering down some pretty crazy and fruitless grep rabbit holes that leave me thinking that I'm going the wrong way (the constant failure is what's tipping me off).
I tagged this question with a litany of things because I'm not entirely sure which tool(s) I should even be using for this.
Edit: I'm probably going to end up accepting Rubens' answer, however wacky it is, because it directly contains the swapping concept (which I guess I could have emphasized more in the original question), and it allows me to specify a percentage of the column for swapping. It also happens to work, which is always a plus.
For someone who doesn't need this, and just wants a basic shuffle, Jim Garrison's answer also works (I tested it).
A word of warning, however, on Rubens' solution. I took this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
...
}
printf "\n";
removed the printf "\n"; and moved the newline character up like this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "\n";
...
}
because just having "" on the else case was causing awk to write broken characters at the end of each line (\00). At one point, it even managed to replace my entire file with Chinese characters. Although, honestly, this probably involved me doing something extra stupid on top of this problem.
This will work for a specifically designated column, but should be enough to point you in the right direction. This works on modern bash shells including Cygwin:
paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
The operative feature is "process substitution".
The paste command joins files horizontally, and the three pieces are split from the original file via cut, with the second piece (the column to be randomized) run through the shuf command to reorder the lines. Here's the output from running it a couple of times:
$ cat test.dat
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,1,2
8,3,8,0
4,9,7,3
8,5,3,3
5,6,5,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,8,2
8,3,1,0
4,9,3,3
8,5,7,3
5,6,5,9
Algorithm:
create a vector with n pairs, from 1 to number of lines, and the respective value in the line (for the selected column), and then sort it randomly;
find how many lines should be randomized: num_random = percentage * num_lines / 100;
select the first num_random entries from your randomized vector;
you may sort the selected lines randomly, but it should be already randomly sorted;
printing output:
i = 0
for num_line, value in column; do
if num_line not in random_vector:
print value; # printing non-randomized value
else:
print random_vector[i]; # randomized entry
i++;
done
Implementation:
#! /bin/bash
infile=$1
col=$2
n_lines=$(wc -l < ${infile})
prob=$(bc <<< "$3 * ${n_lines} / 100")
# Selected lines
tmp=$(tempfile)
paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \
| sort -R | head -n ${prob} > ${tmp}
# Rewriting file
awk -v "col=$col" -F "," '
(NR == FNR) {id[$1] = $2; next}
(FNR == 1) {
i = c = 1;
for (v in id) {value[i] = id[v]; ++i;}
}
{
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
if (i != col) {printf "%s%c", $i, delim; continue;}
if (FNR in id) {printf "%s%c", value[c], delim; c++;}
else {printf "%s%c", $i, delim;}
}
printf "\n";
}
' ${tmp} ${infile}
rm ${tmp}
In case you want a close approach to in-placement, you may pipe the output back to the input file, using sponge.
Execution:
To execute, simply use:
$ ./script.sh <inpath> <column> <percentage>
As in:
$ ./script.sh infile 3 40
1,7,3,2
8,3,8,0
4,9,1,3
8,5,7,3
5,6,5,9
Conclusion:
This allows you to select the column, randomly sort a percentage of entries in that column, and replace the new column in the original file.
This script goes as proof like no other, not only that shell scripting is extremely entertaining, but that there are cases where it should definitely be used not. (:
I'd use a 2-pass approach that starts by getting a count of the number of lines and read the file into an array, then use awk's rand() function to generate random numbers to identify the lines you'll change and then rand() again to determine which pairs of those lines you will swap and then swap the array elements before printing. Something like this PSEUDO-CODE, rough algorithm:
awk -F, -v pct=40 -v col=3 '
NR == FNR {
array[++totNumLines] = $0
next
}
FNR == 1{
pctNumLines = totNumLines * pct / 100
srand()
for (i=1; i<=(pctNumLines / 2); i++) {
oldLineNr = rand() * some factor to produce a line number that's in the 1 to totNumLines range but is not already recorded as processed in the "swapped" array.
newLineNr = ditto plus must not equal oldLineNr
swap field $col between array[oldLineNr] and array[newLineNr]
swapped[oldLineNr]
swapped[newLineNr]
}
next
}
{ print array[FNR] }
' "$file" "$file" > tmp &&
mv tmp "$file"

Resources