I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below:
asdljsdbfajhsdbflakjsdff235
asjhbasdjbfajskdfasdbajsdx3
asjhbasdjbfajs23kdfb235ajds
asjhbasdjbfajskdfbaj456fd3v
asjhbasdjb6589fajskdfbaj235
asjhbasdjbfajs54kdfbaj2f879
asjhbasdjbfajskdfbajxdfgsdh
asjhbasdf3709ddjbfajskdfbaj
100
100
150
125
trh77rnv9vnd9dfnmdcnksosdmn
220
225
sdkjNSDfasd89asdg12asdf6asdf
So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}'
Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below):
$ awk '{
a[NR]=$0 # hash all the records
}
END { # after hashing
mid=int(NR/2) # compute the midpoint, int in case NR is uneven
for(i=1;i<=mid;i++) # iterate from start to midpoint
print a[i],a[mid+i] # output
}' file
If you have the memory to hash half of the file ("mid"):
$ awk '
NR==FNR { # on 1st pass hash second half of records
if(FNR>1) { # we dont need the 1st record ever
a[FNR]=$0 # hash record
if(FNR%2) # if odd record
delete a[int(FNR/2)+1] # remove one from the past
}
next
}
FNR==1 { # on the start of 2nd pass
if(NR%2==0) # if record count is uneven
exit # exit as there is always even count of them
offset=int((NR-1)/2) # compute offset to the beginning of hash
}
FNR<=offset { # only process the 1st half of records
print $0,a[offset+FNR] # output one from file, one from hash
next
}
{ # once 1st half of 2nd pass is finished
exit # just exit
}' file file # notice filename twice
And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"):
$ awk '
NR==FNR { # just get the NR of 1st pass
next
}
FNR==1 {
mid=(NR-1)/2 # get the midpoint
file=FILENAME # filename for getline
while(++i<=mid && (getline line < file)>0); # jump getline to mid
}
{
if((getline line < file)>0) # getline read from mid+FNR
print $0,line # output
}' file file # notice filename twice
Standard disclaimer on getline and no real error control implemented.
Performance:
I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max.
+------------------+-----------+-----------+
| which | | |
| min | mid | max |
+------------------+-----------+-----------+
| time | | |
| real 1m7.027s | 1m30.146s | 0m48.405s |
| user 1m6.387s | 1m27.314 | 0m43.801s |
| sys 0m0.641s | 0m2.820s | 0m4.505s |
+------------------+-----------+-----------+
| mem | | |
| 3 MB | 6.8 GB | 13.5 GB |
+------------------+-----------+-----------+
Update:
I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively:
real 0m41.455s
user 0m39.086s
sys 0m2.369s
and
real 0m39.577s
user 0m37.037s
sys 0m2.541s
Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
if you want just one space between columns, change to
$ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g.
awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
Example Use/Output
With your data in file, then
$ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines:
paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt)
You can use columns utility from autogen:
columns -c2 --by-columns file.txt
You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can:
column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you:
wc -l file.txt
gives number of lines
echo $(($(wc -l < file.txt)/2))
gives a half
head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt
tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt
create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1)
#!/usr/bin/env bash
array=()
file=$1
total=$(wc -l < "$file")
half=$(( total / 2 ))
plus1=$(( half + 1 ))
for ((m=1;m<=half;m++)); do
array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j')
done
After all of that if just want to print the output to stdout. Add the line below to the script.
printf '%s\n' "${array[#]}" ,p Q | ed -s "$file"
If you want to write the changes directly to the file itself, Use this code instead below the script.
printf '%s\n' "${array[#]}" w | ed -s "$file"
Here is an example.
printf '%s\n' {1..10} > file.txt
Now running the script against that file.
./myscript file.txt
Output
1 6
2 7
3 8
4 9
5 10
Or using bash4+ feature mapfile aka readarray
Save the file in an array named array.
mapfile -t array < file.txt
Separate the files.
left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}")
loop and print side-by-side
for i in "${!left[#]}"; do
printf '%s %s\n' "${left[i]}" "${right[i]}"
done
What you said The only known thing in my scenario is, they are even number all the time. That solution should work.
Related
I'm studying bash scripting and I'm stuck fixing an exercise of this site: https://ryanstutorials.net/bash-scripting-tutorial/bash-variables.php#activities
The task is to write a bash script to output a random word from a dictionary whose length is equal to the number supplied as the first command line argument.
My idea was to create a sub-dictionary, assign each word a number line, select a random number from those lines and filter the output, which worked for a similar simpler script, but not for this.
This is the code I used:
6 DIC='/usr/share/dict/words'
7 SUBDIC=$( egrep '^.{'$1'}$' $DIC )
8
9 MAX=$( $SUBDIC | wc -l )
10 RANDRANGE=$((1 + RANDOM % $MAX))
11
12 RWORD=$(nl "$SUBDIC" | grep "\b$RANDRANGE\b" | awk '{print $2}')
13
14 echo "Random generated word from $DIC which is $1 characters long:"
15 echo $RWORD
and this is the error I get using as input "21":
bash script.sh 21
script.sh: line 9: counterintelligence's: command not found
script.sh: line 10: 1 + RANDOM % 0: division by 0 (error token is "0")
nl: 'counterintelligence'\''s'$'\n''electroencephalograms'$'\n''electroencephalograph': No such file or directory
Random generated word from /usr/share/dict/words which is 21 characters long:
I tried in bash to split the code in smaller pieces obtaining no error (input=21):
egrep '^.{'21'}$' /usr/share/dict/words | wc -l
3
but once in the script line 9 and 10 give error.
Where do you think is the error?
problems
SUBDIC=$( egrep '^.{'$1'}$' $DIC ) will store all words of the given length in the SUBDIC variable, so it's content is now something like foo bar baz.
MAX=$( $SUBDIC | ... ) will try to run the command foo bar baz which is obviously bogus; it should be more like MAX=$(echo $SUBDIC | ... )
MAX=$( ... | wc -l ) will count the lines; when using the above mentioned echo $SUBDIC you will have multiple words, but all in one line...
RWORD=$(nl "$SUBDIC" | ...) same problem as above: there's only one line (also note #armali's answer that nl requires a file or stdin)
RWORD=$(... | grep "\b$RANDRANGE\b" | ...) might match the dictionary entry catch 22
likely RWORD=$(... | awk '{print $2}') won't handle lines containing spaces
a simple solution
doing a "random sort" over the all the possible words and taking the first line, should be sufficient:
egrep "^.{$1}$" "${DIC}" | sort -R | head -1
MAX=$( $SUBDIC | wc -l ) - A pipe is used for connecting a command's output, while $SUBDIC isn't a command; an appropriate syntax is MAX=$( <<<$SUBDIC wc -l ).
nl "$SUBDIC" - The argument to nl has to be a filename, which "$SUBDIC" isn't; an appropriate syntax is nl <<<"$SUBDIC".
This code will do it. My test dictionary of words is in file file. It's a good idea to get all words of a given length first but put them in an array not in var. And then get a random index and echo it.
dic=( $(sed -n "/^.\{$1\}$/p" file) )
ind=$((0 + RANDOM % ${#dic[#]}))
echo ${dic[$ind]}
I am also doing this activity and I create one simple solution.
I create the script.
#!/bin/bash
awk "NR==$1 {print}" /usr/share/dict/words
Here if you want a random string then you have to run the script as per the below command from the terminal.
./script.sh $RANDOM
If you want the print any specific number string then you can run as per the below command from the terminal.
./script.sh 465
cat /usr/share/dict/american-english | head -n $RANDOM | tail -n 1
$RANDOM - Returns a different random number each time is it referred to.
this simple line outputs random word from the mentioned dictionary.
Otherwise as umläute mentined you can do:
cat /usr/share/dict/american-english | sort -R | head -1
(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)
I have a set of systems and tasks in two columns, SYSTEM,TASK:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850
I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha then one beta then one gamma and repeat until finished.
I get the whole list of tasks + systems at once, so I don't need to keep any state
The list of systems is not static, on the order of N=100
The total number of tasks is variable, on the order of N=500
The number of tasks for each system is not guaranteed to be equal
Hard / absolute interleaving isn't required, as long as there aren't two of the same system twice in a row
The same task may show up more than once, but not for the same system
Input format / delimiter can be changed
I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt etc) and then recombine them with paste or similar, but I'd like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.
edit:
To clarify, any output should not have the same system on two lines in a row. All system,task pairs must be preserved, you can't move a task from one system to another - that'd make this really easy!
One of several possible sample outputs:
beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909
We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.
The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.
Theoretical Problem
Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.
Example
L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1
ababaca, acababa, and abacaba are all valid solutions.
aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1
Solution
The following approach produces a valid sequence if and only if there exists a solution.
Sort the letters by their frequencies.
For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ ... ≥ 0.
Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).
Write down a sequence s of f(a) many a.
Add the remaining letters into a FIFO working list, that is:
(Don't add any a)
First add f(b) many b
Then f(c) many c
and so on
Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.
Example
L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2
The letters are already sorted by their frequencies.
s = aaaaa
workinglist = bbbbbccccdd. The leftmost entry is the first one.
We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
s = a_a_a_a_a_ workinglist = bbbbbccccdd
s = aba_a_a_a_ workinglist = bbbbccccdd
s = ababa_a_a_ workinglist = bbbccccdd
...
s = ababababab workinglist = ccccdd
⚠️ We reached the end of sequence s. We repeat step 4.
s = a_b_a_b_a_b_a_b_a_b_ workinglist = ccccdd
s = acb_a_b_a_b_a_b_a_b_ workinglist = cccdd
...
s = acbcacb_a_b_a_b_a_b_ workinglist = cdd
s = acbcacbca_b_a_b_a_b_ workinglist = dd
s = acbcacbcadb_a_b_a_b_ workinglist = d
s = acbcacbcadbda_b_a_b_ workinglist =
⚠️ The working list is empty. We stop.
The final sequence is acbcacbcadbdabab.
Implementation In Bash
Here is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don't have to use explicit loops. Intermediate results are stored in variables. No files are created.
#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"
inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"
maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"
seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"
paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6
This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.
Well, this is what I've come up with:
args=()
while IFS=' ' read -r _ name; do
# add a file redirection with grepped certain SYSTEM only for later eval
args+=("<(grep '^$name,' file)")
done < <(
# extract SYSTEM only
<file cut -d, -f1 |
#sort with the count
sort | uniq -c | sort -nr
)
# this is actually safe, because we control all arguments
eval paste -d "'\\n'" "${args[#]}" |
# paste will insert empty lines when the list ended - remove them
sed '/^$/d'
First, I extract and sort the SYSTEM names in the order which occurs the most often to be first. So for the input example we get:
4 beta
3 alpha
1 gamme
Then for each such name I add the proper string <(grep '...' file) to arguments list witch will be later evalulated.
Then I evalulate the call to paste <(grep ...) <(grep ...) <(grep ...) ... with newline as the paste's delimeter. I remove empty lines with simple sed call.
The output for the input provided:
beta,21700055
alpha,90198500
gamma,64910850
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
Converted to a fancy oneliner, with substituting the while read with command substitution and sed. Got safe with inputfile naming with printf "%q" "$inputfile" and double quoting inside sed regex.
inputfile="file"
fieldsep=","
eval paste -d '"\\n"' "$(
cut -d "$fieldsep" -f1 "$inputfile" |
sort | uniq -c | sort -nr |
sed 's/^[[:space:]]*[0-9]\+[[:space:]]*\(.*\)$/<(grep '\''^\1'"$fieldsep"\'' "'"$(printf "%q" "$inputfile")"'")/' |
tr '\n' ' '
)" |
sed '/^$/d'
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
}
To accommodate for strange input values this script implements somewhat simple but hardy in bash statemachine.
The variable counts stores SYSTEM names with their're occurrence count. So from the example input it will be
4 alpha
3 beta
1 gamma
Now - we output the SYSTEM name with the biggest occurrence count that is also different from the last outputted SYSTEM name. We decrement it's occurrence count. If the count is equal to zero, it is removed from the list. We remember the last outputted SYSTEM name. We repeat this process until all occurrence counts reach zero, so the list is empty. For the example input this will output:
beta
alpha
beta
alpha
beta
alpha
beta
gamma
Now, we need to join that list with the job names. We can't use join as the input is not sorted and we don't want to change the ordering. So what I do, I get only SYSTEM names in system. Then for each system I open a different file descriptor with filtered only that SYSTEM name from the input file. All the file descriptors are stored in an array. Then for each SYSTEM name from the input, I find the file descriptor that filters that SYSTEM name from the input file and read exactly one line from the file descriptor. This works like an array of file positions each file position associated / filtering specified SYSTEM name.
beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850
The script was done so for the input in the form of:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850
the script outputs correctly:
alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438
I think this algorithm will mostly always print correct output, but the ordering is so that the least common SYSTEMs will be outputted last, which may be not optimal.
Tested manually with some custom tests and checker on paiza.io.
inputfile="inputfile"
in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' )
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF
sed -i -e '/^$/d' "$inputfile"
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
} |
{
# check if the output is correct
output=$(cat)
# output should have same lines as inputfile
if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
echo "Output does not match input!" >&2
exit 1
fi
# two consecutive lines can't have the same system
lastsys=""
<<<"$output" cut -d, -f1 |
while IFS= read -r sys; do
if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
echo "Same systems found on two consecutive lines!" >&2
exit 1
fi
lastsys="$sys"
done
# all ok
echo "all ok!"
echo -------------
printf "%s\n" "$output"
}
exit
I have these functions to process a 2GB text file. I'm splitting it into 6 parts for simultaneous processing but it is still taking 4+ hours.
What else can I try make the script faster?
A bit of details:
I feed my input csv into a while loop to be read line by line.
I grabbed the values from the csv line from 4 fields in the read2col function
The awk in my mainf function takes the values from read2col and do some arithmetic calculation. I'm rounding the result to 2 decimal places. Then, print the line to a text file.
Sample data:
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Script:
read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}
#################################################
#for each line in the csv
mainf()
{
cd $infarepath
while read -r line; do
#read the value of csv fields into variables
read2col
if [[ $is_one_way == 0 ]]; then
if [[ $price_outbound > 0 ]]; then
#calculate price inc and print the entire line to txt file
echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
else
#divide price ecx and inc by 2 if price outbound is not greater than 0
echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
fi
else
echo $line >>"$csvsplitfile".tmp
fi
done < $csvsplitfile
}
The first thing you should do is stop invoking six subshells for running awk for every single line of input. Let's do some quick, back-of-the-envelope calculations.
Assuming your input lines are about 292 characters (as per you example), a 2G file will consist of a little over 7.3 million lines. That means you are starting and stopping a whopping forty-four million processes.
And, while Linux admirably handles fork and exec as efficiently as possible, it's not without cost:
pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s
In addition, bash hasn't really been optimised for this sort of processing, its design leads to sub-optimal behaviour for this specific use case. For details on this, see this excellent answer over on one of our sister sites.
An analysis of the two methods of file processing (one program reading an entire file (each line has just hello on it), and bash reading it a line at a time) is shown below. The two commands used to get the timings were:
time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )
For varying file sizes (user+sys time, averaged over a few runs), it's quite interesting:
# of lines cat-method while-method
---------- ---------- ------------
1,000 0.375s 0.031s
10,000 0.391s 0.234s
100,000 0.406s 1.994s
1,000,000 0.391s 19.844s
10,000,000 0.375s 205.583s
44,000,000 0.453s 889.402s
From this, it appears that the while method can hold its own for smaller data sets, it really does not scale well.
Since awk itself has ways to do calculations and formatted output, processing the file with one single awk script, rather than your bash/multi-awk-per-line combination, will make the cost of creating all those processes and line-based delays go away.
This script would be a good first attempt, let's call it prog.awk:
BEGIN {
FMT = "%.2f"
OFS = FS
}
{
isOneWay=$7
priceOutbound=$30
priceExc=$25
tax=$27
priceInc=$26
if (isOneWay == 0) {
if (priceOutbound > 0) {
$25 = sprintf(FMT, priceOutbound)
$26 = sprintf(FMT, priceOutbound + tax / 2)
} else {
$25 = sprintf(FMT, priceExc / 2)
$26 = sprintf(FMT, priceInc / 2)
}
}
print
}
You just run that single awk script with:
awk -F'","' -f prog.awk data.txt
With the test data you provided, here's the before and after, with markers for field numbers 25 and 26:
<-25-> <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
I have two data files. One is having 1600 rows and the other one is having 2 million rows(tab delimited files). I need to vlookup between these two files. Please see below example for the expected output and kindly let me know if it's possible. I've tried using awk, but couldn't get the expected result.
File 1(small file)
BC1 10 100
BC2 20 200
BC3 30 300
File 2(large file)
BC1 XYZ
BC2 ABC
BC3 DEF
Expected Output:
BC1 10 100 XYZ
BC2 20 200 ABC
BC3 30 300 DEF
I also tried the join command. It is taking forever to complete. Please help me find a solution. Thanks
Commands for your output:
awk '{print $1}' *file | sort | uniq -d > out.txt
for i in $(cat out.txt)
do
grep "$i" large_file >> temp.txt
done
sort -g -t 1 temp.txt > out1.txt
sort -g -t 1 out.txt > out2.txt
paste out1.txt out2.txt | awk '{print $1 $2 $3 $5}'
Commands for Vlookup
Store 1st and 2nd column in file1 file2 respectively
cat file1 file2 | sort | uniq -d ### for records which are present in both files
cat file1 file2 | sort | uniq -u ### for records which are unique and not present in bulk file
This awk script will scan line by line each file, and will try to match the number in the BC column. Once matched, it will print all the columns.
If one of the files does not contain one of the numbers, it will be skipped in both files and will search for the next one. It will loop until one of the files ends.
The script also accepts any number of columns per file and any number of files, as long as the first column is BC and a number.
This awk script assumes that the files are ordered from minor to major number in the BC column (like in your example). Otherwise it will not work.
To execute the script, run this command:
awk -f vlookup.awk smallfile bigfile
The vlookup.awk file will have this content:
BEGIN {files=1;lines=0;maxlines=0;filelines[1]=0;
#Number of columns for SoD, PRN, reference file
col_bc=1;
#Initialize variables
bc_now=0;
new_bc=0;
end_of_process=0;
aux="";
text_result="";
}
{
if(FILENAME!=ARGV[1])exit;
no_bc=0;
new_bc=0;
#Save number of columns
NFields[1]=NF;
#Copy reference file data
for(j=0;j<=NF;j++)
{
file[1,j]=$j;
}
#Read lines from file
for(i=2;i<ARGC;i++)
{
ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
}
#Check that all files are in the same number
for(i=1;i<ARGC;i++)
{
bc[i]=file[i,col_bc];
bc[i]=sub("BC","",file[i,col_bc]);
if(bc[i]>bc_now) {bc_now=bc[i];new_bc=1;}
}
#One or more files have a new number
if (new_bc==1)
{
for(i=1;i<ARGC;i++)
{
while(bc_now!=file[i,col_bc])
{
#Read next line from file
if(i==1) ret=getline; #File 1 is the reference file
else ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
#Check if in current file data has gone to next number
if(file[i,col_bc]>bc_now)
{
no_bc=1;
break;
}
#No more data lines to compare, end of comparison
if(FILENAME!=ARGV[1])
{
exit;
}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Exit for loop)
if (no_bc==1) {break;}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Continue while loop)
if (no_bc==1) {next;}
}
#Number is aligned
for(i=1;i<ARGC;i++)
{
for(j=2;j<=NFields[i];j++) {
#Join colums in text_result variable
aux=sprintf("%s %s",text_result,file[i,j]);
text_result=sprintf("%s",aux);
}
}
printf("BC%d%s\n",bc_now,text_result)
#Reset text variables
aux="";
text_result="";
}
I also tried the join command. It is taking forever to complete.
Please help me find a solution.
It's improbable that you'll find a solution (scripted or not) that's faster than the compiled join command. If you can't wait for join to complete, you need more powerful hardware.
Is there a way in Linux to ask for the Head or Tail but with an additional offset of records to ignore.
For example if the file example.lst contains the following:
row01
row02
row03
row04
row05
And I use head -n3 example.lst I can get rows 1 - 3 but what if I want it to skip the first row and get rows 2 - 4?
I ask because some commands have a header which may not be desirable within the search results. For example du -h ~ --max-depth 1 | sort -rh will return the directory size of all folders within the home directory sorted in descending order but will append the current directory to the top of the result set (i.e. ~).
The Head and Tail man pages don't seem to have any offset parameter so maybe there is some kind of range command where the required lines can be specified: e.g. range 2-10 or something?
From man tail:
-n, --lines=K
output the last K lines, instead of the last 10;
or use -n +K to output lines starting with the Kth
You can therefore use ... | tail -n +2 | head -n 3 to get 3 lines starting from line 2.
Non-head/tail methods include sed -n "2,4p" and awk "NR >= 2 && NR <= 4".
To get the rows between 2 and 4 (both inclusive), you can use:
head -n4 example.lst | tail -n+2
or
head -n4 example.lst | tail -n3
It took make a lot of time to end-up with this solution which, seems to be the only one that covered all usecases (so far):
command | tee full.log | stdbuf -i0 -o0 -e0 awk -v offset=${MAX_LINES:-200} \
'{
if (NR <= offset) print;
else {
a[NR] = $0;
delete a[NR-offset];
printf "." > "/dev/stderr"
}
}
END {
print "" > "/dev/stderr";
for(i=NR-offset+1 > offset ? NR-offset+1: offset+1 ;i<=NR;i++)
{ print a[i]}
}'
Feature list:
live output for head (obviously that for tail is not possible)
no use of external files
progressbar on stderr, one dot for each line after the MAX_LINES, very useful for long running tasks.
avoids possible incorrect logging order due to buffering (stdbuf)
sed -n 2,4p somefile.txt
#fill