Shell program - determine average word length in a file - linux

I am trying to write a shell program to determine the average word length in a file. I'm assuming I need to use wc and expr somehow. Guidance in the right direction would be great!

Assuming your file is ASCII and wc can indeed read it...
chars=$(cat inputfile | wc -c)
words=$(cat inputfile | wc -w)
Then a simple
avg_word_size=$(( ${chars} / ${words} ))
will calculate a (rounded) integer. But it will be "more wrong" than just the rounding error is: you'll have included all whitespace character in your avarage wordsize as well. And I assume you want to be more precise...
The following will give you some increased precision by calculating the rounded integer from a number that is multiplied by 100:
_100x_avg_word_size=$(( $((${chars} * 100)) / ${words} ))
Now we can use that for telling the world:
echo "Avarage word size is: ${avg_word_size}.${_100x_avg_word_size: -2:2}"
To further refine, we could assume that only 1 whitespace character is separating words:
chars=$(cat inputfile | wc -c)
words=$(cat inputfile | wc -w)
avg_word_size=$(( $(( ${chars} - $(( ${words} - 1 )) )) / ${words} ))
_100x_avg_word_size=$(( $((${chars} * 100)) / ${words} ))
echo "Avarage word size is: ${avg_word_size}.${_100x_avg_word_size: -2:2}"
Now it's your job to try and include the concept of 'lines' into your computations... :-)

Update: to show clearly (hopefully) the differenct between wc and this method; and fixed a "too-many-newlines" bug; Also added finer control of apostrophes in word endings .
If your want to consider a word as being a bash word, then using wc alone is fine.
However if you want to consider a word as word in a spoken/written language, then you can't use wc for the word parsing.
Eg.. wc considers the following to contain 1 word (of size average = 112.00),
wheras the script belows shows it to contain 19 words (of size average = 4.58)
"/home/axiom/zap_notes/apps/eng-hin-devnag-itrans/Platt's_Urdu_and_classical_Hindi_to_English_-_preface5.doc't"
Using Kurt's script, the following line is shown to contain 7 words (of size average = 8.14),
wheras the script presented below shows it to contain 7 words (of size average = 4.43) ...बे = 2 chars
"बे = {Platts} ... —be-ḵẖẉabī, s.f. Sleeplessness:"
So, if wc is your flavour, good, and if not, something like this may suit:
# Cater for special situation words: eg 's and 't
# Convert each group of anything which isn't a "character" (including '_') into a newline.
# Then, convert each CHARACTER which isn't a newline into a BYTE (not character!).
# This leaves one 'word' per line, each 'word' being made up of the same BYTE ('x').
#
# Without any options, wc prints newline, word, and byte counts (in that order),
# so we can capture all 3 values in a bash array
#
# Use `awk` as a floating point calculator (bash can only do integer arithmetic)
count=($(sed "s/\>'s\([[:punct:]]\|$\)/\1/g # ignore apostrophe-s ('s) word endings
s/'t\>/xt/g # consider words ending in apostrophe-t ('t) as base word + 2 characters
s/[_[:digit:][:blank:][:punct:][:cntrl:]]\+/\n/g
s/^\n*//; s/\n*$//; s/[^\n]/x/g" "$file" | wc))
echo "chars / word average:" \
$(awk -vnl=${count[0]} -vch=${count[2]} 'BEGIN{ printf( "%.2f\n", (ch-nl)/nl ) }')

Related

How do I get AWK to reaarrange and manipulate text in a file to two output files depending on conditions?

I tried to find an efficient way to split then recombine text in one file into two seperate files. it's got a lot going on like removing the decimal point, reversing the sign (+ becomes - and - becomes +) in amount field and padding. For example:
INPUT file input.txt:
(this first line is there just to give character position more easily instead of counting, it's not present in the input file, the "|" is just there to illustrate position only)
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345
| | | | | | | ("|" shows position)
123456789XXPPPPPPPPPP NNNNNN#1404.58 #0.00 0 1
987654321YYQQQQQQQQQQ NNNNNN#-97.73 #-97.73 1 1
777777777XXGGGGGGGGGG NNNNNN#115.92 #115.92 0 0
888888888YYHHHHHHHHHH NNNNNN#3.24 #3.24 1 0
Any line that contains a "1" as the 85th character above goes to one file say OutputA.txt rearranged like this:
PPPPPPPPPP~~NNNNNN123456789XX~~~-0000140458-0000000000
QQQQQQQQQQ~~NNNNNN987654321YY~~~+0000009773+0000009773
As well as any line that contains a "0" as the 85th character above goes to another file OutputB.txt rearranged like this:
GGGGGGGGGG~~NNNNNN777777777XX~~~-0000011592-0000011592
HHHHHHHHHH~~NNNNNN888888888YY~~~-0000000324-0000000324
It seems so complicated, but if I could just grab each portion of the input lines as different variables and then write them out in a different order with right alignment for the amount padded with 0s and splitting them into different files depending on the last column. Not sure how I can put all these things together in one go.
I tried printing out each line into a different file depending whether the 85th charater is a 1 or 0, then then trying to create variables say from first character to 11th character is varA and the next 10 is varB etc... but it get complex quickly because I need to change + to - and - to + and then pad with zeros and change te spacing. it gets a bit mad. This should be possible with one script but I just can't put all the pieces together.
I've looked for tutorials but nothing seems to cover grabbing based on condition whilst at the same time padding, rearranging, splitting etc.
Many thanks in advance
split
Use GNU AWK ability to print to file, consider following simple example
seq 20 | awk '$1%2==1{print $0 > "fileodd.txt"}$1%2==0{print $0 > "fileeven.txt"}'
which does read output of seq 20 (numbers from 1 to 20, inclusive, each on separate line) and does put odd numbers to fileodd.txt and even number to fileeven.txt
recombine text
Use substr and string contatenation for that task, consider following simple example, say you have file.txt with DD-MM-YYYY dates like so
01-29-2022
01-30-2022
01-31-2022
but you want YYYY-MM-DD then you could do that by
awk '{print substr($0,7,4) "-" substr($0,1,2) "-" substr($0,4,2)}' file.txt
which gives output
2022-01-29
2022-01-30
2022-01-31
substr arguments are: string ($0 is whole line), start position and length, space is concatenation operator.
removing the decimal point
Use gsub with second argument set to empty string to delete unwanted characters, but keep in mind . has special meaning in regular expression, consider following simple example, let file.txt content be
100.15
200.30
300.45
then
awk '{gsub(/[.]/,"");print}' file.txt
gives output
10015
20030
30045
Observe that /[.]/ not /./ is used and gsub does change in-place.
reversing the sign(...)padding
Multiply by -1, then use sprintf with suitable modifier, consider following example, let file.txt content be
1
-10
100
then
awk '{print "Reversed value is " sprintf("%+05d",-1*$1)}' file.txt
gives output
Reversed value is -0001
Reversed value is +0010
Reversed value is -0100
Explanation: % - this is place where value will be instered, + - prefix using - or +, 05 - pad with leading zeros to width of 5 characters, d assume value is integer. sprintf does return formatted string which can be concatenated with other string as shown above.
(tested in GNU Awk 5.0.1)
You can use jq for this task:
#!/bin/bash
INPUT='
123456789XXPPPPPPPPPP NNNNNN#1404.58 #0.00 0 1
987654321YYQQQQQQQQQQ NNNNNN#-97.73 #-97.73 1 1
777777777XXGGGGGGGGGG NNNNNN#115.92 #115.92 0 0
888888888YYHHHHHHHHHH NNNNNN#3.24 #3.24 1 0
'
convert() {
jq -rR --arg lineSelector "$1" '
def transformNumber($len):
tonumber | # convert string to number
(if . < 0 then "+" else "-" end) as $sign | # store inverted sign
if . < 0 then 0 - . else . end | # abs(number)
. * 100 | # number * 100
tostring | # convert number back to string
$sign + "0" * ($len - length) + .; # indent with leading zeros
# Main program
split(" ") | # split each line by space
map(select(length > 0)) | # remove empty entries
select(.[4] == $lineSelector) | # keep only lines with the selected value in last column
# generate output # example for first line
.[0][11:21] + # PPPPPPPPPP
"~~" + # ~~
(.[1] | split("#")[0]) + # NNNNNN
.[0][0:11] + # 123456789XX
"~~~" + # ~~~
(.[1] | split("#")[1] | transformNumber(10)) + # -0000140458
(.[2] | split("#")[1] | transformNumber(10)) # -0000000000
' <<< "$2"
}
convert 0 "$INPUT" # or convert 1 "$INPUT"
Output for 0
GGGGGGGGGG~~NNNNNN777777777XX~~~-0000011592-0000011592
HHHHHHHHHH~~NNNNNN888888888YY~~~-0000000324-0000000324
Output for 1
PPPPPPPPPP~~NNNNNN123456789XX~~~-0000140458-0000000000
QQQQQQQQQQ~~NNNNNN987654321YY~~~+0000009773+0000009773

Convert carriage return (\r) to actual overwrite

Questions
Is there a way to convert the carriage returns to actual overwrite in a string so that 000000000000\r1010 is transformed to 101000000000?
Context
1. Initial objective:
Having a number x (between 0 and 255) in base 10, I want to convert this number in base 2, add trailing zeros to get a 12-digits long binary representation, generate 12 different numbers (each of them made of the last n digits in base 2, with n between 1 and 12) and print the base 10 representation of these 12 numbers.
2. Example:
With x = 10
Base 2 is 1010
With trailing zeros 101000000000
Extract the 12 "leading" numbers: 1, 10, 101, 1010, 10100, 101000, ...
Convert to base 10: 1, 2, 5, 10, 20, 40, ...
3. What I have done (it does not work):
x=10
x_base2="$(echo "obase=2;ibase=10;${x}" | bc)"
x_base2_padded="$(printf '%012d\r%s' 0 "${x_base2}")"
for i in {1..12}
do
t=$(echo ${x_base2_padded:0:${i}})
echo "obase=10;ibase=2;${t}" | bc
done
4. Why it does not work
Because the variable x_base2_padded contains the whole sequence 000000000000\r1010. This can be confirmed using hexdump for instance. In the for loop, when I extract the first 12 characters, I only get zeros.
5. Alternatives
I know I can find alternative by literally adding zeros to the variable as follow:
x_base2=1010
x_base2_padded="$(printf '%s%0.*d' "${x_base2}" $((12-${#x_base2})) 0)"
Or by padding with zeros using printf and rev
x_base2=1010
x_base2_padded="$(printf '%012s' "$(printf "${x_base2}" | rev)" | rev)"
Although these alternatives solve my problem now and let me continue my work, it does not really answer my question.
Related issue
The same problem may be observed in different contexts. For instance if one tries to concatenate multiple strings containing carriage returns. The result may be hard to predict.
str=$'bar\rfoo'
echo "${str}"
echo "${str}${str}"
echo "${str}${str}${str}"
echo "${str}${str}${str}${str}"
echo "${str}${str}${str}${str}${str}"
The first echo will output foo. Although you might expect the other echo to output foofoofoo..., they all output foobar.
The following function overwrite transforms its argument such that after each carriage return \r the beginning of the string is actually overwritten:
overwrite() {
local segment result=
while IFS= read -rd $'\r' segment; do
result="$segment${result:${#segment}}"
done < <(printf '%s\r' "$#")
printf %s "$result"
}
Example
$ overwrite $'abcdef\r0123\rxy'
xy23ef
Note that the printed string is actually xy23ef, unlike echo $'abcdef\r0123\rxy' which only seems to print the same string, but still prints \r which is then interpreted by your terminal such that the result looks the same. You can confirm this with hexdump:
$ echo $'abcdef\r0123\rxy' | hexdump -c
0000000 a b c d e f \r 0 1 2 3 \r x y \n
000000f
$ overwrite $'abcdef\r0123\rxy' | hexdump -c
0000000 x y 2 3 e f
0000006
The function overwrite also supports overwriting by arguments instead of \r-delimited segments:
$ overwrite abcdef 0123 xy
xy23ef
To convert variables in-place, use a subshell: myvar=$(overwrite "$myvar")
With awk, you'd set the field delimiter to \r and iterate through fields printing only the visible portions of them.
awk -F'\r' '{
offset = 1
for (i=NF; i>0; i--) {
if (offset <= length($i)) {
printf "%s", substr($i, offset)
offset = length($i) + 1
}
}
print ""
}'
This is indeed too long to put into a command substitution. So you better wrap this in a function, and pipe the lines to be resolved to that.
To answer the specific question, how to convert 000000000000\r1010 to 101000000000, refer to Socowi's answer.
However, I wouldn't introduce the carriage return in the first place and solve the problem like this:
#!/usr/bin/env bash
x=$1
# Start with 12 zeroes
var='000000000000'
# Convert input to binary
binary=$(bc <<< "obase = 2; $x")
# Rightpad with zeroes: ${#binary} is the number of characters in $binary,
# and ${var:x} removes the first x characters from $var
var=$binary${var:${#binary}}
# Print 12 substrings, convert to decimal: ${var:0:i} extracts the first
# i characters from $var, and $((x#$var)) interprets $var in base x
for ((i = 1; i <= ${#var}; ++i)); do
echo "$((2#${var:0:i}))"
done

Interleave lines sorted by a column

(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)
I have a set of systems and tasks in two columns, SYSTEM,TASK:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850
I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha then one beta then one gamma and repeat until finished.
I get the whole list of tasks + systems at once, so I don't need to keep any state
The list of systems is not static, on the order of N=100
The total number of tasks is variable, on the order of N=500
The number of tasks for each system is not guaranteed to be equal
Hard / absolute interleaving isn't required, as long as there aren't two of the same system twice in a row
The same task may show up more than once, but not for the same system
Input format / delimiter can be changed
I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt etc) and then recombine them with paste or similar, but I'd like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.
edit:
To clarify, any output should not have the same system on two lines in a row. All system,task pairs must be preserved, you can't move a task from one system to another - that'd make this really easy!
One of several possible sample outputs:
beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909
We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.
The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.
Theoretical Problem
Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.
Example
L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1
ababaca, acababa, and abacaba are all valid solutions.
aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1
Solution
The following approach produces a valid sequence if and only if there exists a solution.
Sort the letters by their frequencies.
For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ ... ≥ 0.
Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).
Write down a sequence s of f(a) many a.
Add the remaining letters into a FIFO working list, that is:
(Don't add any a)
First add f(b) many b
Then f(c) many c
and so on
Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.
Example
L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2
The letters are already sorted by their frequencies.
s = aaaaa
workinglist = bbbbbccccdd. The leftmost entry is the first one.
We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
s = a_a_a_a_a_ workinglist = bbbbbccccdd
s = aba_a_a_a_ workinglist = bbbbccccdd
s = ababa_a_a_ workinglist = bbbccccdd
...
s = ababababab workinglist = ccccdd
⚠️ We reached the end of sequence s. We repeat step 4.
s = a_b_a_b_a_b_a_b_a_b_ workinglist = ccccdd
s = acb_a_b_a_b_a_b_a_b_ workinglist = cccdd
...
s = acbcacb_a_b_a_b_a_b_ workinglist = cdd
s = acbcacbca_b_a_b_a_b_ workinglist = dd
s = acbcacbcadb_a_b_a_b_ workinglist = d
s = acbcacbcadbda_b_a_b_ workinglist =
⚠️ The working list is empty. We stop.
The final sequence is acbcacbcadbdabab.
Implementation In Bash
Here is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don't have to use explicit loops. Intermediate results are stored in variables. No files are created.
#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"
inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"
maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"
seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"
paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6
This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.
Well, this is what I've come up with:
args=()
while IFS=' ' read -r _ name; do
# add a file redirection with grepped certain SYSTEM only for later eval
args+=("<(grep '^$name,' file)")
done < <(
# extract SYSTEM only
<file cut -d, -f1 |
#sort with the count
sort | uniq -c | sort -nr
)
# this is actually safe, because we control all arguments
eval paste -d "'\\n'" "${args[#]}" |
# paste will insert empty lines when the list ended - remove them
sed '/^$/d'
First, I extract and sort the SYSTEM names in the order which occurs the most often to be first. So for the input example we get:
4 beta
3 alpha
1 gamme
Then for each such name I add the proper string <(grep '...' file) to arguments list witch will be later evalulated.
Then I evalulate the call to paste <(grep ...) <(grep ...) <(grep ...) ... with newline as the paste's delimeter. I remove empty lines with simple sed call.
The output for the input provided:
beta,21700055
alpha,90198500
gamma,64910850
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
Converted to a fancy oneliner, with substituting the while read with command substitution and sed. Got safe with inputfile naming with printf "%q" "$inputfile" and double quoting inside sed regex.
inputfile="file"
fieldsep=","
eval paste -d '"\\n"' "$(
cut -d "$fieldsep" -f1 "$inputfile" |
sort | uniq -c | sort -nr |
sed 's/^[[:space:]]*[0-9]\+[[:space:]]*\(.*\)$/<(grep '\''^\1'"$fieldsep"\'' "'"$(printf "%q" "$inputfile")"'")/' |
tr '\n' ' '
)" |
sed '/^$/d'
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
}
To accommodate for strange input values this script implements somewhat simple but hardy in bash statemachine.
The variable counts stores SYSTEM names with their're occurrence count. So from the example input it will be
4 alpha
3 beta
1 gamma
Now - we output the SYSTEM name with the biggest occurrence count that is also different from the last outputted SYSTEM name. We decrement it's occurrence count. If the count is equal to zero, it is removed from the list. We remember the last outputted SYSTEM name. We repeat this process until all occurrence counts reach zero, so the list is empty. For the example input this will output:
beta
alpha
beta
alpha
beta
alpha
beta
gamma
Now, we need to join that list with the job names. We can't use join as the input is not sorted and we don't want to change the ordering. So what I do, I get only SYSTEM names in system. Then for each system I open a different file descriptor with filtered only that SYSTEM name from the input file. All the file descriptors are stored in an array. Then for each SYSTEM name from the input, I find the file descriptor that filters that SYSTEM name from the input file and read exactly one line from the file descriptor. This works like an array of file positions each file position associated / filtering specified SYSTEM name.
beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850
The script was done so for the input in the form of:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850
the script outputs correctly:
alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438
I think this algorithm will mostly always print correct output, but the ordering is so that the least common SYSTEMs will be outputted last, which may be not optimal.
Tested manually with some custom tests and checker on paiza.io.
inputfile="inputfile"
in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' )
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF
sed -i -e '/^$/d' "$inputfile"
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
} |
{
# check if the output is correct
output=$(cat)
# output should have same lines as inputfile
if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
echo "Output does not match input!" >&2
exit 1
fi
# two consecutive lines can't have the same system
lastsys=""
<<<"$output" cut -d, -f1 |
while IFS= read -r sys; do
if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
echo "Same systems found on two consecutive lines!" >&2
exit 1
fi
lastsys="$sys"
done
# all ok
echo "all ok!"
echo -------------
printf "%s\n" "$output"
}
exit

How to split words in bash

Good evening, People
Currently I have an Array called inputArray which stores an input file 7 lines line by line. I have a word which is 70000($s0), how do I split the word so it is 70000 & ($s0) separate?
I looked at an answer which is on this website already but I couldn't understand it the answer I looked at was:
s='1000($s3)'
IFS='()' read a b <<< "$s"
echo -e "a=<$a>\nb=<$b>"
giving the output a=<1000> b=<$s3>
Let me give this a shot.
In certain circumstances, the shell will perform "word splitting", where a string of text is broken up into words. The word boundaries are defined by the IFS variable. The default value of IFS is: space, tab, newline. When a string is to be split into words, any sequence of this set of characters is removes to extract the words.
In your example, the set of characters that delimit words are ( and ). So the words in that string that are bounded by the IFS set of characters are 1000 and $s3
What is <<< "$s"? This is a here-string. It's used to send a string to some command's standard input. It's like doing
echo "$s" | read a b
except that form doesn't work as expected in bash. read a b <<< "$s" works well.
Now, what are the circumstances where word splitting occurs? One is when a variable is unquoted. A demo:
IFS='()'
echo "$s" | wc # 1 line, 1 word and 10 characters
echo $s | wc # 1 line, 2 words and 9 characters
The read command also splits a string into words, in order to assign words to the named variables. The variable a gets the first word, and b gets all the rest.
The command, broken down is:
IFS='()' read a b <<< "$s"
# ^^^^^^^ 1
# ^^^^^^^^ 2
# ^^^^^^^^ 3
only for the duration of the read command, assign the variable IFS the value ()
send the string "$s" to read's stdin
from stdin, use $IFS to split the input into words: assign the first word to variable a and the rest of the string to variable b. Trailing characters from $IFS at the end of the string are discarded.
Documentation:
Word splitting
Here strings
Simple command execution, describing why this assignment of IFS is only in effect for the duration of the read command.
read command
Hope that helps.

elif conditional statement not working

I have this file as:
The number is %d0The number is %d1The number is %d2The number is %d3The number is %d4The number is %d5The number is %d6The...
The number is %d67The number is %d68The number is %d69The number is %d70The number is %d71The number is %d72The....
The number is %d117The number is %d118The number is %d119The number is %d120The number is %d121The number is %d122
I want to pad it like:
The number is %d0 The number is %d1 The number is %d2 The number is %d3 The number is %d4 The number is %d5 The number is %d6
The number is %d63 The number is %d64 The number is %d65 The number is %d66 The number is %d67 The number is %d68 The number is %d69
d118The number is %d119The number is %d120The number is %d121The number is %d122The number is %d123The number is %d124The
Please tell me how to do it through shell script
I am working on Linux
Edit:
This single command pipeline should do what you want:
sed 's/\(d[0-9]\+\)/\1 /g;s/\(d[0-9 ]\{3\}\) */\1/g' test2.txt >test3.txt
# ^ three spaces here
Explanation:
For each sequence of digits following a "d", add three spaces after it. (I'll use "X" to represent spaces.)
d1 becomes d1XXX
d10 becomes d10XXX
d100 becomes d100XXX
Now (the part after the semicolon), capture every "d" and the next three character which must be digits or spaces and output them but not any spaces beyond.
d1XXX becomes d1XX
d10XXX becomes d10X
d100XXX becomes d100
If you want to wrap the lines as you seem to show in your sample data, then do this instead:
sed 's/\(d[0-9]\+\)/\1 /g;s/\(d[0-9 ]\{3\}\) */\1/g' test2.txt | fold -w 133 >test3.txt
You may need to adjust the argument of the fold command to make it come out right.
There's no need for if, grep, loops, etc.
Original answer:
First of all, you really need to say which shell you're using, but since you have elif and fi, I'm assuming it's Bourne-derived.
Based on that assumption, your script makes no sense.
The parentheses for the if and elif are unnecessary. In this context, they create a subshell which serves no purpose.
The sed commands in the if and elif say "if the pattern is found, copy hold space (it's empty, by the way) to pattern space and output it and output all other lines.
The first sed command will always be true so the elif will never be executed. sed always returns true unless there's an error.
This may be what you intended:
if grep -Eqs 'd[0-9]([^0-9]|$)' test2.txt; then
sed 's/\(d[0-9]\)\([^0-9]\|$\)/\1 \2/g' test2.txt >test3.txt
elif grep -Eqs 'd[0-9][0-9]([^0-9]|$)' test2.txt; then
sed 's/\(d[0-9][0-9]\)\([^0-9]\|$\)/\1 \2/g' test2.txt >test3.txt
else
cat test2.txt >test3.txt
fi
But I wonder if all that could be replaced by something like this one-liner:
sed 's/\(d[0-9][0-9]?\)\([^0-9]\|$\)/\1 \2/g' test2.txt >test3.txt
Since I don't know what test2.txt looks like, part of this is only guessing.

Resources