I have a string that has duplicate words. I would like to display only the unique words. The string is:
variable="alpha bravo charlie alpha delta echo charlie"
I know several tools that can do this together. This is what I figured out:
echo $variable | tr " " "\n" | sort -u | tr "\n" " "
What is a more effective way to do this?
Use a Bash Substitution Expansion
The following shell parameter expansion will substitute spaces with newlines, and then pass the results into the sort utility to return only the unique words.
$ echo -e "${variable// /\\n}" | sort -u
alpha
bravo
charlie
delta
echo
This has the side-effect of sorting your words, as the sort and uniq utilities both require input to be sorted in order to detect duplicates. If that's not what you want, I also posted a Ruby solution that preserves the original word order.
Rejoining Words
If, as one commenter pointed out, you're trying to reassemble your unique words back into a single line, you can use command substitution to do this. For example:
$ echo $(echo -e "${variable// /\\n}" | sort -u)
alpha bravo charlie delta echo
The lack of quotes around the command substitution are intentional. If you quote it, the newlines will be preserved because Bash won't do word-splitting. Unquoted, the shell will return the results as a single line, however unintuitive that may seem.
You may use xargs:
echo "$variable" | xargs -n 1 | sort -u | xargs
Note: This solution assumes that all unique words should be output in the order they're encountered in the input. By contrast, the OP's own solution attempt outputs a sorted list of unique words.
A simple Awk-only solution (POSIX-compliant) that is efficient by avoiding a pipeline (which invariably involves subshells).
awk -v RS=' ' '{ if (!seen[$1]++) { printf "%s%s",sep,$1; sep=" " } }' <<<"$variable"
# The above prints without a trailing \n, as in the OP's own solution.
# To add a trailing newline, append `END { print }` to the end
# of the Awk script.
Note how $variable is double-quoted to prevent it from accidental shell expansions, notably pathname expansion (globbing), and how it is provided to Awk via a here-string (<<<).
-v RS=' ' tells Awk to split the input into records by a single space.
Note that the last word will have the input line's trailing newline included, which is why we don't use $0 - the entire record - but $1, the record's first field, which has the newline stripped due to Awk's default field-splitting behavior.
seen[$1]++ is a common Awk idiom that either creates an entry for $1, the input word, in associative array seen, if it doesn't exist yet, or increments its occurrence count.
!seen[$0]++ therefore only returns true for the first occurrence of a given word (where seen[$0] is implicitly zero/the empty string; the ++ is a post-increment, and therefore doesn't take effect until after the condition is evaluated)
{printf "%s%s",sep,$1; sep=" "} prints the word at hand $1, preceded by separator sep, which is implicitly the empty string for the first word, but a single space for subsequent words, due to setting sep to " " immediately after.
Here's a more flexible variant that handles any run of whitespace between input words; it works with GNU Awk and Mawk[1]:
awk -v RS='[[:space:]]+' '{if (!seen[$0]++){printf "%s%s",sep,$0; sep=" "}}' <<<"$variable"
-v RS='[[:space:]]s+' tells Awk to split the input into records by any mix of spaces, tabs, and newlines.
[1] Unfortunately, BSD/OSX Awk (in strict compliance with the POSIX spec), doesn't support using regular expressions or even multi-character literals as RS, the input record separator.
Preserve Input Order with a Ruby One-Liner
I posted a Bash-specific answer already, but if you want to return only unique words while preserving the word order of the original string, then you can use the following Ruby one-liner:
$ echo "$variable" | ruby -ne 'puts $_.split.uniq'
alpha
bravo
charlie
delta
echo
This will split the input string on whitespace, and then return unique elements from the resulting array.
Unlike the sort or uniq utilities, Ruby doesn't need the words to be sorted to detect duplicates. This may be a better solution if you don't want your results to be sorted, although given your input sample it makes no practical difference for the posted example.
Rejoining Words
If, as one commenter pointed out, you're then trying to reassemble the words back into a single line after deduplication, you can do that too. For that, we just append the Array#join method:
$ echo "$variable" | ruby -ne 'puts $_.split.uniq.join(" ")'
alpha bravo charlie delta echo
You can use awk:
$ echo "$variable" | awk '{for(i=1;i<=NF;i++){if (!seen[$i]++) printf $i" "}}'
alpha bravo charlie delta echo
If you do not want the trailing space and want a trailing CR, you can do:
$ echo "$variable" | awk 'BEGIN{j=""} {for(i=1;i<=NF;i++){if (!seen[$i]++)j=j==""?j=$i:j=j" "$i}} END{print j}'
alpha bravo charlie delta echo
Using associative arrays in BASH 4+ you can simplify this:
variable="alpha bravo charlie alpha delta echo charlie"
# declare an associative array
declare -A unq
# read sentence into an indexed array
read -ra arr <<< "$variable"
# iterate each word and populate associative array with word as key
for w in "${arr[#]}"; do
unq["$w"]=1
done
# print unique results
printf "%s\n" "${!unq[#]}"
delta
bravo
echo
alpha
charlie
## if you want results in same order as original string
for w in "${arr[#]}"; do
[[ ${unq["$w"]} ]] && echo "$w" && unset unq["$w"]
done
alpha
bravo
charlie
delta
echo
pure, ugly bash:
for x in $vaviable; do
if [ "$(eval echo $(echo \$un__$x))" = "" ]; then
echo -n $x
eval un__$x=1
__usv="$__usv un__$x"
fi
done
unset $__usv
Related
I have a file consisting of multiple rows like this
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN|0000000010000.00|6761857316|508998|6011|GL
I have to split and replace the column 11 into 4 different columns using the count of character.
This is the 11th column containing extra spaces also.
SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN
This is I have done
ls *.txt *.TXT| while read line
do
subName="$(cut -d'.' -f1 <<<"$line")"
awk -F"|" '{ "echo -n "$11" | cut -c1-23" | getline ton;
"echo -n "$11" | cut -c24-36" | getline city;
"echo -n "$11" | cut -c37-38" | getline state;
"echo -n "$11" | cut -c39-40" | getline country;
$11=ton"|"city"|"state"|"country; print $0
}' OFS="|" $line > $subName$output
done
But while doing echo of 11th column, its trimming the extra spaces which leads to mismatch in count of character. Is there any way to echo without trimming spaces ?
Actual output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR MHIN|||0000000010000.00|6761857316|508998|6011|GL
Expected Output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR|MH|IN|0000000010000.00|6761857316|508998|6011|GL
The least annoying way to code this that I've found so far is:
perl -F'\|' -lane '$F[10] = join "|", unpack "a23 A13 a2 a2", $F[10]; print join "|", #F'
It's fairly straightforward:
Iterate over lines of input; split each line on | and put the fields in #F.
For the 11th field ($F[10]), split it into fixed-width subfields using unpack (and trim trailing spaces from the second field (A instead of a)).
Reassemble subfields by joining with |.
Reassemble the whole line by joining with | and printing it.
I haven't benchmarked it in any way, but it's likely much faster than the original code that spawns multiple shell and cut processes per input line because it's all done in one process.
A complete solution would wrap it in a shell loop:
for file in *.txt *.TXT; do
outfile="${file%.*}$output"
perl -F'\|' -lane '...' "$file" > "$outfile"
done
Or if you don't need to trim the .txt part (and you don't have too many files to fit on the command line):
perl -i.out -F'\|' -lane '...' *.txt *.TXT
This simply places the output for each input file foo.txt in foo.txt.out.
A pure-bash implementation of all this logic
#!/usr/bin/env bash
shopt -s nocaseglob extglob
for f in *.txt; do
subName=${f%.*}
while IFS='|' read -r -a fields; do
location=${fields[10]}
ton=${location:0:23}; ton=${ton%%+([[:space:]])}
city=${location:23:12}; city=${city%%+([[:space:]])}
state=${location:36:2}
country=${location:38:2}
fields[10]="$ton|$city|$state|$country"
printf -v out '%s|' "${fields[#]}"
printf '%s\n' "${out:0:$(( ${#out} - 1 ))}"
done <"$f" >"$subName.out"
done
It's slower (if I did this well, by about a factor of 10) than pure awk would be, but much faster than the awk/shell combination proposed in the question.
Going into the constructs used:
All the ${varname%...} and related constructs are parameter expansion. The specific ${varname%pattern} construct removes the shortest possible match for pattern from the value in varname, or the longest match if % is replaced with %%.
Using extglob enables extended globbing syntax, such as +([[:space:]]), which is equivalent to the regex syntax [[:space:]]+.
This question already has answers here:
Sorting and removing duplicate words in a line
(7 answers)
Closed 6 years ago.
I want to delete duplicate strings from a String. Example:
A="Dog Cat Horse Dog Dog Cat"
The string A should look like this:
A="Dog Cat Horse"
How can I write a Shell script for that?
You could use this,
echo "a a b b c c" | tr ' ' '\n' | sort | uniq | tr '\n' ' ' | sed -e 's/[[:space:]]*$//'
If order is not important, you can use an associative array:
declare -A uniq
for k in $A ; do uniq[$k]=1 ; done
echo ${!uniq[#]}
(Safely) split the string on blanks, creating an array with each word:†
read -r -d '' -a words < <(printf '%s\0' "$A")
Loop on the fields of the array, storing the words into an associative array; if the word was already seen, ignore it
declare -A Aseen
Aunique=()
for w in "${words[#]}"; do
[[ ${Aseen[$w]} ]] && continue
Aunique+=( "$w" )
Aseen[$w]=x
done
You can print the Aunique array to standard output:
printf '%s\n' "${Aunique[#]}"
which yields:
Dog
Cat
Horse
or create a new string with it
Anew="${Aunique[*]}"
printf '%s\n' "$Anew"
which yields:
Dog Cat Horse
or join the array with a separator, e.g., with the character ,:‡
IFS=, eval 'Asep="${Aunique[*]}"'
printf '%s\n' "${Asep[#]}"
which yields:
Dog,Cat,Horse
All these use Bash≥4 features. If you're stuck on older Bash versions, there are workarounds but it won't be as safe and nice and easy…
Note. This method will not sort the string: the words remain in the original order, only with the duplicates removed.
†This is the canonical (and safe!) way to split a string on space characters (or, more generally on the characters contained in the special variable IFS, which has default value space-tab-newline). Don't use horrors like words=( $A ): it's subject to filename expansion (globbing). Another method widely encountered is read -r -a words <<< "$A"; this is fine (i.e., safe), but will not handle newlines in A.
‡The use of eval here is 100% safe (because of the single quotes); it's actually the canonical way to join the elements of an array in Bash (or to join the positional parameters in POSIX shells).
With gawk:
awk -v RS="[ \n]" -v ORS=" " '!($0 in a){print;a[$0]}' <(echo $A)
How do I go about finding the one word that is not repeated in a string in bash? I'd like to know if there is a "native" bash way of doing this, or if I need to use another command line utility (like awk,sed,grep,...).
For instance, var1="thrice once twice twice thrice";. I need something that will split out the word 'once' since it only occurs once (i.e., no duplicates).
You could use sort, uniq after splitting the string by whitespace:
tr ' ' '\n' <<< "$var1" | sort | uniq -u
This would produce once for your input.
(If the input contains punctuation, you might want to remove it before anything else in order to avoid unexpected results.)
#devnull's answer is the better choice (both for simplicity and probably performance), but if you're looking for a bash-only solution:
Caveats:
Uses associative arrays, which are only available in bash 4 or higher:
Using a literal * in the input word list won't work (other glob-like strings are OK, however).
Deals correctly with multi-line input and input with multiple whitespace chars. between words.
# Define the input word list.
# Bonus: multi-line input with multiple inter-word spaces.
var1=$'thrice once twice twice thrice\ntwice again'
# Declare associative array.
declare -A wordCounts
# Read all words and count the occurrence of each.
while read -r w; do
[[ -n $w ]] && (( wordCounts[$w]+=1 ))
done <<<"${var1// /$'\n'}" # split input list into lines for easy parsing
# Output result.
# Note that the output list will NOT automatically be sorted, because the keys of an
# associative array are not 'naturally sorted'; hence piping to `sort`.
echo "Words that only occur once in '$var1':"
echo "---"
for w in "${!wordCounts[#]}"; do
(( wordCounts[$w] == 1 )) && echo "$w"
done | sort
# Expected output:
# again
# once
Just for fun, awk:
awk '{
for (i=1; i<=NF; i++) c[$i]++
for (word in c) if (c[word]==1) print word
}' <<< "$var1"
once
I have this variable:
A="Some variable has value abc.123"
I need to extract this value i.e abc.123. Is this possible in bash?
Simplest is
echo "$A" | awk '{print $NF}'
Edit: explanation of how this works...
awk breaks the input into different fields, using whitespace as the separator by default. Hardcoding 5 in place of NF prints out the 5th field in the input:
echo "$A" | awk '{print $5}'
NF is a built-in awk variable that gives the total number of fields in the current record. The following returns the number 5 because there are 5 fields in the string "Some variable has value abc.123":
echo "$A" | awk '{print NF}'
Combining $ with NF outputs the last field in the string, no matter how many fields your string contains.
Yes; this:
A="Some variable has value abc.123"
echo "${A##* }"
will print this:
abc.123
(The ${parameter##word} notation is explained in §3.5.3 "Shell Parameter Expansion" of the Bash Reference Manual.)
Some examples using parameter expansion
A="Some variable has value abc.123"
echo "${A##* }"
abc.123
Longest match on " " space
echo "${A% *}"
Some variable has value
Longest match on . dot
echo "${A%.*}"
Some variable has value abc
Shortest match on " " space
echo "${A%% *}"
some
Read more Shell-Parameter-Expansion
The documentation is a bit painful to read, so I've summarised it in a simpler way.
Note that the '*' needs to swap places with the ' ' depending on whether you use # or %. (The * is just a wildcard, so you may need to take off your "regex hat" while reading.)
${A% *} - remove shortest trailing * (strip the last word)
${A%% *} - remove longest trailing * (strip the last words)
${A#* } - remove shortest leading * (strip the first word)
${A##* } - remove longest leading * (strip the first words)
Of course a "word" here may contain any character that isn't a literal space.
You might commonly use this syntax to trim filenames:
${A##*/} removes all containing folders, if any, from the start of the path, e.g.
/usr/bin/git -> git
/usr/bin/ -> (empty string)
${A%/*} removes the last file/folder/trailing slash, if any, from the end:
/usr/bin/git -> /usr/bin
/usr/bin/ -> /usr/bin
${A%.*} removes the last extension, if any (just be wary of things like my.path/noext):
archive.tar.gz -> archive.tar
How do you know where the value begins? If it's always the 5th and 6th words, you could use e.g.:
B=$(echo "$A" | cut -d ' ' -f 5-)
This uses the cut command to slice out part of the line, using a simple space as the word delimiter.
As pointed out by Zedfoxus here. A very clean method that works on all Unix-based systems. Besides, you don't need to know the exact position of the substring.
A="Some variable has value abc.123"
echo "$A" | rev | cut -d ' ' -f 1 | rev
# abc.123
More ways to do this:
(Run each of these commands in your terminal to test this live.)
For all answers below, start by typing this in your terminal:
A="Some variable has value abc.123"
The array example (#3 below) is a really useful pattern, and depending on what you are trying to do, sometimes the best.
1. with awk, as the main answer shows
echo "$A" | awk '{print $NF}'
2. with grep:
echo "$A" | grep -o '[^ ]*$'
the -o says to only retain the matching portion of the string
the [^ ] part says "don't match spaces"; ie: "not the space char"
the * means: "match 0 or more instances of the preceding match pattern (which is [^ ]), and the $ means "match the end of the line." So, this matches the last word after the last space through to the end of the line; ie: abc.123 in this case.
3. via regular bash "indexed" arrays and array indexing
Convert A to an array, with elements being separated by the default IFS (Internal Field Separator) char, which is space:
Option 1 (will "break in mysterious ways", as #tripleee put it in a comment here, if the string stored in the A variable contains certain special shell characters, so Option 2 below is recommended instead!):
# Capture space-separated words as separate elements in array A_array
A_array=($A)
Option 2 [RECOMMENDED!]. Use the read command, as I explain in my answer here, and as is recommended by the bash shellcheck static code analyzer tool for shell scripts, in ShellCheck rule SC2206, here.
# Capture space-separated words as separate elements in array A_array, using
# a "herestring".
# See my answer here: https://stackoverflow.com/a/71575442/4561887
IFS=" " read -r -d '' -a A_array <<< "$A"
Then, print only the last elment in the array:
# Print only the last element via bash array right-hand-side indexing syntax
echo "${A_array[-1]}" # last element only
Output:
abc.123
Going further:
What makes this pattern so useful too is that it allows you to easily do the opposite too!: obtain all words except the last one, like this:
array_len="${#A_array[#]}"
array_len_minus_one=$((array_len - 1))
echo "${A_array[#]:0:$array_len_minus_one}"
Output:
Some variable has value
For more on the ${array[#]:start:length} array slicing syntax above, see my answer here: Unix & Linux: Bash: slice of positional parameters, and for more info. on the bash "Arithmetic Expansion" syntax, see here:
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Arithmetic-Expansion
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Shell-Arithmetic
You can use a Bash regex:
A="Some variable has value abc.123"
[[ $A =~ [[:blank:]]([^[:blank:]]+)$ ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
Prints:
abc.123
That works with any [:blank:] delimiter in the current local (Usually [ \t]). If you want to be more specific:
A="Some variable has value abc.123"
pat='[ ]([^ ]+)$'
[[ $A =~ $pat ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
echo "Some variable has value abc.123"| perl -nE'say $1 if /(\S+)$/'
I have a string formatted as below
Walk Off the Earth - Somebody That I Used to Know
[playing] #36/37 1:04/4:05 (26%)
volume: n/a repeat: off random: on single: off consume: off
Now, from the above string I need to extract 36 from #36/37.
First thing I did was to extract #36/37 from second line using
echo "above mentioned string" | awk 'NR==2 {print $2}'
Now, I want to extract 36 from the above extracted part for that I did
echo `#36/37` | sed -e 's/\//#/g' | awk -F "#" '{print $2}'
which gave me 36 as my outptut.
But, I feel that using both sed and awk just to extract text from #36/37 is but of a overkill. So, is there any better or shorter way to achieve this.
Split the field on the pound and slash characters into an array and retrieve the required element.
awk 'NR==2 {split($2, arr, "[#/]"); print arr[2]}'
This answer takes advantage of bash's built-in extended regular-expression syntax using the =~ test operator. (I say test, but don't expect it to work with the test command. It only works with the [[ keyword.)
mini:~ michael$ cat foo
Walk Off the Earth - Somebody That I Used to Know
[playing] #36/37 1:04/4:05 (26%)
volume: n/a repeat: off random: on single: off consume: off
mini:~ michael$ [[ $(<foo) =~ \#[[:digit:]]{2} ]] && echo "${BASH_REMATCH[0]#\#}"
36
When you boil it down, this is simply a regular expression that matches the two digits after a pound sign, and saves them in the zeroth element of the BASH_REMATCH array.
One way using sed assuming infile has the content of the question. In second line match any characters until #, then save any numbers in group 1, and substitute the complete line with this group \1. The -n switch avoids print anything unless indicated with a p instruction in the code.
sed -ne '2 { s/^[^#]*#\([0-9]*\).*$/\1/; p; q }' infile
Output:
36
This might work for you:
sed 's/.*#\([0-9]*\)\/[0-9]*.*/\1/p;d' file
36
sed -n '2s/.*\#\([0-9]*\)\/.*/\1/p'
This suppresses everything but the second line, then echos the digits between # and /
input | while read playing numbers rest
do
if [[ $playing = "[playing]" ]]; then
t="${numbers:1}"
echo "${t%/*}"
fi
done
Bash default split is by whitespace, so what you get in the second field (numbers) is just that numbers. The rest is the use of bash parameter expansion operators to get at the portion of interest: remove the first character and remove the suffix starting with "/"
This would solve your problem.
awk -F'[#/]' 'NR==2{print $2}'
I've written a script which output the string between the first and last character. To solve you're problem, you can use the following commands combined with this script.
echo '[playing] #36/37 1:044:05 (26%)' | cut -d' ' -f2 | ./cut_between.sh -f '#' -l '/'
You can download this script on GitHub.
You can do it without any external program with BASH-internal string operations like this:
string="[playing] #36/37 1:04/4:05 (26%)"
part=${string##*#};number=${part%%/*}
echo "$number"