Unique emails in string - string

I have a string with emails, some duplicated. For example only:
"aaa#company.com,bbb#company.com,aaa#company.com,bbb#company.com,ccc#company.com"
I would like string to contain only unique emails, comma separated. Result should be:
"aaa#company.com,bbb#company.com,ccc#company.com"
Any easy way to do this?
P.S. emails vary, and I don't know what they will contain.

How about this:
echo "aaa#company.com,bbb#company.com,aaa#company.com,bbb#company.com,ccc#company.com" |
tr ',' '\n' |
sort |
uniq |
tr '\n' ',' |
sed -e 's/,$//'
I convert the separating commas into newlines so that I can then use tools (like sort, uniq, and grep) that work with lines.

Using awk and process-substitution only than to use sort and other tools.
awk -vORS="," '!seen[$1]++' < <(echo "aaa#company.com,bbb#company.com,aaa#company.com,bbb#company.com,ccc#company.com" | tr ',' '\n')
aaa#company.com,bbb#company.com,ccc#company.com
Or another way to use pure-bash and avoid tr completely would be
# Read into a bash array with field-separator as ',' read with '-a' for reading to an array
IFS=',' read -ra myArray <<< "aaa#company.com,bbb#company.com,aaa#company.com,bbb#company.com,ccc#company.com"
# Printing the array elements new line and feeding it to awk
awk -vORS="," '!seen[$1]++' < <(printf '%s\n' "${myArray[#]}")
aaa#company.com,bbb#company.com,ccc#company.com

With perl
$ s="aaa#company.com,bbb#company.com,aaa#company.com,bbb#company.com,ccc#company.com"
$ echo $s | perl -MList::MoreUtils=uniq -F, -le 'print join ",",uniq(#F)'
aaa#company.com,bbb#company.com,ccc#company.com

Getting the strings in an array:
IFS=','; read -r -a lst <<< "aaa#company.com,bbb#company.com,aaa#company.com,bbb#company.com,ccc#company.com"
Sorting and filtering:
IFS=$'\n' sort <<< "${lst[*]}" | uniq

Related

String split and extract the last field in bash

I have a text file FILENAME. I want to split the string at - of the first column field and extract the last element from each line. Here "$(echo $line | cut -d, -f1 | cut -d- -f4)"; alone is not giving me the right result.
FILENAME:
TWEH-201902_Pau_EX_21-1195060301,15cef8a046fe449081d6fa061b5b45cb.final.cram
TWEH-201902_Pau_EX_22-1195060302,25037f17ba7143c78e4c5a475ee98e25.final.cram
TWEH-201902_Pau_T-1383-1195060311,267364a6767240afab2b646deec17a34.final.cram
code I tried:
while read line; do \
DNA="$(echo $line | cut -d, -f1 | cut -d- -f4)";
echo $DNA
done < ${FILENAME}
Result I want
1195060301
1195060302
1195060311
Would you please try the following:
while IFS=, read -r f1 _; do # set field separator to ",", assigns f1 to the 1st field and _ to the rest
dna=${f1##*-} # removes everything before the rightmost "-" from "$f1"
echo "$dna"
done < "$FILENAME"
Well, I had to do with the two lines of codes. May be someone has a better approach.
while read line; do \
DNA="$(echo $line| cut -d, -f1| rev)"
DNA="$(echo $DNA| cut -d- -f1 | rev)"
echo $DNA
done < ${FILENAME}
I do not know the constraints on your input file, but if what you are looking for is a 10-digit number, and there is only ever one 10-digit number per line... This should do niceley
grep -Eo '[0-9]{10,}' input.txt
1195060301
1195060302
1195060311
This essentially says: Show me all 10 digit numbers in this file
input.txt
TWEH-201902_Pau_EX_21-1195060301,15cef8a046fe449081d6fa061b5b45cb.final.cram
TWEH-201902_Pau_EX_22-1195060302,25037f17ba7143c78e4c5a475ee98e25.final.cram
TWEH-201902_Pau_T-1383-1195060311,267364a6767240afab2b646deec17a34.final.cram
A sed approach:
sed -nE 's/.*-([[:digit:]]+)\,.*/\1/p' input_file
sed options:
-n: Do not print the whole file back, but only explicit /p.
-E: Use Extend Regex without need to escape its grammar.
sed Extended REgex:
's/.*-([[:digit:]]+)\,.*/\1/p': Search, capture one or more digit in group 1, preceded by anything and a dash, followed by a comma and anything, and print only the captured group.
Using awk:
awk -F[,] '{ split($1,arr,"-");print arr[length(arr)] }' FILENAME
Using , as a separator, take the first delimited "piece" of data and further split it into an arr using - as the delimiter and awk's split function. We then print the last index of arr.

Linux script with 'and' and 'or' operator

I have a file where i need to replace a special keyword
(thorn) to (tab) and save it back. It is working just fine with below code.
#Skip the header line, and translate the thorn column separators (octal 376) to tabs
#and if there were any actual tabs in the raw file, translate them to something harmless - let's say a divide sign, octal 362
cat $input | tr '\11' '\362' | tr '\376' '\11' | tail -n 2 > $outputfile
Input file
1.header1þheader2þheader3
2.Thisþisþaþsample,input,thornþfile
3.forþtestingþscript
Output file
2.This is a sample,input,thorn file
3.for testing script
Notice the comma not getting replaced, which is what we need.
However I need to tune the code in such a way that when thorn is not present then consider comma as delimiter and replace keyword comma with tab and save it back. The problem is when I am using 'or' condition then the file when have thorn is not getting saved because its not passing the criteria of comma.
cat $input | tr '\11' '\362' | tr '\376' '\11' || tr ',' '\11' | tail -n 2 > $outputfile
I am using double pipe because when thorn is present I cannot replace ','.
Basically I am trying to figure out combination of 'and' and 'or' in linux script, but it's not working like below.
cat $input | ((tr '\11' '\367' | tr '\376' '\11') || (tr ',' '\11')) & tail -n +2 > newfile.csv
cat Input_file
1.header1þheader2þheader3
2.Thisþisþaþsample,input,thornþfile
3.forþthornþdelimited
4.for,comma,delimited
5.forþboth,thorn,andþcomma,delimted
For me, the thorn character þ in the above Input_File is represented by the two characters \303\276, and passing the above entries through this perl one-liner will produce the result that OP wanted:
cat Input_file | perl -ne 'if (s/\303\276/\t/g) {print} elsif (s/,/\t/g) {print} else {print}'
1.header1 header2 header3
2.This is a sample,input,thorn file
3.for thorn delimited
4.for comma delimited
5.for both,thorn,and comma,delimted
I have tried to get the resolution myself and it is working just fine.
Below is the code attached. however if any body can suggest any issue or any better resolution, that would be highly appreciated.
input='inputFile.csv'
if grep -q '[\376]' $input; then
cat $input | tr '\11' '\367'| tr '\376' '\11'
elif grep -q ',' $input; then
cat $input | tr ',' '\11'

Add suffix to comma-separated strings in bash ecosystem

Is there a way of transforming a comma-delimited variable to add a suffix to each token using standard gnu tools? e.g.
VARIABLE=`aaa,bbb,ccc`
suffix=`-foo`
Expected output = `aaa-foo,bbb-foo,ccc-foo`
Additionally, if I have only one token, the transformation should behave in the same way
e.g. aaa -> aaa-foo
echo "aaa,bbb,ccc" | sed -E 's/([^,]+)/\1-foo/g'
It makes groups of characters that are not "," and then append -foo on it
With variables:
suffix="-foo"; VARIABLE="aaa,bbb,ccc"; echo ${VARIABLE} | sed -E "s/([^,]+)/\1${suffix}/g"
echo $VARIBLE | tr "," "\n" | awk '{print $1"-foo"}' | paste -sd "," -
explanation:
put each token on single line
tr "," "\n"
append "-foo" to each token
awk '{print $1"-foo"}'
join back up with the original comma
paste -sd "," -
Try:
answer = `echo $VARIABLE | sed "s/,/-foo,/g" | sed "s/$/-foo/"`
If you need to have the suffix as a variable then try:
answer = `echo $VARIABLE | sed "s/,/${suffix},/g" | sed "s/$/${suffix}/"`
I don't have access to a Unix box at the moment to prove this works.
The following:
s="aaa,bbb,ccc"
IFS=,
a=( $s )
mapfile -t b < <(printf '%s-foo\n' "${a[#]}")
should give us:
$ declare -p b
declare -a b=([0]="aaa-foo" [1]="bbb-foo" [2]="ccc-foo")
From there, if you can reconstruct the original format in a number of ways...
IFS=, eval 'JOINED="${b[*]}"'
Or if you don't like using eval, perhaps:
d=""; o=""
for x in "${b[#]}"; do
printf -v o '%s%s%s' "$o" "$d" "$x"
d=,
done
... which will put the complete modified string in $o.
With bash Parameter Expansion
var='aaa,bbb,ccc';[ -n "$var" ] && printf "%s\n" "${var//,/-foo,}-foo"

Count number of patterns with a single command

I'd like to count the number of occurrences in a string. For example, in this string :
'apache2|ntpd'
there are 2 different strings separated by | character.
Another example :
'apache2|ntpd|authd|freeradius'
In this case there are 4 different strings separated by | character.
Would you know a shell or perl command that could simply count this for me?
you can use awk command as below;
echo "apache2|ntpd" | awk -F'|' '{print NF}'
-F'|' is to field separator;
NF means Number of Fields
Example;
user#host:/tmp$ echo 'apache2|ntpd|authd|freeradius' | awk -F'|' '{print NF}'
4
you can also use this;
user#host:/tmp$ echo "apache2|ntpd" | tr '|' ' ' | wc -w
2
user#host:/tmp$ echo 'apache2|ntpd|authd|freeradius' | tr '|' ' ' | wc -w
4
tr '|' ' ' : translate | to space
wc -w : print the word counts
if there are spaces in the string, wc -w not correct result, so
echo 'apac he2|ntpd' | tr '|' '\n' | wc -l
user#host:/tmp$ echo 'apac he2|ntpd' | tr '|' ' ' | wc -w
3 --> not correct
user#host:/tmp$ echo 'apac he2|ntpd' | tr '|' '\n' | wc -l
2
tr '|' '\n' : translate | to newline
wc -l : number of lines
Do can do this just within bash without calling external languages like awk or external programs like grep and tr.
data='apache2|ntpd|authd|freeradius'
res=${data//[!|]/}
num_strings=$(( ${#res} + 1 ))
echo $num_strings
Let me explain.
res=${data//[!|]/} removes all characters that are not (that's the !) pipes (|).
${#res} gives the length of the resulting string.
num_strings=$(( ${#res} + 1 )) adds one to the number of pipes to get the number of fields.
It's that simple.
Another pure bash technique using positional-parameters
$ userString="apache2|ntpd|authd|freeradius"
$ printf "%s\n" $(IFS=\|; set -- $userString; printf "%s\n" "$#")
4
Thanks to cdarke's suggestion from the commands, the above command can directly store the count to a variable
$ printf -v count "%d" $(IFS=\|; set -- $userString; printf "%s\n" "$#")
$ printf "%d\n" "$count"
4
With wc and parameter expansion:
$ data='apache2|ntpd|authd|freeradius'
$ wc -w <<< ${data//|/ }
4
Using parameter expansion, all pipes are replaced with spaces. The result string is passed to wc -w for word count.
As #gniourf_gniourf mentionned, it works with what at first looks like process names but will fail if strings contain spaces.
You can do this with grep as well-
echo "apache2|ntpd|authd|freeradius" | grep -o "|" | wc -l
Output-
3
That output is the number of pipes.
To get the number of commands-
var=$(echo "apache2|ntpd|authd|freeradius" | grep -o "|" | wc -l)
echo $((var + 1))
Output -
4
You could use awk to count the occurrances of delimiters +1:
$ awk '{print gsub(/\|/,"")+1}' <(echo "apache2|ntpd|authd|freeradius")
4
may be this will help you.
IN="apache2|ntpd"
mails=$(echo $IN | tr "|" "\n")
for addr in $mails
do
echo "> [$addr]"
done

bash, extract string from text file with space delimiter

I have a text files with a line like this in them:
MC exp. sig-250-0 events & $0.98 \pm 0.15$ & $3.57 \pm 0.23$ \\
sig-250-0 is something that can change from file to file (but I always know what it is for each file). There are lines before and above this, but the string "MC exp. sig-250-0 events" is unique in the file.
For a particular file, is there a good way to extract the second number 3.57 in the above example using bash?
use awk for this:
awk '/MC exp. sig-250-0/ {print $10}' your.txt
Note that this will print: $3.57 - with the leading $, if you don't like this, pipe the output to tr:
awk '/MC exp. sig-250-0/ {print $10}' your.txt | tr -d '$'
In comments you wrote that you need to call it in a script like this:
while read p ; do
echo $p,awk '/MC exp. sig-$p/ {print $10}' filename | tr -d '$'
done < grid.txt
Note that you need a sub shell $() for the awk pipe. Like this:
echo "$p",$(awk '/MC exp. sig-$p/ {print $10}' filename | tr -d '$')
If you want to pass a shell variable to the awk pattern use the following syntax:
awk -v p="MC exp. sig-$p" '/p/ {print $10}' a.txt | tr -d '$'
More lines would've been nice but I guess you would like to have a simple use awk.
awk '{print $N}' $file
If you don't tell awk what kind of field-separator it has to use it will use just a space ' '. Now you just have to count how many fields you have got to get your field you want to get. In your case it would be 10.
awk '{print $10}' file.txt
$3.57
Don't want the $?
Pipe your awk result to cut:
awk '{print $10}' foo | cut -d $ -f2
-d will use the $ als field-separator and -f will select the second field.
If you know you always have the same number of fields, then
#!/bin/bash
file=$1
key=$2
while read -ra f; do
if [[ "${f[0]} ${f[1]} ${f[2]} ${f[3]}" == "MC exp. $key events" ]]; then
echo ${f[9]}
fi
done < "$file"

Resources