sed: remove whole words containg a character class - linux

I'd like to remove any word which contains a non alpha char from a text file. e.g
"ok 0bad ba1d bad3 4bad4 5bad5bad5"
should become
"ok"
I've tried using
echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'

The following sed command does the job:
sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
It removes all words containing at least one non-alphabetic character. It is better to use POSIX character classes like [:alpha:], because for instance they won't consider the French name "François" as being faulty (i.e. containing a non-alphabetic character).
Explanation
We remove all patterns starting with an arbitrary number of spaces followed by an arbitrary (possibly nil) number of alphabetic characters, followed by at least one non-space and non-alphabetic character, and then glob to the end of the word (i.e. until the next space). Please note that you may want to swap [:space:] for [:blank:], see this page for a detailed explanation of the difference between these two POSIX classes.
Test
$ echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
ok

Using awk:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/)
{printf "%s%s", ofs, $i; ofs=OFS} print ""}' <<< "$s"
ok
This awk command loops through all words and if word matches the regex /^[[:alpha:]]+$/ then it writes to standard out. (i<NF)?OFS:RS is a short cut to add OFS if current field no is less than NF otherwise it writes RS.
Using grep + tr together:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
r=$(grep -o '[^ ]\+' <<< "$s"|grep '^[[:alpha:]]\+$'|tr '\n' ' ')
echo "$r"
ok
First grep -o breaks the string into individual words. 2nd grep only searches for words with alphabets only. ANd finally tr translates \n to space.

If you're not concerned about losing different numbers of spaces between each word, you could use something like this in Perl:
perl -ane 'print join(" ", grep { !/[^[:alpha:]]/ } #F), "\n"
the -a switch enables auto-split mode, which splits the text on any number of spaces and stores the fields in the array #F. grep filters out the elements of that array that contain any non-alphabetical characters. The resulting array is joined on a single space.

This might work for you (GNU sed):
sed -r 's/\b([[:alpha:]]+\b ?)|\S+\b ?/\1/g;s/ $//' file
This uses a back reference within alternation to save the required string.

st="ok 0bad ba1d bad3 4bad4 5bad5bad5"
for word in $st;
do
if [[ $word =~ ^[a-zA-Z]+$ ]];
then
echo $word;
fi;
done

Related

Remove special characters from 2nd column of a file

I have a file s.csv
a,b+ -.,c
aa,bb ().,c._c
I want to remove all special characters from 2nd column (file separated by comma)
cat s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'
The above code also removes special characters from 3rd column as well.
awk -F, '{print $2}' s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'
This code only print 2nd column.
Any idea how can I remove special char from 2nd column and price all
Required output should be
a,b,c
aa,bb,c._c
Remove all (from second field)
characters that are not upper case letters [^A-Z
or lower case letters a-z
or digits 0-9]
from second field $2
fields are with "," separated -F ','
keep the separator in output OFS=FS
$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' s.csv
# test
$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' <<<'aa,bb ().,c._c'
aa,bb,c._c
As #Léa Gris mentioned below
Don't forget to set the locale to C or [^A-Za-z0-9] is gonna be
interpreted unexpectedly in non-western European alphabets. Prepend
awk invocation with
LC_ALL=C
You can use the [:alpha:] character class using awk, here for second field and remove with gsub() function the characteres that aren't alpha:
awk 'BEGIN{OFS=FS=","} {gsub(/[^[:alpha:]]+/, "", $2)} 1' file
a,b,c
aa,bb,c._c
if you need other set of characters, you can see this answer of Ed Morton:
https://stackoverflow.com/questions/56481541/how-can-you-tell-which-characters-are-in-which-character-classes
and see "which characters are in which character classes"
Use this Perl one-liner:
perl -F',' -lane '$F[1] =~ s{[\W_]+}{}g; #F = map { lc } #F; print join ",", #F; ' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
s{[\W_]+}{} : Replace 1 or more occurrences of \W (non-word character) or underscore with nothing.
The regex uses these modifiers:
/g : Match the pattern repeatedly.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
You don't have to alter locale just to do it - by using octals instead of letters, the regex engine respects them as ASCII instead of being overly clever - i even intentionally set it to Belgian French to illustrate :
CODE
echo 'a,b+ -.,c
aa,bb ().,c._c' | {m,g}awk '
gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$(!_+!_))^_' \
OFS=',' FS=','
OUTPUT
a,b,c
aa,bb,c._c
SHOWCASE LOCALE=C isn't needed
LANG="fr_BE.UTF8" gawk -e '
BEGIN { for(_=8*4;_<8^4;_++) { printf("%c",_) } } ' |
LANG="fr_BE.UTF8" gawk -p- -e '
gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$-_)^_' OFS=',' FS=','
——————————
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
# profile gawk, cr'e'e Sun May 29 05:58:26 2022
# R`egle(s)
1 (gsub("[^\\060-\\071\\101-\\132\\141-\\172]+", "", $-_)) ^ _ { # 1
1 print
}

How to check if a value contains characters in bash

I have values such as
B146XYZ,
G638XYZ,
G488xBC
I have to write a bash script where when it sees comma it has to remove the comma and add 7 spaces to it and also if it sees comma and a space or just space(no punctuations) it has to add 7 spaces to make all of them fixed length.
if [[ $row = *’,’* ]]
then
first= “${ row%%,*}”
echo “${first } “
I tried but can’t understand how to add conditions for the remaining criteria specially struggling with single value conditions such as G488xBC
What about just:
sed -E 's/[, ]+/ /g' file
Or something like this will print a padded table, so long as no field is longer than 13 characters:
awk -F '[,[:space:]]+' \
'{
for (i=1; i<NF; i++) {
printf("%-14s", $i)
}
print $NF
}'
Or the same thing in pure bash:
while IFS=$', \t' read -ra vals; do
last=$((${#vals[#]} - 1))
for ((i=0; i<last; i++)); do
printf "%-14s" "${vals[i]}"
done
printf '%s\n' "${vals[last]}"
done
newrow="${row//,/ }"
VALUES=`echo $VALUES | sed 's/,/ /g' | xargs`
The sed command will replace the comma with a single space.
The xargs will consolidate any number of whitespaces into a single space.
With that you now have your values in space separated string instead of comma, separated by unknown number of whitespaces.
From there you can use for i in $VALUES; do printf "$i\t"; done
Using the tab character like above will give you aligned output in case your values may be different in length.
But if your values are always same length then you can make it a bit more simple by doing
VALUES=`echo $VALUES | sed 's/,/ /g' | xargs | sed 's/1 space/7 spaces/g'`
echo $VALUES

How to extract strings between nth and mth occurence of a certain character in linux bash?

File1 contains:
a:b:c:d:any words here:e:f:G
w/r "any words here" can be a single word, two words, three words, and so on.
I want to get the string between the 4rd ":" and the 5th ":". So, that will be "any words here".
My initial idea was to replace ":" with space then, use awk to print.. but since the string i want to extract can be composed of multiple words, it will not accurately work.
cut command allow you to split a line based on a delimiter, and extract required fields from it
In your example,
> echo 'a:b:c:d:any words here:e:f:G' |cut -f 5 -d:
should give you
any words here
With awk
$ echo 'a:b:c:d:any words here:e:f:G' | awk -F: '{print $5}'
any words here
Or by creating an array with IFS changed to :
$ IFS=: words=( $(echo 'a:b:c:d:any words here:e:f:G') ); echo ${words[4]}
any words here
If it is just 1 line input you can use the bash regex. It is more of a pain if you want to return more than 1 field, but for 1 field it is easy enough:
f=3
[[ "1:2:3:4:5:6:7:8" =~ (^([^:]*:){$f,$f})([^:]*)(:|$) ]]
echo "${BASH_REMATCH[3]}"
4

Find words containing 20 vowels grep

I found many similar questions but most of them ask for vowels in a row which is easy. I want to find words that contain 20 vowels not in a row using grep.
I originally thought grep -Ei [aeiou]{20} would do it but that seems to search only for 20 vowels in a row
Use a regular expression that searches for 20 vowels separated by any quantity of consonants.
grep -Ei "[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*"
The backslash is just informing the shell that the expression continues on the next line. It is not part of the regex itself.
If you understand that part, you can shorten it considerably using groups. This regexp is the same as above, but using groups in parenthesis with repetition.
grep -Ei "([aeiou][b-df-hj-np-tv-z]*){20}"
I don't believe that's a problem that calls for just a regex. Here's a programmatic approach. We redefine the field separator to the empty string; each character is a field. We iterate over the line; if a character is a vowel we increment a counter. If, at the end of the string, the count is 20, we print it:
cat nicks.awk
BEGIN{
FS=""
}
{
c=0;
for( i=1;i<=NF;i=i+1 ){
if ($i ~ /[aeiou]/ ){
c=c+1;
}
};
if(c==20){
print $0
}
}
And this is what it does ... it only prints back the one string that has 20 vowels.
echo "contributorNickSequestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
echo "contributorNickSeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
contributorNickSeoquestionsfoundcontainingvowelsgrcep
echo "contributorNickSaeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
If all you really need is to find 20 vowels in a line then that's just:
awk '{x=tolower($0)} gsub(/[aeiou]/,"&",x)==20' file
or with grep:
grep -Ei '^[^aeiou]*([aeiou][^aeiou]*){20}$' file
To find words (assuming each is space separated) there's many options including this with GNU awk:
awk -v RS='\\s+' -v IGNORECASE=1 'gsub(/[aeiou]/,"&")==20' file
or this with any awk:
awk '{for (i=1;i<=NF;i++) {x=tolower($i); if (gsub(/[aeiou]/,"&",x)==20) print $i} }' file

echo without trimming the space in awk command

I have a file consisting of multiple rows like this
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN|0000000010000.00|6761857316|508998|6011|GL
I have to split and replace the column 11 into 4 different columns using the count of character.
This is the 11th column containing extra spaces also.
SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN
This is I have done
ls *.txt *.TXT| while read line
do
subName="$(cut -d'.' -f1 <<<"$line")"
awk -F"|" '{ "echo -n "$11" | cut -c1-23" | getline ton;
"echo -n "$11" | cut -c24-36" | getline city;
"echo -n "$11" | cut -c37-38" | getline state;
"echo -n "$11" | cut -c39-40" | getline country;
$11=ton"|"city"|"state"|"country; print $0
}' OFS="|" $line > $subName$output
done
But while doing echo of 11th column, its trimming the extra spaces which leads to mismatch in count of character. Is there any way to echo without trimming spaces ?
Actual output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR MHIN|||0000000010000.00|6761857316|508998|6011|GL
Expected Output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR|MH|IN|0000000010000.00|6761857316|508998|6011|GL
The least annoying way to code this that I've found so far is:
perl -F'\|' -lane '$F[10] = join "|", unpack "a23 A13 a2 a2", $F[10]; print join "|", #F'
It's fairly straightforward:
Iterate over lines of input; split each line on | and put the fields in #F.
For the 11th field ($F[10]), split it into fixed-width subfields using unpack (and trim trailing spaces from the second field (A instead of a)).
Reassemble subfields by joining with |.
Reassemble the whole line by joining with | and printing it.
I haven't benchmarked it in any way, but it's likely much faster than the original code that spawns multiple shell and cut processes per input line because it's all done in one process.
A complete solution would wrap it in a shell loop:
for file in *.txt *.TXT; do
outfile="${file%.*}$output"
perl -F'\|' -lane '...' "$file" > "$outfile"
done
Or if you don't need to trim the .txt part (and you don't have too many files to fit on the command line):
perl -i.out -F'\|' -lane '...' *.txt *.TXT
This simply places the output for each input file foo.txt in foo.txt.out.
A pure-bash implementation of all this logic
#!/usr/bin/env bash
shopt -s nocaseglob extglob
for f in *.txt; do
subName=${f%.*}
while IFS='|' read -r -a fields; do
location=${fields[10]}
ton=${location:0:23}; ton=${ton%%+([[:space:]])}
city=${location:23:12}; city=${city%%+([[:space:]])}
state=${location:36:2}
country=${location:38:2}
fields[10]="$ton|$city|$state|$country"
printf -v out '%s|' "${fields[#]}"
printf '%s\n' "${out:0:$(( ${#out} - 1 ))}"
done <"$f" >"$subName.out"
done
It's slower (if I did this well, by about a factor of 10) than pure awk would be, but much faster than the awk/shell combination proposed in the question.
Going into the constructs used:
All the ${varname%...} and related constructs are parameter expansion. The specific ${varname%pattern} construct removes the shortest possible match for pattern from the value in varname, or the longest match if % is replaced with %%.
Using extglob enables extended globbing syntax, such as +([[:space:]]), which is equivalent to the regex syntax [[:space:]]+.

Resources