Remove special characters from 2nd column of a file

Remove special characters from 2nd column of a file - linux

I have a file s.csv
a,b+ -.,c
aa,bb ().,c._c
I want to remove all special characters from 2nd column (file separated by comma)
cat s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'
The above code also removes special characters from 3rd column as well.
awk -F, '{print $2}' s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'
This code only print 2nd column.
Any idea how can I remove special char from 2nd column and price all
Required output should be
a,b,c
aa,bb,c._c

Remove all (from second field)
characters that are not upper case letters [^A-Z
or lower case letters a-z
or digits 0-9]
from second field $2
fields are with "," separated -F ','
keep the separator in output OFS=FS
$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' s.csv
# test
$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' <<<'aa,bb ().,c._c'
aa,bb,c._c
As #Léa Gris mentioned below
Don't forget to set the locale to C or [^A-Za-z0-9] is gonna be
interpreted unexpectedly in non-western European alphabets. Prepend
awk invocation with
LC_ALL=C

You can use the [:alpha:] character class using awk, here for second field and remove with gsub() function the characteres that aren't alpha:
awk 'BEGIN{OFS=FS=","} {gsub(/[^[:alpha:]]+/, "", $2)} 1' file
a,b,c
aa,bb,c._c
if you need other set of characters, you can see this answer of Ed Morton:
https://stackoverflow.com/questions/56481541/how-can-you-tell-which-characters-are-in-which-character-classes
and see "which characters are in which character classes"

Use this Perl one-liner:
perl -F',' -lane '$F[1] =~ s{[\W_]+}{}g; #F = map { lc } #F; print join ",", #F; ' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
s{[\W_]+}{} : Replace 1 or more occurrences of \W (non-word character) or underscore with nothing.
The regex uses these modifiers:
/g : Match the pattern repeatedly.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

You don't have to alter locale just to do it - by using octals instead of letters, the regex engine respects them as ASCII instead of being overly clever - i even intentionally set it to Belgian French to illustrate :
CODE
echo 'a,b+ -.,c
aa,bb ().,c._c' | {m,g}awk '
gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$(!_+!_))^_' \
OFS=',' FS=','
OUTPUT
a,b,c
aa,bb,c._c
SHOWCASE LOCALE=C isn't needed
LANG="fr_BE.UTF8" gawk -e '
BEGIN { for(_=8*4;_<8^4;_++) { printf("%c",_) } } ' |
LANG="fr_BE.UTF8" gawk -p- -e '
gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$-_)^_' OFS=',' FS=','
——————————
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
# profile gawk, cr'e'e Sun May 29 05:58:26 2022
# R`egle(s)
1 (gsub("[^\\060-\\071\\101-\\132\\141-\\172]+", "", $-_)) ^ _ { # 1
1 print
}

Related

Replace pattern in one column bash

I have multiple *csv file that cat like:
#sample,time,N
SPH-01-HG00186-1_R1_001,8.33386,93
SPH-01-HG00266-1_R1_001,7.41229,93
SPH-01-HG00274-1_R1_001,7.63903,93
SPH-01-HG00276-1_R1_001,7.94798,93
SPH-01-HG00403-1_R1_001,7.99299,93
SPH-01-HG00404-1_R1_001,8.38001,93
And I try to wrangle cated csv file to:
#sample,time,N
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93
I did:
for i in $(ls *csv); do line=$(cat ${i} | grep -v "#" | cut -d'-' -f3); sed 's/*${line}*/${line}/g'; done
Yet no result showed up... Any advice of doing so? Thanks.

With awk and the logic of splitting each line by , then split their first field by -:
awk -v FS=',' -v OFS=',' 'NR > 1 { split($1,w,"-"); $1 = w[3] } 1' file.csv
With sed and a robust regex that cannot possibly modify the other fields:
sed -E 's/^([^,-]*-){2}([^,-]*)[^,]*/\2/' file.csv
# or
sed -E 's/^(([^,-]*)-){3}[^,]*/\2/' file.csv

Use this Perl one-liner:
perl -i -pe 's{.*?-.*?-(.*?)-.*?,}{$1,}' *.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak (you can omit .bak, to avoid creating any backup files).
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

You can use
sed -E 's/^[^-]+-[0-9]+-([^-]+)[^,]+/\1/' file > newfile
Details:
-E - enabling the POSIX ERE regex flavor
^[^-]+-[0-9]+-([^-]+)[^,]+ - the regex pattern that searches for
^ - start of string
[^-]+ - one or more non-hyphen chars
- - a hyphen
[0-9]+ - one or more digits
- - a hyphen
([^-]+) - Group 1: one or more non-hyphens
[^,]+ - one or more non-comma chars
\1 - replace the match with Group 1 value.
See the online demo:
#!/bin/bash
s='SPH-01-HG00186-1_R1_001,8.33386,93
SPH-01-HG00266-1_R1_001,7.41229,93
SPH-01-HG00274-1_R1_001,7.63903,93
SPH-01-HG00276-1_R1_001,7.94798,93
SPH-01-HG00403-1_R1_001,7.99299,93
SPH-01-HG00404-1_R1_001,8.38001,93'
sed -E 's/^[^-]+-[0-9]+-([^-]+)[^,]+/\1/' <<< "$s"
Output:
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93

You can mangle text using bash parameter expansion, without resorting to external tools like awk and sed:
IFS=","
while read -r -a line; do
x="${line[0]%-*}"
x="${x##*-}"
printf "%s,%s,%s\n" "$x" "${line[1]}" "${line[2]}"
done < input.txt
Or you could do it with simple awk, as others have done.
awk '{print $3,$5,$6}' FS='[-,]' OFS=, < input.txt

If you need to use cut AT ANY PRICE then I suggest following solution, let file.txt content be
#sample,time,N
SPH-01-HG00186-1_R1_001,8.33386,93
SPH-01-HG00266-1_R1_001,7.41229,93
SPH-01-HG00274-1_R1_001,7.63903,93
SPH-01-HG00276-1_R1_001,7.94798,93
SPH-01-HG00403-1_R1_001,7.99299,93
SPH-01-HG00404-1_R1_001,8.38001,93
then
head -1 file.txt && tail -6 file.txt | tr '-' ',' | cut --delimiter=',' --fields=3,5,6
gives output
#sample,time,N
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93
Explanation: output 1st line as-is using head then ram 6 last lines into tr to replace - using , finally use cut with , delimiter and specify desired fields.

{m,n,g}awk NF++ FS='^[^-]+-[^-]+-|-[^,]+' OFS=
|
#sample,time,N
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93

Extract multiple floating numbers from a line

I want to extract timeTaken values from following line:
<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting.
I am using following command with grep and awk:
grep -Po "Exception, Curl1-Time: \K(\d+.\d*)s. Curl2-Time: (\d+.\d+)" app.log | awk '{print $1 + $3}'
This outputs: 4.167565
Can this be done in more smarter way, maybe using sed or any other
bash tool.
Is it ok to ignore trailing "s." in time-taken
values as the result of addition is correct.

You already use PCRE. Why not use Perl itself?
perl -lne 'print $1 + $2
if /Exception, Curl1-Time: ([\d.]+)s\. Curl2-Time: ([\d.]+)/
' < input

If you have GNU's grep, then you can execute:
var="<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting."
grep -Eo '[[:digit:]]+\.[[:digit:]]+s?' <<< "$var"
Or you can use awk and stay POSIX:
var="<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting."
awk '{ while (match($0, /[[:digit:]]+\.[[:digit:]]+s?/)) { print substr($0, RSTART, RLENGTH); $0 = substr($0, RSTART + RLENGTH) } }' <<< "$var"
As you can see, both commands use the regex [[:digit:]]+\.[[:digit:]]+s? to match a pattern of one or more digits, a dot, one or more digits and an optional 's'.
GNU's grep uses the -o option to extract the matching regex pattern.
The awk version uses its match and substr functions, to match and extract relevant data.
After a regex match, RSTART and RLENGTH are set and we can use them to calculate a start and end positions for substr.
RLENGTH is the length of the substring matched by the match function.
RSTART is the start-index in characters of the substring matched by the match function.
see section Built-in Functions for String Manipulation

sed 's/.*Curl1-Time: \([0-9]\.[0-9]*\)s.*\([0-9]\.[0-9]*\)s.*$/\1 \2/p' filename | awk '{print ($1+$2);}'
Regex pattern matching ".Curl1-Time: ([0-9].[0-9])s.([0-9].[0-9])s.*$" ---> Pattern within the braces is the number matching regex.
Entire line is replaced with two matching patterns. i.e the output of sed will be two numbers with spaces in between them. e.g. 1234 34567
awk parses the sed output with default space delimiter and sums up them and prints the result.

search a line that contain a special character using sed or awk

I wonder if there is a command in Linux that can help me to find a line that begins with "*" and contains the special character "|"
for example
* Date | Auteurs

Simply use:
grep -ne '^\*.*|' "${filename}"
Or if you want to use sed:
sed -n '/^\*.*|/{=;p}' "${filename}" | sed '{N;s/\n/:/}'
Or (gnu) awk equivalent (require to backslash the pipe):
awk '/^\*.*\|/' "${filename}"
Where:
^ : start of the line
\*: a literal *
.*: zero or more generic char (not newline)
| : a literal pipe
NB: "${filename}": i've assumed you're using the command in a script with the target file passed in a double quoted variable as "${filename}". In the shell simply use the actual name of the file (or the path to it).
UPDATE (line numbers)
Modify the above commands to obtain also the line number of the matched lines. With grep is simple as to add -n switch:
grep -ne '^\*.*|' "${filename}"
We obtain an output like this:
81806:* Date | Auteurs
To obtain exactly the same output from sed and awk we have to complicate the commands a little bit:
awk '/^\*.*\|/{print NR ":" $0}' "${filename}"
# the = print the line number, p the actual match but it's on two different lines so the second sed call
sed -n '/^\*.*|/{=;p}' "${filename}" | sed '{N;s/\n/:/}'

sed: remove whole words containg a character class

I'd like to remove any word which contains a non alpha char from a text file. e.g
"ok 0bad ba1d bad3 4bad4 5bad5bad5"
should become
"ok"
I've tried using
echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'

The following sed command does the job:
sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
It removes all words containing at least one non-alphabetic character. It is better to use POSIX character classes like [:alpha:], because for instance they won't consider the French name "François" as being faulty (i.e. containing a non-alphabetic character).
Explanation
We remove all patterns starting with an arbitrary number of spaces followed by an arbitrary (possibly nil) number of alphabetic characters, followed by at least one non-space and non-alphabetic character, and then glob to the end of the word (i.e. until the next space). Please note that you may want to swap [:space:] for [:blank:], see this page for a detailed explanation of the difference between these two POSIX classes.
Test
$ echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
ok

Using awk:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/)
{printf "%s%s", ofs, $i; ofs=OFS} print ""}' <<< "$s"
ok
This awk command loops through all words and if word matches the regex /^[[:alpha:]]+$/ then it writes to standard out. (i<NF)?OFS:RS is a short cut to add OFS if current field no is less than NF otherwise it writes RS.
Using grep + tr together:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
r=$(grep -o '[^ ]\+' <<< "$s"|grep '^[[:alpha:]]\+$'|tr '\n' ' ')
echo "$r"
ok
First grep -o breaks the string into individual words. 2nd grep only searches for words with alphabets only. ANd finally tr translates \n to space.

If you're not concerned about losing different numbers of spaces between each word, you could use something like this in Perl:
perl -ane 'print join(" ", grep { !/[^[:alpha:]]/ } #F), "\n"
the -a switch enables auto-split mode, which splits the text on any number of spaces and stores the fields in the array #F. grep filters out the elements of that array that contain any non-alphabetical characters. The resulting array is joined on a single space.

This might work for you (GNU sed):
sed -r 's/\b([[:alpha:]]+\b ?)|\S+\b ?/\1/g;s/ $//' file
This uses a back reference within alternation to save the required string.

st="ok 0bad ba1d bad3 4bad4 5bad5bad5"
for word in $st;
do
if [[ $word =~ ^[a-zA-Z]+$ ]];
then
echo $word;
fi;
done

How to concatenate multiple lines of output to one line?

If I run the command cat file | grep pattern, I get many lines of output. How do you concatenate all lines into one line, effectively replacing each "\n" with "\" " (end with " followed by space)?
cat file | grep pattern | xargs sed s/\n/ /g
isn't working for me.

Use tr '\n' ' ' to translate all newline characters to spaces:
$ grep pattern file | tr '\n' ' '
Note: grep reads files, cat concatenates files. Don't cat file | grep!
Edit:
tr can only handle single character translations. You could use awk to change the output record separator like:
$ grep pattern file | awk '{print}' ORS='" '
This would transform:
one
two
three
to:
one" two" three"

Piping output to xargs will concatenate each line of output to a single line with spaces:
grep pattern file | xargs
Or any command, eg. ls | xargs. The default limit of xargs output is ~4096 characters, but can be increased with eg. xargs -s 8192.
grep xargs

In bash echo without quotes remove carriage returns, tabs and multiple spaces
echo $(cat file)

This could be what you want
cat file | grep pattern | paste -sd' '
As to your edit, I'm not sure what it means, perhaps this?
cat file | grep pattern | paste -sd'~' | sed -e 's/~/" "/g'
(this assumes that ~ does not occur in file)

This is an example which produces output separated by commas. You can replace the comma by whatever separator you need.
cat <<EOD | xargs | sed 's/ /,/g'
> 1
> 2
> 3
> 4
> 5
> EOD
produces:
1,2,3,4,5

The fastest and easiest ways I know to solve this problem:
When we want to replace the new line character \n with the space:
xargs < file
xargs has own limits on the number of characters per line and the number of all characters combined, but we can increase them. Details can be found by running this command: xargs --show-limits and of course in the manual: man xargs
When we want to replace one character with another exactly one character:
tr '\n' ' ' < file
When we want to replace one character with many characters:
tr '\n' '~' < file | sed s/~/many_characters/g
First, we replace the newline characters \n for tildes ~ (or choose another unique character not present in the text), and then we replace the tilde characters with any other characters (many_characters) and we do it for each tilde (flag g).

Here is another simple method using awk:
# cat > file.txt
a
b
c
# cat file.txt | awk '{ printf("%s ", $0) }'
a b c
Also, if your file has columns, this gives an easy way to concatenate only certain columns:
# cat > cols.txt
a b c
d e f
# cat cols.txt | awk '{ printf("%s ", $2) }'
b e

I like the xargs solution, but if it's important to not collapse spaces, then one might instead do:
sed ':b;N;$!bb;s/\n/ /g'
That will replace newlines for spaces, without substituting the last line terminator like tr '\n' ' ' would.
This also allows you to use other joining strings besides a space, like a comma, etc, something that xargs cannot do:
$ seq 1 5 | sed ':b;N;$!bb;s/\n/,/g'
1,2,3,4,5

Here is the method using ex editor (part of Vim):
Join all lines and print to the standard output:
$ ex +%j +%p -scq! file
Join all lines in-place (in the file):
$ ex +%j -scwq file
Note: This will concatenate all lines inside the file it-self!

Probably the best way to do it is using 'awk' tool which will generate output into one line
$ awk ' /pattern/ {print}' ORS=' ' /path/to/file
It will merge all lines into one with space delimiter

paste -sd'~' giving error.
Here's what worked for me on mac using bash
cat file | grep pattern | paste -d' ' -s -
from man paste .
-d list Use one or more of the provided characters to replace the newline characters instead of the default tab. The characters
in list are used circularly, i.e., when list is exhausted the first character from list is reused. This continues until
a line from the last input file (in default operation) or the last line in each file (using the -s option) is displayed,
at which time paste begins selecting characters from the beginning of list again.
The following special characters can also be used in list:
\n newline character
\t tab character
\\ backslash character
\0 Empty string (not a null character).
Any other character preceded by a backslash is equivalent to the character itself.
-s Concatenate all of the lines of each separate input file in command line order. The newline character of every line
except the last line in each input file is replaced with the tab character, unless otherwise specified by the -d option.
If ‘-’ is specified for one or more of the input files, the standard input is used; standard input is read one line at a time,
circularly, for each instance of ‘-’.

On red hat linux I just use echo :
echo $(cat /some/file/name)
This gives me all records of a file on just one line.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Remove special characters from 2nd column of a file - linux

Related

Replace pattern in one column bash

Extract multiple floating numbers from a line

search a line that contain a special character using sed or awk

sed: remove whole words containg a character class

How to concatenate multiple lines of output to one line?

Categories

Resources