Parse string using grep, sed or awk

Parse string using grep, sed or awk - linux

I have a string that looks like this
807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482
I need output like this:
S:S6S11,07001,23668732,1,1496851208,807262,7482
I need the string with the column separated like this:
S:S6 + the next 3 characters;
In this case S:S6S11 this works:
echo 807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482 |
grep -P -o 'F:S6.{1,3}'
Output:
S:S6S11
This gets me close, getting just the numbers
echo 807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482 |
grep -o '[0-9]\+' | tr '\n' ','
Output:
807001,6,11,23668732,1,1496851208,807262,7482,
How can I get S:S6S11 in the beginning of my output and avoid 6,11 after that?
If this can be done better with sed or awk I don't mind.
Edit - clarification of structure
The rest of the string is:
LETTERS NUMBERS
BB 23668732
CC 1
DD 1496851208.807262
EE 7482
I need just the numbers but they have to correspond to the letters.

awk to the rescue!
$ echo "807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482" |
awk '{pre=gensub(".*(S:S6...).*","\\1","g"); ## extract prefix
sub(/./,","); ## replace first char with comma
gsub(/[^0-9]+/,","); ## replace non-numeric values with comma
print pre $0}' ## print prefix and replaced line
S:S6S11,07001,6,11,23668732,1,1496851208,807262,7482

... or sed:
$ echo "807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482" | sed -re 's/^.([0-9]+)(S:S6...)ABB([0-9]+)CC([0-9]+)DD([0-9]+)\.([0-9]+)EE([0-9]*)$/\2,\1,\3,\4,\5,\6,\7/'
S:S6S11,07001,23668732,1,1496851208,807262,7482
That is, if your line format is fixed.

If you use GNU awk, you can simplify the task by defining RS as the desired pattern, e.g.:
parse.awk
BEGIN { RS = "S:S6...|\n" }
# Start of the string
RT != "\n" {
sub(".", ",") # Replace first char by a comma
pst = $0 # Remember the rest of the string
pre = RT # Remember the S:S6 pattern
}
# End of string
RT == "\n" {
gsub("[A-Z.]+", ",") # Replace letters and dots by commas
print pre pst $0 # Print the final result
}
Run e.g. it like this:
s=807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482
gawk -f parse.awk <<<$s
Output:
S:S6S11,07001,23668732,1,1496851208,807262,7482

Here is one way you could do it with sed:
parse.sed
h # Duplicate string to hold space
s/.*(S:S6...).*/\1/ # Extract the desired pattern
x # Swap hold and pattern space
s/S:S6...// # Remove pattern (still in hold space)
s/[A-Z.]+/,/g # Replace letters and dots with commas
s/./,/ # Replace first char with comma
G # Append hold space content
s/([^\n]+)\n(.*)/\2\1/ # Rearrange to match desired output
Run it like this:
s=807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482
sed -Ef parse.sed <<<$s
Output:
S:S6S11,07001,23668732,1,1496851208,807262,7482

It sounds like this MAY be what you're really trying to do:
$ awk -F'[A-Z]{2,}|[.]' -v OFS=',' '{$1=substr($1,7) OFS substr($1,2,5)}1' file
S:S6S11,07001,23668732,1,1496851208,807262,7482
but your requirements for how and what to match where are very unclear and just one sample input line doesn't help much.

Related

Substitute all characters between two strings by char 'X' using sed

In a Bash script, I am trying to in-file replace the characters between two given strings by 'X'. I have bunch of string pair, between which I want the replacement of characters by 'X' should happen.
In the below code, the first string in the pair is declared in cpi_list array. The second string in the pair is always either %26 or & or ENDOFLINE
This is what I am doing.
# list of "first" or "start" string
declare -a cpi_list=('%26Name%3d' '%26Pwd%3d')
# This is the "end" string
myAnd=\%26
newfile="inputlog.txt"
for item in "${cpi_list[#]}";
do
sed -i -e :a -e "s/\($item[X]*\)[^X]\(.*"$myAnd"\)/\1X\2/;ta" $newfile;
done
The input
CPI.%26Name%3dJASON%26Pwd%3dBOTTLE%26Name%3dCOTT
CPI.%26Name%3dVoorhees&machete
I want to make it
CPI.%26Name%3dXXXXX%26Pwd%3dXXXXXX%26Name%3dXXXX
CPI.%26Name%3dXXXXXXXX&machete
PS: The last item need also change %26Name%3dCOTT to %26Name%3dXXXX even though there is no end %26 because I am looking for either %26 as the end point or the END OF THE LINE
But somehow it is not working.

This will work in any awk called from any shell in any UNIX installation:
$ cat tst.awk
BEGIN {
begs = "%26Name%3d|%26Pwd%3d"
ends = "%26|&"
}
{
head = ""
tail = $0
while( match(tail, begs) ) {
tgtStart = RSTART + RLENGTH
tgt = substr(tail,tgtStart)
if ( match(tgt, ends) ) {
tgt = substr(tgt,1,RSTART-1)
}
gsub(/./,"X",tgt)
head = head substr(tail,1,tgtStart-1) tgt
tail = substr(tail,tgtStart+length(tgt))
}
$0 = head tail
print
}
$ cat file
CPI.%26Name%3dJASON%26Pwd%3dBOTTLE%26Name%3dCOTT
CPI.%26Name%3dVoorhees&machete
$ awk -f tst.awk file
CPI.%26Name%3dXXXXX%26Pwd%3dXXXXXX%26Name%3dXXXX
CPI.%26Name%3dXXXXXXXX&machete
Just like with a sed subsitution, any regexp metacharacter in the beg and end strings would need to be escaped or we'd have to use a loop with index()s instead of match() so we'd do string matching instead of regexp matching.

You can avoid %26 doing this:
a='CPI.%26Name%3dJASON%26Pwd%3dBOTTLE%26Name%3dCOTT'
echo "$a" |sed -E ':a;s/(%3dX*)([^%X]|%[013-9a-f][0-9a-f]|%2[0-5789a-f])/\1X/g;ta;'
Note that each encoded character %xx counts for one X.

It is not pretty but you can use perl:
$ s1="CPI.%26Name%3dJASON%26Pwd%3dBOTTLE%26Name%3dCOTT"
$ echo "$s1" | perl -lne 'if (/(?:^.*%26Name%3d)(.*)(?:%26Pwd%3d)(?:.*%26Name%3d)(.*)((?:%26Pwd%3d)|(?:$))/) {
$i1=$-[1];
$l1=$+[1]-$-[1];
$i2=$-[2];
$l2=$+[2]-$-[2];
substr($_, $i1, $l1, "X"x$l1);
substr($_, $i2, $l2, "X"x$l2);
print;
}'
CPI.%26Name%3dXXXXX%26Pwd%3dBOTTLE%26Name%3dXXXX
That is for two pairs like the example. N pairs in a line will be a slight modification.

Count total number of pattern between two pattern (using sed if possible) in Linux

I have to count all '=' between two pattern i.e '{' and '}'
Sample:
{
100="1";
101="2";
102="3";
};
{
104="1,2,3";
};
{
105="1,2,3";
};
Expected Output:
3
1
1

A very cryptic perl answer:
perl -nE 's/\{(.*?)\}/ say ($1 =~ tr{=}{=}) /ge'
The tr function returns the number of characters transliterated.
With the new requirements, we can make a couple of small changes:
perl -0777 -nE 's/\{(.*?)\}/ say ($1 =~ tr{=}{=}) /ges'
-0777 reads the entire file/stream into a single string
the s flag to the s/// function allows . to handle newlines like a plain character.

Perl to the rescue:
perl -lne '$c = 0; $c += ("$1" =~ tr/=//) while /\{(.*?)\}/g; print $c' < input
-n reads the input line by line
-l adds a newline to each print
/\{(.*?)\}/g is a regular expression. The ? makes the asterisk frugal, i.e. matching the shortest possible string.
The (...) parentheses create a capture group, refered to as $1.
tr is normally used to transliterate (i.e. replace one character by another), but here it just counts the number of equal signs.
+= adds the number to $c.

Awk is here too
grep -o '{[^}]\+}'|awk -v FS='=' '{print NF-1}'
example
echo '{100="1";101="2";102="3";};
{104="1,2,3";};
{105="1,2,3";};'|grep -o '{[^}]\+}'|awk -v FS='=' '{print NF-1}'
output
3
1
1

First some test input (a line with a = outside the curly brackets and inside the content, one without brackets and one with only 2 brackets)
echo '== {100="1";101="2";102="3=3=3=3";} =;
a=b
{c=d}
{}'
Handle line without brackets (put a dummy char so you will not end up with an empty string)
sed -e 's/^[^{]*$/x/'
Handle line without equal sign (put a dummy char so you will not end up with an empty string)
sed -e 's/{[^=]*}/x/'
Remove stuff outside the brackets
sed -e 's/.*{\(.*\)}/\1/'
Remove stuff inside the double quotes (do not count fields there)
sed -e 's/"[^"]*"//g'
Use #repzero method to count equal signs
awk -F "=" '{print NF-1}'
Combine stuff
echo -e '{100="1";101="2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$/x/' -e 's/{[^=]*}/x/' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{print NF-1}'
The ugly temp fields x and replacing {} can be solved inside awk:
echo -e '= {100="1";101="2=2=2=2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$//' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{if (NF>0) c=NF-1; else c=0; print c}'
or shorter
echo -e '= {100="1";101="2=2=2=2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$//' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{print (NF>0) ? NF-1 : 0; }'

No harder sed than done ... in.
Restricting this answer to the environment as tagged, namely:
linux shell unix sed wc
will actually not require the use of wc (or awk, perl, or any other app.).
Though echo is used, a file source can easily exclude its use.
As for bash, it is the shell.
The actual environment used is documented at the end.
NB. Exploitation of GNU specific extensions has been used for brevity
but appropriately annotated to make a more generic implementation.
Also brace bracketed { text } will not include braces in the text.
It is implicit that such braces should be present as {} pairs but
the text src. dangling brace does not directly violate this tenet.
This is a foray into the world of `sed`'ng to gain some fluency in it's use for other purposes.
The ideas expounded upon here are used to cross pollinate another SO problem solution in order
to aquire more familiarity with vetting vagaries of vernacular version variances. Consequently
this pedantic exercice hopefully helps with the pedagogy of others beyond personal edification.
To test easily, at least in the environment noted below, judiciously highlight the appropriate
code section, carefully excluding a dangling pipe |, and then, to a CLI command line interface
drag & drop, copy & paste or use middle click to enter the code.
The other SO problem. linux - Is it possible to do simple arithmetic in sed addresses?
# _______________________________ always needed ________________________________
echo -e '\n
\n = = = {\n } = = = each = is outside the braces
\na\nb\n { } so therefore are not counted
\nc\n { = = = = = = = } while the ones here do count
{\n100="1";\n101="2";\n102="3";\n};
\n {\n104="1,2,3";\n};
a\nb\nc\n {\n105="1,2,3";\n};
{ dangling brace ignored junk = = = \n' |
# _____________ prepatory conditioning needed for final solutions _____________
sed ' s/{/\n{\n/g;
s/}/\n}\n/g; ' | # guarantee but one brace to a line
sed -n '/{/ h; # so sed addressing can "work" here
/{/,/}/ H; # use hHold buffer for only { ... }
/}/ { x; s/[^=]*//g; p } ' | # then make each {} set a line of =
# ____ stop code hi-lite selection in ^--^ here include quote not pipe ____
# ____ outputs the following exclusive of the shell " # " comment quotes _____
#
#
# =======
# ===
# =
# =
# _________________________________________________________________________
# ____________________________ "simple" GNU solution ____________________________
sed -e '/^$/ { s//0/;b }; # handle null data as 0 case: next!
s/=/\n/g; # to easily count an = make it a nl
s/\n$//g; # echo adds an extra nl - delete it
s/.*/echo "&" | sed -n $=/; # sed = command w/ $ counts last nl
e ' # who knew only GNU say you ah phoo
# 0
# 0
# 7
# 3
# 1
# 1
# _________________________________________________________________________
# ________________________ generic incomplete "solution" ________________________
sed -e '/^$/ { s//echo 0/;b }; # handle null data as 0 case: next!
s/=$//g; # echo adds an extra nl - delete it
s/=/\\\\n/g; # to easily count an = make it a nl
s/.*/echo -e & | sed -n $=/; '
# _______________________________________________________________________________
The paradigm used for the algorithm is instigated by the prolegomena study below.
The idea is to isolate groups of = signs between { } braces for counting.
These are found and each group is put on a separate line with ALL other adorning characters removed.
It is noted that sed can easily "count", actually enumerate, nl or \n line ends via =.
The first "solution" uses these sed commands:
print
branch w/o label starts a new cycle
h/Hold for filling this sed buffer
exchanage to swap the hold and pattern buffers
= to enumerate the current sed input line
substitute s/.../.../; with global flag s/.../.../g;
and most particularly the GNU specific
evaluate (execute can not remember the actual mnemonic but irrelevantly synonymous)
The GNU specific execute command is avoided in the generic code. It does not print the answer but
instead produces code that will print the answer. Run it to observe. To fully automate this, many
mechanisms can be used not the least of which is the sed write command to put these lines in a
shell file to be excuted or even embed the output in bash evaluation parentheses $( ) etc.
Note also that various sed example scripts can "count" and these too can be used efficaciously.
The interested reader can entertain these other pursuits.
prolegomena:
concept from counting # of lines between braces
sed -n '/{/=;/}/=;'
to
sed -n '/}/=;/{/=;' |
sed -n 'h;n;G;s/\n/ - /;
2s/^/ Between sets of {} \n the nl # count is\n /;
2!s/^/ /;
p'
testing "done in":
linuxuser#ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
linuxuser#ubuntu:~$ sed --version -----> sed (GNU sed) 4.4

And for giggles an awk-only alternative:
echo '{
> 100="1";
> 101="2";
> 102="3";
> };
> {
> 104="1,2,3";
> };
> {
> 105="1,2,3";
> };' | awk 'BEGIN{RS="\n};";FS="\n"}{c=gsub(/=/,""); if(NF>2){print c}}'
3
1
1

How to filter a the required content from a string in linux?

I had a string like:-
sometext sometext BASEDIR=/someword/someword/someword/1342.32 sometext sometext.
Could someone tell me, how to filter this number 1342.32, from the above string in linux??

$ echo "sometext BASEDIR=/someword/1342.32 sometext." |
sed "s/[^0-9.]//g"
> 1342.32.
The sed command searches for anything not in the set "0123456789" or ".", and replaces it with nothing (deletes it). It does this in global mode, so it doesn't stop on the first match.
This is enough if you're just trying to read it. If you're trying to feed the number into another command and need a real number, you will need to clean it up:
$ ... | cut -f 1-2 -d "."
> 1342.32
cut splits the input on the delemiter, then selects fields 1 and 2 (numbered from one). So "1.2.3.4" would return "1.2".

If sometext is always delimited from the surrounding fields by a white space, try this
cat log.txt | awk '{for (i=1;i<=NF;i++) {if ($i ~
/BASEDIR/) {print i,$i}}}' | awk -F/ '{for (i=1;i<=NF;i++) {if ($i ~
/^[0-9][0-9]*$/) {print $i}}}'
The code snippet above assumes that your data is contained in a file called log.txt and organised in records(read this awk-wise)

This works also if digits appear in sometext before BASEDIR as well as if the input has additional lines:
sed -n 's,.*BASEDIR=\(/\w*\)*/\([0-9.]*\).*,\2,p'
-n do not output lines without BASEDIR…
\(/\w*\)* group of / and someword, repeated
\([0-9.]*\) group of repeated digit or decimal point
\2 replacement of everything matched (the entire line) with the 2nd group
p print the result

Paste corresponding characters from multiple lines together

I'm writing a linux-command that pasts corresponding characters from multiple lines together. For example: I want to change these lines
A---
-B--
---C
--D-
to this:
A----B-----D--C-
So far, i've made this:
cat sanger.a sanger.c sanger.g sanger.t | cut -c 1
This does the trick for only the first column, but it has to work for all the columns.
Is there anyone who can help?
EDIT: This is a better example. I want this:
SUGAR
HONEY
CANDY
to become
SHC UOA GND AED RYY (without spaces)

Awk way for updated spec
awk -vFS= '{for(i=1;i<=NF;i++)a[i]=a[i]$i}
END{for(i=1;i<=NF;i++)printf "%s",a[i];print ""}' file
Output
A----B-----D--C-
SHCUOAGNNAEDRYY
P.s for a large file this will use lots of memory
A terrible way not using awk, also you need to know the number of fields before hand.
for i in {1..4};do cut -c $i test | tr -d "\n" ; done;echo

Here's a solution without awk or sed, assuming the file is named f:
paste -s -d "" <(for i in $(seq 1 $(wc -L < f)); do cut -c $i f; done)
wc -L is a GNUism which returns the length of the longest line in the input file, which might not work depending on your version/locale. You could instead find the longest line by doing something like:
awk '{if (length > x) {x = length}} END {print x}' f
Then using this value in the seq command instead of the above command substitution.

All right, time for some sed insanity! :D
Disclaimer: If this is for something serious, use something less brittle than this. awk comes to mind. Unless you feel confident enough in your sed abilities to maintain this lunacy.
cat file1 file2 etc | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }' | tr -d '\n'; echo
This comes in three parts: Say you have a file foo.txt
12345
67890
abcde
fghij
then
cat foo.txt | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }'
produces
16af
27bg
38ch
49di
50ej
After that, tr -d '\n' deletes the newlines, and ;echo adds one at the end.
The heart of this madness is the sed code, which is
1h
1!H
$ {
:loop
g
s/$/\n/
s/\([^\n]\)[^\n]*\n/\1/g
p
g
s/^.//
s/\n./\n/g
h
/[^\n]/ b loop
}
This first follows the basic pattern
1h # if this is the first line, put it in the hold buffer
1!H # if it is not the first line, append it to the hold buffer
$ { # if this is the last line,
do stuff # do stuff. The whole input is in the hold buffer here.
}
which assembles all input in the hold buffer before working on it. Once the whole input is in the hold buffer, this happens:
:loop
g # copy the hold buffer to the pattern space
s/$/\n/ # put a newline at the end
s/\([^\n]\)[^\n]*\n/\1/g # replace every line with only its first character
p # print that
g # get the hold buffer again
s/^.// # remove the first character from the first line
s/\n./\n/g # remove the first character from all other lines
h # put that back in the hold buffer
/[^\n]/ b loop # if there's something left other than newlines, loop
And there you have it. I might just have summoned Cthulhu.

how to replace substring in a file according to specific pattern without programming

suppose I have a file:
its format should be :
number, string1 , [string2] ,....
here string1 should not contain ',' ,because we use ',' to separate each column
but due to some reason ,string1 now contain some ',' inside it,
so we need to replace it with other symbol ,such as '-'
1,aaa,bbb,ccc,[x,y,z],eee,fff,ggg
2,q,w,[x],f,g
3,z,[y],g,h
4,zzz,xxx,ccc,vvv,[z],g,h
....
should be revised to :
1,aaa-bbb-ccc,[x,y,z],eee,fff,ggg
2,q-w,[x],f,g
3,z,[y],g,h
4,zzz-xxx-ccc-vvv,[z],g,h
....
what's the best way to do it without programming , I mean we just use awk,sed,vim rather than shell programming,python,c++,etc
Thanks

$ awk -F, 'BEGIN{OFS=FS} {two=$0;sub($1 FS,"",two);sub(/,[[].*/,"",two);gsub(/,/,"-",two); rest=$0;sub(/^[^[]*/,"",rest); print $1,two,rest}' input.txt
1,aaa-bbb-ccc,[x,y,z],eee,fff,ggg
2,q-w,[x],f,g
3,z,[y],g,h
4,zzz-xxx-ccc-vvv,[z],g,h
$
Let's break out the awk script for easier commenting.
$ awk -F, '
BEGIN { OFS=FS }
{
two=$0; # Second field is based on the line...
sub($1 FS,"",two); # Remove the first field,
sub(/,[[].*/,"",two); # Remove everything from the [ onwards,
gsub(/,/,"-",two); # Replace commas in whatever remains.
rest=$0; # Last part of the line, after "two"
sub(/^[^[]*/,"",rest); # Strip everything up to the [
print $1,two,rest; # Print it.
}
' input.txt

a little long, but you can use sed like this:
sed ':loop; s/\([0-9]\+,.*\)\([^,]*\),\([^,]*\)\(.*,\[\)/\1\2-\3\4/; t loop' \
input_file
slightly shorter one:
sed ':loop; s/\([0-9]*,[^\[,]*\),\([^\[,]*,\[\)/\1-\2/; t loop' input_file
description for the second one:
loop while there are matches # :loop;
1) find numbers followed by a comma, # \([0-9]*,
followed by anything not comma or '[', # [^\[,]*\)
2) find comma # ,
3) find anything not ',' or '[' # \([^\[,]*
4) followed by a ',' and '[' # ,\[\)/
5) replace the whole thing with
match of step 1 and '-' and matches
from steps 3-4 # /\1-\2/;
end loop
# t loop

This might work for you (GNU sed):
sed -e 's/,\[/\n&/;h;s/\n.*//;s/,/-/2g;G;s/\n.*\n//' file

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parse string using grep, sed or awk - linux

... or sed: $ echo "807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482" | sed -re 's/^.([0-9]+)(S:S6...)ABB([0-9]+)CC([0-9]+)DD([0-9]+)\.([0-9]+)EE([0-9]*)$/\2,\1,\3,\4,\5,\6,\7/' S:S6S11,07001,23668732,1,1496851208,807262,7482 That is, if your line format is fixed.

Related

Substitute all characters between two strings by char 'X' using sed

Count total number of pattern between two pattern (using sed if possible) in Linux

How to filter a the required content from a string in linux?

Paste corresponding characters from multiple lines together

how to replace substring in a file according to specific pattern without programming

Categories

Resources