Replace substring of a string of characters with other characters in awk - linux

I have a file which contains a very long string of characters and I would like to replace a substring of it with Ns. Example:
test
ABCDABCDABCD
I would like to replace a substring of it with all letters N with awk command and sed, all the characters from index 5 to 8, so the total length of letter N is 4.
Output
ABCDNNNNABCD
I tried something like this:
awk '{ v=substr($0,5,4); sed -i "s/$v/N/g";print substr($0,1,4)""v""substr($0,9,12)}' test
however, this command seems to give this output:
ABCDABCDABC
And no substitution was made
I would like to have in the code the number of the index from where to start the substitution, (here, for example, is 4) and the length number of the substitution ( here also 4), so I can just modify these numbers in case I want to start in another position and for a different length of substitution because in reality, I have a string with thousands of letter and I want to replace hundreds of characters so substitution of pattern does not work in my case

You want to use awk and sed? Seems like you actually want one of:
$ echo ABCDABCDABCD | perl -pe 'substr($_,4,4)="NNNN"'
ABCDNNNNABCD
$ echo ABCDABCDABCD | perl -pe 'substr($_,4,4)="N"x4'
ABCDNNNNABCD

$ echo 'ABCDABCDABCD' |
awk -v b=5 -v e=8 '{
t=substr($0,b,e-b+1); gsub(/./,"N",t); print substr($0,1,b-1) t substr($0,e+1)
}'
ABCDNNNNABCD

echo ABCDABCDABCD | awk '{$0=gensub(/ABCD/,"NNNN",2)}1'
ABCDNNNNABCD

Related

Linux shell script to pad numbers after an underscore in all file names [duplicate]

I have long lists as follows:
D6N
T69TN
K70R
M184V
T215FEG
The result must be like this:
D006N
T069TN
K070R
M184V
T215FEG
I'm new on bash, I tried approaches based in splitting it in columns and reformat. However the positions and length of 2nd and 3rd putative columns are not fixed.
Thank you for any help!
You can do this using awk, using the built-in match function:
awk 'match($0, /[0-9]+/) { printf "%s%03d%s\n",
substr($0, 0, RSTART - 1), substr($0, RSTART, RLENGTH), substr($0, RSTART + RLENGTH) }' file
When match is successful, it sets two variables RSTART and RLENGTH, which can be used to extract substrings. The middle substring is formatted using %03d, to pad with leading zeros.
Any lines not matching the pattern won't be printed.
Another option using perl:
perl -pe 's/\d{1,3}/sprintf("%03d", $&)/eg' file
This replaces any sequence of one to three digits with a zero-padded three digit number. In this version, all lines are printed.
It is little bit longer with the sed's regular expressions, but here it is in Perl:
echo "D6N" | perl -pe 's/(\D)(\d)(\D)/${1}0$2$3/g; s/(\D)(\d\d)(\D)/${1}0$2$3/g;'
It would pad with zeros 1- and 2-digit numbers surrounded by non-digits. It does it with a simple trick: pad 1-digit numbers with one zero (thus 1-digit numbers become 2-digit numbers), and then pad 2-digit numbers with another zero.
Another sed based implementation:
$ cat testfile
D6N
T69TN
K70R
M184V
T215FEG
$ sed -r 's/[0-9]+/00&/g; s/0?0?([0-9]{3})/\1/g' testfile
D006N
T069TN
K070R
M184V
T215FEG
Logic: Unconditionally prefix 2 zeros to numbers & remove leading zeros, till the number is 3 digit long.
This gnu awk can also get the job done:
awk -v RS='[0-9]+' 'RT{print $0 sprintf("%03d", RT); next} 1' ORS= file
D006N
T069TN
K070R
M184V
T215FEG
AFAIK, there is no simple pure-Bash solution for this. Therefore, I'd prefer Perl, because Perl expressions are brief, and Perl is ubiquitous.
s='D6N
T69TN
K70R
M184V
T215FEG'
echo "$s" | perl -ne '/^(\D*)(\d{1,2})(\D*)$/m and printf "%s%03s%s", $1, $2, $3 or print'
With Bash regexes:
#!/bin/bash
re='([[:alpha:]]*)([[:digit:]]*)([[:alpha:]]*)'
while IFS= read -r line; do
[[ $line =~ $re ]]
printf "%s%03d%s\n" "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}" "${BASH_REMATCH[3]}"
done < infile
This matches each line with a regex and captures the three groups: letters, digits, letters. The printf format string makes sure that the digit group is zero padded if it is shorter than three digits.

Find words containing 20 vowels grep

I found many similar questions but most of them ask for vowels in a row which is easy. I want to find words that contain 20 vowels not in a row using grep.
I originally thought grep -Ei [aeiou]{20} would do it but that seems to search only for 20 vowels in a row
Use a regular expression that searches for 20 vowels separated by any quantity of consonants.
grep -Ei "[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*"
The backslash is just informing the shell that the expression continues on the next line. It is not part of the regex itself.
If you understand that part, you can shorten it considerably using groups. This regexp is the same as above, but using groups in parenthesis with repetition.
grep -Ei "([aeiou][b-df-hj-np-tv-z]*){20}"
I don't believe that's a problem that calls for just a regex. Here's a programmatic approach. We redefine the field separator to the empty string; each character is a field. We iterate over the line; if a character is a vowel we increment a counter. If, at the end of the string, the count is 20, we print it:
cat nicks.awk
BEGIN{
FS=""
}
{
c=0;
for( i=1;i<=NF;i=i+1 ){
if ($i ~ /[aeiou]/ ){
c=c+1;
}
};
if(c==20){
print $0
}
}
And this is what it does ... it only prints back the one string that has 20 vowels.
echo "contributorNickSequestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
echo "contributorNickSeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
contributorNickSeoquestionsfoundcontainingvowelsgrcep
echo "contributorNickSaeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
If all you really need is to find 20 vowels in a line then that's just:
awk '{x=tolower($0)} gsub(/[aeiou]/,"&",x)==20' file
or with grep:
grep -Ei '^[^aeiou]*([aeiou][^aeiou]*){20}$' file
To find words (assuming each is space separated) there's many options including this with GNU awk:
awk -v RS='\\s+' -v IGNORECASE=1 'gsub(/[aeiou]/,"&")==20' file
or this with any awk:
awk '{for (i=1;i<=NF;i++) {x=tolower($i); if (gsub(/[aeiou]/,"&",x)==20) print $i} }' file

echo without trimming the space in awk command

I have a file consisting of multiple rows like this
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN|0000000010000.00|6761857316|508998|6011|GL
I have to split and replace the column 11 into 4 different columns using the count of character.
This is the 11th column containing extra spaces also.
SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN
This is I have done
ls *.txt *.TXT| while read line
do
subName="$(cut -d'.' -f1 <<<"$line")"
awk -F"|" '{ "echo -n "$11" | cut -c1-23" | getline ton;
"echo -n "$11" | cut -c24-36" | getline city;
"echo -n "$11" | cut -c37-38" | getline state;
"echo -n "$11" | cut -c39-40" | getline country;
$11=ton"|"city"|"state"|"country; print $0
}' OFS="|" $line > $subName$output
done
But while doing echo of 11th column, its trimming the extra spaces which leads to mismatch in count of character. Is there any way to echo without trimming spaces ?
Actual output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR MHIN|||0000000010000.00|6761857316|508998|6011|GL
Expected Output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR|MH|IN|0000000010000.00|6761857316|508998|6011|GL
The least annoying way to code this that I've found so far is:
perl -F'\|' -lane '$F[10] = join "|", unpack "a23 A13 a2 a2", $F[10]; print join "|", #F'
It's fairly straightforward:
Iterate over lines of input; split each line on | and put the fields in #F.
For the 11th field ($F[10]), split it into fixed-width subfields using unpack (and trim trailing spaces from the second field (A instead of a)).
Reassemble subfields by joining with |.
Reassemble the whole line by joining with | and printing it.
I haven't benchmarked it in any way, but it's likely much faster than the original code that spawns multiple shell and cut processes per input line because it's all done in one process.
A complete solution would wrap it in a shell loop:
for file in *.txt *.TXT; do
outfile="${file%.*}$output"
perl -F'\|' -lane '...' "$file" > "$outfile"
done
Or if you don't need to trim the .txt part (and you don't have too many files to fit on the command line):
perl -i.out -F'\|' -lane '...' *.txt *.TXT
This simply places the output for each input file foo.txt in foo.txt.out.
A pure-bash implementation of all this logic
#!/usr/bin/env bash
shopt -s nocaseglob extglob
for f in *.txt; do
subName=${f%.*}
while IFS='|' read -r -a fields; do
location=${fields[10]}
ton=${location:0:23}; ton=${ton%%+([[:space:]])}
city=${location:23:12}; city=${city%%+([[:space:]])}
state=${location:36:2}
country=${location:38:2}
fields[10]="$ton|$city|$state|$country"
printf -v out '%s|' "${fields[#]}"
printf '%s\n' "${out:0:$(( ${#out} - 1 ))}"
done <"$f" >"$subName.out"
done
It's slower (if I did this well, by about a factor of 10) than pure awk would be, but much faster than the awk/shell combination proposed in the question.
Going into the constructs used:
All the ${varname%...} and related constructs are parameter expansion. The specific ${varname%pattern} construct removes the shortest possible match for pattern from the value in varname, or the longest match if % is replaced with %%.
Using extglob enables extended globbing syntax, such as +([[:space:]]), which is equivalent to the regex syntax [[:space:]]+.

Extract multiple floating numbers from a line

I want to extract timeTaken values from following line:
<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting.
I am using following command with grep and awk:
grep -Po "Exception, Curl1-Time: \K(\d+.\d*)s. Curl2-Time: (\d+.\d+)" app.log | awk '{print $1 + $3}'
This outputs: 4.167565
Can this be done in more smarter way, maybe using sed or any other
bash tool.
Is it ok to ignore trailing "s." in time-taken
values as the result of addition is correct.
You already use PCRE. Why not use Perl itself?
perl -lne 'print $1 + $2
if /Exception, Curl1-Time: ([\d.]+)s\. Curl2-Time: ([\d.]+)/
' < input
If you have GNU's grep, then you can execute:
var="<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting."
grep -Eo '[[:digit:]]+\.[[:digit:]]+s?' <<< "$var"
Or you can use awk and stay POSIX:
var="<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting."
awk '{ while (match($0, /[[:digit:]]+\.[[:digit:]]+s?/)) { print substr($0, RSTART, RLENGTH); $0 = substr($0, RSTART + RLENGTH) } }' <<< "$var"
As you can see, both commands use the regex [[:digit:]]+\.[[:digit:]]+s? to match a pattern of one or more digits, a dot, one or more digits and an optional 's'.
GNU's grep uses the -o option to extract the matching regex pattern.
The awk version uses its match and substr functions, to match and extract relevant data.
After a regex match, RSTART and RLENGTH are set and we can use them to calculate a start and end positions for substr.
RLENGTH is the length of the substring matched by the match function.
RSTART is the start-index in characters of the substring matched by the match function.
see section Built-in Functions for String Manipulation
sed 's/.*Curl1-Time: \([0-9]\.[0-9]*\)s.*\([0-9]\.[0-9]*\)s.*$/\1 \2/p' filename | awk '{print ($1+$2);}'
Regex pattern matching ".Curl1-Time: ([0-9].[0-9])s.([0-9].[0-9])s.*$" ---> Pattern within the braces is the number matching regex.
Entire line is replaced with two matching patterns. i.e the output of sed will be two numbers with spaces in between them. e.g. 1234 34567
awk parses the sed output with default space delimiter and sums up them and prints the result.

How to concatenate multiple lines of output to one line?

If I run the command cat file | grep pattern, I get many lines of output. How do you concatenate all lines into one line, effectively replacing each "\n" with "\" " (end with " followed by space)?
cat file | grep pattern | xargs sed s/\n/ /g
isn't working for me.
Use tr '\n' ' ' to translate all newline characters to spaces:
$ grep pattern file | tr '\n' ' '
Note: grep reads files, cat concatenates files. Don't cat file | grep!
Edit:
tr can only handle single character translations. You could use awk to change the output record separator like:
$ grep pattern file | awk '{print}' ORS='" '
This would transform:
one
two
three
to:
one" two" three"
Piping output to xargs will concatenate each line of output to a single line with spaces:
grep pattern file | xargs
Or any command, eg. ls | xargs. The default limit of xargs output is ~4096 characters, but can be increased with eg. xargs -s 8192.
grep xargs
In bash echo without quotes remove carriage returns, tabs and multiple spaces
echo $(cat file)
This could be what you want
cat file | grep pattern | paste -sd' '
As to your edit, I'm not sure what it means, perhaps this?
cat file | grep pattern | paste -sd'~' | sed -e 's/~/" "/g'
(this assumes that ~ does not occur in file)
This is an example which produces output separated by commas. You can replace the comma by whatever separator you need.
cat <<EOD | xargs | sed 's/ /,/g'
> 1
> 2
> 3
> 4
> 5
> EOD
produces:
1,2,3,4,5
The fastest and easiest ways I know to solve this problem:
When we want to replace the new line character \n with the space:
xargs < file
xargs has own limits on the number of characters per line and the number of all characters combined, but we can increase them. Details can be found by running this command: xargs --show-limits and of course in the manual: man xargs
When we want to replace one character with another exactly one character:
tr '\n' ' ' < file
When we want to replace one character with many characters:
tr '\n' '~' < file | sed s/~/many_characters/g
First, we replace the newline characters \n for tildes ~ (or choose another unique character not present in the text), and then we replace the tilde characters with any other characters (many_characters) and we do it for each tilde (flag g).
Here is another simple method using awk:
# cat > file.txt
a
b
c
# cat file.txt | awk '{ printf("%s ", $0) }'
a b c
Also, if your file has columns, this gives an easy way to concatenate only certain columns:
# cat > cols.txt
a b c
d e f
# cat cols.txt | awk '{ printf("%s ", $2) }'
b e
I like the xargs solution, but if it's important to not collapse spaces, then one might instead do:
sed ':b;N;$!bb;s/\n/ /g'
That will replace newlines for spaces, without substituting the last line terminator like tr '\n' ' ' would.
This also allows you to use other joining strings besides a space, like a comma, etc, something that xargs cannot do:
$ seq 1 5 | sed ':b;N;$!bb;s/\n/,/g'
1,2,3,4,5
Here is the method using ex editor (part of Vim):
Join all lines and print to the standard output:
$ ex +%j +%p -scq! file
Join all lines in-place (in the file):
$ ex +%j -scwq file
Note: This will concatenate all lines inside the file it-self!
Probably the best way to do it is using 'awk' tool which will generate output into one line
$ awk ' /pattern/ {print}' ORS=' ' /path/to/file
It will merge all lines into one with space delimiter
paste -sd'~' giving error.
Here's what worked for me on mac using bash
cat file | grep pattern | paste -d' ' -s -
from man paste .
-d list Use one or more of the provided characters to replace the newline characters instead of the default tab. The characters
in list are used circularly, i.e., when list is exhausted the first character from list is reused. This continues until
a line from the last input file (in default operation) or the last line in each file (using the -s option) is displayed,
at which time paste begins selecting characters from the beginning of list again.
The following special characters can also be used in list:
\n newline character
\t tab character
\\ backslash character
\0 Empty string (not a null character).
Any other character preceded by a backslash is equivalent to the character itself.
-s Concatenate all of the lines of each separate input file in command line order. The newline character of every line
except the last line in each input file is replaced with the tab character, unless otherwise specified by the -d option.
If ‘-’ is specified for one or more of the input files, the standard input is used; standard input is read one line at a time,
circularly, for each instance of ‘-’.
On red hat linux I just use echo :
echo $(cat /some/file/name)
This gives me all records of a file on just one line.

Resources