Finding character location of all instances of a string in bash - string

I'm trying to find the location of all instances of a string in a particular file; however, the code I'm currently running only returns the location of the first instance and then stops there. Here is what I'm currently running:
str=$(cat temp1.txt)
tmp="${str%%<C>*}"
if [ "$tmp" != "$str" ]; then
echo ${#tmp}
fi
The file is only one line of string and I would display it but the format questions need to be in won't allow me to add the proper amount of spaces between each character.

I am not sure of many details of your requirements, however this is an awk one-liner:
awk -vRS='<C>' '{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' temp1.txt
Let’s test it with an actual line of input:
$ awk -vRS='<C>' \
'{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' \
<<<" <C> <C> "
4:14:20:
This means: the first <C> is at byte 4, the second <C> is at byte 14 (including the three bytes of the first <C>), and the whole line is 20 bytes long (including final newline).
Is this what you want?
Explanation
We set (-v) record separator (RS) as <C>. Then we keep a variable a with the count of all bytes processed so far. For each “line” (i.e., <C>-separated substrings) we add the length of the current line to a, printf it with a suitable format "%u:", and increase a by the length of the separator which ended the current line. Since no printing so far included newlines, at the END we print an empty string, which is an idiom to output a final newline.

Look at the basically the same question asked here.
In particular your question may be answered for multiple instances thanks to user
JRFerguson response using perl.
EDIT: I found another solution that might just do the trick here. (The main question and response post is found here.)
I changed the shell from ksh to bash, changed the searched string to include multiple <C>'s to better demonstrate an answer the question, and named it "tester":
#!/bin/bash
printf '%s\n' '<C>abc<C>xyz<C>123456<C>zzz<C>' | awk -v s="$1" '
{ d = ""
for(i = 1; x = index(substr($0, i), s); i = i + x + length(s) - 1) {
printf("%s%d", d, i + x - 1)
d = ":"
}
print ""
}'
This is how I ran it:
$ tester '<C>'
1:7:13:22:28
I haven't figured the code out (I like to know why it works) but it seems to work! It would nice to get an explanation and an elegant way to feed your string into this script. Cheers.

Related

AWK - string containing required fields

I thought it would be easy to define a string such as "1 2 3" and use it within AWK (GAWK) to extract the required fields, how wrong I have been.
I have tried creating AWK arrays, BASH arrays, splitting, string substitution etc, but could not find any method to use the resulting 'chunks' (ie the column/field numbers) in a print statement.
I believe Akshay Hegde has provided an excellent solution with the get_cols function, here
but it was over 8 years ago, and I am really struggling to work out 'how it works', namely, what this is doing;
s = length(s) ? s OFS $(C[i]) : $(C[i])
I am unable to post a comment asking for clarification due to my lack of reputation (and it is an old post).
Is someone able to explain how the solution works?
NB I don't think I need the sub as I using the following to cleanup (replace all non-numeric characters with a comma, ie seperator, and sort numerically)
Columns=$(echo $Input_string | sed 's/[^0-9]\+/,/g') Columns=$(echo $Columns | xargs -n1 | sort -n | xargs)
(using this string, the awk would be Executed using awk -v cols=$Columns -f test.awk infile in the given solution)
Given the informative answer from #Ed Morton, with a nice worked example, I have attempted to remove the need for a function (and also an additional awk program file). The intention is to have this within a shell script, and I would rather it be self contained, but also, further investigation into 'how it works'.
Fields="1 2 3"
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print "s="s " arr1="Column[1]" arr2="Column[2]" arr3="Column[3]}'
The results have surprised me (taking note of my Comment to Ed)
s=1 2 3 arr1=1 arr2=2 arr3=3
The above clearly shows the split has worked into the array, but I thought s would include $ for each ternary operator concatenation, ie "$1 $2 $3"
Moreso, I was hoping to append the actual file to the above command, which I have found allows me to use echo $string | awk '{program}' file.name
NB it is a little insulting that my question has been marked as -1 indicating little research effort, as I have spent days trying to work this out.
Taking all the information above, I think s results in "1 2 3", but the print doesn't accept this in the same way as it does as it is called from a function, simply trying to 'print 1 2 3' in relation to the file, which seems to be how all my efforts have ended up.
This really confuses me, as Ed's 'diagonal' example works from command line, indicating that concept of 'print s' is absolutely fine when used with a file name input.
Can anyone suggest how this (example below) can work?
I don't know if using echo pipe and appending the file name is strictly allowed, but it appears to work (?!?!?!)
(failed result)
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print s}' myfile.txt
This appears to go through myfile.txt and output all lines containing many comma separated values, ie the whole file (I haven't included the values, just for illustration only)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
what this is doing; s = length(s) ? s OFS $(C[i]) : $(C[i])
You have encountered a ternary operator, it has following syntax
condition ? valueiftrue : valueiffalse
length function, when provided with single argument does return number of characters, in GNU AWK integer 0 is considered false, others integers are considered true, so in this case it is is not empty check. When s is not empty (it might be also not initalized yet, as GNU AWK will assume empty string in such case), it is concatenated with output field separator (OFS, default is space) and C[i]-th field value and assigned to variable s, when s is empty value of C[i]-th field value. Used multiple time this allows building of string of values sheared by OFS, consider following simple example, let say you want to get diagonal of 2D matrix, stored in file.txt with following content
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
then you might do
awk '{s = length(s) ? s OFS $(NR) : $(NR)}END{print s}' file.txt
which will get output
1 7 13 19 25
Explanation: NR is number row, so 1st row $(NR) is 1st field, for 2nd row it is 2nd field, for 3rd it is 3rd field and so on
(tested in GNU Awk 5.0.1)

BASH - Extract Data from String

I have a log that returns thousands of lines of data, I want to extract a few values from that.
In the log there is only one line containing the unquie unit reference so I can grep for that using:
grep "unit=Central-C152" logfile.txt
That produces a line of output similar to the following:
a3cd23e,85d58f5,53f534abef7e7,unit=Central-C152,locale=32325687-8595-9856-1236-12546975,11="School",1="Mr Green",2="Qual",3="SWE",8="report",5="channel",7="reset",6="velum"
The format of the line may change in that the order of the values won't always be in the same position.
I'm trying to work out how to get the value of 2 and 7 in to separate variables.
I had thought about cut on , or = but as the values aren't in a set order I couldn't work out that best way to do it.
I' trying to get:
var state=value of 2 without quotes
var mode=value of 7 without quotes
Can anyone advise on the best way to do this ?
Thanks
Could you please try following to create variable's values.
state=$(awk '/unit=Central-C152/ && match($0,/2=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
mode=$(awk '/unit=Central-C152/ && match($0,/7=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
You could print them too by doing following.
echo "$state"
echo "$mode"
Explanation: Adding explanation of command too now.
awk ' ##Starting awk program here.
/unit=Central-C152/ && match($0,/2=\"[^"]*/){ ##Checking condition if a line has string (unit=Central-C152) and using match using REGEX to check from 2 to till "
print substr($0,RSTART+3,RLENGTH-3) ##Printing substring starting from RSTART+3 till RLENGTH-3 characters.
}
' Input_file ##Mentioning Input_file name here.
You are probably better off doing all of the processing in Awk.
awk -F, '/unit=Central-C152/ {
for(i=1;i<=NF;++i)
if($i ~ /^[27]="/) {
b[++k] = $i
sub(/^[27]="/, "", b[k])
sub(/"$/, "", b[k])
gsub(/\\/, "", b[k])
}
print "state " b[1] ", mode " b[2]
}' logfile.txt
This presupposes that the fields always occur in the same order (2 before 7). Maybe you need to change or disable the gsub to remove backslashes in the values.
If you want to do more than print the values, refactoring whatever Bash code you have into Awk is often a better approach than doing this processing in Bash.
Assuming you already have the line in a variable such as with:
line="$(grep 'unit=Central-C152' logfile.txt | head -1)"
You can then simply use the built-in parameter substitution features of bash:
f2=${line#*2=\"} ; f2=${f2%%\"*} ; echo ${f2}
f7=${line#*7=\"} ; f7=${f7%%\"*} ; echo ${f7}
The first command on each line strips off the first part of the line up to and including the <field-number>=". The second command then strips everything off that beyond (and including) the first quote. The third, of course, simply echos the value.
When I run those commands against your input line, I see:
Qual
reset
which is, from what I can see, what you were after.

bash script and awk to sort a file

so I have a project for uni, and I can't get through the first exercise. Here is my problem:
I have a file, and I want to select some data inside of it and 'display' it in another file. But the data I'm looking for is a little bit scattered in the file, so I need several awk commands in my script to get them.
Query= fig|1240086.14.peg.1
Length=76
Score E
Sequences producing significant alignments: (Bits) Value
fig|198628.19.peg.2053 140 3e-42
> fig|198628.19.peg.2053
Length=553
Here on the picture, you can see that there are 2 types of 'Length=', and I only want to 'catch' the "Length=" that are just after a "Query=".
I have to use awk so I tried this :
awk '{if(/^$/ && $(NR+1)/^Length=/) {split($(NR+1), b, "="); print b[2]}}'
but it doesn't work... does anyone have an idea?
You need to understand how Awk works. It reads a line, evaluates the script, then starts over, reading one line at a time. So there is no way to say "the next line contains this". What you can do is "if this line contains, then remember this until ..."
awk '/Query=/ { q=1; next } /Length/ && q { print } /./ { q=0 }' file
This sets the flag q to 1 (true) when we see Query= and then skips to the next line. If we see Length and we recently saw Query= then q will be 1, and so we print. In other cases, set q back to "not recently seen" on any non-empty line. (I put in the non-empty condition to allow for empty lines anywhere without affecting the overall logic.)
awk solution:
awk '/^Length=/ && r~/^Query/{ sub(/^[^=]+=/,""); printf "%s ",$0 }
NF{ r=$0 }END{ print "" }' file
NF{ r=$0 } - capture the whole non-empty line
/^Length=/ && r~/^Query/ - on encountering Length line having previous line started with Query(ensured by r~/^Query/)
It sounds like this is what you want for the first part of your question:
$ awk -F'=' '!NF{next} f && ($1=="Length"){print $2} {f=($1=="Query")}' file
76
but idk what the second part is about since there's no "data" lines in your input and only 1 valid output from your sample input best I can tell.

replace every nth occurrence of a pattern using awk [duplicate]

This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 6 years ago.
I'm trying to replace every nth occurrence of a string in a text file.
background:
I have a huge bibtex file (called in.bib) containing hundreds of entries beginning with "#". But every entry has a different amount of lines. I want to write a string (e.g. "#") right before every (let's say) 6th occurrence of "#" so, in a second step, I can use csplit to split the huge file at "#" into files containing 5 entries each.
The problem is to find and replace every fifth "#".
Since I need it repeatedly, the suggested answer in printing with sed or awk a line following a matching pattern won't do the job. Again, I do not looking for just one matching place but many of it.
What I have so far:
awk '/^#/ && v++%5 {sub(/^#/, "\n#\n#")} {print > "out.bib"}' in.bib
replaces 2nd until 5th occurance (and no more).
(btw, I found and adopted this solution here: "Sed replace every nth occurrence". Initially, it was meant to replace every second occurence--which it does.)
And, second:
awk -v p="#" -v n="5" '$0~p{i++}i==n{sub(/^#/, "\n#\n#")}{print > "out.bib"}' in.bib
replaces exactly the 5th occurance and nothing else.
(adopted solution from here: "Display only the n'th match of grep"
What I need (and not able to write) is imho a loop. Would a for loop do the job? Something like:
for (i = 1; i <= 200; i * 5)
<find "#"> and <replace with "\n#\n#">
then print
The material I have looks like this:
#article{karamanic_jedno_2007,
title = {Jedno Kosova, Dva Srbije},
journal = {Ulaznica: Journal for Culture, Art and Social Issues},
author = {Karamanic, Slobodan},
year = {2007}
}
#inproceedings{blome_eigene_2008,
title = {Das Eigene, das Andere und ihre Vermischung. Zur Rolle von Sexualität und Reproduktion im Rassendiskurs des 19. Jahrhunderts},
comment = {Rest of lines snippet off here for usability -- as in following entries. All original entries may have a different amount of lines.}
}
#book{doring_inter-agency_2008,
title = {Inter-agency coordination in United Nations peacebuilding}
}
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
What I want is every sixth entry looking like this:
#
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
Thanks for your help.
Your code is almost right, i modified it.
To replace every nth occurrence, you need a modular expression.
So for better understanding with brackets, you need an expression like ((i % n) == 0)
awk -v p="#" -v n="5" ' $0~p { i++ } ((i%n)==0) { sub(/^#/, "\n#\n#") }{ print }' in.bib > out.bib
you can do the splitting in awk easily in one step.
awk -v RS='#' 'NR==1{next} (NR-1)%5==1{c++} {print RT $0 > FILENAME"."c}' file
will create file.1, file.2, etc with 5 records each, where the record is defined by the delimiter #.
Instead of doing this in multiple steps with multiple tools, just do something like:
awk '/#/ && (++v%5)==1{out="out"++c} {print > out}' file
Untested since you didn't provide any sample input/output.
If you don't have GNU awk and your input file is huge you'll need to add a close(out) right before the out=... to avoid having too many files open simultaneously.

add text after keyword in bash / shell

I am in the middle of a migration for PTR records from MSoft and I am adjusting the zonefiles for my needs. I have already prepared the zone files so they look like the following:
snapo#jump:~/mike/10$ cat 21.128
102 [AGE:3630582] 1200 PTR host1.domain.company.local.
69 [AGE:3630774] 1200 PTR host2.domain.compan2.local.
[AGE:3630762] 1200 PTR host2.domain.company.local.
80 [AGE:3630774] 1200 PTR hostXX.domain.company.local.
so I have the filename as variable x and I want to achieve the output of the text file to be like this with awk (because I don't think that there is another way in bash). Please no php/python/perl answers, because the script will need to run on different systems and the only language that is supposed to be installed is bash.
Because this is a merge from multiple PTR zones to one, I would have to edit the zone file to look like this:
102.21.128 [AGE:3630582] 1200 PTR host1.domain.company.local.
69.21.128 [AGE:3630774] 1200 PTR host2.domain.compan2.local.
21.128 [AGE:3630762] 1200 PTR host2.domain.company.local.
80.21.128 [AGE:3630774] 1200 PTR hostXX.domain.company.local.
It is also possible that there is no number in the first row "empty" , then it should add it without a dot in front. Do you have an awk sample or any other sample (cut , grep , head, tail, sed)?
Command should replace the strings in the existing file or with a pipe in the output file > editedtextfile.txt or similar.
With sed:
sed 's/^[^[:space:]]\+/&.21.128/' filename
Treating the input as plain text has the advantage of keeping the formatting intact.
For the edited question, this can be expanded to
sed 's/^[^[:space:]]\+/&.21.128/; s/^[[:space:]]/21.128&/' filename
Addendum: If you don't want to repeat the inserted data in the code, then
sed 's/^[^[:space:]]*/&\n21.128/; s/^\n//; s/\n/./' filename
is another approach that uses a little more trickery: It inserts a marker before the new data, removes the marker if there is nothing before it and otherwise replaces it with a dot.
Addendum 2: Using shell variables with sed code is a little tricky and potentially dangerous (because of code injection). If the variable comes from a trustworthy source and is known to not contain any metacharacters, then it is possible to write
sed "s/^[^[:space:]]*/&\n$variable/; s/^\n//; s/\n/./" filename
as #triplee points out in the comments. If $variable contains slashes but no other metacharacters and a character is known that it does not contain, then it is possible to use a different delimiter for the s command:
sed "s#^[^[:space:]]*#&\n$variable#; s/^\n//; s/\n/./" filename
(if it is known that $variable does not contain the character #).
If none of this is the case, deeper magic is required. For example, if $variable is known to be a single line (I suspect that this is the case because otherwise the transformation makes little sense), then it is possible to write
(echo "$variable"; cat filename) | sed '1 { h; d; }; s/^[^[:space:]]*/&\n/; G; s/\(.*\n\)\(.*\)\n\(.*\)/\1\3\2/; s/^\n//; s/\n/./'
This feeds the variable to sed as first line of the input, and then works as follows:
1 { h; d; } # first line: hold, don't print
s/^[^[:space:]]*/&\n/ # after that: Insert marker as before
G # fetch variable from the hold buffer
s/\(.*\n\)\(.*\)\n\(.*\)/\1\3\2/ # move it to the right place
s/^\n// # rest as before.
s/\n/./
However, at this point you may want to consider using awk instead, which has better facilities to deal with shell variables (that is to say, you can use them without treating them as code):
awk -v var="$variable" '{ n = match($0, /[ \t]/); print substr($0, 1, n - 1) (n <= 1 ? "" : ".") var substr($0, n) }' filename
The -v var="$variable" makes a variable var known to the awk code that has the value of $variable", and the awk code then works as follows:
{
# find the first space or tab in the line (0 if none)
# (I would use [[:space:]] here, but there are commonly shipped versions
# of mawk that don't understand POSIX character classes, so for portability
# I resort to [ \t])
n = match($0, /[ \t]/)
# assemble output line accordingly and print it.
print substr($0, 1, n - 1) (n <= 1 ? "" : ".") var substr($0, n)
}
awk -F" " '{print $1".21.128\t" $2"\t"$3"\t"$4"\t"$5}' $1

Resources