replace every nth occurrence of a pattern using awk [duplicate]

replace every nth occurrence of a pattern using awk [duplicate] - linux

This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 6 years ago.
I'm trying to replace every nth occurrence of a string in a text file.
background:
I have a huge bibtex file (called in.bib) containing hundreds of entries beginning with "#". But every entry has a different amount of lines. I want to write a string (e.g. "#") right before every (let's say) 6th occurrence of "#" so, in a second step, I can use csplit to split the huge file at "#" into files containing 5 entries each.
The problem is to find and replace every fifth "#".
Since I need it repeatedly, the suggested answer in printing with sed or awk a line following a matching pattern won't do the job. Again, I do not looking for just one matching place but many of it.
What I have so far:
awk '/^#/ && v++%5 {sub(/^#/, "\n#\n#")} {print > "out.bib"}' in.bib
replaces 2nd until 5th occurance (and no more).
(btw, I found and adopted this solution here: "Sed replace every nth occurrence". Initially, it was meant to replace every second occurence--which it does.)
And, second:
awk -v p="#" -v n="5" '$0~p{i++}i==n{sub(/^#/, "\n#\n#")}{print > "out.bib"}' in.bib
replaces exactly the 5th occurance and nothing else.
(adopted solution from here: "Display only the n'th match of grep"
What I need (and not able to write) is imho a loop. Would a for loop do the job? Something like:
for (i = 1; i <= 200; i * 5)
<find "#"> and <replace with "\n#\n#">
then print
The material I have looks like this:
#article{karamanic_jedno_2007,
title = {Jedno Kosova, Dva Srbije},
journal = {Ulaznica: Journal for Culture, Art and Social Issues},
author = {Karamanic, Slobodan},
year = {2007}
}
#inproceedings{blome_eigene_2008,
title = {Das Eigene, das Andere und ihre Vermischung. Zur Rolle von Sexualität und Reproduktion im Rassendiskurs des 19. Jahrhunderts},
comment = {Rest of lines snippet off here for usability -- as in following entries. All original entries may have a different amount of lines.}
}
#book{doring_inter-agency_2008,
title = {Inter-agency coordination in United Nations peacebuilding}
}
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
What I want is every sixth entry looking like this:
#
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
Thanks for your help.

Your code is almost right, i modified it.
To replace every nth occurrence, you need a modular expression.
So for better understanding with brackets, you need an expression like ((i % n) == 0)
awk -v p="#" -v n="5" ' $0~p { i++ } ((i%n)==0) { sub(/^#/, "\n#\n#") }{ print }' in.bib > out.bib

you can do the splitting in awk easily in one step.
awk -v RS='#' 'NR==1{next} (NR-1)%5==1{c++} {print RT $0 > FILENAME"."c}' file
will create file.1, file.2, etc with 5 records each, where the record is defined by the delimiter #.

Instead of doing this in multiple steps with multiple tools, just do something like:
awk '/#/ && (++v%5)==1{out="out"++c} {print > out}' file
Untested since you didn't provide any sample input/output.
If you don't have GNU awk and your input file is huge you'll need to add a close(out) right before the out=... to avoid having too many files open simultaneously.

Related

Remove duplicate lines based on partial text

I have a long list of URLs stored in a text files which I will go through and download. But before doing this I want to remove the duplicate URLs from the list. One thing to note is that some of the URLs look different but infact lead to the same page. The unique elements in the URL (aside from the domain and path) are the first 2 parameters in the query string. So for example, my text file would look like this:
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=399494&group=23
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=454665&group=12
If a unique URL is defined up to the second query string (key) then lines 1 and 4 are a duplicate. I would like to completely remove the duplicate lines, so not even keeping one. In the example above, lines 2 and 3 would remain and the 1 and 4 get deleted.
How can I achieve this using basic command line tools?

To shorten the code from other answer:
awk -F\& 'FNR == NR { url[$1,$2]++; next } url[$1,$2] == 1' urls.txt urls.txt

Using awk:
$ awk -F'[?&]' 'FNR == NR { url[$1,$2,$3]++; next } url[$1,$2,$3] == 1' urls.txt urls.txt
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
Reads the file twice; first time to keep a count of how many times the bits you're interested in occur, the second time to print only those that showed up once.

How to use awk command to remove word "a" not character 'a' in a text file?

I tried to use awk '{$0 = tolower($0);gsub(/a|an|is|the/, "", $0);}' words.txt
but it also replaced a in words like day.I only want to delete word a.
for example:
input: The day is sunny the the the Sunny is is
expected output:day sunny

Using GNU awk and built-in variable RT:
$ echo this is a test and nothing more |
awk '
BEGIN {
RS="[ \n]+"
a["a"]
a["an"]
a["is"]
a["the"]
}
(tolower($0) in a==0) {
printf "%s%s",$0, RT
}'
this test and nothing more
However, post some sample data with expected output for more specific answers.

you need to define word boundary to eliminate partial matches
$ echo "This is a sunny day, that is it." |
awk '{$0=tolower($0); gsub(/\y(is|it|a|this)\y/,"")}1'
will print
sunny day, that .
you can eliminate punctuation signs as well by either adding them to field delimiters or to the gsub words.

Following awk may help you in same.
Condition 1st: Considering you want to only remove words like a, the and is here, you could edit my code and add more words too as per your need.
awk '{
for(i=1;i<=NF;i++){
if(tolower($i)=="a" || tolower($i)=="the" || tolower($i)=="is"){
$i=""
}
};
}
1' Input_file
Condition 2nd: In case you want to remove words like a, the and is and you want to remove duplicate fields too from lines then following may help you(this has come by seeing your example output shown in comments above):
awk '{
for(i=1;i<=NF;i++){
if(tolower($i)=="a" || tolower($i)=="the" || tolower($i)=="is" || ++a[tolower($i)]>1){
$i=""
}
};
}
1' Input_file
NOTE: Since I am nullifying the fields so I am considering that you are fine with little improper space in between the line.

You need to an expression where the word is delimited by something (you need to decide what delimits your words. For example, do numbers delimit the word or are a part of the word, for example, a4?) So the expression could be, for example, /[^:alphanum:](a|an|is|the)[^:alphanum:]/.
Note however that these expressions will match the word AND the delimiters. Use capture feature to deal with this problem.
It looks like your "words.txt" containts just one word per line, so the expression should be delimited by beginning and end of line, like /^a$/

Finding character location of all instances of a string in bash

I'm trying to find the location of all instances of a string in a particular file; however, the code I'm currently running only returns the location of the first instance and then stops there. Here is what I'm currently running:
str=$(cat temp1.txt)
tmp="${str%%<C>*}"
if [ "$tmp" != "$str" ]; then
echo ${#tmp}
fi
The file is only one line of string and I would display it but the format questions need to be in won't allow me to add the proper amount of spaces between each character.

I am not sure of many details of your requirements, however this is an awk one-liner:
awk -vRS='<C>' '{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' temp1.txt
Let’s test it with an actual line of input:
$ awk -vRS='<C>' \
'{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' \
<<<" <C> <C> "
4:14:20:
This means: the first <C> is at byte 4, the second <C> is at byte 14 (including the three bytes of the first <C>), and the whole line is 20 bytes long (including final newline).
Is this what you want?
Explanation
We set (-v) record separator (RS) as <C>. Then we keep a variable a with the count of all bytes processed so far. For each “line” (i.e., <C>-separated substrings) we add the length of the current line to a, printf it with a suitable format "%u:", and increase a by the length of the separator which ended the current line. Since no printing so far included newlines, at the END we print an empty string, which is an idiom to output a final newline.

Look at the basically the same question asked here.
In particular your question may be answered for multiple instances thanks to user
JRFerguson response using perl.
EDIT: I found another solution that might just do the trick here. (The main question and response post is found here.)
I changed the shell from ksh to bash, changed the searched string to include multiple <C>'s to better demonstrate an answer the question, and named it "tester":
#!/bin/bash
printf '%s\n' '<C>abc<C>xyz<C>123456<C>zzz<C>' | awk -v s="$1" '
{ d = ""
for(i = 1; x = index(substr($0, i), s); i = i + x + length(s) - 1) {
printf("%s%d", d, i + x - 1)
d = ":"
}
print ""
}'
This is how I ran it:
$ tester '<C>'
1:7:13:22:28
I haven't figured the code out (I like to know why it works) but it seems to work! It would nice to get an explanation and an elegant way to feed your string into this script. Cheers.

How can I append any string at the end of line and keep doing it after specific number of lines?

I want to add a symbol " >>" at the end of 1st line and then 5th line and then so on. 1,5,9,13,17,.... I was searching the web and went through below article but I'm unable to achieve it. Please help.
How can I append text below the specific number of lines in sed?
retentive
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
Output should be like-
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)

You can do it with awk:
awk '{if ((NR-1) % 5) {print $0} else {print $0 " >>"}}'
We check if line number minus 1 is a multiple of 5 and if it is we output the line followed by a >>, otherwise, we just output the line.
Note: The above code outputs the suffix every 5 lines, because that's what is needed for your example to work.

You can do it multiple ways. sed is kind of odd when it comes to selecting lines but it's doable. E.g.:
sed:
sed -i -e 's/$/ >>/;n;n;n;n' file
You can do it also as perl one-liner:
perl -pi.bak -e 's/(.*)/$1 >>/ if not (( $. - 1 ) % 5)' file

You're thinking about this wrong. You should append to the end of the first line of every paragraph, don't worry about how many lines there happen to be in any given paragraph. That's just:
$ awk -v RS= -v ORS='\n\n' '{sub(/\n/," >>&")}1' file
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)

This might work for you (GNU sed):
sed -i '1~4s/$/ >>/' file

There's a couple more:
$ awk 'NR%5==1 && sub(/$/,">>>") || 1 ' foo
$ awk '$0=$0(NR%5==1?">>>":"")' foo

Here is a non-numeric way in Awk. This works if we have an Awk that supports the RS variable being more than one character long. We break the data into records based on the blank line separation: "\n\n". Inside these records, we break fields on newlines. Thus $1 is the word, $2 is the definition, $3 is the quote and $4 is the source:
awk 'BEGIN {OFS=FS="\n";ORS=RS="\n\n"} $1=$1" >>"'
We use the same output separators as input separators. Our only pattern/action step is then to edit $1 so that it has >> on it. The default action is { print }, which is what we want: print each record. So we can omit it.
Shorter: Initialize RS from catenation of FS.
awk 'BEGIN {OFS=FS="\n";ORS=RS=FS FS} $1=$1" >>"'
This is nicely expressive: it says that the format uses two consecutive field separators to separate records.
What if we use a flag, initially reset, which is reset on every blank line? This solution still doesn't depend on a hard-coded number, just the blank line separation. The rule fires on the first line, because C evaluates to zero, and then after every blank line, because we reset C to zero:
awk 'C++?1:$0=$0" >>";!NF{C=0}'
Shorter version of accepted Awk solution:
awk '(NR-1)%5?1:$0=$0" >>"'
We can use a ternary conditional expression cond ? then : else as a pattern, leaving the action empty so that it defaults to {print} which of course means {print $0}. If the zero-based record number is is not congruent to 0, modulo 5, then we produce 1 to trigger the print action. Otherwise we evaluate `$0=$0" >>" to add the required suffix to the record. The result of this expression is also a Boolean true, which triggers the print action.
Shave off one more character: we don't have to subtract 1 from NR and then test for congruence to zero. Basically whenever the 1-based record number is congruent to 1, modulo 5, then we want to add the >> suffix:
awk 'NR%5==1?$0=$0" >>":1'
Though we have to add ==1 (+3 chars), we win because we can drop two parentheses and -1 (-4 chars).
We can do better (with some assumptions): Instead of editing $0, what we can do is create a second field which contains >> by assigning to the parameter $2. The implicit print action will print this, offset by a space:
awk 'NR%5==1?$2=">>":1'
But this only works when the definition line contains one word. If any of the words in this dictionary are compound nouns (separated by space, not hyphenated), this fails. If we try to repair this flaw, we are sadly brought back to the same length:
awk 'NR%5==1?$++NF=">>":1'
Slight variation on the approach: Instead of trying to tack >> onto the record or last field, why don't we conditionally install >>\n as ORS, the output record separator?
awk 'ORS=(NR%5==1?" >>\n":"\n")'
Not the tersest, but worth mentioning. It shows how we can dynamically play with some of these variables from record to record.
Different way for testing NR == 1 (mod 5): namely, regexp!
awk 'NR~/[16]$/?$0=$0" >>":1'
Again, not tersest, but seems worth mentioning. We can treat NR as a string representing the integer as decimal digits. If it ends with 1 or 6 then it is congruent to 1, mod 5. Obviously, not easy to modify to other moduli, not to mention computationally disgusting.

extracting first line from file command such that

I have a file with almost 5*(10^6) lines of integer numbers. So, my file is big enough.
The question is all about extract specific lines, filtering them by a condition.
For example, I'd like to:
Extract the N first lines without read entire file.
Extract the lines with the numbers less or equal X (or >=, <=, <, >)
Extract the lines with a condition related a number (math predicate)
Is there a cleaver way to perform these tasks? (using sed or awk or cat or head)
Thanks in advance.

To extract the first $NUMBER lines,
head -n $NUMBER filename
Assuming every line contains just a number (although it will also work if the first token is one), 2 can be solved like this:
awk '$1 >= 1234 && $1 < 5678' filename
And keeping in spirit with that, 3 is just the extension
awk 'condition' filename
It would have helped if you had specified what condition is supposed to be, though. This way, you'll have to read the awk documentation to find out how to code it. Again, the number will be represented by $1.
I don't think I can explain anything about the head call, it's really just what it says on the tin. As for the awk lines: awk, like sed, works linewise. awk fetches lines in a loop and applies your code to each line. This code takes the form
condition1 { action1 }
condition2 { action2 }
# and so forth
For every line awk fetches, the conditions are checked in the order they appear, and the associated action to each condition is performed if the condition is true. It would, for example, have been possible to extract the first $NUMBER lines of a file with awk like this:
awk -v number="$NUMBER" '1 { print } NR == number { exit }' filename
where 1 is synonymous with true (like in C) and NR is the line number. The -v command line option initializes the awk variable number to $NUMBER. If no action is specified, the default action is { print }, which prints the whole line. So
awk 'condition' filename
is shorthand for
awk 'condition { print }' filename
...which prints every line where the condition holds.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

replace every nth occurrence of a pattern using awk [duplicate] - linux

Your code is almost right, i modified it. To replace every nth occurrence, you need a modular expression. So for better understanding with brackets, you need an expression like ((i % n) == 0) awk -v p="#" -v n="5" ' $0~p { i++ } ((i%n)==0) { sub(/^#/, "\n#\n#") }{ print }' in.bib > out.bib

you can do the splitting in awk easily in one step. awk -v RS='#' 'NR==1{next} (NR-1)%5==1{c++} {print RT $0 > FILENAME"."c}' file will create file.1, file.2, etc with 5 records each, where the record is defined by the delimiter #.

Related

Remove duplicate lines based on partial text

How to use awk command to remove word "a" not character 'a' in a text file?

Finding character location of all instances of a string in bash

How can I append any string at the end of line and keep doing it after specific number of lines?

extracting first line from file command such that

Categories

Resources