Hello let say I have a file such as :
$OUT some text
some text
some text
$OUT
$OUT
$OUT
how can I use sed in order to replace the 3 $OUT into "replace-thing" ?
and get
$OUT some text
some text
some text
replace-thing
With sed:
sed -n '1h; 1!H; ${g; s/\$OUT\n\$OUT\n\$OUT/replace-thing/g; p;}' file
GNU sed does not require the semicolon after p.
With commentary
sed -n ' # without printing every line:
# next 2 lines read the entire file into memory
1h # line 1, store current line in the hold space
1!H # not line 1, append a newline and current line to hold space
# now do the search-and-replace on the file contents
${ # on the last line:
g # replace pattern space with contents of hold space
s/\$OUT\n\$OUT\n\$OUT/replace-thing/g # do replacement
p # and print the revised contents
}
' file
This is the main reason I only use sed for very simple things: once you start using the lesser-used commands, you need extensive commentary to understand the program.
Note the commented version does not work on the BSD-derived sed on MacOS -- the comments break it, but removing them is OK.
In plain bash:
pattern=$'$OUT\n$OUT\n$OUT' # using ANSI-C quotes
contents=$(< file)
echo "${contents//$pattern/replace-thing}"
And the perl one-liner:
perl -0777 -pe 's/\$OUT(\n\$OUT){2}/replace-thing/g' file
for this particular task, I recommend to use awk instead. (hope that's an option too)
Update: to replace all 3 $OUT use: (Thanks to #thanasisp and #glenn jackman)
cat input.txt | awk '
BEGIN {
i = 0
p = "$OUT" # Pattern to match
n = 3 # N matches
r = "replace-thing"
}
$0 == p {
++i
if(i == n){
print(r)
i = 0 #reset counter (optional)
}
}
$0 != p {
i = 0
print($0)
}'
If you just want to replace the 3th $OUT usage, use:
cat input.txt | awk '
BEGIN {
i = 0
p = "\\$OUT" # Pattern to match
n = 3 # Nth match
r = "replace-thing"
}
$0 ~ p {
++i
if(i == n){
print(r)
}
}
i <= n || $0 !~ p {
print($0)
}'
This might work for you (GNU sed):
sed -E ':a;N;s/[^\n]*/&/3;Ta;/^(\$OUT\n?){3}$/d;P;D' file
Gather up 3 lines in the pattern space and if those 3 lines each contain $OUT, delete them. Otherwise, print/delete the first line and repeat.
Related
I am looking for a way to filter a (~12 Gb) largefile.txt with long strings in each line for each of the words (one per line) in a queryfile.txt. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.
For example, if queryfile.txt has the words:
this
next
And largefile.txt has the lines:
this is the first line with an ABCword # contents of first line will be saved
and there is an ABCword2 in this one as well # contents of 2nd line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
(Notice that the largefile.txt always has a word starting with ABC included in every line. It's also impossible for one of the query words to start with "ABC")
The save file should look similar to:
this ABCword1
this ABCword2
next ABCword2
So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:
LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt
The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.
I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt
2nd attempt based on updated sample input/output in question:
$ cat tst.awk
FNR==NR { words[$1]; next }
{
queryWord = otherWord = ""
for (i=1; i<=NF; i++) {
if ( $i in words ) {
queryWord = $i
}
else if ( $i ~ /^ABC/ ) {
otherWord = $i
}
}
if ( (queryWord != "") && (otherWord != "") ) {
print queryWord, otherWord
}
}
$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2
Original answer:
This MAY be what you're trying to do (untested):
awk '
FNR==NR { word2lgth[$1] = length($1); next }
($1 in word2lgth) && (match(substr($0,word2lgth[$1]+1),/ ABC[[:alnum:]_]+/) ) {
print substr($0,1,word2lgth[$1]+1+RSTART+RLENGTH)
}
' queryfile.txt largefile.txt > results.txt
Given:
cat large_file
this is the first line with an ABCword
and the next line has an ABCword2 too CRABCAKE
third line has an ABCword3
ABCword4 and this is behind
cat query_file
this
next
(The comments you have on each line of large_file are eliminated otherwise ABCword3 prints since there is 'this' in the comment.)
You can actually do this entirely with GNU sed and tr manipulation of the query file:
pat=$(gsed -E 's/^(.+)$/\\b\1\\b/' query_file | tr '\n' '|' | gsed 's/|$//')
gsed -nE "s/.*(${pat}).*(\<ABC[a-zA-Z0-9]*).*/\1 \2/p; s/.*(\<ABC[a-zA-Z0-9]*).*(${pat}).*/\1 \2/p" large_file
Prints:
this ABCword
next ABCword2
ABCword4 this
This one assumes your queryfile has more entries than there are words one a line in the largefile. Also, it does not consider your comments as comments but processes them as reqular data and therefore if cut'n'pasted, the third record is a match too.
$ awk '
NR==FNR { # process queryfile
a[$0] # hash those query words
next
}
{ # process largefile
for(i=1;i<=NF && !(f1 && f2);i++) # iterate until both words found
if(!f1 && ($i in a)) # f1 holds the matching query word
f1=$i
else if(!f2 && ($i~/^ABC/)) # f2 holds the ABC starting word
f2=$i
if(f1 && f2) # if both were found
print f1,f2 # output them
f1=f2=""
}' queryfile largefile
Using sed in a while loop
$ cat queryfile.txt
this
next
$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2
I have to count all '=' between two pattern i.e '{' and '}'
Sample:
{
100="1";
101="2";
102="3";
};
{
104="1,2,3";
};
{
105="1,2,3";
};
Expected Output:
3
1
1
A very cryptic perl answer:
perl -nE 's/\{(.*?)\}/ say ($1 =~ tr{=}{=}) /ge'
The tr function returns the number of characters transliterated.
With the new requirements, we can make a couple of small changes:
perl -0777 -nE 's/\{(.*?)\}/ say ($1 =~ tr{=}{=}) /ges'
-0777 reads the entire file/stream into a single string
the s flag to the s/// function allows . to handle newlines like a plain character.
Perl to the rescue:
perl -lne '$c = 0; $c += ("$1" =~ tr/=//) while /\{(.*?)\}/g; print $c' < input
-n reads the input line by line
-l adds a newline to each print
/\{(.*?)\}/g is a regular expression. The ? makes the asterisk frugal, i.e. matching the shortest possible string.
The (...) parentheses create a capture group, refered to as $1.
tr is normally used to transliterate (i.e. replace one character by another), but here it just counts the number of equal signs.
+= adds the number to $c.
Awk is here too
grep -o '{[^}]\+}'|awk -v FS='=' '{print NF-1}'
example
echo '{100="1";101="2";102="3";};
{104="1,2,3";};
{105="1,2,3";};'|grep -o '{[^}]\+}'|awk -v FS='=' '{print NF-1}'
output
3
1
1
First some test input (a line with a = outside the curly brackets and inside the content, one without brackets and one with only 2 brackets)
echo '== {100="1";101="2";102="3=3=3=3";} =;
a=b
{c=d}
{}'
Handle line without brackets (put a dummy char so you will not end up with an empty string)
sed -e 's/^[^{]*$/x/'
Handle line without equal sign (put a dummy char so you will not end up with an empty string)
sed -e 's/{[^=]*}/x/'
Remove stuff outside the brackets
sed -e 's/.*{\(.*\)}/\1/'
Remove stuff inside the double quotes (do not count fields there)
sed -e 's/"[^"]*"//g'
Use #repzero method to count equal signs
awk -F "=" '{print NF-1}'
Combine stuff
echo -e '{100="1";101="2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$/x/' -e 's/{[^=]*}/x/' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{print NF-1}'
The ugly temp fields x and replacing {} can be solved inside awk:
echo -e '= {100="1";101="2=2=2=2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$//' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{if (NF>0) c=NF-1; else c=0; print c}'
or shorter
echo -e '= {100="1";101="2=2=2=2";102="3";};\na=b\n{c=d}\n{}' |
sed -e 's/^[^{]*$//' -e 's/.*{\(.*\)}/\1/' -e 's/"[^"]*"//g' |
awk -F "=" '{print (NF>0) ? NF-1 : 0; }'
No harder sed than done ... in.
Restricting this answer to the environment as tagged, namely:
linux shell unix sed wc
will actually not require the use of wc (or awk, perl, or any other app.).
Though echo is used, a file source can easily exclude its use.
As for bash, it is the shell.
The actual environment used is documented at the end.
NB. Exploitation of GNU specific extensions has been used for brevity
but appropriately annotated to make a more generic implementation.
Also brace bracketed { text } will not include braces in the text.
It is implicit that such braces should be present as {} pairs but
the text src. dangling brace does not directly violate this tenet.
This is a foray into the world of `sed`'ng to gain some fluency in it's use for other purposes.
The ideas expounded upon here are used to cross pollinate another SO problem solution in order
to aquire more familiarity with vetting vagaries of vernacular version variances. Consequently
this pedantic exercice hopefully helps with the pedagogy of others beyond personal edification.
To test easily, at least in the environment noted below, judiciously highlight the appropriate
code section, carefully excluding a dangling pipe |, and then, to a CLI command line interface
drag & drop, copy & paste or use middle click to enter the code.
The other SO problem. linux - Is it possible to do simple arithmetic in sed addresses?
# _______________________________ always needed ________________________________
echo -e '\n
\n = = = {\n } = = = each = is outside the braces
\na\nb\n { } so therefore are not counted
\nc\n { = = = = = = = } while the ones here do count
{\n100="1";\n101="2";\n102="3";\n};
\n {\n104="1,2,3";\n};
a\nb\nc\n {\n105="1,2,3";\n};
{ dangling brace ignored junk = = = \n' |
# _____________ prepatory conditioning needed for final solutions _____________
sed ' s/{/\n{\n/g;
s/}/\n}\n/g; ' | # guarantee but one brace to a line
sed -n '/{/ h; # so sed addressing can "work" here
/{/,/}/ H; # use hHold buffer for only { ... }
/}/ { x; s/[^=]*//g; p } ' | # then make each {} set a line of =
# ____ stop code hi-lite selection in ^--^ here include quote not pipe ____
# ____ outputs the following exclusive of the shell " # " comment quotes _____
#
#
# =======
# ===
# =
# =
# _________________________________________________________________________
# ____________________________ "simple" GNU solution ____________________________
sed -e '/^$/ { s//0/;b }; # handle null data as 0 case: next!
s/=/\n/g; # to easily count an = make it a nl
s/\n$//g; # echo adds an extra nl - delete it
s/.*/echo "&" | sed -n $=/; # sed = command w/ $ counts last nl
e ' # who knew only GNU say you ah phoo
# 0
# 0
# 7
# 3
# 1
# 1
# _________________________________________________________________________
# ________________________ generic incomplete "solution" ________________________
sed -e '/^$/ { s//echo 0/;b }; # handle null data as 0 case: next!
s/=$//g; # echo adds an extra nl - delete it
s/=/\\\\n/g; # to easily count an = make it a nl
s/.*/echo -e & | sed -n $=/; '
# _______________________________________________________________________________
The paradigm used for the algorithm is instigated by the prolegomena study below.
The idea is to isolate groups of = signs between { } braces for counting.
These are found and each group is put on a separate line with ALL other adorning characters removed.
It is noted that sed can easily "count", actually enumerate, nl or \n line ends via =.
The first "solution" uses these sed commands:
print
branch w/o label starts a new cycle
h/Hold for filling this sed buffer
exchanage to swap the hold and pattern buffers
= to enumerate the current sed input line
substitute s/.../.../; with global flag s/.../.../g;
and most particularly the GNU specific
evaluate (execute can not remember the actual mnemonic but irrelevantly synonymous)
The GNU specific execute command is avoided in the generic code. It does not print the answer but
instead produces code that will print the answer. Run it to observe. To fully automate this, many
mechanisms can be used not the least of which is the sed write command to put these lines in a
shell file to be excuted or even embed the output in bash evaluation parentheses $( ) etc.
Note also that various sed example scripts can "count" and these too can be used efficaciously.
The interested reader can entertain these other pursuits.
prolegomena:
concept from counting # of lines between braces
sed -n '/{/=;/}/=;'
to
sed -n '/}/=;/{/=;' |
sed -n 'h;n;G;s/\n/ - /;
2s/^/ Between sets of {} \n the nl # count is\n /;
2!s/^/ /;
p'
testing "done in":
linuxuser#ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
linuxuser#ubuntu:~$ sed --version -----> sed (GNU sed) 4.4
And for giggles an awk-only alternative:
echo '{
> 100="1";
> 101="2";
> 102="3";
> };
> {
> 104="1,2,3";
> };
> {
> 105="1,2,3";
> };' | awk 'BEGIN{RS="\n};";FS="\n"}{c=gsub(/=/,""); if(NF>2){print c}}'
3
1
1
I have a column
1
1
1
2
2
2
I would like to insert a blank line when the value in the column changes:
1
1
1
<- blank line
2
2
2
I would recommend using awk:
awk -v i=1 'NR>1 && $i!=p { print "" }{ p=$i } 1' file
On any line after the first, if value of the "i"th column is different to the previous value, print a blank line. Always set the value of p. The 1 at the end evaluates to true, which means that awk prints the line. i can be set to the column number of your choice.
while read L; do [[ "$L" != "$PL" && "$PL" != "" ]] && echo; echo "$L"; PL="$L"; done < file
awk(1) seems like the obvious answer to this problem:
#!/usr/bin/awk -f
BEGIN { prev = "" }
/./ {
if (prev != "" && prev != $1) print ""
print
prev = $1
}
You can also do this with SED:
sed '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}'
The long version with explanations is:
sed '{
N # read second line; (terminate if there are no more lines)
s/^\(.*\)\n\1$/\1\n\1/ # try to replace two identical lines with themselves
tx # if replacement succeeded then goto label x
P # print the first line
s/^.*\n/\n/ # replace first line by empty line
P # print this empty line
D # delete empty line and proceed with input
:x # label x
P # print first line
D # delete first line and proceed with input
}'
One thing I like about using (GNU) SED (what which is not clear if it is useful to you from your question) is that you can easily apply changes in-place with the -i switch, e.g.
sed -i '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}' FILE
You could use getline function in Awk to match the current line against the following line:
awk '{f=$1; print; getline}f != $1{print ""}1' file
I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file.
I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines.
I tried this piece of code, but it's very slow and takes hours.
FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e ${X}d ${FILEsrc}; #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done
I found this line the most time consuming one in the code:
sed -i -e ${X}d ${FILEsrc};
is there any way to overcome this problem and make the code faster?
Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?
A simple O(n) algorithm is described in:
http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done
Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this:
sed "s!.*!&{w $FILEtrg\nd;}!" offsets
The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance:
... | sed -i -f - "$FILEsrc"
Generating the offsets file left as an exercise.
Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input.
Here is a complete script, updated to use shuf (thanks #Thor!) to avoid possible duplicate random numbers.
#!/bin/sh
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
# Add a line number to each input line
nl -ba "$FILEsrc" |
# Rearrange lines
shuf |
# Pick out the line number from the first $MaxLines ones into sed script
sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" |
# Run the generated sed script on the original input file
sed -i -f - "$FILEsrc"
[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.]
Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash):
tmp=$(mktemp /tmp/XXXXXXXX)
awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines '
BEGIN { srand(); }
maxLines==0 { exit; }
{ if (rand() < maxLines/total--) {
print; maxLines--;
} else {
print $0 > /dev/fd/3
}
}' "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines.
Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic:
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
tmp=$(mktemp /tmp/XXXXXXXX)
total=$(wc -l < "$FILEsrc")
while read -r line && (( MaxLines > 0 )); do
(( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; }
(( MaxLines-- ))
printf "$line\n"
done < "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
Here's a complete Go program :
package main
import (
"bufio"
"fmt"
"log"
"math/rand"
"os"
"sort"
"time"
)
func main() {
N := 10
rand.Seed( time.Now().UTC().UnixNano())
f, err := os.Open(os.Args[1]) // open the file
if err!=nil { // and tell the user if the file wasn't found or readable
log.Fatal(err)
}
r := bufio.NewReader(f)
var lines []string // this will contain all the lines of the file
for {
if line, err := r.ReadString('\n'); err == nil {
lines = append(lines, line)
} else {
break
}
}
nums := make([]int, N) // creates the array of desired line indexes
for i, _ := range nums { // fills the array with random numbers (lower than the number of lines)
nums[i] = rand.Intn(len(lines))
}
sort.Ints(nums) // sorts this array
for _, n := range nums { // let's print the line
fmt.Println(lines[n])
}
}
Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this :
go build randomlines
And then call it like this :
./randomlines "path_to_my_file"
This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.
Here's an interesting two-pass option with coreutils, sed and awk:
n=5
total=$(wc -l < infile)
seq 1 $total | shuf | head -n $n \
| sed 's/^/NR == /; $! s/$/ ||/' \
| tr '\n' ' ' \
| sed 's/.*/ & { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \
| awk -f - infile
A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output:
{ if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" }
So the random lines are saved in rndlines and the rest in leftover.
Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming.
Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file.
FILE=input.txt
SAMPLE=10
SEP=$'\t'
<$FILE nl -s $"SEP" -nln -w1 |
sort -R |
tee \
>(awk "NR > $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \
>(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \
>/dev/null
# check the results
wc -l $FILE*
# 'remove' the lines, if needed
mv $FILE.rest $FILE
This might work for you (GNU sed, sort and seq):
n=10
seq 1 $(sed '$=;d' input_file) |
sort -R |
sed $nq |
sed 's/.*/&{w output_file\nd}/' |
sed -i -f - input_file
Where $n is the number of lines to extract.
How would I delete the 6 lines starting from every instance of a word i see?
I think this sed command will do what you want:
sed '/bar/,+5d' input.txt
It removes any line containing the text bar plus the five following lines.
Run as above to see the output. When you know it is working correctly use the switch --in-place=.backup to actually perform the change.
This simple perl script will remove every line that containts word "DELETE6" and 5 consecutive lines (total 6). It also saves previous version of file in FILENAME.bak. To run the script:
perl script.pl FILE_TO_CHANGE
#!/usr/bin/perl
use strict;
use warnings;
my $remove_count = 6;
my $word = "DELETE6";
local $^I = ".bak";
my $delete_count = 0;
while (<>) {
$delete_count = $remove_count if /$word/;
print if $delete_count <= 0;
$delete_count--;
}
HTH
perl -i.bak -n -e '$n ++; $n = -5 if /foo/; print if $n > 0' data.txt
perl -ne 'print unless (/my_word/ and $n = 1) .. ++$n == 7'
Note that if my_word occurs in the skipped-over lines, the counter will not be reset.