How to use invert "-v" in grep when I do not have a file but a long string that is just one line? - linux

Supposed I have
echo "The first part. The second part. The third part."
and want to remove The first part and The third part to get:
The second part.
I tried:
echo "The first part. The second part. The third part." | grep -v -e "The first part." -e "The third part."
but the inverting flag appears to work only for files with multiple lines. How can I do it for a single string?

Use sed instead:
echo "The first part. The second part. The third part." \
| sed -e 's/[[:space:]]*The first part\.[[:space:]]*//g' \
-e 's/[[:space:]]*The third part\.[[:space:]]*//g'

grep is a tool which works line-based and is more as a select-lines-which-satesfy-condition tool, The task you want to implement is more remove-substrings-from-file. This is in the area of substitutions and not in the area of selection: The best tool for this task is to use sed
sed 's/string_to_get_rid_of//g' file
Of course it is possible that your file is structured in records and you want to remove all records which contain a particular word, then there is another option. Assume that your file is split into various records which are delimited by a unique character (eg. <full-stop>-character (.)). The it is better to use awk for this. Awk allows you to redefine it's record separator from a new-line (default) to anything you want by defining RS and ORS (the latter for the output):
awk 'BEGIN{RS=ORS="."}/string_that_should_not_appear/{next}1' file
Assume you have a file with the content:
foo.bar.baz.qux
quux.quuz.corge
If we want to remove all the records which do not contain qux, we do:
awk 'BEGIN{RS=ORS="."}/qux/{next}1' file
which returns
foo.bar.baz.quuz.corge.
Notice that the record containing "cux" contained a newline and that an extra ORS is added at the end. Also you might get
foo.bar.baz.quuz.corge
.
Which is due to the POSIX standard that files should end with a newline
In case of the OP, it would read:
awk 'BEGIN{RS=ORS="."}/The first part/{next}/The third part/{next}1' file

Related

Extract text from each line from a multiple-line text file based on a condition, Linux

I have a txt file with only one column that each line represent a different fastq.gz file from a sequence output. See an example below:
36108-ABZG339L_S237_L001_R1_001.fastq.gz
36108-ABZG339L_S237_L001_R2_001.fastq.gz
36108-ABZGM_S7_L001_R1_001.fastq.gz
36108-ABZGM_S7_L001_R2_001.fastq.gz
First of all, I would like to convert the first "-" symbol to underscore "_".
I achieved that through the following command:
sed 's/[-]/_/Ig' inputfile.txt > outputfile.txt
Then the outputfile.txt is:
36108_ABZG339L_S237_L001_R1_001.fastq.gz
36108_ABZG339L_S237_L001_R2_001.fastq.gz
36108_ABZGM_S7_L001_R1_001.fastq.gz
36108_ABZGM_S7_L001_R2_001.fastq.gz
Afterwards, I would like to extract in a new txt file only the text between first and second underscore, so:
ABZG339L
ABZG339L
ABZGM
ABZGM
How can I achieve? I tried through sed, awk but I cannot find out.
Thanks on advance for your aid,
Magí
1st solution: To get your shown expected sample output you need not to first substitute - to - and then print, we can use power of awk here to create multiple field separators and then print needed value accordingly.
awk -F'-|_' '{print $2}' Input_file
Explanation: Simple explanation of above awk program would be, making _ and - as field separators for whole Input_file then printing 2nd field/column in it.
2nd solution: Using sed solution, using sed's back reference capability here.
sed -E 's/^[^-]*-([^_]*).*/\1/' Input_file
Explanation: Using sed's -E option here to enable ERE(extended regular expression) here. In main program of sed then from starting of value till 1st occurrence of - matching it and then creating 1st back reference(temp location in memory to be retrieved later on while performing substitution) and then matching anything till last of value. While substitution, substituting whole line value with only matched value to get desired results.
3rd solution: Using GNU grep here. Using GNU grep's -oP options here to enable PCRE regex engine in this program. In main program matching everything from starting to till - and forgetting that match with \k option of GNU grep. Then matching everything just before - and printing it.
grep -oP '^.*?-\K[^_]*' Input_file

Linux remove whitespace first line

i have the file virt.txt contains:
0302 000000 23071SOCIETY 117
0602 000000000000000001 PAYMENT BANK
I want to remove 3 whitespaces from 6th to 8th column to the first line only.
I do:
sed '1s/[[:blank:]]+[[:blank:]]+[[:blank:]]//6' virt.txt
it'KO
please help
Your regex would consume all the available blanks from a sequence of three or more (in a quite inefficient way) and replace the sixth occurrence of that. Because your first input line does not contain six or more separate stretches of three or more whitespace characters, it actually did nothing. But you can in fact use sed to do exactly what you say you want:
sed '1s/^\(.....\) /\1/' virt.txt
(or for convenience, if you have sed -E or the variant sed -r which works on some platforms, but neither of these is standard):
sed -E '1s/^(.{5}) {3}/\1/' virt.txt # -E is not portable
The parentheses capture the first five characters into a back reference, and we then use the first back reference \1 as the replacement string, effectively replacing only the text which matched outside the parentheses.
If your sed supports the -i option, you can use that to modify the file directly; but this is also not standard, so the most portable solution is to write the result to a new file, then move it back on top of the original file if you want to replace it.
sed is convenient if you are familiar with it, but as you are clearly not, perhaps a better approach would be to use a different language, ideally one which is not write-only for many users, like sed.
If you know the three characters will always be spaces, just do a static replacement.
awk 'NR==1 { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
On the first line (NR is the current input line number) replace the input line $0 with a catenation of the substrings on both sides of the part you want to cut.
For a simple replacement like that, you can also use basic Unix text manipulation utilities, though it's rather inefficient and inelegant:
head -n 1 virt.txt | cut -c1-5,9- >newfile.txt
tail -n +2 virt.txt >>newfile.txt
If you need to check that the three characters are spaces, the Awk script only needs a minor tweak.
awk 'NR==1 && /^.{5} {3}/ { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
You should vaguely recognize the regex from above. Awk is less succinct, but as a consequence also quite a lot more readable, than sed.

Using sed to obtain pattern range through multiple files in a directory

I was wondering if it was possible to use the sed command to find a range between 2 patterns (in this case, dates) and output these lines in the range to a new file.
Right now, I am just looking at one file and getting lines within my time range of the file FileMoverTransfer.log. However, after a certain time period, these logs are moved to new log files with a suffix such as FileMoverTransfer.log-20180404-xxxxxx.gz. Here is my current code:
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log >> /public/FileMoverRoot/logs/intervalFMT.log
While this doesn't work, as sed isn't able to look through all of the files in the directory starting with FileMoverTransfer.log?
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log* >> /public/FileMoverRoot/logs/intervalFMT.log
Any help would be greatly appreciated. Thanks!
The range operator only operates within a single file, so you can't use it if the start is in one file and the end is in another file.
You can use cat to concatenate all the files, and pipe this to sed:
cat FileMoverTransfer.log* | sed -n "/^$start_date/,/^$end_date/p;/^$end_date/q" >> /public/FileMoverRoot/logs/intervalFMT.log
And instead of quoting and unquoting the sed command, you can use double quotes so that the variables will be expanded inside it. This will also prevent problems if the variables contain whitespace.
awk solution
As the OP confirmed that an awk solution would be acceptable, I post it.
(gunzip -c FileMoverTransfer.log-*.gz; cat FileMoverTransfer.log ) \
|awk -v st="$start_date" -v en="$end_date" '$1>=st&&$1<=en{print;next}$1>en{exit}'\
>/public/FileMoverRoot/logs/intervalFMT.log
This solution is functionally almost identical to Barmar’s sed solution, with the difference that his solution, like the OP’s, will print and quit at the first record matching the end date, while mine will print all lines matching the end date and quit at the first record past the end date, without printing it.
Some remarks:
The OP didn't specify the date format. I suppose it is a format compatible with ordinary string order, otherwise some conversion function should be used.
The files FileMoverTransfer.log-*.gz must be named in such a way that their alphabetical ordering corresponds to the chronological order (which is probably the case.)
I suppose that the dates are separated from the rest of the line by whitespace. If they aren’t, you have to supply the -F option to awk. E.g., if the dates are separated by -, you must write awk -F- ...
awk is much faster than sed in this case, because awk simply looks for the separator (whitespace or whatever was supplied with -F) while sed performs a regexp match.
There is no concept of range in my code, only date comparison. The only place where I suppose that the lines are ordered is when I say $1>en{exit}, that is exit when a line is newer than the end date. If you remove that final pattern and its action, the code will run through the whole input, but you could drop the requirement that the files be ordered.

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

How can I remove a doubled section of a string?

I'm having trouble with data manipulation in a txt file. My file currently looks like this:
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
This shows some of my tab-delimited data. Half the entries are in the correct seven-characters (letterletternumbernumbernumbernumbernumber) format, but some are doubled up. I want to go into the second column (first column is empty for a reason!) and remove the repeats in the string so it would read
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
I can't work out how to do this with sed/awk on a per column basis. I feel like I should be able to write a regex, but because the data is a repeat, I don't want to lose the first half of the string; and I can't work out how to cut on a specific column, or I would just delete the 7th character. Any help much appreciated!
Solution
You can solve this with a backreference. For example, using GNU sed:
$ cat << EOF | sed --regexp-extended 's/(.{7})\1/\1/'
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
EOF
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
If you aren't using GNU sed, you may need to escape the capture groups. In addition, you can tune the regular expression if you need a more accurate character match.
Explanation
The cat pipeline is just a here-document to make it easy to display and test the code. You can call sed directly on your file, or use the -i flag to perform an in-place edit when you're comfortable with the results.
The sed script does the following:
It stores any group of 7 consecutive characters in a capture group using an "interval expression" (the number in the curly braces).
The \1 is a backreference that matches the first capture group.
The match looks for "a capture group followed by a copy of the capture group."
The substitution replaces the match with a single copy of the capture group.
One way, using awk:
awk '{ print substr($1, 1, 7), $2 }' file.txt
Output:
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
You could use something like that:
sed -i 's|\([A-Z]\{2\}[0-9]\{5\}\)[A-Z0-9]*\s*\(.*\)|\1 \2|g' <your-file>

Resources