Iterate over grep output in multi-line chunks - linux

I can't figure out how to iterate by match from grep output when using -A or -B flags (returning multiple lines for a single match). I'm not opposed to using awk for this, but this will be used to check multiple large files so the shorter execution time, the better.
I figured I could achieve this by using --group-separator flag of grep and setting IFS to the same character, but this doesn't seem to be the case.
$ grep --group-separator=# -B2 'locked' thisisatest | while IFS=# read -r match; do
echo "a"
echo "${match}"
done
a
this is a test file asdflksdklsdklsdkl.txt
a
this is meaningless
a
this is locked
a
a
this is a test file basdflksdklsdklsdkl.txt
a
this is meaningless
a
this is locked
$ cat thisisatest
this is a test file asdflksdklsdklsdkl.txt
this is meaningless
this is locked
j
j
j
j
j
this is a test file basdflksdklsdklsdkl.txt
this is meaningless
this is locked
I expect "a" to be printed before and after the entirety of a match such as:
a
this is a test file asdflksdklsdklsdkl.txt
this is meaningless
this is locked
a

Changing $IFS controls how read splits a line into fields. You're only having it fill out one field at a time ($match), so the field delimiter doesn't matter. Change the line delimiter instead: use read -d.
grep ... | while read -d# -r match; do

Related

Removing content existing in another file in bash [duplicate]

This question already has answers here:
Print lines in one file matching patterns in another file
(5 answers)
Closed 4 years ago.
I am attempting to clean one file1.txt that contains always the same lines using file2.txt that contains a list of IP addresses I want to remove.
The working script I have written I believe can be enhanced somehow to be faster in execution.
My script:
#!/bin/bash
IFS=$'\n'
for i in $(cat file1.txt); do
for j in $(cat file2); do
echo ${i} | grep -v ${j}
done
done
I have tested the script with the following data set:
Amount of lines in file1.txt = 10,000
Amount of lines in file2.txt = 3
Scrit execution time:
real 0m31.236s
user 0m0.820s
sys 0m6.816s
file1.txt content:
I3fSgGYBCBKtvxTb9EMz,1.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.2.2.3,45,This IP belongs to office space,1539760502,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.3.2.3,45,This IP belongs to office space,1539760503,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.4.2.3,45,This IP belongs to office space,1539760504,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.5.2.3,45,This IP belongs to office space,1539760505,https://myoffice.com
... lots of other lines in the same format
I3fSgGYBCBKtvxTb9EMz,4.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
file2.txt content:
1.1.2.3
1.2.2.3
... lots of other IPs here
1.2.3.9
How can I improve those timings?
I am confident that the files will grow over time. In my case I will run the script every hour from cron, therefore I would like to improve here.
You want to get rid of all lines in file1.txt that contains substrings which match file2.txt. grep to the rescue
grep -vFwf file2.txt file1.txt
The -w is need to avoid that 11.11.11.11 matches 111.11.11.111
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
source: man grep
On a further note, here are a couple of pointers for your script:
Don't use for loops to read files (http://mywiki.wooledge.org/DontReadLinesWithFor).
Don't use cat (See How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?)
Use quotes! (See Bash and Quotes)
This allows us to rewrite it as:
#!/bin/bash
while IFS=$'\n' read -r i; do
while IFS=$'\n' read -r j; do
echo "$i" | grep -v "$j"
done < file2
done < file1
Now the problem is that you read file2 N times. Where N is the number of lines of file1. This is not really efficient. And luckily grep has the solution for us (see top).

Bash: Read in file, edit line, output to new file

I am new to linux and new to scripting. I am working in a linux environment using bash. I need to do the following things:
1. read a txt file line by line
2. delete the first line
3. remove the middle part of each line after the first
4. copy the changes to a new txt file
Each line after the first has three sections, the first always ends in .pdf and the third always begins with R0 but the middle section has no consistency.
Example of 2 lines in the file:
R01234567_High Transcript_01234567.pdf High School Transcript R01234567
R01891023_Application_01891023127.pdf Application R01891023
Here is what I have so far. I'm just reading the file, printing it to screen and copying it to another file.
#! /bin/bash
cd /usr/local/bin;
#echo "list of files:";
#ls;
for index in *.txt;
do echo "file: ${index}";
echo "reading..."
exec<${index}
value=0
while read line
do
#value='expr ${value} +1';
echo ${line};
done
echo "read done for ${index}";
cp ${index} /usr/local/bin/test2;
echo "file ${index} moved to test2";
done
So my question is, how can I delete the middle bit of each line, after .pdf but before the R0...?
Using sed:
sed 's/^\(.*\.pdf\).*\(R0.*\)$/\1 \2/g' file.txt
This will remove everything between .pdf and R0 and replace it with single space.
Result for your example:
R01234567_High Transcript_01234567.pdf R01234567
R01891023_Application_01891023127.pdf R01891023
The Hard, Unreliable Way
It's a bit verbose, and much less terse and efficient than what would make sense if we knew that the fields were separated by tab literals, but the following loop does this processing in pure native bash with no external tools:
shopt -s extglob
while IFS= read -r line; do
[[ $line = *".pdf"*R0* ]] || continue # ignore lines that don't fit our format
filename=${line%%.pdf*}.pdf
id=R0${line##*R0}
printf '%s\t%s\n' "$filename" "$id"
done
${line%%.pdf*} returns everything before the first .pdf in the line; ${line%%.pdf*}.pdf then appends .pdf to that content.
Similarly, ${line##*R0} expands to everything after the last R0; R0${line##*R0} thus expands to the final field starting with R0 (presuming that that's the only instance of that string in the file).
The Easy Way (Using Tab Delimiters)
If cat -t file (on MacOS) or cat -A file (on Linux) shows ^I sequences between the fields (but not within the fields), use the following instead:
while IFS=$'\t' read -r filename title id; do
printf '%s\t%s\n' "$filename" "$id"
done
This reads the three tab separated fields into variables named filename, title and id, and emits the filename and id fields.
Updated answer assuming tab delim
Since there is a tab delimiter, then this is a cinch for awk. Borrowing from my originally deleted answer and #geek1011 deleted answer:
awk -F"\t" '{print $1, $NF}' infile.txt
Here awk splits each record in your file by tab, then prints the first field $1 and the last field $NF where NF is the built in awk variable for the record's Number of Fields; by prepending a dollar sign, it says "The value of the last field in the record".
Original answer assuming space delimiter
Leaving this here in case someone has space delimited nonsense like I originally assumed.
You can use awk instead of using bash to read through the file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt
awk reads files line by line and processes each record it comes across. Fields are delimited automatically by white space. The first field is $1, the second is $2 and so on. awk has built in variables; here we use NF which is the Number of Fields contained in the record, and NR which is the record number currently being processed.
This script does the following:
If the record number is greater than 1 (not the header) then
Loop through each field (separated by white space here) until we find a field that has "pdf" in it ($i!~/pdf/). Store everything we find up until that field in a variable called firstRec separated by a space (firstRec=firstRec" "$i).
print out the firstRec, then print out whatever field we stopped iterating on (the one that contains "pdf") which is $i, and finally print out the last field in the record, which is $NF (print firstRec,$i,$NF)
You can direct this to another file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt > outfile.txt
sed may be a cleaner way of going here since, if your pdf file has more than one space separating characters, then you will lose the multiple spaces.
You can use sed on each line like that:
line="R01234567_High Transcript_01234567.pdf High School Transcript R01234567"
echo "$line" | sed 's/\.pdf.*R0/\.pdf R0/'
# output
R01234567_High Transcript_01234567.pdf R01234567
This replace anything between .pdf and R0 with a spacebar.
It doesn't deal with some edge cases but it simple and clear

Delete lines from a file matching first 2 fields from a second file in shell script

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

Count the number of occurrences in a string. Linux

Okay so what I am trying to figure out is how do I count the number of periods in a string and then cut everything up to that point but minus 2. Meaning like this:
string="aaa.bbb.ccc.ddd.google.com"
number_of_periods="5"
number_of_periods=`expr $number_of_periods-2`
string=`echo $string | cut -d"." -f$number_of_periods`
echo $string
result: "aaa.bbb.ccc.ddd"
The way that I was thinking of doing it was sending the string to a text file and then just greping for the number of times like this:
grep -c "." infile
The reason I don't want to do that is because I want to avoid creating another text file for I do not have permission to do so. It would also be simpler for the code I am trying to build right now.
EDIT
I don't think I made it clear but I want to make finding the number of periods more dynamic because the address I will be looking at will change as the script moves forward.
If you don't need to count the dots, but just remove the penultimate dot and everything afterwards, you can use Bash's built-in string manuipulation.
${string%substring}
Deletes shortest match of $substring from back of $string.
Example:
$ string="aaa.bbb.ccc.ddd.google.com"
$ echo ${string%.*.*}
aaa.bbb.ccc.ddd
Nice and simple and no need for sed, awk or cut!
What about this:
echo "aaa.bbb.ccc.ddd.google.com"|awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
(further shortened by helpful comment from #steve)
gives:
aaa.bbb.ccc.ddd
The awk command:
awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
works by separating the input line into fields (FS) by ., then joining them as output (OFS) with ., but the number of fields (NF) has been reduced by 2. The final 1 in the command is responsible for the print.
This will reduce a given input line by eliminating the last two period separated items.
This approach is "shell-agnostic" :)
Perhaps this will help:
#!/bin/sh
input="aaa.bbb.ccc.ddd.google.com"
number_of_fields=$(echo $input | tr "." "\n" | wc -l)
interesting_fields=$(($number_of_fields-2))
echo $input | cut -d. -f-${interesting_fields}
grep -o "\." <<<"aaa.bbb.ccc.ddd.google.com" | wc -l
5

Egrep acts strange with -f option

I've got a strangely acting egrep -f.
Example:
$ egrep -f ~/tmp/tmpgrep2 orig_20_L_A_20090228.txt | wc -l
3
$ for lines in `cat ~/tmp/tmpgrep2` ; do egrep $lines orig_20_L_A_20090228.txt ; done | wc -l
12
Could someone give me a hint what could be the problem?
No, the files did not changed between executions. The expected answer for the egrep line count is 12.
UPDATE on file contents: the searched file contains cca 13000 lines, each of them are 500 char long, the pattern file contains 12 lines, each of them are 24 char long. The pattern always (and only) occurs on a fixed position in the seached file (26-49).
UPDATE on pattern contents: every pattern from tmpgrep2 are a 24 char long number.
If the search patterns are found on the same lines, then you can get the result you see:
Suppose you look for:
abc
def
ghi
jkl
and the data file is:
abcdefghijklmnoprstuvwxzy
then the one-time command will print 1 and the loop will print 4.
Could it be that the lines read contain something that the shell is expanding/substituting for you, in the second version? Then that doesn't get done by grep when it reads the patterns itself, thus leading to a different sent of patterns being matched.
I'm not totally sure if the shell is doing any expansion on the variable value in an invocation like that, but it's an idea at least.
EDIT: Nope, it doesn't seem to do any substitutions. But it could be quoting issue, if your patterns contain whitespace the for loop will step through each token, not through each line. Take a look at the read bash builtin.
Do you have any duplicates in ~/tmp/tmpgrep2? Egrep will only use the dupes one time, but your loop will use each occurrence.
Get rid of dupes by doing something like this:
$ for lines in `sort < ~/tmp/tmpgrep2 | uniq` ; do egrep $lines orig_20_L_A_20090228.txt ; done | wc -l
I second #unwind.
Why don't you run without wc -l and see what each search is finding?
And maybe:
for lines in `cat ~/tmp/tmpgrep2` ; do echo $lines ; done
Just to see now the shell is handling $lines?
The others have already come up with most of the things I would look at. The next thing I would check is the environment variable GREP_OPTIONS, or whatever it is called on your machine. I've gotten the strangest error messages or behaviors when using a command line argument that interfered with the environment settings.

Resources