grep a large list against a large file

grep a large list against a large file - linux

I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).
I want all the csv lines, that contain an id from the id file.
My naive approach was:
cat the_ids.txt | while read line
do
cat huge.csv | grep $line >> output_file
done
But this takes forever!
Are there more efficient approaches to this problem?

Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)

Use grep -f for this:
grep -f the_ids.txt huge.csv > output_file
From man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
If you provide some sample input maybe we can even improve the grep condition a little more.
Test
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23

grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:
use -x option if there is a need to match the entire line in the second file
use -F if the first file has strings, not patterns
use -w to prevent partial matches while not using the -x option
This post has a great discussion on this topic (grep -f on large files):
Fastest way to find lines of a file from another larger file in Bash
And this post talks about grep -vf:
grep -vf too slow with large files
In summary, the best way to handle grep -f on large files is:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
and for grep -vf:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt

You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:
ugrep -F -f the_ids.txt huge.csv
This works with GNU grep too, but I expect ugrep to run several times faster.

Related

How to clean output, prints the desired information with less CPU usage

I have 20GB log file, where it contains lots of fields, the field or column numbers 2 contains numbers. I use the below commands to print only column 2
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' | nawk -F "=" '{print $2}'
the result of this command is:
"93711994166", Key
since i want only the number then i append the below command to my original command to clean the output:
| awk -F, '{print $1}' | sed 's/"//g'
the result is:
93711994166
my final purpose is to print only numbers having length other than 11 digits, therefore, I append the following to my final command:
-vE '^.{11}$'
so my final command is:
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' | nawk -F "=" '{print $2}' | awk -F, '{print $1}' | sed 's/"//g' | grep -vE '^.{11}$' >/tmp/$file
this command takes long time to execute also causes high CPU usage. I want to achieve the following:
print all numbers with length not equal to 11 digits.
print all numbers that do not start with 93 (regardless of their length)
clean, effective and not cpu or memory costly command
I have another requirement which is to print also the numbers that not started with 93.
Note:
the log file contains lots of different lines but i use awk '/Read:ROP/' to work on the below output and extract numbers
Read:ROP (CustomerId="93700001865", Key=1, ActiveEndDate=2025-01-19 20:12:22, FirstCallDate=2018-01-08 12:30:30, IsFirstCallPassed=true, IsLocked=false, LTH={Data=["1|
MOC|07.07.2020 09:18:58|48000.0|119||OnPeakAccountID|480|19250||", "1|RECHARGE|04.07.2020 10:18:32|-4500.0|0|0", "1|RECHARGE|04.07.2020 10:18:59|-4500.0|0|0"], Index=0
}, LanguageID=2, LastKnownPeriod="Active", LastRechargeAmount=4500, LastRechargeDate=2020-07-04 10:18:59, VoucherRchFraudCounter=0, c_BlockPAYG=true, s_PackageKeyCount
er=13, s_OfferId="xyz", OnPeakAccountID_FU={Balance=18850});

20GB log file [...] zcat
Using zcat on 20GB log files is quite expensive. Check top when running your command line above.
It might be worth keeping the data from the first filtering step:
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' > filter_data.out
and work with the filtered data. I assume here that this awk step can remove the majority of the data.
Bonus points: This step can be parallelized by running the zcat [...] |awk [...] pipe file-by-file, and you only need to do this once for each file.
The other steps don't look particularly expensive unless there are a lot of data lines left even after filtering.

sed '/.*Read:ROP.*([^=]="\([^"]*\)".*/!d; s//\1/'
/.../ - match regex
.*Read:ROP.* - match Read:ROP followed by anything with anything in front, ie. awk '/Read:ROP/'
([^=]*=" - match a (, followed by anything except =, then a =, then a ", ie. nawk -F "=" '{print $2}'
\([^"]*\) - match everythjing inside qoutes. I guess [0-9] would be fine also
".* - delete rest of line
! - if the line doesn't match the regex
d - remove the line
s - substitute
// - reuse the regex in /.../
\1 - substitute for first backreference, ie. for \([^"]*\)

bash: awk print with in print

I need to grep some pattern and further i need to print some output within that. Currently I am using the below command which is working fine. But I like to eliminate using multiple pipe and want to use single awk command to achieve the same output. Is there a way to do it using awk?
root#Server1 # cat file
Jenny:Mon,Tue,Wed:Morning
David:Thu,Fri,Sat:Evening
root#Server1 # awk '/Jenny/ {print $0}' file | awk -F ":" '{ print $2 }' | awk -F "," '{ print $1 }'
Mon
I want to get this output using single awk command. Any help?

You can try something like:
awk -F: '/Jenny/ {split($2,a,","); print a[1]}' file

Try this
awk -F'[:,]+' '/Jenny/{print $2}' file.txt
It is using muliple -F value inside the [ ]
The + means one or more since it is treated as a regex.

For this particular job, I find grep to be slightly more robust.
Unless your company has a policy not to hire people named Eve.
(Try it out if you don't understand.)
grep -oP '^[^:]*Jenny[^:]*:\K[^,:]+' file
Or to do a whole-word match:
grep -oP '^[^:]*\bJenny\b[^:]*:\K[^,:]+' file
Or when you are confident that "Jenny" is the full name:
grep -oP '^Jenny:\K[^,:]+' file
Output:
Mon
Explanation:
The stuff up until \K speaks for itself: it selects the line(s) with the desired name.
[^,:]+ captures the day of week (in this case Mon).
\K cuts off everything preceding Mon.
-o cuts off anything following Mon.

How to remove lines contained in file 1 from file 2 if in file 2 they are prefixed?

I have the following situation:
source.txt
ID1:email1#domain1.com
ID2:email2#domain2.com
ID3:email3#domain3.com
...
IDs are numeric strings, e.g. 1234, 23412, 897... (one or more digits).
exclude.txt
emailX#domainX.com
emailY#domainY.com
emailZ#domainZ.com
...
i.e. only emails, no IDs.
I want to remove all lines from source.txt which contain emails listed in exclude.txt, preserving the ID:email pairs for the lines which are not removed.
How can I do that with linux command line tools (or simple bash script if needed)?

You can do it easily with awk:
awk -F":" 'NR==FNR{a[$1];next}(!($2 in a))' exclude.txt source.txt
Alternative with grep:
grep -v -F -f exclude.txt source.txt
Use grep with care, since grep does a regex matching. You might need to add also -w option to grep (word matching)

zcat file not working for gzip file

I have a .gz which I need to merge and do other manipulations with (without compressing it), but I am having trouble just using zcat or gzip -dc or awk, for example when I pass these value to less -S like this:
awk '{print $1}' <(gzip -dc file.gz) | less -S
I get the incorrect column printed. When I use just less -S to view the file, only the last few columns are printed. So I thought it was a problem with the delimiter, but I have tried importing in R some lines (it is too big to import the whole file), and it seems to be space delimited since all the columns are showing up when I do this:
x=read.table("file.gz", header=T, nrows=100)
But how do I read the lines correctly to use this file with zcat?
Thank you so much for your help!

If you want the whole line to be printed, try $0.
awk '{print $0}' <(gzip -dc file.gz) | less -S
If you want specific columns to be printed, use -F to specific field separator. For example, if you want first field of ':' separated fields from each line (like in /etc/passwd), try this command.
awk -F':' '{print $1}' <(gzip -dc passwd.gz) |less -S

How to grep within a grep

I have a bunch of massive text files, about 100MB each.
I want to grep to find entries that have 'INDIANA JONES' in it:
$ grep -ir 'INDIANA JONES' ./
Then, I would like to find the entries where there is the word PORTUGAL within 5,000 characters of the INDIANA JONES term. How would I do this?
# in pseudocode
grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char

Use grep's -o flag to output the 5000 characters surround the match, then search those characters for the second string. For example:
grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL"
If you need the original match, add the -n flag to the second grep and pipe into:
cut -f1 -d: > line_numbers.txt
then you could use awk to print those lines:
awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt
To avoid the temporary file, this could be written like:
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt
For multiple files, use find and a bash loop:
for i in $(find . -type f); do
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i"
done

One way to deal with this is with gawk. You could set the record separator to either INDIANA JONES or PORTUGAL and then perform a length check on the record (after stripping newlines, assuming newlines do not count towards the limit of 5000). You may have to resort to find to run this recursively within a directory
awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0;
gsub("\n", "", a)};
((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1};
{prevRT=RT};
END{if (found) print FILENAME}' file.txt

Consider installing ack-grep.
sudo apt-get install ack-grep
ack-grep is a more powerful version of grep.
There's no trivial solution to your question (that i can think of) outside of a full batch script, but you can use the -A and -B flags on ack-grep to specify a number of trailing or leading lines to output, resp.
This may not be a number of chars, but is a step further in that direction.
While this may not be a solution, it might give you some idea as to how to do this. Lookup filters like ack, awk, sed, etc. and see if you can find one with a flag for this kind of behaviour.
The ack-grep manual:
http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html
EDIT:
I think the sad news is, what you might think you're looking for is something like:
grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename
The problem is, even on a small file, querying this is going to be impossible time-wise.
I got this one to work with a different number. it's a size problem.
For such a large set of files, you'll need to do this in more than one step.
A Solution:
The only solution I know of is the leading and trailing output from ack-grep.
Step 1: how long are your lines?
If you knew how many lines out you had to go
(and you could estimate/calculate this a few ways) then you'd be able to grep the output of the first grep. Depending on what's in your file, you should be able to get a decent upper bound as to how many lines is 5000 chars (if a line has 100 chars average, 50+ lines should cover you, but if it has 10 chars, you'll need 500+).
You've got to determine the maximum number of lines that could be 5000 chars. You could guess or pick a high range if you like, but that'll be up to you. It's your data.
With that, call: (if you needed 100 lines for 5000 chars)
ack-grep -ira "PORTUGAL" -A 100 -B 100 filename
and
ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename
replace the 100s with what you need.
Step 2: parse the output
you'll need to take the matches that ack-grep returns and parse them, looking for any matches again, within these sub-ranges.
Look for INDIANA JONES in the first PORTUGAL ack-grep match output, and look for PORTUGAL in the second set of matches.
This should take a bit more work, likely involving a bash script (I might see if I can get one working this week), but it solves your massive-data problem, by breaking it down into more manageable chunks.

grep 'INDIANA JONES' . -iR -l | while read filename; do head -c 5000 "$filename" | grep -n PORTUGAL -H --label="$filename" ; done
This works as follows:
grep 'INDIANA JONES' . -iR -l. Search for all files in or below the current directory. Case insensitive (-i). And only print the names of the files that match (-l), don't print any content.
| while read filename; do ...|...|...; done for each line of input, store it in variable $filename and execute the pipeline.
Now, for each file that matched 'INDIANA JONES', we do
head -c 5000 "$filename" - extract the first 5000 characters
grep ... - search for PORTUGAL. Print the filename (-H), but where we tell us the 'filename' we want to use with --label="$filename". Print line numbers too, -n.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string