Find files from a folder when a specific word appears on at least a specific number of lines

Find files from a folder when a specific word appears on at least a specific number of lines - linux

How can I find the files from a folder where a specific word appears on more than 3 lines? I tried using recursive grep for finding that word and then using -c to count the number of lines where the word appears.

This command will recursively list the files in the current directory where word appears on more than 3 lines, along with the matches count for each file:
grep -c -r 'word' . | grep -v -e ':[0123]$' | sort -n -t: -k2
The final sort is not necessary if you don't want the results sorted, but I'd say it's convenient.
The first command in the pipeline (grep -c -r 'word' .) recursively finds every file in the current directory that contains word, and counts the occurrences for each file. The intermediate grep discards every count that is 0, 1, 2 or 3, so you just get counts greater than 3 (this is because -v in grep(1) inverts the sense of matching to select non-matching lines). The final sort step sorts the list according to the occurrences for each file; it sets the field delimiter to : and instructs sort(1) to do a numeric-based sorting using the 2nd field (the count) as the sort key.
Here's a sample output from some tests I ran:
./file1:4
./dir1/dir2/file3:5
./dir1/file2:8
If you just want the filenames without the match counts, you can use sed(1) to discard the :count portions:
grep -m 4 -c -r 'word' . | grep -v -e ':[0123]$' | sed -r 's/:[0-9]+$//'
As noted in the comments, if matches count is not important, in this case we can optimize the first grep with -m 4, which stops reading the file after 4 matching lines.
UPDATE
The solution above works fine up to a certain extent if used with small numbers, but it does not scale well for larger numbers. If you want to filter based on an arbitrary number, you can use awk(1) (and in fact it ends up being much more clean), like so:
grep -c -r 'word' . | awk -F: '$2 > 10'
The -F: argument to awk(1) is necessary; it instructs awk(1) to separate fields by : rather than the default (whitespace and tab). This solution generalizes well to any number.
Again, if matches count doesn't matter and all you want is to get a list of the filenames, do this instead:
grep -c -r 'word' . | awk -F: '$2 > 10 { print $1 }'

Related

Count specific pattern inside text

I have a huge file I want to use as shell command to count the number of the word 'new' in the file
a tried to use wc and grep but I get the number of lines that contain pattern only

From #Fravadona's suggestion:
grep -ow new file.txt | wc -l
-o means "print only the matches, one per line"
-w means "only match if it's a full word" and avoid matching for e.g. newOrder
wc -l counts the amount of lines grep did output

Loop through each column in a CSV file and exporting distinct values to a file

I have a CSV file with columns A-O. 500k rows. In Bash I would like to loop through each column, get distinct values and output them to a file:
sort -k1 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f1 -d , | uniq > EMPLOYEEID.csv
sort -k2 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f2 -d , | uniq > SORTNAME.csv
This works, but to me is very manual and not really scalable if there were like 100 columns.
The code sorts the column in-place and then the column specified is passed to uniq to get distinct values and is then outputted.
NB: The first row has the header information.
The above code works, but I'm looking to streamline it somewhat.

Assuming headers can be used as file names for each column:
head -1 test.csv | \
tr "," "\n" | \
sed "s/ /_/g" | \
nl -ba -s$'\t' | \
while IFS=$'\t' read field name; do
cut -f$field -d',' test.csv | \
tail -n +2 | sort -u > "${name}.csv" ;
done
Explanation:
head - reads the first line
tr- replaces the , with new line
sed - replaces white space with _ for cleaner file names (tr would work also, and you can combine with previous one then, but if you need more complex transforms use sed)
nl - adds the field number
-ba - number all lines
-s$'\t' - set the separator to tab (not necessary, as it default, but for clarity sake)
while- reads trough field number/names
cut - selects the field
tail - removes the heading, not all tails have this option, you can replace with sed
sort -u - sorts and removes duplicates
>"$name.csv" - saves in the appropriate file name
note: this assumes that there are no , int the fields, otherwise you will need to use a csv parser

Doing all the columns in a single pass is much more efficient than rescanning the entire input file for each column.
awk -F , 'NR==1 { ncols = split($0, cols, /,/); next }
{ for(i=1; i<=ncols; ++i)
if (!seen[i ":" $i])
print $i >>cols[i] ".csv"}' CROWN.csv
If this is going to be part of a bigger task, maybe split the input file into several temporary files with fewer columns than the number of open file handles permitted on your system, rather than fix this script to handle an arbitrary number of columns.
You can inspect this system constant with ulimit -n; on some systems, you can increase it either by tweaking the system configuration or, in the worst case, by recompiling the kernel. (Your question doesn't identify your platform, but this should be easy enough to google.)
Addendum: I created a quick and dirty timing comparison of these answers at https://ideone.com/dnFj41; I encourage you to fork it and experiment with different shapes of input data. With an input file of 100 columns and (probably) no duplication in the columns -- but only a few hundred rows -- I got the following results:
0.001s Baseline test -- simply copy input file to an identical output file
0.242s tripleee -- this single-pass AWK script
0.561s Sorin -- multiple passes using simple shell script
2.154s Mihir -- multiple passes using AWK
Unfortunately, Carmen's answer could not be tested, because I did not have permissions to install Text::CSV_XS on Ideone.
An earlier version of this answer contained a Python attempt, but I was too lazy to finish debugging it. It's still there in the edit history if you are curious.

How to display number of times a word repeated after a common pattern

I have a file which has N number of line
For example
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
How to get the below result using uniq command:
This/is/workshop/ =5

Okay so there are a couple tools you can utilize here. Familiarize yourself with grep, cut, and uniq. My process for doing something like this may not be ideal, but given your original question I'll try to tailor the process to the lines in the file you've given.
First you'll want to grep the file for the relevant strings. Then you can pass it through to cut, declaring the fields you want to include by specifying the delimiter and also the number of fields. Lastly, you can pipe this through to uniq to count it.
Example:
Contents of file.txt
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
Use grep, cut and uniq
$ grep "This/is/workshop/" file.txt | cut -d/ -f1-3 | uniq -c
5 This/is/workshop
To specify the delimiter in cut, you use the -d flag and the delimiter you want to use. Each field is what exists between delimiters, starting at 1. For this, we want the first three. Then just pipe it through to uniq to get the count you are after.

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org

If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org

If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.

Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

How to grep within a grep

I have a bunch of massive text files, about 100MB each.
I want to grep to find entries that have 'INDIANA JONES' in it:
$ grep -ir 'INDIANA JONES' ./
Then, I would like to find the entries where there is the word PORTUGAL within 5,000 characters of the INDIANA JONES term. How would I do this?
# in pseudocode
grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char

Use grep's -o flag to output the 5000 characters surround the match, then search those characters for the second string. For example:
grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL"
If you need the original match, add the -n flag to the second grep and pipe into:
cut -f1 -d: > line_numbers.txt
then you could use awk to print those lines:
awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt
To avoid the temporary file, this could be written like:
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt
For multiple files, use find and a bash loop:
for i in $(find . -type f); do
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i"
done

One way to deal with this is with gawk. You could set the record separator to either INDIANA JONES or PORTUGAL and then perform a length check on the record (after stripping newlines, assuming newlines do not count towards the limit of 5000). You may have to resort to find to run this recursively within a directory
awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0;
gsub("\n", "", a)};
((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1};
{prevRT=RT};
END{if (found) print FILENAME}' file.txt

Consider installing ack-grep.
sudo apt-get install ack-grep
ack-grep is a more powerful version of grep.
There's no trivial solution to your question (that i can think of) outside of a full batch script, but you can use the -A and -B flags on ack-grep to specify a number of trailing or leading lines to output, resp.
This may not be a number of chars, but is a step further in that direction.
While this may not be a solution, it might give you some idea as to how to do this. Lookup filters like ack, awk, sed, etc. and see if you can find one with a flag for this kind of behaviour.
The ack-grep manual:
http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html
EDIT:
I think the sad news is, what you might think you're looking for is something like:
grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename
The problem is, even on a small file, querying this is going to be impossible time-wise.
I got this one to work with a different number. it's a size problem.
For such a large set of files, you'll need to do this in more than one step.
A Solution:
The only solution I know of is the leading and trailing output from ack-grep.
Step 1: how long are your lines?
If you knew how many lines out you had to go
(and you could estimate/calculate this a few ways) then you'd be able to grep the output of the first grep. Depending on what's in your file, you should be able to get a decent upper bound as to how many lines is 5000 chars (if a line has 100 chars average, 50+ lines should cover you, but if it has 10 chars, you'll need 500+).
You've got to determine the maximum number of lines that could be 5000 chars. You could guess or pick a high range if you like, but that'll be up to you. It's your data.
With that, call: (if you needed 100 lines for 5000 chars)
ack-grep -ira "PORTUGAL" -A 100 -B 100 filename
and
ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename
replace the 100s with what you need.
Step 2: parse the output
you'll need to take the matches that ack-grep returns and parse them, looking for any matches again, within these sub-ranges.
Look for INDIANA JONES in the first PORTUGAL ack-grep match output, and look for PORTUGAL in the second set of matches.
This should take a bit more work, likely involving a bash script (I might see if I can get one working this week), but it solves your massive-data problem, by breaking it down into more manageable chunks.

grep 'INDIANA JONES' . -iR -l | while read filename; do head -c 5000 "$filename" | grep -n PORTUGAL -H --label="$filename" ; done
This works as follows:
grep 'INDIANA JONES' . -iR -l. Search for all files in or below the current directory. Case insensitive (-i). And only print the names of the files that match (-l), don't print any content.
| while read filename; do ...|...|...; done for each line of input, store it in variable $filename and execute the pipeline.
Now, for each file that matched 'INDIANA JONES', we do
head -c 5000 "$filename" - extract the first 5000 characters
grep ... - search for PORTUGAL. Print the filename (-H), but where we tell us the 'filename' we want to use with --label="$filename". Print line numbers too, -n.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string