How to display number of times a word repeated after a common pattern - linux

I have a file which has N number of line
For example
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
How to get the below result using uniq command:
This/is/workshop/ =5

Okay so there are a couple tools you can utilize here. Familiarize yourself with grep, cut, and uniq. My process for doing something like this may not be ideal, but given your original question I'll try to tailor the process to the lines in the file you've given.
First you'll want to grep the file for the relevant strings. Then you can pass it through to cut, declaring the fields you want to include by specifying the delimiter and also the number of fields. Lastly, you can pipe this through to uniq to count it.
Example:
Contents of file.txt
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
Use grep, cut and uniq
$ grep "This/is/workshop/" file.txt | cut -d/ -f1-3 | uniq -c
5 This/is/workshop
To specify the delimiter in cut, you use the -d flag and the delimiter you want to use. Each field is what exists between delimiters, starting at 1. For this, we want the first three. Then just pipe it through to uniq to get the count you are after.

Related

Difference between using the uniq command with sort or without it in linux

When I use uniq -u data.txt lists the whole file and when I use sort data.txt | uniq -u it omits repeated lines. Why does this happen?
uniq man says that -u, --unique only prints unique lines. I don't understand why I need to use pipe to get correct output.
uniq removes adjacent duplicates. If you want to omit duplicates that are not adjacent, you'll have to sort the data first.

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

Find files from a folder when a specific word appears on at least a specific number of lines

How can I find the files from a folder where a specific word appears on more than 3 lines? I tried using recursive grep for finding that word and then using -c to count the number of lines where the word appears.
This command will recursively list the files in the current directory where word appears on more than 3 lines, along with the matches count for each file:
grep -c -r 'word' . | grep -v -e ':[0123]$' | sort -n -t: -k2
The final sort is not necessary if you don't want the results sorted, but I'd say it's convenient.
The first command in the pipeline (grep -c -r 'word' .) recursively finds every file in the current directory that contains word, and counts the occurrences for each file. The intermediate grep discards every count that is 0, 1, 2 or 3, so you just get counts greater than 3 (this is because -v in grep(1) inverts the sense of matching to select non-matching lines). The final sort step sorts the list according to the occurrences for each file; it sets the field delimiter to : and instructs sort(1) to do a numeric-based sorting using the 2nd field (the count) as the sort key.
Here's a sample output from some tests I ran:
./file1:4
./dir1/dir2/file3:5
./dir1/file2:8
If you just want the filenames without the match counts, you can use sed(1) to discard the :count portions:
grep -m 4 -c -r 'word' . | grep -v -e ':[0123]$' | sed -r 's/:[0-9]+$//'
As noted in the comments, if matches count is not important, in this case we can optimize the first grep with -m 4, which stops reading the file after 4 matching lines.
UPDATE
The solution above works fine up to a certain extent if used with small numbers, but it does not scale well for larger numbers. If you want to filter based on an arbitrary number, you can use awk(1) (and in fact it ends up being much more clean), like so:
grep -c -r 'word' . | awk -F: '$2 > 10'
The -F: argument to awk(1) is necessary; it instructs awk(1) to separate fields by : rather than the default (whitespace and tab). This solution generalizes well to any number.
Again, if matches count doesn't matter and all you want is to get a list of the filenames, do this instead:
grep -c -r 'word' . | awk -F: '$2 > 10 { print $1 }'

How to make grep to stop searching in each file after N lines?

It's best to describe the use by a hypothetical example:
Searching for some useful header info in a big collection of email storage (each email in a separate file). e.g. doing stats of top mail client apps used.
Normally if you do grep you can specify -m to stop at first match but let's say an email does not contact X-Mailer or whatever it is we are looking for in a header? It will scan through the whole email. Since most headers are <50 lines performance could be increased by telling grep to search only 50 lines on any file. I could not find a way to do that.
I don't know if it would be faster but you could do this with awk:
awk '/match me/{print;exit}FNR>50{exit}' *.mail
will print the first line matching match me if it appears in the first 50 lines. (If you wanted to print the filename as well, grep style, change print; to print FILENAME ":" $0;)
awk doesn't have any equivalent to grep's -r flag, but if you need to recursively scan directories, you can use find with -exec:
find /base/dir -iname '*.mail' \
-exec awk '/match me/{print FILENAME ":" $0;exit}FNR>50{exit}' {} +
You could solve this problem by piping head -n50 through grep but that would undoubtedly be slower since you'd have to start two new processes (one head and one grep) for each file. You could do it with just one head and one grep but then you'd lose the ability to stop matching a file as soon as you find the magic line, and it would be awkward to label the lines with the filename.
you can do something like this
head -50 <mailfile>| grep <your keyword>
Try this command:
for i in *
do
head -n 50 $i | grep -H --label=$i pattern
done
output:
1.txt: aaaaaaaa pattern aaaaaaaa
2.txt: bbbb pattern bbbbb
ls *.txt | xargs head -<N lines>| grep 'your_string'

grep and sed command

i have a truckload of files with sql commands in them, i have been asked to extract all database table names from the files
How can I use grep and sed to parse the files and create a list of the unique table names in a text file ..one per line?
the name names all seem to start with "db_" which is handy!
what would be the best way to use grep and sed together to pull the table names out?
This will search for lines containing the table names. The output of this will quickly reveal if a more selective search is needed:
grep "\<db_[a-zA-Z0-9_]*" *.sql
Once the proper search is sorted out, remove all other characters from lines with tablenames:
grep "\<db_[a-zA-Z0-9_]*" *.sql | sed 's/.*\(\<db_[a-zA-Z0-9_]*\).*/\1/'
Once that's running, add on a sort and remove duplicates:
(same last pipe expression) | sort | uniq
you just need grep
grep -owE "db_[a-zA-Z0-9]+" file|sort -u
or awk
awk '{for(i=1;i<=NF;i++)if($i~/^db_[a-zA-Z0-9]+/){print $i} }' file

Resources