Recursively grep unique pattern in different files - linux

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org

If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org

If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.

Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

Related

How to find a substring from some text in a file and store it in a bash variable?

I have a file named config.txt which has following data:
ABC_PATH=xxx/xxx
IMAGE=docker.name.net:3000/apache:1.8.109.1
NAMESPACE=xxx
Now I am running a shell script in which I want to store 1.8.109.1 (this value may differ, rest will remain same) in a variable, maybe using sed, awk or any other linux tool.
How can I achieve that?
The following will work.
ver="$(cat config.txt | grep apache: | cut -d: -f3)"
grep apache: will find the line that has the text 'apache:' in it.
-d specifies what delimiters to use. In this case : is set as the delimiter.
-f is used to select the specific field (array index, starting at 1) of the resulting list obtained after delimiting by :
Thus, -f3 selects the 3rd occurence of the delimited list.
The version info is now captured in the variable $ver
I think this should work:
cat config.txt | grep apache: | cut -d: -f3

How to display number of times a word repeated after a common pattern

I have a file which has N number of line
For example
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
How to get the below result using uniq command:
This/is/workshop/ =5
Okay so there are a couple tools you can utilize here. Familiarize yourself with grep, cut, and uniq. My process for doing something like this may not be ideal, but given your original question I'll try to tailor the process to the lines in the file you've given.
First you'll want to grep the file for the relevant strings. Then you can pass it through to cut, declaring the fields you want to include by specifying the delimiter and also the number of fields. Lastly, you can pipe this through to uniq to count it.
Example:
Contents of file.txt
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
Use grep, cut and uniq
$ grep "This/is/workshop/" file.txt | cut -d/ -f1-3 | uniq -c
5 This/is/workshop
To specify the delimiter in cut, you use the -d flag and the delimiter you want to use. Each field is what exists between delimiters, starting at 1. For this, we want the first three. Then just pipe it through to uniq to get the count you are after.

How to replace Pipe with a new line in Linux?

Please, accept my apologies, if this question was asked before. I am new and do not know how to do it. I have a file containing the data like this:
name=1|surname=2|phone=3|email=4
phone=5|surname=6|name=7|email=8
surname=9|phone=10|email=11|name=12
phone=13|email=14|name=15|surname=6
I would like to have a file like this:
name=1
name=7
name=12
name=15
Thanks in advance!
Say names.txt is your file, then use something like :
cat names.txt | tr "|" "\n" | grep "^name="
tr transforms | to newlines
grep filters for the lines with name
And here is a one command solution with GNU awk:
awk -v RS="[|\n]" '/^name=/' names.txt
the -v RS="[|\n]' set the record separatro to|` or newline
the /^name=/ filters for records starting with name= (and implicitly prints them)
I would go for the solution of #Lars, but I wanted to test this with "lookbehind".
With grep you can get the matches only with grep -o, but the following line will also find surname:
grep -o "name=[0-9]*" names.txt
You can fix this a little by looking for the character before name (start of line with ^ or |).
grep -o "(^|\|)name=[0-9]*" names.txt
What a fix! Now you get the right names, but sometimes with an extra |.
With \K (and grep option -P) you can tell grep to use something for the matching but skip it during output.
grep -oP "(^|\|)\Kname=[0-9]*" names.txt

How to make grep to stop searching in each file after N lines?

It's best to describe the use by a hypothetical example:
Searching for some useful header info in a big collection of email storage (each email in a separate file). e.g. doing stats of top mail client apps used.
Normally if you do grep you can specify -m to stop at first match but let's say an email does not contact X-Mailer or whatever it is we are looking for in a header? It will scan through the whole email. Since most headers are <50 lines performance could be increased by telling grep to search only 50 lines on any file. I could not find a way to do that.
I don't know if it would be faster but you could do this with awk:
awk '/match me/{print;exit}FNR>50{exit}' *.mail
will print the first line matching match me if it appears in the first 50 lines. (If you wanted to print the filename as well, grep style, change print; to print FILENAME ":" $0;)
awk doesn't have any equivalent to grep's -r flag, but if you need to recursively scan directories, you can use find with -exec:
find /base/dir -iname '*.mail' \
-exec awk '/match me/{print FILENAME ":" $0;exit}FNR>50{exit}' {} +
You could solve this problem by piping head -n50 through grep but that would undoubtedly be slower since you'd have to start two new processes (one head and one grep) for each file. You could do it with just one head and one grep but then you'd lose the ability to stop matching a file as soon as you find the magic line, and it would be awkward to label the lines with the filename.
you can do something like this
head -50 <mailfile>| grep <your keyword>
Try this command:
for i in *
do
head -n 50 $i | grep -H --label=$i pattern
done
output:
1.txt: aaaaaaaa pattern aaaaaaaa
2.txt: bbbb pattern bbbbb
ls *.txt | xargs head -<N lines>| grep 'your_string'

How to extract distinct part of a string from a file in linux

I'm using the following command to extract distinct urls that contain .com extension and may contain .us or whatever country extension.
grep '\.com' source.txt -m 700 | uniq | sed -e 's/www.//'
> dest.txt
The problem is that, it extracts urls in the same doamin, the thing tht I don't want. Ex:
abc.yahoo.com
efg.yahoo.com
I only need the yahoo.com. How can I using grep or any other command extract distinct domain names only ?
Maybe something like this?
egrep -io '[a-z0-9\-]+\.[a-z]{2,3}(\.[a-z]{2})?' source.txt
Have you tried using awk in instead of sed and specify "." as the delimiter and only print out the two last fields.
awk -F "." '{ print $(NF-1)"."$NF }'
Perhaps something like this should help:
egrep -o '[^.]*.com' file

Resources