How to extract distinct part of a string from a file in linux - linux

I'm using the following command to extract distinct urls that contain .com extension and may contain .us or whatever country extension.
grep '\.com' source.txt -m 700 | uniq | sed -e 's/www.//'
> dest.txt
The problem is that, it extracts urls in the same doamin, the thing tht I don't want. Ex:
abc.yahoo.com
efg.yahoo.com
I only need the yahoo.com. How can I using grep or any other command extract distinct domain names only ?

Maybe something like this?
egrep -io '[a-z0-9\-]+\.[a-z]{2,3}(\.[a-z]{2})?' source.txt

Have you tried using awk in instead of sed and specify "." as the delimiter and only print out the two last fields.
awk -F "." '{ print $(NF-1)"."$NF }'

Perhaps something like this should help:
egrep -o '[^.]*.com' file

Related

Use regex in grep while while using two files

I know that you can use regex in grep and use patterns from a file to search another file. But, can you combine these two options?
For example, from the file where the patterns come from (with the -f option for use patterns from a file), I only want to use the first column to search the second file.
I tried this:
grep -E '^(*)\b' -f file_1 file_2 > file_3
To grep the first column from file_1 with the * wildcard, but it is not working. Any ideas?
Grep doesn't use wildcards for patterns, it uses regular expressions, so (*) makes little sense.
If you want to extract the first column from a file, use cut -f1 or awk '{print $1}' (or sed or perl or whatever to extract it), the redirect to grep using the special - (i.e. standard input) as the source file:
cut -f1 file1 | grep -f- file_2 > file_3

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

How to replace Pipe with a new line in Linux?

Please, accept my apologies, if this question was asked before. I am new and do not know how to do it. I have a file containing the data like this:
name=1|surname=2|phone=3|email=4
phone=5|surname=6|name=7|email=8
surname=9|phone=10|email=11|name=12
phone=13|email=14|name=15|surname=6
I would like to have a file like this:
name=1
name=7
name=12
name=15
Thanks in advance!
Say names.txt is your file, then use something like :
cat names.txt | tr "|" "\n" | grep "^name="
tr transforms | to newlines
grep filters for the lines with name
And here is a one command solution with GNU awk:
awk -v RS="[|\n]" '/^name=/' names.txt
the -v RS="[|\n]' set the record separatro to|` or newline
the /^name=/ filters for records starting with name= (and implicitly prints them)
I would go for the solution of #Lars, but I wanted to test this with "lookbehind".
With grep you can get the matches only with grep -o, but the following line will also find surname:
grep -o "name=[0-9]*" names.txt
You can fix this a little by looking for the character before name (start of line with ^ or |).
grep -o "(^|\|)name=[0-9]*" names.txt
What a fix! Now you get the right names, but sometimes with an extra |.
With \K (and grep option -P) you can tell grep to use something for the matching but skip it during output.
grep -oP "(^|\|)\Kname=[0-9]*" names.txt

manipulate directories using cut or awk?

I am processing some directories in Linux, and I am trying to manipulate the file names, here's my case
grep 'string' | awk '{print $2$3}'
and I get the following
dir1/another-directory/even-another-directory/file1.jpeg
dir1/another-directory/even-another-directory/fiiile2.jpeg
dir1/another-directory/even-another-directory/filee4.jpeg
dir1/another-directory/even-another-directory/fileee1.jpeg
I am trying to take the last part of these files (anything after the slash), so that I get a list like this, in a CSV file maybe?
file1.jpeg
fiiile2.jpeg
filee4.jpeg
fileee1.jpeg
would awk or cut be able to do that? I know this is a very basic question, but I couldn't find something related online so far.
Thanks,
do everything with awk:
awk '/string/{x=$2$3;sub(/.*\//,"",x);print x}'
Drop the grep and do the match in awk and use xargs to call basename on each file for stripping the leading directories:
awk '/string/{print $2$3}' | xargs -n1 basename

How to trim specific text with grep

I am in need of trimming some text with grep, I have tried various other methods and havn't had much luck, so for example:
C:\Users\Admin\Documents\report2011.docx: My Report 2011
C:\Users\Admin\Documents\newposter.docx: Dinner Party Poster 08
How would it be possible to trim the text file, so to trim the ":" and all characters after it.
E.g. so the output would be like:
C:\Users\Admin\Documents\report2011.docx
C:\Users\Admin\Documents\newposter.docx
use awk?
awk -F: '{print $1':'$2}' inputFile > outFile
you can use grep
(note that -o returns only the matching text)
grep -oe "^C:[^:]" inputFile > outFile
That is pretty simple to do with grep -o:
$ grep -o '^C:[^:]*' input
C:\Users\Admin\Documents\report2011.docx
C:\Users\Admin\Documents\newposter.docx
If you can have other drives just replace C by .:
$ grep -o '^.:[^:]*' input
If a line can start with something different than a drive name, you can consider both the occurrence a drive name in the beginning of the line and the case where there is no such drive name:
$ grep -o '^\(.:\|\)[^:]*' input
cat inputFile | cut -f1,2 -d":"
The -d specifies your delimiter, in this case ":". The -f1,2 means you want the first and second fields.
The first part doesn't necessarily have to be cat inputFile, it's just whatever it takes to get the text that you referred to. The key part being cut -f1,2 -d":"
Your text looks like output of grep. If what you're asking is how to print filenames matching a pattern, use GNU grep option --files-with-matches
You can use this as well for your example
grep -E -o "^C\S+"| tr -d ":"
egrep -o "^C\S+"| tr -d ":"
\S here is non-space character match

Resources