grep and sed command - string

i have a truckload of files with sql commands in them, i have been asked to extract all database table names from the files
How can I use grep and sed to parse the files and create a list of the unique table names in a text file ..one per line?
the name names all seem to start with "db_" which is handy!
what would be the best way to use grep and sed together to pull the table names out?

This will search for lines containing the table names. The output of this will quickly reveal if a more selective search is needed:
grep "\<db_[a-zA-Z0-9_]*" *.sql
Once the proper search is sorted out, remove all other characters from lines with tablenames:
grep "\<db_[a-zA-Z0-9_]*" *.sql | sed 's/.*\(\<db_[a-zA-Z0-9_]*\).*/\1/'
Once that's running, add on a sort and remove duplicates:
(same last pipe expression) | sort | uniq

you just need grep
grep -owE "db_[a-zA-Z0-9]+" file|sort -u
or awk
awk '{for(i=1;i<=NF;i++)if($i~/^db_[a-zA-Z0-9]+/){print $i} }' file

Related

How to display number of times a word repeated after a common pattern

I have a file which has N number of line
For example
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
How to get the below result using uniq command:
This/is/workshop/ =5
Okay so there are a couple tools you can utilize here. Familiarize yourself with grep, cut, and uniq. My process for doing something like this may not be ideal, but given your original question I'll try to tailor the process to the lines in the file you've given.
First you'll want to grep the file for the relevant strings. Then you can pass it through to cut, declaring the fields you want to include by specifying the delimiter and also the number of fields. Lastly, you can pipe this through to uniq to count it.
Example:
Contents of file.txt
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
Use grep, cut and uniq
$ grep "This/is/workshop/" file.txt | cut -d/ -f1-3 | uniq -c
5 This/is/workshop
To specify the delimiter in cut, you use the -d flag and the delimiter you want to use. Each field is what exists between delimiters, starting at 1. For this, we want the first three. Then just pipe it through to uniq to get the count you are after.

how to find the files/(pwd of file) which is having a particular word below a particular word in directories and sub directories in linux

I have 200 folders, Each folder is having multiple shell and sql files, my requirement is to grep/find all the directories and the files which are having the below
Insert into dbname.table_name
Select
I want know what are all the files(pwd of the file) having insert into ${dbname}.{table_name} followed by select which is in next line. Db name and table name is same for all
You could use grep -r -i -A1 "insert.into" | grep -i -B1 select
-r will grep on all files in the current directory and recursively in all subdirectories.
-A1 prints one line After the matching line,
-B1 prints one line Before the matching line.
So the first grep above will print all lines matching insert.into plus the next; the second grep will keep only those pairs that have a select on their second line.
(-i to ignore case)
You may then append | grep -i insert.into | cut -d: -f1 | sort -u to get only the file names.
Note this makes some assumptions:
options -A/-B are only on Linux/gnu, not on plain Unixes like HPUX.
if you have lines containing both insert.into and select, you'll get some funky output.

Use regex in grep while while using two files

I know that you can use regex in grep and use patterns from a file to search another file. But, can you combine these two options?
For example, from the file where the patterns come from (with the -f option for use patterns from a file), I only want to use the first column to search the second file.
I tried this:
grep -E '^(*)\b' -f file_1 file_2 > file_3
To grep the first column from file_1 with the * wildcard, but it is not working. Any ideas?
Grep doesn't use wildcards for patterns, it uses regular expressions, so (*) makes little sense.
If you want to extract the first column from a file, use cut -f1 or awk '{print $1}' (or sed or perl or whatever to extract it), the redirect to grep using the special - (i.e. standard input) as the source file:
cut -f1 file1 | grep -f- file_2 > file_3

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

How to extract distinct part of a string from a file in linux

I'm using the following command to extract distinct urls that contain .com extension and may contain .us or whatever country extension.
grep '\.com' source.txt -m 700 | uniq | sed -e 's/www.//'
> dest.txt
The problem is that, it extracts urls in the same doamin, the thing tht I don't want. Ex:
abc.yahoo.com
efg.yahoo.com
I only need the yahoo.com. How can I using grep or any other command extract distinct domain names only ?
Maybe something like this?
egrep -io '[a-z0-9\-]+\.[a-z]{2,3}(\.[a-z]{2})?' source.txt
Have you tried using awk in instead of sed and specify "." as the delimiter and only print out the two last fields.
awk -F "." '{ print $(NF-1)"."$NF }'
Perhaps something like this should help:
egrep -o '[^.]*.com' file

Resources