Extract multiple substrings in bash - string

I have a page exported from a wiki and I would like to find all the links on that page using bash. All the links on that page are in the form [wiki:<page_name>]. I have a script that does:
...
# First search for the links to the pages
search=`grep '\[wiki:' pages/*`
# Check is our search turned up anything
if [ -n "$search" ]; then
# Now, we want to cut out the page name and find unique listings
uniquePages=`echo "$search" | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d':' -f2 | cut -d' ' -f 1 | sort -u`
....
However, when presented with a grep result with multiple [wiki: text in it, it only pulls the last one and not any others. For example if $search is:
Before starting the configuration, all the required libraries must be installed to be detected by Cmake. If you have missed this step, see the [wiki:CT/Checklist/Libraries "Libr By pressing [t] you can switch to advanced mode screen with more details. The 5 pages are available [wiki:CT/Checklist/Cmake/advanced_mode here]. To obtain information about ea - '''Installation of Cantera''': If Cantera has not been correctly installed or if you do not have sourced the setup file '''~/setup_cantera''' you should receive the following message. Refer to the [wiki:CT/FormulationCantera "Cantera installation"] page to fix this problem. You can set the Cantera options to OFF if you plan to use built-in transport, thermodynamics and chemistry.
then it only returns CT/FormulationCantera and it doesn't give me any of the other links. I know this is due to using cut so I need a replacement for the $uniquepages line.
Does anybody have any suggestions in bash? It can use sed or perl if needed, but I'm hoping for a one-liner to extract a list of page names if at all possible.

egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//' | sort -u
upd. to remove all after space without cut
egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//;s/ .*//' | sort -u

Related

How to find a substring from some text in a file and store it in a bash variable?

I have a file named config.txt which has following data:
ABC_PATH=xxx/xxx
IMAGE=docker.name.net:3000/apache:1.8.109.1
NAMESPACE=xxx
Now I am running a shell script in which I want to store 1.8.109.1 (this value may differ, rest will remain same) in a variable, maybe using sed, awk or any other linux tool.
How can I achieve that?
The following will work.
ver="$(cat config.txt | grep apache: | cut -d: -f3)"
grep apache: will find the line that has the text 'apache:' in it.
-d specifies what delimiters to use. In this case : is set as the delimiter.
-f is used to select the specific field (array index, starting at 1) of the resulting list obtained after delimiting by :
Thus, -f3 selects the 3rd occurence of the delimited list.
The version info is now captured in the variable $ver
I think this should work:
cat config.txt | grep apache: | cut -d: -f3

How to take a text between "/" with Awk / cut? [duplicate]

This question already has answers here:
shell script to extract text from a variable separated by forward slashes
(3 answers)
Closed 4 years ago.
I have this command in a script:
find /home/* -type d -name dev-env 2>&1 | grep -v 'Permiso' >&2 > findPath.txt
this gives me this back:
/home/user/project/dev-env
I need to take the second parameter between "/" (user) to save it later in a variable. I can not find the way to just pick up the "user" text.
Using cut:
echo "/home/user/project/dev-env" | cut -d'/' -f3
Result:
user
This tells cut to use / as the delimiter and return the 3rd field. (The 1st field is blank/empty, the 2nd field is home.)
Using awk:
echo "/home/user/project/dev-env" | awk -F/ '{print $3}'
This tells awk to use / as the field-seperator and print the 3rd field.
Assuming that the path resulting from the grep is always an absolute path:
second_component=$(find .... -type d -name dev-env 2>&1 | grep -v 'Permiso' | cut -d / -f 3)
However, your approach suffers from several other problems:
You use /home/* as starting point for find. This will work only, if there is exactly one subdirectory below /home. Not a very likely scenario.
Even then, it works only if grep results in exactly one line. This is a semantic problem: What if you get more than one line - which one are you interested in? Assume that you know that you are always interested into the first line, you can solve this by piping the result though head -n 1.
Next, you redirect the stderr from find to stdout, which means that any error from find remains unnoticed; you just get some weird result. It would be better to have any error message from find being displayed, and instead evaluate the exit code from find and grep.
... | cut -d/ -f3
"Third field, as cut by slash delimiter"

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

How to get grep -m1 to work in OSX

I have a script which I used perfectly fine in Linux, but now that I've switched over to Mac, the script still runs but has slightly different behavior.
This is a script for tallying student attendance at departmental functions. We use a portable barcode scanner to scan their ID's, and then save all scans in one csv file per date.
I used grep -m1 $ID csvfolder/* | wc -l in the past to get a count of how many files their ID shows up in. The -m1 is necessary to make sure they don't get "extra credit" for repeatedly scanning in at the same event.
However, when I use this same command in Mac, it exits grep when it has found the first match in the first file. So if the student shows up in 4 files, wc -l still returns 1
How can I (without installing the GNU versions) emulate this feature?
I don't have Mac OS X handy to test it with, but the following is Posix-standard afaik:
grep -l "$ID" csvfolder/* | wc -l
The grep will print the name of each file which contains a match. That should work with Gnu grep equally.
You could alternatively use awk for this task:
awk -v id="$ID" '$0 ~ id{print 1; exit}' csvfolder/* | wc -l

Get view of a client in perforce through command line

We can know the information about a client using
p4 client -o *clientname*
but it gives a lot of information. Is there any way to get only the view of the client using command line?
You can use p4's -z tag option to get annotated output useful for scripting. From there, you can extract the lines that start with ... View using grep and cut:
p4 -z tag client -o | grep -E '^[.]{3} View' | cut -d ' ' -f 3-
(And if you're using Windows, you can obtain grep and cut implementations from UnxUtils.)

Resources