I have a json file that is download using curl. It has some information of a confluence page. I want to extract only 3 parts that downloaded information - that is page: id, status and title.
I have written a bash script for this and my constraint is that I am not sure how to pass multiple variables in grep command
id=id #hardcoded
status=status #hardcoded
echo Enter title you are looking for: #taking input from user here read title_name echo echo echo Here are details
curl -u username:password -sX GET "http://X.X.X.X:8090/rest/api/content?type=page&start=0&limit=200" | python -mjson.tool | grep -Eai "$title_name"|$id|$status"
Aside from a typo (you have an unbalanced quote - please always check the syntax for correctness before you are posting something), the basic idea of your approach would work in that
grep -Eai "$title_name|$id|$status"
would select those text lines which contain those lines which contain the content of one of the variables title_name, id or status.
However, it is a pretty fragile solution. I don't know what can be the actual content of those variables, but for instance, if title_name were set to X.Z, it would also match lines containing the string XYZ, since the dot matches any character. Similarily, if title_name would contain, say, a lone [ or (, grep would complained about an unmatched parentheses error.
If you want to match the string literally and not be taken as regular expressions, it is better to write those pattern into a file (one pattern per line) and use
grep -F -f patternfile
for searching. Of course, since you are using bash, you can also use process substitution if you prefer not using an explicit temporary file.
Related
I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.
I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.
It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output
Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question
One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.
I want to collect user names from member-list pages like this:
http://www.marksdailyapple.com/forum/memberslist/
I want to get every username from all the pages,
and I want to do this in linux,with bash
where should I start,could anyone me some tips?
This is what my Xidel was made for:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username' -f '(//a[#rel="Next"])[1]'
With that simple line it will parse the pages with a proper html parser, use css selectors to find all links with names, use xpath to find the next page and repeat it until all pages are processed
You can also write it using only css selectors:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username' -f 'div#pagination_top span.prev_next a'
Or pattern matching. There you basically just copy the html elements you want to find from the page source and replace the text content with {.}:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e '<a class="username">{.}</a>*' -f '<a rel="next">{.}</a>'
First you should use wget to get all the username pages. You will have to use some options (check the man page for wget) to make it follow the right links, and ideally not follow any of the uninteresting links (or failing that, you can just ignore the uninteresting links afterwards).
Then, despite the fact that Stackoverflow tells you not to use regular expressions to parse HTML, you should use regular expressions to parse HTML, because it's only a homework assignment, right?
If it's not a homework assignment, you've not chosen the best tool for the job.
As Robin suggest, you should really do this kind of stuff within a programming language containing a decent html-parser. You can always use command-line tools do various tasks, however in this case I probably would have chosen perl.
If you really want to try to do it with command-line tools i would suggest, curl, grep, sort and sed.
I always find it easier when I have something to play with, so here's something to get you started.
I would not use this kind of code to produce something useful though, but just so you could get some ideas.
The memberpages seems so be xxx://xxx.xxx/index1.html, where the 1 is indicating the page-number. Therefore the first thing I would do is to extract the number of the last memberpage. When I have that I know which urls I want to feed curl with.
Every username is in a member of the class "username", with that information we can use grep to get the relevant data.
#!/bin/bash
number_of_pages=2
curl http://www.marksdailyapple.com/forum/memberslist/index[1-${number_of_pages}].html --silent | egrep 'class="username">.*</a>' -o | sed 's/.*>\(.*\)<\/a>/\1/' | sort
The idea here is to give curl the addresses in the format index[1-XXXX].html, that will make curl traverse all the pages. We then grep for the username class, pass it to sed to extract relevant data (the username). We then pass the produced "username-list" to sort to get the usernames sorted. I always like sorted things ;)
Big Notes though,
You should really be doing this in another way. Again, I recommend perl for these kind of tasks.
There is no errorchecking, validaton of usernames, etc, etc. If you should use this in some sort of production there are no shortcuts, do it right. Try to read up on how to parse webpages in different programming languages.
By purpose I declared number_of_pages to two. You'll have to figure out a way bý yourself to get the number of the last memberpage. It was a lot of pages though, and i imagine it would take some time to iterate through them.
Hope that helps !
I used this bash script to go through all the pages:
#!/bin/bash
IFS=$'\n'
url="http://www.marksdailyapple.com/forum/memberslist/"
content=$(curl --silent -L ${url} 2>/dev/null | col -b)
pages=$(echo ${content} | sed -n '/Last Page/s/^.*index\([0-9]\+\).*/\1/p' | head -1)
for page in $(seq ${pages}); do
IFS=
content=$(curl --silent -L ${url}index${page}.html 2>/dev/null | col -b)
patterns=$(echo ${content} | sed -n 's/^.*class="username">\([^<]*\)<.*$/\1/gp')
IFS=$'\n' users=(${patterns})
for user in ${users[#]}; do
echo "user=${user}."
done
done
I was asked today to list all image files references in our project to help remove/fix dead references.
All our image names in the source files are enclosed in either simple or double quotes ('image.png' or "image.png").
To extract those I thought of using grep, sed an other tools like that but so fair I failed to come up with something effective.
I can currently list all lines that contain image names by grepping the image file extensions (.png, .gif, and so on) but that also brings lines completely unrelated to my search. My attempt with sed wasn't working in case there was several strings per line.
I could probably filter out the list by myself, but hey: this is linux ! So there has to be a tool for the job.
How would you do that ?
You should be able to extract the file names with something like this:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]"
The option -o causes grep to list only the matches instead of the whole line. Use tr to remove the quotes:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]" | tr -d "\"'"
I'm trying to write a script to login to a Drupal website automagically to put it into maintenance mode. Here's what I have so far, and the grep gives me back the line I want.
curl http://www.drupalwebsite.org/?q=user | grep '<input type="hidden" name="form_build_id" id="form-[a-zA-Z0-9]*" value="form-[a-zA-Z0-9]*" />'
Now I'm kind of a Linux newbie, and I'm using Cygwin with BASH. How would I then pipe the output and use a command to get the value of the id attribute from the output that grep generated? I'll be using this substring later to do another curl request to actually submit the login.
I was looking at using expr but I don't really understand how I would tell expr "oh hey this stdin data I want to you manipulate in this way". It seems like the only way I could do this would be by saving off the grep output in a variable and then feeding the variable to expr.
Use sed to trim the results you get from your grep, ie.
edit : added myID variable, use any name you like.
myID=$(
curl http://www.drupalwebsite.org/?q=user \
| grep '<input type="hidden" name="form_build_id" id="form-[a-zA-Z0-9]*" value="form-[a-zA-Z0-9]*" />' \
| sed 's/^.* id="//;s/" value=.*$//'
)
#use ${myID} later in script
printf "myID=${myID}\n"
The first part removes the 'front' part of the string, everything up to the id=", while the 2nd part removes every " value= .....
Note that you can chain together multiple sub-replace actions in sed by separating them with the ';'.
edit2
Also, once you're using sed, there's no reason to use grep, try this:
myID=$(
curl http://www.drupalwebsite.org/?q=user \
| sed -n '\#<input type="hidden" name="form_build_id" id="form-[a-zA-Z0-9]*" value="form-[a-zA-Z0-9]*" />#{
s\#^.* id="##
s\#" value=.*$##p
}'
)
( It's a good habit to get into to removing unnecessary processes. It may not matter in this case, but if you get to where you are writing code that will be executed 1000's of time in a hour, then having an extra grep when you don't need it is creating 1000's of extra processes that don't need to be created.)
You may have to escape the '< and >' chars like '\< >' or , worst case '[<] [>]'.
I'm using the '#' as the reg-ex replacement separator now to avoid having to escape any '/' chars in the srch-target string. And I continue using it in the whole example, just to be consistent. For some seds you have tell them that you're using a non-standard separator, hence the leading \# at the front of each block of sed code.
The -n means "don't default print each line of input", and because of that, we have to add the 'p' at the end, which means print the current buffer.
Finally, I'm not sure about your regular expression, particularly the -[a-zA-Z0-9]*, this means zero or more of the previous character (or character class in this case). Typically people wanting at least one alpha-numeric, will use -[a-zA-Z0-9][a-zA-Z0-9]*, yes OR [[:alnum:]][[:alnum:]]*, but I don't know your data well enough to say for sure.
I hope this helps.
You could use grep again with the -o option. Possibly two consecutive greps to also filter out the surrounding id="..." part.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.