How to extract data (user name)from webpage - linux

I want to collect user names from member-list pages like this:
http://www.marksdailyapple.com/forum/memberslist/
I want to get every username from all the pages,
and I want to do this in linux,with bash
where should I start,could anyone me some tips?

This is what my Xidel was made for:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username' -f '(//a[#rel="Next"])[1]'
With that simple line it will parse the pages with a proper html parser, use css selectors to find all links with names, use xpath to find the next page and repeat it until all pages are processed
You can also write it using only css selectors:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username' -f 'div#pagination_top span.prev_next a'
Or pattern matching. There you basically just copy the html elements you want to find from the page source and replace the text content with {.}:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e '<a class="username">{.}</a>*' -f '<a rel="next">{.}</a>'

First you should use wget to get all the username pages. You will have to use some options (check the man page for wget) to make it follow the right links, and ideally not follow any of the uninteresting links (or failing that, you can just ignore the uninteresting links afterwards).
Then, despite the fact that Stackoverflow tells you not to use regular expressions to parse HTML, you should use regular expressions to parse HTML, because it's only a homework assignment, right?
If it's not a homework assignment, you've not chosen the best tool for the job.

As Robin suggest, you should really do this kind of stuff within a programming language containing a decent html-parser. You can always use command-line tools do various tasks, however in this case I probably would have chosen perl.
If you really want to try to do it with command-line tools i would suggest, curl, grep, sort and sed.
I always find it easier when I have something to play with, so here's something to get you started.
I would not use this kind of code to produce something useful though, but just so you could get some ideas.
The memberpages seems so be xxx://xxx.xxx/index1.html, where the 1 is indicating the page-number. Therefore the first thing I would do is to extract the number of the last memberpage. When I have that I know which urls I want to feed curl with.
Every username is in a member of the class "username", with that information we can use grep to get the relevant data.
#!/bin/bash
number_of_pages=2
curl http://www.marksdailyapple.com/forum/memberslist/index[1-${number_of_pages}].html --silent | egrep 'class="username">.*</a>' -o | sed 's/.*>\(.*\)<\/a>/\1/' | sort
The idea here is to give curl the addresses in the format index[1-XXXX].html, that will make curl traverse all the pages. We then grep for the username class, pass it to sed to extract relevant data (the username). We then pass the produced "username-list" to sort to get the usernames sorted. I always like sorted things ;)
Big Notes though,
You should really be doing this in another way. Again, I recommend perl for these kind of tasks.
There is no errorchecking, validaton of usernames, etc, etc. If you should use this in some sort of production there are no shortcuts, do it right. Try to read up on how to parse webpages in different programming languages.
By purpose I declared number_of_pages to two. You'll have to figure out a way bý yourself to get the number of the last memberpage. It was a lot of pages though, and i imagine it would take some time to iterate through them.
Hope that helps !

I used this bash script to go through all the pages:
#!/bin/bash
IFS=$'\n'
url="http://www.marksdailyapple.com/forum/memberslist/"
content=$(curl --silent -L ${url} 2>/dev/null | col -b)
pages=$(echo ${content} | sed -n '/Last Page/s/^.*index\([0-9]\+\).*/\1/p' | head -1)
for page in $(seq ${pages}); do
IFS=
content=$(curl --silent -L ${url}index${page}.html 2>/dev/null | col -b)
patterns=$(echo ${content} | sed -n 's/^.*class="username">\([^<]*\)<.*$/\1/gp')
IFS=$'\n' users=(${patterns})
for user in ${users[#]}; do
echo "user=${user}."
done
done

Related

How to pass multiple variables in grep

I have a json file that is download using curl. It has some information of a confluence page. I want to extract only 3 parts that downloaded information - that is page: id, status and title.
I have written a bash script for this and my constraint is that I am not sure how to pass multiple variables in grep command
id=id #hardcoded
status=status #hardcoded
echo Enter title you are looking for: #taking input from user here read title_name echo echo echo Here are details
curl -u username:password -sX GET "http://X.X.X.X:8090/rest/api/content?type=page&start=0&limit=200" | python -mjson.tool | grep -Eai "$title_name"|$id|$status"
Aside from a typo (you have an unbalanced quote - please always check the syntax for correctness before you are posting something), the basic idea of your approach would work in that
grep -Eai "$title_name|$id|$status"
would select those text lines which contain those lines which contain the content of one of the variables title_name, id or status.
However, it is a pretty fragile solution. I don't know what can be the actual content of those variables, but for instance, if title_name were set to X.Z, it would also match lines containing the string XYZ, since the dot matches any character. Similarily, if title_name would contain, say, a lone [ or (, grep would complained about an unmatched parentheses error.
If you want to match the string literally and not be taken as regular expressions, it is better to write those pattern into a file (one pattern per line) and use
grep -F -f patternfile
for searching. Of course, since you are using bash, you can also use process substitution if you prefer not using an explicit temporary file.

Find and Replace Incrementally Across Multiple Files - Bash

I apologize in advance if this belongs in SuperUser, I always have a hard time discerning whether these scripting in bash questions are better placed here or there. Currently I know how to find and replace strings in multiple files, and how to find and replace strings within a single file incrementally from searching for a solution to this issue, but how to combine them eludes me.
Here's the explanation:
I have a few hundred files, each in sets of two: a data file (.data), and a message file (data.ms).
These files are linked via a key value unique to each set of two that looks like: ab.cdefghi
Here's what I want to do:
Step through each .data file and do the following:
Find:
MessageKey ab.cdefghi
Replace:
MessageKey xx.aaa0001
MessageKey xx.aaa0002
...
MessageKey xx.aaa0010
etc.
Incrementing by 1 every time I get to a new file.
Clarifications:
For reference, there is only one instance of "MessageKey" in every file.
The paired files have the same name, only their extensions differ, so I could simply step through all .data files and then all .data.ms files and use whatever incremental solution on both and they'd match fine, don't need anything too fancy to edit two files in tandem or anything.
For all intents and purposes whatever currently appears on the line after each MessageKey is garbage and I am completely throwing it out and replacing it with xx.aaa####
String length does matter, so I need xx.aa0009, xx.aaa0010 not xx.aa0009, xx.aa00010
I'm using cygwin.
I would approach this by creating a mapping from old key to new and dumping that into a temp file.
grep MessageKey *.data \
| sort -u \
| awk '{ printf("%s:xx.aaa%04d\n", $1, ++i); }' \
> /tmp/key_mapping
From there I would confirm that the file looks right before I applied the mapping using sed to the files.
cat /tmp/key_mapping \
| while read old new; do
sed -i -e "s:MessageKey $old:MessageKey $new:" * \
done
This will probably work for you, but it's neither elegant or efficient. This is how I would do it if I were only going to run it once. If I were going to run this regularly and efficiency mattered, I would probably write a quick python script.
#Carl.Anderson got me started on the right track and after a little tweaking, I ended up implementing his solution but with some syntax tweaks.
First of all, this solution only works if all of your files are located in the same directory. I'm sure anyone with even slightly more experience with UNIX than me could modify this to work recursively, but here goes:
First I ran:
-hr "MessageKey" . | sort -u | awk '{ printf("%s:xx.aaa%04d\n", $2, ++i); }' > MessageKey
This command was used to create a find and replace map file called "MessageKey."
The contents of which looked like:
In.Rtilyd1:aa.xxx0087
In.Rzueei1:aa.xxx0088
In.Sfricf1:aa.xxx0089
In.Slooac1:aa.xxx0090
etc...
Then I ran:
MessageKey | while IFS=: read old new; do sed -i -e "s/MessageKey $old/MessageKey $new/" *Data ; done
I had to use IFS=: (or I could have alternatively find and replaced all : in the map file with a space, but the former seemed easier.
Anyway, in the end this worked! Thanks Carl for pointing me in the right direction.

Different results in grep results when using --color=always option

I ran into this problem with grep and would like to know if it's a bug or not. The reproducible scenario is a file with the contents:
string
string-
and save it as 'file'. The goal is to use grep with --color=always to output 'string' while excluding 'string-'. Without --color, the following works as expected:
$ grep string file | grep -v string-
but using --color outputs both instances:
$ grep --color=always string file | grep -v string-
I experimented with several variations but it seems --color breaks the expected behavior. Is this a bug or am I misunderstanding something? My assumption is that passing --color should have no effect on the outcome.
#Jake Gould's answer provides a great analysis of what actually happens, but let me try to phrase it differently:
--color=always uses ANSI escape codes for coloring.
In other words: --color=always by design ALTERS its output, because it must add the requisite escape sequences to achieve coloring.
Never use --color=always, unless you know the output is expected to contain ANSI escape sequences - typically, for human eyeballs on a terminal.
If you're not sure how the input is processed, use --color=auto, which - I believe - causes grep to apply coloring only if its stdout is connected to a terminal.
I a given pipeline, it typically only makes sense to apply --color=auto (or --color=always) to a grep command that is the LAST command in the pipeline.
When you use --color grep adds ANSI (I believe?) color coding. So your text which looks like this:
string
string-
Will actually look like this in terms of pure, unprocessed ASCII text:
^[[01;31m^[[Kstring^[[m^[[K
^[[01;31m^[[Kstring^[[m^[[K-
There is some nice info provided in this question thread including this great this answer.
My assumption is that passing --color should have no effect on the outcome.
Nope. The purpose of grep—as most all Unix/Linux tools—is to provide a basic simple service & do that well. And that service is to search a plain-text (key here) input file based on a patter & return the output. The --color option is a small nod to the fact that we are humans & staring at screens with uncolored text all day can drive you nuts. Color coding makes work easier.
So color coding with ANSI is usually considered a final step in a process. It’s not the job of grep to assume that if it comes across ANSI in it’s input it should ignore it. Perhaps a case could be made to add a --decolor option to grep, but I doubt that is a feature worth the effort.
grep is a base level plain-text parsing tool. Nothing more & nothing less.

Search a string in a file, then print the lines that start with that string

I have an assignment and I have no idea when it comes to managing files, reading and writing. Here's my main problem:
I have a script that manages a address book, at the moment the menu is finished, functions are being used but I don't know how to search or write a file.
The first "option" gives the user the option (duh!) to search the address book by the contact name. The pattern I want to use is something along the lines of "name:address:email:phone", letting the user to put Spaces in the name, address but not email nor phone, and only numbers in the last one. I believe I could achieve this with Regular Expressions, which I understand a bit from Java lessons.
How can I do this, then? I know grep may be useful, but I don' know of the parameters even after reading the man pages. Parsing line by line could be done with for line in $(file) but still not sure.
If you're allowed to use grep, then you probably may use awk, and that's what I would prefer for most parts of your assignment.
Looking up a contact by name:
awk -v name="Anton Kovalenko" -F: '$1==name' "$file"
Here's one way to do it:
grep "^something" $file | while read line
do
echo $line; #do whatever you want with your $line here
done

What is the easiest way of extracting strings from source files?

I was asked today to list all image files references in our project to help remove/fix dead references.
All our image names in the source files are enclosed in either simple or double quotes ('image.png' or "image.png").
To extract those I thought of using grep, sed an other tools like that but so fair I failed to come up with something effective.
I can currently list all lines that contain image names by grepping the image file extensions (.png, .gif, and so on) but that also brings lines completely unrelated to my search. My attempt with sed wasn't working in case there was several strings per line.
I could probably filter out the list by myself, but hey: this is linux ! So there has to be a tool for the job.
How would you do that ?
You should be able to extract the file names with something like this:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]"
The option -o causes grep to list only the matches instead of the whole line. Use tr to remove the quotes:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]" | tr -d "\"'"

Resources