How to extract emails from a list of urls? - linux

I have a website with a list of urls. I need to make a command that would go on each link, go to the website and find an email address.
I need this command to do this for every url and search for every page on each url.
Is there any way I could achieve this ?
Thank you !
Here is the website with the url list : http://cabm.net/nos-membres

I would encourage you to retrieve email addresses from databases.
That said, it is possible to achieve the desired effect by retrieving the set of html pages from each site (read downloading) then parsing through the string to retrieve strings in email format:
-whitespace-string#string.string-whitespace-
Results would be stored in an array of email addresses.
There isn't a single command to do this but it is possible to write a program to achieve the same.
The program itself also raises ethical issues on whether or not you're allowed to do that but that's a discussion for another time ;)

Related

Disable google indexing website telephone numbers

I was presented with the task of hiding telephone numbers from Google - what that means is, we want to display them on the website and have them clickable href="tel:..." but to ensure Google does not index it and does NOT display it with the search results.
Does anyone know of any effective technique?
I was thinking of writing VueJs component, which mixes given number with some alpha characters, but this would only work with the presentation / label, the tel:... would still have to have a valid telephone number and I'm not sure if Google wouldn't pick it form the href attribute.
I think the best approach is just to hide it from bots, may be you can use something like this VueIfBot
<vue-if-bot>
This will not be visible for bots
</vue-if-bot>
or any other alternative just check the userAgent for example in php
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
If you can't get userAgent, but you still want to check if this is a search engine crawler you can check user IP adress here is a list of IP Addresses of Search Engine Spiders
And finally after you successfully hide your data you can test it with User-Agent Switcher

Friendly URLs when using a Record ID for dynamic content

I've read a bit on the matter of friendly urls and I'm a little unsure as to what is better.
I currently have my website using a structure of http://www.domain.com/page.php?id=2
I am using the record id to determine the content of the page. My record id's are numeric and increment for new pages added. The content of existing pages can change completely over time. But, still use the same record id (this is a cms so the client may do this).
The way I understand it I have two options for friendly urls:
http://www.domain.com/page/2
http://www.domain.com/some-text-describing-the-page
Now because I identify the content by the record id, I would assume the first option would make more sense.
My client seems to want option two.
After some reading I found two conflicting points.
As per Tim Berners-Lee (the architect of the WWW) he states that you want a URI which will have the potential to remain the same 2 months, 2 years, 200 years from now. So you DO NOT want to use a page title or something similar for your pages. If you change your pages content you are either forced to change the content and leave the URI alone, or change the URI and are stuck with dangling links. You can read his article here (http://www.w3.org/Provider/Style/URI)
However, a number of other people on the internet (with no know authority to me) clearly state that you need to have a descriptive yet short URI for the best SEO value. From what I read, mostly for the purpose of backlinks and having keywords in the anchor text since people just use the link itself for the anchor text. So having keywords in the link itself helps search engines know what the link is about without a custom title.
It seems to me the difference has to do with long term VS short term.
Am I grasping this correctly?
If I am to use a slug style URI as defined by the user, do I have to just allow my user to type in whatever they want to a field and check against the current database to see if it exist? If so, am I supposed to anticipate static links by running a query for the know record id and then use the result to generate the url which would just be rewritten back to the format: http://www.domain.com/page.php?id=2?
It seems to me that would be a lot of extra overhead.
I would suggest something in the middle of those two:
http://www.domain.com/page/2/some-text-describing-the-page
or without page:
http://www.domain.com/2/some-text-describing-the-page
You can still get page Id from the Url, and there is a title as well! And what even more important, you're still able to get correct content, even when page title change later.
So think about situation like that: User creates a page, it receives Id=4 and it's title is My great title. From that information Url is generated, and is e.g. http://www.domain.com/page/3/my-great-title. After 2 months user changes the title to This title is better then the last one!. Url changes as well to http://www.domain.com/page/3/this-title-is-better-then-the-last-one. However, there is still 3 within the Url, so you're able to show right content! You can also check, if the rest of Url is actual, and redirect (301 would be the best one) to new one to let search engines know, that Url changed.

How to place search query in the URL?

With a lot of search engines, you can find the string you are searching in the URL.
However, http://drugcompare.destinationrx.com/Home.aspx does not let me do this. When I search something, the resulting URL is http://drugcompare.destinationrx.com/DrugCompare.aspx no matter what.
Is there any way I can find out whether I can search the website by adding something to the end of the URL, like "?query=searchstring" instead of using the form provided on the page? Basically I need a unique URL.
that website you pointed at uses POST to send data for its search query which means you wont be able to see or append it on the URL bar. The reason for that is either for security or the search query it generates is a complex object or too long and does not fit in a url. websites such as search engines uses GET, with that you can append your search query in the url by following the syntax it generates.

Best non-interactive approach to enter a string into a formular field and get the resulting text

In some website for which I have access, there are some input fields. In the sixth field I need to enter some input string from a list of 10000 strings, then a new page appears, for which I would just need to count the number of lines. Finally I would like to get a table with two columns like input string and number of resulting lines. Since I have to manually enter the info for all the different 10000 strings, I wonder therefore what is the best approach to enter a string into a generic formular field and get the resulting text. I heard about curl but I am not sure whether this is the easiest one.
P.S.
Example of interactive way: I type some string o words into google search and then I get a new page with the search results. Previously I have introduced my google username and password, so the results will be probably filtered according to my profile.
Example of non-interactive way: A script somehow introduces my user information, search query and saves to some text file the search results. Imagine the same idea but for a more complicated website like this.
What you want to do is to send a HTTP POST with specific data. This can be done with any proper HTTP client code, and one such is libcurl (or the pycurl binding or even using the curl command line tool). On the response from the post, you probably get a redirect and then the results, or you need to do a separate request for the results and then you're done and go back to do the next POST. Repeat until all POSTs are done.
What you may need to take into account is that you may have to deal with cookies and possibly to follow a redirect from the POST. A good approach is to record a "manual session" as done with a browser (use firebug or LiveHTTPHeaders etc) and then use that recording to help you repeat the same thing with a HTTP client.
A decent tutorial to get some starting up details on this kind of work can be found here: http://curl.haxx.se/docs/httpscripting.html
You could also use JMeter to run all the posts. You may use the CSV input to set the 10000 strings. Then you save the result as xml and extract the necessary data.

How can I make clean search urls?

If I have search that has a lot of different options, then url becomes very long and looks very bad. Is there anyway to make urls look better? Using POST to make search would keep urls clean, but people couldn't share search urls.
Try doing an advanced search with many options on Google: the URL is long and not especially human-readable. I really don't think that's a problem; I don't think many people read URLs often. If you expect people to share search results, then show a button on the search results page that will generate a tinyURL-style shortened URL for that particular query.
A POST is meant for something that changes server state (e.g. a database update) and really shouldn't be used for a search.
You can encode all of your search criteria into something like a hash and then have a single parameter in your querystring that has that value:
http://www.mysearch.com/query=2esd32d2csg3fasfdlkjSDDFdskjsEWFsDFFR39fdf
I'm not sure exactly how you'd encode everything, but it wouldn't be too difficult.
Do the different options actually need to be in the URL? For example, a quick search from my Firefox search window gives a URL like:
http://www.google.com/search?q=search&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
If I'm sending the link to anyone, I habitually cut off everything after q=search. Why not have the URL be the bare minimum that you need to send the link to someone (or bookmark), and make the rest as invisible POST variables?

Resources