I have to fetch website details as search engine does. I need the description of the site,link and some info about them and will store it in my DB. Is there any libraries available for doing this? Please remember I can crawl a whole webpage but I need only the information in the format crawled by search engines.
Thanks,
Karthik
Which language? APIs and bindings exist for reading webpage content. Do you realize the scale of the task if you wish to create a new 'search engine'? Your question is so generic, there's not a lot of advice that can be given, other than:
Respect robots.txt
Don't hammer the server with requests, you'll soon get your IP blocked by sensible sysadmins.
Related
I'm relatively new to webdev, and essentially I'm trying to create functionality that allows users to create their own event pages (similar to Facebook Events, Meetup etc.).
My question is to do with the surrounding architecture of this, namely:
Given each event page will have its own url, do you need a separate file for each event in the backend?
Or is there some clever way of templating and directing traffic? (Sorry if this is vague, I'm trying to probe if there is a better way of doing this?)
I notice that most of the FB/Meetup events all have their own URLs, does this mean that they all have separate files in the backend?
I'm using Nodejs for the backend btw.
I've been Googling around but haven't been able to figure it out, I think I might be using the wrong wording... so even a point in the direction of the right wording would be much appreciated!
Thanks y'all
Generally this is handled via database-driven solutions. The "pages" are dynamic, and the URL is a parameter see #Konrad's comment that allows the pages to be looked up in a database which allows the dynamic content to be loaded into a single page which handles the complexity of each page seeming unique.
So, I have a website where i need to insert a lot of products from our suppliers' website, and I would like to automate it.
I was reading this question, that would solve most of the "front-end" part, and on the backend than I only need to get the link of the page, and use a basic web crawler library to search on that page the given selector.
However I'm missing the part where I inject a possible JS library on the supplier website, in order to allow others to select the element with the mouse, and not to find it using something like the Chrome console.
I was thinking about fetching the source page on the server side, and that returning it, but all the resources like CSS, JS, images, will be blocked by CSRF protection, and so here's my question:
Are there ways to do this in any way?
Edit:
ok maybe it's not so clear what I want to achieve... here's how thing should go:
Someone decides to insert a new supplier, that he should set the CSS path for the product name, the CSS path for the product description etc etc (section where i appreciate suggestion), and then he only needs to insert the products links and the server will get the informations using the CSS path inserted before.
I have a very interesting requirement that I am not too sure of the answer. I am turning to Stack Overflow in the hope that someone is able to share their experiences and propose a solution.
Setup
I have a front facing website that is powered by Ghost running a standard MEAN stack enviorment and all traffic is handled via CloudFlare.
Problem
I have become aware recently that I have been receiving a large amount of requests via the CloudFlare display that do not appear in my Google Analytics. I am aware that some people may have JS disabled, however we are talking orders of magnitude difference between the two. I would very much like to know why.
Hypothesis
I suspect that person(s) are trying to use port scanning, or attempt to find vulnerabilities in my platform. Or it could be a simple case of linking going astray. Either way, I am not sure.
Solutions
This is the part I am not sure about. What would be the best approach to record and retain HTTP requests? One consideration I have had is to use Morgan to to filestream requests into a .log file and review at a later date. However, I wonder if there is a more elegant solution.
I welcome any thoughts you may have.
Thanks
Google Analytics is a fair bit more conservative than Cloudflare. One reason, as you mentioned is that Cloudflare is able to access raw HTTP logs, instead of having to use JavaScript to identify page views. As Cloudflare only marks HTTP requests, port scanning would not be recorded as a hit.
However, even with bots accounted for, Cloudflare may still record views which Google Analytics can't, for example; AJAX content requests. As the Google Analytics beacon is only run once when the page is loaded, Google Analytics only records this once - Cloudflare sees this as 2 HTTP requests in it's raw logs.
For details, please see the following blog post, it goes into detail as to how Google Analytics and Cloudflare Analytics can differ: Understanding Analytics: When Is a Page View Not a Page View?
Google Webmaster Tools offers several methods to verify ownership of websites. Meta tags, DNS records, linking to a Google Analytics account, or uploading an HTML file to the server. My website has already been verified through the HTML file method, but I'd like to make my verification more resilient with Google (yes, they do actually recommend more than one method of verification). I don't want to make our usage of Google any more public than it already is, so adding meta tags is out of the picture - as well as using a Google Analytics account, as we don't utilize that for visitor reporting.
This brings up my original question, if I choose to add a DNS record in the form of the following:
TXT Record google-site-verification=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
How would adding this TXT record affect site loading times and overall performance - especially in terms of new visitors who must perform a new DNS lookup? Substantial, marginal?
Most likely marginal, but we're pinching at pennies here and trying to squeeze ever last bit of optimization out of our server box. Any feedback and/or your own speed tests would be more than welcome!
Typical users will not see the TXT records. They'd only request for A (or AAAA) records to access your services. People interested in the TXT record need to ask for it explicitly:
dig TXT your.fqdn.com
So there is no effect on site load times.
how to check if a certain page is being accessed from a crawler or a script that fires contineous requests?
I need to make sure that the site is only being accessed from a web browser.
Thanks.
This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.
You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
User-agent: *
Disallow: /
Save that as robots.txt at the site root, and no automated system should check your site.
I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
ASP.NET HTML:
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
ASP.NET Javascript:
<script type="text/javascript">
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>
</script>
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
<div
class="all rndCorner"
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"
onclick="$TodoApp.$AddSampleTree()">
Please click here to create your own set of sample tasks to do
</div>
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D