What database does Apache Nutch use for storing URLs? - nutch

I tried to look into its dependencies (see here) but I fail to figure what it uses for storing URLs and handling the progress of the crawl. Judging by the tutorial requirements (see here) it doesn't need any 3rd party system, like some SQL database.
So what does it use?
Thanks for any suggestion!

Nutch 1.x stores the data in Hadoop MapFiles and SequenceFiles. Apache Nutch is a batch-based crawler and the data is
either write-once/read-many as for the segments created and filled in every crawl cycle
or rewritten when new data is added: the "CrawlDb" which holds the URLs and status information (fetch status and date, signature / checksum, score, metadata)
Nutch 2.x (retired) put all data into a single "web table" - with scale-up and distribution delegated to big data stores (HBase, etc.) via Apache Gora.

Related

Azure cdn Ignore query strings purpose

I know what is the difference between Azure CDN query string modes and I have read a helpfull example of query string modes but...
I don't understand what is the purpose of "Ignore query strings" or how this can be useful in a real dynamic web.
For example, suppose we have a product purchase website with a URL similar to www.myweb.com/products?id=3
If we use "Ignore query strings"... Does this mean that if an user later requests product 4 (www.myweb.com/products?id=4), he will receive the page for product 3?
I think I'm not understanding correctly Azure CDN, I'm seeing Azure CDN as a dynamic content CDN, however Azure CDN is only used for static content as this article explains:
Standard content delivery network (CDN) capability includes the ability to cache files closer to end users to speed up delivery of static files.
This is correct? Any help or example on the subject is welcome
Yes, if you are selected Ignore query strings Query string caching behavior (this is the default), in your case subsequent requests after the initial request www.myweb.com/products?id=3, no matter the query string value, that POP server will serve the same content until it's cache period expires.
And for the second question, CDN is all about serving static files. To my understanding i believe what the article says is about dynamic site accelaration. It's about bunch of techniques to optimize dynamic web sites content serving performance. Because unlike static web sites, dynamic web sites assets (static files. ex: images, js, css, html) are loading dynamically based on the user behavior.
Now that I have it clearer, I will answer my question:
Azure CDN - Used to cache static content, even on dynamic web pages.
For the example in the question, all products must download the same javascript and css content, for those types of files Azure CDN is used. Real example using "Ignore query strings":
User A www.myweb.com/products?id=3, jquery-versionX.js and mystyles.css are not cached, the server is requested and the user receives it.
User B www.myweb.com/products?id=4, since we are using "Ignore query strings" the jquery-versionX.js and mystyles.css files are cached, they are served to the user without requesting it from the server again.
User C www.myweb.com/products?id=3, since we are using "Ignore query strings" the jquery-versionX.js and mystyles.css files are cached, they are served to the user without requesting it from the server again.
Reddis or other similar - Used to cache dynamic content (queries to databases for example).
For the example in the question, all the products have different information, which is obtained by doing a database query. We can store those queries or JSON objects in a Reddis cache. Real example:
User A www.myweb.com/products?id=3, product 3 is not cached, it is requested from the server and received by the user.
User B www.myweb.com/products?id=4, product 4 is not cached, it is requested from the server and received by the user.
User C www.myweb.com/products?id=3, product 3 is cached, the server is not requested and the user receives it from the cache.
Summary:
Both methods can be used simultaneously, Azure CDN is for static content and Reddis or similar for dynamic content.

Spark results accessible through API

We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything.
Here are a few options that come to mind
Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query.
Spark results are pumped into a messaging queue from which a socket server like connection is made.
What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection.
Can someone with expertise help in that sense?
The standard way I think would be to use Livy, as you mention. Since it's a REST API you wouldn't expect to get a JSON response containing the full result (could be gigabytes of data, after all).
Rather, you'd use pagination with ?from=500 and issue multiple requests to get the number of rows you need. A web application would only need to display or visualize a small part of the data at a time anyway.
But from what you mentioned in your comment to Raphael Roth, you didn't mean to call this API directly from the web app (with good reason). So you'll have an API layer that is called by the web app and which then invokes Spark. But in this case, you can still use Livy+pagination to achieve what you want, unless you specifically need to have the full result available. If you do need the full results generated on the backend, you could design the Spark queries so they materialize the result (ideally to cloud storage) and then all you need is to have your API layer access the storage where Spark writes the results.

Pass metadata along seed urls with Nutch 1.X REST APi

I'm currently trying to include the seed url in the data indexed for each url in my search backend (currently ElasticSearch).
I've seen in this previous question that metadata could be passed with each seed, which could suit my need. However, I'm using the REST API to create my seed list, and it seems that metadata aren't allowed in the seedUrls parameter.
Has anybody tried to do this with the REST API?
Is there another way to achieve this?
I thought I could write a custom IndexingFilter to add the seed URL in the NutchDocument to be indexed, but at this point, the seed URL is not available from what I've seen.
Thanks in advance!
At the moment the REST API doesn't seem to support handling associated metadata. I believe that this doens't require such a great effort to accomplish, basically we just need to handle the JSON payload and customize the corresponding entity SeedUrl to hold the metadata and of course customize the writeToSeedFile method.
Although your approach of writing an IndexingFilter wouldn't work. The seed URLs are injected at the very begining of the crawl life cycle, and the IndexingFilter are only responsable of choosing what gets indexed into your storage.

Using CrawlDbReader to read Nutch Crawl Data

I am using nutch 1.4 to implement a focused crawler. Can anyone tell me how to use the nutch CrawlDbReader, LinkDbReader and SegmentReader APIs in my JSP program so that I can create custom UI for my project.
Specifically, I need to issue commands like readdb, readseg etc to the crawl data and get the output through a browser.
Is there something special with these APIs that make this more than "pass data from server to client" issue?
You can use the APIs to get the data. Just look how they are used by nutch.sh, and how the main() is built and do something similar. Then pass the data to the client wither by XML or by JSON or any other way.

fetching only website details as search engine does

I have to fetch website details as search engine does. I need the description of the site,link and some info about them and will store it in my DB. Is there any libraries available for doing this? Please remember I can crawl a whole webpage but I need only the information in the format crawled by search engines.
Thanks,
Karthik
Which language? APIs and bindings exist for reading webpage content. Do you realize the scale of the task if you wish to create a new 'search engine'? Your question is so generic, there's not a lot of advice that can be given, other than:
Respect robots.txt
Don't hammer the server with requests, you'll soon get your IP blocked by sensible sysadmins.

Resources