Is it possible to have different fetch interval in Nutch? - nutch

Is it possible to use different fetch interval for each URL that I have listed or group of URLs?
If not, is there a command that I can use to fetch a URL whenever I want (this way I could use a cron job or a daemon)?

If the fetch interval is setted for a seed URL (that is defined on the seed file) you could use the metadata portion of the inject step (https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L69-L72) this way you can control how your seed links will be fetched. However the discovered links will have their own scheduling, but perhaps you can write something that propagates the nutch.fetchInterval or nutch.fetchInterval.fixed to the outlinks of your seed files so all the links on the same host will have the same fetch interval (or your own algorithm).
Said this you also can write your own custom fetch schedule (similar to the ones bundled with Nutch: mimetype/default/adaptative) that implements your custom logic.

Related

Trouble with GET request to Copy Data from REST API to Data Lake

I will provide some context: my pipeline makes a GET Request to a REST API (Auth type: OAuth2 Client Credential) in order to import data to the Data Lake (ADLSGen2) in parquet file format. Later, a Stored Procedure creates a View which includes every file in a predefined directory.
I am looking forward to requesting data to the API on an hourly basis (or maybe every 30 minutes) in order to get information of the previous hour. The thing is: almost 36 million records are brought per hour as a response.
In the body of the response there is no reference to the number or the total of pages. There is only data (keys and values).
On the other hand, the Headers include "first-page" and "next-page" (this one appears only if there are further pages in the response, but also makes no reference to the total of pages).
I was wondering if there are any useful suggestions to make my Copy Data activity work differently. Right now, and because of what I mentioned above, the pagination rule is set to RFC5988. I would like my requested data to be partitioned in some way.
Also, I was wondering if there is another way to approach this issue (like using another activity, for example).
Thanks!
Mateo
You need to replace the Header placeholder with your header_name(Link).
Or you can directly use like this dynamic content.

What is the best practice for storing rarely modified database values in NodeJS?

I've got a node app that works with Salesforce for a few different things. One of the features is letting users fill in a form and pushing it to Salesforce.
The form has a dropdown list, so I query salesforce to get the list of available dropdown items and make them available to my form via res.locals. Currently I'm getting these values via some middleware, storing them in the users session, and then checking if the session value is set, use them, if not, query salesforce and pull them in.
This works, but it means every users session data in Mongo holds a whole bunch of picklist vals (they are the same for all users). I very rarely make changes to the values on the Salesforce side of things, so I'm wondering if there is a "proper" way of storing these vals in my app?
I could pull them into a Mongo collection, and trigger a manual refresh of them whenever they change. I could expire them in Mongo (but realistically if they do need to change, it's because someone needs to access the new values immediately), so not sure that makes the most sense...
Is storing them in everyone's session the best way to tackle this, or is there something else I should be doing?
To answer your question quickly, you could add them to a singleton object (instead of session data, which is per user). But not sure how you will manage their lifetime (i.e. pull them again when they change). A singleton can be implemented using a simple script file that can be required which returns a simple object...
But if I was to do something like this, I would go about doing it differently:
I would create an API endpoint that returns your list data (possibly giving it a query parameters to return different lists).
If you can afford the data being outdated for a short period of time then, you can write your API so that it returns the response cached (http cache, for a short period of time)
If your data has to be realtime fresh, then your API should return an eTag in the response of the API. The eTag header basically acts like a checksum for your data, a good checksum would be "last updated date" of all the records in a collection. Upon receiving a request you check if you have the header "if-none-match" which would contain the checksum, at this point, you do a "lite" call to your database to just pull the checksum, if it matches then you return 304 http code (not modified), otherwise you actually pull the full data you need and return it (alongside the new checksum in the response eTag). Basically you are letting your browser do the caching...
Note that you can also combine caching in points 1 and 2 and use them together.
More resources on this here:
https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers
https://developers.facebook.com/docs/marketing-api/etags

Apache Nutch: Get list of URLs and not content from the entire web

I'm very new to apache Nutch. My goal is to start from a list of seed URLs and extract as much URLs (and sub URLs) as I can within a size limit (say no more than 1 million or less than 1 TB of data) using Nutch. I do not need the content of the pages, I only need to save the URLs. Is there any way to do this? Is Nutch the right tool?
Yes, you could use Nutch for this purpose, essentially Nutch does all of what you want.
You need to parse the fetched HTML in either way (in order to discover new links, and of course repeat the process). One way to go would be to dump the LinkDB that Nutch keeps into a file using the linkdb command. Our you could use the indexer-links plugin that is available for Nutch 1.x to index your inlinks/outlinks into Solr/ES.
In Nutch you control how many URLs you want to process per round, but this is hardly related to the amount of fetched data. So you'll need to decide when to stop.

Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes (files here), based on the cshapes dataset.
The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon", and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem).
Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values.
What I would like is to have a status such as active/running, completed or aborted. I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted.
Is this possible?
I'm not sure if this is exactly what you're looking for, but may be helpful. Whenever I'm curious about what my cluster is doing, I check out the tasks API.
The tasks API shows you all of the tasks that are currently running on your cluster. It will give you information about individual tasks, such as the task ID, start time, and running time. Here's the command:
curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool
Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here.
First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel.
You can also try with a higher request_timeout value, but I guess that is something you don't want to do.
just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct, otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing.

Generating pages that require complex calculations and data manipulation

What's the best approach for generating a page that is the results of complex calculation/data manipulation/api calls (e.g. 5 mins per page)? Obviously I can't do the calculation within my rails web request.
A scheduled task can produce some data, but where should I store it? Should I store it in a postgres table? Should I store it in a document oriented database? Should I store it in memory? Should I generate an html?
I have the feeling of being second-level ignorant about the subject. Is there a well known set of tools to deal with this kind of architectural problem?
Thanks.
I would suggest following approach:
1. Once you receive initial request:
You can start processing in a separate thread when you receive the first request with input for calculation and send some token/unique identifier for the request.
2. Store the result:
Then start the calculation and store the result in memory using some tool like memcached.
3. Poll for the result:
Then the request for fetching the result should keep polling for the result with generated token/unique request identifier. As Adriano said you can use AJAX for that (I am assuming you are getting the requests from Web Browser).

Resources