Suppose I have configured db.fetch.interval.default to have value 10. Furthermore, suppose I have successfully crawled a website (e.g. http://example.com). At this point, all the URLs in the crawldb will have a fetch interval of 10 days.
The problem: I want to change the fetch interval for one particular URL, say for http://example.com/daily-news/. I want to edit the crawldb to change the fetch interval for http://example.com/daily-news/ to 2 days instead of 10. How can I edit the crawldb?
The CrawlDb is a Hadoop map file which is not supposed to be edited. The Nutch "inject" command provides an option -overwrite which allows to overwrite existing entries and set a custom fetch interval. The URL file should contain (tab-separated):
http://myUrl/ <tab> nutch.fetchInterval=custom_interval_in_sec
For more details please check the command-line help shown by bin/nutch inject. You can then verify the overwritten record using bin/nutch readdb <crawldb> <myUrl>. Please also note that the fetch status of the overwritten record is lost resp. it's set to "injected".
Related
I have a page with this url : example.com/someurl.
For some reason, I made a 301 redirection on it to example.com/some-url
But I would like to make sure that ancient datas for the first url will be associated with my new url.
Because if I export my datas, I'll have two entries which means two different pages, but actually this is only one page.
Thank you :)
Alas this doesn't work to well. Your best chance would be to use virtual pageview and pass in the old url to the pageview tracking on the new page:
ga("send","pageview","/some-url");
Alternatively you can create a filter in your GA view settings that rewrites the new url to the old url.
This will work for newly incoming hits. Data that's already connected will not change, and you cannot really consolidate after the fact.
You can use filter in settings
Filter Type: Search and Replace
field request URI
Search String: /some-url
Replace String: /someurl
And its merge wtih historical data
I am trying to get the recent post from a particular location. using this url.
https://api.instagram.com/v1/media/search?lat=34.0500&lng=-118.2500&distance=50&MAX_ID=max_id&access_token=XXXX
So when I use this URL for the first time, I get 20 results. I obtain the max ID from the list of 20 results and modify my url .
But when I use the modified URL, I obtain the same result as the first one.
How do I go about solving this?
Contrary to what I thought, the media search endpoint doesn't return a pagination object. Sorry. It also doesn't support the min_id/max_id parameters, which is why you are having problems..
If you want to get different data you are going to have to use the time based request parameter MIN_TIMESTAMP. However it looks like that parameter doesn't work for that endpoint either (though the documentation says it is supported). Indeed, a quick search on the internet reveals it might be a long standing bug with the api.
There are several websites within the cruise industry that I would like to scrape.
Examples:
http://www.silversea.com/cruise/cruise-results/?page_num=1
http://www.seabourn.com/find-luxury-cruise-vacation/FindCruises.action?cfVer=2&destCode=&durationCode=&dateCode=&shipCodeSearch=&portCode=
In some scenarios, like the first one shown, the results page follows a patten - ?page_num=1...17. However the number of results will vary over time.
In the second scenario, the URL does not change with pagination.
At the end of the day, what I'd like to do is to get the results for each website into a single file.
Q1: Is there any alternative to setting 17 scrapers for scenario 1 and then actively watching as results grow/shrink over time?
Q2: I'm completely stumped about how to scrape content from second scenario.
Q1- The free tool from (import.io) does not have the ability to actively watch the data change over time. What you could do is have the data Bulk Extracted by the Extractor (with 17 pages this would be really fast) and added to a database. After each entry to the database, the entries could be de-duped or marked as unique. You could do this manually in Excel or programmatically.
Their Enterprise (data as a service) could do this for you.
Q2- If there is not a unique URL for each page, the only tool that will paginate the pages for you is the Connector.
I would recommend you to build an extractor to get the pagination. The result of this extractor will be a list of links, each link corresponding to a page.
This way, every time you run your application and the number of pages changes, you will always get all the pages.
After that, make a call for each page to get the data you want.
Extractor 1: Get pages -- Input: The first URL
Extractor 2: Get items (data) -- Input: The result from Extractor 1
I would like to retrieve users with repositories that contain a README file that contains text that is matched by a string passed in the query. Is this possible using the GitHub API?
In addition, I would like to include location and language in the query.
thanks.
This is not straightforward using the available API now. However, you can use the API to get what you want.
Be warned that there are over 10 million repositories on Github - it will take a long time. As you can only retrieve a list of 100 repositories per query, you need to use pagination -> more than 100000 requests to get all the repositories. A user is limited to 5000 requests per hour, then you are "banned" for another hour. This will take more than 40 hours, if you're using just one user credentials.
Steps:
Get the JSON with all the repositories (https://developer.github.com/v3/repos/#list-all-public-repositories)
Use pagination to fetch 100 objects per query (https://developer.github.com/v3/#link-header)
Decode the json and retrieve the list of repositories
For each repository you need to get the repository url object from the JSON, which gives you the link to the repository.
Now you need to get the README contents. There are two ways :
a) You use the Github API, by using the repo url and sending a GET request for : https://api.github.com/repos/:owner/:repo/readme( https://developer.github.com/v3/repos/contents/#get-the-readme) and then either decode the file (it is encoded using Base64) or you follow the html property of the JSON e.g "html": "https://github.com/pengwynn/octokit/blob/master/README.md". If there is no README, you will get a 404 Not found code, so you can easily proceed to the next repository.
b) You just make the URL for the README using step 4 that gives you e.g. https://api.github.com/repos/octocat/Hello-World ; and you parse it and transform it into https://github.com/octocat/Hello-World/README.MD ; however this would be more complicated, in case there is no README.
Search through the file for your specific text, and record or not if you have found the text.
Iterate until you went through all the repositories.
Advanced things - if you plan on running this more often, I can strongly recommend to use caching https://developer.github.com/v3/#conditional-requests ; You basically store the date + time when you have done the query, and use it later to see if anything has changed in the repository. This will eliminate many of your subsequent queries if you need to have an up-to-date information. You will still have to retrieve the whole list of repositories though. (but then you only do your search for updated repositories)
Of course to make it faster, you can improve this algorithm to make it parallel - you retrieve 100 repositories, then proceed to retrieve the next 100, and in the meanwhile you search if the first 100 repositories contain a README file and if that readme has what you are searching for, and so on. This will make things faster, most certainly. You will need to use some sort of a buffer, as you do not know which terminates faster (getting the repositories list, or searching through them)
Hope it helps.
I'm starting with Nutch (trunk version) and I'm spinning around the code without seeing something that seems obvious.
I want to extract the resource of every URLs crawled ( eg: https://stackoverflow.com/questions/ask ===> /question/ask ) expecting two results:
1. Post the information as an additional field to a Solr instance. I have solved this problem writing an IndexingFilter plugin and works perfectly.
2. Dumping this information as metadata when the next command it's thrown: bin/nutch readdb -dump crawldb
And at this second point it's where I'm stucked. Reading documentation and other examples it seems I have to use the CrawlDatum but I don't know in what class I have to modify in order to show this information when a dump is made.
Maybe someone knows where to touch in order to achieve this?
Some help will be appreciated!