I want to write my own HTML parser plugin for nutch.
I am doing focused crawling by generating outlinks falling only in specific xpath.
In my use case, I want to fetch different data from the html pages depending on the current depth of the crawl. So I need to know the current depth in HtmlParser plugin for each content that I am parsing.
Is it possible with Nutch? I see CrawlDatum does not have crawl_depth information.
I was thinking of having map of information in another data structure.
Does anybody have better idea?
Thanks
Crawl.java has NutchConfiguration object. This object is passed while initializing all the components. I set the property for crawl-depth before creating new Fetcher.
conf.setInt("crawl.depth", i+1);
new Fetcher(conf).fetch(segs[0], threads,
org.apache.nutch.fetcher.Fetcher.isParsing(conf)); // fetch it
The HtmlParser plugin can access it as below:
LOG.info("Current depth: " + getConf().getInt("crawl.depth", -1));
This doesn't force me to break map-reduce.
Thanks
Nayn
With Nutch, "depth" represents the number of generate/fetch/update cycles run successively. Per example, if you are at depth 4, it means your are in the fourth cycle. When you say that you want to go no further than depth 10, it means that you want to stop after 10 cycles.
Within each cycle, the number or previous cycles run before it (the "depth") is unknown. That information is not kept.
If you have your own version of Crawl.java, you could keep track of the current "depth" and pass that information to your HTML parser plugin.
Related
I’m new to Modx so I don’t know if this is possible or not.
My TV, in this case [[*myTV]] outputs the following:
<data value='www.mylink.com'>Description</data>
Is there a way to only display the data value in the front-end? In this case I just want to display the url.
My recommendation would be to keep the data (in this case the URL) and the html separate, and that might help your situation. If the TV only includes the URL itself, then it makes it much easier to deal with the output of the TV using output modifiers. As an example, if [[*myTV]] contains www.mylink.com for a particular resource and you want the original output in your question, you could do something like:
[[*myTV:default=`<data value='[[*myTV]]'>Description</data>`]]
You can also nest TVs within output modifiers, so if for example you had a corresponding [[*description]] TV that describes the URL in [[*myTV]], you could use:
[[*myTV:default=`<data value='[[*myTV]]'>[[*description]]</data>`]]
TL;DR... The short version: Storing the entire output in the TV and extracting text from within that TV to output is much more difficult than storing individual components of that output in separate TVs and bringing them together for output when needed.
The longer version: In any situation where you're storing both data and HTML in a TV (which is not advisable in the vast majority of cases), you'll likely find duplication of your data across your project, and if by chance you decide to change the html at some point in the future, you then have to go into each and every TV field and change that HTML, which is the opposite effect from what a CMS is supposed to do - i.e. make Content Management easier!
If you do happen to find a use case for storing TVs along with their HTML formatting, that is a job best left for MODX Chunks, where you can code the implementation of your TVs within HTML in one spot within MODX and instead of duplicating that code everywhere, you reference the chunk like so: [[$chunk]].
I am using google translate API and my requirement for translation is greater than the google max limit. I want to translate a HTML section which is far greater than google max limit which it allows for a single request. How can I break my HTML into pieces so that I send multiple request with my overall html structure being valid.
Also,I am using nodeJs as a server side language.
any other idea how to achieve this?
Use a parser like jsdom to transform your HTML content into a DOM structure.
Then, use the translate API to translate the contents of the text nodes in the DOM structure and replace the translated text to get the full translated page.
If you need it, you could also try to find and translate any relevant text outside of text nodes, like alt- or title-attributes.
If you care about performance, you could try to translate bigger subtrees of the DOM structure at once, but then you would have to be careful to not upload too much content again.
There are several websites within the cruise industry that I would like to scrape.
Examples:
http://www.silversea.com/cruise/cruise-results/?page_num=1
http://www.seabourn.com/find-luxury-cruise-vacation/FindCruises.action?cfVer=2&destCode=&durationCode=&dateCode=&shipCodeSearch=&portCode=
In some scenarios, like the first one shown, the results page follows a patten - ?page_num=1...17. However the number of results will vary over time.
In the second scenario, the URL does not change with pagination.
At the end of the day, what I'd like to do is to get the results for each website into a single file.
Q1: Is there any alternative to setting 17 scrapers for scenario 1 and then actively watching as results grow/shrink over time?
Q2: I'm completely stumped about how to scrape content from second scenario.
Q1- The free tool from (import.io) does not have the ability to actively watch the data change over time. What you could do is have the data Bulk Extracted by the Extractor (with 17 pages this would be really fast) and added to a database. After each entry to the database, the entries could be de-duped or marked as unique. You could do this manually in Excel or programmatically.
Their Enterprise (data as a service) could do this for you.
Q2- If there is not a unique URL for each page, the only tool that will paginate the pages for you is the Connector.
I would recommend you to build an extractor to get the pagination. The result of this extractor will be a list of links, each link corresponding to a page.
This way, every time you run your application and the number of pages changes, you will always get all the pages.
After that, make a call for each page to get the data you want.
Extractor 1: Get pages -- Input: The first URL
Extractor 2: Get items (data) -- Input: The result from Extractor 1
Currently, we're running revolution 2.2. On site_content, we have some tags that are ran for crawling twitter. I want to start tracking the number of results for each tag as results come in, to determine which tags don't return that many results, etc.
So I was thinking that I should create a new table (twitter_data), and have a foreign key that will link it to the search tag ID, which is stored in site_content.
What is the best path to accomplish this? Should I create my table then run the reverse schema tool, outlined here?
http://rtfm.modx.com/display/revolution20/Reverse+Engineer+xPDO+Classes+from+Existing+Database+Table#ReverseEngineerxPDOClassesfromExistingDatabaseTable-CreatingaMySQLtable
I also found this, but not sure if this is what I should be looking into:
http://rtfm.modx.com/display/revolution20/Using+Custom+Database+Tables+in+your+3rd+Party+Components
Probably not - if you can avoid modifying the core modx schema do so. an external table may be your best option, but requires a fair bit of work.
though if you can explain wht you mean by 'tags' a little better [html tags? snippets? content tags? not sure what you mean] there may be other options. for example. one of our clients wanted to count page hits [and didn't want to use google to do it] so all we did was to create a template variable bound to each page they wanted to count and then updated that appropriate variable by writing plugin to fire on the onpageload or onpagerender event. [I don't ermember exactly which or what it was called]
Basically, you may be able to do this by writng a plugin rather than trying to extend anything or add snippets/chunks.
does apache-nutch support sitemaps?
or how can i implement it myself? how can i use priority field, should it be multiplied to boost field?
Not that I'm aware of.
Depending on the behaviour you expect their are multiple implementations, can u be more specific?
For instance:
+ you can make it that new sitemaps submitted are 'injected' whith a high score so they will get crawled earlier. For this just add an inject command before starting a new crawl/fetch/index cycle
+ you can create a scoring plug-in which will boost URL found in a sitemaps...
But you can not define recrawl periods at a URL level, as the sitemap would indicate. Nutch has build-in fonction which will recrawl more often URL that changes more an vice-versa. However you could decide to boost score of URL with frequent refresh rate, so that they get crawled earlier...
I guess they support it now. I found it on this link
https://wiki.apache.org/nutch/SitemapFeature