Add metadata to Crawldb dump - nutch

I'm starting with Nutch (trunk version) and I'm spinning around the code without seeing something that seems obvious.
I want to extract the resource of every URLs crawled ( eg: https://stackoverflow.com/questions/ask ===> /question/ask ) expecting two results:
1. Post the information as an additional field to a Solr instance. I have solved this problem writing an IndexingFilter plugin and works perfectly.
2. Dumping this information as metadata when the next command it's thrown: bin/nutch readdb -dump crawldb
And at this second point it's where I'm stucked. Reading documentation and other examples it seems I have to use the CrawlDatum but I don't know in what class I have to modify in order to show this information when a dump is made.
Maybe someone knows where to touch in order to achieve this?
Some help will be appreciated!

Related

Shopware Product Export Feed for Doofinder plugin

I have catalog setup under Shopware, and I have installed doofinder plugin for the search purpose. So, now I need to provide the feed urls in my doofinder and the feeds are setup as well. But, one of the feed does not generate the xml correctly. It is trying to export around 18k records. While, there is one another feed that exports around 74k records.
Can someone please throw me pointers, what can be the probable cause and solution? I am newbie to shopware, etc.
First of, which version of Shopware are you using? (5.6.7 or 6.2.2?)
For now I assume you are using 5.6.7 (or anything along the 5.6 line).
How do you generate the feeds?
Have you checked the logs in var/log/? Maybe there is an kinda obvious error.

Scrapy not extracting data from a certain xpath

I'm trying to extract some data from an amazon product page.
What I'm looking for is getting the images from the products. For example:
https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
By using the XPath
//script[contains(., "ImageBlockATF")]/text()
I get the part of the source code that contains the urls, but 2 options pop up in the chrome XPath helper.
By trying things out with XPaths I ended up using this:
//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]
Which gives me exclusively the data I need.
The problem that I'm having is that, for certain products ( it can happen within 2 pairs of different shoes) sometimes I can extract the data and other times nothing comes out. I extract by doing:
imagenesString = response.xpath('//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]').extract()
If I use the chrome xpath helper, the data always appears with the xpath above but in the program itself sometimes it appears, sometimes not. I know sometimes the script that the console reads is different than the one that appears on the site but I'm struggling with this one, because sometimes it works, sometimes it does not. Any ideas on what could be going on?
I think I found your problem: Its a captcha.
Follow these steps to reproduce:
1. run scrapy shell
scrapy shell https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
2. view response like scrapy
view(respone)
When executing this I sometimes got a captcha.
Hope this points you in the right direction.
Cheers

How to test a rdfa parser?

I'm trying to find a way to check if my rdfa-parser (written in nodejs) is working.
So I have an rdfa-parser, which should print all triples, found in a file or url (with rdfa-syntax).
So far I know, that there are testsuits for RDFa-parsing (http://rdfa.info/test-suite/rdfa1.1/html5/manifest), but I'm not sure how to use them.
Is there a good webpage, where this is described? Or can anyone help me in another way?
There should be some information at the rdfa.info/tests site. Basically, you need a service that will accept a GET request, where the "uri" query parameter points to the input file. The service then parses the file, and returns some other form of RDF, typically N-Triples. More information on the Github page: https://github.com/rdfa/rdfa-website/blob/master/README.md

Cloudant skip parameter not working

I'm trying to do paging of results from a Cloudant database.
I've tried using bookmark, but the fact that the final page of results still has a bookmark is a problem for me, since it means that apps using the database can't tell if there's a 'next page' or not without requesting it.
Instead, I've tried using skip with a URL like this:
https://samdutton.cloudant.com/mydb/_design/mydesigndoc/_search/mysearch?
q=foo:bar&skip=10
However, this isn't working: I always get the first page of results.
Am I doing something wrong, or should this work?
Skipping results isn't supported in search. See the search documentation for the full list of supported options.

How to omit JavaScript and comments using nutch crawl?

I am a newbie at this, trying to use Nutch 1.2 to fetch a site. I'm using only a Linux console to work with Nutch as I don't need anything else. My command looks like this
bin/nutch crawl urls -dir crawled -depth 3
Where the folder urls is were I have my links and I do get the results to the folder crawled.
And when I would like to see the results I type:bin/nutch readseg -dump crawled/segments/20110401113805 /home/nutch/dumpfiles
This works very fine, but I get a lot of broken links.
Now, I do not want Nutch to follow JavaScript links, only regular links, could anyone give me a hint/help on how to do that?
I've tried to edit the conf/crawl-urlfilter.txt with no results. I might have typed wrong commands!
Any help appreciated!
beware there are two different filter files, one for the one stop crawl command and the other for the step-by-step commands.
For the rest just build a regex that will match the urls you want to skip, add minus before and you shoulb de done.

Resources