nutch replace parsed content before indexing

nutch replace parsed content before indexing - nutch

I am using nutch 1.15. I want to replace some string of the parsed content before getting indexed.
Is there a way to write the regexp and replace the contents?
Example:
Content : "This is the crawled page"
I want to replace "page" with string "content"

Since you want to do the replacement in the content (parsed text). You can write a custom IndexFilter similar to https://github.com/apache/nutch/tree/master/src/plugin/index-replace that manipulates the data before sending it to your storage.
The previous plugin works only on the metadata fields, but should provide a good overview on how to build your own.
There is also something similar that you can do on the Solr side, for instance take a look at this blog post

Related

Kentico - Content Only page & smart search result

I have a custom page types (Content Only) for Locations. Then I have a landing page (/company/locations/) with repeater to list all locations and their details. Things work well so far. Now, after adding the smart search, I notice that if I search a location name like "san francisco", the landing page didn't show up in search result, but the content-only page showed with a URL like this /company/locations/san-francisco. The thing is, this URL results in 404 since that page doesn't really exist. What should I do? Should I re-create the page type and change it to a regular page instead of content only before it's too late? Or is there a way to make individual location url (/company/locations/san-francisco) work - considering we can't specify a page template to go with content only page type? Thanks!

There are multiple types of Search indexes in Kentico.
"Pages" scans the data of a document, such as any webparts+properties, editable text, form data, etc. They do NOT scan the rendering on the page though, it doesn't catch any Repeaters (what you're using).
"Page Crawler" will literally load the page, and scan all the content in the page. This will catch Repeaters and dynamic content like that.
Knowing this, you have a couple options.
Use Pages, then Modify the Smart Search Result and add some transformation logic to say something like the below
The Link
Use Page Crawler, tell it specifically to only index the /company/locations.
Use Page Crawler, and also a custom smart search indexer so you can exclude the header/footer or other areas out of the content (it's a bit more advanced)

If you don't want that URL to show then simply exclude those page types from that search index. But if you want them to specifically show, then create a detail or selected transformation for that /company/locations repeater to display when someone navigates to it from the search. This will also be good for google and other search indexes if you plan to have specifics for each location.

Nutch 2 exclude content-type image from crawling

The problem is that there can be images not with the specific image extensions. For example Nutch2 was crawling a page ending with .ashx but was still an image.
Is there a way to exclude images using an HTML header filter:content-type: images/* or something equivalent but not based on a url pattern (regex-urlfilter.txt)?

You can achieve this by writing a plugin that will extend URLFilter interface.
In String filter(String urlString) method, you can check the url if it has some vague extension then further validate by getting its HTTP header values from the server and check if its content type is an image then return null otherwise return the URL. But I doubt that would not be very efficient method since many useless HTTP calls will be generated for this validation purpose only.
Another thing is to just let it be and Nutch will not going to parse and/or index the image anyway.

Getting thumbnails in OpenSearchServer search results

I need an alternative to Google Custom Search for a website I look after, it has to be something that will crawl a website, index it, allow fiddling of priorities, and then allow search queries via REST or something similar and return XML or JSON etc. It needs to run on a Windows Server instance.
So, I'm up and running with http://www.opensearchserver.com/ and it seems to do the trick, but can't, for the life of me, work out how to get thumbnail images in the results? I've searched the documentation and read everything I could, but can't find out how to do this (or how to get my head around it).
I'm crawling standard web pages and they all have thumbnail meta data, which I'm assuming should be able to be parsed somehow for results and included in the JSON results?
Any pointers at all would be very helpful, thanks!

I figured this out, in case anyone else is struggling, here's how I did it. The answer is in the documentations, it's just not that simple.
Read: http://www.opensearchserver.com/documentation/faq/crawling/how_to_extract_specific_information_from_web_pages.md - it contains the method
Assume you set up a 'web crawler' index.
Assuming you're using a meta thumbnail like this:
<meta name="thumbnail" content="http://my_cdn.com/news/images/29637.jpg">
Go into Schema / Fields. Add a new field called 'thumbnail' with index no, store yes, vector no, analyser Text, copy of blank. Save that.
Now go to schema / parser list, edit HTML parser. Go to 'field mapping', now add a new regex for the thumbnail in the html. We map from the 'htmlSource' to the thumbnail' with the matching regex.
My imperfect regex (that works though) is:
htmlSource -> linked in: thumbnail -> captured by:
(?s)<meta name="thumbnail" content="(.*?)">
Now SAVE this and go to crawl/manual crawl, enter a url that has a thumbnail and then check if the field now appears in the list below when it's read. If not check your regex, and check you actually saved the HTML Parser changes.
To get the thumb in your results, simply add the fieldname to the JSON you send with the query:
"returnedFields": [ "
"url",
"thumbnail"
],

How to place search query in the URL?

With a lot of search engines, you can find the string you are searching in the URL.
However, http://drugcompare.destinationrx.com/Home.aspx does not let me do this. When I search something, the resulting URL is http://drugcompare.destinationrx.com/DrugCompare.aspx no matter what.
Is there any way I can find out whether I can search the website by adding something to the end of the URL, like "?query=searchstring" instead of using the form provided on the page? Basically I need a unique URL.

that website you pointed at uses POST to send data for its search query which means you wont be able to see or append it on the URL bar. The reason for that is either for security or the search query it generates is a complex object or too long and does not fit in a url. websites such as search engines uses GET, with that you can append your search query in the url by following the syntax it generates.

google cse- rendering search results

I'm using Google CSE on my website and I want to have the search results display differently than the standard method. I've found this:
http://code.google.com/apis/customsearch/docs/snippets.html
I'm a little confused on the steps on how to style the results to my liking. I know that I have to create the structured data in my pages first (ie Pagemaps).
What does the second step mean though
"Fetch that structured data in the search results for your Custom Search Engine.
The Custom Search server can return the search results, along with the structured data, in XML or JSON format. "
And for the third step, do I just copy the code provided in the Custom Search Element?
Thanks in advance

"Fetch that structured data in the search results for your Custom Search Engine. The Custom Search server can return the search results, along with the structured data, in XML or JSON format. "
You don't need to fetch them yourself, I guess Indexing is meant with that. You can force Google to re-index your sites or upload a Pagemap directly through their service: https://developers.google.com/custom-search/docs/structured_data#pagemaphttp
After that you just request data from the JSON url:
https://www.google.com/cse?cx=[CSEID]&q=animal&output=xml&sort=myprivate12345-document-rating&pgmpk=myprivate12345
And for the third step, do I just copy the code provided in the Custom Search Element?
If you plan to use Javascript you best request the results in JSON. After that it is an Object in your code and you can style the hell out of it or do other things with it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

nutch replace parsed content before indexing - nutch

I am using nutch 1.15. I want to replace some string of the parsed content before getting indexed. Is there a way to write the regexp and replace the contents? Example: Content : "This is the crawled page" I want to replace "page" with string "content"

Related

Kentico - Content Only page & smart search result

Nutch 2 exclude content-type image from crawling

Getting thumbnails in OpenSearchServer search results

How to place search query in the URL?

google cse- rendering search results

Categories

Resources