changing the url domain in nutch index programmatically

changing the url domain in nutch index programmatically - nutch

i'm currently making search engine for a website content (only for searching within that website). however, i'm thinking of building the index in the staging server. it's something like this:
1. i stage my code at www.staging_server.com
2. i index the pages at www.staging_server.com
3. i copy codes at www.staging_server.com to www.production_server.com
4. i copy the index to www.production_server.com index???
the problem with step 4 is that the urls in the index created in step 2 is in the form of www.staging_server.com/index, www.staging_server.com/whatever, www.staging_server/anything. but what i need is www.production_server.com/index, www.production_server.com/whatever, www.production_server.com/anything
i'm wondering whether the urls in the index can be changed programmatically. and if so, how to do that?
note: i'm nutch beginner, so please be merciful to me

If you are only working with the index after the crawl, you can open up the index with a Lucene IndexReader and add new records with an IndexModifier. Your can page through each document, create a copy of the document with the new url, and then add the new document back to the index. You will need to delete the original document if you do not with it to persist in the index.
Lucene does not allow index updating but rather the deletion of a old record and the insertion of a new one.

Related

Sitecore Search Api - How to get formatted Url

How can I get the formatted url from Sitecore Lucene search? I created a custom index and updated it with under root as /sitecore/content/websitename/home.
When the search results are retrieved the URL is appended with https://hostname/websitename/home/sample.aspx. I would like the url to be https://hostname/sample.aspx. Is there any setting in index config that needs to be updated?
In sites.config I already have rootPath="/sitecore/content/websitename" startItem="/home"

You can get the url in two ways:
For each result from your index, fetch the item and get the url with the LinkManager as you normally would for any item. This does mean that you need to fetch the items what will be a performance hit.
Create a computed field in your index to include the url. In your computed field, make sure the correct link is being generated. If not, you might need to check your url options and the maybe the Rendering.SiteResolving setting (true). Verify the results with a debugger (or with Luke to test the index). Remember that if you include the url in the index, you will need to update additional items when an item is renamed (or even the display name changed when that is used in the url). All children of that changed item had their urls changed as well at that point.

How can I crawl but not index web pages in OpenSearchServer?

I'm using OpenSearchServer to provide search functionality on a web site. I want to crawl all pages on the site for links to follow but I want to exclude some pages from the index. I can't work out how to do this.
Specifically the website includes a shop that has its own product search and I am keeping this search for products and categories. The product pages have URLs like http://www.thesite/p/123 so I don't want to include any page like this in the search results. However some product pages reference background info pages and I want these to be included in the search index.
The problem I have is that the filter has no effect on the results - it doesn't filter out the /p/ and /c/ results. If I change the filter by unticking the negative box I get no results so it seems to be either the contents of the field or the filter criteria that is causing the problem.
I've tried adding a negative filter to the default query called search in the Query > Filter tab on the index with url:"http://www.thesite/p/*"
but it seems that wildcards are not supported for query filters although they are supported for Crawler > Exclusion list filters.
I've tried adding a new field called urlField in Schema > Fields and populating it using an analyzer configured using the Whitespace Tokenizer and a regular expression (http://www.thesite/(c|p)/). When I use the Test button it seems to generate two tokens for my test URL http://www.thesite/p/123:
http://www.thesite/p/
p
I'd hoped to be able to use the first one in a Query > Filter to exclude all the shop results and optionally be able to use the p (for product) or c (for category) if I need to search the product pages sometime in the future.
The urlShop field in the schema is set up as follows:
Indexed: yes
Stored: no (because I don't need the field back, just want to be able to filter on it)
TermVector: No
Analyzer: urlShop
Copy of: url
I've added urlFilter:"http://www.thesite/p/" to Query > Filters with the negative box ticked.
This seems to have no effect on the results when I use the default renderer.
To see whether it affects the returned results I unticked the negative box in the query filter I get no results in the default renderer. This leads me to believe that the urlShop field is not being populated but I'm not sure how to check this directly.
I would like to know whether there is an easier way to do this but if my approach makes sense in the context of OpenSearchServer please can you help me identify what's wrong?
The website is running under IIS and OpenSearchServer will be configured on the same server running in Tomcat.

Finally figured this out...
Go to query and hit edit for your configured query. Then go to the filters tab. Add a query filter like this:
urlExact:"http://myurltoexclude*"
Check the "negative" box. Click add.
Now make sure to click "save in the tiny little button on the right hand side. This is the part I missed. The URLS are still in the DB and crawl, but at least they aren't returned in results.

Node/MongoDB scraper - Deleting records that are no longer valid

This seems like a pretty simple thing but I can't find any discussions that really explain how to do it.
I'm building a scraper with MongoDB and Node.js. It runs once daily and scrapes several hundred urls and records to the database. Example:
Scraper goes to this google image search page for "stack overflow"
Scraper gets the top 100 links from this page
A record of the link's url, img src, page title and domain name are saved to MongoDB.
Here's what I'm trying to achieve:
If the image is no longer in the 100 scraped links, I want to delete it from the databqse
If the image is still in the 100 scraped links, but details have changed (e.g. new page title) I want to find the mongodb record and update it.
If the image doesn't exist already, I want to create a new record
The bit I'm having trouble with is deleting entries that haven't been scraped. What's the best way to achieve this?
So far my code successfully checks whether entries exist, updates them. It's deleting records that are no longer relevant that I'm having trouble with. Pastebin link is here:
http://pastebin.com/35cXcXzk

You either need to timestamp items (and update them on every scrape) and periodically delete items which haven't been updated in a while, or you need to associate items with a particular query. In the latter case, you would gather all of the items previously associated with the query, and mark them off as the new results come in. Any items not marked off the list at the end, need to be deleted.

another possibility is to use the new TTL index option in mongodb 2.4 allowing you to set time to live on documents
http://docs.mongodb.org/manual/tutorial/expire-data/
This will let the server expire them over time instead of having to perform big expensive remove executions.
Another optimization is to use the power of 2 option for collections to avoid the high fragmentation of memory that write, remove cycles create
http://docs.mongodb.org/manual/reference/command/collMod/

How can I search through a single sharepoint List in a site collection using Search Scope?

I have added a Web Address rule to a Search Scope and set the Folder url to the following for searching through a single list in site collection :-
http://svrmosstest3/sites/asmtportal/Lists/SearchList
And added this scope to the search dropdown, this search is working fine and will return results from that list only, but it returns one extra item which is an entry for the list itself which is :-
http://svrmosstest3/sites/asmtportal/Lists/SearchList/AllItems.aspx
because this will always fall under the Rule URL.
Is there any other method to create a search scope that will search only through the items of a Single Sharepoint List in a site collection ??
Please tell me ,.. if there are any sharepoint experts ??

You will most likely need to add a crawl rule to exclude that one page from being added to the index (which will then prevent it from being included in your search scope).

Sharepoint Document Upload Page - Passing URL Variables?

Throughout my SharePoint site, I have several document repositories that are tied to primary keys from an external database. I have added custom columns in the document library metadata fields so that we will know which SharePoint documents correspond with which table entries. As a requirement, we need to have document uploads that have these fields automatically populated. For instance, I'd like to have the following url:
./Upload.aspx?ClassID=2&SystemID=63
So that when you upload any documents to this library, it automatically adds the ClassID and SystemID values to the corresponding ClassID and SystemID columns outlined in the SharePoint document library fields.
Is there any quick or easy way to do this, or will I have to completely rewrite the Upload.aspx script from scratch?

I think the only way to go is to create your own Upload.aspx page. Read more here.

Unfortunately, it looks like going custom is the only option for now. Here are some tips on how to code the submission page.
There is a corresponding entry that describes how to add a document to a document library here:
How do you upload a file to a document library in sharepoint?
Likewise, once you have a document library file handler, you can alter its metadata column values using this method:
http://www.davehunter.co.uk/Blog/Lists/Posts/Post.aspx?List=f0e16a1a-6fa9-4130-bcab-baeb97ccc4ff&ID=109
Essentially it's just
SPFile.Item["ColumnName"] = "Value";

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string