nutch and sitemap.xml - search

does apache-nutch support sitemaps?
or how can i implement it myself? how can i use priority field, should it be multiplied to boost field?

Not that I'm aware of.
Depending on the behaviour you expect their are multiple implementations, can u be more specific?
For instance:
+ you can make it that new sitemaps submitted are 'injected' whith a high score so they will get crawled earlier. For this just add an inject command before starting a new crawl/fetch/index cycle
+ you can create a scoring plug-in which will boost URL found in a sitemaps...
But you can not define recrawl periods at a URL level, as the sitemap would indicate. Nutch has build-in fonction which will recrawl more often URL that changes more an vice-versa. However you could decide to boost score of URL with frequent refresh rate, so that they get crawled earlier...

I guess they support it now. I found it on this link
https://wiki.apache.org/nutch/SitemapFeature

Related

Search result: How to show only pages, not different content items?

We are using Liferay as a classic CMS meaning that we compose pages using web content articles. There is an issue with Liferay's internal search I could not yet find a proper answer for:
Because web content articles are pretty much only building blocks for pages we don't want the search to show them as distinct items. The user should only get a list of pages that contain their search keywords, including all the articles put onto this page.
At the moment we can see two different approaches and both come with certain problems we could not solve yet:
Idea 1
We modify the journal indexer and try to obtain all URLs of the pages (how?) where the article has been placed on. Then we add them to the document to be indexed. In the search result we then can access the URLs and collect them. In the end we make sure every URL is only shown once.
Idea 2
At some point Liferay renders the entire page before sending it to the browser. If we somehow could put an indexer there, we could index the entire page. We then could limit the search to the special "page documents". Getting the fully rendered page would be the main issue here, because either we would have to run a crawler to frequently trigger this indexing or we would need to find a way to trigger page rendering from within an indexer or something like that.
I have been carrying this problem around for quite a while now and still could not find an idea good enough to spend time trying it out. If anyone of you has some input on those two ideas or maybe an entirely different approach, I would be extremely grateful.
I'll just answer myself, because by now we found a suitable solution to solve our problem:
In addition to the default search portlet there is also a "Web Content Search Portlet" shipped with Liferay. It seems to have been part of Liferay for quite a while now, but it's somewhat hard to find, because there is hardly any documentation for it (I only found the Liferay wiki page, which isn't really anything at all). It searches only within web content articles and shows links to the pages rather than just a link an isolated view of the article. It has much less configuration options than the default search portlet, however. Pretty much all it allows to change is whether articles actually have to be placed on at least one page to show up in the results.
So there is no need for any kind of custom indexer or any other "hack"...all we need to do is use the correct portlet. We will only need to write a hook that changes the appearance of the result page.
What you ask is interesting but your ideas are on the wrong direction.
Specially idea 2 it's particulary wrong because you cannot do indexing work meanwhile a page is rendered. Think about performace only.
In Liferay pages and assets are not directly linked: pages have portlets and portlets display assets (web content and more).
Liferay indexing refers and scans assets content, not refers the display result of the assets. Think about permission: the same page can display different contents depends on the user who looks.
bye

Modx - Extend site_content - Add new table

Currently, we're running revolution 2.2. On site_content, we have some tags that are ran for crawling twitter. I want to start tracking the number of results for each tag as results come in, to determine which tags don't return that many results, etc.
So I was thinking that I should create a new table (twitter_data), and have a foreign key that will link it to the search tag ID, which is stored in site_content.
What is the best path to accomplish this? Should I create my table then run the reverse schema tool, outlined here?
http://rtfm.modx.com/display/revolution20/Reverse+Engineer+xPDO+Classes+from+Existing+Database+Table#ReverseEngineerxPDOClassesfromExistingDatabaseTable-CreatingaMySQLtable
I also found this, but not sure if this is what I should be looking into:
http://rtfm.modx.com/display/revolution20/Using+Custom+Database+Tables+in+your+3rd+Party+Components
Probably not - if you can avoid modifying the core modx schema do so. an external table may be your best option, but requires a fair bit of work.
though if you can explain wht you mean by 'tags' a little better [html tags? snippets? content tags? not sure what you mean] there may be other options. for example. one of our clients wanted to count page hits [and didn't want to use google to do it] so all we did was to create a template variable bound to each page they wanted to count and then updated that appropriate variable by writing plugin to fire on the onpageload or onpagerender event. [I don't ermember exactly which or what it was called]
Basically, you may be able to do this by writng a plugin rather than trying to extend anything or add snippets/chunks.

Whats the best way to use multiple languages on a website?

I was wondering what would be the best way to achieve a multi-language template based website. So say I want to offer my website in Englisch and German there are some different methods. My interest is mainly about SEO, so which would be the best way for search engines.
The first way that I often see is using different directories for each language, for example www.example.com for English and www.example.com/de/ for the German translation. The disadvantage of this is: when changing a file, ist has to be changed in every directory manually. And for search engines the two directories would be concerned as duplicate content, wouldnt they?
The second way I know is just using some GET value like www.example.com?lang=de and then setting a cookie. But this way search engines probably wont even find the different languages.
So is there another way or which one is the best?
I worked on internationalised websites until this year. The advice we always had from SEO gurus was to discriminate language based on URL - so, www.example.com/en and www.example.com/de.
I think this is also better for users; if i bookmark a page in German, then when i come back to it, i get a page in German even if my cookies have expired. Similarly, i can do things like post the URL on Facebook, and have my German-speaking friends click on it and get a site in German.
Note that if your site serves multiple countries, you should handle those along with language - so, you might have example.com/de-DE, example.com/en-GB, example.com/en-IE, etc.
However, this should not involve duplication. Instead, you should set your application up to process the URL, extract the locale information, and then forward the request internally to a locale-independent page. So, a request for example.com/de-DE/info and a request for example.com/en-IE/info should both be passed to /info.jsp (or i'm guessing info.php in your case). That page should then be coded to emit text in the appropriate language, using a page-level localisation mechanism.
Things are a bit trickier if you want the URLs themselves to be localised (eg example.org/de-DE/anmelden vs example.org/en-IE/sign-in). However, the same principle applies: extract the locale, then forward to a common page. The difference is that there must be more sophistication in figuring out what the page is from the URL; you will need a mapping from natural language in the URL to the page filename.

TYPO3: How to count page impressions on every page with an extension

I need to count the page impressions of every page on a TYPO3 site into the db.
So I think I need an extension which is called on every page impression and increase a column 'impressions' in the db of the specific page.
I'm new to typo3 and new to extension development as well. Is there a way to include an extbase-extension on every page so some php-script get called?
(Update)
I want to add more information:
I don't need a counter which counts all PIs. The counter needs to be page-related. So it make sense to extend the pages-table from Typo3. Another need is that the extension should be done with extbase.
I'm new to typo3 and new to extension development as well. Is there a way to include an extbase-extension on every page so some php-script get called?
Once your plugin is configured you can include it with page.1234 < plugin.tx_yourextension_pi1 on any page. 1234 determines the position on your page.
The script should be USER_INT, so it's not being cached (mind you, this will cost loads of performance as previously stated by #norwebian)
As you don't want to output anything, make sure the controller stays empty as well.
Did you do a quick search in the extension repository? Trying a search for "page counter" reveals four relevant extensions.
"Sys_stat" is the closest thing to an "official" solution, it is really just enabling a few settings already existent. It has been reported to fill up the database with too much data, though.
"Generic Visitor Counter" would be my favourite, I believe (if I was going for a page counter at all), it is recently updated and seems simple enough.
You should really consider a proper stats extension, though. Both ics_awstats and ke_stats have been in my toolset.
YMMV. Be aware that if your site is popular, stats gathering quickly gets out of hand. On the other hand, if you go for a simple counter, including uncached extensions will cost performance.
I am not sure if I really understood what you want and need. After all, page impressions are not the same as page views. I wouldn't know the difference "onpage" right now though. So am I right in assuming that you mean page views?
If yes: I would take the following approach:
A separate, autonomous extension with a JavaScript for asynchronous calling of an API and a table for storing page views / page impressions.
Each page globally binds a JavaScript that initializes itself.
Once the DOM is ready, it sends a call to an AJAX API endpoint with the URL of the page as a parameter.
The endpoint takes only the URL.
For each unique URL, a record including counter is created or updated.
Extending the table for the pages doesn't make sense to me. What are you doing with a website that consists of news overviews, news details, press and blog sections, a dealer search and a store with product pages?
I would keep the statistics table standalone.
If you expand the table a bit and add date and time - no simple increment of hits - you can even identify the hottest pages of the week, the month, etc.
--
My approach won't increase/delay page load time much, if at all, and will have little noticeable impact even on heavily requested websites.
With the AJAX endpoint, it's then up to you how you deploy it and how much of the CMS framework you want to load.

Nutch : get current crawl depth in the plugin

I want to write my own HTML parser plugin for nutch.
I am doing focused crawling by generating outlinks falling only in specific xpath.
In my use case, I want to fetch different data from the html pages depending on the current depth of the crawl. So I need to know the current depth in HtmlParser plugin for each content that I am parsing.
Is it possible with Nutch? I see CrawlDatum does not have crawl_depth information.
I was thinking of having map of information in another data structure.
Does anybody have better idea?
Thanks
Crawl.java has NutchConfiguration object. This object is passed while initializing all the components. I set the property for crawl-depth before creating new Fetcher.
conf.setInt("crawl.depth", i+1);
new Fetcher(conf).fetch(segs[0], threads,
org.apache.nutch.fetcher.Fetcher.isParsing(conf)); // fetch it
The HtmlParser plugin can access it as below:
LOG.info("Current depth: " + getConf().getInt("crawl.depth", -1));
This doesn't force me to break map-reduce.
Thanks
Nayn
With Nutch, "depth" represents the number of generate/fetch/update cycles run successively. Per example, if you are at depth 4, it means your are in the fourth cycle. When you say that you want to go no further than depth 10, it means that you want to stop after 10 cycles.
Within each cycle, the number or previous cycles run before it (the "depth") is unknown. That information is not kept.
If you have your own version of Crawl.java, you could keep track of the current "depth" and pass that information to your HTML parser plugin.

Resources