Crawling different sites with nutch 1.8 - nutch

i am using nutch 1.8 to crawling information from sites who has different patterns from same field. I was writing plugins for each of that sites the , but when i start nutch, just first plugin is matching with all sites, others as they are not exists.
If the first plugin is not matched with site, skip to next one and check them, etc until you find the right plugin for site?

Not clear why you are getting this. Are you writing a HTMLParseFilter? What you could do is to exit the parse method if the current document's URL does not match a given pattern or alternatively pass some metadata from the seeds which you could use to determine which HTMLParseFilter implementation to use.
BTW you'd get a more relevant audience by posting on the Nutch user list (see http://nutch.apache.org/mailing_lists.html)

Related

How can I get the tags of stackoverflow for solr index?

Recently, I used nutch-1.11 and solr-4.10.4 to set up a crawler, I can crawl data by sequential nutch commands, but now my problem is how can I to fetch the specified data, like tags of questions of stackoverflow for example, then I can use these data for solr indexing for my some purpose? I try to configure and modify the "local/conf/nutch-site" but doesn't work for me, I'm a newer for Nnutch!
Nutch fetches urls, so what you could do is point it to a page which might contain all the links to the questions with that tag.
For example
https://stackoverflow.com/questions/tagged/nutch?sort=newest, this page contains links to all questions having Nutch as the tag. Now by crawling 2 or more rounds will make Nutch fetch all outlinks from this page.

Search result: How to show only pages, not different content items?

We are using Liferay as a classic CMS meaning that we compose pages using web content articles. There is an issue with Liferay's internal search I could not yet find a proper answer for:
Because web content articles are pretty much only building blocks for pages we don't want the search to show them as distinct items. The user should only get a list of pages that contain their search keywords, including all the articles put onto this page.
At the moment we can see two different approaches and both come with certain problems we could not solve yet:
Idea 1
We modify the journal indexer and try to obtain all URLs of the pages (how?) where the article has been placed on. Then we add them to the document to be indexed. In the search result we then can access the URLs and collect them. In the end we make sure every URL is only shown once.
Idea 2
At some point Liferay renders the entire page before sending it to the browser. If we somehow could put an indexer there, we could index the entire page. We then could limit the search to the special "page documents". Getting the fully rendered page would be the main issue here, because either we would have to run a crawler to frequently trigger this indexing or we would need to find a way to trigger page rendering from within an indexer or something like that.
I have been carrying this problem around for quite a while now and still could not find an idea good enough to spend time trying it out. If anyone of you has some input on those two ideas or maybe an entirely different approach, I would be extremely grateful.
I'll just answer myself, because by now we found a suitable solution to solve our problem:
In addition to the default search portlet there is also a "Web Content Search Portlet" shipped with Liferay. It seems to have been part of Liferay for quite a while now, but it's somewhat hard to find, because there is hardly any documentation for it (I only found the Liferay wiki page, which isn't really anything at all). It searches only within web content articles and shows links to the pages rather than just a link an isolated view of the article. It has much less configuration options than the default search portlet, however. Pretty much all it allows to change is whether articles actually have to be placed on at least one page to show up in the results.
So there is no need for any kind of custom indexer or any other "hack"...all we need to do is use the correct portlet. We will only need to write a hook that changes the appearance of the result page.
What you ask is interesting but your ideas are on the wrong direction.
Specially idea 2 it's particulary wrong because you cannot do indexing work meanwhile a page is rendered. Think about performace only.
In Liferay pages and assets are not directly linked: pages have portlets and portlets display assets (web content and more).
Liferay indexing refers and scans assets content, not refers the display result of the assets. Think about permission: the same page can display different contents depends on the user who looks.
bye

SEO search result indentation (google)

I want my website to have indentation in google result search.
After taking reference of many websites, I found this one website "www.traveloka.com"
Inside the website, I can't find any meta keywords stuffs.
But the website is well indented.
My question is :
- does meta keywords really needed to have google indent my search result ?
- if yes, why the website www.traveloka.com is well indented without meta keywords ?
- if no, what matters then ? Beside having the page have href linking to each other ?
UPDATE :
While doing SEO, I found this website :
chlooe.com
It reports SEO advises, which ones to be changed, etc.
I'll follow the instructions there. any thoughts ?
If by indentation you mean ... it's called sublinks.
Meta tags are no longer important for most search engines. They now rank the pages according to content so in your site's content, use strong keywords to get better ranking.
Having a specific page title helps a lot too.
As for the meta tags, personally, I like to leave it in but they are no longer mandatory.
The Google site links are generated automatically by Google depending on your content.
Here are a few tips:
1) Have a sitemap.xml in your website. This will tell the crawlers which pages are available on your site. To generate a sitemap.xml, I use http://www.xml-sitemaps.com/
2) Submit that sitemap to google webmaster tools.
3) Use clean urls. For example www.mydomain.com/contact, .../about-us, .../portfolio, ... etc. These help search engines seperate the content and create sub links depending on the most important content.
4) Most important of all, get traffic on your website... no traffic = poor ranking.
This is not a full tutorial but just some tips. Search for "google sub links" to learn more.
Hope this helps
https://support.google.com/webmasters/answer/47334?hl=en

Searching Single Pages with Dynamic Content

I have a slight problem I have been trying to address for a client I have been working with. We have 4 sets of single pages that are loading content from a database using PHP based upon a get string that is provided. These pages that are generated are optimized well for SEO and have alt tags for images and Content that we need to be able to search using a search feature.
Now i had assumed (An everyone knows what assuming gets you) that these pages by default would be able to be searched by the concrete 5 built in search feature. But it doesn't work. If I search for a word that I know is definitely on one of these pages even multiple times no results are found.
How can I make Concrete5 search these pages. If its no do able by a default or by a plugin, then can someone please offer some advice on how to fix this. This is an important feature and must be completed.
EDIT: See my comment below. I still need some help or direction here as CSE inst much of an option.
EDIT2: It may be viable for me to install a crawler and a custom search engine to address my problems. I was thinking of spider. Any other suggestions on that or other options are much appreciated!
Unfortunately C5 doesn't provide a way to do this -- the only way to tap into the search index is with blocks. And even if you created a phony block just to pass content from the single_page through to the search index, there's no way to say that some content is from one URL while other content is from another URL (which you'd need to do since your single_page controller is handling many different URL's).
I don't know of a way to achieve what you want to do (and it appears that nobody else does either -- http://www.concrete5.org/community/forums/customizing_c5/make-content-in-single-pages-searchable/ ), other than building your own internal search engine.
EDIT: I just did some digging, and thought that perhaps you could manually insert records into the PageSearchIndex table and specify the searchable content and the desired path there -- but this won't work because it relies on one cID (collection id, a.k.a. page id) per entry -- so you'd only be able to insert one record for the top-level single_page path.
I think the simplest solution here would be to create your own searching infrastructure for your single_pages (like some kind of function in the controller that would return an array of page paths and searchable content for each one), then override the search block and perform an additional search of your single_page -- then combine the results on the search results page there. Or just use google site search for your site, which will actually crawl the pages and hence find your various single_page urls: https://www.google.com/cse/
Best of luck.
I have not tested this, but maybe you can put a function getSearchableContent() in the single pages controller like you do for blocks. This would return the string to be searched. Would look something like this:
function getSearchableContent() {
// ... compose searchstring depending on the queried content.
return $searchstring;
}
But I don't know if this works for dynamic content. If not, I'd look into C5's search index core classes and try to extend them for your project.

How do I move old content down in the search engine rankings?

There is some precedent for search-engine-ranking-related questions on StackOverflow, so please don't close this question. It's programming-related to the extent that HTML META tags can be called "programming".
Here's the problem:
We make FogBugz, the software project planning and bug tracking suite.
Either we did a great job with our old documentation or a crummy job with our new documentation, but for most of the popular searches on FogBugz terms, documentation for our old versions comes up.
Here's an example. For context, our current FogBugz version is FogBugz 7. The top two results for that search are for FogBugz 5, which is positively ancient.
As best I can tell, there are several options for getting these results out of the top slots, but each has problems:
A NOINDEX tag, but what happens if someone is actually searching for help on an old version?
Finding the incoming links to the old documentation and placing a NOFOLLOW on them to deprive the old docs of PageRank. Problem here is that it's really fiddly to find the links to the content, rather than changing the content itself.
The unavailable_after tag, which is just a time-delayed NOINDEX, with the same problem of removal rather than demotion.
I just want these old documentation versions to stop competing with our current versions, without being completely unavailable.
An approach I used in the past (3 years ago)
Change the URL to your old documentation, and change your own links to point to the new url. e.g. abc.com/docs/fogzbugz/v5/xyz becomes abc.com/docs/fogzbugz/ancient/v5/xyz
Using the old URLs, implement a 301 redirection to your new v7 content. e.g. a request to abc.com/docs/fogzbugz/v5/GettingStarted.html is redirected to abc.com/docs/fogzbugz/v7/GettingStarted.html
In this way, existing links from external sites will take browsers to the latest documentation, and inform indexing robots that the page has moved.
Google will find the new links to your old documentation by indexing your site, but there will be no external links, thus reducing page rank.
Google will also find the new links to your new documentation, and as more sites link to it, its page rank will increase and so take priority.
This worked for me on a small scale (100 or so pages) site, and visitor attempts to view the old content rapidly dropped off.
If a user does land on a v5 page, how about the MSDN approach of explicitly stating the version that the page describes, and providing links to the equivalent topic in the v6 and v7 docs?
I would suggest that external links to older versions get redirected to the latest version - with some sort of note that if you really needed version 5 the link is here.
I think a lot of the problem deals with the fact that search engines give something a high rank if a lot of people are linking to a specific page. Unless you can get all the people linking to your old documentation, to link to your new documentation, then you are going to have a problem with the older documents being rated artificially high. In order to overcome this, you might need to change the way you handle documentation pages. One good way would be to always show the newest information on a particular topic, and then only by clicking on a link on the page, do you get to the older versions. Optimally, this would be the same page, with a different parameter, to state which version you want to get documentation for.
What about trying the MSDN approach? You assign a version tag to your pages. When this page is displayed, its version number is displayed as well. Users will be able to see immediately that this information is deprecated.
You may need to write some stubs for new version pages like "This problem has been resolved in the current version" so that the users don't have to think you didn't do anything in 5 years. Some writing work, some interlinking but it's doable for a limited number of problematic pages.

Resources