block some part of web page to be indexed

block some part of web page to be indexed - nutch

I crawled a web site. there are a lot of common contents on the pages, like drop-down menu, navigation. How to prevent these contents from being indexed?

Not sure, if you still need to do this, but just in case you do, you could try blacklist_whitelist plug-in which can be found at https://issues.apache.org/jira/browse/NUTCH-585.
The plug-in allows you to have a list of the elements you want to either block or allow but not both.
for example:
<property>
<name>parser.html.blacklist</name>
<value>noscript,div,#footer</value>
<description>
A comma-delimited list of css like tags to identify the elements which should
NOT be parsed. Use this to tell the HTML parser to ignore the given elements, e.g. site navigation.
It is allowed to only specify the element type (required), and optional its class name ('.')
or ID ('#'). More complex expressions will not be parsed.
Valid examples: div.header,span,p#test,div#main,ul,div.footercol
Invalid expressions: div#head#part1,#footer,.inner#post
Note that the elements and their children will be silently ignored by the parser,
so verify the indexed content with Luke to confirm results.
Use either 'parser.html.blacklist' or 'parser.html.whitelist', but not both of them at once. If so,
only the whitelist is used.
</description>
</property>

I am working with nutch codebase since past 2 years and as far i have seen, this aint possible. Once the content enters the nutch segments, you cant strip off parts like drop-down menu, navigation etc from it and keep only the required stuff.
If you or anyone else knows how to do it (off course..without modifying the code), please share the same.

Related

Wayfinder IncludeDocs parameter in Modx is breaking the snippet

I'm quite stuck on an unexpected problem. I'm trying to use Wayfinder to generate a sitemap for a project. The output of the navigation items is as expected, but I need to include a number of documents in addition to the primary navigation elements.
To do this, I have used the includeDocs parameter.
[[Wayfinder? &startId=`0` &includeDocs=`17,18,19,20`]]
When I do this, I get no output at all. Remove includeDocs and I get the standard nav (expected). Use the param and the output is completely empty.
No idea what I'm doing wrong or what (if any) other setting must be defined in order to make this work.

The includeDocs parameter is very misleading. It should rather be named "onlyIncudeDocs" or "restrictTo", since that is what it does. It also requires the docs you include to be directly accessable from your startId, alternatively have the entire path "included".
I would suggest you create weblink resources directly under your startId, and link them to the resources you want to include. That way wayfinder will pick them up by default. (Note that you may need to handle this in your rowTpl for wayfinder, since a weblink stores the actual link in it's content field)
If you also want to include the children of the id's you specify, you would probably be better of slightly revising your resource structure.

Drupal, Solr & Facet Api - Persistent facet links in blocks

I need to produce facet block from two vocabularies in my site. I am using Views and a patched version of Views infinite Scroll to generate the search page, using my search index, and I have tweaked everything I could in the facet display settings to see if I could produce the requested results, to no avail.
I do not need keyword searches. I need to show all taxonomy terms in each facets at all times and to be able to select a single criteria at a time from each vocabulary. So, never more thane one selection at a time from each facet block.
Why are you using Solr to store data and generate your search page, if you do not need keyword search and are trying to go against the native working of solr Facets, I hear you say? For performance reasons, it is the reason why I am using Solr to store & serve the results, I have even gone as far as pushing renedered node to the index with the help of the somwhat obscure search_api_solr_view_modes module.
I could take two separate routes
Create a custom block, load all the taxonomy terms, alter the output of the term link to point to the view and provide the TID for the View. The active filter data could be obtained from the view arguments. I know how to do that but feel it is the wrong way to go about it, if I am working with Solr, I should be using a facet, not a custom block.
Build a custom Facet block that has this exact behaviour. After reading a lot of documentation, I git kind of dicouraged with the possibility of doing this simply without having to develop a Facet plugin, which is kind of out of my league.
Any advice is appreciated.
Here is a screenshot of the interface I have to produce.
http://imageshack.com/a/img834/9836/kr0i.png
Each taxonomy term has to be persistent, i.e., produce a link event if there are no nodes indexed under this term.
Selecting a term in one of the vocabularies will deselect previously selected terms
Clicking on the x next to a term will remove it form the active search criterias.

Have a look at this. https://drupal.org/project/ajax_facets This might get you to where you need to be. Sans you infinite scroll. There is a youtube video that goes with it. http://www.youtube.com/watch?v=pBj3OkXLyWs
I'd appreciate it if it works as I haven't tried it my self.

Searching Single Pages with Dynamic Content

I have a slight problem I have been trying to address for a client I have been working with. We have 4 sets of single pages that are loading content from a database using PHP based upon a get string that is provided. These pages that are generated are optimized well for SEO and have alt tags for images and Content that we need to be able to search using a search feature.
Now i had assumed (An everyone knows what assuming gets you) that these pages by default would be able to be searched by the concrete 5 built in search feature. But it doesn't work. If I search for a word that I know is definitely on one of these pages even multiple times no results are found.
How can I make Concrete5 search these pages. If its no do able by a default or by a plugin, then can someone please offer some advice on how to fix this. This is an important feature and must be completed.
EDIT: See my comment below. I still need some help or direction here as CSE inst much of an option.
EDIT2: It may be viable for me to install a crawler and a custom search engine to address my problems. I was thinking of spider. Any other suggestions on that or other options are much appreciated!

Unfortunately C5 doesn't provide a way to do this -- the only way to tap into the search index is with blocks. And even if you created a phony block just to pass content from the single_page through to the search index, there's no way to say that some content is from one URL while other content is from another URL (which you'd need to do since your single_page controller is handling many different URL's).
I don't know of a way to achieve what you want to do (and it appears that nobody else does either -- http://www.concrete5.org/community/forums/customizing_c5/make-content-in-single-pages-searchable/ ), other than building your own internal search engine.
EDIT: I just did some digging, and thought that perhaps you could manually insert records into the PageSearchIndex table and specify the searchable content and the desired path there -- but this won't work because it relies on one cID (collection id, a.k.a. page id) per entry -- so you'd only be able to insert one record for the top-level single_page path.
I think the simplest solution here would be to create your own searching infrastructure for your single_pages (like some kind of function in the controller that would return an array of page paths and searchable content for each one), then override the search block and perform an additional search of your single_page -- then combine the results on the search results page there. Or just use google site search for your site, which will actually crawl the pages and hence find your various single_page urls: https://www.google.com/cse/
Best of luck.

I have not tested this, but maybe you can put a function getSearchableContent() in the single pages controller like you do for blocks. This would return the string to be searched. Would look something like this:
function getSearchableContent() {
// ... compose searchstring depending on the queried content.
return $searchstring;
}
But I don't know if this works for dynamic content. If not, I'd look into C5's search index core classes and try to extend them for your project.

Dividing long content to subpages

I need to divide long content to sub-pages.
Rule for dividing: Heading1 (H1)
Cms-system: MODX Evolution
As far as i know, there is nothing in modx to use for this kind of problem.
I probably got to do this manually anyway, but i still would like to know if there is a way to do this in MODX Evo / Revo.
Edit:
I need to do this in MODX; sub-pages got to be actual subpages, and original page becomes to container.
Navigation will be done with wayfinder.
Edit2:
All done.. manually. Question still open, though.

This is not possible out of the box and I don't know of any extra that archieves what you want. You would have to write a plugin that acts everytime you save a resource and split up the content, create/delete sibling resources as needed etc. Sounds like a lot of work for what you want to archieve to me.
I suppose you have a look at the MIGX extra. It provides you with a TV with the possibility to store an indefinite amount of distinct TV content sets. Have a look at the documentation and Mark Hamstra's tutorial (with screenshots) to see how it is done. You should define one MIGX entry to consist of a text field for the <h1> and a rich text field for the content of the "subpage".
Afterwards, you can use form customization to hide the original content field and display your MIGX Tv instead.
I think, this is a much easier way to archieve, what you want, and can't think of any way, where you would benefit from actual subpages.
Edit: Sorry, I just recognized that you were asking about Evolution, not Revolution. My solution would work in Revo, but I don't think there's something like MIGX for Evo. Sorry, my mistake.

not 'out of the box' you will have to run your content through a snippet to parse it into separate divs or something that you can run some javascript on to possibly 'tab' the content.
If you need to show the 'subpages' in your navigation, you will probably have to use the gatResources extra to parse your content ~ which will be very expensive on resource usage.

You can (depending on how you're using the tree) just create actual sub resources under the parent resource, using Ditto or Wayfinder to build navigation for it.
If you can't use the tree like that (though from your description I think you can), you could also set up a number of template variables ("content1", "content2", "content3" etc) and show that with a simple snippet or so.

What bad things can a user do in a browser without the script tag?

I have an entry form where the user can type arbitrary HTML. What do I need to filter out besides script tags? Here's what I do:
userInput.replace(/<(script)/gi, "<$1");
but the sanitizer of WMD (used here on SO) manages a white list of tags, and filters out (blanks) all other tags. Why?
I don't like white lists because I don't want to prevent the user from entering arbitrary tags if she so chooses; but I can use a more extensive black list, besides 'script', if needed. What do I need as a black list?

Short answer: anything they can do with the script tag.
The script tag is not required to run javascript. Script can also be placed in almost every HTML tag. Script can appear in a number of places additional to the script tag including, but not limited to, src and href attributes that are used for URLs, event handlers and the style attribute.
The ability for a user to put unwanted script into your page is a security vulnerability known as cross-site scripting. Read around this topic and read the XSS prevention cheat sheet.
You may not want to let users add HTML to your pages. If you need this feature, consider other formats such as Markdown that allows you to disable the use of any embedded HTML; or another less secure option is to use a filtering library that tries to remove all script, such as HTMLPurifier. If you choose the filtering option, be sure to subscribe to announcements of new releases and always go back to your project to install the bug-fixed releases of the filter as new exploits are found and worked-around.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string