I use Apache Nutch 2.3.1 with Elasticsearch 1.7 for crawling and indexing respectively. After completing all the necessary procedures the final content of the parsed page includes both the header and footer which sometimes result to slightly irrelevant searches.
I would like to configure Nutch to exclude the header and footer of a page from the content. There are some open issues in Nutch's JIRA but all seem to refer to the Nutch 1.x branch. In addition, I have enabled the boilerpipe plugin but I did not notice any change in the quality of the content.
Is there any plugin or another way to perform more precise parsing?
You could also use NUTCH-1870 which uses XPath to extract specific portions of the document, but it is also developed for Nutch 1.x. To be honest although the Nutch 2.x branch is in active development (and improved a lot over time) the 1.x versions are still more feature rich and a lot of the new contributions are focused on the 1.x branch.
I'm guessing this plugins wouldn't be to hard to port into Nutch 2.x, and we welcome every contribution.
Related
I am planning to upgrade my company's intranet from liferay 6.0.6CE to 6.2CE. I have done some research on it but I am still confused on API part. Will my custom portlets need only recompilation or would they need a complete rewriting. I am also concerned about my Theme and Exts. I have a lot of customization in my exts and my theme. What would be the best way to move ahead?
Also I have a NFS file server and SOLR search server configured with my current deployment. Need suggestions on that too.
I've heard recently, that the Migration Tool (6.1 to 6.2) now also supports themes. It won't be pixel perfect though. Check what it can do for you.
There have been some APIs that changed. Contrary to the comments given to your question, I'd say "It depends": I don't know how much of Liferay's API you use or if you just add functionality on top. You'll have to find out for yourself. The migration tool might help you.
The things that have changed the most are: Themes (using Bootstrap, as of 6.2) and Document Library (now including ImageGallery, which was still available in 6.0). Migration of data should be smooth if you follow the documented upgrade path. Migration of your portlets and plugins will definitely require recompile (within the new plugins sdk or updated maven dependencies) and probably adaptation to some changed API calls. I've seen instances where this was simple, but I've also seen hard cases.
As there have been no more updates for 6.0 CE for quite a while, I'm recommending to upgrade though (other than #FeinesFabi in the comment). If you want to have a long-term stable platform that you don't need to maintain for yourself, EE would be the way to go (supported for ~7 years after release)
For ext changes, you'll have to be aware that there are no guarantees: Ext allows you to change the inner implementation of Liferay, and that's what nobody strives to keep stable, even in minor updates. If you're using ext, you'll always have to be aware of incompatible changes. Ext allows you to keep your changes out of the official sourcecode - so they're well isolated. It doesn't say anything about the underlying implementation to be stable. With great power (ext) comes great responsibility. Keep your ext as small as possible - whatever you can do outside of ext should be done outside and use the public API.
The basic upgrade path (for Liferay itself, not your plugins) is quite well documented in the User's Guide.
I am currently developing an internal application for our company with the following requirements.
Rich GUI but only basic HTML like components will be used. So component list is not a deciding factor.
Fuzzy requirements which might change frequently during the implementation stage, and tight turn around times are the norm. Thus am looking for a Drag-&-Drop design
The application will rarely be used (max once in a month) and the user base will not exceed 20.
Time is a critical factor and thus I do not want to spend time on configuring and troubleshooting the framework. I will go for a easy to integrate solution.
I did a brief research and decided on JSF with IceFaces. But am now confused about the version. If I go with 1.8, I get Drag-&-Drop designing (Netbeans 6.5) but I will be stuck with JSF 1.2
If I choose ICEFaces 2.0 I will have to manually design the UI which might take more time.
Any suggestions on which version to choose?
If you can, go with the newest stable version. There are a lot of reason to use JSF2.
You can have drag&drop with ICEfaces2 too, at least in Eclipse (see the wiki). IDE integration is available for NetBeans, the release note mentions a palette (I'm not familiar with NetBeans, but it may be what you're looking for).
I'm puzzled because of build and run errors that mislead me. From them, I can't quite figure out what the distinction is between the various JavaServer Page Standard Tag Libraries. For instance, I see:
jstl.jar (in Apache Tomcat)
jstl-1.2.jar (in Tomahawk examples)
jstl-impl.jar (in GlassFish)
In times past, I've used (and recently recovered and have stored privately against disaster) from javax.servlet.jsp.jstl
jstl-api-1.2.jar
jstl-impl-1.2.jar
These latter are the only ones I seem to be able to use reliably in doing JavaServer Faces (JSF) work.
There's no wiki statement I've found that contrasts these different JARs. Yeah, I know their ages are different. I wonder, for instance, if jstl.jar isn't supposed to be a modern, definitive, both in one (api and impl) and I'm just using the wrong JSF libraries (myfaces-api-1.2.8.jar, for instance) to go with it?
My purpose is to establish a definitive set of JARs for doing Facelet work using either MyFaces or RichFaces, the two I know best.
Thanks to anyone who can shed some light and best practice on this.
If your target servletcontainer has it builtin, then you do not need to have any in your /WEB-INF/lib. Full fledged Java EE containers like Glassfish and JBoss AS have it builtin.
If your target servletcontainer does not have it builtin (Tomcat, etc), or you want to cover as much as possible servletcontainers, then you need to pick the newest JSTL version which matches the Servlet API version as declared by your web.xml.
For more detail about JSTL version differences and where to download them, see our own JSTL tag wiki page. It's the same page as when you hover the jstl tag below your question and click info.
I am developing a Java EE based web application. We have a very limited time to come up with a alpha version and trying to decide on a web framework to use. It has to be something easy to learn but powerful. Standard JSP/Servlet is not an option here due to the time it takes for the development. Appreciate if anyone could advice. Current options are Wicket and GWT. (JSF is also an option)
Wicket is component-based and comes with a bunch of standard components (like pagination, auto-complete, data grids, form handling etc.). If you want to create a standard panel (with the possibility for easy re-use) just create your HTML fragment to use a template (with wicket:id attributes wherever you want to bind dynamic content or sub-components) and a corresponding Java file. Furthermore, you can attach specific CSS and JS files.
In my opinion, Wicket development is good value (functionality) for money. And you get a lot of built-in AJAX functionality without even writing (not reading) any JS. E.g., change the model for a component, attach the component to an AjaxRequestTarget and the panel is automagically repainted via DOM manipulation.
For a quick overview and intro I recommend Wicket in Action by Dashorst & Hillenius. (And don't miss out on other great resources.)
Everything depends on your application. I don't have experience with Wicket, not much with JSF. I have big experience with GWT.
GWT is good if your application has to be mostly dynamic. In GWT you can change everything on the page not even calling the server. GWT is compiled to Javascript. On the other hand, if you have big project, it is quite frustrating if your application in development starts few minutes, because it has a lot of code to compile to Javascript. My opinion: it is not good for big projects.
If you don't need to change your pages so much client-side, I would use JSF2 (or Wicket, if I knew it).
Have a look at this comparison of Wicket and GWT, this may help you decide for yourself:
Wicket and GWT compared with code
I want to hide a Sharepoint web that has been deprecated (via custom means) due to the release of a newer version, whether it would be making it invisible in the sites and workspaces, or via some special archiving function provided by Sharepoint. Basically I do not wish the users to be able to see the deprecated site.
I was wondering what are the options for doing so, both programmatically or via Sharepoint utils/interfaces?
Thanks.
UPDATE:
The scenario where I want to hide the web from the users (e.g. Webv1.0 when Web2.0 is available) is a bit like, okay, I have version 2.0 of Software X downloaded and installed, and it has converted all of my data into version 2.0 format so it will be compatible with new features. As a user, I would not want to use Software X version 1.0 anymore since it is now old. Of course I would want a backup copy of my data from version 1.0, but I probably don't want to be confused by having a link here which can get me to version 1.0 of the software (and from a developer's point of view, it'll be extra unnecessary work to make version 1.0 being viewable/editable in version 2.0).
I thought of the idea of using security to only allow admins to see everything, but I want to explore other options first e.g. whether it is possible to make the link to the old site disappear programmatically.
Thanks.
Could you just remove all access to that site (by breaking security inheritance) and just allow admins only access to it?
Colin's answer sounds like the way to go. Alternatively you can inject a little bit of JavaScript that automatically redirects the user to the new version of the site.
You can add JavaScript using a content Editor Web Part (one page at a time) or by using the free SharePoint Infuser (all pages in one go).