Nutch Parsing plugin and redirects - nutch

I am using nutch 2.0, i've created a plugin for parsing html that implements Parser and works just fine.
The problem is that i need to "parse" also pages that generate redirects (301,300), for getting the url and the http code.My plugin ignores the redirected pages.
Any ideas how i can obtain this information, maybe with other extension point?

I've implemented the Protocol extension point and now i can save on database the redirects and loadtimes.

Related

Getting redirected URL in python 3

I want to get the address of a page after redirect. I have the following code
url = 'https://simple.wikipedia.org/wiki/Gcd'
print(urlopen(url).geturl())
But it doesn't work, it prints https://simple.wikipedia.org/wiki/Gcd, while it should print https://simple.wikipedia.org/wiki/Greatest_common_divisor.
So, what is the problem with it?
There is actually no problem. The URL you get when opening https://simple.wikipedia.org/wiki/Gcd is exactly that URL. The only way for the URL to change would be a redirect, and if you look at the response from that URL, you can see that it returns just a 200 status code. So there is no redirect.
However, when you open the URL in the browser, the URL does get changed to https://simple.wikipedia.org/wiki/Greatest_common_divisor. How does this happen when there is no redirect?
This is actually a new MediaWiki feature that rewrites the URL in the browser using the History API. It simply replaces the URL that is displayed in the browser—but without actually making a new request or being a true HTTP redirect.
It’s a functionality that only works in modern browsers with JavaScript enabled. Otherwise, you would stay on the Gcd URL which is also the behavior from older versions of MediaWiki.
You can learn more about this new MediaWiki feature in the Phabricator task T37045.
As for your “problem” with it, you should consider communicating with MediaWiki using the MediaWiki API which will also tell you when a page is a redirect.

Prevent displaying PHP File through URL with htaccess

i would like to make it impossible to open a PHP file directly through an Url but keep it still accessible through jQuery. Right now it is possible to enter this URL in the Browser:
http://domain.com/php/member.php
But i would like to prevent that. If someone types this in the Browser Url than i would like to redirect everyone to http://domain.com with htaccess. But it must be still possible to send variables through the own Website with jQuery to the PHP File.
Thanks :)
If you don't want it access directly put it outside the web root and use a php script to interact with it. jQuery/JavaScript is client side. If it can access the file, then the client will be able to also. You would be better off using PHP to send/receive info to jQuery and hide that file outside the web root so there is no direct access.
You can use this htaccess redirect to generate the code to redirect domain.com/php/member.php to domain.com. You can use a 301 redirect like the one above.

How to use htaccess for content negotiation?

In order for the content to be available through the same link, even if the file extension has changed, URI's shouldnt change. So i decided to use content negotiation and htaccess to achieve this. I searched the web but all i found out about is how to implement this specific to php. In my site i have not only php but also html, images and javascript files.
How can i use content negotiation with just htaccess?

How to create addressable pages using JSF

Using the currrent version of Java EE, how do you create addressable web pages using Java Server Faces (JSF). That is, creating JSF pages that have a clean URL, so the page for the person entity with ID 1234 might be http://www.example.com/person/1234? It's clear enough to me how to service a clean URL using the Java API for RESTful web services (JAX-RS), but not how to do so for a JSF page, or how to combine the two.
A previous question I found suggests that doing so is not actually possible. Is that really so?
Use a URL rewriting solution like PrettyFaces. It uses basically a simple Filter under the covers which forwards the request from pretty to ugly URL and redirects the request from ugly to pretty URL based on some XML mapping file.
Related questions:
Bookmarkable URL in JSF application - Trying to use Spring Webflow and JSF . Any suggestions?
How to rewrite the URL
How do I configure JSF url mappings without file extensions?

How to detect which content is not secured on mixed content SSL page.?

I've added a SSL certificate to an existing site, and now in IE I get a mixed content warning. Problem is, I don't know what's the non-secure content IE is warning me about. It's a simple html page, with a few Flash, a few images, a loaded CSS and JS.
How can I find out what's the non-secured content..?
Edit:
I found the culprit: it's the JS AC_RunActiveContent.js used to display Flash movie. So anyone has an idea on how to prevent SSL mixed content when using AC_RunActiveContent.js.?
This means that something is requesting content using the http protocol specifically, or you have an absolute path to an image or other content that begins with http instead of https.
A few tips: Use relative paths everywhere you can. If you must use an absolute path, and it's to a server you own, use https. If you're loading stuff from off your site, you're probably stuck with the mixed-content warning.
This also goes for your scripts, check out the JS, and the CSS template and make sure they're not the guilty parties - if they are change them to use relative paths, or to request items via https instead of http (assuming you're positive that the server they're referencing supports https, if it doesn't you're stuck).
There are a few other details, this might be helpful.
Ok, so here is the solution for my particular problem. It was the codebase value in my code that needed to be https as well (I didn't think it would trigger the warning, as my Flash were displaying correctly, oh well)...
AC_FL_RunContent( 'codebase','https://download.macromedia.com/pub/shoc...
Link to Adobe info on this: Security Information error in Internet Explorer
I use the Firefox console -- it reports the http resources it blocks from fetching on a mixed content page.
Search your source for http: only. Another great tool to help you out is Fiddler with which you can see what's getting downloaded upon requesting your page.

Resources