How to crawl only HTML in Nutch? - nutch

Is it possible to crawl/fetch only plain HTML pages via Nutch (i.e. no pictures, video, flash, excel, exe, pdf or word files)?
How to check Content-Type of the page and fetch only text/html pages via Nutch?

Edit conf/regex-urlfilter.txt:
Set files suffix for ignore:
-\.(jpg|gif|zip|ico)$

Related

How can we convert an html file to sharepoint2010 site page

How can we convert an html file to sharepoint and how can we uplaod and apply the css and jQuery files to the share point site..please help me on this
HTML files cannot be converted to Sharepoint page on Sharepoint2010. if you have styles on the page its advisible to upload the same (incl .css and .js) on one document library and provide the ref on the uploaded HTML page . or else you can use that HTML on an Article page which you create , add a Content editor webpart and paste those HTML scripts with .css and .js reference on that
Hope this helps
Regards
Arjun

How to compress html pages using SetOutputFilter DEFLATE

I am not able to get compressed html pages in my browser even though I am 100% sure mod_deflate is activated on my server.
My htaccess file has this code snippet :
<IfModule mod_deflate.c>
<Files *.html>
SetOutputFilter DEFLATE
</Files>
</IfModule>
A non compressed excerpt of my content is:
<div>
<div>
Content
</div>
</div>
With the htaccess code I am using, I would expect to get the output below in my browser (no space and no tabs at the beginning of each line):
<div>
<div>
Content
</div>
</div>
Is there something wrong with the code I am using in the htaccess file?
Is keeping all tabs in front of each html line after compression the normal behavior of mod_deflate?
If so, would you recommend that I switch tabs with spaces in my html code to get the desired effect?
Thanks for your insights on this
For the Deflate output filter to compress the content
Your content should be at least 120 bytes; compressing lesser bytes increases the output size.
The http client making the request should support gzip/deflate encoding.
Most modern Web browsers support gzip encoding and automatically decompress the gziped content for you. So what you are seeing using a Web browser's View Page Source option is not the compressed content. To verify if your browser received a compressed content, hit the F12 Key, select the Network tab and your requested page. If the response header has Content-Encoding: gzip, you can be sure the compression worked.
In Firefox, you can remove support for gzip,deflate by going to about:config and emptying the value for network.http.accept-encoding. Now with no support for gzip, Firefox will receive uncompressed content from your Apache server.
Alternatively, if you want to see the compressed content, you can use a client that does not automatically decompress the contents for you (unless you use --compressed option).
You can use curl for this:
curl -H "Accept-Encoding: gzip,deflate" http://example.com/page.html > page.gz

How to get the crawled pages content and corresponding URL in nutch?

I want to get the crawled content by nutch in text file. I have used the #readseg commads but output is not fruitful.
Is there is some plugin which can get nutch to crawl and store the url and content in text file.
With nutch 1, you can do something like:
./bin/nutch readseg -get out-crawl/segments/20160823085007/ "https://en.wikipedia.org/wiki/Canon" -nofetch -nogenerate -noparse -noparsedata -noparsetext > Canon.html
It still comes with a few line to get rid off at the begining of the file.
You can modify Fetch Job of Nutch to get URLs and page content belong to the URLs during the crawling process. In the source code file (src/java/org/apache/nutch/fetcher/FetcherReducer.java):
case ProtocolStatusCodes.SUCCESS: // got a page
String URL= TableUtil.reverseUrl(fit.url); //URL
content = Bytes.toString(ByteBuffer.wrap((content.getContent()))));//URL belong the URL
output(fit, content, status, CrawlStatus.STATUS_FETCHED);
break;
Hope this helps,
Le Quoc Do

What is the htaccess equivalent of <meta http-equiv="Content-Type" content="text/html; charset=utf-8>

What is the htacces equivalent to meta http-equiv="Content-Type" content="text/html; charset=utf-8"? Yslow says i should put this in my htacces. I'm on appache server.
Ok seen here I have I think an answer. Which code is appropriate though? I only have html extensions on my site. http://www.askapache.com/htaccess/setting-charset-in-htaccess.html
AddCharset UTF-8 .html
vs
AddType 'text/html; charset=UTF-8' html
vs
AddDefaultCharset UTF-8
vs
Content-Type: text/html; charset=UTF-8
The first one, AddCharset, tells the server that files ending in .html should be said to be encoded in UTF-8.
The second gives the full Content-Type for HTML files, including both the MIME type and charset. This shouldn't be necessary, since Apache should already be configured to serve .html files as text/html.
The third, AddDefaultCharset, sets the default character set for all file types, not just HTML. So, for instance, text documents, XML documents, stylesheets, and the like will be served with a UTF-8 character set listed. This is what I would recommend; you should be saving all of your documents in UTF-8 by default anyhow, and so even if all of your documents are HTML now, this will keep the correct character set configured for other types of files if you add them later.
The last is not an Apache configuration; it's the actual header that should be sent along with your documents if you set one of the above options. You can check the headers that were sent in Firebug on Firefox, or various developer tools that other browsers offer. You should always have a Content-Type: header, and if your text is encoded in UTF-8, it should always specify charset=UTF-8.
Note that the meta tag is not required if you set the charset appropriately via the headers. It is still nice to have the meta tag if you are going to view the files locally, without a web server; in that case, there is nothing to set the header, so the browser needs to fall back toe the meta tag. But for this purpose, you can use the shorter and simpler meta tag: <meta charset=utf-8>. This abbreviated form was formally introduced in HTML5, but browsers have actually supported it for much longer, and it's compatible with all modern browsers, even back to IE 6.
Another possibility is the rewrite engine (in this case, matching no-extension URLs):
RewriteEngine on
RewriteRule ^([^.]*)$ $1 [type=text/html]

How to modify Sharepoint filetype icons depending on parts of the filename?

We have a SharePoint Document library, where we store html files with links to external files. Samples:
mypicture.jpg.html
mywordfile.docx.html
mypdffile.pdf.html
and so on. Now by default all Files show up with the HTML Icon, referenced in the DOCICON.XML file. Thats of course correct as the .html extension shows, it is a HTML file. But we want the files to have different icons, based on their original file type.
Is there a way to automatically change the Icon
during rendering or
when we save the file to the library (via SharePoint API)?
Any other approachs?
Why not use a little jquery to change the icon during rendering? Each doc in your library should be contained in
<td class="ms-vb-icon"><a tabindex=...><img ... src="/_layouts/images/ichtm.gif"></a></td>
I think you can slurp that into an array, assign a new var that's just the href stripped of path/filename. and .html, and use that to replace htm in the src tag.
Could you not just edit the DOCICON.xml to add the ".jpg.html" and ".docx.html" extensions in?
For a full listing of icon files see all "ic*.gif" files in the TEMPLATE\IMAGES directory under the 12 hive. Unfortunately, this will not solve your problem, but this is where you can change it based on the extension, if you so choose.
Note that a blog I wrote a while back has a different focus, but does discuss where the icons come from: http://wiki.threewill.com/display/is/2007/10/14/External+Link+for+Editing+a+SharePoint+Document.

Resources