Heritrix 3.2.x , how to read content from warc files ? - heritrix

Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ?
I tried using python warc tool and java based warc-tools.jar.

To get an idea what warc file consists, just use some kind of text editor. For graphical view, you need a tool like webarchiveplayer or pywb or openwayback.

Have you tried programming a reader using JWAT or use the JWAT Tools command line.
jwattools.cmd extract path.to.warc(.gz)

Related

How can I use python to edit docx and/or doc file tags on a windows system?

I have a folder with a large amount of .doc and .docx files, I would like to develop a python script to edit the tags of each file so I can find a file in the folder using the tags - thus making my life a little easier.
I am unsure of how to even start and was hoping someone could point me to a library or provide some sample code to help me get started.
I am not sure if the file extenstion matters because this seems to be a windows property (right-click file > Properties > Details > Tags > type in tags) but if the extension matters I do can change all the files to be .docx
The python-docx package provides methods to access most of the metatdata in a word file. The class docx.opc.coreprops.CoreProperties in specific allows you to modify author, category, etc. I didn't see tags mentioned but if you do some more research i'm sure you can find it.
docx.opc.coreprops.CoreProperties.keywords can be used to update doc file tags.

Crawler reading a pdf

i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database).
However, i am unsure which method / Tool to use.
My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster.
So my question is, how can I read a pdf from an online source and gather these pieces of information?
If you are not limited in terms of programming language, consider using iText.
It can easily extract all the text from a given PDF document. It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text.
iText is available both for c# and java lovers.
File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
Have a look at the website to learn more.
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

nutch parse custom xml with tika using xpath

I am new to nutch.
nutch 1.7
I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over.
How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there.
tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?
Any hints on the right direction much appreciated.
thanks.
I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:
https://github.com/BayanGroup/nutch-custom-search
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/

extracting information about outlink of url in nutch

I am quite new to nutch. I have crawled a site successfully using nutch 1.2 and extracted segment dump by readseg command but issue is that dump contains lot of information other than url and outlinks also if i want to analyse it, manual approach needs to be adopted.
It would be really great if there is any utiltiy, plugin which export link with out links in machine readable format like csv or sql.
Please suggest
Generally you will have to write your own application to do this. You can provide additional flags to remove unecessary data.
http://wiki.apache.org/nutch/bin/nutch%20readseg
check here for what flags can be used to reduce the data.
alternatively writing your own application using the hadoop FS library would be better, and then to extract the information directly programatically.
http://wiki.apache.org/hadoop/SequenceFile

Lucene 4.2.0 index pdf

I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.

Resources