Is there is any way to get the html content of each webpage in nutch while crawling the web page?
Yes, you can acutally export the content of the crawled segments. It is not straightforward, but it works well for me. First, create a java project with the following code:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;
import java.io.File;
import java.io.FileOutputStream;
public class NutchSegmentOutputParser {
public static void main(String[] args) {
if (args.length != 2) {
System.out.println("usage: segmentdir (-local | -dfs <namenode:port>) outputdir");
return;
}
try {
Configuration conf = NutchConfiguration.create();
FileSystem fs = FileSystem.get(conf);
String segment = args[0];
File outDir = new File(args[1]);
if (!outDir.exists()) {
if (outDir.mkdir()) {
System.out.println("Creating output dir " + outDir.getAbsolutePath());
}
}
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
while (reader.next(key, content)) {
String filename = key.toString().replaceFirst("http://", "").replaceAll("/", "___").trim();
File f = new File(outDir.getCanonicalPath() + "/" + filename);
FileOutputStream fos = new FileOutputStream(f);
fos.write(content.getContent());
fos.close();
System.out.println(f.getAbsolutePath());
}
reader.close();
fs.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
I recommend using Maven; add the following dependencies:
<dependency>
<groupId>org.apache.nutch</groupId>
<artifactId>nutch</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>0.23.1</version>
</dependency>
and create a jar package (i.e. NutchSegmentOutputParser.jar)
You need Hadoop to be installed on your machine. Then run:
$/hadoop-dir/bin/hadoop --config \
NutchSegmentOutputParser.jar:~/.m2/repository/org/apache/nutch/nutch/1.5.1/nutch-1.5.1.jar \
NutchSegmentOutputParser nutch-crawled-dir/2012xxxxxxxxx/ outdir
where nutch-crawled-dir/2012xxxxxxxxx/ is the crawled directory you want to extract content from (it contains 'segment' subdirectory) and outdir is an output dir. The output file names are generated from URI, however, the slashes are replaced by "_".
Hope it helps.
Try this:
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags
metaTags, DocumentFragment doc)
{
Parse parse = parseResult.get(content.getUrl());
LOG.info("parse.getText: " +parse.getText());
return parseResult;
}
Then check the content in hadoop.log.
Its super basic.
public ParseResult getParse(Content content) {
LOG.info("getContent: " + new String(content.getContent()));
The Content object has a method getContent(), which returns a byte array. Just have Java create a new String() with the BA, and you've got the raw html of whatever nutch had fetched.
I'm using Nutch 1.9
Here's the JavaDoc on org.apache.nutch.protocol.Content
https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/protocol/Content.html#getContent()
Yes there is a way. Have a look at cache.jsp to see how it displays the cached data.
Related
I'm writing code to convert SVG's to PNG's:
package com.example;
import java.io.*;
import java.nio.file.Paths;
import org.apache.batik.transcoder.image.PNGTranscoder;
import org.apache.batik.transcoder.SVGAbstractTranscoder;
import org.apache.batik.transcoder.TranscoderInput;
import org.apache.batik.transcoder.TranscoderOutput;
public class Main {
public static void main(String [] args) throws Exception {
// read the input SVG document into TranscoderInput
String svgURI = Paths.get(args[0]).toUri().toURL().toString();
TranscoderInput input = new TranscoderInput(svgURI);
// define OutputStream to PNG Image and attach to TranscoderOutput
OutputStream ostream = new FileOutputStream("out.png");
TranscoderOutput output = new TranscoderOutput(ostream);
// create a JPEG transcoder
PNGTranscoder t = new PNGTranscoder();
// set the transcoding hints
t.addTranscodingHint(SVGAbstractTranscoder.KEY_HEIGHT, new Float(600));
t.addTranscodingHint(SVGAbstractTranscoder.KEY_WIDTH, new Float(600));
// convert and write output
t.transcode(input, output);
// flush and close the stream then exit
ostream.flush();
ostream.close();
}
}
I get the following exceptions executing it with a variety of SVG's:
Exception in thread "main" org.apache.batik.transcoder.TranscoderException: null
Enclosed Exception:
Could not write PNG file because no WriteAdapter is availble
at org.apache.batik.transcoder.image.ImageTranscoder.transcode(ImageTranscoder.java:132)
at org.apache.batik.transcoder.XMLAbstractTranscoder.transcode(XMLAbstractTranscoder.java:142)
at org.apache.batik.transcoder.SVGAbstractTranscoder.transcode(SVGAbstractTranscoder.java:156)
at com.example.Main.main(Main.java:26)
Batik version (reported by Maven):
version=1.9
groupId=org.apache.xmlgraphics
artifactId=batik-transcoder
I get the same error with Batik 1.7.
Suggestions?
The problem was solved by Peter Coppens on the xmlgraphics-batik-users mailing list. The problem is that the Maven repository for Batik 1.9 is missing a dependency, which can be addressed by adding to pom.xml:
<dependency>
<groupId>org.apache.xmlgraphics</groupId>
<artifactId>batik-codec</artifactId>
<version>1.9</version>
</dependency>
The cryptic exception disappears and the code functions as expected with this addition. This was reported as a bug for Batk 1.7 (https://bz.apache.org/bugzilla/show_bug.cgi?id=44682).
This question already has answers here:
Where to place and how to read configuration resource files in servlet based application?
(6 answers)
Closed 7 years ago.
I'm trying to load a properties file in a JSF application I'm working on, though I can't manage to reference the file.
package com.nivis.util;
import java.io.IOException;
import java.io.InputStream;
import java.util.Properties;
public class PropHandler {
String result = "";
InputStream inputStream;
public void loadProp() {
try {
inputStream = this.getClass().getResourceAsStream("prop.properties");
if (inputStream == null) {
System.err.println("===== Did not load =====");
} else {
System.err.println("===== Loaded =====");
}
} catch (Exception ex) {
ex.printStackTrace();
} finally {
try {
inputStream.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
public static void main(String[] args) {
PropHandler ph = new PropHandler();
ph.loadProp();
}
}
The file is located in the same package and in the different examples I've found when searching for this, that should work. I've also tried to put the file in every conceivable place in the application and reference it to the best of my knowledge, but it does not work.
(only some of the folders that I've tested to put the file)
What am I doing wrong?
Optimally I'd like to have it in the same folder that I use for the msg.properties file.
As this answer elaborates, com/nivis/prop.properties should be the right way to reference the file nested in your resources folder.
But because you're not using ClassLoader classloader = Thread.currentThread().getContextClassLoader(); to locate the Classloader you have to use an absolute path starting with "/" resulting in /com/nivis/prop.properties.
try something like this
ClassLoader classloader = Thread.currentThread().getContextClassLoader();
InputStream is = classloader.getResourceAsStream("prop.properties");
I want to create a Word document that uses different languages. In particular, I have a two-language original text where the language changes between English and German for each paragraph. This is what I tried:
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.poi.xwpf.usermodel.XWPFStyles;
public class DocxCreator {
public static void createDocument(File docxOutput) throws IOException {
XWPFDocument doc = new XWPFDocument();
XWPFStyles docStyles = doc.createStyles();
docStyles.setSpellingLanguage("de-DE");
{
XWPFParagraph para = doc.createParagraph();
XWPFRun run = para.createRun();
run.setLanguage("de-DE"); // XXX: this method does not exist
para.setText("Deutsch");
}
{
XWPFParagraph para = doc.createParagraph();
XWPFRun paraRun = para.createRun();
para.setStyle("en-US");
paraRun.setText("English");
}
/*- XXX: How do I add the style “en-US” to the document and set its language to en-US”? */
/* XXX: How do I enable global grammar and spell checking? */
try (FileOutputStream fos = new FileOutputStream(docxOutput)) {
doc.write(fos);
}
}
public static void main(String[] args) throws IOException {
createDocument(new File("multilang.docx"));
}
}
I do not think this is currently supported by POI.
Generally, the language of the text is specified on the XWPFRun (XWPF) / CharacterRun (HWPF) level.
For HWPF (old binary *.doc format) there exists at least a method CharacterRun.getLanguageCode() - but no respective setter.
For XWPF (new *.docx format) I do not see such a thing at all.
The language codes are the same for *.doc and *.docx. A list is available here.
I just started to develop a simple Blackberry app which shows a text sequence in a RichTextField on a MainScreen. When I define the String directly in the sourcecode, then I have no problem to display it. But if I try to read it in from a .txt file which is located in the res folder, then I get a NullPointerException.
The code below is what I did so far.
package mypackage;
import java.io.IOException;
import java.io.InputStream;
import net.rim.device.api.io.IOUtilities;
import net.rim.device.api.ui.component.RichTextField;
import net.rim.device.api.ui.container.MainScreen;
public final class MyScreen extends MainScreen{
String str = readFile("Testfile.txt");
public MyScreen(){
setTitle("Read Files");
add(new RichTextField(str));
}
public String readFile(String filename){
InputStream is = this.getClass().getResourceAsStream("/"+filename);
try {
byte[] filebytes = IOUtilities.streamToBytes(is);
is.close();
return new String(filebytes);
}
catch (IOException e){
System.out.println(e.getMessage());
}
return "";
}
}
Parts of this code I found in this forum but my problem is that I don't understand when I have to open a connection and when to close it.
And when do I need a Buffer?
And why do I have to convert a InputStream to a byte[] and then the byte[] to a String?
All I need is one method, where I can type in the Filename and get back a String-Object with the text which is in my .txt file.
And of course the method should save resources...
package mypackage;
import java.io.IOException;
import java.io.InputStream;
import net.rim.device.api.io.IOUtilities;
import net.rim.device.api.ui.component.RichTextField;
import net.rim.device.api.ui.container.MainScreen;
public final class MyScreen extends MainScreen {
public MyScreen() throws IOException {
setTitle("Read Files");
add(new RichTextField(readFileToString("Testfile.txt")));
}
public String readFileToString(String path) throws IOException {
InputStream is = getClass().getResourceAsStream("/"+path);
byte[] content = IOUtilities.streamToBytes(is);
is.close();
return new String(content);
}
}
Yes!!! I found a way to solve my problem.
I don't know why my previous code didn't work but this one works...
The only thing I've changed is that I've added the throws IOException instead of surrounding it with a try - catch block...
Following this question Click Here. I thought of creating a simple IDE for groovy and Java. Code is reproduced here for easy reference:
import groovy.swing.SwingBuilder
import java.awt.BorderLayout as BL
import static javax.swing.JFrame.EXIT_ON_CLOSE
import org.fife.ui.rsyntaxtextarea.*
RSyntaxTextArea textArea = new RSyntaxTextArea()
textArea.syntaxEditingStyle = SyntaxConstants.SYNTAX_STYLE_JAVA
swing = new SwingBuilder()
frame = swing.frame(title:"test", defaultCloseOperation:EXIT_ON_CLOSE, size:[600,400], show:true ) {
borderLayout()
panel( constraints:BL.CENTER ) {
borderLayout()
scrollPane( constraints:BL.CENTER ) {
widget textArea
}
}
}
Now I have all the codings entered by the user in textarea which is an Object of RSynataxTextArea, how i should perform compilation for all the code written by the user? Is there any class for this purpose or any ways of doing it in Groovy?
Thanks in advance.
I you look in the src/main/groovy/ui folder of the source download for Groovy, you'll see the code which makes the groovyConsole work
If you look inside the ConsoleSupport class, you'll see the way the console does it:
protected Object evaluate(String text) {
String name = "Script" + counter++;
try {
return getShell().evaluate(text, name);
}
catch (Exception e) {
handleException(text, e);
return null;
}
}
where getShell() is:
public GroovyShell getShell() {
if (shell == null) {
shell = new GroovyShell();
}
return shell;
}
So it returns a new GroovyShell or the exiting one if one already exists