nutch crawling with solr, job failed for depth >= 2 - nutch

I am trying to run Nutch crawler on my local machine and want to index the retrieved data using solr.
Using apache-nutch-1.9 and solr-4.10.1
As of now they are installed and seem to run for depth = 1.
I get the following error when depth = 2
bin/crawl urls/ crawl http://localhost:8983/solr 2
.....
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

Related

PySpark3 - Reading XML files

I'm trying to read an XML file in my PySpark3 Jyupter notebook (running in Azure).
I have this code:
df = spark.read.load("wasb:///data/test/Sample Data.xml")
However I keep getting the error java.io.IOException: Could not read footer for file:
An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
I know its reaching the file - from looking at the length - matches the xml file size - but stuck after that?
Any ideas?
Thanks.
Please refer to the two blogs below, I think they can answer your question completely.
Azure Blob Storage with Pyspark
Reading JSON, CSV and XML files efficiently in Apache Spark
The code is like as below.
session = SparkSession.builder.getOrCreate()
session.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>"
)
# OR SAS token for a container:
# session.conf.set(
# "fs.azure.sas.<container-name>.blob.core.windows.net",
# "<sas-token>"
# )
# your Sample Data.xml file in the virtual directory `data/test`
df = session.read.format("com.databricks.spark.xml") \
.options(rowTag="book").load("wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/data/test/")
If you were using Azure Databricks, I think the code will works as expected. otherwise, may you need to install the com.databricks.spark.xml library in your Apache Spark cluster.
Hope it helps.

Why does Spark job fail on Mesos with "hadoop: not found"?

I use Spark 1.6.1, Hadoop 2.6.4 and Mesos 0.28 on Debian 8.
While trying to submit a job via spark-submit to a Mesos cluster a slave fails with the following in stderr log:
I0427 22:35:39.626055 48258 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ad642fcf-9951-42ad-8f86-cc4f5a5cb408-S0\/hduser","items":[{"action":"BYP$
I0427 22:35:39.628031 48258 fetcher.cpp:379] Fetching URI 'hdfs://xxxxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
I0427 22:35:39.628057 48258 fetcher.cpp:250] Fetching directly into the sandbox directory
I0427 22:35:39.628078 48258 fetcher.cpp:187] Fetching URI 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
E0427 22:35:39.629243 48258 shell.hpp:93] Command 'hadoop version 2>&1' failed; this is the output:
sh: 1: hadoop: not found
Failed to fetch 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar': Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was e$
Failed to synchronize with slave (it's probably exited)
My Jar file contains hadoop 2.6 binaries
The path to spark executor/binary is via an hdfs:// link
My jobs don't appear in the framework tab, but they do appear in the driver with the status 'queued' and they just sit there till I shut down the spark-mesos-dispatcher.sh service.
I was seeing a very similar error and I figured out my problem was that hadoop_home wasn't set in the mesos agent.
I added to /etc/default/mesos-slave (path may be different on your install) on each mesos-slave the following line: MESOS_hadoop_home="/path/to/my/hadoop/install/folder/"
EDIT: Hadoop has to be installed on each slave, the path/to/my/haoop/install/folder is a local path

Spark job submiited from local machine to remore cluster can't see data on remote server

The post may seem a bit long but I am providing all the specific details to help readers what I am trying to achieve and what all I have already done but still running into issue.
I am trying to submit the spark job to remote cluster from eclipse running locally on windows 7 machine but running into issue with respect to finding the input path to data on cluster nodes. I followed the suggestion made in this forum to configure the sparkContext as following where I set the spark.driver.host to IP address of Windows machine.
SparkConf sparkConf = new SparkConf().setAppName("Count Lines")
.set("spark.driver.host", "9.1.194.199") //IP address of Windows 7
.set("spark.driver.port", "51910")
.set("spark.fileserver.port", "51811")
.set("spark.broadcast.port", "51812")
.set("spark.replClassServer.port", "51813")
.set("spark.blockManager.port", "51814")
.setMaster("spark://master.aa.bb.com:7077"); //mater hostname
I also had to set HADOOP_HOME to c:\winutils in eclipse, to be able to run this code on windows.
Then I set the path to data which exists on all the nodes of spark cluster as following
String topDir = "/data07/html/test";
JavaRDD<String> lines = sc.textFile(topDir+"/*");
However, I get following error.
5319 [main] INFO org.apache.spark.SparkContext - Created broadcast 0 from textFile at CountLines2.java:65
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/data07/html/test/* matches 0 files
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
Now considering the fact that running the code inside eclipse needed local hadoop installation (ie., setting HADOOP_HOME to c:\winutils), I modified the code to use a data path that exists locally on Windows machine. With that modification, the progam went a bit further and launched tasks on all the nodes of the cluster but failed later for path issue with a different error.
105926 [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager - Lost task 15.2 in stage 0.0 (TID 162) on executor master.aa.bb.com: java.lang.IllegalArgumentException (java.net.URISyntaxException: Relative path in absolute URI: C:%5Cdata%5CMedicalSieve%5Crepositories%5Craw%5CMedscape%5Cclinical/*) [duplicate 162]
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 0.0 failed 4 times, most recent failure: Lost task 44.3 in stage 0.0 (TID 148, aalim03.almaden.ibm.com): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: C:%5Cdata%5Chtml%5Ctest/*
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
As a rule of thumb every input you use should be accessible on every node (both workers and driver). These could be local file system, files on some DFS or external resource.
The only situation when data is shipped directly from the driver is when you use ParallelCollectionRDD with parallelize / makeRDD.

Nutch 2 with Cassandra as a storage is not crawling data properly

I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format.
When I use readdb command in Nutch, I did get any useful crawling data.
Below are the details of different files and output I am getting:
========== command to run crawler =====================
bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3
======================== seed.txt data ==========================
http://www.ft.com
=== Output of readdb command to read data from cassandra webpage.f table======
~/Documents/Softwares/apache-nutch-2.3/runtime/local$ bin/nutch readdb -dump data -content
~/Documents/Softwares/apache-nutch-2.3/runtime/local/data$ cat part-r-00000
http://www.ft.com/ key: com.ft.www:http/
baseUrl: null
status: 4 (status_redir_temp)
fetchTime: 1426888912463
prevFetchTime: 1424296904936
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker dist : 0
reprUrl: null
batchId: 1424296906-20007
metadata _csh_ :
===============content of regex-urlfilter.txt ======================
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.
===========content of log file which is bothering me ======================
2015-02-18 13:57:51,253 ERROR store.CassandraStore -
2015-02-18 13:57:51,253 ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;#653e3e90
2015-02-18 14:01:45,537 INFO connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
Please let me know if you need more information.
Can someone please help me ?
Thanks in advance.
-Sumant
I just started using Nutch and Cassandra today. I am not receiving the same errors in my log file during a crawl.
Did you double check your nutch-site.xml and gora.properties settings? This is how I currently have my files configured.
nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>My Spider</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.cassandra.store.CassandraStore</value>
<description>Default class for storing data</description>
</property>
</configuration>
gora.properties
#############################
# CassandraStore properties #
#############################
gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
gora.cassandrastore.servers=localhost:9160

Nutch job failing when sending data to Solr

I've been trying various things with no avail. My configuration of Nutch/Solr is based on this:
http://ubuntuforums.org/showthread.php?t=1532230
Now that I have Nutch and Solr up and running, I would like to use Solr to index the crawl data. Nutch successfully crawls the domain I specified but fails when I run the command to communicate that data to Solr. Here's the command:
bin/nutch solrindex http://solr:8181/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
Here's the output:
Indexer: starting at 2013-09-12 10:34:43
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/share/apache-nutch-1.7/crawl/linkdb/crawl_fetch
Input path does not exist: file:/usr/share/apache-nutch-1.7/crawl/linkdb/crawl_parse
Input path does not exist: file:/usr/share/apache-nutch-1.7/crawl/linkdb/parse_data
Input path does not exist: file:/usr/share/apache-nutch-1.7/crawl/linkdb/parse_text
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
I've also tried another command after much Googling:
bin/nutch solrindex http://solr:8181/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
With this output:
Indexer: starting at 2013-09-12 10:45:51
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
Does anyone have any ideas of how to overcome these errors?
Was expecting the same error on fresh Solr 5.2.1 and Nutch 1.10:
2015-07-30 20:56:23,015 WARN mapred.LocalJobRunner - job_local_0001
org.apache.solr.common.SolrException: Not Found
Not Found
request: http://127.0.0.1:8983/solr/update?wt=javabin&version=2
So i have created a collection (or core, i am not an expert in SOLR):
bin/solr create -c demo
And changed URL in Nutch indexing script:
bin/nutch solrindex http://127.0.0.1:8983/solr/demo crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
I know that the question is rather old, but maybe i will help somebody with it...
Did you see the log in solr that revealed the error reason. I had ever same problem in nutch, and the solr's log showed a message "unknown field 'host'". After I modified the schema.xml for solr, the problem vanished.

Resources