Step by step running apache Nutch 2.2.1

Step by step running apache Nutch 2.2.1 - nutch

I have config plugin.folders in nutch-default.xml but when I run Nutch via Eclipse & Netbeans,
Main class: org.apache.nutch.crawl.InjectorJob
Arguments: /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls
VM Options: -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
THe errors like below:
cd /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1; JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home "/Applications/NetBeans/NetBeans 7.3.app/Contents/Resources/NetBeans/java/maven/bin/mvn" "-Dexec.args=-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log -classpath %classpath org.apache.nutch.crawl.InjectorJob /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls" -Dexec.executable=/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java process-classes org.codehaus.mojo:exec-maven-plugin:1.2.1:exec
Scanning for projects...
------------------------------------------------------------------------
Building Apache Nutch 2.2.1
------------------------------------------------------------------------
[resources:resources]
[debug] execute contextualize
Using platform encoding (US-ASCII actually) to copy filtered resources, i.e. build is platform dependent!
skip non existing resourceDirectory /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/src/main/resources
[compiler:compile]
Nothing to compile - all classes are up to date
[exec:exec]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-jdk14/1.6.1/slf4j-jdk14-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-simple/1.6.1/slf4j-simple-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/07/06 08:55:18 INFO crawl.InjectorJob: InjectorJob: starting at 2013-07-06 08:55:18
13/07/06 08:55:18 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls
2013-07-06 08:55:18.420 java[1206:1c03] Unable to load realm info from SCDynamicStore
13/07/06 08:55:18 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
13/07/06 08:55:18 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
13/07/06 08:55:19 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
13/07/06 08:55:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/07/06 08:55:19 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/07/06 08:55:19 INFO input.FileInputFormat: Total input paths to process : 1
13/07/06 08:55:19 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/06 08:55:19 INFO mapred.JobClient: Running job: job_local226390157_0001
13/07/06 08:55:19 INFO mapred.LocalJobRunner: Waiting for map tasks
13/07/06 08:55:19 INFO mapred.LocalJobRunner: Starting task: attempt_local226390157_0001_m_000000_0
13/07/06 08:55:19 INFO mapred.Task: Using ResourceCalculatorPlugin : null
13/07/06 08:55:19 INFO mapred.MapTask: Processing split: file:/MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls/seed.txt:0+20
13/07/06 08:55:19 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
13/07/06 08:55:19 INFO mapreduce.GoraRecordWriter: gora.buffer.write.limit = 10000
13/07/06 08:55:19 INFO mapred.LocalJobRunner: Map task executor complete.
13/07/06 08:55:19 WARN mapred.FileOutputCommitter: Output path is null in cleanup
13/07/06 08:55:19 WARN mapred.LocalJobRunner: job_local226390157_0001
java.lang.Exception: java.lang.IllegalArgumentException: plugin.folders is not defined
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
at org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:69)
at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:97)
at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:117)
at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:99)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
13/07/06 08:55:20 INFO mapred.JobClient: map 0% reduce 0%
13/07/06 08:55:20 INFO mapred.JobClient: Job complete: job_local226390157_0001
13/07/06 08:55:20 INFO mapred.JobClient: Counters: 0
13/07/06 08:55:20 ERROR crawl.InjectorJob: InjectorJob: java.lang.RuntimeException: job failed: name=inject /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls, jobid=job_local226390157_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
------------------------------------------------------------------------
BUILD FAILURE
------------------------------------------------------------------------
Total time: 6.572s
Finished at: Sat Jul 06 08:55:20 ICT 2013
Final Memory: 11M/236M
------------------------------------------------------------------------
Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project nutch: Command execution failed. Process exited with an error: 255 (Exit value: 255) -> [Help 1]
To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.
For more information about the errors and possible solutions, please read the following articles:
[Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

The error message clearly indicates the problem (and where to look for a solution):
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-jdk14/1.6.1/slf4j-jdk14-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-simple/1.6.1/slf4j-simple-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

Related

Start PySpark in Jupyter notebook on EMR 6.5

I am trying to start a pyspark job using Amazon EMR Jupyter hub feature, as follow:
And with following code:
from pyspark import SparkSession
spark = SparkSession \
.builder \
.appName("My App") \
.getOrCreate()
But at the end, I always got:
The code failed because of a fatal error:
Session 0 unexpectedly reached final status 'dead'. See logs:
stdout:
stderr:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/lib/spark/jars/spark-unsafe_2.12-3.1.2-amzn-1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/05/04 13:21:11 INFO RSCDriver: Connecting to: ip-10-42-255-42.eu-west-1.compute.internal:10000
22/05/04 13:21:11 INFO RSCDriver: Starting RPC server...
22/05/04 13:21:11 INFO RpcServer: Connected to the port 10001
22/05/04 13:21:11 WARN RSCConf: Your hostname, ip-10-42-255-42.eu-west-1.compute.internal, resolves to a loopback address, but we couldn't find any external IP address!
22/05/04 13:21:11 WARN RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/apache/livy/shaded/json4s/DefaultFormats. Method 'java.text.SimpleDateFormat $anonfun$df$1(org.apache.livy.shaded.json4s.DefaultFormats)' at index 156 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef
at org.apache.livy.shaded.json4s.DefaultFormats.$init$(Formats.scala:318)
at org.apache.livy.shaded.json4s.DefaultFormats$.<init>(Formats.scala:296)
at org.apache.livy.shaded.json4s.DefaultFormats$.<clinit>(Formats.scala)
at org.apache.livy.repl.Session.<init>(Session.scala:66)
at org.apache.livy.repl.ReplDriver.initializeSparkEntries(ReplDriver.scala:43)
at org.apache.livy.rsc.driver.RSCDriver.run(RSCDriver.java:337)
at org.apache.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:93)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1056)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/05/04 13:21:11 INFO ShutdownHookManager: Shutdown hook called
22/05/04 13:21:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-2804f6ee-21f1-4773-98dc-8b3e3bd1924a
Seems the livy version is clahing with the livy version embedded with the apache shaded jar, so I tried to override the jar using a fat jar that contains all the spark jar I'm used to use, and use the following config to import it:
%%configure -f
{
"conf": {
"spark.jars": "s3://mybucket/myfatjar.jar"
}
}
But without any effect.

Gremlin console and spark UI not responding when performing OLAP query with JanusGraph with Apache spark

I have a graph on Janusgraph(v0.5.3) which contains around 2 million vertices and 20 million edges. I'm making a OLAP query which is modified version of lowest_common_ancestor recipe (query added below).
The query is taking too long(more than 1 hour) and I'm seeing Managed memory leak detected; warnings and then the spark web UI doesnt respond
anymore(cant debug anymore).
Also I'm seeing Lost executor driver on localhost: Executor heartbeat timed out warnings . But the query is not exiting even after 1 hour. I see these warnings after 30 min the job is started. I
was hoping spark and hadoop would make queries faster, but this seems
very slow. I'm not able to profile the query or look into spark web UI for the progress.
Note: I have installed hadoop(3.2.2) and using Apache Spark(2.4.0 ). I'm assuming spark came with janusgraph distribution which I don't remember installing. But the JanusGraph docs says v0.5.3 is compatible with spark 2.2.x ),not sure if spark compatibility is the issue?
Below is how I'm running the query using bin/gremlin.sh console.
graph = GraphFactory.open('conf/hadoop-graph/read-cql.properties')
g = graph.traversal().withComputer(SparkGraphComputer)
// OLAP query
input = [2437272, 4956336]
g.V().has(id, within(input)).
aggregate('input').hasId(input.head()).
repeat(__.in('has_word')).emit().as('x').
select('input').unfold().has(id, within(input.tail())).
repeat(__.in('has_word')).emit(where(eq('x'))).
group().
by(select('x')).
by(path().count(local).fold()).
unfold().filter(select(values).count(local).is(input.tail().size())).
order().
by(select(values).unfold().sum()).
select(keys).limit(5).elementMap()
Im assuming hadoop is configured fine
user#xyz-WS:~/Downloads/janusgraph-full-0.5.3$ bin/gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
00:56:59 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin>
gremlin> hdfs
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1845984070_1, ugi=rrmerugu (auth:SIMPLE)]]]
conf/hadoop-graph/read-cql.properties as described below
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
gremlin.spark.persistStorageLevel=DISK_ONLY #MEMORY_AND_DISK
#
# JanusGraph Cassandra InputFormat configuration
#
# These properties defines the connection properties which were used while write data to JanusGraph.
janusgraphmr.ioformat.conf.storage.backend=cql
# This specifies the hostname & port for Cassandra data store.
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.port=9042
# This specifies the keyspace where data is stored.
janusgraphmr.ioformat.conf.storage.cql.keyspace=janusgraph
# This defines the indexing backend configuration used while writing data to JanusGraph.
janusgraphmr.ioformat.conf.index.search.backend=elasticsearch
janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1
# Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr).
#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.widerows=true
#
# SparkGraphComputer Configuration
#
spark.master=local[8]
spark.executor.memory=12g
spark.driver.memory=12g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
spark.executor.memoryOverhead=1g
spark.driver.memoryOverhead=1g
spark.network.timeout=600s
spark.executor.heartbeatInterval=119s
spark.io.compression.codec=snappy
gremlin console output
rrmerugu#Code-WS:~/Downloads/janusgraph-full-0.5.3$ bin/gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
07:57:25 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cql.properties')
==>hadoopgraph[cqlinputformat->nulloutputformat]
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cqlinputformat->nulloutputformat], sparkgraphcomputer]
gremlin> input = [2437272, 4956336]
==>2437272
==>4956336
gremlin> g.V().has(id, within(input)).
......1> aggregate('input').hasId(input.head()).
......2> repeat(__.in('has_word')).emit().as('x').
......3> select('input').unfold().has(id, within(input.tail())).
......4> repeat(__.in('has_word')).emit(where(eq('x'))).
......5> group().
......6> by(select('x')).
......7> by(path().count(local).fold()).
......8> unfold().filter(select(values).count(local).is(input.tail().size())).
......9> order().
.....10> by(select(values).unfold().sum()).
.....11> select(keys).limit(5).elementMap()
07:58:05 WARN org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer - class org.apache.hadoop.mapreduce.lib.output.NullOutputFormat does not implement PersistResultGraphAware and thus, persistence options are unknown -- assuming all options are possible
07:58:06 WARN org.apache.spark.util.Utils - Your hostname, Code-WS resolves to a loopback address: 127.0.1.1; using 192.168.0.10 instead (on interface enp7s0)
07:58:06 WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address
08:01:59 WARN org.apache.spark.executor.Executor - Managed memory leak detected; size = 40472352 bytes, TID = 1633
08:02:03 WARN org.apache.spark.executor.Executor - Managed memory leak detected; size = 41050016 bytes, TID = 1683
08:11:26 WARN org.apache.spark.rpc.netty.NettyRpcEnv - Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.0.10:43701 in 119 seconds
08:11:29 WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [119 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:835)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:864)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [119 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
08:21:13 WARN org.apache.spark.rpc.netty.NettyRpcEnv - Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.0.10:43701 in 119 seconds
08:21:37 WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [119 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:835)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:864)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [119 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
08:32:07 WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
08:35:18 WARN org.apache.spark.HeartbeatReceiver - Removing executor driver with no recent heartbeats: 614252 ms exceeds timeout 600000 ms
Exception in thread "dispatcher-event-loop-7" java.lang.OutOfMemoryError: Java heap space
08:44:53 ERROR org.apache.spark.util.Utils - Uncaught exception in thread driver-heartbeater
08:57:17 WARN org.spark_project.jetty.io.ManagedSelector -
java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.util.Utils - uncaught error in thread Spark Context Cleaner, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.util.Utils - throw uncaught fatal error in thread Spark Context Cleaner
java.lang.OutOfMemoryError: Java heap space
Exception in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.executor.Executor - Exception in task 91.0 in stage 16.0 (TID 1890)
java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker for task 1890,5,main]
java.lang.OutOfMemoryError: Java heap space
08:58:56 WARN org.apache.spark.scheduler.TaskSetManager - Lost task 91.0 in stage 16.0 (TID 1890, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.scheduler.TaskSetManager - Task 91 in stage 16.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 16.0 failed 1 times, most recent failure: Lost task 91.0 in stage 16.0 (TID 1890, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
Driver stacktrace:
Type ':help' or ':h' for help.
system specs: I'm performing this on Ubuntu 6 core i7 processor with 32GB RAM and 1TB SSD.
What I want to find answers for:
Am I missing something, any hints on why this query is taking too long with errors? any suggestions appreciated.
Is there a better way to profile this query and why it's taking too long; console and spark web UI doesn't response once memory leak errors happen, and I have no way to debug this.
UPDATE
updated the log with OOM errors that the program exited with now

java.lang.ClassCastException: org.apache.hadoop.conf.Configuration cannot be cast to org.apache.hadoop.yarn.conf.YarnConfiguration

I am running a spark application using yarn in cloudera.
Spark version: 2.1
I get the following error:
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found
binding in
[jar:file:/data/yarn/nm/filecache/13/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation. SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory] 18/04/14 22:20:57 INFO
util.SignalUtils: Registered signal handler for TERM 18/04/14 22:20:57
INFO util.SignalUtils: Registered signal handler for HUP 18/04/14
22:20:57 INFO util.SignalUtils: Registered signal handler for INT
Exception in thread "main" java.lang.ClassCastException:
org.apache.hadoop.conf.Configuration cannot be cast to
org.apache.hadoop.yarn.conf.YarnConfiguration at
org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:60)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:763)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

I managed to solve it by verifyning that the spark version configured in SPARK_HOME variable matches the hadoop version installed in cloudera.
From the following link https://spark.apache.org/downloads.html you can download the suitable version for your required hadoop.
The haddop version in cloudera can by found by:
$ hadoop version

I encounter the same issue while trying to start a Spark job using Yarn Rest API.
And the reason was that the environment variable SPARK_YARN_MODE was missing. Adding this env var, everything works fine :
export SPARK_YARN_MODE=true

SAP HANA VORA - Spark Controller issue

I am trying to install start the SAP HANA Spark Controller on VORA 1.2 using Ambari.
However, when I am starting my Spark controller, I am getting the below exception.
Kindly help here...
[hanaes#ip-172-30-2-218 bin]$ ./hanaes start
Starting HANA Spark Controller ...
Class path is /usr/sap/spark/controller/bin/../conf:/usr/hdp/2.3.4.7-4/hadoop/conf:/etc/hive/conf:../*:../lib/*:../lib/external/*:/usr/hdp/2.3.4.7-4/hadoop/*:/usr/hdp/2.3.4.7-4/hadoop/lib/*:/usr/hdp/2.3.4.7-4/hadoop-hdfs/*:/usr/hdp/2.3.4.7-4/hadoop-hdfs/lib/*
STARTED
[hanaes#ip-172-30-2-218 bin]$ clear
[hanaes#ip-172-30-2-218 bin]$ tail -1000f /var/log/hanaes/hana_controller.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/sap/spark/controller/lib/external/spark-assembly-1.5.2.2.3.4.7-4-hadoop2.7.1.2.3.4.7-4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.4.7-4/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/05/09 12:04:43 INFO HanaESConfig: Loaded HANA Extended Store Configuration Found Spark Libraries. Proceeding with Current Class Path
16/05/09 12:04:44 INFO Server: Starting Spark Controller
16/05/09 12:04:52 ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:125)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:65)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:523)
at com.sap.hana.spark.network.CommandRouter.initializeHanaContext(CommandRouter.scala:125)
at com.sap.hana.spark.network.CommandRouter.<init>(CommandRouter.scala:38)
at com.sap.hana.spark.network.Server$$anonfun$1.apply(Server.scala:96)
at com.sap.hana.spark.network.Server$$anonfun$1.apply(Server.scala:96)
at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
at akka.actor.Props.newActor(Props.scala:252)
at akka.actor.ActorCell.newActor(ActorCell.scala:552)
at akka.actor.ActorCell.create(ActorCell.scala:578)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 16/05/09 12:04:52 ERROR Utils: Uncaught exception in thread SAPHanaSpark-akka.actor.default-dispatcher-2 java.lang.NullPointerException
at org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:593)
at com.sap.hana.spark.network.CommandRouter.initializeHanaContext(CommandRouter.scala:125)
at com.sap.hana.spark.network.CommandRouter.<init>(CommandRouter.scala:38)
at com.sap.hana.spark.network.Server$$anonfun$1.apply(Server.scala:96)
at com.sap.hana.spark.network.Server$$anonfun$1.apply(Server.scala:96)
at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
at akka.actor.Props.newActor(Props.scala:252)
at akka.actor.ActorCell.newActor(ActorCell.scala:552)
at akka.actor.ActorCell.create(ActorCell.scala:578)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The spark controller log indicates issue with Yarn, you need to check Yarn log that is responsible for the failed spark controller job:
Ambari -> Yarn -> Quick Links -> Resource Manager UI -> find the failed Spark Controller job -> click on application ID on left -> click on ‘logs'

SPARK_RPC_CLIENT_CONNECT_TIMEOUT in running Hive On Spark - YARN Cluster mode

I am using HDP2.3 and trying to use Spark(1.3.1) as the execution engine for running hive queries.
spark-assembly jar is also available in the hive/lib folder.
I am able to run the query in spark-master: local but facing the below issue when using spark-master: yarn-cluster.
command run,
hive -e "set hive.execution.engine=spark; set
spark.master=yarn-cluster; select count(*) from db_name.table_name;"
output,
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hive/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/downloads/machine/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hive/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/downloads/machine/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/etc/hive/2.3.0.0-2557/0/hive-log4j.properties
Query ID = root_20150909201120_a67d5ca3-36df-43fe-894a-3645585eec7a
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
yarn log of the application,
15/09/09 19:42:27 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
15/09/09 19:42:27 INFO client.RemoteDriver: Connecting to: sandbox.hortonworks.com:59941
15/09/09 19:42:27 ERROR yarn.ApplicationMaster: User class threw exception: SPARK_RPC_CLIENT_CONNECT_TIMEOUT
java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:46)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:146)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)
15/09/09 19:42:27 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: SPARK_RPC_CLIENT_CONNECT_TIMEOUT)
15/09/09 19:42:37 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
15/09/09 19:42:37 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: SPARK_RPC_CLIENT_CONNECT_TIMEOUT)
15/09/09 19:42:37 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1441817597849_0008
Any help on debugging the issue is much appreciated.

I don't think queries can be executed in yarn-cluster mode.
You can run interactive queries in local and yarn-client mode only

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Step by step running apache Nutch 2.2.1 - nutch

Related

Start PySpark in Jupyter notebook on EMR 6.5

Gremlin console and spark UI not responding when performing OLAP query with JanusGraph with Apache spark

java.lang.ClassCastException: org.apache.hadoop.conf.Configuration cannot be cast to org.apache.hadoop.yarn.conf.YarnConfiguration

SAP HANA VORA - Spark Controller issue

SPARK_RPC_CLIENT_CONNECT_TIMEOUT in running Hive On Spark - YARN Cluster mode

Categories

Resources