Start PySpark in Jupyter notebook on EMR 6.5

Start PySpark in Jupyter notebook on EMR 6.5 - apache-spark

I am trying to start a pyspark job using Amazon EMR Jupyter hub feature, as follow:
And with following code:
from pyspark import SparkSession
spark = SparkSession \
.builder \
.appName("My App") \
.getOrCreate()
But at the end, I always got:
The code failed because of a fatal error:
Session 0 unexpectedly reached final status 'dead'. See logs:
stdout:
stderr:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/lib/spark/jars/spark-unsafe_2.12-3.1.2-amzn-1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/05/04 13:21:11 INFO RSCDriver: Connecting to: ip-10-42-255-42.eu-west-1.compute.internal:10000
22/05/04 13:21:11 INFO RSCDriver: Starting RPC server...
22/05/04 13:21:11 INFO RpcServer: Connected to the port 10001
22/05/04 13:21:11 WARN RSCConf: Your hostname, ip-10-42-255-42.eu-west-1.compute.internal, resolves to a loopback address, but we couldn't find any external IP address!
22/05/04 13:21:11 WARN RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/apache/livy/shaded/json4s/DefaultFormats. Method 'java.text.SimpleDateFormat $anonfun$df$1(org.apache.livy.shaded.json4s.DefaultFormats)' at index 156 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef
at org.apache.livy.shaded.json4s.DefaultFormats.$init$(Formats.scala:318)
at org.apache.livy.shaded.json4s.DefaultFormats$.<init>(Formats.scala:296)
at org.apache.livy.shaded.json4s.DefaultFormats$.<clinit>(Formats.scala)
at org.apache.livy.repl.Session.<init>(Session.scala:66)
at org.apache.livy.repl.ReplDriver.initializeSparkEntries(ReplDriver.scala:43)
at org.apache.livy.rsc.driver.RSCDriver.run(RSCDriver.java:337)
at org.apache.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:93)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1056)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/05/04 13:21:11 INFO ShutdownHookManager: Shutdown hook called
22/05/04 13:21:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-2804f6ee-21f1-4773-98dc-8b3e3bd1924a
Seems the livy version is clahing with the livy version embedded with the apache shaded jar, so I tried to override the jar using a fat jar that contains all the spark jar I'm used to use, and use the following config to import it:
%%configure -f
{
"conf": {
"spark.jars": "s3://mybucket/myfatjar.jar"
}
}
But without any effect.

Related

Apache spark master server not starting. Caused by: java.lang.reflect.InaccessibleObjectException

I have installed apache spark by following these instructions. When I get to step 5, or when I have to execute start-master.sh in terminal I get the following output:
21/09/25 12:41:33 WARN Utils: Your hostname, petar-X580VD resolves to a loopback address: 127.0.1.1; using 192.168.0.105 instead (on interface wlp3s0)
21/09/25 12:41:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.unsafe.array.ByteArrayMethods.<clinit>(ByteArrayMethods.java:54)
at org.apache.spark.internal.config.package$.<init>(package.scala:1095)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
at org.apache.spark.deploy.SparkSubmitArguments.$anonfun$loadEnvironmentArguments$3(SparkSubmitArguments.scala:157)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:157)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:115)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.<init>(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module #4434095f
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:357)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181)
at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:56)
... 13 more
I don't know how to fix this.

As #werner suggested in the comments, changing to java version 11 fixed the problem.

java.lang.ClassCastException: org.apache.hadoop.conf.Configuration cannot be cast to org.apache.hadoop.yarn.conf.YarnConfiguration

I am running a spark application using yarn in cloudera.
Spark version: 2.1
I get the following error:
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found
binding in
[jar:file:/data/yarn/nm/filecache/13/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation. SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory] 18/04/14 22:20:57 INFO
util.SignalUtils: Registered signal handler for TERM 18/04/14 22:20:57
INFO util.SignalUtils: Registered signal handler for HUP 18/04/14
22:20:57 INFO util.SignalUtils: Registered signal handler for INT
Exception in thread "main" java.lang.ClassCastException:
org.apache.hadoop.conf.Configuration cannot be cast to
org.apache.hadoop.yarn.conf.YarnConfiguration at
org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:60)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:763)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

I managed to solve it by verifyning that the spark version configured in SPARK_HOME variable matches the hadoop version installed in cloudera.
From the following link https://spark.apache.org/downloads.html you can download the suitable version for your required hadoop.
The haddop version in cloudera can by found by:
$ hadoop version

I encounter the same issue while trying to start a Spark job using Yarn Rest API.
And the reason was that the environment variable SPARK_YARN_MODE was missing. Adding this env var, everything works fine :
export SPARK_YARN_MODE=true

JMeter 3.3 connect Spark 2.2.1 error: "Cannot create PoolableConnectionFactory (Method not supported)"

Use this list of jars, I can successfully connect SQuirrel SQL to Spark 2.2.1:
commons-logging-1.1.3.jar
hadoop-common-2.7.3.jar
hive-exec-1.2.1.spark2.jar
hive-jdbc-1.2.1.spark2.jar
hive-metastore-1.2.1.spark2.jar
http-client-1.0.4.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
libfb303-0.9.3.jar
libthrift-0.9.3.jar
log4j-1.2.17.jar
slf4j-api-1.7.16.jar
slf4j-log4j12-1.7.16.jar
spark-hive-thriftserver_2.11-2.2.1.jar
spark-hive_2.11-2.2.1.jar
spark-network-common_2.11-2.2.1.jar
I think the above jars are more than necessary. But when trying to connect JMeter 3.3 to same Spark 2.2.1 ThriftServer with them, I got below error message
enter code here Cannot create PoolableConnectionFactory (Method not supported)
The JDBC configuration is here:
The full response at Jmeter is here:
Thread Name: test 1-1
Sample Start: 2018-04-03 13:34:43 CST
Load time: 511
Connect Time: 510
Latency: 0
Size in bytes: 62
Sent bytes:0
Headers size in bytes: 0
Body size in bytes: 62
Sample Count: 1
Error Count: 1
Data type ("text"|"bin"|""): text
Response code: null 0
Response message: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported)
Response headers:
SampleResult fields:
ContentType: text/plain
DataEncoding: UTF-8
I also try to use newer Hive JDBC driver 2.3.0, but it is obviously that is not worked with Spark 2.2.1 either on beeline or any others including Jmeter.
Error message when using beeline with Hive JDBC driver 2.3.0 is here:
$ beeline -u jdbc:hive2://<hostip>:10000/tpch_sf100_orc -n rxxxds
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.0-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apache-tez-0.9.0-bin/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://<hostip>:10000/tpch_sf100_orc
18/04/03 13:41:58 [main]: ERROR jdbc.HiveConnection: Error opening session
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=tpch_sf100_orc})
at org.apache.thrift.TApplicationException.read(TApplicationException.java:111) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_OpenSession(TCLIService.java:168) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.hive.service.rpc.thrift.TCLIService$Client.OpenSession(TCLIService.java:155) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:680) [hive-jdbc-2.3.0.jar:2.3.0]
at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:200) [hive-jdbc-2.3.0.jar:2.3.0]
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107) [hive-jdbc-2.3.0.jar:2.3.0]
at java.sql.DriverManager.getConnection(DriverManager.java:664) [?:1.8.0_112]
at java.sql.DriverManager.getConnection(DriverManager.java:208) [?:1.8.0_112]
at org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:145) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:209) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.Commands.connect(Commands.java:1641) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.Commands.connect(Commands.java:1536) [hive-beeline-2.3.0.jar:2.3.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_112]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_112]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112]
at org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:56) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1274) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1313) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.connectUsingArgs(BeeLine.java:867) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.initArgs(BeeLine.java:776) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1010) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:519) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501) [hive-beeline-2.3.0.jar:2.3.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_112]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_112]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112]
at org.apache.hadoop.util.RunJar.run(RunJar.java:239) [hadoop-common-2.9.0.jar:?]
at org.apache.hadoop.util.RunJar.main(RunJar.java:153) [hadoop-common-2.9.0.jar:?]
18/04/03 13:41:58 [main]: WARN jdbc.HiveConnection: Failed to connect to <hostip>:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://<hostip>:10000/tpch_sf100_orc: Could not establish connection to jdbc:hive2://<hostip>:10000/tpch_sf100_orc: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=tpch_sf100_orc}) (state=08S01,code=0)
Beeline version 2.3.0 by Apache Hive
What else can be done to connect JMeter to Spark?

Most probably this is due to client and server libraries mismatch, you need to use the same version as on the server.
So identify which version of hive is running on your server and download appropriate hive-jdbc jar and all the dependencies (you can retrieve them using i.e. Maven Dependency Plugin) and update the ones in JMeter Classpath with the correct versions.
It might be better to use "clean" JMeter installation for this to avoid possible jar hell so it is a good time to upgrade to JMeter 4.0

You could also build your own "JDBC sampler" and add/adapt just few line of code from the original to make it works. I had to do that for Hive and Phoenix support 4 years ago and it worked. It required a Timeout method that was not implemented back then.
Here are the sources:
https://github.com/csalperwyck/JMeterJDBCSamplerWithOutTimeOut

SPARK_RPC_CLIENT_CONNECT_TIMEOUT in running Hive On Spark - YARN Cluster mode

I am using HDP2.3 and trying to use Spark(1.3.1) as the execution engine for running hive queries.
spark-assembly jar is also available in the hive/lib folder.
I am able to run the query in spark-master: local but facing the below issue when using spark-master: yarn-cluster.
command run,
hive -e "set hive.execution.engine=spark; set
spark.master=yarn-cluster; select count(*) from db_name.table_name;"
output,
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hive/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/downloads/machine/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hive/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/downloads/machine/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/etc/hive/2.3.0.0-2557/0/hive-log4j.properties
Query ID = root_20150909201120_a67d5ca3-36df-43fe-894a-3645585eec7a
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
yarn log of the application,
15/09/09 19:42:27 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
15/09/09 19:42:27 INFO client.RemoteDriver: Connecting to: sandbox.hortonworks.com:59941
15/09/09 19:42:27 ERROR yarn.ApplicationMaster: User class threw exception: SPARK_RPC_CLIENT_CONNECT_TIMEOUT
java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:46)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:146)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)
15/09/09 19:42:27 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: SPARK_RPC_CLIENT_CONNECT_TIMEOUT)
15/09/09 19:42:37 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
15/09/09 19:42:37 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: SPARK_RPC_CLIENT_CONNECT_TIMEOUT)
15/09/09 19:42:37 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1441817597849_0008
Any help on debugging the issue is much appreciated.

I don't think queries can be executed in yarn-cluster mode.
You can run interactive queries in local and yarn-client mode only

Step by step running apache Nutch 2.2.1

I have config plugin.folders in nutch-default.xml but when I run Nutch via Eclipse & Netbeans,
Main class: org.apache.nutch.crawl.InjectorJob
Arguments: /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls
VM Options: -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
THe errors like below:
cd /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1; JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home "/Applications/NetBeans/NetBeans 7.3.app/Contents/Resources/NetBeans/java/maven/bin/mvn" "-Dexec.args=-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log -classpath %classpath org.apache.nutch.crawl.InjectorJob /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls" -Dexec.executable=/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java process-classes org.codehaus.mojo:exec-maven-plugin:1.2.1:exec
Scanning for projects...
------------------------------------------------------------------------
Building Apache Nutch 2.2.1
------------------------------------------------------------------------
[resources:resources]
[debug] execute contextualize
Using platform encoding (US-ASCII actually) to copy filtered resources, i.e. build is platform dependent!
skip non existing resourceDirectory /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/src/main/resources
[compiler:compile]
Nothing to compile - all classes are up to date
[exec:exec]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-jdk14/1.6.1/slf4j-jdk14-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-simple/1.6.1/slf4j-simple-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/07/06 08:55:18 INFO crawl.InjectorJob: InjectorJob: starting at 2013-07-06 08:55:18
13/07/06 08:55:18 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls
2013-07-06 08:55:18.420 java[1206:1c03] Unable to load realm info from SCDynamicStore
13/07/06 08:55:18 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
13/07/06 08:55:18 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
13/07/06 08:55:19 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
13/07/06 08:55:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/07/06 08:55:19 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/07/06 08:55:19 INFO input.FileInputFormat: Total input paths to process : 1
13/07/06 08:55:19 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/06 08:55:19 INFO mapred.JobClient: Running job: job_local226390157_0001
13/07/06 08:55:19 INFO mapred.LocalJobRunner: Waiting for map tasks
13/07/06 08:55:19 INFO mapred.LocalJobRunner: Starting task: attempt_local226390157_0001_m_000000_0
13/07/06 08:55:19 INFO mapred.Task: Using ResourceCalculatorPlugin : null
13/07/06 08:55:19 INFO mapred.MapTask: Processing split: file:/MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls/seed.txt:0+20
13/07/06 08:55:19 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
13/07/06 08:55:19 INFO mapreduce.GoraRecordWriter: gora.buffer.write.limit = 10000
13/07/06 08:55:19 INFO mapred.LocalJobRunner: Map task executor complete.
13/07/06 08:55:19 WARN mapred.FileOutputCommitter: Output path is null in cleanup
13/07/06 08:55:19 WARN mapred.LocalJobRunner: job_local226390157_0001
java.lang.Exception: java.lang.IllegalArgumentException: plugin.folders is not defined
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
at org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:69)
at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:97)
at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:117)
at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:99)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
13/07/06 08:55:20 INFO mapred.JobClient: map 0% reduce 0%
13/07/06 08:55:20 INFO mapred.JobClient: Job complete: job_local226390157_0001
13/07/06 08:55:20 INFO mapred.JobClient: Counters: 0
13/07/06 08:55:20 ERROR crawl.InjectorJob: InjectorJob: java.lang.RuntimeException: job failed: name=inject /MY_DATA_SOURCE/HR_PROJECTS/JSearch/Apache_Nutch/RELEASE/release-2.2.1/urls, jobid=job_local226390157_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
------------------------------------------------------------------------
BUILD FAILURE
------------------------------------------------------------------------
Total time: 6.572s
Finished at: Sat Jul 06 08:55:20 ICT 2013
Final Memory: 11M/236M
------------------------------------------------------------------------
Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project nutch: Command execution failed. Process exited with an error: 255 (Exit value: 255) -> [Help 1]
To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.
For more information about the errors and possible solutions, please read the following articles:
[Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

The error message clearly indicates the problem (and where to look for a solution):
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-jdk14/1.6.1/slf4j-jdk14-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hung/.m2/repository/org/slf4j/slf4j-simple/1.6.1/slf4j-simple-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Start PySpark in Jupyter notebook on EMR 6.5 - apache-spark

Related

Apache spark master server not starting. Caused by: java.lang.reflect.InaccessibleObjectException

java.lang.ClassCastException: org.apache.hadoop.conf.Configuration cannot be cast to org.apache.hadoop.yarn.conf.YarnConfiguration

JMeter 3.3 connect Spark 2.2.1 error: "Cannot create PoolableConnectionFactory (Method not supported)"

SPARK_RPC_CLIENT_CONNECT_TIMEOUT in running Hive On Spark - YARN Cluster mode

Step by step running apache Nutch 2.2.1

Categories

Resources