pyspark 2.0 throw AlreadyExistsException(message:Database default already exists) when interact with hive

pyspark 2.0 throw AlreadyExistsException(message:Database default already exists) when interact with hive - apache-spark

I 've just upgraded to spark 2.0.0 from 1.3.1, I wrote a simple code to interact with hive（1.2.1） used spark sql, I've put the hive-site.xml into spark conf directory, and I get the expected results from the sql, BUT it throws a weird AlreadyExistsException(message:Database default already exists) , how to ignore this?
【Code】
from pyspark.sql import SparkSession
ss = SparkSession.builder.appName("test").master("local") \
.config("spark.ui.port", "4041") \
.enableHiveSupport()\
.getOrCreate()
ss.sparkContext.setLogLevel("INFO")
ss.sql("show tables").show()
【Log】
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/08 19:41:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/08/08 19:41:24 INFO execution.SparkSqlParser: Parsing command: show tables
16/08/08 19:41:25 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
16/08/08 19:41:26 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/08/08 19:41:26 INFO metastore.ObjectStore: ObjectStore, initialize called
16/08/08 19:41:26 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/08/08 19:41:26 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/08/08 19:41:26 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/08/08 19:41:27 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/08/08 19:41:27 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/08/08 19:41:27 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/08/08 19:41:27 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/08/08 19:41:27 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
16/08/08 19:41:27 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
16/08/08 19:41:27 INFO metastore.ObjectStore: Initialized ObjectStore
16/08/08 19:41:27 INFO metastore.HiveMetaStore: Added admin role in metastore
16/08/08 19:41:27 INFO metastore.HiveMetaStore: Added public role in metastore
16/08/08 19:41:27 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
16/08/08 19:41:27 INFO metastore.HiveMetaStore: 0: get_all_databases
16/08/08 19:41:27 INFO HiveMetaStore.audit: ugi=felix ip=unknown-ip-addr cmd=get_all_databases
16/08/08 19:41:28 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
16/08/08 19:41:28 INFO HiveMetaStore.audit: ugi=felix ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/08/08 19:41:28 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/08/08 19:41:28 INFO session.SessionState: Created local directory: /usr/local/Cellar/hive/1.2.1/libexec/conf/tmp/3fbc3578-fdeb-40a9-8469-7c851cb3733c_resources
16/08/08 19:41:28 INFO session.SessionState: Created HDFS directory: /tmp/hive/felix/3fbc3578-fdeb-40a9-8469-7c851cb3733c
16/08/08 19:41:28 INFO session.SessionState: Created local directory: /usr/local/Cellar/hive/1.2.1/libexec/conf/tmp/felix/3fbc3578-fdeb-40a9-8469-7c851cb3733c
16/08/08 19:41:28 INFO session.SessionState: Created HDFS directory: /tmp/hive/felix/3fbc3578-fdeb-40a9-8469-7c851cb3733c/_tmp_space.db
16/08/08 19:41:28 INFO client.HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is /user/hive/warehouse
16/08/08 19:41:28 INFO session.SessionState: Created local directory: /usr/local/Cellar/hive/1.2.1/libexec/conf/tmp/8eaa63ec-9710-499f-bd50-6625bf4459f5_resources
16/08/08 19:41:28 INFO session.SessionState: Created HDFS directory: /tmp/hive/felix/8eaa63ec-9710-499f-bd50-6625bf4459f5
16/08/08 19:41:28 INFO session.SessionState: Created local directory: /usr/local/Cellar/hive/1.2.1/libexec/conf/tmp/felix/8eaa63ec-9710-499f-bd50-6625bf4459f5
16/08/08 19:41:28 INFO session.SessionState: Created HDFS directory: /tmp/hive/felix/8eaa63ec-9710-499f-bd50-6625bf4459f5/_tmp_space.db
16/08/08 19:41:28 INFO client.HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is /user/hive/warehouse
16/08/08 19:41:28 INFO metastore.HiveMetaStore: 0: create_database: Database(name:default, description:default database, locationUri:hdfs://localhost:9900/user/hive/warehouse, parameters:{})
16/08/08 19:41:28 INFO HiveMetaStore.audit: ugi=felix ip=unknown-ip-addr cmd=create_database: Database(name:default, description:default database, locationUri:hdfs://localhost:9900/user/hive/warehouse, parameters:{})
16/08/08 19:41:28 ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
at com.sun.proxy.$Proxy22.create_database(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy23.createDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:291)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:291)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:291)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:262)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:209)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:208)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:251)
at org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:290)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:98)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:147)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.hive.HiveSessionCatalog.<init>(HiveSessionCatalog.scala:51)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:49)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
16/08/08 19:41:28 INFO metastore.HiveMetaStore: 0: get_database: default
16/08/08 19:41:28 INFO HiveMetaStore.audit: ugi=felix ip=unknown-ip-addr cmd=get_database: default
16/08/08 19:41:28 INFO metastore.HiveMetaStore: 0: get_database: default
16/08/08 19:41:28 INFO HiveMetaStore.audit: ugi=felix ip=unknown-ip-addr cmd=get_database: default
16/08/08 19:41:28 INFO metastore.HiveMetaStore: 0: get_tables: db=default pat=*
16/08/08 19:41:28 INFO HiveMetaStore.audit: ugi=felix ip=unknown-ip-addr cmd=get_tables: db=default pat=*
16/08/08 19:41:28 INFO spark.SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:-2
16/08/08 19:41:28 INFO scheduler.DAGScheduler: Got job 0 (showString at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/08/08 19:41:28 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (showString at NativeMethodAccessorImpl.java:-2)
16/08/08 19:41:28 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/08/08 19:41:28 INFO scheduler.DAGScheduler: Missing parents: List()
16/08/08 19:41:28 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at showString at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/08/08 19:41:28 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.9 KB, free 366.3 MB)
16/08/08 19:41:29 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.4 KB, free 366.3 MB)
16/08/08 19:41:29 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.68.80.25:58224 (size: 2.4 KB, free: 366.3 MB)
16/08/08 19:41:29 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1012
16/08/08 19:41:29 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at showString at NativeMethodAccessorImpl.java:-2)
16/08/08 19:41:29 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/08/08 19:41:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, PROCESS_LOCAL, 5827 bytes)
16/08/08 19:41:29 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/08/08 19:41:29 INFO codegen.CodeGenerator: Code generated in 152.42807 ms
16/08/08 19:41:29 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1279 bytes result sent to driver
16/08/08 19:41:29 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 275 ms on localhost (1/1)
16/08/08 19:41:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/08/08 19:41:29 INFO scheduler.DAGScheduler: ResultStage 0 (showString at NativeMethodAccessorImpl.java:-2) finished in 0.288 s
16/08/08 19:41:29 INFO scheduler.DAGScheduler: Job 0 finished: showString at NativeMethodAccessorImpl.java:-2, took 0.538913 s
16/08/08 19:41:29 INFO codegen.CodeGenerator: Code generated in 13.588415 ms
+-------------------+-----------+
| tableName|isTemporary|
+-------------------+-----------+
| app_visit_log| false|
| cms_article| false|
| p4| false|
| p_bak| false|
+-------------------+-----------+
16/08/08 19:41:29 INFO spark.SparkContext: Invoking stop() from shutdown hook
PS：everything works well when I test it in Java.
Any help will be highly appreciated.

as the log show, this message does not mean any bad thing happened, it only check if default database has been existed;
in fact, these exception logs should not be displayed if the default database exist.

Assuming you don't have an existing Hive warehouse and stuff try setting the following in spark-defaults.xml and restart spark-master.
spark.sql.warehouse.dir=file:///usr/lib/spark/..... (spark install dir)

Related

Permission denied error when setting up local Spark instance and running pyspark

I am setting up a local Spark instance on Windows to use with PySpark as described in this guide (but with spark-3.0.0 / hadoop 2.7 instead): https://phoenixnap.com/kb/install-spark-on-windows-10.
I can startup Spark with:
C:\Spark\spark-3.0.0-bin-hadoop2.7\bin>spark-shell.cmd
and connect to it with http://localhost:4040/ in my browser (I see the Spark GUI).
But when am running the Python pyspark example with
C:\Spark\spark-3.0.0-bin-hadoop2.7\examples>run-example SparkPi
it throws an Permission Denied error like in this trace:
21/03/08 10:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/03/08 10:51:04 INFO SparkContext: Running Spark version 3.0.0
21/03/08 10:51:04 INFO ResourceUtils: ==============================================================
21/03/08 10:51:04 INFO ResourceUtils: Resources for spark.driver:
21/03/08 10:51:04 INFO ResourceUtils: ==============================================================
21/03/08 10:51:04 INFO SparkContext: Submitted application: Spark Pi
21/03/08 10:51:04 INFO SecurityManager: Changing view acls to: #####
21/03/08 10:51:04 INFO SecurityManager: Changing modify acls to: #####
21/03/08 10:51:04 INFO SecurityManager: Changing view acls groups to:
21/03/08 10:51:04 INFO SecurityManager: Changing modify acls groups to:
21/03/08 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(#####); groups with view permissions: Set(); users with modify permissions: Set(#####); groups with modify permissions: Set()
21/03/08 10:51:05 INFO Utils: Successfully started service 'sparkDriver' on port 63213.
21/03/08 10:51:05 INFO SparkEnv: Registering MapOutputTracker
21/03/08 10:51:05 INFO SparkEnv: Registering BlockManagerMaster
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/03/08 10:51:05 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/03/08 10:51:05 INFO DiskBlockManager: Created local directory at C:\Users\#####\AppData\Local\Temp\blockmgr-dce03954-27a7-484d-8e54-f552b21433f7
21/03/08 10:51:05 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
21/03/08 10:51:05 INFO SparkEnv: Registering OutputCommitCoordinator
21/03/08 10:51:05 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/03/08 10:51:05 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://WORKSTATION.DOMAIN.EXT:4040
21/03/08 10:51:05 INFO SparkContext: Added JAR file:///C:/Spark/spark-3.0.0-bin-hadoop2.7/examples/jars/scopt_2.12-3.7.1.jar at spark://WORKSTATION.DOMAIN.EXT:63213/jars/scopt_2.12-3.7.1.jar with timestamp 1615197065578
21/03/08 10:51:05 INFO SparkContext: Added JAR file:///C:/Spark/spark-3.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.12-3.0.0.jar at spark://WORKSTATION.DOMAIN.EXT:63213/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615197065579
21/03/08 10:51:05 INFO Executor: Starting executor ID driver on host WORKSTATION.DOMAIN.EXT
21/03/08 10:51:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63260.
21/03/08 10:51:05 INFO NettyBlockTransferService: Server created on WORKSTATION.DOMAIN.EXT:63260
21/03/08 10:51:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/03/08 10:51:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: Registering block manager NLLR4000250910.solon.prd:63260 with 366.3 MiB RAM, BlockManagerId(driver, WORKSTATION.DOMAIN.EXT, 63260, None)
21/03/08 10:51:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:06 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
21/03/08 10:51:06 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
21/03/08 10:51:06 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
21/03/08 10:51:06 INFO DAGScheduler: Parents of final stage: List()
21/03/08 10:51:06 INFO DAGScheduler: Missing parents: List()
21/03/08 10:51:06 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
21/03/08 10:51:06 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.1 KiB, free 366.3 MiB)
21/03/08 10:51:06 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1816.0 B, free 366.3 MiB)
21/03/08 10:51:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on WORKSTATION.DOMAIN.EXT:63260 (size: 1816.0 B, free: 366.3 MiB)
21/03/08 10:51:06 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1200
21/03/08 10:51:06 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
21/03/08 10:51:06 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
21/03/08 10:51:06 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, WORKSTATION.DOMAIN.EXT, executor driver, partition 0, PROCESS_LOCAL, 7393 bytes)
21/03/08 10:51:06 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, WORKSTATION.DOMAIN.EXT, executor driver, partition 1, PROCESS_LOCAL, 7393 bytes)
21/03/08 10:51:06 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/03/08 10:51:06 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/03/08 10:51:06 INFO Executor: Fetching spark://WORKSTATION.DOMAIN.EXT:63213/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615197065579
21/03/08 10:51:06 ERROR Utils: Aborting task
java.io.IOException: Failed to connect to WORKSTATION.DOMAIN.EXT/192.168.#.#:63213
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:359)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:535)
at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:869)
at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:860)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:860)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:404)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.AbstractChannel$AnnotatedSocketException: Permission denied: no further information: WORKSTATION.DOMAIN.EXT/192.168.#.#:63213
Caused by: java.net.SocketException: Permission denied: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Unknown Source)
[snip]
When running it on a different machine with seemingly the same config where it works fine, I get this trace on the part where the Exception is thrown on the other trace:
[snip]
21/03/08 08:00:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/03/08 08:00:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/03/08 08:00:22 INFO Executor: Fetching spark://WORKSTATION.DOMAIN.EXT:63646/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615186820489
21/03/08 08:00:22 INFO TransportClientFactory: Successfully created connection to WORKSTATION.DOMAIN.EXT/10.121.#.#:63646 after 86 ms (0 ms spent in bootstraps)
21/03/08 08:00:22 INFO Utils: Fetching spark://WORKSTATION.DOMAIN.EXT:63646/jars/spark-examples_2.12-3.0.0.jar to C:\Users\#####\AppData\Local\Temp\spark-54a13d9f-9064-4f34-ba81-af49b18d9a0c\userFiles-24c3eabc-02a4-4aca-8abb-424431c6442f\fetchFileTemp5258763437798623210.tmp
21/03/08 08:00:24 INFO Executor: Adding file:/C:/Users/#####/AppData/Local/Temp/spark-54a13d9f-9064-4f34-ba81-af49b18d9a0c/userFiles-24c3eabc-02a4-4aca-8abb-424431c6442f/spark-examples_2.12-3.0.0.jar to class loader
[snip]
At first it seemed to me as a Firewall issue, but adding the executing java.exe as exeption to the firewall didn't solve the issue.
Does anyone know what I should try next to get this issue resolved?

Finally I could solve it by setting my SPARK_LOCAL_IP to localhost in my environment variables: Go to your Windows environment variables and set SPARK_LOCAL_IP=localhost

FileNotFoundException on submitting Spark Jobs to remote

I've created an environment where I've set up 3 Docker containers, 1 for Airflow using the puckel/docker-airflow image with spark and hadoop additionally installed. The other two containers are basically imitating spark master and worker (used gettyimages/spark Docker image to create this). All 3 containers are connected to each other via a bridge network, so all containers are able to communicate with each other.
What I'm trying to do next is to submit spark job from the Airflow container to the Spark cluster (master).
As an initial example, I'm using the wordcount sample script. I created a sample.txt file in the airflow container at path usr/local/airflow/sample.txt. I've bashed into the Airflow container and I'm using the command given below to run the wordcount.py on spark master located at the ip which I found after inspecting the bridge network.
spark-submit --master spark://ipaddress:7077 --files usr/local/airflow/sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
After submitting the script, from the logs, I can see that a connection has been established with the master (from airflow container), and it also copied the file specified by --files to the master and worker, but then it just errors out saying,
java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
As per my understanding (could be wrong), but when we specify files to copy to master using --files you can access it directly via the file name (sample.txt in my case). So what I'm trying to figure out is if a job has been submitted and the file has been copied to master, then why is it searching in the location file:/usr/local/airflow/sample.txt? How do I make it refer to the correct path?
I apologize as this question has been asked a couple of times, but I've read all the related question on stackoverflow, but I'm still unable to resolve this. I'd really appreciate y'alls help on this.
Thanks.
The full log below,
user#machine:/usr/local/airflow# spark-submit --master spark://172.22.0.2:7077 --files sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py ./sample.txt
20/07/25 03:23:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/25 03:23:35 INFO SparkContext: Running Spark version 2.4.1
20/07/25 03:23:35 INFO SparkContext: Submitted application: PythonWordCount
20/07/25 03:23:35 INFO SecurityManager: Changing view acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing view acls groups to:
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls groups to:
20/07/25 03:23:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
20/07/25 03:23:35 INFO Utils: Successfully started service 'sparkDriver' on port 33457.
20/07/25 03:23:35 INFO SparkEnv: Registering MapOutputTracker
20/07/25 03:23:36 INFO SparkEnv: Registering BlockManagerMaster
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/25 03:23:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-dd1957de-6907-484d-a3d8-2b3b88e0c7ca
20/07/25 03:23:36 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/25 03:23:36 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/25 03:23:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/25 03:23:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://0508a77fcaad:4040
20/07/25 03:23:37 INFO SparkContext: Added file file:///usr/local/airflow/sample.txt at spark://0508a77fcaad:33457/files/sample.txt with timestamp 1595647417081
20/07/25 03:23:37 INFO Utils: Copying /usr/local/airflow/sample.txt to /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/userFiles-74f8cfe4-8a19-4d2e-8fa1-1f0bd1f0ef12/sample.txt
20/07/25 03:23:37 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.22.0.2:7077...
20/07/25 03:23:37 INFO TransportClientFactory: Successfully created connection to /172.22.0.2:7077 after 32 ms (0 ms spent in bootstraps)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200725032338-0003
20/07/25 03:23:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45057.
20/07/25 03:23:38 INFO NettyBlockTransferService: Server created on 0508a77fcaad:45057
20/07/25 03:23:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200725032338-0003/0 on worker-20200725025003-172.22.0.4-8881 (172.22.0.4:8881) with 2 core(s)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20200725032338-0003/0 on hostPort 172.22.0.4:8881 with 2 core(s), 1024.0 MB RAM
20/07/25 03:23:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMasterEndpoint: Registering block manager 0508a77fcaad:45057 with 366.3 MB RAM, BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200725032338-0003/0 is now RUNNING
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.020/07/25 03:23:38 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/usr/local/airflow/spark-warehouse').
20/07/25 03:23:38 INFO SharedState: Warehouse path is 'file:/usr/local/airflow/spark-warehouse'.
20/07/25 03:23:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/25 03:23:47 INFO FileSourceStrategy: Pruning directories with:
20/07/25 03:23:47 INFO FileSourceStrategy: Post-Scan Filters:
20/07/25 03:23:47 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
20/07/25 03:23:47 INFO FileSourceScanExec: Pushed Filters:
20/07/25 03:23:51 INFO CodeGenerator: Code generated in 2187.926234 ms
20/07/25 03:23:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 220.9 KB, free 366.1 MB)
20/07/25 03:23:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.8 KB, free 366.1 MB)
20/07/25 03:23:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 0508a77fcaad:45057 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:23:55 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
20/07/25 03:23:55 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/07/25 03:23:57 INFO SparkContext: Starting job: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40
20/07/25 03:23:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.22.0.4:59324) with ID 0
20/07/25 03:23:58 INFO DAGScheduler: Registering RDD 5 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39)
20/07/25 03:23:58 INFO DAGScheduler: Got job 0 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40) with 1 output partitions
20/07/25 03:23:58 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40)
20/07/25 03:23:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39), which has no missing parents
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.2 KB, free 366.0 MB)
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.1 KB, free 366.0 MB)
20/07/25 03:23:58 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 0508a77fcaad:45057 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:23:58 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
20/07/25 03:23:58 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) (first 15 tasks are for partitions Vector(0))
20/07/25 03:23:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/07/25 03:23:58 INFO BlockManagerMasterEndpoint: Registering block manager 172.22.0.4:45435 with 366.3 MB RAM, BlockManagerId(0, 172.22.0.4, 45435, None)
20/07/25 03:23:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.22.0.4:45435 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:24:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.22.0.4:45435 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:24:11 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:11 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 1]
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 2]
20/07/25 03:24:12 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 3]
20/07/25 03:24:12 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
20/07/25 03:24:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/07/25 03:24:12 INFO TaskSchedulerImpl: Cancelling stage 0
20/07/25 03:24:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/07/25 03:24:12 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) failed in 13.690 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
20/07/25 03:24:12 INFO DAGScheduler: Job 0 failed: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40, took 14.579961 s
Traceback (most recent call last):
File "/opt/spark-2.4.1/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:13 INFO SparkContext: Invoking stop() from shutdown hook
20/07/25 03:24:13 INFO SparkUI: Stopped Spark web UI at http://0508a77fcaad:4040
20/07/25 03:24:13 INFO StandaloneSchedulerBackend: Shutting down all executors
20/07/25 03:24:13 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/07/25 03:24:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/25 03:24:16 INFO MemoryStore: MemoryStore cleared
20/07/25 03:24:16 INFO BlockManager: BlockManager stopped
20/07/25 03:24:16 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/25 03:24:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/25 03:24:16 INFO SparkContext: Successfully stopped SparkContext
20/07/25 03:24:16 INFO ShutdownHookManager: Shutdown hook called
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-2dfb2222-d56c-4ee1-ab62-86e71e5e751b
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/pyspark-2ee74d07-6606-4edc-8420-fe46212c50e5

Change your spark-submit like below for submitting your spark job.
spark-submit \
--master spark://ipaddress:7077 \
--deploy-mode cluster # add this if you want to pass file name to wordcount.py
--files usr/local/airflow/sample.txt \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
OR
spark-submit \
--master spark://ipaddress:7077 \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py /usr/local/airflow/sample.txt

Connec to Hive from Apache Spark [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I have a simple program that I'm running on Standalone Cloudera VM. I have created a managed table in Hive , which I want to read in Apache spark, but the initial connection to hive is not being established. Please advise.
I'm running this program in IntelliJ, I have copied hive-site.xml from my /etc/hive/conf to /etc/spark/conf, even then the spark-job is not connecting to Hive metastore
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(ConnectToHive.class.getName())
.config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
.enableHiveSupport()
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
HiveContext hiveContext = new HiveContext(sparkSession);
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");
hiveContext.sql("SHOW DATABASES").show();
hiveContext.sql("SHOW TABLES").show();
sparkSession.close();
}
The output is as below, where is expect to see "Employee table" , so that I can query. Since I'm running on Standa-alone , hive metastore is in Local mySQL server.
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true is the configuration for Hive metastore
hive> show databases;
OK
default
sxm
temp
Time taken: 0.019 seconds, Fetched: 3 row(s)
hive> use default;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
employee
Time taken: 0.014 seconds, Fetched: 1 row(s)
hive> describe formatted employee;
OK
# col_name data_type comment
id string
firstname string
lastname string
addresses array<struct<street:string,city:string,state:string>>
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Tue Jul 25 06:33:01 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1500989581
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.07 seconds, Fetched: 29 row(s)
hive>
Added Spark Logs
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/25 11:38:30 INFO SparkContext: Running Spark version 2.1.0
17/07/25 11:38:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/25 11:38:30 INFO SecurityManager: Changing view acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing view acls groups to:
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls groups to:
17/07/25 11:38:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); groups with view permissions: Set(); users with modify permissions: Set(cloudera); groups with modify permissions: Set()
17/07/25 11:38:31 INFO Utils: Successfully started service 'sparkDriver' on port 55232.
17/07/25 11:38:31 INFO SparkEnv: Registering MapOutputTracker
17/07/25 11:38:31 INFO SparkEnv: Registering BlockManagerMaster
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/07/25 11:38:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eb1e611f-1b88-487f-b600-3da1ff8353db
17/07/25 11:38:31 INFO MemoryStore: MemoryStore started with capacity 1909.8 MB
17/07/25 11:38:31 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/25 11:38:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/25 11:38:31 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
17/07/25 11:38:31 INFO Executor: Starting executor ID driver on host localhost
17/07/25 11:38:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41433.
17/07/25 11:38:31 INFO NettyBlockTransferService: Server created on 10.0.2.15:41433
17/07/25 11:38:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/07/25 11:38:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:41433 with 1909.8 MB RAM, BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:32 INFO SharedState: Warehouse path is 'file:/home/cloudera/works/JsonHive/spark-warehouse/'.
17/07/25 11:38:32 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/07/25 11:38:32 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/07/25 11:38:32 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
17/07/25 11:38:32 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
17/07/25 11:38:32 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/07/25 11:38:32 INFO ObjectStore: ObjectStore, initialize called
17/07/25 11:38:32 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/07/25 11:38:32 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/07/25 11:38:34 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/07/25 11:38:35 INFO ObjectStore: Initialized ObjectStore
17/07/25 11:38:36 INFO HiveMetaStore: Added admin role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: Added public role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: No user is added in admin role, since config is empty
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_all_databases
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_all_databases
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_functions: db=default pat=*
17/07/25 11:38:36 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/76258222-81db-4ac1-9566-1d8f05c3ecba_resources
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba/_tmp_space.db
17/07/25 11:38:36 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/cloudera/works/JsonHive/spark-warehouse/
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: default
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: default
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: global_temp
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: global_temp
17/07/25 11:38:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Process finished with exit code 0
UPDATE
/usr/lib/hive/conf/hive-site.xml was not in the classpath so it was not reading the tables, after adding it in the classpath it worked fine ... Since I was running from IntelliJ I have this problem .. in production the spark-conf folder will have link to hive-site.xml ...

17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
This is a hint that you're not connected to the remote hive metastore (that you've set as MySQL), and the XML file is not correctly on your classpath.
You can do it programmatically without XML before you make a SparkSession
System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");
How to connect to a Hive metastore programmatically in SparkSQL?

Connection to Hive context giving me null point exception

I am using a simple Hive context on spark standalone
*/
import org.apache.commons.math.stat.descriptive.rank.Max
object App {
def main(args: Array[String]) {
val logFile = "src/main/resources/kv1.txt" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\n'")
sqlContext.sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src")
// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
}
}
I have posted some section of the log file to see
16/06/14 19:34:53 INFO ClientWrapper: Inspected Hadoop version: 2.2.0
16/06/14 19:34:53 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.2.0
16/06/14 19:34:53 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/06/14 19:34:53 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/06/14 19:34:53 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
16/06/14 19:34:53 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/06/14 19:34:53 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/06/14 19:34:53 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/06/14 19:34:53 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/06/14 19:34:53 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/06/14 19:34:53 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/06/14 19:34:53 INFO ObjectStore: ObjectStore, initialize called
16/06/14 19:34:54 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/06/14 19:34:54 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/06/14 19:34:55 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/06/14 19:34:56 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/06/14 19:34:56 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/06/14 19:34:56 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/06/14 19:34:56 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/06/14 19:34:57 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/06/14 19:34:57 INFO ObjectStore: Initialized ObjectStore
16/06/14 19:34:57 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/06/14 19:34:57 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/06/14 19:34:57 WARN : Your hostname, SSA7201E-169 resolves to a loopback/non-reachable address: 140.159.218.207, but we couldn't find any external IP address!
16/06/14 19:34:57 INFO HiveMetaStore: Added admin role in metastore
16/06/14 19:34:57 INFO HiveMetaStore: Added public role in metastore
16/06/14 19:34:58 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/06/14 19:34:58 INFO HiveMetaStore: 0: get_all_databases
16/06/14 19:34:58 INFO audit: ugi=s3911541 ip=unknown-ip-addr cmd=get_all_databases
16/06/14 19:34:58 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/06/14 19:34:58 INFO audit: ugi=s3911541 ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/06/14 19:34:58 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at App$.main(App.scala:27)
at App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:567)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSystem.java:542)
at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 17 more
16/06/14 19:34:58 INFO SparkContext: Invoking stop() from shutdown hook
16/06/14 19:34:58 INFO SparkUI: Stopped Spark web UI at http://140.159.218.207:4040
16/06/14 19:34:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/06/14 19:34:58 INFO MemoryStore: MemoryStore cleared
16/06/14 19:34:58 INFO BlockManager: BlockManager stopped
16/06/14 19:34:58 INFO BlockManagerMaster: BlockManagerMaster stopped
16/06/14 19:34:58 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/06/14 19:34:58 INFO SparkContext: Successfully stopped SparkContext
16/06/14 19:34:58 INFO ShutdownHookManager: Shutdown hook called
16/06/14 19:34:58 INFO ShutdownHookManager: Deleting directory C:\Users\s3911541.AD.000\AppData\Local\Temp\spark-56c72d84-b79b-4cc9-aaa6-b1818303dded
16/06/14 19:34:58 INFO ShutdownHookManager: Deleting directory C:\Users\s3911541.AD.000\AppData\Local\Temp\spark-94c1e22c-1d4c-4c40-93ea-309171c93b99
16/06/14 19:34:58 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/06/14 19:34:58 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/06/14 19:34:58 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/06/14 19:34:58 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\s3911541.AD.000\AppData\Local\Temp\spark-94c1e22c-1d4c-4c40-93ea-309171c93b99
java.io.IOException: Failed to delete: C:\Users\s3911541.AD.000\AppData\Local\Temp\spark-94c1e22c-1d4c-4c40-93ea-309171c93b99
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:928)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:267)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:239)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:218)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Process finished with exit code 1
My sbt file is as foolw and Ihave not compiled any jar..
name := "ScalaTest2"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.6.1"
// http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.1"
Can someone point me how to solve this problem? I am sure my path file and text file are coorect..If some has any clue for that
Thank you

Error creating transactional connection factory during running Spark on Hive project in IDEA

I am trying to setup a develop environment for a Spark Streaming project which requires write data into Hive.
I have a cluster with 1 master, 2 slaves and 1 develop machine (coding in Intellij Idea 14).
Within the spark shell, everything seems working fine and I am able to store data into default database in Hive via Spark 1.5 using DataFrame.write.insertInto("testtable")
However when creating a scala project in IDEA and run it using same cluster with same setting, Error was thrown when creating transactional connection factory in the metastore database which suppose to be "metastore_db" in mysql.
Here's the hive-site.xml:
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://10.1.50.73:9083</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://10.1.50.73:3306/metastore_db?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>sa</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>huanhuan</value>
</property>
</configuration>
the machine which I was running IDEA, can remotely login Mysql and Hive to create tables, so there should have no problem with permissions.
Here's the log4j Output:
> /home/stdevelop/jdk1.7.0_79/bin/java -Didea.launcher.port=7536 -Didea.launcher.bin.path=/home/stdevelop/idea/bin -Dfile.encoding=UTF-8 -classpath /home/stdevelop/jdk1.7.0_79/jre/lib/plugin.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/deploy.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/jfxrt.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/charsets.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/javaws.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/jfr.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/jce.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/jsse.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/rt.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/resources.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/management-agent.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/ext/zipfs.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/ext/sunec.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/ext/sunpkcs11.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/ext/sunjce_provider.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/ext/localedata.jar:/home/stdevelop/jdk1.7.0_79/jre/lib/ext/dnsns.jar:/home/stdevelop/IdeaProjects/StreamingIntoHive/target/scala-2.10/classes:/root/.sbt/boot/scala-2.10.4/lib/scala-library.jar:/home/stdevelop/SparkDll/spark-assembly-1.5.0-hadoop2.5.2.jar:/home/stdevelop/SparkDll/datanucleus-api-jdo-3.2.6.jar:/home/stdevelop/SparkDll/datanucleus-core-3.2.10.jar:/home/stdevelop/SparkDll/datanucleus-rdbms-3.2.9.jar:/home/stdevelop/SparkDll/mysql-connector-java-5.1.35-bin.jar:/home/stdevelop/idea/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain StreamingIntoHive 10.1.50.68 8080
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/09/22 19:43:18 INFO SparkContext: Running Spark version 1.5.0
15/09/22 19:43:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 19:43:22 INFO SecurityManager: Changing view acls to: root
15/09/22 19:43:22 INFO SecurityManager: Changing modify acls to: root
15/09/22 19:43:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/09/22 19:43:26 INFO Slf4jLogger: Slf4jLogger started
15/09/22 19:43:26 INFO Remoting: Starting remoting
15/09/22 19:43:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.1.50.68:58070]
15/09/22 19:43:26 INFO Utils: Successfully started service 'sparkDriver' on port 58070.
15/09/22 19:43:26 INFO SparkEnv: Registering MapOutputTracker
15/09/22 19:43:26 INFO SparkEnv: Registering BlockManagerMaster
15/09/22 19:43:26 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e7fdc896-ebd2-4faa-a9fe-e61bd93a9db4
15/09/22 19:43:26 INFO MemoryStore: MemoryStore started with capacity 797.6 MB
15/09/22 19:43:27 INFO HttpFileServer: HTTP File server directory is /tmp/spark-fb07a3ad-8077-49a8-bcaf-12254cc90282/httpd-0bb434c9-1418-49b6-a514-90e27cb80ab1
15/09/22 19:43:27 INFO HttpServer: Starting HTTP Server
15/09/22 19:43:27 INFO Utils: Successfully started service 'HTTP file server' on port 38865.
15/09/22 19:43:27 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/22 19:43:29 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/09/22 19:43:29 INFO SparkUI: Started SparkUI at http://10.1.50.68:4040
15/09/22 19:43:29 INFO SparkContext: Added JAR /home/stdevelop/SparkDll/mysql-connector-java-5.1.35-bin.jar at http://10.1.50.68:38865/jars/mysql-connector-java-5.1.35-bin.jar with timestamp 1442922209496
15/09/22 19:43:29 INFO SparkContext: Added JAR /home/stdevelop/SparkDll/datanucleus-api-jdo-3.2.6.jar at http://10.1.50.68:38865/jars/datanucleus-api-jdo-3.2.6.jar with timestamp 1442922209498
15/09/22 19:43:29 INFO SparkContext: Added JAR /home/stdevelop/SparkDll/datanucleus-rdbms-3.2.9.jar at http://10.1.50.68:38865/jars/datanucleus-rdbms-3.2.9.jar with timestamp 1442922209534
15/09/22 19:43:29 INFO SparkContext: Added JAR /home/stdevelop/SparkDll/datanucleus-core-3.2.10.jar at http://10.1.50.68:38865/jars/datanucleus-core-3.2.10.jar with timestamp 1442922209564
15/09/22 19:43:30 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/09/22 19:43:30 INFO AppClient$ClientEndpoint: Connecting to master spark://10.1.50.71:7077...
15/09/22 19:43:32 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20150922062654-0004
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor added: app-20150922062654-0004/0 on worker-20150921191458-10.1.50.71-44716 (10.1.50.71:44716) with 1 cores
15/09/22 19:43:32 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150922062654-0004/0 on hostPort 10.1.50.71:44716 with 1 cores, 1024.0 MB RAM
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor added: app-20150922062654-0004/1 on worker-20150921191456-10.1.50.73-36446 (10.1.50.73:36446) with 1 cores
15/09/22 19:43:32 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150922062654-0004/1 on hostPort 10.1.50.73:36446 with 1 cores, 1024.0 MB RAM
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor added: app-20150922062654-0004/2 on worker-20150921191456-10.1.50.72-53999 (10.1.50.72:53999) with 1 cores
15/09/22 19:43:32 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150922062654-0004/2 on hostPort 10.1.50.72:53999 with 1 cores, 1024.0 MB RAM
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor updated: app-20150922062654-0004/1 is now LOADING
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor updated: app-20150922062654-0004/0 is now LOADING
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor updated: app-20150922062654-0004/2 is now LOADING
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor updated: app-20150922062654-0004/0 is now RUNNING
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor updated: app-20150922062654-0004/1 is now RUNNING
15/09/22 19:43:32 INFO AppClient$ClientEndpoint: Executor updated: app-20150922062654-0004/2 is now RUNNING
15/09/22 19:43:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60161.
15/09/22 19:43:33 INFO NettyBlockTransferService: Server created on 60161
15/09/22 19:43:33 INFO BlockManagerMaster: Trying to register BlockManager
15/09/22 19:43:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.1.50.68:60161 with 797.6 MB RAM, BlockManagerId(driver, 10.1.50.68, 60161)
15/09/22 19:43:33 INFO BlockManagerMaster: Registered BlockManager
15/09/22 19:43:34 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/09/22 19:43:35 INFO SparkContext: Added JAR /home/stdevelop/Builds/streamingintohive.jar at http://10.1.50.68:38865/jars/streamingintohive.jar with timestamp 1442922215169
15/09/22 19:43:39 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#10.1.50.72:40110/user/Executor#-132020084]) with ID 2
15/09/22 19:43:39 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#10.1.50.71:38248/user/Executor#-1615730727]) with ID 0
15/09/22 19:43:40 INFO BlockManagerMasterEndpoint: Registering block manager 10.1.50.72:37819 with 534.5 MB RAM, BlockManagerId(2, 10.1.50.72, 37819)
15/09/22 19:43:40 INFO BlockManagerMasterEndpoint: Registering block manager 10.1.50.71:48028 with 534.5 MB RAM, BlockManagerId(0, 10.1.50.71, 48028)
15/09/22 19:43:42 INFO HiveContext: Initializing execution hive, version 1.2.1
15/09/22 19:43:42 INFO ClientWrapper: Inspected Hadoop version: 2.5.2
15/09/22 19:43:42 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.5.2
15/09/22 19:43:42 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#10.1.50.73:56385/user/Executor#1871695565]) with ID 1
15/09/22 19:43:43 INFO BlockManagerMasterEndpoint: Registering block manager 10.1.50.73:43643 with 534.5 MB RAM, BlockManagerId(1, 10.1.50.73, 43643)
15/09/22 19:43:45 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/09/22 19:43:45 INFO ObjectStore: ObjectStore, initialize called
15/09/22 19:43:47 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/09/22 19:43:47 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/09/22 19:43:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/09/22 19:43:48 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/09/22 19:43:58 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/09/22 19:44:03 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/09/22 19:44:03 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/09/22 19:44:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/09/22 19:44:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/09/22 19:44:12 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
15/09/22 19:44:12 INFO ObjectStore: Initialized ObjectStore
15/09/22 19:44:13 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/09/22 19:44:14 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/09/22 19:44:15 INFO HiveMetaStore: Added admin role in metastore
15/09/22 19:44:15 INFO HiveMetaStore: Added public role in metastore
15/09/22 19:44:16 INFO HiveMetaStore: No user is added in admin role, since config is empty
15/09/22 19:44:16 INFO HiveMetaStore: 0: get_all_databases
15/09/22 19:44:16 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_all_databases
15/09/22 19:44:17 INFO HiveMetaStore: 0: get_functions: db=default pat=*
15/09/22 19:44:17 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_functions: db=default pat=*
15/09/22 19:44:17 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
15/09/22 19:44:18 INFO SessionState: Created local directory: /tmp/9ee94679-df51-46bc-bf6f-66b19f053823_resources
15/09/22 19:44:18 INFO SessionState: Created HDFS directory: /tmp/hive/root/9ee94679-df51-46bc-bf6f-66b19f053823
15/09/22 19:44:18 INFO SessionState: Created local directory: /tmp/root/9ee94679-df51-46bc-bf6f-66b19f053823
15/09/22 19:44:18 INFO SessionState: Created HDFS directory: /tmp/hive/root/9ee94679-df51-46bc-bf6f-66b19f053823/_tmp_space.db
15/09/22 19:44:19 INFO HiveContext: default warehouse location is /user/hive/warehouse
15/09/22 19:44:19 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
15/09/22 19:44:19 INFO ClientWrapper: Inspected Hadoop version: 2.5.2
15/09/22 19:44:19 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.5.2
15/09/22 19:44:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 19:44:22 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/09/22 19:44:22 INFO ObjectStore: ObjectStore, initialize called
15/09/22 19:44:23 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/09/22 19:44:23 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/09/22 19:44:23 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/09/22 19:44:25 WARN HiveMetaStore: Retrying creating default database after error: Error creating transactional connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:587)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:788)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:57)
at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:171)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.<init>(IsolatedClientLoader.scala:179)
at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:227)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:186)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:393)
at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:175)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:178)
at StreamingIntoHive$.main(StreamingIntoHive.scala:42)
at StreamingIntoHive.main(StreamingIntoHive.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
NestedThrowablesStackTrace:
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:240)
at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:286)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:57)
at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:171)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.<init>(IsolatedClientLoader.scala:179)
at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:227)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:186)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:393)
at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:175)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:178)
at StreamingIntoHive$.main(StreamingIntoHive.scala:42)
at StreamingIntoHive.main(StreamingIntoHive.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.OutOfMemoryError: PermGen space
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:165)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:153)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at org.datanucleus.store.rdbms.connectionpool.DBCPConnectionPoolFactory.createConnectionPool(DBCPConnectionPoolFactory.java:59)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.<init>(ConnectionFactoryImpl.java:85)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:240)
………………
………………
Process finished with exit code 1
===============
Can anyone help me to find out the reason?
Thanks.

I believe adding hiveContext may help, if not added already in code.
import org.apache.spark.sql.hive.HiveContext
import hiveContext.implicits._
import hiveContext.sql
val hiveContext = new HiveContext(sc)
A hive context adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory. -- From spark example
Issue was resolved by,
Actually I created metastore_db manually so if spark connects mysql, it will pass the direct sql DbType test. However since it didn't pass the mysql direct sql therefore spark used derby as default metastore db as what was shown in the log4j. This implies spark was not connected to metastore_db in mysql although hive-site.xml was correctly configured. I found a solution for this which is described in the question comment. Hope it helps

If you are using Mysql database specify mysql driver path in spark-env.sh file,
SampleCode:
export SPARK_CLASSPATH="/home/${user.name}/work/apache-hive-2.0.0-bin/lib/mysql-connector-java-5.1.38-bin.jar"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LOG_DIR=/home/hadoop/work/spark-2.0.0-bin-hadoop2.7/logs
export SPARK_WORKER_DIR=/home/hadoop/work/spark-2.0.0-bin-hadoop2.7/logs/worker_dir
More info: https://www.youtube.com/watch?v=y3E8tUFhBt0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pyspark 2.0 throw AlreadyExistsException(message:Database default already exists) when interact with hive - apache-spark

as the log show, this message does not mean any bad thing happened, it only check if default database has been existed; in fact, these exception logs should not be displayed if the default database exist.

Assuming you don't have an existing Hive warehouse and stuff try setting the following in spark-defaults.xml and restart spark-master. spark.sql.warehouse.dir=file:///usr/lib/spark/..... (spark install dir)

Related

Permission denied error when setting up local Spark instance and running pyspark

FileNotFoundException on submitting Spark Jobs to remote

Connec to Hive from Apache Spark [duplicate]

Connection to Hive context giving me null point exception

Error creating transactional connection factory during running Spark on Hive project in IDEA

Categories

Resources