JDBC From Informix to Spark using Dataframes - apache-spark

I can connect to the Informix database using simple JDBC connection program but when I try to load the tables using Spark Dataframes I am facing an exception. Do we need to use a specific connector for Informix spark connection?
Below is the stack trace of the exception:
java.sql.SQLException: System or internal error java.lang.NumberFormatException: For input string: "table_name"
at com.informix.util.IfxErrMsg.getSQLException(IfxErrMsg.java:482)
at com.informix.jdbc.IfxChar.toLong(IfxChar.java:666)
at com.informix.jdbc.IfxResultSet.getLong(IfxResultSet.java:1123)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:411)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:472)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

When Spark generates the database queries, it puts the column names in quotes. To accommodate this, in the JDBC connection URL you need to add
DELIMIDENT=Y

From stack trace it seems that there is connection to Informix database.
The problem probably is with reading data from Informix. Spark calls getNext() which calls getLong() and getLong() recives 'table_name' which cannot be parsed as a number.
I don't know Spark. Maybe add some details (proably code) about how do you use Spark.

Related

Access Spark Hive metastore within an UDF running in the workers (Databricks)

Context
I have an operation that should be performed on some tables using pyspark.
This operation includes accessing the Spark metastore (in Databricks) to get some metadata.
Since I have plenty of tables I'm parallelizing this operation among the cluster workers with an RDD, as you can see in the code below:
base_spark_context = SparkContext.getOrCreate()
rdd = base_spark_context.sc.parallelize(tables_list)
rdd.map(lambda table_name: sync_table(table_name)).collect()
The UDF sync_table() run queries on the metastore, similar to this code line:
spark_client.session.sql("select 1")
Problem
The problem is that this SQL execution not succeeds. Rather I get some metastore related error. Traceback:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
(suppressed lines)
Caused by: java.lang.reflect.InvocationTargetException
(suppressed lines)
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader#16c0663d, see the next exception for details.
(suppressed lines)
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /databricks/spark/work/app-20210413201900-0000/0/metastore_db.
Is there any limitation accessing the Databricks metastore within a worker, after parallelizing the operation in such a way? Or there is a possibility of performing such an operation?

Issue while loading an avro dataset into Teradata with spark-streaming

I am trying to load a dataset of avro files into a Teradata table through spark streaming (jdbc). The configuration is properly set and the load succeeds to certain extent (I can validate rows of data have been inserted into the table), but halfways through I start having exceptions and the load fails. The stacktrace is below. Any inkling as to what might causing this ?
18/02/08 17:27:42 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 0)
java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.02] [Error 1154] [SQLState HY000] A failure occurred while inserting the batch of rows destined for database table "database"."table". Details of the failure can be found in the exception chain that is accessible with getNextException.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:149)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:133)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2389)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:592)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.02] [Error 1147] [SQLState HY000] The next failure(s) in the exception chain occurred while beginning FastLoad of database table "database"."table"
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:95)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:70)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.beginFastLoad(FastLoadManagerPreparedStatement.java:966)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2210)
The issue stems from trying to load data in append mode to an existing table. FAST LOAD doesn't support that. The table should be empty (a.k.a truncated) each time the process runs. Which makes this useful for staging data prior to crunching it. But not for storing it.

Hive on Spark CDH5.7 Execution Error

I've updated my cluster to CDH 5.7 recently and I am trying to run a Hive query processing on Spark.
I have configured the Hive client to use the Spark execution engine and Hive Dependency on a Spark Service from Cloudera Manager.
Via HUE, i'm simply running a simple select query but seem to get this error at all times: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Following are the logs for the same:
ERROR operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:374)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:180)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any help to solve this would be great!
This problem is due to a open JIRA: https://issues.apache.org/jira/browse/HIVE-11519. You should use another serialization tool..
Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
is not the real error message, you'd better turn on the DEBUG info by using hive cli, like
bin/hive --hiveconf hive.root.logger=DEBUG,console
and you will get more detailed logs, such as, those are something i got before:
16/03/17 13:55:43 [fxxxxxxxxxxxxxxxx4 main]: INFO exec.SerializationUtilities: Serializing MapWork using kryo
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
this is caused by some dependency conflicts, see https://issues.apache.org/jira/browse/HIVE-13301 for detail.

Connection error while writing into Cassandra using pyspark

I am trying to write data into Cassandra using Pyspark shell,command:
dataframe_name.write.format("org.apache.spark.sql.cassandra").options(table="table_name",keyspace="keyspace_name").save(mode="append")
but I am getting the following error:
15/09/15 06:37:18 ERROR DAGScheduler: Failed to update accumulators for ResultTask(2, 198)
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.<init>(Socket.java:425)
at java.net.Socket.<init>(Socket.java:208)
at org.apache.spark.api.python.PythonAccumulatorParam.openSocket(PythonRDD.scala:813)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:828)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:798)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:80)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:342)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:337)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:337)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:945)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1014)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I have tried executing the same operation using python shell on pyspark machine. It is working fine.
from cassandra.cluster import Cluster
cluster= Cluster(['ip_of_cassandra_machine'])
session = cluster.connect('keyspace_name');
a = session.prepare(""" insert into table_name(t1,t2) values('value1', 'value2') """)
session.execute(a)
This looks like a networking issue from within Spark. Without the exact versions of Spark and the Spark Cassandra Connector it would be hard to diagnose. My guess is that the driver is incorrectly setup for communication with the executors. Are you sure that your driver application is reachable by your executors and vice-versa?
You can always test setting --master local to see if the problem exists when networking is out of the picture.

accessing IMap from EntryProcessor

Can IMap or other Hazelcast distributed data structures like AtomicLong be accessed from within process() method of an EntryProcessor?
I'm getting following exception:
java.util.concurrent.ExecutionException: java.lang.IllegalThreadStateException: Thread[hz.Alcatraz-ANP-Sys-HAZLE-2.actiance.local.partition-operation.thread-5,5,Alcatraz-ANP-Sys-HAZLE-2.actiance.local] cannot make remote call: com.hazelcast.concurrent.lock.operations.LockOperation#3229190f
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.7.0_51]
at java.util.concurrent.FutureTask.get(FutureTask.java:188) ~[na:1.7.0_51]
at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:189) ~[hazelcast-3.3_actiance.jar:3.3]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:209) [hazelcast-3.3_actiance.jar:3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) [hazelcast-3.3_actiance.jar:3.3]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92) [hazelcast-3.3_actiance.jar:3.3]
I'm using Hazelcast version 3.3
You can access other datastructures but you need to make sure they're in the same data partition as the currently processed entry. This means you can use (for example) data affinity to pin all data together in the same partition.
Sharing an IAtomicLong between different partitions is not possible though.
PS: You also shouldn't mutate other data than the current processed entry since it might end up in a deadlock between different nodes.

Resources