Apache Spark SQL can not SELECT Cassandra timestamp columns - apache-spark

I created Docker containers in which I installed Apache Spark 3.1.2 (Hadoop 3.2) that host a ThriftServer which is configured to access Cassandra via the spark-cassandra-connector(3.1.0). Each of these services is running in it's own container. So I got 5 containers up (1x spark master, 2x spark worker, 1x spark thriftserver, 1x cassandra) which are configured to live in the same network via docker-compose.
I use the beeline client from Apache Hive(1.2.1) to query the database. Everything works fine, except for querying a field in Cassandra with the type timestamp.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 0.0 failed 4 times, most recent failure: Lost task 9.3 in stage 0.0 (TID 53) (192.168.80.5 executor 0): java.lang.ClassCastException: java.sql.Timestamp cannot be cast to java.time.Instant
I checked the Spark/spark-cassandra-connector documentations but didn't find much except for a configuration property called spark.sql.datetime.java8API.enabled which
If the configuration property is set to true, java.time.Instant and java.time.LocalDate classes of Java 8 API are used as external types for Catalyst's TimestampType and DateType. If it is set to false, java.sql.Timestamp and java.sql.Date are used for the same purpose.
I think this property could maybe help in my case. Despite saying in the docs that the default value is false, the value is always true in my case. I don't set it anywhere and I've tried overwriting it with false in the $SPARK_HOME/conf/spark-defaults.conf file and via the --conf commandline parameters when starting up ThriftServer (and master/worker instances), but the environment tab (at localhost:4040) always shows it as true.
Is there a way to make Spark convert the timestamps in a way that doesn't lead to an exception? It would be important to do that in SQL since I want to connect software for data visualization later on.

I found this JIRA which mentions that there was a bug converting times which is not fixed in 3.1.2 (3.1.3 is not released yet), but in 3.0.3.
I downgraded Apache Spark(3.0.3) and spark-cassandra-connector(3.0.1) which seems to solve the problem for now.

Related

NullPointerException in Spark Thrift Server using Apache Superset or Redash

I get a NullPointerException after connecting BI tools like Redash or Superset to a Spark Thriftserver (both tools use PyHive). Apache Zeppelin works fine for queries using STS and I could never reproduce the error there (Zeppelin uses org.apache.hive.jdbc.HiveDriver).
DB engine Error
hive error: ('Query error', 'Error running query: java.lang.NullPointerException')
This sends the STS into a state where only a restart can bring it back. Queries from all clients will fail (Zeppelin, beeline, Redash, Superset). It seems to occur mostly when schema is automatically fetched (which doesn't quite work, DB name is fetched correctly, table names are wrong). While browsing PyHive code I encountered some incompatibilities between PyHive <-> STS (like this and this). The connection between Redash/Superset and STS works, I am able to do queries until the Thriftserver enters the broken state.
I understand why schema refresh doesn't work (and might be able work around it), but I don't understand why the Thriftserver enters an unrecoverable, broken state with the NullPointerException.
My setup:
Kubernetes
Delta Lake with data formatted as delta
Hive Metastore
Spark Cluster where a Spark Thriftserver is started: start-thriftserver.sh --total-executor-cores 3 --driver-memory 3G --executor-memory 1536M --hiveconf hive.server2.thrift.port 10000 --hiveconf hive.server2.thrift.max.worker.threads 2000 --hiveconf hive.server2.thrift.bind.host my-host
(I also tried spark.sql.thriftServer.incrementalCollect=false but that didn't affect anything.)
Redash / Apache Superset connected to the STS

Apache spark cassandra dataframe load error

I have an error with Spark-Cassandra load. Pls help!
This is known bug in the alpha version of Spark Cassandra Connector 3.0. You need to use 3.0.0-beta version that was released this week.
P.S. You don't need to create SparkSession instance in Zeppelin - it's already there. You can set properties for Cassandra in the Interpreter settings, or even pass via option when reading or writing...

CDH 6.2 Hive cannot execute queries neither on Spark nor MapReduce

I'm trying to run a simple select count(*) from table query on Hive, but it fails with the following error:
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 5414a8a4-5252-4ccf-b63e-2ee563f7d772_0: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
This is happening since I've moved to CDH 6.2 and enabled Spark (version 2.4.0-cdh6.2.0) as the execution engine of Hive (version 2.1.1-cdh6.2.0).
My guess is that Hive is not correctly configured to launch Spark. I've tried setting the spark.home property of the hive-site.xml to /opt/cloudera/parcels/CDH/lib/spark/, and setting the SPARK_HOME environment variable to the same value, but it made no difference.
A similar issue was reported here, but the solution (i.e., to put the spark-assembly.jar file in Hive's lib directory) cannot be applied (as the file is no longer built in latest Spark's versions).
A previous question addressed a similar but different issue, related to memory limits on YARN.
Also, switching to MapReduce as the execution engine still fails, but with a different error:
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org/apache/hadoop/hdfs/protocol/SystemErasureCodingPolicies
Looking for the latest error on Google shows no result at all.
UPDATE: I discovered that queries do work when connecting to Hive through other tools (e.g., Beeline, Hue, Spark) and independently of the underlying execution engine (i.e., MapReduce or Spark). Thus, the error may lie within the Hive CLI, which is currently deprecated.
UPDATE 2: the same problem actually happened on Beeline and Hue with a CREATE TABLE query; I was able to execute it only with the Hive interpreter of Zeppelin

Apache/Cloudera HUE / Livy Spark Server - InterpreterError: Fail to start interpreter

I'm at a loss at this point. I'm trying to run PySpark/SparkR on Apache HUE 4.3, using Spark 2.4 + Livy Server 0.5.0. I've followed every guide I can find, but I keep running into this issue. Basically, I can run PySpark/SparkR through command line, but HUE, for some reason, does the following:
Ignores all Spark configuration (executor memory, cores, etc) that I have set in multiple places (spark-defaults.conf, livy.conf and livy-client.conf)
Successfully creates session for both PySpark and SparkR, yet when you try to do anything (even just print(1+1)), I get InterpreterError: Fail to start interpreter
Actually works with Scala on HUE. Scala works, but PySpark and SparkR do not on HUE (presumably since Scala is java-based).
Any configuration needed I can provide. This is driving me absolutely insane.
I also cannot interact with PySpark through the REST API either, same InterpreterError. This leads me to believe it's more Livy Server based than HUE.
Figured it out. I was trying to run Spark on YARN in cluster mode and I switched to client and fixed it. Must have been a missed reference/file on the cluster machines.

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

Resources