Can not write Spark dataframe into Cassandra table - apache-spark

I am connecting spark on HDP3.0 with Cassandra to write a data frame into Cassandra table but receiving below error:
My code written into Cassandra table is below:

HDP 3.0 is based on the Hadoop 3.1.1 that uses commons-configuration2 library instead of commons-configuration that is used by Spark Cassandra Connector. You can start your spark-shell or spark-submit with following:
spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.1,commons-configuration:commons-configuration:1.10
to explicitly add commons-configuration.

Related

How to use hive warehouse connector in HDP 2.6.5

I have a requirement to read hive table from spark which is ACID enabled.
Spark by native doesn't support to read ORC file which is ACID enabled, only option is use spark jdbc.
We can also use hive warehouse connector to read files , can someone explain what is the steps to read using hive warehouse connector.
Is HWC only work in HDP 3 version.Kindly advise.
Spark version :2.3.0
HDP -2.6.5
Spark can read ORC file, check documentation on it here: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#orc-files
Here is a sample of code to read orc file:
spark.read.format("orc").load("example.orc")
HWC is made for HDP 3 version, as Hive and Spark catalogs are not compatible anymore in HDP 3, (Hive is in version 3, and Spark in version 2).
See documentation on it here: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

How to query hdfs from a spark cluster (2.1) which is running on kubernetes?

I was trying to access HDFS files from a spark cluster which is running inside Kubernetes containers.
However I keep on getting the error:
AnalysisException: 'The ORC data source must be used with Hive support enabled;'
What I am missing here?
Are you have SparkSession created with enableHiveSupport()?
Similar issue:
Spark can access Hive table from pyspark but not from spark-submit

Snappydata store with hive metastore from existing spark installation

I am using snappydata-1.0.1 on HDP2.6.2, spark 2.1.1 and was able to connect from an external spark application. But when i enable hive support by adding hive-site.xml to spark conf, snappysession is listing the tables from hivemetastore instead of snappystore.
SparkConf sparkConf = new SparkConf().setAppName("TEST APP");
JavaSparkContext javaSparkContxt = new JavaSparkContext(sparkConf);
SparkSession sps = new SparkSession.Builder().enableHiveSupport().getOrCreate();
SnappySession snc = new SnappySession(new SparkSession(javaSparkContxt.sc()).sparkContext());
snc.sqlContext().sql("show tables").show();
The above code gives me list of tables in snappy store when hive-site.xml is not in sparkconf, if hive-site.xml added.. it lists me tables from hive metastore.
Is it not possible to use hive metastore and snappydata metastore in the same application?
Can is read hive table into a dataframe and snappydata table to another DF in same application?
Thanks in advance
So, it isn't the hive metastore that is the problem. You can use Hive tables and Snappy Tables in the same application. e.g. copy hive table into Snappy in-memory.
But, we will need to test the use of external hive metastore configured in hive-site.xml. Perhaps a bug.
You should try using the Snappy smart connector. i.e. Run your Spark using the Spark distribution in HDP and connect to Snappydata cluster using the connector (see docs). Here it looks like you are trying to run your Spark app using the Snappydata distribution.

Pyspark and Cassandra Connection Error

I have stucked with a problem. When i write sample cassandra connection code while import cassandra connector gives error.
i am starting the script like below code (both of them gave error)
./spark-submit --jars spark-cassandra-connector_2.11-1.6.0-M1.jar /home/beyhan/sparkCassandra.py
./spark-submit --jars spark-cassandra-connector_2.10-1.6.0.jar /home/beyhan/sparkCassandra.py
But giving below error while
import pyspark_cassandra
ImportError: No module named pyspark_cassandra
Which part i did wrong ?
Note:I have already installed cassandra database.
You are mixing up DataStax' Spark Cassandra Connector (in the jar you add to spark submit), and TargetHolding's PySpark Cassandra project (which has the pyspark_cassandra module). The latter is deprecated, so you should probably use the Spark Cassandra Connector. Documention for this package can be found here.
To use it, you can add the following flags to spark submit:
--conf spark.cassandra.connection.host=127.0.0.1 \
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3
Of course use the IP address on which Cassandra is listening, and check what connector version you need to use: 2.0.0-M3 is the latest version and works with Spark 2.0 and most Cassandra versions. See the compatibility table in case you are using a different version of Spark. 2.10 or 2.11 is the version of Scala your Spark version is built with. If you use Spark 2, by default it is 2.11, before 2.x it was version 2.10.
Then the nicest way to work with the connector is to use it to read dataframes, which looks like this:
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
See the PySpark with DataFrames documentation for more details

hive query execution using spark engine

I have setup HADOOP 2.7.2,HIVE 2.1,SCALA 2.11.8 and SPARK 2.0 in ubuntu 16.04 system.
Hadoop, hive and spark are working well. I could connect to hive cli and work with map-reduce without any problem.
I have to improve my hive query performance for order by clause.
I have to use Hive cli only, and could not use spark-shell.
I am trying to use spark as query execution engine on hive
I am following instructions as per this link, I am setting some properties in hive as:
set hive.execution.engine=spark;
set spark.home=/usr/local/spark
set spark.master=spark://ip:7077
I executed the query as
select count(*) from table_name;
then it throws this exception:
failed to create spark client.
I increased the timeout of hive client connection to spark also. But, it is not useful.
First, I recommend you to use the shell and follow next steps:
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
And you can run:
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql("create table myTable (myField STRING) stored as orc")
If this works you can do another query SQL with hiveContext

Resources