NoHostAvailableException spark-cassandra-connector - apache-spark

I am using sparkContext to count the rows of a cassandra table but getting below error :
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
I am getting above error after CassandraConnector: Connected to Cassandra cluster: cluster_name .
Below is the configuration code . I read that I need to setup driver protocol but there is no option for it to add spark-cassandra-connector .
val sparkConf = new SparkConf(true).setAppName(appName)
sparkConf.set("spark.sql.orc.enabled", "true")
sparkConf.set("spark.sql.hive.convertMetastoreOrc", "false")
sparkConf.set("spark.sql.hive.metastorePartitionPruning true", "true")
sparkConf.set("spark.sql.orc.filterPushdown", "true")
sparkConf.set("spark.cassandra.connection.host", connection.hosts)
sparkConf.set("spark.cassandra.auth.username", connection.user)
sparkConf.set("spark.cassandra.auth.password", connection.password)
sparkConf.set("spark.cassandra.connection.local_dc", connection.local_dc.getOrElse(""))
sparkConf.set("spark.cassandra.input.consistency.level", connection.consistency_level)
sparkConf.set("spark.cassandra.input.reads_per_sec", Parameters.getCassandraInputReadsPerSec)
sparkConf.set("spark.cassandra.input.fetch.size_in_rows", Parameters.getCassandraInputFetchSizeInRows)
sparkConf.set("spark.cassandra.connection.timeout_ms", Parameters.getCassandraConnectionTimeoutMs)
sparkConf.set("spark.cassandra.read.timeout_ms", Parameters.getCassandraReadTimeoutMs)
sparkConf.set("spark.cassandra.connection.keep_alive_ms", Parameters.getCassandraConnectionKeepAliveMs)
val sparkSession: SparkSession = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate()
val sc: SparkContext = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
sqlContext.setConf("fs.maprfs.impl", Parameters.getFsMaprfsImpl)
sqlContext.setConf("hive.exec.compress.output", Parameters.getHiveExecCompressOutput)
sqlContext.setConf("hive.merge.mapredfiles", Parameters.getHiveMergeMapredfiles)
sqlContext.setConf("hive.merge.smallfiles.avgsize", Parameters.getHiveMergeSmallfilesAvgsize)
sqlContext.setConf("hive.exec.reducers.bytes.per.reducer", Parameters.getHiveExecReducersBytesPerReducer)
sqlContext.setConf("hive.exec.dynamic.partition", Parameters.getHiveExecDynamicPartition)
sqlContext.setConf("hive.exec.dynamic.partition.mode", Parameters.getHiveExecDynamicPartitionMode)

Related

Spark reading table after connection to HiverServer2 only gives schema not data

I try to connect to a remote hive cluster using the following code and I get the table data as expected
val spark = SparkSession
.builder()
.appName("adhocattempts")
.config("hive.metastore.uris", "thrift://<remote-host>:9083")
.enableHiveSupport()
.getOrCreate()
val seqdf=sql("select * from anon_seq")
seqdf.show
However, when I try to do this via HiveServer2, I get no data in my dataframe. This table is based on a sequencefile. Is that the issue, since I am actually trying to read this via jdbc?
val sparkJdbc = SparkSession.builder.appName("SparkHiveJob").getOrCreate
val sc = sparkJdbc.sparkContext
val sqlContext = sparkJdbc.sqlContext
val driverName = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driverName)
val df = sparkJdbc.read
.format("jdbc")
.option("url", "jdbc:hive2://<remote-host>:10000/default")
.option("dbtable", "anon_seq")
.load()
df.show()
Can someone help me understand the purpose of using HiveServer2 with jdbc and relevant drivers in Spark2?

Query remote hive tables from local spark 2x program

When I run a local spark2x program from eclipse, I am getting belo error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: When hive.metastore.uris is setted, please set spark.sql.authorization.enabled and hive.security.authorization.enabled to true to enable authorization.;
Code used:
System.setProperty("hadoop.home.dir", "D:/winutils");
//kerberos releated
val ZKServerPrincipal = "zookeeper/hadoop.hadoop.com";
val ZOOKEEPER_DEFAULT_LOGIN_CONTEXT_NAME = "Client";
val ZOOKEEPER_SERVER_PRINCIPAL_KEY = "zookeeper.server.principal";
val hadoopConf: Configuration = new Configuration();
LoginUtil.setZookeeperServerPrincipal(ZOOKEEPER_SERVER_PRINCIPAL_KEY, ZKServerPrincipal);
LoginUtil.login(userPrincipal, userKeytabPath, krb5ConfPath, hadoopConf);
//creating spark session
val spark = SparkSession .builder() .appName("conenction").config("spark.master", "local") .config("spark.sql.authorization.enabled","true") .enableHiveSupport() .getOrCreate()
val df75 = spark.sql("select * from dbname.tablename limit 10")

DSE Spark Streaming: Long active batches queue

I have the following code:
val conf = new SparkConf()
.setAppName("KafkaReceiver")
.set("spark.cassandra.connection.host", "192.168.0.78")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "2g")
.set("spark.driver.memory", "4g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "3")
.set("spark.executor.cores", "3")
.set("spark.shuffle.service.enabled", "false")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.backpressure.initialRate", "200")
.set("spark.streaming.receiver.maxRate", "500")
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "192.168.0.113:9092",
"group.id" -> "test-group-aditya",
"auto.offset.reset" -> "largest")
val topics = Set("random")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
I'm running the code through spark-submit with the following command:
dse> bin/dse spark-submit --class test.kafkatesting /home/aditya/test.jar
I have a three-node Cassandra DSE cluster installed on different machines. Whenever I run the application, it takes so much data and starts creating a queue of active batches, which in turn creates a backlog and a long scheduling delay. How can I increase the performance and control the queue such that it receives a new batch only after it finishes executing the current batch?
I found the solution, did some optimisation in code. Instead of saving RDD try to create Dataframe, saving DF to Cassandra in much faster as compared to RDD. Also, increase the no of core and and executor memory in order to achieve good results.
Thanks,

How to get TimeStamp data in hive when using Spark

val sql = "select time from table"
val data = sql(sql).map(_.getTimeStamp(0).toString)
In the hive table,time's type is timestamp.when i run this program,it throws NullPointerException.
val data = sql(sql).map(_.get(0).toString)
When I change to the above code,the same Exception be threw.
Is anyone can tell me how to get TimeStamp data in hive using Spark?
Tks.
If you are trying to read data from Hive table you should use, HiveContext instead of SQLContext.
If you are using Spark 2.0, you can try the following.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val df = sql("select time from table")
df.select($"time").show()

connect to mysql from spark

I am trying to follow the instructions mentioned here...
https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/
and here...
https://www.percona.com/blog/2015/10/07/using-apache-spark-mysql-data-analysis/
I am using sparkdocker image.
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
cd /usr/local/spark/
./sbin/start-master.sh
./bin/spark-shell --driver-memory 1G --executor-memory 1g --executor-cores 1 --master local
This works as expected:
scala> sc.parallelize(1 to 1000).count()
But this shows an error:
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://1.2.3.4:3306/test?user=dba&password=dba123",
"dbtable" -> "ontime.ontime_part",
"fetchSize" -> "10000",
"partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28"
)).load()
And here is the error:
<console>:25: error: not found: value spark
val jdbcDF = spark.read.format("jdbc").options(
How do I connect to MySQL from within spark shell?
With spark 2.0.x,you can use DataFrameReader and DataFrameWriter.
Use SparkSession.read to access DataFrameReader and use Dataset.write to access DataFrameWriter.
Suppose using spark-shell.
read example
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
val df=spark.read.jdbc(url,"table_name",prop)
df.show()
read example 2
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", “schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
from spark doc
write example
import org.apache.spark.sql.SaveMode
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)
Create the spark context first
Make sure you have jdbc jar files in attached to your classpath
if you are trying to read data from jdbc. use dataframe API instead of RDD as dataframes have better performance. refer to the below performance comparsion graph.
here is the syntax for reading from jdbc
SparkConf conf = new SparkConf().setAppName("app"))
.setMaster("local[2]")
.set("spark.serializer",prop.getProperty("spark.serializer"));
JavaSparkContext sc = new JavaSparkContext(conf);
sqlCtx = new SQLContext(sc);
df = sqlCtx.read()
.format("jdbc")
.option("url", "jdbc:mysql://1.2.3.4:3306/test")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable","dbtable")
.option("user", "dbuser")
.option("password","dbpwd"))
.load();
It looks like spark is not defined, you should use the SQLContext to connect to the driver like this:
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()
Later you can user sqlcontext where you used spark (in spark.read etc)
This is a common problem for those migrating to Spark 2.0.0 from the earlier versions. The Spark documentation is not very good. To solve this, you have to define a SparkSession, like this:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
This solution is hidden in the Spark SQL, Dataframes and Data Sets Guide located here. SparkSession is the new entry point to the DataFrame API and it incorporates both SQLContext and HiveContext and has some additional advantages, so there is no need to define either of those anymore. Further information about this can be found here.
Please accept this as the answer, if you find this useful.

Resources