I'm trying to come up with a generic implementation to use Spark JDBC to support Read/Write data from/to various JDBC compliant databases like PostgreSQL, MySQL, Hive, etc.
My code looks something like below.
val conf = new SparkConf().setAppName("Spark Hive JDBC").setMaster("local[*]")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Spark Hive JDBC Example")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:hive2://host1:10000/default")
.option("dbtable", "student1")
.option("user", "hive")
.option("password", "hive")
.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
.load()
jdbcDF.printSchema
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:hive2://127.0.0.1:10000/default")
.option("dbtable", "student2")
.option("user", "hive")
.option("password", "hive")
.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
.mode(SaveMode.Overwrite)
Output:
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
|-- dept: string (nullable = true)
The above code works seamlessly for PostgreSQL, MySQL databases, but it starts causing problems as soon as I use Hive related JDBC config.
First, my read was not able to read any data and returning empty results. After some search, I could able to make read work by adding the custom HiveDialect but still, I'm facing issue in writing data to Hive.
case object HiveDialect extends JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")
override def quoteIdentifier(colName: String): String = s"`$colName`"
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case StringType => Option(JdbcType("STRING", Types.VARCHAR))
case _ => None
}
}
JdbcDialects.registerDialect(HiveDialect)
Error in Write:
19/11/13 10:30:14 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HivePreparedStatement.addBatch(HivePreparedStatement.java:75)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:664)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
How I can execute Hive Queries(Read/Write) from Spark to Multiple Remote Hive Servers using Spark JDBC?
I can't use Hive metastore URI approach since, in that case, I will be restricting myself with single Hive server config. Also as I mentioned earlier I want the approach to be generic for all the database types (PostgreSQL, MySQL, Hive) so taking the Hive metastore URI approach won't work in my case.
Dependency Details:
Scala version: 2.11
Spark version: 2.4.3
Hive version: 2.1.1.
Hive JDBC Driver Used: 2.0.1
There is a lot of context, and I am not sure whether it is relevant. However, the error is straightforward enough:
Spark is not able to create the table in Hive with DataType "Text".
There is indeed no data type called Text in Hive, perhaps you are looking for one of the following:
STRING
VARCHAR
CHAR
I hope this helps, otherwise please consider reducing your question to the bare minimum (while maintaining a reproducible example) to avoid distractions.
Related
I have a Cassandra table that is created as the following(in cqlsh)
CREATE TABLE blog.session( id int PRIMARY KEY, visited text);
I write data to Cassandra and it looks like this
id | visited
1 | Url1-Url2-Url3
I then try to read it using spark Cassandra connector(2.5.1).
val sparkSession = SparkSession.builder()
.master("local")
.appName("ReadFromCass")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()
import sparkSession.implicits._
val readSessions = sparkSession.sqlContext
.read
.cassandraFormat("table1", "keyspace1").load().show()
However, it seems to be unable to read the visited since it is a text object with dashes in between words. The error occurs as
org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string
any ideas on why spark is unable to read this and how to fix it?
The error seemed to be the version of the spark-cassandra-connector. Instead of using "2.5.1" use "3.0.0-beta"
When loading a Spark Dataset from JDBC, the datatype is reflecting from the database datatype, for example, if database table has type, "decimal (22,0)", it will load into Dataset as that type
enter code here
Dataset<Row> dataset = sparkSession.read()
.format("jdbc")
.option("url", url)
.option("query", "select COLUMN_1, COLUMN_2 where 1=2")
.option("user", username)
.option("password", password)
.option("driver", driverClass)
.load();
enter code here
When trying to append new row to that Dataset in Java, it will crash with invalid data type as there is no exacting mataching of that type in Java
Am I doing something wrong?
Probably you are appending wrong row, you can use this type. Please add rest of your code.
scala> val df =Seq("1111111111111111111111").toDF("val").select('val.cast("decimal(22,0)"))
res42: org.apache.spark.sql.DataFrame = [val: decimal(22,0)]
scala> df.printSchema
root
|-- val: decimal(22,0) (nullable = true)
Spark 2.x here. My code:
val query = "SELECT * FROM some_big_table WHERE something > 1"
val df : DataFrame = spark.read
.option("url",
s"""jdbc:postgresql://${redshiftInfo.hostnameAndPort}/${redshiftInfo.database}?currentSchema=${redshiftInfo.schema}"""
)
.option("user", redshiftInfo.username)
.option("password", redshiftInfo.password)
.option("dbtable", query)
.load()
Produces:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at scala.Option.getOrElse(Option.scala:121)
I'm not reading anything from a Parquet file, I'm reading from a Redshift (RDBMS) table. So why am I getting this error?
If you use generic load function you should include format as well:
// Query has to be subquery
val query = "(SELECT * FROM some_big_table WHERE something > 1) as tmp"
...
.format("jdbc")
.option("dbtable", query)
.load()
Otherwise Spark assumes that you use default format, which in presence of no specific configuration, is Parquet.
Also nothing forces you to use dbtable.
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
query,
props
)
variant is also valid.
And of course with such simple query all of that it is not needed:
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
some_big_table,
props
).where("something > 1")
will work the same way, and if you want to improve performance you should consider parallel queries
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
Spark 2.1 Hangs while reading a huge datasets
Partitioning in spark while reading from RDBMS via JDBC
or even better, try Redshift connector.
I am trying to follow the instructions mentioned here...
https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/
and here...
https://www.percona.com/blog/2015/10/07/using-apache-spark-mysql-data-analysis/
I am using sparkdocker image.
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
cd /usr/local/spark/
./sbin/start-master.sh
./bin/spark-shell --driver-memory 1G --executor-memory 1g --executor-cores 1 --master local
This works as expected:
scala> sc.parallelize(1 to 1000).count()
But this shows an error:
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://1.2.3.4:3306/test?user=dba&password=dba123",
"dbtable" -> "ontime.ontime_part",
"fetchSize" -> "10000",
"partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28"
)).load()
And here is the error:
<console>:25: error: not found: value spark
val jdbcDF = spark.read.format("jdbc").options(
How do I connect to MySQL from within spark shell?
With spark 2.0.x,you can use DataFrameReader and DataFrameWriter.
Use SparkSession.read to access DataFrameReader and use Dataset.write to access DataFrameWriter.
Suppose using spark-shell.
read example
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
val df=spark.read.jdbc(url,"table_name",prop)
df.show()
read example 2
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", “schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
from spark doc
write example
import org.apache.spark.sql.SaveMode
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)
Create the spark context first
Make sure you have jdbc jar files in attached to your classpath
if you are trying to read data from jdbc. use dataframe API instead of RDD as dataframes have better performance. refer to the below performance comparsion graph.
here is the syntax for reading from jdbc
SparkConf conf = new SparkConf().setAppName("app"))
.setMaster("local[2]")
.set("spark.serializer",prop.getProperty("spark.serializer"));
JavaSparkContext sc = new JavaSparkContext(conf);
sqlCtx = new SQLContext(sc);
df = sqlCtx.read()
.format("jdbc")
.option("url", "jdbc:mysql://1.2.3.4:3306/test")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable","dbtable")
.option("user", "dbuser")
.option("password","dbpwd"))
.load();
It looks like spark is not defined, you should use the SQLContext to connect to the driver like this:
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()
Later you can user sqlcontext where you used spark (in spark.read etc)
This is a common problem for those migrating to Spark 2.0.0 from the earlier versions. The Spark documentation is not very good. To solve this, you have to define a SparkSession, like this:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
This solution is hidden in the Spark SQL, Dataframes and Data Sets Guide located here. SparkSession is the new entry point to the DataFrame API and it incorporates both SQLContext and HiveContext and has some additional advantages, so there is no need to define either of those anymore. Further information about this can be found here.
Please accept this as the answer, if you find this useful.
In previous Version of Spark like 1.6.1, i am using creating Cassandra Context using spark Context,
import org.apache.spark.{ Logging, SparkContext, SparkConf }
//config
val conf: org.apache.spark.SparkConf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.setAppName(getClass.getSimpleName)
lazy val sc = new SparkContext(conf)
val cassandraSqlCtx: org.apache.spark.sql.cassandra.CassandraSQLContext = new CassandraSQLContext(sc)
//Query using Cassandra context
cassandraSqlCtx.sql("select id from table ")
But In Spark 2.0 , Spark Context is replaced with Spark session, how can i use cassandra context?
Short Answer: You don't. It has been deprecated and removed.
Long Answer: You don't want to. The HiveContext provides everything except for the catalogue and supports a much wider range of SQL(HQL~). In Spark 2.0 this just means you will need to manually register Cassandra tables use createOrReplaceTempView until an ExternalCatalogue is implemented.
In Sql this looks like
spark.sql("""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test")""".stripMargin)
In the raw DF api it looks like
spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "test", "table" -> "words"))
.load
.createOrReplaceTempView("words")
Both of these commands will register the table "words" for SQL queries.