Getting SQLException when writing dataset having string columns to teradata - apache-spark

I am getting below given error when trying to write dataset from spark to teradata while having some string data in dataset:
2018-01-02 15:49:05 [pool-2-thread-2] ERROR c.i.i.t.spark2.algo.JDBCTableWriter:115 - Error in JDBC operation:
java.sql.SQLException: [Teradata Database] [TeraJDBC 15.00.00.20] [Error 3706] [SQLState 42000] Syntax error: Data Type "TEXT" does not match a Defined Type name.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(ErrorFactory.java:308)
at com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveInitSubState.java:109)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:307)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:196)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:123)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementController.java:114)
at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:385)
at com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecuteUpdate(TDStatement.java:602)
at com.teradata.jdbc.jdbc_4.TDStatement.executeUpdate(TDStatement.java:1109)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:805)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
How can I ensure that the data gets properly written into teradata.
I am reading csv file from HDFS into dataset and then trying to write the same to Teradata using DataFrameWriter. I using below given code for this:
ds.write().mode("append")
.jdbc(url, tableName, props);
I am using spark 2.2.0 and Teradata is 15.00.00.07
I am getting somewhat similar issues when I tried writing in to Nettezza while in DB2 the I am able write but string values are getting replaced with .
Is there any kind of option required while writing to these databases..?

I was able to fix this issue by implementing custom JDBCDialect for Teradata.
The same approach can be used to address similar issues with other datasources like Netezza, DB2, Hive, etc.
To do so, you need to extend 'JdbcDialect' class and register it:
public class TDDialect extends JdbcDialect {
private static final Map<String, Option<JdbcType>> dataTypeMap = new HashMap<String, Option<JdbcType>>();
static {
dataTypeMap
.put("int", Option.apply(JdbcType.apply("INTEGER",
java.sql.Types.INTEGER)));
dataTypeMap.put("long",
Option.apply(JdbcType.apply("BIGINT", java.sql.Types.BIGINT)));
dataTypeMap.put("double", Option.apply(JdbcType.apply(
"DOUBLE PRECISION", java.sql.Types.DOUBLE)));
dataTypeMap.put("float",
Option.apply(JdbcType.apply("FLOAT", java.sql.Types.FLOAT)));
dataTypeMap.put("short", Option.apply(JdbcType.apply("SMALLINT",
java.sql.Types.SMALLINT)));
dataTypeMap
.put("byte", Option.apply(JdbcType.apply("BYTEINT",
java.sql.Types.TINYINT)));
dataTypeMap.put("binary",
Option.apply(JdbcType.apply("BLOB", java.sql.Types.BLOB)));
dataTypeMap.put("timestamp", Option.apply(JdbcType.apply("TIMESTAMP",
java.sql.Types.TIMESTAMP)));
dataTypeMap.put("date",
Option.apply(JdbcType.apply("DATE", java.sql.Types.DATE)));
dataTypeMap.put("string", Option.apply(JdbcType.apply("VARCHAR(255)",
java.sql.Types.VARCHAR)));
dataTypeMap.put("boolean",
Option.apply(JdbcType.apply("CHAR(1)", java.sql.Types.CHAR)));
dataTypeMap.put("text", Option.apply(JdbcType.apply("VARCHAR(255)",
java.sql.Types.VARCHAR)));
}
/***/
private static final long serialVersionUID = 1L;
#Override
public boolean canHandle(String url) {
return url.startsWith("jdbc:teradata");
}
#Override
public Option<JdbcType> getJDBCType(DataType dt) {
Option<JdbcType> option = dataTypeMap.get(dt.simpleString().toLowerCase());
if(option == null){
option = Option.empty();
}
return option;
}
}
Now you can register this using below code snippet before calling any Action on spark:
JdbcDialects.registerDialect(new TDDialect());
With some datasources, like Hive for example, You may need to override one more method to avoid NumberFormatExceptions or some similar exceptions:
#Override
public String quoteIdentifier(String colName) {
return colName;
}
Hope this will help anyone facing similar issues.

Its working for me, can you please try once and let me know?
Point to be noted:
***Your hive table must be in Text format as storage. It should not be ORC.
Create the schema in Teradata before writing it from your pyspark notebook.***
df = spark.sql("select * from dbname.tableName")
properties = {
"driver": "com.teradata.jdbc.TeraDriver",
"user": "xxxx",
"password": "xxxxx"
}
df.write.jdbc(url='provide_url',table='dbName.tableName', properties=properties)

Related

Writing to BigQuery using Dataproc is slow with Spark BigQuery connector

We have a Spark Streaming application which is reading the data from Pubsub and applying some transformation and then convert the JavaDStream to Dataset and then write the results into BigQuery normalize tables.
Below is the sample Code. All the normalize tables are partitioned on CurrentTimestamp column.
Is there any parameter we can set to improve write performance?
pubSubMessageDStream
.foreachRDD(new VoidFunction2<JavaRDD<PubSubMessageSchema>, Time>() {
#Override
public void call(JavaRDD<PubSubMessageSchema> v1, Time v2) throws Exception {
Dataset<PubSubMessageSchema> pubSubDataSet = spark.createDataset(v1.rdd(), Encoders.bean(PubSubMessageSchema.class));
---
---
---
for (Row payloadName : payloadNameList) {
Dataset<Row> normalizedDS = null;
if(payloadNameAList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
} else if(payloadNameBList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
}
normalizedDS.selectExpr(columnsBigQuery).write().format("bigquery")
.option("temporaryGcsBucket", gcsBucketName)
.option("table", tableName)
.option("project", projectId)
.option("parentProject", parentProjectId)
.mode(SaveMode.Append)
.save();
}
}
}
Writing to BigQuery requires writing to GCS and then triggering a BigQuery load job. You can try to change the intermediateFormat to AVRO and see if it affects the performance - from our tests the better format depends on the schema and data size.
In addition, in the upcoming connector version 0.19.0 there is a write implementation for the DataSource v2 API for spark 2.4, which should improve the performance by 10-15%.

Azure DataBricks Stream foreach fails with NotSerializableException

I want to continuously elaborate rows of a dataset stream (originally initiated by a Kafka): based on a condition I want to update a Radis hash. This is my code snippet (lastContacts is the result of a previous command, which is a stream of this type: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: long]. This expands to org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]):
class MyStreamProcessor extends ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)
}
override def close(errorOrNull: Throwable): Unit = {}
}
val query = lastContacts
.writeStream
.foreach(new MyStreamProcessor())
.start()
query.awaitTermination()
I receive a huge stack trace, which the relevant part (I think) is this: java.io.NotSerializableException: org.apache.spark.sql.streaming.DataStreamWriter
Could anyone explain why this exception occurs and how to avoid? Thank you!
This question is related to the following two:
DataFrame to RDD[(String, String)] conversion
Call a function with each element a stream in Databricks
Spark Context is not serializable.
Any implementation of ForeachWriter must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
In your code, you are trying to use spark context within process method,
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
*sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)*
}
To send data to redis, you need to create your own connection and open it in the open method and then use it in the process method.
Take a look how to create redis connection pool. https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/ConnectionPool.scala

Simba JDBC driver For Cloud Spanner used with Spark JDBC DataFrame reader

I am using the JDBC driver of Simba Technologies Inc to get connected with the Google cloud spanner. It is running as expected with Java.sql. when, I tried to use the simba JDBC driver with the Spark's JDBC reader in order to read query output as DataFrame but, it is giving wrong output.
Here is my spanner table:
UserID UserName
1 Vaijnath
2 Ganesh
3 Rahul
MetaData:
UserID(String)
UserName(String)
I am executing Query as: SELECT * FROM users
This query fetch correct data when I use Simba JDBC driver with Java Sql, but it fails to fetch data When I use it with Spark SQL's JDBC reader.
It returns the DataFrame as
+------+--------+
|UserID|UserName|
+------+--------+
|UserID|UserName|
|UserID|UserName|
|UserID|UserName|
+------+--------+
As we can see, it is returning correct metadata and number of rows but, row contains the column names.
Here is the code I am using:
import java.util.Properties
import org.apache.spark.sql.{DataFrame, SparkSession}
object spannerIn {
val sparkSession =SparkSession
.builder()
.appName("Spark SQL basic example").master("local")
.config("spark.sql.warehouse.dir", "file:///tmp")
.config("spark.sql.shuffle.partitions", 1)
.getOrCreate()
val properties =new Properties()
properties.setProperty("user", "")
properties.setProperty("password", "")
properties.setProperty("driver", "com.simba.cloudspanner.core.jdbc42.CloudSpanner42Driver")
val connectionURL="jdbc:cloudspanner://localhost;Project=abc;Instance=pqr;Database=xyz;PvtKeyPath=FilePath"
val selectQuery="(select * from users)"
def main(args: Array[String]): Unit = {
val df = createJdbcDataframe()
df.show()
}
def createJdbcDataframe(): DataFrame = {
sparkSession.read.jdbc(connectionURL, selectQuery, properties)
}
}
My question is, can I Use the Simba JDBC Driver with Spark?
If Yes, then what extra things I need to add.
Any help Appreciated.
This occurs because Spark by default quote all identifiers using a double quote ("), meaning the following query is being generated:
SELECT "UserID", "UserName" FROM USERS
This is interpreted by Cloud Spanner as selecting two fixed strings. It's basically the same as this in most other databases:
SELECT 'UserID', 'UserName' FROM USERS
Google Cloud Spanner uses backticks (`) for quoting identifiers, and expects this:
SELECT `UserID`, `UserName` FROM USERS
To fix this, you need to register a specific JDBC dialect for Google Cloud Spanner and register the backtick for quoting like this:
Class.forName("nl.topicus.jdbc.CloudSpannerDriver");
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value").master("local").getOrCreate();
String sparkURL = "jdbc:cloudspanner://localhost;Project=project-id;Instance=instance-id;Database=db;PvtKeyPath=pathToKeyFile.json";
JdbcDialects.registerDialect(new JdbcDialect()
{
private static final long serialVersionUID = 1L;
#Override
public boolean canHandle(String url)
{
return url.toLowerCase().startsWith("jdbc:cloudspanner:");
}
#Override
public String quoteIdentifier(String column)
{
return "`" + column + "`";
}
});
Dataset<Row> dataset = spark.read().jdbc(sparkURL, "ACCOUNT", new Properties());
dataset.show();
Please note that I have not tested the above with the Simba driver, but only with this driver: https://github.com/olavloite/spanner-jdbc
I guess it should work with the Simba driver as well.

spark2.1.0 insert data into hive error

spark version: 2.1.0
I want to insert Datasetinto hive withing partitioned by 'dt' field, but it failed.
when using 'insertInto()', the error is : 'spark2.0 insertInto() can't be used together with partitionBy()'
when using 'saveAsTale()', the error is: 'Saving data in the Hive serde table ad.ad_industry_user_profile_incr is not supported yet. Please use the insertInto() API as an alternative.'
And, the core code is as follows:
rowRDD.foreachRDD(new VoidFunction<JavaRDD<Row>>() {
#Override
public void call(JavaRDD<Row> rowJavaRDD) throws Exception {
Dataset<Row> profileDataFrame = hc.createDataFrame(rowJavaRDD, schema).coalesce(1);
profileDataFrame.write().partitionBy("dt").mode(SaveMode.Append).insertInto(tableName);
// profileDataFrame.write().partitionBy("dt").mode(SaveMode.Append).saveAsTable(tableName);
}
});
Help me, please ~
using profileDataFrame.write().mode(SaveMode.Append).insertInto(tableName)
without .partitionBy("dt")

Reading/writing with Avro schemas AND Parquet format in SparkSQL

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).
I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Resources