spark version: 2.1.0
I want to insert Datasetinto hive withing partitioned by 'dt' field, but it failed.
when using 'insertInto()', the error is : 'spark2.0 insertInto() can't be used together with partitionBy()'
when using 'saveAsTale()', the error is: 'Saving data in the Hive serde table ad.ad_industry_user_profile_incr is not supported yet. Please use the insertInto() API as an alternative.'
And, the core code is as follows:
rowRDD.foreachRDD(new VoidFunction<JavaRDD<Row>>() {
#Override
public void call(JavaRDD<Row> rowJavaRDD) throws Exception {
Dataset<Row> profileDataFrame = hc.createDataFrame(rowJavaRDD, schema).coalesce(1);
profileDataFrame.write().partitionBy("dt").mode(SaveMode.Append).insertInto(tableName);
// profileDataFrame.write().partitionBy("dt").mode(SaveMode.Append).saveAsTable(tableName);
}
});
Help me, please ~
using profileDataFrame.write().mode(SaveMode.Append).insertInto(tableName)
without .partitionBy("dt")
Related
We have a Spark Streaming application which is reading the data from Pubsub and applying some transformation and then convert the JavaDStream to Dataset and then write the results into BigQuery normalize tables.
Below is the sample Code. All the normalize tables are partitioned on CurrentTimestamp column.
Is there any parameter we can set to improve write performance?
pubSubMessageDStream
.foreachRDD(new VoidFunction2<JavaRDD<PubSubMessageSchema>, Time>() {
#Override
public void call(JavaRDD<PubSubMessageSchema> v1, Time v2) throws Exception {
Dataset<PubSubMessageSchema> pubSubDataSet = spark.createDataset(v1.rdd(), Encoders.bean(PubSubMessageSchema.class));
---
---
---
for (Row payloadName : payloadNameList) {
Dataset<Row> normalizedDS = null;
if(payloadNameAList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
} else if(payloadNameBList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
}
normalizedDS.selectExpr(columnsBigQuery).write().format("bigquery")
.option("temporaryGcsBucket", gcsBucketName)
.option("table", tableName)
.option("project", projectId)
.option("parentProject", parentProjectId)
.mode(SaveMode.Append)
.save();
}
}
}
Writing to BigQuery requires writing to GCS and then triggering a BigQuery load job. You can try to change the intermediateFormat to AVRO and see if it affects the performance - from our tests the better format depends on the schema and data size.
In addition, in the upcoming connector version 0.19.0 there is a write implementation for the DataSource v2 API for spark 2.4, which should improve the performance by 10-15%.
I am currently working on a new project and chose Cassandra as our data store.
I have a use case where I store prices for material and to accomplish it I created list of User-Defined Types (UDTs). But unfortunately, while deserialization using datastax driver. After queries for the required data, I found that the list object is null while in the database there is value for it. Is it a current limitation of Cassandra java driver or am I missing something?
This is how my simplified Entity (table) looks like:
#PrimaryKeyColumn(name = "tenant_id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)
private long tenantId;
#PrimaryKeyColumn(name = "item_id", ordinal = 1, type = PrimaryKeyType.CLUSTERED)
private String itemId;
#CassandraType(type = DataType.Name.LIST, userTypeName = "volume_scale_1")
private List<VolumeScale> volumeScale1;
}
So I am getting volumeScale1 as null after database select query.
And this is how my UDT looks like:
In Cassandra database:
CREATE TYPE pricingservice.volume_scale (
from_scale int,
to_scale int,
value frozen<price_value>
);
As UDT in java :
#UserDefinedType("volume_scale")
public class VolumeScale
{
#CassandraType(type = DataType.Name.TEXT, userTypeName = "from_scale")
#Column("from_scale")
private String fromScale;
#CassandraType(type = DataType.Name.TEXT, userTypeName = "to_scale")
#Column("to_scale")
private String toScale;
#CassandraType(type = DataType.Name.UDT, userTypeName = "value")
private PriceValue value;
// getter and setter
}
I also tried using Object Mapper from java driver itself as per #Alex suggestion but got stuck at one point where creating an object using ItemPriceByMaterialMapperBuilder is throwing compilation error. Is anything additional required towards annotation processing or am I missing something? do you have any idea how to use Mapper annotation? I used google AutoService also to achieve annotation processing externally but didn't work.
#Mapper
//#AutoService(Processor.class)
public interface ItemPriceByMaterialMapper
// extends Processor
{
static MapperBuilder<ItemPriceByMaterialMapper> builder(CqlSession session) {
return new ItemPriceByMaterialMapperBuilder(session);
}
#DaoFactory
ItemPriceByMaterialDao itemPriceByMaterialDao ();
// #DaoFactory
// ItemPriceByMaterialDao itemPriceByMaterialDao(#DaoKeyspace CqlIdentifier
// keyspace);
}
Version used:
Java Version: 1.8
DataStax OSS java-driver-mapper-processor: 4.5.1
DataStax OSS java-driver-mapper-runtime: 4.5.1
Cassandra: 3.11.4
Spring Boot Framework: 2.2.4.RELEASE
From what I understand, you have multiple problems: if you're using Spring Data Cassandra, then you'll get older driver (3.7.2 for Spring 2.2.6-RELEASE), and it may clash with driver 4.0.0 (it's too old, don't use it) that you're trying to use. Driver 4.x isn't binary compatible with previous drivers, and its support in Spring Data Cassandra could be only in the next major release of Spring.
Instead of Spring Data you can use Object Mapper from java driver itself - it could be more optimized than Spring version.
I decided not to use object mapper and work with Spring Data Cassandra with Spring 2.2.6-RELEASE. Thanks
I am getting below given error when trying to write dataset from spark to teradata while having some string data in dataset:
2018-01-02 15:49:05 [pool-2-thread-2] ERROR c.i.i.t.spark2.algo.JDBCTableWriter:115 - Error in JDBC operation:
java.sql.SQLException: [Teradata Database] [TeraJDBC 15.00.00.20] [Error 3706] [SQLState 42000] Syntax error: Data Type "TEXT" does not match a Defined Type name.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(ErrorFactory.java:308)
at com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveInitSubState.java:109)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:307)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:196)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:123)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementController.java:114)
at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:385)
at com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecuteUpdate(TDStatement.java:602)
at com.teradata.jdbc.jdbc_4.TDStatement.executeUpdate(TDStatement.java:1109)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:805)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
How can I ensure that the data gets properly written into teradata.
I am reading csv file from HDFS into dataset and then trying to write the same to Teradata using DataFrameWriter. I using below given code for this:
ds.write().mode("append")
.jdbc(url, tableName, props);
I am using spark 2.2.0 and Teradata is 15.00.00.07
I am getting somewhat similar issues when I tried writing in to Nettezza while in DB2 the I am able write but string values are getting replaced with .
Is there any kind of option required while writing to these databases..?
I was able to fix this issue by implementing custom JDBCDialect for Teradata.
The same approach can be used to address similar issues with other datasources like Netezza, DB2, Hive, etc.
To do so, you need to extend 'JdbcDialect' class and register it:
public class TDDialect extends JdbcDialect {
private static final Map<String, Option<JdbcType>> dataTypeMap = new HashMap<String, Option<JdbcType>>();
static {
dataTypeMap
.put("int", Option.apply(JdbcType.apply("INTEGER",
java.sql.Types.INTEGER)));
dataTypeMap.put("long",
Option.apply(JdbcType.apply("BIGINT", java.sql.Types.BIGINT)));
dataTypeMap.put("double", Option.apply(JdbcType.apply(
"DOUBLE PRECISION", java.sql.Types.DOUBLE)));
dataTypeMap.put("float",
Option.apply(JdbcType.apply("FLOAT", java.sql.Types.FLOAT)));
dataTypeMap.put("short", Option.apply(JdbcType.apply("SMALLINT",
java.sql.Types.SMALLINT)));
dataTypeMap
.put("byte", Option.apply(JdbcType.apply("BYTEINT",
java.sql.Types.TINYINT)));
dataTypeMap.put("binary",
Option.apply(JdbcType.apply("BLOB", java.sql.Types.BLOB)));
dataTypeMap.put("timestamp", Option.apply(JdbcType.apply("TIMESTAMP",
java.sql.Types.TIMESTAMP)));
dataTypeMap.put("date",
Option.apply(JdbcType.apply("DATE", java.sql.Types.DATE)));
dataTypeMap.put("string", Option.apply(JdbcType.apply("VARCHAR(255)",
java.sql.Types.VARCHAR)));
dataTypeMap.put("boolean",
Option.apply(JdbcType.apply("CHAR(1)", java.sql.Types.CHAR)));
dataTypeMap.put("text", Option.apply(JdbcType.apply("VARCHAR(255)",
java.sql.Types.VARCHAR)));
}
/***/
private static final long serialVersionUID = 1L;
#Override
public boolean canHandle(String url) {
return url.startsWith("jdbc:teradata");
}
#Override
public Option<JdbcType> getJDBCType(DataType dt) {
Option<JdbcType> option = dataTypeMap.get(dt.simpleString().toLowerCase());
if(option == null){
option = Option.empty();
}
return option;
}
}
Now you can register this using below code snippet before calling any Action on spark:
JdbcDialects.registerDialect(new TDDialect());
With some datasources, like Hive for example, You may need to override one more method to avoid NumberFormatExceptions or some similar exceptions:
#Override
public String quoteIdentifier(String colName) {
return colName;
}
Hope this will help anyone facing similar issues.
Its working for me, can you please try once and let me know?
Point to be noted:
***Your hive table must be in Text format as storage. It should not be ORC.
Create the schema in Teradata before writing it from your pyspark notebook.***
df = spark.sql("select * from dbname.tableName")
properties = {
"driver": "com.teradata.jdbc.TeraDriver",
"user": "xxxx",
"password": "xxxxx"
}
df.write.jdbc(url='provide_url',table='dbName.tableName', properties=properties)
I am using the JDBC driver of Simba Technologies Inc to get connected with the Google cloud spanner. It is running as expected with Java.sql. when, I tried to use the simba JDBC driver with the Spark's JDBC reader in order to read query output as DataFrame but, it is giving wrong output.
Here is my spanner table:
UserID UserName
1 Vaijnath
2 Ganesh
3 Rahul
MetaData:
UserID(String)
UserName(String)
I am executing Query as: SELECT * FROM users
This query fetch correct data when I use Simba JDBC driver with Java Sql, but it fails to fetch data When I use it with Spark SQL's JDBC reader.
It returns the DataFrame as
+------+--------+
|UserID|UserName|
+------+--------+
|UserID|UserName|
|UserID|UserName|
|UserID|UserName|
+------+--------+
As we can see, it is returning correct metadata and number of rows but, row contains the column names.
Here is the code I am using:
import java.util.Properties
import org.apache.spark.sql.{DataFrame, SparkSession}
object spannerIn {
val sparkSession =SparkSession
.builder()
.appName("Spark SQL basic example").master("local")
.config("spark.sql.warehouse.dir", "file:///tmp")
.config("spark.sql.shuffle.partitions", 1)
.getOrCreate()
val properties =new Properties()
properties.setProperty("user", "")
properties.setProperty("password", "")
properties.setProperty("driver", "com.simba.cloudspanner.core.jdbc42.CloudSpanner42Driver")
val connectionURL="jdbc:cloudspanner://localhost;Project=abc;Instance=pqr;Database=xyz;PvtKeyPath=FilePath"
val selectQuery="(select * from users)"
def main(args: Array[String]): Unit = {
val df = createJdbcDataframe()
df.show()
}
def createJdbcDataframe(): DataFrame = {
sparkSession.read.jdbc(connectionURL, selectQuery, properties)
}
}
My question is, can I Use the Simba JDBC Driver with Spark?
If Yes, then what extra things I need to add.
Any help Appreciated.
This occurs because Spark by default quote all identifiers using a double quote ("), meaning the following query is being generated:
SELECT "UserID", "UserName" FROM USERS
This is interpreted by Cloud Spanner as selecting two fixed strings. It's basically the same as this in most other databases:
SELECT 'UserID', 'UserName' FROM USERS
Google Cloud Spanner uses backticks (`) for quoting identifiers, and expects this:
SELECT `UserID`, `UserName` FROM USERS
To fix this, you need to register a specific JDBC dialect for Google Cloud Spanner and register the backtick for quoting like this:
Class.forName("nl.topicus.jdbc.CloudSpannerDriver");
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value").master("local").getOrCreate();
String sparkURL = "jdbc:cloudspanner://localhost;Project=project-id;Instance=instance-id;Database=db;PvtKeyPath=pathToKeyFile.json";
JdbcDialects.registerDialect(new JdbcDialect()
{
private static final long serialVersionUID = 1L;
#Override
public boolean canHandle(String url)
{
return url.toLowerCase().startsWith("jdbc:cloudspanner:");
}
#Override
public String quoteIdentifier(String column)
{
return "`" + column + "`";
}
});
Dataset<Row> dataset = spark.read().jdbc(sparkURL, "ACCOUNT", new Properties());
dataset.show();
Please note that I have not tested the above with the Simba driver, but only with this driver: https://github.com/olavloite/spanner-jdbc
I guess it should work with the Simba driver as well.
I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).
I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps