Writing to BigQuery using Dataproc is slow with Spark BigQuery connector

Writing to BigQuery using Dataproc is slow with Spark BigQuery connector - apache-spark

We have a Spark Streaming application which is reading the data from Pubsub and applying some transformation and then convert the JavaDStream to Dataset and then write the results into BigQuery normalize tables.
Below is the sample Code. All the normalize tables are partitioned on CurrentTimestamp column.
Is there any parameter we can set to improve write performance?
pubSubMessageDStream
.foreachRDD(new VoidFunction2<JavaRDD<PubSubMessageSchema>, Time>() {
#Override
public void call(JavaRDD<PubSubMessageSchema> v1, Time v2) throws Exception {
Dataset<PubSubMessageSchema> pubSubDataSet = spark.createDataset(v1.rdd(), Encoders.bean(PubSubMessageSchema.class));
---
---
---
for (Row payloadName : payloadNameList) {
Dataset<Row> normalizedDS = null;
if(payloadNameAList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
} else if(payloadNameBList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
}
normalizedDS.selectExpr(columnsBigQuery).write().format("bigquery")
.option("temporaryGcsBucket", gcsBucketName)
.option("table", tableName)
.option("project", projectId)
.option("parentProject", parentProjectId)
.mode(SaveMode.Append)
.save();
}
}
}

Writing to BigQuery requires writing to GCS and then triggering a BigQuery load job. You can try to change the intermediateFormat to AVRO and see if it affects the performance - from our tests the better format depends on the schema and data size.
In addition, in the upcoming connector version 0.19.0 there is a write implementation for the DataSource v2 API for spark 2.4, which should improve the performance by 10-15%.

Related

Spark Structured Streaming program that reads from non-empty Kafka topic (starting from earliest) triggers batches locally, but not on EMR cluster

We are running into a problem where -- for one of our applications --
we don't see any evidences of batches being processed in the Structured
Streaming tab of the Spark UI.
I have written a small program (below) to reproduce the issue.
A self-contained project that allows you to build the app, along with scripts that facilitate upload to AWS, and details on how to run and reproduce the issue can be found here: https://github.com/buildlackey/spark-struct-streaming-metrics-missing-on-aws (The github version of the app is a slightly evolved version of what is presented below, but it illustrates the problem of Spark streaming metrics not showing up.)
The program can be run 'locally' -- on someones' laptop in local[*] mode (say with a dockerized Kafka instance),
or on an EMR cluster. For local mode operation you invoke the main method with 'localTest' as the first
argument.
In our case, when we run on the EMR cluster, pointing to a topic
where we know there are many data records (we read from 'earliest'), we
see that THERE ARE INDEED NO BATCHES PROCESSED -- on the cluster for some reason...
In the local[*] case we CAN see batches processed.
To capture evidence of this i wrote a forEachBatch handler that simply does a
toLocalIterator.asScala.toList.mkString("\n") on the Dataset of each batch, and then dumps the
resultant string to a file. Running locally.. i see evidence of the
captured records in the temporary file. HOWEVER, when I run on
the cluster and i ssh into one of the executors i see NO SUCH
files. I also checked the master node.... no files matching the pattern 'Missing'
So... batches are not triggering on the cluster. Our kakfa has plenty of data and
when running on the cluster the logs show we are churning through messages at increasing offsets:
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542913 requested 4596542913
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542914 requested 4596542914
Note to get the logs we are using:
yarn yarn logs --applicationId <appId>
which should get both driver and executor logs for the entire run (when app terminates)
Now, in the local[*] case we CAN see batches processed. The evidence is that we see a file whose name
is matching the pattern 'Missing' in our tmp folder.
I am including my simple demo program below. If you can spot the issue and clue us in, I'd be very grateful !
// Please forgive the busy code.. i stripped this down from a much larger system....
import com.typesafe.scalalogging.StrictLogging
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.{Dataset, SparkSession}
import java.io.File
import java.util
import scala.collection.JavaConverters.asScalaIteratorConverter
import scala.concurrent.duration.Duration
object AwsSupportCaseFailsToYieldLogs extends StrictLogging {
case class KafkaEvent(fooMsgKey: Array[Byte],
fooMsg: Array[Byte],
topic: String,
partition: String,
offset: String) extends Serializable
case class SparkSessionConfig(appName: String, master: String) {
def sessionBuilder(): SparkSession.Builder = {
val builder = SparkSession.builder
builder.master(master)
builder
}
}
case class KafkaConfig(kafkaBootstrapServers: String, kafkaTopic: String, kafkaStartingOffsets: String)
def sessionFactory: (SparkSessionConfig) => SparkSession = {
(sparkSessionConfig) => {
sparkSessionConfig.sessionBuilder().getOrCreate()
}
}
def main(args: Array[String]): Unit = {
val (sparkSessionConfig, kafkaConfig) =
if (args.length >= 1 && args(0) == "localTest") {
getLocalTestConfiguration
} else {
getRunOnClusterConfiguration
}
val spark: SparkSession = sessionFactory(sparkSessionConfig)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val dataSetOfKafkaEvent: Dataset[KafkaEvent] = spark.readStream.
format("kafka").
option("subscribe", kafkaConfig.kafkaTopic).
option("kafka.bootstrap.servers", kafkaConfig.kafkaBootstrapServers).
option("startingOffsets", kafkaConfig.kafkaStartingOffsets).
load.
select(
$"key" cast "binary",
$"value" cast "binary",
$"topic",
$"partition" cast "string",
$"offset" cast "string").map { row =>
KafkaEvent(
row.getAs[Array[Byte]](0),
row.getAs[Array[Byte]](1),
row.getAs[String](2),
row.getAs[String](3),
row.getAs[String](4))
}
val initDF = dataSetOfKafkaEvent.map { item: KafkaEvent => item.toString }
val function: (Dataset[String], Long) => Unit =
(dataSetOfString, batchId) => {
val iter: util.Iterator[String] = dataSetOfString.toLocalIterator()
val lines = iter.asScala.toList.mkString("\n")
val outfile = writeStringToTmpFile(lines)
println(s"writing to file: ${outfile.getAbsolutePath}")
logger.error(s"writing to file: ${outfile.getAbsolutePath} / $lines")
}
val trigger = Trigger.ProcessingTime(Duration("1 second"))
initDF.writeStream
.foreachBatch(function)
.trigger(trigger)
.outputMode("append")
.start
.awaitTermination()
}
private def getLocalTestConfiguration: (SparkSessionConfig, KafkaConfig) = {
val sparkSessionConfig: SparkSessionConfig =
SparkSessionConfig(master = "local[*]", appName = "dummy2")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers = "localhost:9092",
kafkaTopic = "test-topic",
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
private def getRunOnClusterConfiguration = {
val sparkSessionConfig: SparkSessionConfig = SparkSessionConfig(master = "yarn", appName = "AwsSupportCase")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers= "kafka.foo.bar.broker:9092", // TODO - change this for kafka on your EMR cluster.
kafkaTopic= "mongo.bongo.event", // TODO - change this for kafka on your EMR cluster.
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
def writeStringFile(string: String, file: File): File = {
java.nio.file.Files.write(java.nio.file.Paths.get(file.getAbsolutePath), string.getBytes).toFile
}
def writeStringToTmpFile(string: String, deleteOnExit: Boolean = false): File = {
val file: File = File.createTempFile("streamingConsoleMissing", "sad")
if (deleteOnExit) {
file.delete()
}
writeStringFile(string, file)
}
}

I have encountered similar issue, maxOffsetsPerTrigger would fix the issue. Actually, it's not issue.
All logs and metrics per batch are only printed or showing after
finish of this batch. That's the reason why you can't see the job make
progress.
If maxOffsetsPerTrigger can't solve the issue, you could try to consume from latest offset to confirm the procssing logic is correct.

This is a provisional answer. One of our team members has a theory that looks pretty likely. Here it is: Batches ARE getting processed (this is demonstrated better by the version of the program I linked to on github), but we are thinking that since there is so much backed up in the topic on our cluster that the processing (from earliest) of the first batch takes a very long time, hence when looking at the cluster we see zero batches processed... even though there is clearly work being done. It might be that the solution is to use maxOffsetsPerTrigger to gate the amount of incoming traffic (when starting from earliest and working w/ a topic that has huge volumes of data). We are working on confirming this.

Simba JDBC driver For Cloud Spanner used with Spark JDBC DataFrame reader

I am using the JDBC driver of Simba Technologies Inc to get connected with the Google cloud spanner. It is running as expected with Java.sql. when, I tried to use the simba JDBC driver with the Spark's JDBC reader in order to read query output as DataFrame but, it is giving wrong output.
Here is my spanner table:
UserID UserName
1 Vaijnath
2 Ganesh
3 Rahul
MetaData:
UserID(String)
UserName(String)
I am executing Query as: SELECT * FROM users
This query fetch correct data when I use Simba JDBC driver with Java Sql, but it fails to fetch data When I use it with Spark SQL's JDBC reader.
It returns the DataFrame as
+------+--------+
|UserID|UserName|
+------+--------+
|UserID|UserName|
|UserID|UserName|
|UserID|UserName|
+------+--------+
As we can see, it is returning correct metadata and number of rows but, row contains the column names.
Here is the code I am using:
import java.util.Properties
import org.apache.spark.sql.{DataFrame, SparkSession}
object spannerIn {
val sparkSession =SparkSession
.builder()
.appName("Spark SQL basic example").master("local")
.config("spark.sql.warehouse.dir", "file:///tmp")
.config("spark.sql.shuffle.partitions", 1)
.getOrCreate()
val properties =new Properties()
properties.setProperty("user", "")
properties.setProperty("password", "")
properties.setProperty("driver", "com.simba.cloudspanner.core.jdbc42.CloudSpanner42Driver")
val connectionURL="jdbc:cloudspanner://localhost;Project=abc;Instance=pqr;Database=xyz;PvtKeyPath=FilePath"
val selectQuery="(select * from users)"
def main(args: Array[String]): Unit = {
val df = createJdbcDataframe()
df.show()
}
def createJdbcDataframe(): DataFrame = {
sparkSession.read.jdbc(connectionURL, selectQuery, properties)
}
}
My question is, can I Use the Simba JDBC Driver with Spark?
If Yes, then what extra things I need to add.
Any help Appreciated.

This occurs because Spark by default quote all identifiers using a double quote ("), meaning the following query is being generated:
SELECT "UserID", "UserName" FROM USERS
This is interpreted by Cloud Spanner as selecting two fixed strings. It's basically the same as this in most other databases:
SELECT 'UserID', 'UserName' FROM USERS
Google Cloud Spanner uses backticks (`) for quoting identifiers, and expects this:
SELECT `UserID`, `UserName` FROM USERS
To fix this, you need to register a specific JDBC dialect for Google Cloud Spanner and register the backtick for quoting like this:
Class.forName("nl.topicus.jdbc.CloudSpannerDriver");
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value").master("local").getOrCreate();
String sparkURL = "jdbc:cloudspanner://localhost;Project=project-id;Instance=instance-id;Database=db;PvtKeyPath=pathToKeyFile.json";
JdbcDialects.registerDialect(new JdbcDialect()
{
private static final long serialVersionUID = 1L;
#Override
public boolean canHandle(String url)
{
return url.toLowerCase().startsWith("jdbc:cloudspanner:");
}
#Override
public String quoteIdentifier(String column)
{
return "`" + column + "`";
}
});
Dataset<Row> dataset = spark.read().jdbc(sparkURL, "ACCOUNT", new Properties());
dataset.show();
Please note that I have not tested the above with the Simba driver, but only with this driver: https://github.com/olavloite/spanner-jdbc
I guess it should work with the Simba driver as well.

spark2.1.0 insert data into hive error

spark version: 2.1.0
I want to insert Datasetinto hive withing partitioned by 'dt' field, but it failed.
when using 'insertInto()', the error is : 'spark2.0 insertInto() can't be used together with partitionBy()'
when using 'saveAsTale()', the error is: 'Saving data in the Hive serde table ad.ad_industry_user_profile_incr is not supported yet. Please use the insertInto() API as an alternative.'
And, the core code is as follows:
rowRDD.foreachRDD(new VoidFunction<JavaRDD<Row>>() {
#Override
public void call(JavaRDD<Row> rowJavaRDD) throws Exception {
Dataset<Row> profileDataFrame = hc.createDataFrame(rowJavaRDD, schema).coalesce(1);
profileDataFrame.write().partitionBy("dt").mode(SaveMode.Append).insertInto(tableName);
// profileDataFrame.write().partitionBy("dt").mode(SaveMode.Append).saveAsTable(tableName);
}
});
Help me, please ~

using profileDataFrame.write().mode(SaveMode.Append).insertInto(tableName)
without .partitionBy("dt")

How can I write a Dataset<Row> into kafka output topic on Spark Structured Streaming - Java8

I am trying to use ForeachWriter interface in Spark 2.1 it's interface, but I cannot use it.

It will be supported in Spark 2.2.0. To learn how to use it, I suggest you read this blog post: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
You can try Spark 2.2.0 RC2 [1] or just wait for the final release.
Another option is taking a look at this blog if you cannot use Spark 2.2.0+:
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
It has a very simple Kafka sink and maybe that's enough for you.
[1] http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html

First thing to know is that, if you working with spark structured Stream and processing streaming data, you'll be having a streaming Dataset.
Being said, the way to write this streaming Dataset is by calling the ForeachWriter, you got it right..
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[Commons.UserEvent] {
override def open(partitionId: Long, version: Long) = true
override def process(value: Commons.UserEvent) = {
processRow(value)
}
override def close(errorOrNull: Throwable) = {}
}
val query =
ds.writeStream.queryName("aggregateStructuredStream").outputMode("complete").foreach(writer).start
And the function that writes into topic will be like:
private def processRow(value: Commons.UserEvent) = {
/*
* Producer.send(topic, data)
*/
}

Reading/writing with Avro schemas AND Parquet format in SparkSQL

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).

I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string