Issue with partioning sql table data when reading from Spark - apache-spark

I have written a Scala program for loading data from an MS SQL Server and writing it to BigQuery. I execute this in a Spark cluster (Google Dataproc). My issue is that even though I have a cluster with 64 cores, and I specify the executor parameters when running the job, and I partition the data I'm reading, Spark only reads data from a single executor. When I start the job I can see all the executors firing up and on the SQL Server I can see connections from all 4 workers, but within a minute, they all shut down again, leaving only one, which then runs for over an hour before finishing.
The data set is 65 million records, and I'm trying to partition it into 60 partitions.
This is my cluster:
gcloud dataproc clusters create my-cluster \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark:spark.executor.userClassPathFirst=true,spark:spark.driver.userClassPathFirst=true \
--region europe-north1 \
--subnet my-subnet \
--master-machine-type n1-standard-4 \
--worker-machine-type n1-highmem-16 \
--master-boot-disk-size 15GB \
--worker-boot-disk-size 500GB \
--image-version 1.4 \
--master-boot-disk-type=pd-ssd \
--worker-boot-disk-type=pd-ssd \
--num-worker-local-ssds=1 \
--num-workers=4
This is how I run the job:
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region europe-north1 \
--jars gs://mybucket/mycode.jar,gs://hadoop-lib/bigquery/bigquery-connector-hadoop3-latest.jar \
--class Main \
--properties \
spark.executor.memory=19g, \
spark.executor.cores=4, \
spark.executor.instances=11 \
-- yarn
This is the code I use to read the data:
val data = sqlQuery(ss,
serverName,
portNumber,
databaseName,
userName,
password,
tableName)
writeToBigQuery(
bqConfig,
data,
dataSetName,
replaceInvalidCharactersInTableName(r.getAs[String]("TableName")),
"WRITE_TRUNCATE")
def sqlQuery(ss: SparkSession,
hostName: String,
port: String,
databaseName: String,
user: String,
password: String,
query: String): DataFrame = {
val result = ss.read.format("jdbc")
.option("url", getJdbcUrl(hostName, port, databaseName))
.option("dbtable", query)
.option("user", user)
.option("password", password)
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("numPartitions", 60)
.option("partitionColumn", "entityid")
.option("lowerBound", 1)
.option("upperBound", 198012).load()
result
}
def writeToBigQuery(bqConf: Configuration,
df: DataFrame,
dataset: String,
table: String,
writeDisposition: String = "WRITE_APPEND"): Unit = {
//Convert illegal characters in column names
var legalColumnNamesDf = df
for (col <- df.columns) {
legalColumnNamesDf = legalColumnNamesDf.withColumnRenamed(
col,
col
.replaceAll("-", "_")
.replaceAll("\\s", "_")
.replaceAll("æ", "ae")
.replaceAll("ø", "oe")
.replaceAll("å", "aa")
.replaceAll("Æ", "AE")
.replaceAll("Ø", "OE")
.replaceAll("Å", "AA")
)
}
val outputGcsPath = s"gs://$bucket/" + HardcodedValues.SparkTempFolderRelativePath + UUID
.randomUUID()
.toString
val outputTableId = s"$projectId:$dataset.$table"
//Apply explicit schema since to avoid creativity of BigQuery auto config
val uniqBqConf = new Configuration(bqConf)
BigQueryOutputConfiguration.configure(
uniqBqConf,
outputTableId,
s"""{"fields":${Json(DefaultFormats).write(
legalColumnNamesDf.schema.map(
f =>
Map(
"name" -> f.name,
"type" -> f.dataType.sql
.replace("BIGINT", "INT")
.replace("INT", "INT64")
.replaceAll("DECIMAL\\(\\d+,\\d+\\)", "NUMERIC"),
"mode" -> (if (f.nullable) "NULLABLE"
else "REQUIRED")
))
)} }""",
outputGcsPath,
BigQueryFileFormat.NEWLINE_DELIMITED_JSON,
classOf[TextOutputFormat[_, _]]
)
uniqBqConf.set(
BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,
if (Array("WRITE_APPEND", "WRITE_TRUNCATE") contains writeDisposition)
writeDisposition
else "WRITE_APPEND"
)
//Save to BigQuery
legalColumnNamesDf.rdd
.map(
row =>
(null,
Json(DefaultFormats).write(
ListMap(row.schema.fieldNames.toSeq.zip(row.toSeq): _*))))
.saveAsNewAPIHadoopDataset(uniqBqConf)
}
Any ideas would be appreciated.

If you look at the Spark UI, is there a lot of skew where one task is reading most of the data? My guess is that you're picking a poor partition key, so most of the data ends up in one partition.
This stackoverflow answer provides a detailed explanation: What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?. I think your entity ids would need to be evenly distributed between 1 and 198012 for it to be a good column to partition on.

In the end I tried stopping to tell spark how many executors to run and just do dynamic allocation, and now it works. I asked for 24 partitions and it dynamically allocates 8 executors with 3 cores each, running 24 tasks in parallel.

Related

Where should I put my credential data streaming with Kafka in databricks?

I have some values in Azure Key Vault (AKV)
A simple initial googling was giving me
username = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-api-key")
pwd = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-secret")
from kafka import KafkaConsumer
consumer = KafkaConsumer('TOPIC',
bootstrap_servers = 'SERVER:PORT',
enable_auto_commit = False,
auto_offset_reset = 'earliest',
consumer_timeout_ms = 2000,
security_protocol = 'SASL_SSL',
sasl_mechanism = 'PLAIN',
sasl_plain_username = username,
sasl_plain_password = pwd)
This one works one time when the cell in databricks runs, however, after a single run it is finished, and it is not listening to Kafka messages anymore, and the cluster goes to the off state after the configured time (in my case 30 minutes)
So it doesn't solve my problem
My next google search was this blog on databricks (Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2)
from pyspark.sql.types import *
from pyspark.sql.functions import from_json
from pyspark.sql.functions import *
schema = StructType() \
.add("EventHeader", StructType() \
.add("UUID", StringType()) \
.add("APPLICATION_ID", StringType())
.add("FORMAT", StringType())) \
.add("EmissionReportMessage", StructType() \
.add("reportId", StringType()) \
.add("startDate", StringType()) \
.add("endDate", StringType()) \
.add("unitOfMeasure", StringType()) \
.add("reportLanguage", StringType()) \
.add("companies", ArrayType(StructType([StructField("ccid", StringType(), True)]))))
parsed_kafka = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "SERVER:PORT") \
.option("subscribe", "TOPIC") \
.option("startingOffsets", "earliest") \
.load()\
.select(from_json(col("value").cast("string"), schema).alias("kafka_parsed_value"))
There are some issues
Where should I put my GenID or user/pass info?
When I run the display command, it runs, but it will never stop, and it will never show the result
however, after a single run it is finished, and it is not listening to Kafka messages anymore
Given that you have enable_auto_commit = False, it should continue to work on following runs. But this isn't using Spark...
Where should I put my GenID or user/pass info
You would add SASL/SSL properties into option() parameters.
Ex. For SASL_PLAIN
option("kafka.sasl.jaas.config",
'org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(username, password))
See related question
it will never stop
Because you run a streaming query starting with readStream rather than a batched read.
it will never show the result
You'll need to use parsed_kafka.writeStream.format("console"), for example somewhere (assuming you want to start with readStream, rather than display() and read

In Azure databricks writing pyspark dataframe to eventhub is taking too long as there3 Million records in dataframe

Oracle database table has 3 million records. I need to read it into dataframe and then convert it to json format and send it to eventhub for downstream systems.
Below is my pyspark code to connect and read oracle db table as dataframe
df = spark.read \
.format("jdbc") \
.option("url", databaseurl) \
.option("query","select * from tablename") \
.option("user", loginusername) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.load()
then I am converting the column names and values of each row into json (placing under a new column named body) and then sending it to Eventhub.
I have defined ehconf and eventhub connection string. Below is my write to eventhub code
df.select("body") \
.write\
.format("eventhubs") \
.options(**ehconf) \
.save()
my pyspark code is taking 8 hours to send 3 million records to eventhub.
Could you please suggest how to write pyspark dataframe to eventhub faster ?
My Eventhub is created under eventhub cluster which has 1 CU in capacity
Databricks cluster config :
mode: Standard
runtime: 10.3
worker type: Standard_D16as_v4 64GB Memory,16 cores (min workers :1, max workers:5)
driver type: Standard_D16as_v4 64GB Memory,16 cores
The problem is that the jdbc connector just uses one connection to the database by default so most of your workers are probably idle. That is something you can confirm in Cluster Settings > Metrics > Ganglia UI.
To actually make use of all the workers the jdbc connector needs to know how to parallelize retrieving your data. For this you need a field that has evenly distributed data over its values. For example if you have a date field in your data and every date has a similar amount of records, you can use it to split up the data:
df = spark.read \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", tableName) \
.option("user", jdbcUsername) \
.option("password", jdbcPassword) \
.option("numPartitions", 64) \
.option("partitionColumn", "<dateField>") \
.option("lowerBound", "2019-01-01") \
.option("upperBound", "2022-04-07") \
.load()
You have to define the field name and the min and max value of that field so that the jdbc connector can try to split the work evenly between the workers. The numPartitions is the amount of individual connections opened and the best value depends on the count of workers in your cluster and how many connections your datasource can handle.

Spark JDBC read API: Determining the number of partitions dynamically for a column of type datetime

I'm trying to read a table from an RDS MySQL instance using PySpark. It's a huge table, hence I want to parallelize the read operation by making use of the partitioning concept. The table doesn't have a numeric column to find the number of partitions. Instead, it has a timestamp column (i.e. datetime type).
I found the lower and upper bounds by retrieving the min and max values of the timestamp column. However, I'm not sure if there's a standard formula to find out the number of partitions dynamically. Here is what I'm doing currently (hardcoding the value for numPartititons parameter):
select_sql = "SELECT {} FROM {}".format(columns, table)
partition_info = {'partition_column': 'col1',
'lower_bound': '<result of min(col1)>',
'upper_bound': '<result of max(col1)>',
'num_partitions': '10'}
read_df = spark.read.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("dbtable", select_sql) \
.option("user", user) \
.option("password", password) \
.option("useSSL", False) \
.option("partitionColumn", partition_info['partition_column']) \
.option("lowerBound", partition_info['lower_bound'])) \
.option("upperBound", partition_info['upper_bound'])) \
.option("numPartitions", partition_info['num_partitions']) \
.load()
Please suggest me a solution/your approach that works. Thanks
How to set numPartitions depends on your cluster's definition. There are no right or wrong or automatic settings here. As long as you understand the logic behind partitionColumn, lowerBound, upperBound, numPartitions, and probably lots of benchmarking, you can decide what's the right number.
Pyspark - df.cache().count() taking forever to run
What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

Predicate in Pyspark JDBC does not do a partitioned read

I am trying to read a Mysql table in PySpark using JDBC read. The tricky part here is that the table is considerably big, and therefore causes our Spark executor to crash when it does a non-partitioned vanilla read of the table.
Hence, the objective function is basically that we want to do a partitioned read of the table. Couple of things that we have been trying -
We looked at the "numPartitions-partitionColumn-lowerBound-upperBound" combo. This does not work for us since our indexing key of the original table is a string, and this only works with integral types.
The other alternative that is suggested in the docs is the predicate option. This does not seem to work for us, in the sense that the number of partitions seem to still be 1, instead of the number of predicates that we are sending.
The code snippet that we are using is as follows -
input_df = self._Flow__spark.read \
.format("jdbc") \
.option("url", url) \
.option("user", config.user) \
.option("password", config.password) \
.option("driver", "com.mysql.cj.jdbc.Driver") \
.option("dbtable", "({}) as query ".format(get_route_surge_details_query(start_date, end_date))) \
.option("predicates", ["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'",
]) \
.load()
It seems to be doing a full table scan ( non-partitioned ), whilst completely ignoring the passed predicates. Would be great to get some help on this.
Try the following :
spark_session\
.read\
.jdbc(url=url,
table= "({}) as query ".format(get_route_surge_details_query(start_date, end_date)),
predicates=["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'"],
properties={
"user": config.user,
"password": config.password,
"driver": "com.mysql.cj.jdbc.Driver"
}
)
Verify the partitions by
df.rdd.getNumPartitions() # Should be 4
I found this after digging the docs at https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=jdbc#pyspark.sql.DataFrameReader.jdbc

Structured Streaming with mapGroupState causing GC and Performance Issues

In our application we are using structured streaming with MapGroupWithState in combination with read from Kafka.
After starting the application, during the initial batches the performance is good, if i see the kafka lastProgress almost 65K per second. After few batches the performance is reduced completely to around 2000 per second.
in MapGroupWithState Function basically an update and comparison to the value from state store is happening(code snippet provided below).
Number of Offsets from Kafka - 100000
After starting the application, during the initial batches the performance is good, if i see the kafka lastProgress almost 65K per second. After few batches the performance is reduced completely to around 2000 per second.
If we see the Thread Dump from one of executor then there is no suspicious except Blocked threads from spark UI
GC Stats from one of the executor as below , seems
Didn't see much difference after GC
Code Snippet
case class MonitoringEvent(InternalID: String, monStartTimestamp: Timestamp, EndTimestamp: Timestamp, Stream: String, ParentID: Option[String])
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", Config.uatKafkaUrl)
.option("subscribe", Config.interBranchInputTopic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "true")
.option("maxOffsetsPerTrigger", "100000")
.option("request.required.acks", "all")
.load()
.selectExpr("CAST(value AS STRING)")
val me: Dataset[MonitoringEvent] = df.select(from_json($"value", schema).as("data")).select($"data.*").as[MonitoringEvent]
val IB = me.groupByKey(x => (x.ParentID.getOrElse(x.InternalID)))
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(IBTransformer.mappingFunctionIB _)
.flatMap(x => x)
val IBStream = IB
.select(to_json(struct($"*")).as("value"), $"InternalID".as("key"))
.writeStream
.format("kafka")
.queryName("InterBranch_Events_KafkaWriter")
.option("kafka.bootstrap.servers", Config.uatKafkaUrl)
.option("topic", Config.interBranchTopicComplete)
.option("checkpointLocation", Config.interBranchCheckPointDir)
.outputMode("update")
.start()
object IBTransformer extends Serializable {
case class IBStateStore(InternalID: String, monStartTimestamp: Timestamp)
def mappingFunctionIB(intrKey: String, intrValue: Iterator[MonitoringEvent], intrState: GroupState[IBStateStore]): Seq[MonitoringEvent] = {
try {
if (intrState.hasTimedOut) {
intrState.remove()
Seq.empty
} else {
val events = intrValue.toSeq
if (events.map(_.Status).contains(Started)) {
val tmp = events.filter(x => (x.Status == Started && x.InternalID == intrKey)).head
val toStore = IBStateStore(tmp.InternalID, tmp.monStartTimestamp)
intrState.update(toStore)
intrState.setTimeoutDuration(1200000)
}
val IB = events.filter(_.ParentID.isDefined)
if (intrState.exists && IB.nonEmpty) {
val startEvent = intrState.get
val IBUpdate = IB.map {x => x.copy(InternalID = startEvent.InternalID, monStartTimestamp = startEvent.monStartTimestamp) }
IBUpdate.foreach(id => intrState.update((IBStateStore(id.InternalID, id.monStartTimestamp)))) // updates the state with new IDs
IBUpdate
} else {
Seq.empty
}
}
}
catch
.
.
.
}
}
Number of executers used - 8
Exector Memory - 8G
Driver Memory - 8G
Java options and memory i provide in my spark Submit script
--executor-memory 8G \
--executor-cores 8 \
--num-executors 4 \
--driver-memory 8G \
--driver-java-options "-Dsun.security.krb5.debug=true -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dconfig.file=configIB.conf -Dlog4j.configuration=IBprocessor.log4j.properties" \
Tried using G1GC in java options, but there is no improvement. The keys we hold is also less than the size provided, so not sure where it is going wrong .
Any suggestions to improve performance and eliminate GC Issues ?

Resources