I'm new to Spark and Kafka, using pyspark (spark 2.4.8).
Assume we have a kafka streaming source and we want to stream at least N records to our database. What is the best way to ensure the wanted number of records and stop after reaching it?
I thought maybe to count the number of micro-batches using a global parameter, and to limit the number of offsets per micro-batch but I guess it isn't the right way to get over the problem.
My code in general:
raw_stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_server) \
.option("subscribe", "topic1, topic2") \
.option("startingOffsets", "earliest") \
.option("maxOffsetsPerTrigger", offsets_number) \
.load()
...
# define schema (not relevant)
...
counter = 0
def foreach_batch_function(df, epoch_id):
global counter
counter += 1
query = streaming_df \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("query1") \
.foreachBatch(foreach_batch_function) \
.start()
Buy It didn't work. I tried to stop the query after reaching a const number of micro-batches but the counter even didn't increase.
Back to my question, what is the right way to pass the lower bound of requested records and than just stop?
I am trying to read all data in a kafka topic in batches (reading between two offset values) and load them to spark dataframes, without using readStream in spark streaming.
My idea is:
I first get the total number of data lines in the topic finding the maximum offset value.
I define step, namely the total number of data per batch.
With a for loop I read the data batch from the kafka topic setting startingOffsets and endingOffsets parameters.
This is my code (for a topic with a single partition) to print the count in each batch:
val maxOffsetValue = {
Process(s"kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic topicname")
.!!
.split(":")
.last
.trim
.toInt
}
val step = 1000
for (i <- 0 until maxOffsetValue by step) {
val df: DataFrame = {
spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicname")
.option("startingOffsets", s"""{"topicname":{"0":${i}}}""")
.option("endingOffsets", s"""{"topicname":{"0":${i+step}}}""")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.select(from_json(col("value"), dataSchema) as "data")
.select("data.*")
}
println(s"i: ${i}, i+step: ${i+step}, count: ${df.count()}")
}
However, it seems that the json format for startingOffsets and endingOffsets is not flexible, as apparently all offsets indices need to be specified for each partition, e.g something like {"0":${i}, "1": ${i}}} if there are two partitions.
My questions are:
Is there a better way to achieve the same results, possibly that can be extended directly to a multi partition topic?
Is there a way to read the maximum offset without using a shell command?
We tried two MlLib transformers for Kafka consuming: one using Structured Batch Query Stream like here, and one using normal Kafka consumers. The normal consumers read into a list that gets converted to a dataframe
We created two MlLib pipelines starting with an empty dataframe. The first transformer of these pipes did the reading from Kafka. We had a pipe for each kafka transformer type.
then ran the pipes: 1st with normal kafka consumers, 2nd with Spark consumer, 3rd with normal consumers again.
The spark config had 4 executors, with 1 core each:
spark.executor.cores": "1", "spark.executor.instances": 4,
Questions are:
A. where is the consumption made? On the executors or on the driver? according to driver UI, it looks like in both cases the executors did all the work - the driver did not pass any data and 4 executors got created.
B. Why do we have a different number of executors running? In the 1st run, with normal consumers, we see 4 executors working; In the 2nd run, with spark Kafka connector, 1 executor; In 3rd run with normal consumers, 1 executor but 2 cores?
you`ll see the driver's UI attached at the bottom.
this is the relevant code:
Normal Consumer:
var kafkaConsumer: KafkaConsumer[String, String] = null
val readMessages = () => {
for (record <- records) {
recordList.append(record.value())
}
}
kafkaConsumer.subscribe(util.Arrays.asList($(topic)))
readMessages()
var df = recordList.toDF
kafkaConsumer.close()
val json_schema =
df.sparkSession.read.json(df.select("value").as[String]).schema
df = df.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
Spark consumer:
val records = dataset
.sparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", $(url))
.option("subscribe", $(this.topic))
.option("kafkaConsumer.pollTimeoutMs", s"${$(timeoutMs)}")
.option("startingOffsets", $(startingOffsets))
.option("endingOffsets", $(endingOffsets))
.load
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
OmnixLogger.warn(uid, "Executor ID AFTER polling: " + SparkEnv.get.executorId)
val json_schema = records.sparkSession.read.json(records.select("value").as[String]).schema
var df: DataFrame = records.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
I have written a Scala program for loading data from an MS SQL Server and writing it to BigQuery. I execute this in a Spark cluster (Google Dataproc). My issue is that even though I have a cluster with 64 cores, and I specify the executor parameters when running the job, and I partition the data I'm reading, Spark only reads data from a single executor. When I start the job I can see all the executors firing up and on the SQL Server I can see connections from all 4 workers, but within a minute, they all shut down again, leaving only one, which then runs for over an hour before finishing.
The data set is 65 million records, and I'm trying to partition it into 60 partitions.
This is my cluster:
gcloud dataproc clusters create my-cluster \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark:spark.executor.userClassPathFirst=true,spark:spark.driver.userClassPathFirst=true \
--region europe-north1 \
--subnet my-subnet \
--master-machine-type n1-standard-4 \
--worker-machine-type n1-highmem-16 \
--master-boot-disk-size 15GB \
--worker-boot-disk-size 500GB \
--image-version 1.4 \
--master-boot-disk-type=pd-ssd \
--worker-boot-disk-type=pd-ssd \
--num-worker-local-ssds=1 \
--num-workers=4
This is how I run the job:
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region europe-north1 \
--jars gs://mybucket/mycode.jar,gs://hadoop-lib/bigquery/bigquery-connector-hadoop3-latest.jar \
--class Main \
--properties \
spark.executor.memory=19g, \
spark.executor.cores=4, \
spark.executor.instances=11 \
-- yarn
This is the code I use to read the data:
val data = sqlQuery(ss,
serverName,
portNumber,
databaseName,
userName,
password,
tableName)
writeToBigQuery(
bqConfig,
data,
dataSetName,
replaceInvalidCharactersInTableName(r.getAs[String]("TableName")),
"WRITE_TRUNCATE")
def sqlQuery(ss: SparkSession,
hostName: String,
port: String,
databaseName: String,
user: String,
password: String,
query: String): DataFrame = {
val result = ss.read.format("jdbc")
.option("url", getJdbcUrl(hostName, port, databaseName))
.option("dbtable", query)
.option("user", user)
.option("password", password)
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("numPartitions", 60)
.option("partitionColumn", "entityid")
.option("lowerBound", 1)
.option("upperBound", 198012).load()
result
}
def writeToBigQuery(bqConf: Configuration,
df: DataFrame,
dataset: String,
table: String,
writeDisposition: String = "WRITE_APPEND"): Unit = {
//Convert illegal characters in column names
var legalColumnNamesDf = df
for (col <- df.columns) {
legalColumnNamesDf = legalColumnNamesDf.withColumnRenamed(
col,
col
.replaceAll("-", "_")
.replaceAll("\\s", "_")
.replaceAll("æ", "ae")
.replaceAll("ø", "oe")
.replaceAll("å", "aa")
.replaceAll("Æ", "AE")
.replaceAll("Ø", "OE")
.replaceAll("Å", "AA")
)
}
val outputGcsPath = s"gs://$bucket/" + HardcodedValues.SparkTempFolderRelativePath + UUID
.randomUUID()
.toString
val outputTableId = s"$projectId:$dataset.$table"
//Apply explicit schema since to avoid creativity of BigQuery auto config
val uniqBqConf = new Configuration(bqConf)
BigQueryOutputConfiguration.configure(
uniqBqConf,
outputTableId,
s"""{"fields":${Json(DefaultFormats).write(
legalColumnNamesDf.schema.map(
f =>
Map(
"name" -> f.name,
"type" -> f.dataType.sql
.replace("BIGINT", "INT")
.replace("INT", "INT64")
.replaceAll("DECIMAL\\(\\d+,\\d+\\)", "NUMERIC"),
"mode" -> (if (f.nullable) "NULLABLE"
else "REQUIRED")
))
)} }""",
outputGcsPath,
BigQueryFileFormat.NEWLINE_DELIMITED_JSON,
classOf[TextOutputFormat[_, _]]
)
uniqBqConf.set(
BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,
if (Array("WRITE_APPEND", "WRITE_TRUNCATE") contains writeDisposition)
writeDisposition
else "WRITE_APPEND"
)
//Save to BigQuery
legalColumnNamesDf.rdd
.map(
row =>
(null,
Json(DefaultFormats).write(
ListMap(row.schema.fieldNames.toSeq.zip(row.toSeq): _*))))
.saveAsNewAPIHadoopDataset(uniqBqConf)
}
Any ideas would be appreciated.
If you look at the Spark UI, is there a lot of skew where one task is reading most of the data? My guess is that you're picking a poor partition key, so most of the data ends up in one partition.
This stackoverflow answer provides a detailed explanation: What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?. I think your entity ids would need to be evenly distributed between 1 and 198012 for it to be a good column to partition on.
In the end I tried stopping to tell spark how many executors to run and just do dynamic allocation, and now it works. I asked for 24 partitions and it dynamically allocates 8 executors with 3 cores each, running 24 tasks in parallel.
I have a very simple Spark + Kafka application. I'm reading from Kafka and printing in Console. I have 2 lines in below i.e Good-line and Bad-line
Initially I process with good line, and then I switch to bad-line for a while, when I change back to good line I expect to process from where it left off. Surprisingly it starts from latest.
1
2
3
missing
missing
7
8
9
In the below code how can I ensure I read all the messages. I did not find a code or place where I can control the offset. Even if there is a duplicate processing I'm fine .. coz I'll have unique-id in my message
public static void main(String[] args) throws Exception {
String brokers = "quickstart:9092";
String topics = "simple_topic_1";
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(SimpleKafkaProcessor.class.getName())
.master(master).getOrCreate();
SQLContext sqlContext = sparkSession.sqlContext();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
Dataset<Row> rawDataSet = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
//.option("enable.auto.commit", "false")
.option("auto.offset.reset", "earliest")
.option("group.id", "safe_message_landing_app_2")
.option("subscribe", topics).load();
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("basicView");
// Good-Line
sqlContext.sql("select string(Value) as StrValue from basicView").writeStream()
// Bad-Line
//sqlContext.sql("select fieldNotFound as StrValue from basicView").writeStream()
.format("console")
.option("checkpointLocation", "cp/" + UUID.randomUUID().toString())
.trigger(ProcessingTime.create("15 seconds"))
.start()
.awaitTermination();
}