Can an AWS Glue Job Script handle multiple streaming sources in the same job? - apache-spark

In the process of trying to minimize the costs related to AWS Glue, I am attempting to join some of the Streaming Jobs (which currently run as separate jobs) into one. These jobs are very similar in nature:
they batch events from a kafka topic
apply some form of transformation
store the resulting dataframe in a Apache Hudi table
However, the script only seems to accept one streaming source (i.e. kafka topic) per script.
After a typical setup, I have the below:
glue_context.forEachBatch(
frame=raw_data_frame_one,
batch_function=batch_function_one,
options={
"windowSize": args["MICRO_BATCH_WINDOW_SIZE"],
"checkpointLocation": args["TempDir"] + "/" + job_name_one + "/checkpoint/",
},
)
glue_context.forEachBatch(
frame=raw_data_frame_two,
batch_function=batch_function_two,
options={
"windowSize": args["MICRO_BATCH_WINDOW_SIZE"],
"checkpointLocation": args["TempDir"] + "/" + job_name_two + "/checkpoint/",
},
)
I expect both raw data frames to be streamed, but in this case only raw_data_frame_one and batch_function_one are being used.
Also tried submitting these workloads in different threads (with spark.scheduler.mode as FAIR) to no avail.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

How to read and write two DataFrames in parallel with Apache Spark

I am creating two DataFrames from JDBC data source and then writing them both to an S3 bucket. Timestamps of the files written to S3 are 20 seconds apart which tells me that the operations are not executed in parallel. Data is loaded from the same database/table and same amount of rows for testing purposes. How can I make both read and writes to be executed in parallel?
The python script runs on AWS Glue development endpoint with 2 DPUs, standard worker type.
df1 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query1).option("fetchSize", 50000).load()
df2 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query2).option("fetchSize", 50000).load()
df1.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test1")
df2.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test2")
Enable concurrent execution of glue jobs then run the job twice as in a single job it is not possible to save dataframe parallely since spark is distributed processing.

How to ingest different spark dataframes in a single spark job

I want to write a ETL pipeline in spark handling different input sources but using as few computing resources as possible and have problem using 'traditional' spark ETL approach.
I have a number of streaming datasources which need to be persisted into DeltaLake tables. Each datasource is simply a folder in s3 with avro files. Each datasource has different schema. Each datasource should be persisted into it's own DeltaLake table. Little conversion other than avro -> delta is needed, only enrichment with some additional fields derived from filename.
New files are added at a moderate rate, from once a min to once a day, depending on the datasource. I have a kafka notification when new data lands, describing what kind of data and s3 file path.
Assume there are two datasources - A and B. A is s3://bucket/A/* files, B - s3://bucket/B/*. Whenever new files is added I have a kafka message with payload {'datasource': 'A', filename: 's3://bucket/A/file1', ... other fields}. A files should go to delta table s3://delta/A/, B - s3://delta/B/
How can I ingest them all in a single spark application with minimal latency?
As need data is constantly coming, sound like streaming. But in spark streaming one needs to define stream schema upfront, and I have different sources with different schema not known upfront.
Spinning up a dedicated spark application per datasource is not an option - there are 100+ datasources with very small files arriving. Having 100+ spark applications is a waste of money. All should be ingested using single cluster of moderate size.
The only idea I have now: in a driver process run a normal kafka consumer, for each record read a dataframe, enrich with additional fields and persist to it's delta table. More more parallelism - consume multiple messages and run them in futures, so multiple jobs run concurrently.
Some pseudo-code, in a driver process:
val consumer = KafkaConsumer(...)
consumer.foreach{record =>
val ds = record.datasource
val file = record.filename
val df = spark.read.format(avro).load(file)
.withColumn('id', record.id)
val dest = s"s3://delta/${record.datasourceName}"
df.write.format('delta').save(dest)
consumer.commit(offset from record)
}
Sounds good (and PoC shows it works), but I wonder if there are other options? Any other ideas are appreciated.
Spark runs in a DataBricks platform.
Spark does not constraint you to have a spark application per datasource ingestion, you can group datasources into a couple of spark app or you could go with one spark application for all the datasources, which is a feasible approach if the spark app have enough resources to ingest and process all the datasource.
You can do something like:
object StreamingJobs extends SparkApp {
// consume from Kafka Topic 1
StreamProcess_1.runStream(spark)
// consume from Kafka Topic 2
StreamProcess_2.runStream(spark)
// consume from Kafka Topic n
StreamProcess_N.runStream(spark)
// wait until termination
spark.streams.awaitAnyTermination()
}
and maybe another spark jobs for batch processing
object BatchedJobs extends SparkApp {
// consume from data source 1
BatchedProcess_1.run(spark)
// consume from data source 2
BatchedProcess_2.run(spark)
// consume from data source n
BatchedProcess_N.run(spark)
}

Structured streaming performance and purging the parquet files

I am using Spark structured streaming to get streaming data from Kafka. I need to aggregate various metrics (Say 6 metrics) and write as parquet files. I do see that there is a huge delay between metric 1 and metric 2. For example, if metric 1 is updated recently, metric 2 is one hour old data. How do I improve this performance to work in parallel?
Also, I write Parquet files which should be read by another application. How do I purge old parquet information constantly? Should I have a different application for it?
Dataset<String> lines_topic = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
Dataset<Row> data= lines_topic.select(functions.from_json(lines_topic.col("value"), schema).alias(topics)); data.withWatermark(---).groupBy(----).count(); query = data.writeStream().format("parquet").option("path",---).option("truncate", "false").outputMode("append").option("checkpointLocation", checkpointFile).start();
Since each query is running independently from the others you need to ensure you're giving each query enough resources to execute. What could be happening is if you're using the default FIFO scheduler then all triggers are running sequentially vs in parallel.
Just as described here you should set a FAIR scheduler on your SparkContext and then define new pools for each query.
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)
Also, in terms of purging old parquet files you may want to partition the data and then periodically delete old partitions as needed. Otherwise you can't just delete rows if all the data is being written to the same output path.

Run spark job in parallel and using single spark context running in local mode

I need to run some HQLs via spark. I have a jar having class that creates dataset from JSON, perform HQL and create JSON. Finally, save that JSON to a text file in local file system.
Spark is running on local mode.
Problem
: Jobs are sequential and every job is starting spark context. Hence, taking more time.
I want to create single Spark Context and execute jobs in parallel.
Option 1 : Queque based model
I can create a infinitely running job that starts spark context and listen on kafka queue. JSON data & HQL are passed as kafka message.
Option 2 : Spark Streaming
Use spark streaming with kafka to propagate JSON data & HQL
OR is there any other way to achieve this?

Resources