Spark Structured Streaming source retention policy - apache-spark

Consider a continuous flow of JSON data on a Kafka topic, we want to deal with it by structured streaming like this:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
I was wondering if the program runs for a long run, then df variable will become so big - in my case like 100 TB for a week. So is there any configuration available to eliminate earlier data in df or simply dequeue earliest rows?

In Spark the execution will not start until an action is triggered.
This concept is called Lazy Evaluation in Apache Spark.
“Transformations are lazy in nature meaning when we call some operation in RDD, it does not execute immediately”
Having said that the load operation is a transformation and no data will be read upon executing this line of code.
In order to kick off a streaming job to need to provide the following 4 logical components and call start:
The input (Kafka, file, socket, ..)
The trigger (how often to input get updated)
The result table (that is created upon a query after the tigger update)
Output (define what part of the result will be written)
The memory consumption depends on what is done in the query that will be triggered. Spark Documentation::
"Since Spark is updating the Result Table, it has full control over
updating old aggregates when there is late data, as well as cleaning
up old aggregates to limit the size of intermediate state data. Since
Spark 2.1, we have support for watermarking which allows the user to
specify the threshold of late data, and allows the engine to
accordingly clean up old state."
So you have to determine the amount of data needed to calculate the result table in order to estimate the about of required memory.
It is possible that an executor will crash with an OOM exception, if you do something like: mapGroupWithState, …

Related

Processing an entire SQL table via JDBC by streaming or batching on a constrained environment

I am trying to set up a pipeline for processing entire SQL tables one by one with the initial ingestion happening through JDBC. I need to be able to use higher-level processing capabilities such as the ones available in Apache Spark or Flink and would like to use any existing capabilities rather than having to write my own, although it could be an inevitability. I need to be able to execute this pipeline on a constrained setup (potentially a single laptop). Please note that I am not talking about capturing or ingesting CDC here, I just want to batch process an existing table in a way that would not OOM a single machine.
As a trivial example, I have a table in SQL Server that's 500GB. I want to break it down into smaller chunks that would fit into the 16GB-32GB of available memory in a recently modern laptop, apply a transformation function to each of the rows and then forward them into a sink.
Some of the available solutions that seem close to doing what I need:
Apache Spark partitioned reads:
spark.read.format("jdbc").
.option("driver", driver)
.option("url", url)
.option("partitionColumn", id)
.option("lowerBound", min)
.option("upperBound", max)
.option("numPartitions", 10)
.option("fetchsize",1000)
.option("dbtable", query)
.option("user", "username")
.option("password", "password")
.load()
It looks like I can even repartition the datasets further after the initial read.
Problem is, in a local execution mode I expect the entire table to be partitioned across multiple CPU cores which will all try to load their respective chunk into memory, OOMing the whole business.
Is there a way to throttle the reading jobs so that only as many execute as can fit in memory? Can I force jobs to run sequentually?
Could I perhaps partition the table into much smaller chunks, many more than there are cores, causing only a small amount to be processed at once? Wouldn't that hamper everything with endless task scheduling etc?
If I wanted to write my own source for streaming into Spark, would that alleviate my memory woes? Does something like this help me?
Does Spark's memory management kick into play here at all? Why does it need to load the entire partition into memory at once during the read?
I looked at Apache Flink as an alternative as the streaming model is perhaps a little more appropriate here. Here's what it offers in terms of JDBC:
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.jdbc.Driver")
.setDBUrl("jdbc:mysql://localhost/log_db")
.setUsername("username")
.setPassword("password")
.setQuery("select id, something from SOMETHING")
.setRowTypeInfo(rowTypeInfo)
.finish()
However, it seems like this is also designed for batch processing and still attempts to load everything into memory.
How would I go about wrangling Flink to stream micro-batches of SQL data for processing?
Could I potentially write my own streaming source that wraps the JDBC input format?
Is it safe to assume that OOMs do not happen with Flink unless some state/accumulators become too big?
I also saw that Kafka has JDBC connectors but it looks like it is not really possible to run it locally (i.e. same JVM) like the other streaming frameworks. Thank you all for the help!
It's true that with Flink, input formats are only intended to be used for batch processing, but that shouldn't be a problem. Flink does batch processing one event at-a-time, without loading everything into memory. I think what you want should just work.

How to manage HDFS memory with Structured Streaming Checkpoints

I have a long running structured streaming job which consumes several Kafka topics and aggregated over a sliding window. I need to understand how checkpoints are managed/cleaned up within HDFS.
Jobs run fine and I am able to recover from a failed step with no data loss, however, I can see the HDFS utilisation increasing day by day. I can not find any documentation around how Spark manages/cleans up the checkpoints. Previously the checkpoints were being stored on s3 but this turned out to be quite costly with the large amount of small files being read/written.
query = formatted_stream.writeStream \
.format("kafka") \
.outputMode(output_mode) \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("checkpointLocation", "hdfs:///path_to_checkpoints") \
.start()
From what I understand the checkpoints should be cleaned up automatically; after several days I just see my HDFS utilisation increasing linearly. How can I ensure the checkpoints are managed and HDFS does not run out of space?
The accepted answer to Spark Structured Streaming Checkpoint Cleanup informs that Structured Streaming should deal with this issue, but not how or how it can be configured.
As you can see in the code for Checkpoint.scala, the checkpointing mechanism persists the last 10 checkpoint data, but that should not be a problem over a couple of days.
A usual reason for this is that the RDDs you are persisting on disk are also growing linearly with time. This may be due to some RDDs that you don't care about getting persisted.
You need to make sure that from your use of Structured Streaming there are no RDDs that grow that need to be persisted. For example, if you want to calculate a precise count of distinct elements over a column of a Dataset, you need to know the full input data (which means persisting data that linearly increases with time, if you have a constant influx of data per batch). Instead, if you can work with an approximate count, you can use algorithms such as HyperLogLog++, which typically requires much less memory for a tradeoff on precision.
Keep in mind that if you are using Spark SQL, you may want to further inspect what your optimized queries turn into, as this may be related to how Catalyst optimizes your query. If you are not, then maybe Catalyst would have optimized your query for you if you did.
In any case, a further thought: if the checkpoint usage is increasing with time, this should be reflected with your streaming job also consuming more RAM linearly with time, since the Checkpoint is just a serialization of the Spark Context (plus constant-size metadata). If that is the case, check SO for related questions, such as why does memory usage of Spark Worker increase with time?.
Also, be meaningful of what RDDs you call .persist() on (and which cache level, so that you can metadata to disk RDDs and only load them partially into the Spark Context at a time).

How to avoid writing empty json files in Spark [duplicate]

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()
Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.
I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.
If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

Structured streaming performance and purging the parquet files

I am using Spark structured streaming to get streaming data from Kafka. I need to aggregate various metrics (Say 6 metrics) and write as parquet files. I do see that there is a huge delay between metric 1 and metric 2. For example, if metric 1 is updated recently, metric 2 is one hour old data. How do I improve this performance to work in parallel?
Also, I write Parquet files which should be read by another application. How do I purge old parquet information constantly? Should I have a different application for it?
Dataset<String> lines_topic = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
Dataset<Row> data= lines_topic.select(functions.from_json(lines_topic.col("value"), schema).alias(topics)); data.withWatermark(---).groupBy(----).count(); query = data.writeStream().format("parquet").option("path",---).option("truncate", "false").outputMode("append").option("checkpointLocation", checkpointFile).start();
Since each query is running independently from the others you need to ensure you're giving each query enough resources to execute. What could be happening is if you're using the default FIFO scheduler then all triggers are running sequentially vs in parallel.
Just as described here you should set a FAIR scheduler on your SparkContext and then define new pools for each query.
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)
Also, in terms of purging old parquet files you may want to partition the data and then periodically delete old partitions as needed. Otherwise you can't just delete rows if all the data is being written to the same output path.

How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

For example on docs.datastax.com we mention :
table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()
and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries.
Thanks!
While you can't load data faster. You can load portions of the data or terminate early. Spark DataFrames utilize catalyst to optimize it's underlying query plans enables it to take some short cuts.
For example calling limit will allow Spark to skip reading some portions from the underlying DataSource. These would limit the amount of data read from Cassandra by canceling tasks from being executed.
Calling filter, or adding filters can be utilized by the underlying Datasource to help restrict the amount of information actually pulled from Cassandra. There are limitations on what can be pushed down but this is all detailed in the documentation.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#pushing-down-clauses-to-cassandra
Note all of this is accomplished by simply doing further api calls on your DataSource once you call it. For example
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="ks")
.load()
df.show(10) // Will compute only enough tasks to get 10 records and no more
df.filter(clusteringKey > 5).show() //Will pass down the clustering predicate to C*

Resources