I have a Spark (java) based application which scans few tables in Hive and then based on some condition, it gives a list of table/partitions which satisfy the condition.
I want to know how I can collect information regarding this scan?
For eg: How much time it took the application to perform the scan of different tables? How much memory was being used while performing the scan? etc.
I know i can use simple stopwatch for time calculation and print them to the log. But i do not want them to print into logs. I want to push the to a custom Kafka producer that i have created.
I looked into using spark listener and extending them: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener
But these are based on spark specific events(onJobStart, onJobEnd, etc.). I want to trigger my listener on business specific events. For eg: onTableRead, onSpecificMethodCall
2)https://github.com/groupon/spark-metrics
I had a look at this library and it is quite simple to use. But the problem here I am facing is that my Spark application is java based.
In the spark-metrics documentation, they have mentioned the metrics should be initialized as lazy vals to get correct data.
public List getFilteredPartList(){
List<String> fileteredPartitions = utils.getPartNames(getTable().getDbName(), getTable().getTableName(),getConfiguration().getEnvCode(), partNames,
getTableProperties(), getDataAccessManager(), true, true, false, true).getPartNames()};
I want to know the time taken for the execution of above method and push it to my custom Kafka producer.
Related
In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.
Now I want to do the same thing in Flink. After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job. I will explain my use case which will clarify the situtaion.
Example : I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted. Now this list of users is dynamic and is available in a db. I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users. In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins. But with Flink using RichMap I can do that only in the open function when my job starts.
How can I do the following operation in Flink?
It seems that what You want to do is stream-table join. There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here.
The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.
Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.
I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba
I’ve been mulling over how to solve a given problem in Beam and thought I’d reach out to a larger audience for some advice. At present things seem to be working sparsely and I was curious if someone could provide a sounding-board to see if this workflow makes sense.
The primary high-level goal is to read records from Kafka that may be out of order and need to be windowed in Event Time according to another property found on the records and eventually emitting the contents of those windows and writing them out to GCS.
The current pipeline looks roughly like the following:
val partitionedEvents = pipeline
.apply("Read Events from Kafka",
KafkaIO
.read<String, Log>()
.withBootstrapServers(options.brokerUrl)
.withTopic(options.incomingEventsTopic)
.withKeyDeserializer(StringDeserializer::class.java)
.withValueDeserializerAndCoder(
SpecificAvroDeserializer<Log>()::class.java,
AvroCoder.of(Log::class.java)
)
.withReadCommitted()
.commitOffsetsInFinalize()
// Set the watermark to use a specific field for event time
.withTimestampPolicyFactory { _, previousWatermark -> WatermarkPolicy(previousWatermark) }
.withConsumerConfigUpdates(
ImmutableMap.of<String, Any?>(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest",
ConsumerConfig.GROUP_ID_CONFIG, "log-processor-pipeline",
"schema.registry.url", options.schemaRegistryUrl
)
).withoutMetadata()
)
.apply("Logging Incoming Logs", ParDo.of(Events.log()))
.apply("Rekey Logs by Tenant", ParDo.of(Events.key()))
.apply("Partition Logs by Source",
// This is a custom function that will partition incoming records by a specific
// datasource field
Partition.of(dataSources.size, Events.partition<KV<String, Log>>(dataSources))
)
dataSources.forEach { dataSource ->
// Store a reference to the data source name to avoid serialization issues
val sourceName = dataSource.name
val tempDirectory = Directories.resolveTemporaryDirectory(options.output)
// Grab all of the events for this specific partition and apply the source-specific windowing
// strategies
partitionedEvents[dataSource.partition]
.apply(
"Building Windows for $sourceName",
SourceSpecificWindow.of<KV<String, Log>>(dataSource)
)
.apply("Group Windowed Logs by Key for $sourceName", GroupByKey.create())
.apply("Log Events After Windowing for $sourceName", ParDo.of(Events.logAfterWindowing()))
.apply(
"Writing Windowed Logs to Files for $sourceName",
FileIO.writeDynamic<String, KV<String, MutableIterable<Log>>>()
.withNumShards(1)
.by { row -> "${row.key}/${sourceName}" }
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(SerializableFunction { logs -> Files.stringify(logs.value) }), TextIO.sink())
.to(options.output)
.withNaming { partition -> Files.name(partition)}
.withTempDirectory(tempDirectory)
)
}
In a simpler, bulleted form, it might look like this:
Read records from single Kafka topic
Key all records by their tenant
Partition stream by another event properly
Iterate through known partitions in previous step
Apply custom windowing rules for each partition (related to datasource, custom window rules)
Group windowed items by key (tenant)
Write tenant-key pair groupings to GCP via FileIO
The problem is that the incoming Kafka topic contains out-of-order data across multiple tenants (e.g. events for tenant1 might be streaming in now, but then a few minutes later you’ll get them for tenant2 in the same partition, etc.). This would cause the watermark to bounce back and forth in time as each incoming record would not be guaranteed to continually increase, which sounds like it would be a problem, but I'm not certain. It certainly seems that while data is flowing through, some files are simply not being emitted at all.
The custom windowing function is extremely simple and was aimed to emit a single window once the allowed lateness and windowing duration has elapsed:
object SourceSpecificWindow {
fun <T> of(dataSource: DataSource): Window<T> {
return Window.into<T>(FixedWindows.of(dataSource.windowDuration()))
.triggering(Never.ever())
.withAllowedLateness(dataSource.allowedLateness(), Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes()
}
}
However, it seemed inconsistent since we'd see logging come out after the closing of the window, but not necessarily files being written out to GCS.
Does anything seem blatantly wrong or incorrect with this approach? Since the data can come in out of order within the source (i.e. right now, 2 hours ago, 5 minutes from now) and covers data across multiple tenants, but the aim is try and ensure that one tenant that keeps up to date won't drown out tenants that might come in the past.
Would we potentially need another Beam application or something to "split" this single stream of events into sub-streams that are each processed independently (so that each watermark processes on their own)? Is that where a SplittableDoFn would come in? Since I'm running on the SparkRunner, which doesn't appear to support that - but it seems as though it'd be a valid use case.
Any advice would be greatly appreciated or even just another set of eyes. I'd be happy to provide any additional details that I could.
Environment
Currently running against SparkRunner
While this may not be the most helpful response, I'll be transparent as far as the end result. Eventually the logic required for this specific use-case extended far beyond the built-in capabilities of those in Apache Beam, primarily in the area around windowing/governance of time.
The solution that was landed on was to switch the preferred streaming technology from Apache Beam to Apache Flink, which as you might imagine was quite a leap. The stateful-centric nature of Flink allowed us to more easily handle our use cases, define custom eviction criteria (and ordering) around windowing, while losing a layer of abstraction over it.
I have an application written for Spark using Scala language. My application code is kind of ready and the job runs for around 10-15 mins.
There is an additional requirement to provide status of the application execution when spark job is executing at run time. I know that spark runs in lazy way and it is not nice to retrieve data back to the driver program during spark execution. Typically, I would be interested in providing status at regular intervals.
Eg. if there 20 functional points configured in the spark application then I would like to provide status of each of these functional points as and when they are executed/ or steps are over during spark execution.
These incoming status of function points will then be taken to some custom User Interface to display the status of the job.
Can some one give me some pointers on how this can be achieved.
There are few things you can do on this front that I can think of.
If your job contains multiple actions, you can write a script to poll for the expected output of those actions. For example, imagine your script have 4 different DataFrame save calls. You could have your status script poll HDFS/S3 to see if the data has showed up in the expected output location yet. Another example, I have used Spark to index to ElasticSearch, and I have written status logging to poll for how many records are in the index to print periodic progress.
Another thing I tried before is use Accumulators to try and keep rough track of progress and how much data has been written. This works ok, but it is a little arbitrary when Spark updates the visible totals with information from the executors so I haven't found it to be too helpfully for this purpose generally.
The other approach you could do is poll Spark's status and metric APIs directly. You will be able to pull all of the information backing the Spark UI into your code and do with it whatever you want. It won't necessarily tell you exactly where you are in your driver code, but if you manually figure out how your driver maps to stages you could figure that out. For reference, here are is the documentation on polling the status API:
https://spark.apache.org/docs/latest/monitoring.html#rest-api
I have created a spark job using DATASET API. There is chain of operations performed until the final result which is collected on HDFS.
But I also need to know how many records were read for each intermediate dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), I need to know how many records were there for each of 5 intermediate dataset. Can anybody suggest how this can be obtained at dataset level. I guess I can find this out at task level (using listeners) but not sure how to get it at dataset level.
Thanks
The nearest from Spark documentation related to metrics is Accumulators. However this is good only for actions and they mentioned that acucmulators will not be updated for transformations.
You can still use count to get the latest counts after each operation. But should keep in mind that its an extra step like any other and you need to see if the ingestion should be done faster with less metrics or slower with all metrics.
Now coming back to listerners, I see that a SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.
Your requirement is more of a custom implementation. Not sure if you can achieve this. Some info regarding exporting metrics is here.
All metrics which you can collect are at job start, job end, task start and task end. You can check the docs here
Hope the above info might guide you in finding a better solutions