Caching preprocessed data for ML in spark/pyspark

Caching preprocessed data for ML in spark/pyspark - apache-spark

I would like a ML pipeline like this:
raw_data = spark.read....()
data = time_consuming_data_transformation(raw_data, preprocessing_params)
model = fit_model(data)
evaluate(model, data)
Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again? Ideally, I would like the cache to be automatically invalidated when the original data or transformation code (computing graph, preprocessing_params) change.

Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again?
You can of course:
data = time_consuming_data_transformation(raw_data, preprocessing_params).cache()
but if you're data is non-static, it is always better to write data to persistent storage:
time_consuming_data_transformation(raw_data, preprocessing_params).write.save(...)
data = spark.read.load(...)
It is more expensive than cache, but prevents hard to detect inconsistencies when data changes.
Ideally, I would like the cache to be automatically invalidated when the original data
No. Unless it is a streaming program (and learning on streams is not so trivial) Spark doesn't monitor changes in the source.
or transformation code (computing graph, preprocessing_params) change.
It is not clear for me how things change but it is probablly not something that Spark will solve for you. You might need some event driven or reactive components.

Related

Spark structured streaming real-time aggregation

Is it possible to output aggregation data on every trigger, before the aggregation time window is over?
Context: I'm developing an application that reads data from a Kafka topic, processes the data, aggregates it over a 1-hour window, and outputs to S3. However, The spark application understandably outputs the aggregation data to S3 only at the end of a given hour window.
The problem is that the end-users of the aggregated data in S3 can only have a semi real-time view, since they are always one hour late, waiting for the next aggregation to be outputted from the spark application.
Reducing the aggregation time window to something smaller than an hour would certainly help, but would generate a lot more data.
What could be done to enable real-time aggregation, as I call it, using minimal resources?

This is an interesting one and I do have a proposal but I'm not sure if this would really fit your minimal criteria. I'll describe the solution anyway...
If the end goal is to enable users to query data in real-time (or faster analytics in other words) then one way to achieve this is to introduce a database in your architecture that can handle fast inserts/updates - either a key-value store or a column oriented database. Below is a diagram that might help you in visualising this:
The idea is simple - just keep ingesting data into the first database and then keep offloading the data into S3 after a certain time i.e. either an hour or a day depending on your requirements. You could then register the metadata of both of these storage layers into a metadata layer (such as AWS Glue) - this may not always be necessary if you don't need a persistent metastore. On top of this, you could use something like Presto to query across both of these stores. This would also enable you to optimise your storage across these 2 data stores.
You'll obviously need to build the process to drop/delete the data partitions from the store you would be streaming in to and also to move data to S3.
This model is referred to as a tiered storage model or hierarchical storage model with sliding window pattern - Reference Article from Cloudera.
Hope this helps!

How to run apache-beam in batches on a bounded data?

I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:
Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)
Please point out to me any lapses in my understanding

Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...

How to join stream data with a table updated slowly (e.g. once a day)?

In structured streaming, I need to join stream data with some slow changing data. The slow changing data get updated by daily and might get updated not on a fixed time. The stream data, however, comes at seconds level. If I don't want to load the slow changing data in each micro batch, and also expect to get the latest version of slow changing data once it gets updated, is there a way to do that?
Thanks

I'd recommend using DataStreamWriter.foreachBatch on the stream data and simply cache and unpersist the slow-changing dataset when needed. Since foreachBatch works on the driver (on a separate thread though) it should work.
A very advanced approach, in my opinion, would be to develop a custom data source that would do the "slowing part" itself.

I tried this. I store last modified time of a single file in a variable, then broadcast it, and in foreachBatch I get this time again. If they are different I can refresh the cache. And then I found I don't have to broadcast the variable. If the variable gets the value before foreachBatch, it still keeps original value inside of foreachBatch (for local mode running in intelliJ). Code is like:
var latestModified = Files.getLastModifiedTime(Paths.get("/some_file"))
var deltaTable = DeltaTable.forPath(spark, deltaPath)
var c = deltaTable.toDF.cache()
df
.writeStream
......
.foreachBatch { (df, batchId) =>
val currentModifiedTime = Files.getLastModifiedTime(Paths.get("/some_file"))
if (!currentModifiedTime.equals((latestModified))){
c.unpersist()
c = deltaTable.toDF.cache()
latestModified = currentModifiedTime
}
... ...

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?

If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

How does Spark deal with JDBC data in relation to time?

I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?

Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string