How does Spark deal with JDBC data in relation to time? - apache-spark

I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?

Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.

Related

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

Spark structured streaming real-time aggregation

Is it possible to output aggregation data on every trigger, before the aggregation time window is over?
Context: I'm developing an application that reads data from a Kafka topic, processes the data, aggregates it over a 1-hour window, and outputs to S3. However, The spark application understandably outputs the aggregation data to S3 only at the end of a given hour window.
The problem is that the end-users of the aggregated data in S3 can only have a semi real-time view, since they are always one hour late, waiting for the next aggregation to be outputted from the spark application.
Reducing the aggregation time window to something smaller than an hour would certainly help, but would generate a lot more data.
What could be done to enable real-time aggregation, as I call it, using minimal resources?
This is an interesting one and I do have a proposal but I'm not sure if this would really fit your minimal criteria. I'll describe the solution anyway...
If the end goal is to enable users to query data in real-time (or faster analytics in other words) then one way to achieve this is to introduce a database in your architecture that can handle fast inserts/updates - either a key-value store or a column oriented database. Below is a diagram that might help you in visualising this:
The idea is simple - just keep ingesting data into the first database and then keep offloading the data into S3 after a certain time i.e. either an hour or a day depending on your requirements. You could then register the metadata of both of these storage layers into a metadata layer (such as AWS Glue) - this may not always be necessary if you don't need a persistent metastore. On top of this, you could use something like Presto to query across both of these stores. This would also enable you to optimise your storage across these 2 data stores.
You'll obviously need to build the process to drop/delete the data partitions from the store you would be streaming in to and also to move data to S3.
This model is referred to as a tiered storage model or hierarchical storage model with sliding window pattern - Reference Article from Cloudera.
Hope this helps!

How to update or even reset rows in persistent table given multiple simultaneous readers?

I have an exchangeRates table that gets updated in batch once per week. This is to be used by other batch and streaming jobs, across different clusters - thus I want to save this as a persistent, shared table for all to jobs share.
allExchangeRatesDF.write.saveAsTable("exchangeRates")
How best then (for the batch job that manages this data) to gracefully update the table contents (actually overwrite it completely) - considering the various spark job as consumers of it and particularily giving its use in some 24/7 structured streaming streams?
Ive checked the APIs, maybe I am missing something obvious! Very likely.
Thanks!
I think you expect some kind of transaction support from Spark so when there's saveAsTable in progress Spark would hold all writes until the update/reset has finished.
I think that the best way to deal with the requirement is to append new records (using insertInto) with the batch id that would denote the rows that belong to a "new table".
insertInto(tableName: String): Unit Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
You'd then use the batch id to deal with the rows as if they were the only rows in the dataset.

Use Cases for Spark

We have an application which the clients use to track their procurement cycle. We need to build a solution which will help the users to pull any column from any table in a particular subject area and they should be able to see all the rows of the result of this join of the tables from which the columns have been pulled. It needs to be similar to a Salesforce kind of reporting solution. We are looking at HDFS and Spark in Azure HDInsight to support these kind of querying capabilities. We would like to know if this is a valid use case for Spark. The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
Please let me know if this is something that can be done using Spark.
As per my understanding, Spark is mostly used for batch processing. If your use case is directly user-facing, then I am doubtful about using Spark because there may be better solutions(or alternate architectures). Becuase joining 500 million rows in realtime sounds crazy!
The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
This is another thing that puzzled me. Pulling all the 500 million rows into RAM of a single java process doesn't sound right because of the obvious reasons.
Updated
Just using spark for processing huge data will not be effective for realtime solutions(like your use case). But, Spark will be very effective if you are going to pre-process your data, cache the results using some other system, prepare views using the results can be served to your users. More or less similar to Lambda Architecture.
Spark on Yarn cluster to periodically process the data and generate/update the different views, a distributed storage system (preferably columnar storage systems) to cache the views, a REST API to serve the views to users.
Late reply to the question, but in case someone else is reading this in future. AWS Redshift does exactly this.

What would be the proper way to tune Apache Spark for responsive web applications?

I have previously used Apache Spark for streaming applications where it does a wonderful job for ETL pipelines and predictions using Machine Learning.
However, Spark for EDA may not be as fast as one may want. For example, if you would like to do basic mathematical operations on data coming from Postgres or ElasticSearch using the data frames in Spark, the time it takes to fetch data from the host system and do the analysis is much higher than that taken by the SQL query on Postgres to run.
Even simple aggregations such as sum, average, and count can be done much faster using SQL than doing them on top of Spark-SQL.
From what I understand, this is not because of latency in fetching the data from the host system. If you call the show method on a data frame, you can quickly get the top rows of the data set. However, if you limit the response in SQL, and then call collect the time taken is huge.
This means that the data is there, but the processing being done while calling collect is taking a time.
Regardless of the data source (CSV file, JSON file, ElasticSearch, Parquet, etc.), the behavior remains the same.
What is the reason for this latency on collect and is there any way to reduce it to the point where it can work with responsive applications to make real-time or near real-time queries?

Resources