Incremental Data Storage - apache-spark

I have time series daily data which I run a model on. The model runs in Spark.
I only want to run the model daily, and append the results to the historic results. It is important to have a 'merged single data source' containing historical data for the model to run successfully.
I have to use an AWS service to store the results. If I store in S3, I will end up storing backfill + 1 file per day (too many files). If I store in Redshift, it doesn't merge + upsert, therefore becoming complicated. The customer facing data is in Redshift, so dropping the table and reloading daily is not an option.
I am not sure how to cleverly (defined as minimal cost and subsequent processing) store the incremental data without re-processing everything daily to get a single file.

S3 is still your best shot. Since your job doesn't seems need to be accessed on a real-time fashion, it's more of a rolling data set.
If you are worried about the amount of file it generates, there is at least 2 things you can do:
S3 object lifecycle management
You can define your objects to be removed or transition to another storage class(cheaper) after x days.
More examples: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-configuration-examples.html
S3 notification
Basically you can set up a listener in your S3 bucket, 'listening for' all the objects that match your specified prefix and suffix, to trigger other AWS services. One easy thing you can do is to trigger a Lambda, do your processing and then you can do whatever you would like.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Use S3 as your database whenever it's possible. It's damn cheap and it's AWS's backbone.

You can also switch to an ETL. A very efficient one, which is OpenSource, specialized in BigData, fully automatizable and easy to use is the Pentaho Data Integrator.
It comes equipped with ready made plugins for S3, Redshift (and others), and there is a single step to compare with previous values. From my experience it runs pretty fast. Plus it works for you during the night and sends you a morning mail saying every thing went OK (or not).
Note to the moderators: this is a agnostic point of view, I could have recommended many others, but this one seams the most suited for the OP's need.

Related

Best way: how to export dynamodb table to a csv and store it in s3

We have one lambda that will update dynamodb table after some operation.
Now we want to export whole dynamodb table into a s3 bucket with a csv format.
Any efficient way to do this.
Also I have found the below way of streaming directly from dynamodb to s3
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
But in above it will store in json format. and can not find a way to do this efficiently for 10GB data
As far as I can tell you have three "simple" options.
Option #1: Program that does a Scan
It is fairly simple to write a program that does a (parallel) scan of your table and then outputs the result in a CSV. A no bells and whistles version of this is about 100-150 lines of code in Python or Go.
Advantages:
Easy to develop
Can be run easily multiple times from local machines or CI/CD pipelines or whatever.
Disadvantages:
It will cost you a bit of money. Scanning the whole table will use up some read units. Depending on the amount you are readin, this might get costly fast.
Depending on the amount of data this can take a while.
Note: If you want to run this in a Lambda then remember that Lambdas can run for a maximum of 15 minutes. So once you more data than can be processed within those 15 minutes, you probably need to switch to Step Functions.
Option #2: Process a S3 backup
DynamoDB allows you to create backups of your table to S3 (as the article describes you linked). Those backups will either be in JSON or a JSON like AWS format. You then can write a program that converts those JSON files to CSV.
Advantages:
(A lot) cheaper than a scan
Disadvantages:
Requires more "plumbing" because you need to first create the backup, then do download it from S3 to wherever you want to process it etc.
Probably will take longer than option #1

Differences between BigQuery BQ.insert_rows_json and BQ.load_from_json?

I want to stream data into BigQuery and I was thinking in use PubSub + Cloud Functions, since there is no transformation needed (for now, at least) and using Cloud Data Flow feels like a little bit over kill for just inserting rows to a table. I am correct?
The data is streamed from a GCP VM using a Python script into PubSub and it has the following format:
{'SEGMENT':'datetime':'2020-12-05 11:25:05.64684','values':(2568.025,2567.03)}
The BigQuery schema is datetime:timestamp, value_A: float, value_B: float.
My questions with all this are:
a) Do I need to push this into BigQuery as json/dictionary with all values as strings or it has to be with the data type of the table?
b) What's the difference between using BQ.insert_rows_json and BQ.load_table_from_json and which one should I use for this task?
EDIT:
What I'm trying to get is actually market data of some assets. Say around 28 instruments and capture all their ticks. On an average day, there are ~60.k ticks per instrument, so we are talking about ~33.6 M invocations per month. What is needed (for now) is to insert them in a table for further analysis. I'm currently not sure if real streaming should be performed or loads per batch. Since the project is in doing analysis yet, I don't feel that Data Flow is needed, but PubSub should be used since it allows to scale to Data Flow easier when the time comes. This is my first implementation of doing streaming pipelines and I'm using all what I've learned through courses and reading. Please, correct me if I'm having a wrong approach :).
What I would absolutely love to do is, for example, perform another insert to another table when the price difference between one tick and the n'th tick is, for example, 10. For this, should I use Data Flow or the Cloud Function approach is still valid? Because this is like a trigger condition. Basically, the trigger would be something like:
if price difference >= 10:
process all these ticks
insert the results in this table
But I'm unsure how to implement this trigger.
In addition to the great answer of Marton (Pentium10)
a) You can stream a JSON in BigQuery, a VALID json. your example isn't. About the type, there is an automatic coercion/conversion according with your schema. You can see this here
b) The load job loads file in GCS or a content that you put in the request. The batch is asynchronous and can take seconds or minutes. In addition, you are limited to 1500 load per days and per table -> 1 per minutes works (1440 minutes per day). There is several interesting aspect of the load job.
Firstly, it's free!
Your data are immediately loaded in the correct partition and immediately request-able in the partition
If the load fail, no data are inserted. So, it's easiest to replay a file without having doubled values.
At the opposite, the streaming job insert in real time the data into BigQuery. It's interesting when you have real time constraint (especially for visualisation, anomalie detections,...). But there is some bad sides
You are limited to 500k rows per seconds (in EU and US), 100k rows in other regions, and 1Gb max per seconds
The data aren't immediately in the partition, they are in a buffer name UNPARTITIONED for a while or up to have this buffer full.. So you have to take into account this specificity when you build and test your real time application.
It's not free. The cheapest region is $0.05 per Gb.
Now that you are aware of this, ask yourselves about your use case.
If you need real time (less than 2 minutes of delay), no doubt, streaming is for you.
If you have few Gb per month, streaming is also the easiest solution, for few $
If you have a huge volume of data (more than 1Gb per second), BigQuery isn't the good service, consider BigTable (that you can request with BigQuery as a federated table)
If you have an important volume of data (1 or 2Gb per minutes) and your use case required data freshness at the minute+, you can consider a special design
Create a PubSub pull subscription
Create a HTTP triggered Cloud Function (or a Cloud Run service) that pull the subscription for 1 minutes and then submit the pulled content to BigQuery as a load job (no file needed, you can post in memory content directly to BigQuery). And then exist gracefully
Create a Cloud Scheduler that trigger your service every minute.
Edit 1:
The cost shouldn't drive your use case.
If, for now, it's only for analytics, you simply imagine to trigger once per days your job to pull the full subscriptions. With your metrics: 60k metrics * 28 instruments * 100 bytes (24 + memory loss), you have only 168Mb. You can store this in Cloud Functions or Cloud Run memory and perform a load job.
Streaming is really important for real time!
Dataflow, in streaming mode, will cost you, at least $20 per month (1 small worker of type n1-standard1. Much more than 1.5Gb of streaming insert in BigQuery with Cloud Functions.
Eventually, about your smart trigger to stream or to batch insert, it's not really possible, you have to redesign the data ingestion if you change your logic. But before all, only if your use case requires this!!
To answer your questions:
a) you need to push to BigQuery using the library's accepting formats usually a collection or either a JSON document formatted to the table's definition.
b) To add data to BigQuery you can Stream data or Load a file.
For your example you need to stream data, so use the 'streaming api' methods insert_rows* family.

Spark: writing data to place that is being read from without loosing data

Help me please to understand how can I write data to the place that is also being read from without any issue, using EMR and S3.
So I need to read partitioned data, find old data, delete it, write new data back and I'm thinking about 2 ways here:
Read all data, apply a filter, write data back with save option SaveMode.Overwrite. I see here one major issue - before writing it will delete files in S3, so if EMR cluster goes down by some reason after deletion but before writing - all data will be lost. I can use dynamic partition but that would mean that in such situation I'm gonna lost data from 1 partition.
Same as above but write to the temp directory, then delete original, move everything from temp to original. But as this is S3 storage it doesn't have move operation and all files will be copied, which can be a bit pricy(I'm going to work with 200GB of data).
Is there any other way or am I'm wrong in how spark works?
You are not wrong. The process of deleting a record from a table on EMR/Hadoop is painful in the ways you describe and more. It gets messier with failed jobs, small files, partition swapping, slow metadata operations...
There are several formats, and file protocols that add transactional capability on top of a table stored S3. The open Delta Lake (https://delta.io/) format, supports transactional deletes, updates, merge/upsert and does so very well. You can read & delete (say for GDPR purposes) like you're describing. You'll have a transaction log to track what you've done.
On point 2, as long as you have a reasonable # of files, your costs should be modest, with data charges at ~$23/TB/mo. However, if you end with too many small files, then the API costs of listing the files, fetching files can add up quickly. Managed Delta (from Databricks) will help speed of many of the operations on your tables through compaction, data caching, data skipping, z-ordering
Disclaimer, I work for Databricks....

Spark structured streaming real-time aggregation

Is it possible to output aggregation data on every trigger, before the aggregation time window is over?
Context: I'm developing an application that reads data from a Kafka topic, processes the data, aggregates it over a 1-hour window, and outputs to S3. However, The spark application understandably outputs the aggregation data to S3 only at the end of a given hour window.
The problem is that the end-users of the aggregated data in S3 can only have a semi real-time view, since they are always one hour late, waiting for the next aggregation to be outputted from the spark application.
Reducing the aggregation time window to something smaller than an hour would certainly help, but would generate a lot more data.
What could be done to enable real-time aggregation, as I call it, using minimal resources?
This is an interesting one and I do have a proposal but I'm not sure if this would really fit your minimal criteria. I'll describe the solution anyway...
If the end goal is to enable users to query data in real-time (or faster analytics in other words) then one way to achieve this is to introduce a database in your architecture that can handle fast inserts/updates - either a key-value store or a column oriented database. Below is a diagram that might help you in visualising this:
The idea is simple - just keep ingesting data into the first database and then keep offloading the data into S3 after a certain time i.e. either an hour or a day depending on your requirements. You could then register the metadata of both of these storage layers into a metadata layer (such as AWS Glue) - this may not always be necessary if you don't need a persistent metastore. On top of this, you could use something like Presto to query across both of these stores. This would also enable you to optimise your storage across these 2 data stores.
You'll obviously need to build the process to drop/delete the data partitions from the store you would be streaming in to and also to move data to S3.
This model is referred to as a tiered storage model or hierarchical storage model with sliding window pattern - Reference Article from Cloudera.
Hope this helps!

Spark S3Guard - Skip listing S3

I'm using Spark (2.4) to process I data being stored on S3.
I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M )
I know about S3Guard that stores the objects metadata, and thought that I can use it for skipping the S3 listing.
I've read this Cloudera's blog
Note that it is possible to skip querying S3 in some cases, just
serving results from the Metadata Store. S3Guard has mechanisms for
this but it is not yet supported in production.
I know it's quite old , is it already available in production?
As of July 2019 it is still tagged as experimental; HADOOP-14936 lists the tasks there.
The recent work has generally corner cases you aren't going to encounter on a daily basis, but which we know exist and can't ignore.
The specific feature you are talking about, "auth mode", relies on all clients to be using S3Guard and update the tables, and us being happy that we can handle the failure conditions for consistency.
For a managed table, I'm going to say Hadoop 3.3 will be ready to use this. For HADOOP-3.2, it's close. Really, more testing is needed.
In the meantime, if you can't reduce the number of files in S3, can you make sure you don't have a deep directory tree, as its that recursive directory scan which really suffers against it

Resources