Atomic data reaggregation with Apache Spark - apache-spark

We've implemented batch processing with apache spark. One batch appears once in 15 minutes and contains around 5GB of data in parquet format. We are storing data partitioned with schema
/batch=11/dt=20170102/partition=2
batch is monotonically increasing number, dt is date,partition is a number from 0 to 30 partitioned by clientid, it's needed for faster querying.
Mainly data is being searched from this structure by date and/or by clientid. From this folder we are preparing additional transformations using batch id as pointer.
For one day we get 3000 (100 batcher per day) folders and around 3000000 files inside. After some time we want to make bigger batches in order to reduce amount of folders and files stored in hdfs.
For example from
/batch=100/dt=.../partition=...
...
/batch=9999/dt=.../partition=...
we want to make
/batch=9999/dt=20170102/partition=...
/batch=9999/dt=20170103/partition=...
etc...
But problem is that users can run queries on this folder, and if we will move data between batches clients can read the same data twice or didn't read at all.
Can you suggest any appropriate solution how to tighten batches in atomic way? Or may be you can suggest better storing schema for such purposes?

Related

How to write a single file per partition with Spark?

I have a bunch of files with timeseries data from some IoT sensors. I need to aggregate data per hour per IoT device and partition the resulting files per day. Currently there is not much data and not many devices so I can handle it locally with a single laptop.
I've managed to get a single file per day instead of a file per device per day using repartition(1). But as far as I understand this is inefficient as this basically moves all the data to a single node to do all the processing. I've read that handling partitions becomes really important in order to improve performance but it seems quite low level and requires knowledge about the size of your data.
So I was wondering, is there any strategy to handle repartitioning of the data? In my case, I need a single file to improve efficiency when loading the resulting parquet files (using AWS Athena) as aggregations per hour partitioned by day for a bunch of devices results on just a few lines to read per file. Maybe I need to aggregate the files in a second step?
This is the code I currently use to test the aggregation of the data.
df
.drop("name")
.filter(!_.anyNull)
.withColumn("timestamp_seconds", from_unixtime(col("time").divide(1000000000)).cast("timestamp")) //Timestamp in nanoseconds
.withColumn("day", to_date(col("timestamp_seconds")))
.withColumn("hour", hour(col("timestamp_seconds")))
.drop("timestamp_seconds")
.groupBy("id", "day","hour").agg(avg("soc").as("avg_soc"))
.where("avg_soc != 0")
.orderBy("id", "day", "hour")
.repartition(1)
.write
.partitionBy("day")
.mode(SaveMode.Overwrite)
.csv("data/aggregate") //I'm using CSV for now in order to see the data

Delta Lake partitioning strategy for event data

I'm trying to build a system that ingests, stores and can query app event data. In the future it will be used for other tasks (ML, Analytics, etc.) hence why I think Databricks could be a good option(for now).
The main use case will be retrieving user-action events occurring in the app.
Batches of this event data will land in an S3 bucket about every 5-30 mins and Databricks Auto Loader will pick them up and store it in a Delta Table.
A typical query will be: get all events where colA = x over the last day, week, or month.
I think the typical strategy here is to partition by date. e.g:
date_trunc("day", date) # 2020-04-11T00:00:00:00.000+000
This will create 365 partitions in a year. I expect each partition to hold about 1GB of data. In addition to partitioning, I plan on using z-ordering for one of the high cardinality columns that will frequently be used in the where clause.
Is this too many partitions?
Is there a better way to partition this data?
Since I'm partitioning by day and data is coming in every 5-30 mins, is it possible to just "append" data to a days partition instead?
It's really depends on the amount of data that are coming per day and how many files should be read to answer your query. If it 10th of Gb then partition per day is ok. But you can also partition by timestamp truncated to week, and in this case you'll get only 52 partitions per year. ZOrdering will help to keep the files optimized, but if you're appending data every 5-30 minutes, you'll get with at least 24 files per day inside the partition, so you will need to run OPTIMIZE with ZOrder every night, or something like this, to decrease the number of files. Also, make sure that you're using optimized writes - although this make write operation slower, it will decrease the number of files generated (if you're planning to use ZOrdering, then it makes no sense to enable autocompaction)

Does Spark guarantee consistency when reading data from S3?

I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.
Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:
.withColumn('id', f.monotonically_increasing_id())
Will the id column have the same values for the same records each time?
You state very little, but the following is easily testable and should serve as a guideline:
If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.
Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.
I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).

Question about using parquet for time-series data

I'm exploring ways to store a high volume of data from sensors (time series data), in a way that's scalable and cost-effective.
Currently, I'm writing a CSV file for each sensor, partitioned by date, so my filesystem hierarchy looks like this:
client_id/sensor_id/year/month/day.csv
My goal is to be able to perform SQL queries on this data, (typically fetching time ranges for a specific client/sensor, performing aggregations, etc) I've tried loading it to Postgres and timescaledb, but the volume is just too large and the queries are unreasonably slow.
I am now experimenting with using Spark and Parquet files to perform these queries, but I have some questions I haven't been able to answer from my research on this topic, namely:
I am converting this data to parquet files, so I now have something like this:
client_id/sensor_id/year/month/day.parquet
But my concern is that when Spark loads the top folder containing the many Parquet files, the metadata for the rowgroup information is not as optimized as if I used one single parquet file containing all the data, partitioned by client/sensor/year/month/day. Is this true? Or is it the same to have many parquet files or a single partitioned Parquet file? I know that internally the parquet file is stored in a folder hierarchy like the one I am using, but I'm not clear on how that affects the metadata for the file.
The reason I am not able to do this is that I am continuously receiving new data, and from my understanding, I cannot append to a parquet file due to the nature that the footer metadata works. Is this correct? Right now, I simply convert the previous day's data to parquet and create a new file for each sensor of each client.
Thank you.
You can use Structured Streaming with kafka(as you are already using it) for real time processing of your data and store data in parquet format. And, yes you can append data to parquet files. Use SaveMode.Append for that such as
df.write.mode('append').parquet(path)
You can even partition your data on hourly basis.
client/sensor/year/month/day/hour which will further provide you performance improvement while querying.
You can create hour partition based on system time or timestamp column based on type of query you want to run on your data.
You can use watermaking for handling late records if you choose to partition based on timestamp column.
Hope this helps!
I could share my experience and technology stack that being used at AppsFlyer.
We have a lot of data, about 70 billion events per day.
Our time-series data for near-real-time analytics are stored in Druid and Clickhouse. Clickhouse is used to hold real-time data for the last two days; Druid (0.9) wasn't able to manage it. Druid holds the rest of our data, which populated daily via Hadoop.
Druid is a right candidate in case you don't need a row data but pre-aggregated one, on a daily or hourly basis.
I would suggest you let a chance to the Clickhouse, it lacks documentation and examples but works robust and fast.
Also, you might take a look at Apache Hudi.

Design of Spark + Parquet "database"

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.
Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?
give up on concurrent reads/writes and append new data to the existing parquet file.
create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This effectively gives me "concurrency".
something else?
Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.
My Level 0 design of this
Use partitioning by date/time (if your queries are based on date/time to avoid scanning of all data)
Use Append SaveMode where required
Run SparkSQL distributed SQL engine so that
You enable querying of the data from multiple clients/applications/users
cache the data only once across all clients/applications/users
Use just HDFS if you can to store all your Parquet files
I have very similar requirement in my system. I would say if load the whole year's data -for 100g one day that will be 36T data ,if you need to load 36TB daily ,that couldn't be fast anyway. better to save the processed daily data somewhere(such as count ,sum, distinct result) and use that to go back for whole year .

Resources