How to group by timestamp in data or data ingestion time? - apache-spark

I could see that spark streaming windowing function does the grouping only based on "when it received the data". I would like to do the grouping based on the timestamp field available in the data itself. Is it possible?
For example - The data creation timestamp is available as part of the data as 1 PM. But spark streaming received the data at 1.05 PM. So it should do the grouping based on the timestamp (1 PM) available in the data.

I would like to do the grouping based on the timestamp field available in the data itself. Is it possible?
No. Spark Streaming does not offer such a feature.
You should instead use Structured Streaming that does offer window function to group by.
Quoting Window Operations on Event Time:
Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into.

Related

Databricks query performance when filtering on a column correlated to the partition-column

Setting: Delta-lake, Databricks SQL compute used by powerbi.
I am wondering about the following scenario: We have a column timestamp and a derived column date (which is the date of timestamp), and we choose to partitionby date. When we query we use timestamp in the filter, not date.
My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont explicitly use in the query.
Is this correct?
What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.
Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?
Yes, Databricks will take implicit advantage of this partitioning through data skipping because there will be min/max statistics associated with specific data files. The min/max information will be loaded into memory from the transaction log, but it will need to make decision which files it need to hit on every query. But because everything is in memory, it shouldn't be very big performance overhead, until you have hundreds of thousands files.
One thing that you may consider - use generated column instead of explicit date column. Declare it as date GENERATED ALWAYS AS (CAST(timestampColumn AS DATE)), and partition by it. The advantage is that when you're doing a query on timestampColumn, then it should do partition filtering on the date column automatically.

Efficient reading/transforming partitioned data in delta lake

I have my data in a delta lake in ADLS and am reading it through Databricks. The data is partitioned by year and date and z ordered by storeIdNum, where there are about 10 store Id #s, each with a few million rows per date. When I read it, sometimes I am reading one date partition (~20 million rows) and sometimes I am reading in a whole month or year of data to do a batch operation. I have a 2nd much smaller table with around 75,000 rows per date that is also z ordered by storeIdNum and most of my operations involve joining the larger table of data to the smaller table on the storeIdNum (and some various other fields - like a time window, the smaller table is a roll up by hour and the other table has data points every second). When I read the tables in, I join them and do a bunch of operations (group by, window by and partition by with lag/lead/avg/dense_rank functions, etc.).
My question is: should I have the date in all of the joins, group by and partition by statements? Whenever I am reading one date of data, I always have the year and the date in the statement that reads the data as I know I only want to read from a certain partition (or a year of partitions), but is it important to also reference the partition col. in windows and group bus for efficiencies, or is this redundant? After the analysis/transformations, I am not going to overwrite/modify the data I am reading in, but instead write to a new table (likely partitioned on the same columns), in case that is a factor.
For example:
dfBig = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, UNIX_TS, BARCODE, CUSTNUM, .... FROM STORE_DATA_SECONDS WHERE YEAR = 2020 and DATE='2020-11-12'")
dfSmall = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, TS_HR, CUSTNUM, .... FROM STORE_DATA_HRS WHERE YEAR = 2020 and DATE='2020-11-12'")
Now, if I join them, do I want to include YEAR and DATE in the join, or should I just join on STORE_ID_NUM (and then any of the timestamp fields/customer Id number fields I need to join on)? I definitely need STORE_ID_NUM, but I can forego YEAR AND DATE if it is just adding another column and makes it more inefficient because it is more things to join on. I don't know how exactly it works, so I wanted to check as by foregoing the join, maybe I am making it more inefficient as I am not utilizing the partitions when doing the operations? Thank you!
The key with delta is to choose the partitioned columns very well, this could take some trial and error, if you want to optimize the performance of the response, a technique I learned was to choose a filter column with low cardinality (you know if the problem is of time series, it will be the date, on the other hand if it is about a report for all clients in that case it may be convenient to choose your city), remember that if you work with delta each partition represents a level of the file structure where its cardinality will be the number of directories.
In your case I find it good to partition by YEAR, but I would add the MONTH given the number of records that would help somewhat with the dynamic pruning of spark
Another thing you can try is to use BRADCAST JOIN if the table is very small compared to the other.
Broadcast Hash Join en Spark (ES)
Join Strategy Hints for SQL Queries
The latter link explains how dynamic pruning helps in MERGE operations.
How to improve performance of Delta Lake MERGE INTO queries using partition pruning

Using union() to combine historical and live data partitioned by different fields

I have a streaming ingest of sensor data where the data is being saved to S3 partitioned by time (year/month/day). I'm calling this the landing zone.
I then have a periodic batch process to take the latest data from the landing zone and save it into another dataset in S3 that is partitioned by another set of keys. This partitioning is for performance reasons; the users typically filter by the partition keys, so that when querying, the amount of data that needs to be retrieved from disk is minimised. I'm calling this the analytics zone.
I now have a user that needs to query data across both landing and analytics zone, I.e. so they have the latest data available.
Is union() appropriate for joining datasets that have the same columns but are partitioned by different fields? E.g.
// historical contains data up to but excluding Year=2018, Month=10, Day=1
// assetID is a partition field
historicalDF = spark.sql("SELECT * FROM historical WHERE assetID = 123")
// Year, Month and Day are partition fields
liveDF = spark.sql(
"""SELECT * FROM live
WHERE Year = 2018 AND Month = 10 AND Day = 1 AND assetID = 123""")
allDF = historicalDF.union(liveDF)
Is union() appropriate for joining datasets that have the same columns but are partitioned by different fields?
Why wouldn't it be? Once data is loaded with predicates, all opportunity for optimization through partition pruning (the main advantage of having partitioned table) is already used.
What you get is just a DataFrame as any other, with no partitioner set.Short of worrying about long lineages (and with such short pipeline, there is no reason to), union is just fine.

Does Spark's RDD.combineByKey() preserve the order of a previously sorted DataFrame?

I've done this in PySpark:
Created a DataFrame using a SELECT statement to get asset data ordered by asset serial number and then time.
Used DataFrame.map() to convert the DataFrame to an RDD.
Used RDD.combineByKey() to collate all the data for each asset, using the asset's serial number as the key.
Question: Can I be certain that the data for each asset will still be sorted in time order in the RDD resulting from the last step?
Time order is crucial for me (I need to calculate statistics over a moving time window across the data for each asset). When RDD.combineByKey() combines data from different nodes in the Spark cluster for a given key, is any order in that key's data retained? Or is the data from the different nodes combined in no particular order for a given key?
Can I be certain that the data for each asset will still be sorted in time order in the RDD resulting from the last step?
You cannot. When you apply sort across multiple dimensions (data ordered by asset serial number and then time) records for a single asset can be spread across multiple partitions. combineByKey will require a shuffle and the order in which these parts are combined is not guaranteed.
You can try with repartition and sortWithinPartitions (or its equivalent on RDDs):
df.repartition("asset").sortWithinPartitions("time")
or
df.repartition("asset").sortWithinPartitions("asset", "time")
or window functions with frame definition as follows:
w = Window.partitionBy("asset").orderBy("time")
In Spark >= 2.0 window functions can be used with UserDefinedFunctions so if you're fine with writing your own SQL extensions in Scala you can skip conversion to RDD completely.

Timeseries with Spark/Cassandra - How to find timestamps when values satisfy a condition?

I have timeseries stored in a Cassandra table, coming from several sensors. Here is the schema I use for storing data :
CREATE TABLE data_sensors (
sensor_id int,
time timestamp,
value float,
PRIMARY KEY ((sensor_id), time)
);
Values can be temperature or pressure for instance, depending on the sensor from which it is coming from.
My objective is to be able to find basic statistics (min, max, avg, std) on pressure, but only when temperature is higher than a certain value.
Here is a schema of the whole process I'd like to get.
I think it could be better if I changed the Cassandra model, at least for temperature data, to be able to filter on value. Is there another way, after importing data into a Spark RDD, to avoid altering the Cassandra table?
Then, once filtering on temperature is done, how to get the sequence of timestamps I have to use to filter pressure data? Please note that I don't have necessarily the same timestamps for temperature and pressure, that is why I think I need to have periods of time instead of a list of precise timestamps.
Thanks for your help!
It's not really a Cassandra-specific answer, but maybe you want to look at time series databases that provide SQL layer on top of NoSQL stores with support for JOINs and aggregations.
Here's an example of an ATSD SQL syntax that supports period aggregations and joins.
SELECT t1.entity, t1.datetime, min(t1.value), max(t1.value), avg(t2.value)
FROM mpstat.cpu_busy t1
JOIN meminfo.memfree t2
WHERE t1.datetime >= '2016-09-20T15:00:00Z' AND t1.datetime < '2016-09-20T15:15:00Z'
GROUP BY entity, t1.PERIOD(1 MINUTE)
HAVING max(t1.value) > 30
The query joins two metrics, filters out 1-minute rows where first metric was below the threshold and then returns a bunch of statistics for the second series.
If the two series are unevenly spaced, you can regularize the array using linear interpolation.
Disclosure: I work for Axibase that develops ATSD.

Resources