Databricks Delta table load take long time to load 1 recod - apache-spark

Whenever databricks notebook is running, I am trying to insert 1 records into a delta table but this is taking around 70 seconds. I am passing start_time as a variable.
val batchDf= Seq((1000, 40, start_time, null, null, status)).toDF("Key", "RunId", "Start_Time", "End_Time", "Duration", "In-progress")
batchDf.write.format("delta").mode("append").saveAsTable("t_audit")
Any idea why loading 1 record into a delta table is taking this long? I would expect this to finish in less than 5 secs.

Databricks is horribly slow in comparison to anything that I have used in the past 30 years, but in your case it could be related to auto optimize

Related

Cassandra: Data Modeling for event based time series

I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well

Azure Stream Analytics current day aggregation

I'm quite new in Azure Stream Analytics but I need to push to Power BI (live dashboard) rolling totals from start of the day every time when new event arrives to Azure Stream Analytics job. I've created next SQL query to calculate this
SELECT
Factory_Id,
COUNT(0) as events_count,
MAX(event_create_time) as last_event_time,
SUM(event_value) as event_value_total
INTO
[powerbi]
FROM
[eventhub] TIMESTAMP BY event_create_time
WHERE DAY(event_create_time) = DAY(System.Timestamp) and MONTH(event_create_time) = MONTH(System.Timestamp) and YEAR(event_create_time) = YEAR(System.Timestamp)
GROUP BY Factory_Id, SlidingWindow(day,1)
But this didn't give me desired result - I get total for last 24 hours(not only for current day) and some times record with bigger last_event_time has events_count smaller then record with smaller last_event_time. The question is - What I'm doing wrong and how can I achieve desired outcome?
EDIT following comment: This computes the results for the last 24h, but what's needed is the running sum/count to day (from 00:00 until now). See updated answer below.
I'm wondering if an analytics approach would work better than an aggregation here.
Instead of using a time window, you calculate and emit a record for each event in input:
SELECT
Factory_Id,
COUNT(*) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS events_count,
system.timestamp() as last_event_time,
SUM(event_value) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) as event_value_total
INTO PowerBI
FROM [eventhub] TIMESTAMP BY event_create_time
The only hiccup is for events landing on the same time stamp:
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:00:00", "event_value" : 0.1}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 2}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 10}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:02:00", "event_value" : 0.2}
You won't get a single record on that timestamp:
Factory_Id
events_count
last_event_time
event_value_total
1
1
2021-12-10T10:00:00.0000000Z
0.1
1
2
2021-12-10T10:01:00.0000000Z
2.1
1
3
2021-12-10T10:01:00.0000000Z
12.1
1
4
2021-12-10T10:02:00.0000000Z
12.2
We may want to add a step to the query to deal with it if it's an issue for your dashboard. Let me know!
EDIT following comment
This new version will emit progressive results on a daily tumbling window. To do that, every time we get a new record, we collect the last 24h. Then we remove the rows from the previous day, and re-calculate the new aggregates. To collect properly, we first need to make sure we only have 1 record per timestamp.
-- First we make sure we get only 1 record per timestamp, to avoid duplication in the analytics function below
WITH Collapsed AS (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COUNT(*) AS C,
SUM(event_value) AS S
FROM [input1] TIMESTAMP BY event_create_time
GROUP BY Factory_Id, system.timestamp()
),
-- Then we build an array at each timestamp, containing all records from the last 24h
Collected as (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COLLECT() OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS all_events
FROM Collapsed
)
-- Finally we expand the array, removing the rows on the previous day, and aggregate
SELECT
C.Factory_Id,
system.timestamp() as last_event_time,
SUM(U.ArrayValue.C) AS events_count,
SUM(U.ArrayValue.S) AS event_value_total
FROM Collected AS C
CROSS APPLY GETARRAYELEMENTS(C.all_events) AS U
WHERE DAY(U.ArrayValue.last_event_time) = DAY(system.Timestamp())
GROUP BY C.Factory_Id, C.last_event_time, system.timestamp()
Let me know how it goes.

How can i count all azure EventHub events of the current day using StreamAnalitics each 5 minutes?

I need to count all events collected during the current day, from 0:00 to 23:59 utc time each five minutes.
I'm using a stream analytics service withe the current query:
SELECT Cast(pid as bigint) as PublisherID,Cast(cid as bigint) as CampaignID, Count(*) as Count
INTO
[SQLTableClicks]
FROM
[Clicks]
GROUP BY pid,cid, TumblingWindow(Day,1)
it works but it only collect data once a day and i need to update the info each five minutes.
I think hopping window is what you need, it will give you result every 5 minutes, but looking a day back.
Try something like this (I didn't run it, but should give you an idea):
With data as
(
SELECT
Cast(pid as bigint) as PublisherID,
Cast(cid as bigint) as CampaignID,
Count(*) as Count,
System.TimeStamp as Time
FROM
[Clicks]
)
SELECT PublisherID, CampaignID, Count
INTO
[SQLTableClicks]
FROM
[data]
WHERE (DAY(System.TimeStamp) == Day(Time))
GROUP BY pid,cid, HoppingWindow(Duration(day, 1), Hop(minute, 5))

Can I use loops in Spark Data frame

My data is as shown below
Store ID Amount,...
1 1 10
1 2 20
2 1 10
3 4 50
I have to create separate directory for each store
Store 1/accounts
ID Amount
1 10
2 20
store 2/accounts directory:
ID Amount
1 10
For this purpose Can I use loops in Spark dataframe. It is working in local machine. Will it be a problem in cluster
while storecount<=50:
query ="SELECT * FROM Sales where Store={}".format(storecount)
DF =spark.sql(query)
DF.write.format("csv").save(path)
count = count +1
If I correctly understood the problem , you really want to do is partitioning in the dataframe.
I would suggest to do this
df.write.partitionBy("Store").mode(SaveMode.Append).csv("..")
This will write the dataframe into several partitions like
store=2/
store=1/
....
Yes you can run a loop here as it is not an Nested operation on data frame.
Nested operation on RDD or Data frame is not allowed as Spark Context is not Serializable.

"select count(id) from table" takes up to 30 minutes to calculate in SQL Azure

I have a database in SQL Azure which is not taking between 15 and 30 minutes to do a simple:
select count(id) from mytable
The database is about 3.3GB and the count is returning approx 2,000,000 but I have tried it locally and it takes less than 5 seconds!
I have also run a:
ALTER INDEX ALL ON mytable REBUILD
On all the tables in the database.
Would appreciate if anybody could point me to some things to try to diagnose/fix this.
(Please skip to UPDATE 3 below as I now think this is the issue but I still do not understand it).
UPDATE 1:
It appears to take 99% of the time in a clustered index scan as image below shows. I have
UPDATE 2: And this is what the statistics messages come back as when I do:
SET STATISTICS IO ON
SET STATISTICS TIME ON
select count(id) from TABLE
Statistics:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 317037 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(1 row(s) affected)
Table 'TABLE'. Scan count 1, logical reads 279492, physical reads 8220, read-ahead reads 256018, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 297 ms, elapsed time = 438004 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
UPDATE 3: OK - I have another theory now. The Azure portal is suggesting each time I do test this simply select query it is maxing out my DTU percentage to nearly 100%. I am using a Standard Azure SQL instance with performance level S1 (20 DTUs). Is it possible that this simple query is being slowed down by my DTU limit?
I realize this is old, but I had the same issue. I had a table with 2.5 million rows that I imported from an on-prem database into Azure SQL and ran at S3 level. Select Count(0) from Table resulted in a 5-7 minute execution time vs milliseconds on-premise.
In Azure, index and table scans seem to be penalized tremendously in performance, so adding a 'useless' WHERE to the query that forces it to perform an index seek on the clustered index helped.
In my case, this performed almost identical Select count(0) from Table where id > 0 resulted in performance matching the on premise query.
Suggestion: try select count(*) instead: it might actually improve the response time:
http://www.sqlskills.com/blogs/paul/which-index-will-sql-server-use-to-count-all-rows/
Also, have you done an "explain plan"?
http://azure.microsoft.com/blog/2011/12/15/sql-azure-management-portal-tips-and-tricks-part-ii/
http://social.technet.microsoft.com/wiki/contents/articles/1657.gaining-performance-insight-into-windows-azure-sql-database.aspx
============ UPDATE ============
Thank you for getting the statistics.
You're doing a full table scan of 2M rows - not good :(
POSSIBLE WORKAROUND: query system table row_count instead:
http://blogs.msdn.com/b/arunrakwal/archive/2012/04/09/sql-azure-list-of-tables-with-record-count.aspx
select t.name ,s.row_count from sys.tables t
join sys.dm_db_partition_stats s
ON t.object_id = s.object_id
and t.type_desc = 'USER_TABLE'
and t.name not like '%dss%'
and s.index_id = 1
Quick refinement of #FoggyDay post. If your tables are partitioned, you'll want to sum the rowcount.
SELECT t.name, SUM(s.row_count) row_count
FROM sys.tables t
JOIN sys.dm_db_partition_stats s
ON t.object_id = s.object_id
AND t.type_desc = 'USER_TABLE'
AND t.name not like '%dss%'
AND s.index_id = 1
GROUP BY t.name

Resources