i'm having some problems with the reliability of Azure SQL servers.
sometimes doing complicated queries with subqueries like the following:
SELECT DISTINCT [DeviceName] ,name ,data.[Addr] ,[Signal] FROM (SELECT [DeviceName] ,[Signal] ,MAX([Signal]) OVER (PARTITION BY [Addr]) AS 'MaxSignal',[Timestamp] ,[Addr] ,[PartitionId] ,[EventEnqueuedUtcTime] FROM [dbo].[mytable] WHERE CAST([Timestamp] AS DATETIME) > DATEADD(HOUR,+2,(DATEADD(MINUTE, -10, GETDATE()))) ) data LEFT JOIN mytable ON [dbo].[myreftable].[Addr] = data.[Addr] WHERE [Signal] = [MaxSignal];
Is done in almost an instant, like i would assume, at other times simply doing a SELECT COUNT(*) FROM mytable
Is taking upwards of 30 minutes, and showing a DTU usage graph like this:
Anyone know any solutions to this? is it me doing something completely wrong? or is Azure simply not there yet?
what you pay is what you get.You will need to look at what are the top resources consumers in your system.DTU is nothing, but a limit on CPU,IO,Memory available to your database..
so to troubleshoot DTU problems,I would follow below steps..
1.)Below query gives me Resource usage for last 14 days for all resources..
SELECT
(COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats
running above query gives you an idea on how much is your CPU percentage consistently
2.) Below query gives me an idea of resource usage over time..
SELECT start_time, end_time,
(SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_physical_data_read_percent), (avg_log_write_percent)) AS value(v)) as [avg_DTU_percent]
FROM sys.resource_stats where database_name = ‘<your db name>’ order by end_time desc
Now that ,i have enough data to look out which metric is more resource intensive,I can follow normal approach of trying to troubleshoot..
Say for Example,if my CPU usage is above 90% consistently over time,I will gather all the queries which are consuming more CPU and try to fine tune them
Related
I have a table containing the market data of 5,000 unique stocks. Each stock has 24 records a day and each record has 1,000 fields (factors). I want to pivot the table for cross-sectional analysis. You can find my script below.
I have two questions: (1) The current script is a bit complex. Is there a simpler implementation? (2) The execution takes 521 seconds. Any way to make it faster?
1.Create table
CREATE TABLE tb
(
tradeTime DateTime,
symbol String,
factor String,
value Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(tradeTime)
ORDER BY (symbol, tradeTime)
SETTINGS index_granularity = 8192
2.Insert test data
INSERT INTO tb SELECT
tradetime,
symbol,
untuple(factor)
FROM
(
SELECT
tradetime,
symbol
FROM
(
WITH toDateTime('2022-01-01 00:00:00') AS start
SELECT arrayJoin(timeSlots(start, toUInt32((22 * 23) * 3600), 3600)) AS tradetime
)
ARRAY JOIN arrayMap(x -> concat('symbol', toString(x)), range(0, 5000)) AS symbol
)
ARRAY JOIN arrayMap(x -> (concat('f', toString(x)), toFloat64(x) + toFloat64(0.1)), range(0, 1000)) AS factor
3.Finally, send the query
SELECT
tradeTime,
sumIf(value, factor = 'factor1') AS factor1,
sumIf(value, factor = 'factor2') AS factor2,
sumIf(value, factor = 'factor3') AS factor3,
sumIf(value, factor = 'factor4') AS factor4,
...// so many factors to list out
sumIf(value, factor = 'factor1000') AS factor1000
FROM tb
GROUP BY tradeTime,symbol
ORDER BY tradeTime,symbol ASC
Have you considered building a materialized view to solve this with the inserts into a SummingMergeTree ?
I use Spark SQL v2.4. with the SQL API. I have a sql query, which fails when I run the job in Spark, it fails with the error :-
WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes).
This may impact query planning performance.
ERROR TransportClient: Failed to send RPC RPC 8371705265602543276 to xx.xxx.xxx.xx:52790:java.nio.channels.ClosedChannelException
The issue occurs when I am triggering write command to save the output of the query to parquet file on S3:
The query is:-
create temp view last_run_dt
as
select dt,
to_date(last_day(add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1)), 'yyyy-MM-dd') as dt_lst_day_prv_mth
from date_dim
where dt = add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1);
create temp view get_plcy
as
select plcy_no, cust_id
from (select
plcy_no,
cust_id,
eff_date,
row_number() over (partition by plcy_no order by eff_date desc) AS row_num
from plcy_mstr pm
cross join last_run_dt lrd
on pm.curr_pur_dt <= lrd.dt_lst_day_prv_mth
and pm.fund_type NOT IN (27, 36, 52)
and pm.fifo_time <= '2022-02-12 01:25:00'
and pm.plcy_no is not null
)
where row_num = 1;
I am writing the output as :
df.coalesce(10).write.parquet('s3:/some/dir/data', mode="overwrite", compression="snappy")
The "plcy_mstr" table in the above query is a big table of 500 GB size and is partitioned on eff_dt column. Partitioned by every date.
I have tried to increase the executor memory by applying the following configurations, but the job still fails.
set spark.driver.memory=20g;
set spark.executor.memory=20g;
set spark.executor.cores=3;
set spark.executor.instances=30;
set spark.memory.fraction=0.75;
set spark.driver.maxResultSize=0;
The cluster contains 20 nodes with 8 cores each and 64GB of memory.
Can anyone please help me identify the issue and fix the job ? Any help is appreciated.
Happy to provide more information if required.
Thanks
Some of my tables are of type REPLICATE. I would these tables to be actually replicated (not pending) before I start querying my data. This will help me avoid data movement.
I have a script, which I found online, which runs in a loop and do a SELECT TOP 1 on all the tables which are set for replication, but sometimes the script runs for hours. It may seem as the server sometimes won't trigger replication even if you do a SELECT TOP 1 from foo.
How can you force SQL Datawarehouse to complete replication?
The script looks something like this:
begin
CREATE TABLE #tbl
WITH
( DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
ROW_NUMBER() OVER(
ORDER BY
(
SELECT
NULL
)) AS Sequence
, CONCAT('SELECT TOP(1) * FROM ', s.name, '.', t.[name]) AS sql_code
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady';
DECLARE #nbr_statements INT=
(
SELECT
COUNT(*)
FROM #tbl
), #i INT= 1;
WHILE #i <= #nbr_statements
BEGIN
DECLARE #sql_code NVARCHAR(4000)= (SELECT
sql_code
FROM #tbl
WHERE Sequence = #i);
EXEC sp_executesql #sql_code;
SET #i+=1;
END;
DROP TABLE #tbl;
SET #i = 0;
WHILE
(
SELECT TOP (1)
p.[state]
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady'
) = 'NotReady'
BEGIN
IF #i % 100 = 0
BEGIN
RAISERROR('Replication in progress' , 0, 0) WITH NOWAIT;
END;
SET #i = #i + 1;
END;
END
Henrik, if 'select top 1' doesn't trigger a replicated table build, then that would be a defect. Please file a support ticket.
Without looking at your system, it is impossible to know exactly what is going on. Here are a couple of things that could be in factoring into extended build time to look into:
The replicated tables are large (size, not necessarily rows) requiring long build times.
There are a lot of secondary indexes on the replicated table requiring long build times.
Replicated table builds require statirc20 (2 concurrency slots). If the concurrency slots are not available, the build will queue behind other running queries.
The replicated tables are constantly being modified with inserts, updates and deletes. Modifications require the table to be built again.
The best way is to run a command like this as part of the job which creates/updates the table:
select top 1 * from <table>
That will force its redistribution at the correct time, without the slow loop through the stored procedure.
I am having a syntax issue with my stream analytics query. Following is my Stream Analytics query, where i am trying to get following fields from the events:
Vehicle Id
Diff of previous and current fuel level (for each
vehicle),
Diff of current and previous odometer value (for each
vehicle).
NON-WORKING QUERY
SELECT input.vehicleId,
FUEL_DIFF = LAG(input.Payload.FuelLevel) OVER (PARTITION BY vehicleId LIMIT DURATION(minute, 1)) - input.Payload.FuelLevel,
ODO_DIFF = input.Payload.OdometerValue - LAG(input.Payload.OdometerValue) OVER (PARTITION BY input.vehicleId LIMIT DURATION(minute, 1))
from input
Following is one sample input event on which the above query/job is ran on the series of events:
{
"IoTDeviceId":"DeviceId_1",
"MessageId":"03494607-3aaa-4a82-8e2e-149f1261ebbb",
"Payload":{
"TimeStamp":"2017-01-23T11:16:02.2019077-08:00",
"FuelLevel":19.9,
"OdometerValue":10002
},
"Priority":1,
"Time":"2017-01-23T11:16:02.2019077-08:00",
"VehicleId":"MyCar_1"
}
Following syntax error is thrown when the Stream Analytics job is ran:
Invalid column name: 'payload'. Column with such name does not exist.
Ironically, the following query works just fine:
WORKING QUERY
SELECT input.vehicleId,
FUEL_DIFF = LAG(input.Payload.FuelLevel) OVER (PARTITION BY vehicleId LIMIT DURATION(second, 1)) - input.Payload.FuelLevel
from input
The only diffrence between WORKING QUERY and NON-WORKING QUERY is number of LAG constructs used. The NON-WORKING QUERY has two LAG constructs, while WORKING QUERY has just one LAG construct.
I have referred Stream Analytics Query Language, they only have basic examples. Also tried looking into multiple blogs. In addition, I have tried using GetRecordPropertyValue() function, but no luck. Kindly suggest.
Thank you in advance!
This looks like a syntax bug indeed. Thank you for reporting - we will fix it in the upcoming updates.
Please consider using this query as a workaround:
WITH Step1 AS
(
SELECT vehicleId, Payload.FuelLevel, Payload.OdometerValue
FROM input
)
SELECT vehicleId,
FUEL_DIFF = LAG(FuelLevel) OVER (PARTITION BY vehicleId LIMIT DURATION(minute, 1)) - FuelLevel,
ODO_DIFF = OdometerValue - LAG(OdometerValue) OVER (PARTITION BY vehicleId LIMIT DURATION(minute, 1))
from Step1
I have a preview version of Azure Sql Data Warehouse running which was working fine until I imported a large table (~80 GB) through BCP. Now all the tables including the small one do not respond even to a simple query
select * from <MyTable>
Queries to Sys tables are working still.
select * from sys.objects
The BCP process was left over the weekend, so any Statistics Update should have been done by now. Is there any way to figure out what is making this happen? Or at lease what is currently running to see if anything is blocking?
I'm using SQL Server Management Studio 2014 to connect to the Data Warehouse and executing queries.
#user5285420 - run the code below to get a good view of what's going on. You should be able to find the query easily by looking at the value in the "command" column. Can you confirm if the BCP command still shows as status="Running" when the query steps are all complete?
select top 50
(case when requests.status = 'Completed' then 100
when progress.total_steps = 0 then 0
else 100 * progress.completed_steps / progress.total_steps end) as progress_percent,
requests.status,
requests.request_id,
sessions.login_name,
requests.start_time,
requests.end_time,
requests.total_elapsed_time,
requests.command,
errors.details,
requests.session_id,
(case when requests.resource_class is NULL then 'N/A'
else requests.resource_class end) as resource_class,
(case when resource_waits.concurrency_slots_used is NULL then 'N/A'
else cast(resource_waits.concurrency_slots_used as varchar(10)) end) as concurrency_slots_used
from sys.dm_pdw_exec_requests AS requests
join sys.dm_pdw_exec_sessions AS sessions
on (requests.session_id = sessions.session_id)
left join sys.dm_pdw_errors AS errors
on (requests.error_id = errors.error_id)
left join sys.dm_pdw_resource_waits AS resource_waits
on (requests.resource_class = resource_waits.resource_class)
outer apply (
select count (steps.request_id) as total_steps,
sum (case when steps.status = 'Complete' then 1 else 0 end ) as completed_steps
from sys.dm_pdw_request_steps steps where steps.request_id = requests.request_id
) progress
where requests.start_time >= DATEADD(hour, -24, GETDATE())
ORDER BY requests.total_elapsed_time DESC, requests.start_time DESC
Checkout the resource utilization and possibly other issues from https://portal.azure.com/
You can also run sp_who2 from SSMS to get a snapshot of what's threads are active and whether there's some crazy blocking chain that's causing problems.