I have log entries in ClickHouse database, each log has a timestamp. I need to calculate two numbers:
Average of logs per minute
Peek values of logs per minute
timestamp | entry |
-------------------------------|---------|
2022-03-08T22:28:02.177113916Z | message |
To solve this right now I created a simple Python script to analyze downloaded logs, but this doesn't work with actual amount of data, only small slice which I can download.
Can I calculate just by running query without downloading anything locally?
Try this query:
SELECT avg(c) average_count, max(c) peak_count
FROM (
SELECT count() c
FROM logs
/* WHERE timestamp >= 'time1' AND timestamp < 'time2' */
GROUP BY toStartOfMinute(timestamp)
)
Related
Overall I'm trying to set up an azure alert to email me when a computer goes down by using the Heartbeat table.
Let's say I have 5 machines in my Azure subscription, and they each report once per minute to the table called Heartbeat, so it looks something like this:
Currently, I can query "Heartbeat | where Computer == 'computer-name'| where TimeGenerated > ago(5m)" and figure out when one computer has not been reporting in the last 5 minutes, and is down (thank you to this great article for that query).
I am not very experienced with any query language, so I am wondering if it is possible to have 1 query which can check to see if there was ANY computer which stopped sending it's logs over the last 5-10 minute period, and thus would be down. Azure uses KQL, or Kusto Query Language for it's queries and there is documentation in the link above.
Thanks for the help
one option is to calculate the max report time for each computer, then filter to the ones whose max report time is older than 5 minutes ago:
let all_computers_lookback = 12h
;
let monitor_looback = 5m
;
Heartbeat
| where TimeGenerated > ago(all_computers_lookback)
| summarize last_reported = max(TimeGenerated) by computer
| where last_reported < ago(monitor_looback)
another alternative:
the first part creates an "inventory" of all computers that reported at least once in the last (e.g. 12 hours).
the second part finds all computers that reported at least once in the last (e.g. 5 minutes)
the third and final part finds the difference between the two (i.e. all computers that didn't report in the last 5 minutes)
Note: if you have more than 1M computers, you can use the join operator instead of the in() operator
let all_computers_lookback = 12h
;
let monitor_looback = 5m
;
let all_computers =
Heartbeat
| where TimeGenerated > ago(all_computers_lookback)
| distinct computer
;
let reporting_computers =
Heartbeat
| where TimeGenerated > ago(monitor_looback)
| distinct computer
;
let non_reporting_computers =
all_computers
| where computer !in(reporting_computers)
;
non_reporting_computers
If i have a web app in Azure, with ApplicationInsights configured, is there a way we can tell if there was an increase in the number of requests to a given page?
I know we can get the "Delta" of performance in a given time slice, compared to the previous period, but doesn't seem like we can do this for requests?
For example, i'd like to answer questions like: "what pages in the last hour had the highest % increase in requests, compared to the previous period"?
Does anyone know how to do this, or can it be done via the AppInsights query language?
Thanks!
Not sure whether it can be done using the Portal, I don't think so. But I came up with the following Kusto query:
requests
| where timestamp > ago(2h) and timestamp < ago(1h)
| summarize previousPeriod = todouble(count()) by url
| join (
requests
| where timestamp > ago(1h)
| summarize lastHour = todouble(count()) by url
) on url
| project url, previousPeriod, lastHour, change = ((lastHour - previousPeriod) / previousPeriod) * 100
| order by change desc
This is about increase/decrease of amount of traffic per url, you can change count() to for example avg(duration) to get the increase/decrease of the average duration.
I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()
We have reserved various number of RUs per second for our various collections. I'm trying to optimize this to save money. For each response from Cosmos, we're logging the request charge property to Application Insights. I have one analytics query that returns the average number of request units per second and one that returns the maximum.
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start < timestamp and timestamp < end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| make-series sum(value) default=0 on timestamp in range(start, end, 1s) by Database, Collection
| mvexpand sum_value to typeof(double), timestamp limit 36000
| summarize avg(sum_value) by Database, Collection
| order by Database asc, Collection asc
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start <= timestamp and timestamp <= end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| summarize sum(value) by Database, Collection, bin(timestamp, 1s)
| summarize arg_max(sum_value, *) by Database, Collection
| order by Database asc, Collection asc
The averages are fairly low but the maxima can be unbelievably high in some cases. An extreme example is a collection with a reservation of 1,000, an average used of 15,59 and a maximum used of 63,341 RUs/s.
My question is: How can this be? Are my queries wrong? Is throttling not working? Or does throttling only work on a longer period of time than a single second? I have checked for request throttling on the Azure Cosmos DB overview dashboard (response code 429), and there was none.
I have to answer myself. I found two problems:
Application Insights logs an inaccurate timestamp. I added a timestamp as a custom dimension, and within a certain minute I get different seconds in my custom timestamp but the built-in timestamp is one second past the minute for many of these. That is why I got (false) peaks in request charge.
We did have throttling. When viewing request throttling in the portal, I have to select a specific database. If I try to view request throttling for all databases, it looks like there is none.
I am looking to use the apache cassandra database to store a time series of 1 minute OHLCV financial data for ~1000 symbols. This will need to be updated in real-time as data is streamed in. All entries where time>24hr oldare not needed and should be discarded.
Assuming there are 1000 symbols with entries for each minute from the past 24 hrs, the total number of entries will amount to 1000*(60*24) = 1,440,000.
I am interested in designing this database to efficiency retrieve a slice of all symbols from the past [30m, 1h, 12h, 24h] with fast querying times. Ultimately, I need to retrieve the OHLCV that summarises this slice. The resulting output would be {symbol, FIRST(open), MAX(high), MIN(low), LAST(close), SUM(volume)} of the slice for each symbol. This essentially summarises the 1m OHLCV entries and creates an [30m, 1h, 12h, 24h] OHLCV from the time of the query. E.g. If I want to retrieve the past 1h OHLCV from 1:32pm, the query will give me a 1h OHLCV that represents data from 12:32pm-1:32pm.
What would be a good design to meet these requirements? I am not concerned with the database's memory footprint on the hard drive. The real issue is with fast querying times that is light on cpu and ram.
I have come up with a simple and naive way to store each record with clustering ordered by time:
CREATE TABLE symbols (
time timestamp,
symbol text,
open double,
high double,
low double,
close double,
volume double
PRIMARY KEY (time, symbol)
) WITH CLUSTERING ORDER BY (time DESC);
But I am not sure how to select from this to meet my requirements. I would rather design it specifically for my query, and duplicate data if necessary.
Any suggestions will be much appreciated.
While not based on Cassandra, Axibase Time Series Database can be quite relevant to this particular use case. It supports SQL with time-series syntax extensions to aggregate data into periods of arbitrary length.
An OHLCV query for a 15-minute window might look as follows:
SELECT date_format(datetime, 'yyyy-MM-dd HH:mm:ss', 'US/Eastern') AS time,
FIRST(t_open.value) AS open,
MAX(t_high.value) AS high,
MIN(t_low.value) AS low,
LAST(t_close.value) AS close,
SUM(t_volume.value) AS volume
FROM stock.open AS t_open
JOIN stock.high AS t_high
JOIN stock.low AS t_low
JOIN stock.close AS t_close
JOIN stock.volume AS t_volume
WHERE t_open.entity = 'ibm'
AND t_open.datetime >= '2018-03-29T14:32:00Z' AND t_open.datetime < '2018-03-29T15:32:00Z'
GROUP BY PERIOD(15 MINUTE, END_TIME)
ORDER BY datetime
Note the GROUP BY PERIOD clause above which does all the work behind the scenes.
Query results:
| time | open | high | low | close | volume |
|----------------------|----------|---------|----------|---------|--------|
| 2018-03-29 10:32:00 | 151.8 | 152.14 | 151.65 | 152.14 | 85188 |
| 2018-03-29 10:47:00 | 152.18 | 152.64 | 152 | 152.64 | 88065 |
| 2018-03-29 11:02:00 | 152.641 | 153.04 | 152.641 | 152.69 | 126511 |
| 2018-03-29 11:17:00 | 152.68 | 152.75 | 152.43 | 152.51 | 104068 |
You can use a Type 4 JDBC driver, API clients or just curl to run these queries.
I'm using sample 1-minute data for the above example which you can download from Kibot as described in these compression tests.
Also, ATSD supports scheduled queries to materialize minutely data into OHLCV bars of longer duration, say for long-term retention.
Disclaimer: I work for Axibase.