Azure Data Explorer High Ingestion Latency with Streaming - azure

We are using stream ingestion from Event Hubs to Azure Data Explorer. The Documentation states the following:
The streaming ingestion operation completes in under 10 seconds, and
your data is immediately available for query after completion.
I am also aware of the limitations such as
Streaming ingestion performance and capacity scales with increased VM
and cluster sizes. The number of concurrent ingestion requests is
limited to six per core. For example, for 16 core SKUs, such as D14
and L16, the maximal supported load is 96 concurrent ingestion
requests. For two core SKUs, such as D11, the maximal supported load
is 12 concurrent ingestion requests.
But we are currently experiencing ingestion latency of 5 minutes (as shown on the Azure Metrics) and see that data is actually available for querying 10 minutes after ingestion.
Our Dev Environment is the cheapest SKU Dev(No SLA)_Standard_D11_v2 but given that we only ingest ~5000 Events per day (per metric "Events Received") in this environment this latency is very high and not usable in the streaming scenario where we need to have the data available < 1 minute for queries.
Is this the latency we have to expect from the Dev Environment or are the any tweaks we can apply in order to achieve lower latency also in those environments?
How will latency behave with a production environment like Standard_D12_v2? Do we have to expect those high numbers there as well or is there a fundamental difference in behavior between Dev/test and Production Environments in this concern?

Did you follow the two steps needed to enable the streaming ingestion for the specific table, i.e. enabling streaming ingestion on the cluster and on the table?
In general, this is not expected, the Dev/Test cluster should exhibit the same behavior as the production cluster with the expected limitations around the size and scale of the operations, if you test it with a few events and see the same latency it means that something is wrong.
If you did follow these steps, and it still does not work please open a support ticket.

Related

KUDU for JDBC replication purposes, but not for Off-loaded Analytics

Given the quote from Apache KUDU official documentation, namely: https://kudu.apache.org/overview.html
Kudu isn't designed to be an OLTP system, but if you have some subset
of data which fits in memory, it offers competitive random access
performance. We've measured 99th percentile latencies of 6ms or below
using YCSB with a uniform random access workload over a billion rows.
Being able to run low-latency online workloads on the same storage as
back-end data analytics can dramatically simplify application
architecture.
Does this statement imply that KUDU can be used for replication from a JDBC source - the simplest form possible?
Elsewhere I have used KUDU for replicating to from SAP and other COTS, so that reports could run against the KUDU tables as opposed to Hana. That was an architecture decided upon by others.
For pure replication of data, primarily for subsequent extractions from a Data Lake, for data with embellished history with a size < 1TB, this is feasible as well. Cloudera confirmed this after discussion. Even though KUDU has a columnar format and a row format would be desirable, it simply works as well.

How to estimate the Graphite(whisper) database size when I'm using it in monitoring Apache Spark

I'm gonna set up monitoring Spark application via $SPARK_HOME/conf/metrics.propetries.
And decided to use Graphite.
Is there any way to estimate the database size of Graphite especially for monitoring Spark application?
Regardless of what you are monitoring, Graphite has its own configuration about retention and rollup of metrics. It stores file (called whisper) per metric and you can use the calculator to estimate how much disk space it can take https://m30m.github.io/whisper-calculator/

Frequent Compaction of OpsCenter.rollup_state on all the nodes consuming CPU cycles

I am using Datastax Cassandra 4.8.16. With cluster of 8 DC and 5 nodes on each DC on VM's. For last couple of weeks we observed below performance issue
1) Increase drop count on VM's.
2) LOCAL_QUORUM for some write operation not achieved.
3) Frequent Compaction of OpsCenter.rollup_state and system.hints are visible in Opscenter.
Appreciate any help finding the root cause for this.
Presence of dropped mutations means that cluster is heavily overloaded. It could be increase of the main load, so it + load from OpsCenter, overloaded system - you need to look into statistics about number of requests, latencies, etc. per nodes and per tables, to see where increase happened. Please also check the I/O statistics on machines (for example, with iostat) - sizes of the queues, read/write latencies, etc.
Also it's recommended to use a dedicated OpsCenter cluster to store metrics - it could be smaller size, and doesn't require an additional license for DSE. How it said in the OpsCenter's documentation:
Important: In production environments, DataStax strongly recommends storing data in a separate DataStax Enterprise cluster.
Regarding VMs - usually it's not really recommended setup, but heavily depends on what kind of underlying hardware - number of CPUs, RAM, disk system.

planning for graphite components for big cassandra cluster monitoring

I am planning to setup a 80 nodes cassandra cluster (current version 2.1 but will upgrade to 3 in future).
I have gone though http://graphite.readthedocs.io/en/latest/tools.html which has list of tools that graphite supports.
I want to decide which tools to choose as listener and storage so that it could scale.
As a listener should i use the default carbon or should i choose graphite-ng ?
However as storage component, i am confused that whether default whisper is enough? Or should I look at ohter option (like Influxdata,cynite or some rdms db (postgres/mysql))?
As gui component i have finalized to use grafana for better visulization.
I think datadog + grafana will work fine but datadog is not opensource.So Please suggest an opensource scalable up to 100 cassandra nodes alternative.
I have 35 Cassandra nodes (different clusters) monitored without any problems with graphite + carbon + whisper + grafana. But i have to tell that re-configuring collection and aggregations windows with whisper is a pain.
There's many alternatives today for this job, you can use influxdb (+ telegraf) stack for example.
Also with datadog you don't need grafana, they're also a visualizing platform. I've worked with it some time ago, but they have some misleading names for some metrics in their plugin, and some metrics were just missing. As a pros for this platform, it's really easy to install and use.
We have a cassandra cluster of 36 nodes in production right now (we had 51 but migrated the instance type since then so we need less C* servers now), monitored using a single graphite server. We are also saving data for 30 days but in a 60s resolution. We excluded the internode metrics (e.g. open connections from a to b) because of the scaling of the metric count, but keep all other. This totals to ~510k metrics, each whisper file being ~500kb in size => ~250GB. iostat tells me, that we have write peaks to ~70k writes/s. This all is done on a single AWS i3.2xlarge instance which include 1.9TB nvme instance storage and 61GB of RAM. To fully utilize the power of the this disk type we increased the number of carbon caches. The cpu usage is very low (<20%) and so is the iowait (<1%).
I guess we could get away with a less beefy machine, but this gives us a lot of headroom for growing the cluster and we are constantly adding new servers. For the monitoring: Be prepared that AWS will terminate these machines more often than others, so backup and restore are more likely a regular operation.
I hope this little insight helped you.

Whats the best way to mirror a live Cassandra cluster for analytics tasks?

Assuming a live cluster with several DCs, whats the best way to setup some nodes that are dedicated for analytic queries?
Analytic nodes will be hosted in a separate (routed) network and must not write any data back to the production nodes. They also must not be counted against for any CL. This especially applies to EACH_QUORUM that will be used for some writes. Analytics nodes may be offline at any time.
All solutions I've looked into seem to have their own drawbacks.
1) Take snapshots on production and transfer to independent analytics cluster
Significant update delay
IO intensive either on network or disk (e.g. rsync)
Lots of duplicate data due to different replication factors (3:1 prod. vs analytics)
Mismatch in SSTable row ranges and cluster topology on analytics cluster may require to use sstableloader
2) Use write survey mode to establish read-only nodes
Not 100% sure how this could be done for setting up multiple survey nodes to cover the whole ring
Queries can only be executed against each node locally as they could not be part of a coordinated execution
3) Add regular DC dedicated for analytics
EACH_QUORUM will fail in case analytics cluster is not available
Queries on production should not be served from analytics
Would require a way to prevent users on analytics to be able to execute queries or updates on production
Any other options or existing tools that could be used?

Resources