I started exploring ADX a few days back. I imported my data from Azure SQL to ADX using ADF pipeline but when I query those data, it is taking a long time. To find out some workaround I researched for Table Data Partitioning and I am much clear on partition types and tricks.
The problem is, I couldn't find any sample (Kusto Syntax) that guide me to define Paritionging on ADX Database Tables. Can anyone please help me with this syntax?
partition operator is probably what you are looking for:
T | partition by Col1 ( top 10 by MaxValue )
T | partition by Col1 { U | where Col2=toscalar(Col1) }
ADX doesn't currently have the notion of partitioning a table, though it may be added in the future.
that said, with the lack of technical information currently provided, it's somewhat challenging to understand how you got to the conclusion that partitioning your table is required and is the appropriate solution, as opposed to other (many) directions that ADX does allow you pursue.
if you would be willing to detail what actions you're performing, the characteristics of your data & schema, and which parts are performing slower than expected, that may help in providing you a more meaningful and helpful answer.
[if you aren't keen on exposing that information publicly, it's ok to open a support ticket with these details (through the Azure portal)]
(update: the functionality is available for a while now. read more # https://yonileibowitz.github.io/blog-posts/data-partitioning.html)
Related
I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.
We have dimensinal model with fact tables of 100-300 GBs in parquet each. We build PBI reports on top of Azure Synapse (DirectQuery) and experience performance issues on slicing/dicing and especially on calculating multiple KPIs. In the same time data volume is pretty expensive to be kept in Azure Analysis Services. Because of number of dimensions, the fact table can't be aggregated significantly, so PBI import mode or composite model isn't an option as well.
Azure Synapse Analytics faciliates OLAP operations, like GROUP BY ROLLUP/CUBE/GROUPING SETS.
How can I benefit from Synapse's OLAP operations support?
Is that possible to pre-calculate OLAP cubes inside Synapse in order to boost PBI reports performance? How?
If the answer is yes, is that recomended to pre-calculate KPIs? Means moving KPIs definition to DWH OLAP cube level - is it an anti-pattern?
P.S. using separate aggreagations for each PBI visualisation is not an option, it's more an exception from the rule. Synapse is clever enough to take the benefit from materialized view aggregation even on querying a base table, but this way you can't implement RLS and managing that number of materialized views also looks cumbersome.
Upd for #NickW
Could you please answer the following sub-questions:
Have I got it right - OLAP operations support is mainly for downstream cube providers, not for Warehouse performance?
Is spawning Warehouse with materialized views in order to boost performance is considered a common practice or an anti-pattern? I've found (see the link) Power BI can create materialized views automatically based on query patterns. Still I'm afraid it won't be able to provide a stable testable solution, and RLS support again.
Is KPIs pre-calculation at Warehouse side considered as a common way or an anti-pattern? As I understand this is usually done no cube provider side, but if I haven't got one?
Do you see any other options to boost the performance? I can think only about reducing query parallelism by using PBI composite model and importing all dimensions to PBI. Not sure if it'd help.
Synapse Result Set Caching and Materialized Views can both help.
In the future the creation and maintence of Materialized Views will be automated.
Azure Synapse will automatically create and manage materialized views
for larger Power BI Premium datasets in DirectQuery mode. The
materialized views will be based on usage and query patterns. They
will be automatically maintained as a self-learning, self-optimizing
system. Power BI queries to Azure Synapse in DirectQuery mode will
automatically use the materialized views. This feature will provide
enhanced performance and user concurrency.
https://learn.microsoft.com/en-us/power-platform-release-plan/2020wave2/power-bi/synapse-integration
Power BI Aggregations can also help. If there are a lot of dimensions, select the most commonly used to create aggregations.
to hopefully answer some of your questions...
You can't pre-calculate OLAP cubes in Synapse; the closest you could get is creating aggregate tables and you've stated that this is not a viable solution
OLAP operations can be used in queries but don't "pre-build" anything that can be used by other queries (ignoring CTEs, sub-queries, etc.). So if you have existing queries that don't use these functions then re-writing them to use these functions might improve performance - but only for each specific query
I realise that your question was about OLAP but the underlying issue is obviously performance. Given that OLAP is unlikely to be a solution to your performance issues, I'd be happy to talk about performance tuning if you want?
Update 1 - Answers to additional numbered questions
I'm not entirely sure I understand the question so this may not be an answer: the OLAP functions are there so that it is possible to write queries that use them. There can be an infinite number of reasons why people might need to to write queries that use these functions
Performance is the main (only?) reason for creating materialised views. They are very effective for creating datasets that will be used frequently i.e. when base data is at day level but lots of reports are aggregated at week/month level. As stated by another user in the comments, Synapse can manage this process automatically but whether it can actually create aggregates that are useful for a significant proportion of your queries is obviously entirely dependent on your particular circumstances.
KPI pre-calculation. In a DW any measures that can be calculated in advance should be (by your ETL/ELT process). For example, if you have reports that use Net Sales Amount (Gross Sales - Tax) and your source system is only providing Gross Sales and Tax amounts then your should be calculating Net Sales as a measure when loading your fact table. Obviously there are KPIs that can't be calculated in advance (i.e. probably anything involving averages) and these need to be defined in your BI tool
Boosting Performance: I'll cover this in the next section as it is a longer topic
Boosting Performance
Performance tuning is a massive subject - some areas are generic and some will be specific to your infrastructure; this is not going to be a comprehensive review but will highlight a few areas you might need to consider.
Bear in mind a couple of things:
There is always an absolute limit on performance - based on your infrastructure - so even in a perfectly tuned system there is always going to be a limit that may not be what you hoped to achieve. However, with modern cloud infrastructure the chances of you hitting this limit are very low
Performance costs money. If all you can afford is a Mini then regardless of how well you tune it, it is never going to be as fast as a Ferrari
Given these caveats, a few things you can look at:
Query plan. Have a look at how your queries are executing and whether there are any obvious bottlenecks you can then focus on. This link give some further information Monitor SQL Workloads
Scale up your Synapse SQL pool. If you throw more resources at your queries they will run quicker. Obviously this is a bit of a "blunt instrument" approach but worth trying once other tuning activities have been tried. If this does turn out to give you acceptable performance you'd need to decide if it is worth the additional cost. Scale Compute
Ensure your statistics are up to date
Check if the distribution mechanism (Round Robin, Hash) you've used for each table is still appropriate and, on a related topic, check the skew on each table
Indexing. Adding appropriate indexes will speed up your queries though they also have a storage implication and will slow down data loads. This article is a reasonable starting point when looking at your indexing: Synapse Table Indexing
Materialised Views. Covered previously but worth investigating. I think the automatic management of MVs may not be out yet (or is only in public preview) but may be something to consider down the line
Data Model. If you have some fairly generic facts and dimensions that support a lot of queries then you might need to look at creating additional facts/dimensions just to support specific reports. I would always (if possible) derive them from existing facts/dimensions but you can create new tables by dropping unused SKs from facts, reducing data volumes, sub-setting the columns in tables, combining tables, etc.
Hopefully this gives you at least a starting point for investigating your performance issues.
We are using Cassandra 3 and have come up with a modelling based on the initial requirements. Since there have been very frequent requirements changes, this model has subsequently changed many times as well. Hence considering these requirements and model changes, there has been no major improvement in terms of development. The team have decided to go with the BLOB data type and store the entire data in the BLOB. Can you please share the drawback to use BLOB such a scenario. Thanks in Advance.
We migrated from Astyanax Cassandra 1.1 to CQL Cassandra 3.0 directly, so we still have a lot of column families which have value as BLOB.
Major issues we face right now are:
1) Difficult to visualize data directly from database: Biggest advantage of CQL is it supports SQL like queries, hence logging into cql terminal and getting results directly from there is saves a lot of time normally. If you use BLOB you will not be able to do all such things.
2) CQL performs better when your table has a well defined schema instead of using blob to store big chunk of data together.
If you are creating a new table, I will suggest to use Collections for your use case. You will be able to store different type of data and performance will also be good.
Nice slides comparing performance of schemaless tables and tables with scehma and collections. You can skip to slide 26 if you just want the summary.
https://www.slideshare.net/DataStax/migration-from-thrift-to-cql-brij-bhushan-ravat-ericsson-cassandra-summit-2016
I am working with azure table services. What I would like to ask is whether the Azure internals care about the notion of Table at all?
Making things fast largely depend on partition key and row key. Tables do not look like containers or Grouping of entities because there is no limit on number of tables you can create. The total storage size is tied to my storage account.
So is Table just a notion to help people transition from RDBMS land or do they serve a purpose internally? Can I do applications with a single table design without worrying about performance? After all, if table is a just a tag then I may as well include as part of partition key.
EDIT
To give an example, Table partition keys look very much like Cassandra rows and Table rows are like Cassandra columns. It is okay to treat the Storage as a big bucket of key(RowKey)-value pairs. Partition keys are sharding mechanism. Then table just comes across as a "labeling" notion.
You can have all your entities in a single azure table storage and get optimum performance by thoughtfully choosing a right Partition Key and Row Key combination, but IMHO it is any time better to have related entities each in a separate table as that would be easy manageable ( from a developer perspective) . you know what part of you application is hitting which table.
You may want to view this recorded session which discusses best practices and internals.
Windows Azure Storage: What’s Coming, Best Practices, and Internals
You can think the union of a Table name and a Partition Key as being a unit of performance.
Youre probably on the thought mindset that will re-invent FatEntities. developed by Lokad.
The concept of a "Table" has many restrictions and issues. For example if you have a large table with 100,000 entries in a partition, then you can't easily jump to entry 99,001 without iterating through each of the entries. Going backwards is impossible (you cant start at the last entry and go backwards)
I've noticed that querying a WADs table e.g. WADLogs is very very slow. It takes up to 5minute to return 10 records.
Yes the WADs table are very large in our scenario. Still, I wasn't expecting this slow. It takes ages to troubleshoot the production issues.
Question I've:
Could anyone please share best way to manage the WADs table so that query is faster.
Is there anyway to optimize the WADs tabels
Is there a best practice what should and should not be done when logging on to WADs
Are there any best practices on purging/backing up etc.
Thank you.
Gaurav Mantri has a post explaining how to query WAD tables in a performant manner. The bottom line is that you need to query on PartitionKey and RowKey to avoid a performance-killing table scan. The PartitionKey for the WAD tables contains the TickCount in a slightly encoded form and an appropriately constructed value can be used for range queries.
Thanks Neil for the link.
Summary:
Use PartionKey attribute which is indexed by Table Storage.
Where,
PartionKey = "0"+ DateTime.UtcNow.AddDays(-1.0).Ticks
Usage for REST API Query ($filter) criteria:
PartitionKey ge ’0634012319404982640′