Does Row-Level Security Always Decrease Performance in Power BI - azure

The Row-level security (RLS) guidance in Power BI Desktop article mentions the potential negative performance impacts in reports. Is there only the potential for a negative performance impact, though? Could there also be the potential for a positive performance impact?
For example, if there are 10,000 rows in a table, but a user only has access to 1,000, then would Power BI run all of its queries against only those 1,000 rows? Or does it run the queries against the 10,000 rows with additional filters that decrease the performance.
The Avoid using RLS section in the article talks about splitting up the model and using different workspaces instead of RLS. Here it states:
There are several advantages associated with avoiding RLS:
Improved query performance: It can result in improved performance due to fewer filters.
This makes me think that the all 10,000 rows in my hypothetical example would be evaluated and thus performance would not improve, but I wanted to reach out and verify this.

RLS will impact the query performance, as it will have to do in a typical RLS set up, take the user information, filter the reference table, then will filter the main data table and return the results.
It will have to go though, in your example all 10,000 rows, return the data that matches the filter criteria and then return the results.
The tabular engine will be optimised to some degree (seeking data via indexes, rather than scanning the whole data table), but it will have some level of overhead to filer datasets affected by RLS, rather than just the straight return of the whole data. In your example of 10K rows, it will be in the range of milliseconds, but you can use tools such as DAX Studio or Tabular editor, or the inbuilt tools of Power BI Desktop to see the affect on RLS in terms of performance.
RLS can go through 10's of millions of rows or data quite quickly, but final report performance generally depends on, the cardinality of the dataset (which helps indexing of the data), the number of relationships it has to transverse, the complexity of the DAX used in the visuals, the amount of data returned to the visuals, and the number of visuals on the report page itself.

Related

Modeling Analysis service tabular fact with many columns

I have a tabular analysis service fact table with more than 150 attributes (degenrate) and measures (Invoice fact table).
I would like to improve the user experience when browsing the table on azure AS.
Is it a good idea to split the table horizontally to 3 tables each table contains a set of columns and measures (the number of rows remains the same on 3 tables) ?
I don't change a model for browsing's sake. Browsing is a reporting issue. Modeling is modeling. Build a good dimensional model in a star schema. Big wide tables are bad.
I would look for anything that ought to be a dimension to pull out, and build some junk dimensions, and for sure I would move the measures to a DAX table. But if whatever is left is ugly, then I would just leave it and build a report for browsing.

Can sort() and cache() combined in spark increase filter speed like creating index column in SQL?

We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];

Cassandra secondary index vs another table

I have a racesByID table. I also need to find the races by year. What are the pros and cons of using a secondary index on year column over creating a racesByYear table?
It depends on how many race you have per year,
Secondary indexes doesn't perform well if you have low or very high carnality, you will have performance issues in those case. for details check this link : https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html#useWhenIndex__when-no-index
In most cases separates tables will perform better, the cons is that you have to manage the consistency and keep table in sync
I hope this help.

Determining optimal Write Batch Size in Azure Data Factory's copy task

I was looking into optimizing some of the pipelines that we have got and as a part of it, also looked at the Write batch Size option of copy tasks. I understand how it basically works and its importance. Quoting MSDN:
Copy Activity inserts data in a series of batches. You can set the
number of rows in a batch by using the writeBatchSize property. If
your data has small rows, you can set the writeBatchSize property with
a higher value to benefit from lower batch overhead and higher
throughput. If the row size of your data is large, be careful when you
increase writeBatchSize. A high value might lead to a copy failure
caused by overloading the database.
But what is the correct way to determining the perfect metric. I have prepared an excel sheet with the table names, their size (in MB) and count of records in it. What should be my next steps, considering I don't have time to go into extensive hit and trial? Any pointers would be highly appreciated! Thanks.
If it matters, I am on P4 Azure SQL DB and pulling data from Oracle ERP tables.

Cassandra Performance : Less rows with more columns vs more rows with less columns

We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.

Resources