I am able to calculate these metrics for my teradata queries from dbc.DBQLOGTBL table. Metrics - SPOOLUSAGE, CPUIMPACT, IOIMPACT, CPUSKEW, TOTALCPUTIME, TOTALIOCOUNT, PJI, UII.
If I want to know which are the bad queries based on these metrics then which all important metrics should i look for and how I will calculate the optimum range of those metrics for my query.
Related
We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
The Row-level security (RLS) guidance in Power BI Desktop article mentions the potential negative performance impacts in reports. Is there only the potential for a negative performance impact, though? Could there also be the potential for a positive performance impact?
For example, if there are 10,000 rows in a table, but a user only has access to 1,000, then would Power BI run all of its queries against only those 1,000 rows? Or does it run the queries against the 10,000 rows with additional filters that decrease the performance.
The Avoid using RLS section in the article talks about splitting up the model and using different workspaces instead of RLS. Here it states:
There are several advantages associated with avoiding RLS:
Improved query performance: It can result in improved performance due to fewer filters.
This makes me think that the all 10,000 rows in my hypothetical example would be evaluated and thus performance would not improve, but I wanted to reach out and verify this.
RLS will impact the query performance, as it will have to do in a typical RLS set up, take the user information, filter the reference table, then will filter the main data table and return the results.
It will have to go though, in your example all 10,000 rows, return the data that matches the filter criteria and then return the results.
The tabular engine will be optimised to some degree (seeking data via indexes, rather than scanning the whole data table), but it will have some level of overhead to filer datasets affected by RLS, rather than just the straight return of the whole data. In your example of 10K rows, it will be in the range of milliseconds, but you can use tools such as DAX Studio or Tabular editor, or the inbuilt tools of Power BI Desktop to see the affect on RLS in terms of performance.
RLS can go through 10's of millions of rows or data quite quickly, but final report performance generally depends on, the cardinality of the dataset (which helps indexing of the data), the number of relationships it has to transverse, the complexity of the DAX used in the visuals, the amount of data returned to the visuals, and the number of visuals on the report page itself.
Setting: Delta-lake, Databricks SQL compute used by powerbi.
I am wondering about the following scenario: We have a column timestamp and a derived column date (which is the date of timestamp), and we choose to partitionby date. When we query we use timestamp in the filter, not date.
My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont explicitly use in the query.
Is this correct?
What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.
Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?
Yes, Databricks will take implicit advantage of this partitioning through data skipping because there will be min/max statistics associated with specific data files. The min/max information will be loaded into memory from the transaction log, but it will need to make decision which files it need to hit on every query. But because everything is in memory, it shouldn't be very big performance overhead, until you have hundreds of thousands files.
One thing that you may consider - use generated column instead of explicit date column. Declare it as date GENERATED ALWAYS AS (CAST(timestampColumn AS DATE)), and partition by it. The advantage is that when you're doing a query on timestampColumn, then it should do partition filtering on the date column automatically.
I am a beginner using Cassandra. I created a table with below details and when I try to perform range search using token, I am not getting any results. Am I doing something wrong or is it my understanding of data model?
Query select * from test where token(header)>=2 and token(header)<=4;
the token function calculates the token from the value based on the configured partitioner. The calculated value is the hash that is used to identify the node where the data is located, this is not a data itself.
Cassandra can perform range search on values only on clustering columns (only for some designs) only inside the single partition. If you need to perform range on arbitrary column (also for partition keys), there is a DSE Search that allows you to index the table and perform different types of search, including range... But take into account that it will be much slower than traditional Cassandra queries.
In your situation, you can run 3 queries in parallel (to cover values 2,3,4), like this:
select * from test where header = value;
and then combine results in your code.
I recommend to take DS201 & DS220 courses on DataStax Academy to understand how Cassandra performs queries, and how to model data to make this possible.
I have timeseries stored in a Cassandra table, coming from several sensors. Here is the schema I use for storing data :
CREATE TABLE data_sensors (
sensor_id int,
time timestamp,
value float,
PRIMARY KEY ((sensor_id), time)
);
Values can be temperature or pressure for instance, depending on the sensor from which it is coming from.
My objective is to be able to find basic statistics (min, max, avg, std) on pressure, but only when temperature is higher than a certain value.
Here is a schema of the whole process I'd like to get.
I think it could be better if I changed the Cassandra model, at least for temperature data, to be able to filter on value. Is there another way, after importing data into a Spark RDD, to avoid altering the Cassandra table?
Then, once filtering on temperature is done, how to get the sequence of timestamps I have to use to filter pressure data? Please note that I don't have necessarily the same timestamps for temperature and pressure, that is why I think I need to have periods of time instead of a list of precise timestamps.
Thanks for your help!
It's not really a Cassandra-specific answer, but maybe you want to look at time series databases that provide SQL layer on top of NoSQL stores with support for JOINs and aggregations.
Here's an example of an ATSD SQL syntax that supports period aggregations and joins.
SELECT t1.entity, t1.datetime, min(t1.value), max(t1.value), avg(t2.value)
FROM mpstat.cpu_busy t1
JOIN meminfo.memfree t2
WHERE t1.datetime >= '2016-09-20T15:00:00Z' AND t1.datetime < '2016-09-20T15:15:00Z'
GROUP BY entity, t1.PERIOD(1 MINUTE)
HAVING max(t1.value) > 30
The query joins two metrics, filters out 1-minute rows where first metric was below the threshold and then returns a bunch of statistics for the second series.
If the two series are unevenly spaced, you can regularize the array using linear interpolation.
Disclosure: I work for Axibase that develops ATSD.