dbms_stats (performance tuning) - database-administration

Why do we need to gather stats.
I am having a table let it be ABC.Every day there is insertion of 10k records in it. So as per my knowledge oracle automatically gather stats for every new record at time of insertion.Even indexes are rebuild on every insertion.
So de we really need to gather stats manually?

Stats are not automatically gathered on every insert and index are not rebuild on every insert. Stats gather is very important to get good SQL performance. CBO generate optimal SQL plan for better performance . Read more about CBO and SQL Plan

Related

Provisioned write capacity in Cassandra

I need to capture time-series sensor data in Cassandra. The best practices for handling time-series data in DynamoDB is as follow:
Create one table per time period, provisioned with write capacity less than 1,000 write capacity units (WCUs).
Before the end of each time period, prebuild the table for the next period.
As soon as a table is no longer being written to, reduce its provisioned write capacity. Also reduce the provisioned read capacity of earlier tables as they age, and archive or delete the ones whose contents will rarely or never be needed.
Now I am wondering how I can implement the same concept in Cassandra! Is there any way to manually configure write/read capacity in Cassandra as well?
This really depends on your own requirements that you need to discuss with development, etc.
There are several ways to handle time-series data in Cassandra:
Have one table for everything. As Chris mentioned, just include the time component into partition key, like a day, and store data per sensor/day. If the data won't be updated, and you know in advance how long they will be kept, so you can set TTL to data, then you can use TimeWindowCompactionStrategy. Advantage of this approach is that you have only one table and don't need to maintain multiple tables - that's make easier for development and maintenance.
The same approach as you described - create a separate table for period of time, like a month, and write data into them. In this case you can effectively drop the whole table when data "expires". Using this approach you can update data if necessary, and don't require to set TTL on data. But this requires more work for development and ops teams as you need to reach multiple tables. Also, take into account that there are some limits on the number of tables in the cluster - it's recommended not to have more than 200 tables as every table requires a memory to keep metadata, etc. Although, some things, like, a bloom filter, could be tuned to occupy less memory for tables that are rarely read.
For cassandra just make a single table but include some time period in the partition key (so the partitions do not grow indefinitely and get too large). No table maintenance and read/write capacity is really more dependent on workload and schema, size of cluster etc but shouldn't really need to be worried about except for sizing the cluster.

Cassandra aggregation

The Cassandra database is not very good for aggregation and that is why I decided to do the aggregation before write. I am storing some data (eg. transaction) for each user which I am aggregating by hour. That means for one user there will be only one row for each our.
Whenever I receive new data, I read the row for current hour, aggregate it with received data and write it back.I use this data to generate hourly reports.
This works fine with low velocity data but I observed considerably high data loss when velocity is very high (eg 100 records for 1 user in a min). This is because reads and writes are happening very fast and because of "delayed write", I am not getting updated data.
I think my approach "aggregate before write" itself is wrong. I was thinking about UDF but I am not sure how will it impact on performance.
What is the best way to store aggregated data in Cassandra ?
My idea would be:
Model data in Cassandra on hour-by-hour buckets.
Store plain data into Cassandra immediately when they arrive.
Process at X all the data of the X-1 hour and store the aggregate result on another table
This would allow you to have very fast incoming rates, process data only once, store the aggregates into another table to have fast reads.
I use Cassandra to pre-aggregate also. I have different tables for hourly, daily, weekly, and monthly. I think you are probably getting data loss as you are selecting the data before your last inserts have replicated to other nodes.
Look into the counter data type to get around this.
You may also be able to specify a higher consistency level in either the inserts or selects to ensure you're getting the most recent data.

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

Adding and retrieving sorted counts in Cassandra

I have a case where I need to record a user action in Cassandra, then later retrieve a sorted list of users with the highest number of that action in an arbitrary time period.
Can anyone suggest a way to store and retrieve this data in a pre-aggregated method?
Outside of Cassandra I would recommend using stream-summary or count min sketch you would be able to solve this with much less space and have immediate results. Just update and periodically serialize and persist it (assuming you don't need guaranteed accuracy)
In Cassandra you can keep a row per period of time like by hours and have a counter per user in that row, incrementing them on use. Then use a batch job to run through them and find the heavy hitters. You would be constrained to having the minimal queryable time be 1 hour and it wont be particularly cheap or fast to compute but it would work.
Generally it would be good treating these as a log of operation, every time there is an event store it and have batch jobs do analytics against it with hadoop or custom. If need it realtime id recommend the above approach of keeping stream summaries in memory.

How often are SQL Server Index Usage Stats Updated and what triggers it?

There are some other similar question to this but, please, do not confuse.
I know there's a function STATS_DATE() to know where the stats where updated, which is fine, but what I want to know is what triggers an update of this stats, or a cut-off.
I know there's a report for this as well.
But last week I saw the stats in certain server and they gave me very good information with amounts of 4 digits for the main tables in this particular database.
Right now looking in the same production server and the STATS_UPDATE function returned they were updated last Saturday, but this server has been up for weeks without reboot not even service restart. So I know I'm looking the stats accumulated basically just this Monday morning.
So, I would like to know where can I set this settings so the server keeps accumulating the index usage stats until I clear the log or whatever storage it use.
There are various "stats" that SQL server maintains for Tables and indexes.
Histogram statistics. Thes are the stats that the query optimizer uses. STATS_DATE() returns the last date/time these were updated. The criteria for automatic updating of histogram statistics is 500 rows + 20% of the table. So a table with 100,000 rows, you'd have to update 20,500 rows before triggering a recalculation of these. You can't change the threshold for automatic statistic updating, however, you can turn off automatic statistic updating and/or manually update statistics on particular tables and indexes.
Usage statistics: These are found in sys.dm_db_index_usage_stats. Index usage statistics keep track of things like seeks and scans from SELECT queries. They are not persisted and get reset on restart of sql server. These statistics also get reset if the underlying index is rebuilt "ALTER INDEX ... REBUILD", but not with "ALTER INDEX ... REORG"
Operational Statistics: These are found in sys.dm_db_index_operational_stats. Operational statistics are things such as page splits, leaf level inserts and PAGEIOLATCH delay. These are also non persisted.

Resources