Why does delete/insert result in a larger database size than using upsert? - google-cloud-spanner

I have a table with 1.6 million records. This table has multiple Interleaved tables. Periodically I receive updates. If I apply the update by deleting and then inserting, the database size is approximately 35% larger than by using UPSERT.
The database Retention period is set to 1 hour. Even after the retention period has passed, the database size does not go down.
Any idea why this is?
Update: The backup size for the database is the same regardless of the update method.
Thanks

The database retention period which is referred to as version_retention_period in the official documentation is the period in which all the versions of the data are guaranteed to exist.
There are various background processes which perform compactions etc and can take anywhere from a few hours to a few days to run depending on the load etc on the database. Thus, your observation is as expected.

Related

Delta Lake partitioning strategy for event data

I'm trying to build a system that ingests, stores and can query app event data. In the future it will be used for other tasks (ML, Analytics, etc.) hence why I think Databricks could be a good option(for now).
The main use case will be retrieving user-action events occurring in the app.
Batches of this event data will land in an S3 bucket about every 5-30 mins and Databricks Auto Loader will pick them up and store it in a Delta Table.
A typical query will be: get all events where colA = x over the last day, week, or month.
I think the typical strategy here is to partition by date. e.g:
date_trunc("day", date) # 2020-04-11T00:00:00:00.000+000
This will create 365 partitions in a year. I expect each partition to hold about 1GB of data. In addition to partitioning, I plan on using z-ordering for one of the high cardinality columns that will frequently be used in the where clause.
Is this too many partitions?
Is there a better way to partition this data?
Since I'm partitioning by day and data is coming in every 5-30 mins, is it possible to just "append" data to a days partition instead?
It's really depends on the amount of data that are coming per day and how many files should be read to answer your query. If it 10th of Gb then partition per day is ok. But you can also partition by timestamp truncated to week, and in this case you'll get only 52 partitions per year. ZOrdering will help to keep the files optimized, but if you're appending data every 5-30 minutes, you'll get with at least 24 files per day inside the partition, so you will need to run OPTIMIZE with ZOrder every night, or something like this, to decrease the number of files. Also, make sure that you're using optimized writes - although this make write operation slower, it will decrease the number of files generated (if you're planning to use ZOrdering, then it makes no sense to enable autocompaction)

MariaDB got much faster, and I can't find the cause?

I have a concern about my MariaDB 10.4.12 database query execution time, which is getting much faster without any update to my database schema or data. While a speed-up is always welcome, I am concerned about the root cause of this speed-up, especially since I have not rolled out any changes in the last 24 hours. This specific query has sped up 60x overnight.
I have a NodeJS web application that filters a large dataset into "reporting" pages, which typically take 10-12 seconds to load. My main table has 3.5 million rows and the base query involves many joins, date comparisons, and text comparisons. There is room for fine-tuning the query, but it worked for what it was designed to do and I could live with 10 second load times. I noticed this morning, though, that my queries were executed in less than 1 second, without any recent changes on my part.
The most recent change to the application was pushed out five days ago, which affected the amount of data being pulled into this database. A separate application on the same server reaches out to a data set every 10 minutes and replicates these rows into the same database the "reporting" application communicates with. Up until this update, the query was collecting and inserting ~80,000 rows on average, taking about 8-10 seconds to fully replicate the data into this database. My change five days ago reduced the rows being inserted to ~20,000 on average.
Other clues:
PHPMyAdmin still takes 10-12 seconds to run the query, while the MySQL command-line tool takes in less than 1 second
The MariaDB temp directory was changed to a larger partition 7 days ago
The query was tested to be slow (10-12 seconds) 24 hours ago
The query is still slow on a pre-production server that runs the same application with an identical MySQL instance running (same schema and data)
My current running theory is that the ~80,000 inserts were not being executed in the time range being reported by NodeJS (8-10 seconds for the inserts), and they were instead waiting in the MariaDB temp directory until they could be fully written to the database. That would suggest that the database was constantly bogged down by these writes, and reducing the number to ~20k allowed the database to insert faster, allowing the select queries to run faster this morning.
Should I be concerned about this speed up? Could MariaDB have found a faster way to index my data? Am I going crazy?
Thank you.
Don't worry. This kind of thing can be caused by contention (multiple database clients using the database concurrently) and all sorts of other things.
(Cherish this moment. Performance usually goes the other direction.)
You can test for correctness to increase your confidence level. Check a few older and a few newer records to see if they still contain good data.
Or a full-table-scan query, something like this
SELECT COUNT(*), AVG(some_number_column), MIN(some_text_column) FROM mytable
That will take a while but it will hit every row in the table.
You probably don't need to do this, but it's a way to double check (and tell your boss, "I double checked.)
10 seconds, then 1 second. That is "normal".
The first was run when none of the data was cached in RAM; the second was with all cached.
Run it a third time; it will be 1 second again.
Restart MariaDB and run it again; it will again take 10 seconds.
Walk away from the machine for a long time; don't touch the table. It might be back to 10 seconds. For this, look at size of RAM and innodb_buffer_pool_size. Also look for big table scans that bump everything out of cache.

MemSQL for Last n Days Data

I plan to use memsql to store my last 7 days data for real time analytics using SQL.
I checked the documentation and find out that there is no such TTL / expiration feature in MemSQL
Is there any such feature (in case I missed it)?
Is memsql fit the use case if I do daily delete on >7 days data? I quite curious about the fragmentation
We tried it on postgresql and we need to execute Vacuum command, it takes a long time to run.
There is no TTL/expiration feature. You can do it by running delete queries. Many customer use cases are doing this type of thing, so yes MemSQL does fit the use case. Fragmentation generally shouldn't be too much of a problem here - what kind of fragmentation are you concerned about?
There is No Out of the Box TTL feature in MemSQL.
We achieved TTL by adding an additional TS column in our MemSQL Rowstore table with TIMESTAMP(6) datatype.
This provides automatic current timestamp insertion when you add a new row to the table.
When querying data from this table, you can apply a simple filter based on this TIMESTAMP column to filter older records beyond your TTL value.
https://docs.memsql.com/sql-reference/v6.7/datatypes/#time-and-date
You can always have a batch job which can run one a month which can delete older data.
we have not seen any issues due to fragmentation but you can do below once in a while if fragmentation is a concern for you:
MemSQL’s memory allocators can become fragmented over time (especially if a large table is shrunk dramatically by deleting data randomly). There is no command currently available that will compact them, but running ALTER TABLE ADD INDEX followed by ALTER TABLE DROP INDEX will do it.
Warning
Caution should be taken with this work around. Plans will rebuild and the two ALTER queries are going to move all moves in the table twice, so this should not be used that often.
Reference:
https://docs.memsql.com/troubleshooting/latest/troubleshooting/

How to store data in cassandra in a temporary location while sync is in progress?

We have a mysql server running which is serving application writes. To do some batch processing we have written a sync job to migrate data into cassandra cluster.
1. A daily sync job which transfers by updated timestamp for that day.
2. A complete sync job which transfers complete data, overriding existing ones.
Now there may be a possibility that the row was deleted from mysql, in that case using the above approach it will lie forever in cassandra.
To solve that problem we have given a TTL of 15 days for every row. So eventually it will get deleted, if it was not deleted then in next full sync the TTL will be over written again.
Its working fine as far as the use case is concerned but the issue is that in full sync complete data is over written and sstable is generated continuously with compactions happenning all the time, load averages shoot up with slowness and backup size increases (which could have been avoided).
Essentially we would want to replace the existing table data by new data but we dont want to truncate before starting the job but only after job completes.
Is there any way by which this can be solved other than creating a new table altogether and dropping past table when data is generated?
You can look at the double-run migration strategy I presented here: http://www.slideshare.net/doanduyhai/from-rdbms-to-cassandra-without-a-hitch
It has the advantage of allowing 100% uptime and possible rollback if things go wrong. The downside is the amount of work required in term of releases & codes

Simple select count(id) uses 100% of Azure SQL DTUs

This started off as this question but now seems more appropriately asked specifically since I realised it is a DTU related question.
Basically, running:
select count(id) from mytable
EDIT: Adding a where clause does not seem to help.
Is taking between 8 and 30 minutes to run (whereas the same query on a local copy of SQL Server takes about 4 seconds).
Below is a screen shot of the MONITOR tab in the Azure portal when I run this query. Note I did this after not touching the Database for about a week and Azure reporting I had only used 1% of my DTUs.
A couple of extra things:
In this particular test, the query took 08:27s to run.
While it was running, the above chart actually showed the DTU line at 100% for a period.
The database is configured Standard Service Tier with S1 performance level.
The database is about 3.3GB and this is the largest table (the count is returning approx 2,000,000).
I appreciate it might just be my limited understanding but if somebody could clarify if this is really the expected behaviour (i.e. a simple count taking so long to run and maxing out my DTUs) it would be much appreciated.
From the query stats in your previous question we can see:
300ms CPU time
8000 physical reads
8:30 is about 500sec. We certainly are not CPU bound. 300ms CPU over 500sec is almost no utilization. We get 16 physical reads per second. That is far below what any physical disk can deliver. Also, the table is not fully cached as evidenced by the presence of physical IO.
I'd say you are throttled. S1 corresponds to
934 transactions per minute
for some definition of transaction. Thats about 15 trans/sec. Maybe you are hitting a limit of one physical IO per transaction?! 15 and 16 are suspiciously similar numbers.
Test this theory by upgrading the instance to a higher scale factor. You might find that SQL Azure Database cannot deliver the performance you want at an acceptable price.
You also should find that repeatedly scanning half of the table results in a fast query because the allotted buffer pool seems to fit most of the table (just not all of it).
I had the same issue. Updating the statistics with fullscan on the table solved it:
update statistics mytable with fullscan
select count
should perform clustered index scan if one is available and its up to date. Azure SQL should update statistics automatically, but does not rebuild indexes automatically if they are completely out of date.
if there's a lot of INSERT/UPDATE/DELETE traffic on that table I suggest manually rebuilding the indexes every once in a while.
http://blogs.msdn.com/b/dilkushp/archive/2013/07/28/fragmentation-in-sql-azure.aspx
and SO post for more info
SQL Azure and Indexes

Resources