I have a main database that stores up to 5'000 new rows per day.
I want to have a second database that only contains the latest 30 days worth of data at any time.
Therefore I plan to set up a cron job that regularly dumps the rows older than 30 days and copies the new ones.
What would be the best design for the copying part?
Copying on the fly with MySQL alone
An MySql export to a txt file, then an MySql import, then deleting the temporary file
A php script that iterates through the rows and copies them one by one
I want robustness and minimum amount of CPU/memory usage
The quickest and most robust way is to perform the transfer directly in MySQL. Here are the steps involved:
First, create the second table:
CREATE TABLE IF NOT EXISTS second.last30days LIKE main_table;
Next, insert the records 30 days old, or newer:
INSERT INTO
second.last30days
SELECT
*
FROM
main_table
WHERE
created >= CURDATE() - INTERVAL 30 DAYS
ORDER BY created;
Lastly, delete the records older than 30 days:
DELETE FROM
second.last30days
WHERE
created < CURDATE() - INTERVAL 30 DAYS
ORDER BY created;
It would be advisable to not run the INSERT and DELETE statements at the same time.
if the databases are both hosted on the same server. just use an insert ... select statement.
that way you minimize everything. 1 query, and done.
MySQL 5.0 - 13.2.5.1. INSERT ... SELECT Syntax"
Related
The task is to permanently record new data to a database every minute and then, occasionally, to read only latest 24h data, using Python.
The only approach I know:
create a script A that will be inserting into a MariaDB database table, one new line per minute, with a timestamp as a field value
create a script B that will be reading from the database table, using WHERE and timestamp values
The problem is, there are 2 restrictions:
it is not allowed to have more than 10.000 lines in one database table
it is not allowed to delete any lines
How to fulfill the task and meet both restrictions? Are there best practices?
Thanks!
You can create a new table every X days when it is full. Name the table with the first timestamp value.
With this solution you need to create your B script in this way:
List all tables
Find the tables you are looking for
Write your SQL query on all theses tables using UNION ALL
You can do it into a single SQL query for optimisation or into a script using multiple queries for simplicity.
Currently I have a ETL job that reads few tables, performs certain transformations and writes them back to the daily table.
I use the following query in spark sql
"INSERT INTO dbname.tablename PARTITION(year_month)
SELECT * from Spark_temp_table "
The target table in which all these records are being inserted is partitioned at a year X month level. Records which are generated on a daily basis are not that much hence I am partitioning on year X month level.
However, when I check the partition, it has small ~50MB files for each day my code runs (code has to run daily) and eventually I will end up having around 30 files in my partition totalling ~1500MB
I want to know if there is way for me to just create one (or maybe 2-3 files as per block size restrictions) in one partition while I append my records on a daily basis
The way I think I can do it is to just read everything from the concerned partition in my spark dataframe, append it with the latest record and repartition it before writing back. How do I ensure I only read data from the concerned partition and only that partition is over written with lesser number of files?
you can use DISTRIBUTE BY clause to control how the records will be distributed in files inside each partition.
to have a single file per partition, you can use DISTRIBUTE BY year, month
and to have 3 file per partition, you can use DISTRIBUTE BY year, month, day % 3
the full query:
INSERT INTO dbname.tablename
PARTITION(year_month)
SELECT * from Spark_temp_table
DISTRIBUTE BY year, month, day % 3
We have a requirement to load last 30 days updated data from the table.
One of the potential solution below does not allow to do so.
select * from XYZ_TABLE where WRITETIME(lastupdated_timestamp) > (TOUNIXTIMESTAMP(now())-42,300,000);
select * from XYZ_TABLE where lastupdated_timestamp > (TOUNIXTIMESTAMP(now())-42,300,000);
The table has columns as
lastupdated_timestamp (with an index on this field)
lastupdated_userid (with an index on this field)
Any pointers ...
Unless your table was built with this query in mind, your query will search every partition of the database, which will become very costly once your dataset has become large and will probably result in a timeout.
To efficiently complete this query, the XYZ_TABLE should have a primary key something like so:
PRIMARY KEY ((update_month, update_day), lastupdated_timestamp)
This is so Cassandra knows right where to go find the data. It has month and day buckets it can quickly find, then you can run queries like this to find updates on a certain day.
SELECT * FROM XYZ_TABLE WHERE update_month = 07-18 and update_day = 06
We've set up an Azure Search Index on our Azure SQL Database of ~2.7 million records all contained in one Capture table. Every night, our data scrapers grab the latest data, truncate the Capture table, then rewrite all the latest data - most of which will be duplicates of what was just truncated, but with a small amount of new data. We don't have any feasible way of only writing new records each day, due to the large amounts of unstructured data in a couple fields of each record.
How should we best manage our index in this scenario? Running the indexer on a schedule requires you to indicate this "high watermark column." Because of the nature of our database (erase/replace once a day) we don't have any column that would apply here. Further, what really needs to happen for our Azure Search Index is either it also needs to go through a full daily erase/replace, or some other approach so that we don't keep adding 2.7 million duplicate records every day to the index. The former likely won't work for us because it takes 4 hours minimum to index our whole database. That's 4 hours where clients (worldwide) may not have a full dataset to query on.
Can someone from Azure Search make a suggestion here?
What's the proportion of the data that actually changes every day? If that proportion is small, then you don't need to recreate the search index. Simply reset the indexer after the SQL table has been recreated, and trigger reindexing (resetting an indexer clears its high water mark state, but doesn't change the target index). Even though it may take several hours, your index is still there with the mostly full dataset. Presumably if you update the dataset once a day, your clients can tolerate hours of latency for picking up latest data.
I have a SQL table with approximately 80 million rows. Columns first name and last name are full text indexed. Also we have appropriate clustered and non clustered indexes.
We have different types of queries and everything works fine regarding performance and response time. but we have few cases when we are searching for specific first and last name in some time period and response time is few minutes (normally is 3-10 seconds).
When I look at the execution plan I see that there is index eager spool step and it costs 90%. before that step I have full text step. So I we cant figure out why we have good response time on million other combinations, (between in that queries we don't have eager spool step) but in this specific combination (first and last name, and they are very common) we have this issue.
I am using MS SQL Server 2012.