Best practices to store every minute and select from database latest 24h data only? - python-3.x

The task is to permanently record new data to a database every minute and then, occasionally, to read only latest 24h data, using Python.
The only approach I know:
create a script A that will be inserting into a MariaDB database table, one new line per minute, with a timestamp as a field value
create a script B that will be reading from the database table, using WHERE and timestamp values
The problem is, there are 2 restrictions:
it is not allowed to have more than 10.000 lines in one database table
it is not allowed to delete any lines
How to fulfill the task and meet both restrictions? Are there best practices?
Thanks!

You can create a new table every X days when it is full. Name the table with the first timestamp value.
With this solution you need to create your B script in this way:
List all tables
Find the tables you are looking for
Write your SQL query on all theses tables using UNION ALL
You can do it into a single SQL query for optimisation or into a script using multiple queries for simplicity.

Related

How to get all the operations done on an oracle table be imported into hive for processing?(not just actual data in table, but the operations also)

I have a table in oracle db which gets multiple transactions done (lets say around 100 million inserts,updates or deletes in a day). I want to get all the transactions happening in that table to be brought into hive for processing through spark or hive.
For example:
lets say a record in that oracle table goes through initial insert operation followed by 5 updates to same/different columns and finally gets deleted. I want to capture all such operations for all the records in that table and import into hive.
We want to find records with number of operations that exceed a threshold for specific columns and pull a report on them.
Has anyone come across such a use case? Appreciate any help in achieving this.

Data loading in Azure DataWarehouse _Solution required

I would like to storing the data in the DWH in a consistent matter. Every week I need to load data in AzureDW from on-Prem SQLDB.
The thing is that I have primary key in a table which I get every week. The example of table
I want to design in such a way that all 4 records gets stored in DW.
Shall I use surrogate key or is there some other better way?
If this is staged source data I wouldn't add a surrogate key, typically you only create surrogate keys in your dimensional model.
If your data volume is growing by semi-exponentially every time the process is run (unlikely) I would process as a CTAS, otherwise I would do a
INSERT INTO dbo.table
SELECT *, SYSUTCDATETIME() AS RECORD_INSERT_DATE FROM dbo.table_external_table
So you would just insert all incoming data and add a timestamp for the insert date. Your NK and timestamp become your unique key on the table.
If your requirements involve easily returning the current version of the record you could use a typeII SCD pattern to set a end date for the most recent version of the record and start date + active flag for the new version of the record.

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

What is the best way to move out-of-order Access records into the proper order by using a locked ID field?

I have roughly 1500 records in an Access database. I have a field ID that acts as the primary key, and as such cannot be manually changed. After looking through the original Excel sheet these records were kept in, I noticed that a few records in Excel were missing from the Access database. After going through all of them, I added the three missing records into Access.
This database stores records in date order, grouped by a manufacturer. Ex. records from Manufacturer1 collected during week 1 of June '16 are all located together, and records from Manufacturer2 collected during week 2 of June '16 are stored directly afterwards. This is important for us because the data in this database often needs to be looked at visually, so keeping things in date order is essential. There is also a macro that export the data to an Excel sheet and formats it to be easier to read, which exports the records in the order in which they are stored (by the ID field). This is a problem because the three missing records are from years past - now they are in the middle of records from 2018. The IDs they were assigned upon entry keeps them in that location.
Is there a way to reliably insert these records into the database in the location at which they should be? Such as shifting the values of other records ID fields down by 3 to allow room for the missing records? I know I can probably manually have those three records move to the desired location in the macro that exports to Excel, but I'd rather have a less hacky solution that could work if a similar problem happens again.
The order of data in a database is of no interest to the database - it's the relation between data that matters.
To always view your data in the order you want use the ORDER BY clause in an SQL statement. Generally you can add data to the underlying table directly through the query - unless you've got many-to-one type queries where your update would need to affect more than one record.
SELECT FieldName1, FieldName2, . . . .
FROM MyDataTable
ORDER BY Manufacturer, Date
Edit: Even here you'll be adding new records to the bottom of the dataset, but refreshing the query will move the records to the correct order.

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Resources