Data loading in Azure DataWarehouse _Solution required - azure

I would like to storing the data in the DWH in a consistent matter. Every week I need to load data in AzureDW from on-Prem SQLDB.
The thing is that I have primary key in a table which I get every week. The example of table
I want to design in such a way that all 4 records gets stored in DW.
Shall I use surrogate key or is there some other better way?

If this is staged source data I wouldn't add a surrogate key, typically you only create surrogate keys in your dimensional model.
If your data volume is growing by semi-exponentially every time the process is run (unlikely) I would process as a CTAS, otherwise I would do a
INSERT INTO dbo.table
SELECT *, SYSUTCDATETIME() AS RECORD_INSERT_DATE FROM dbo.table_external_table
So you would just insert all incoming data and add a timestamp for the insert date. Your NK and timestamp become your unique key on the table.
If your requirements involve easily returning the current version of the record you could use a typeII SCD pattern to set a end date for the most recent version of the record and start date + active flag for the new version of the record.

Related

Best practices to store every minute and select from database latest 24h data only?

The task is to permanently record new data to a database every minute and then, occasionally, to read only latest 24h data, using Python.
The only approach I know:
create a script A that will be inserting into a MariaDB database table, one new line per minute, with a timestamp as a field value
create a script B that will be reading from the database table, using WHERE and timestamp values
The problem is, there are 2 restrictions:
it is not allowed to have more than 10.000 lines in one database table
it is not allowed to delete any lines
How to fulfill the task and meet both restrictions? Are there best practices?
Thanks!
You can create a new table every X days when it is full. Name the table with the first timestamp value.
With this solution you need to create your B script in this way:
List all tables
Find the tables you are looking for
Write your SQL query on all theses tables using UNION ALL
You can do it into a single SQL query for optimisation or into a script using multiple queries for simplicity.

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

How to find the delta difference for a table in cassandra using uuid column type

I have the following table on my Cassandra db, I want to find the delta difference in terms of cassandra query. For example, if I operate any insert,update,delete operation to the table I should be able to show which row/rows are getting impacted as my final result.
Let's say on first instance I have perform some 10 rows insertions so if I take the delta difference the output should only show that 10 rows are inserted. Same if we modify any number of rows or delete some rows then those changes should be captured.
Next time if we run the query it should idealy give 0 as we have not insert/modify/delete any row/rows
Here is the following table
CREATE TABLE datainv (
datainv_account_id uuid,
datainv_run_id uuid,
id uuid,
datainv_summary text,
json text,
number text,
PRIMARY KEY (datainv_account_id, datainv_run_id));
many things I have searched on internet but most of the solution are based on timeuuid,but in this case I have uuid columns only. So I'm not getting any solution that the same use-case can be achieved using uuid
It's not so easy to generate a diff between 2 table states in Cassandra, because you can't easily detect if you have inserted new partitions or not. You can implement something based on the timeuuid or on the timestamp as clustering column - in this case you'll able to filter out the data since latest change, as you have ordering of values that you don't have with uuid that is completely random. But it still requires that you perform the full scan of all the table. Plus it won't detect deletions...
Theoretically you can implement this with Spark as following:
read all primary key values & store this data in some other table/on disk;
next time, read all primary key values & find difference between original set of primary keys & new set - for example, do full outer join & use presence of None on left as addition, and presence of None on right as deletion;
store new set of the primary keys in a separate table/on disk, but previous version should be truncated.
but it will consume quite a lot of resources.

Cassandra data order without passing where condition

I am struggling with data order of Cassandra data. I have a table like this
tbl_data
- yymmddhh (text)
- data (text)
parting key is 'yymmddhh'
I am adding data like this
'16-11-17-01', 'a'
'16-11-17-01', 'b'
'16-11-17-02', 'c'
'16-11-17-03', 'xyz'
'16-11-17-03', 'e'
'16-11-17-03', 'f'
select * from tbl_data limit 10;
I am expecting data in the order in which I added data. But it is giving data like this
'16-11-17-03', 'f'
'16-11-17-03', 'e'
'16-11-17-01', 'a'
i.e. latest record first or some random order. I need data in the same order in which I added. I am not able to figure out the default order of the data in my case. Also I don't want to pass partition key in where condition because its overhead to remember that value for me. Kindly suggest me the solution.
I'm afraid you will struggle forever on this.
As per comments, you can't decide the order "outside" a partition, unless you really understand what you're doing by changing the partitioner.
Please have a read at the suggested link, and at this and this SO answers to understand why you are getting your records in this specific order (yes, they ARE ordered...).
A possible solution, however, is to add a timestamp clustering key, and change the partition key to a simpler "yymmdd":
tbl_data
- yymmdd (timestamp)
- hhmmssMMM (timestamp)
- data (text)
Now you'd store data on day by day basis (that is you need to know the day you are querying data for), and the order of your data inside each partition (that is each day) is sorted by the timestamp column, so for your requirements you'd store there the insertion time of the record.
Now, if you don't insert data every day, you really need to keep track the insertion dates into another (very simple) table:
CREATE TABLE inserted_days (
yymmdd timestamp PRIMARY KEY
);
Issuing a
SELECT * FROM inserted_days
would scan all this partition, returning records in random order (from you app point of view, so you need to sort it), but here we are talking of 365 records in year, something you don't need to worry about. It's easy to do and you'd not incur into unmanageable overheads.
HTH.

Data modelling of raw data for further transformation in cassandra

I am working on a system for storing and processing time series data from a couple of plants. Every plant has a different number of raw measurement values, each of them represented as a key-value pair.
The raw data needs to be preprocessed to obtain semantics. I also need to save the raw data, because the transformation process should be configurable. While I am new to No-Sql databases and Cassandra I searched for resources on the web and found the weather station example (similar described on other resources, too).
My requirements are similar to this example, but as extension I need a way to store a variable number of measurement values (key-pair) per plant. I also know, that my table model highly depends on the queries I want to run against it. The most common queries will be:
Get all values per key for a specific time (range) and plant.
Get all values per multiple keys for a specific time (range) and plant.
My question now is, how would a table structure look like that best fit theses requirements?
I thought about something like that, but don't know if it contains some drawbacks:
CREATE TABLE values_per_day (
plant_id text,
date text,
event_time timestamp,
key text,
value text,
PRIMARY KEY ((plant_id, date), event_time, address)
);
The recommendation for Cassandra is to start with the queries you want to perform. For each query, consider the inputs to the query, which indicate what data you want it to return. For each query you should have a table that has the inputs to the query as its primary key. If you want to query for a rangeof values, that value should be the cluster key (not the partition key) of a primary key, with the other inputs the partition key. If you want to query for very long value ranges, consider slicing that value into buckets.

Resources