What is the best means of solving the following problem:
I have a Sybase ASE database which is used as an OLTP server. There is a lot of data inserted into the database each day and as a result the 'live' tables hold only the last n days of data (n can vary from table to table).
I would like to introduce a Sybase IQ server as a Decision Support Server holding all of the previous days data for reporting purposes.
I would like a nightly job which would "sync" the Sybase IQ tables with those in ASE i.e. insert all new rows, update all changed rows but NOT delete any of the rows outside of the n days that the live table represents.
All ideas welcome!!!
You have to develop an ETL (Extract Transform Load) process.
There are a lot commercial and free ETL products. But I think that the best way in this case
Create RS ASE -> ASE replication (direct ASE -> IQ would have bad performance)
Modify delete function string to separate delete operations
Periodically truncate insert IQ tables from the second ASE db (update is very poor in IQ) via linked server connection
Related
We load data from on-prem database servers to Azure Data Lake Storage Gen2 using Azure Data Factory and Databricks store them as parquet files. Every run, we get only get the new and modified data from last run and UPSERT into existing parquet files using databricks merge statement.
Now we are trying to move this data from parquet files Azure Synapse. Ideally, I would like to do this.
Read incremental load data into a external table. (CETAS or COPY
INTO)
Use above as staging table.
Merge staging table with production table.
The problem is merge statement is not available in Azure Syanpse. Here is the solution Microsoft suggests for incremental load
CREATE TABLE dbo.[DimProduct_upsert]
WITH
( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED INDEX ([ProductKey])
)
AS
-- New rows and new versions of rows
SELECT s.[ProductKey]
, s.[EnglishProductName]
, s.[Color]
FROM dbo.[stg_DimProduct] AS s
UNION ALL
-- Keep rows that are not being touched
SELECT p.[ProductKey]
, p.[EnglishProductName]
, p.[Color]
FROM dbo.[DimProduct] AS p
WHERE NOT EXISTS
( SELECT *
FROM [dbo].[stg_DimProduct] s
WHERE s.[ProductKey] = p.[ProductKey]
)
;
RENAME OBJECT dbo.[DimProduct] TO [DimProduct_old];
RENAME OBJECT dbo.[DimProduct_upsert] TO [DimProduct];
Basically dropping and re-creating the production table with CTAS. Will work fine with small dimenstion tables, but i'm apprehensive about large fact tables with 100's of millions of rows with indexes. Any suggestions on what would be the best way to do incremental loads for really large fact tables. Thanks!
Till the time SQL MERGE is officially supported, the recommended way fwd to update target tables is to use T SQL insert/update commands between the delta records and target table.
Alternatively, you can also use Mapping Data Flows (in ADF) to emulate SCD transactions for dimensional/fact data load.
I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop
I'm new to Cassandra and Column family database world.I have a scenario where I need to move data from one Column family database such as Scylla Database to another Column family database Datastax Cassandra.Amount of data to be transferred will be in millions. And I wan this data transfer to happen on regular interval time lets say 2 mins.I was exploring sstableloader option. No luck yet. is ter any other better approach for my scenario ? Any suggetions will be highly appreciated.
(Disclaimer: I'm a ScyllaDB employee)
There are 3 ways to accomplish this:
Dual writes from the client side with client side time stamps, to both DBs
Use the sstableloader tool to migrate the data from one DB to the other.
Use nodetool refresh command to load sstables
You can read more about Migration from Cassandra to Scylla in the following doc, which also describes how to perform dual writes from the client side (option-1), with code examples + how to use the sstableloader tool (option-2)
http://docs.scylladb.com/procedures/cassandra_to_scylla_migration_process/
For the nodetool refresh usage you may look here: http://docs.scylladb.com/nodetool-commands/refresh/
A common approach is to instrument the client to write to both databases in parallel, instead of synchronizing the two databases. This keeps the two databases in sync on every single write.
Are there any bandwidth limits or throttling for Azure SQL Data Warehouse extracts? Are there any connection string settings that optimize how fast we can extract data via a SELECT query?
From SSIS on a VM in the same Azure region as SQL DW, if I run a SELECT * query to extract via OLEDB millions of rows with the default connection string (default packet size) I see it use about 55Mbps bandwidth. If I add Packet Size=32767 I see it use about 125Mbps bandwidth. Is there any way to make it go faster? Are there other connection string settings to be aware of?
Incidentally, I was able to get up to about 500Mbps bandwidth coming from SQL DW if I run multiple extracts in parallel. But I can't always break one query into several parallel queries. Sometimes I just need one query to extract the data faster.
Of course Polybase CETAS (CREATE EXTERNAL TABLE AS SELECT) is much more efficient at extracting data. But that's not a good fit in all extract scenarios. For example, if I want to put Analysis Services on top of Azure SQL DW, I can't really involve a CETAS statement during cube processing so Polybase doesn't help me there.
At the moment your best option is to run multiple extracts in parallel, optimizing the packet size as you have described. For SSAS on top of SQLDW your best option would be to use parallel partition processing.
I'd read many posts and articles about comparing SQL Azure and Table Service and most of them told that Table Service is more scalable than SQL Azure.
http://www.silverlight-travel.com/blog/2010/03/31/azure-table-storage-sql-azure/
http://www.intertech.com/Blog/post/Windows-Azure-Table-Storage-vs-Windows-SQL-Azure.aspx
Microsoft Azure Storage vs. Azure SQL Database
https://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/2fd79cf3-ebbb-48a2-be66-542e21c2bb4d
https://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
https://stackoverflow.com/questions/2711868/azure-performance
http://vermorel.com/journal/2009/9/17/table-storage-or-the-100x-cost-factor.html
Azure Tables or SQL Azure?
http://www.brentozar.com/archive/2010/01/sql-azure-frequently-asked-questions/
https://code.google.com/p/lokad-cloud/wiki/FatEntities
Sorry for http, I'm new user >_<
But http://azurescope.cloudapp.net/BenchmarkTestCases/ benchmark shows different picture.
My case. Using SQL Azure: one table with many inserts, about 172,000,000 per day(2000 per second). Can I expect good perfomance for inserts and selects when I have 2 million records or 9999....9 billion records in one table?
Using Table Service: one table with some number of partitions. Number of partitions can be large, very large.
Question #1: is Table service has some limitations or best practice for creating many, many, many partitions in one table?
Question #2: in a single partition I have a large amount of small entities, like in SQL Azure example above. Can I expect good perfomance for inserts and selects when I have 2 million records or 9999 billion entities in one partition?
I know about sharding or partition solutions, but it is a cloud service, is cloud not powerfull and do all work without my code skills?
Question #3: Can anybody show me benchmarks for quering on large amount of datas for SQL Azure and Table Service?
Question #4: May be you could suggest a better solution for my case.
Short Answer
I haven't seen lots of partitions cause Azure Tables (AZT) problems, but I don't have this volume of data.
The more items in a partition, the slower queries in that partition
Sorry no, I don't have the benchmarks
See below
Long Answer
In your case I suspect that SQL Azure is not going work for you, simply because of the limits on the size of a SQL Azure database. If each of those rows you're inserting are 1K with indexes you will hit the 50GB limit in about 300 days. It's true that Microsoft are talking about databases bigger than 50GB, but they've given no time frames on that. SQL Azure also has a throughput limit that I'm unable to find at this point (I pretty sure it's less than what you need though). You might be able to get around this by partitioning your data across more than one SQL Azure database.
The advantage SQL Azure does have though is the ability to run aggregate queries. In AZT you can't even write a select count(*) from customer without loading each customer.
AZT also has a limit of 500 transactions per second per partition, and a limit of "several thousand" per second per account.
I've found that choosing what to use for your partition key (PK) and row key depends (RK) on how you're going to query the data. If you want to access each of these items individually, simply give each row it's own partition key and a constant row key. This will mean that you have lots of partition.
For the sake of example, if these rows you were inserting were orders and the orders belong to a customer. If it was more common for you to list orders by customer you would have PK = CustomerId, RK = OrderId. This would mean to find orders for a customer you simply have to query on the partition key. To get a specific order you'd need to know the CustomerId and the OrderId. The more orders a customer had, the slower finding any particular order would be.
If you just needed to access orders just by OrderId, then you would use PK = OrderId, RK = string.Empty and put the CustomerId in another property. While you can still write a query that brings back all orders for a customer, because AZT doesn't support indexes other than on PartitionKey and RowKey if your query doesn't use a PartitionKey (and sometimes even if it does depending on how you write them) will cause a table scan. With the number of records you're talking about that would be very bad.
In all of the scenarios I've encountered, having lots of partitions doesn't seem to worry AZT too much.
Another way you can partition your data in AZT that is not often mentioned is to put the data in different tables. For example, you might want to create one table for each day. If you want to run a query for last week, run the same query against the 7 different tables. If you're prepared to do a bit of work on the client end you can even run them in parallel.
Azure SQL can easily ingest that much data an more. Here's a video I recorded months ago that show a sample (available on GitHub) that shows one way you can do that.
https://www.youtube.com/watch?v=vVrqa0H_rQA
here's the full repo
https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-azuresql