Is there a way to configure a Cassandra cluster with data centre splitting / NetworkTopologyStrategy / ReplicationFactor 1? Basically, I want to keep the data in its originating node but still be able to query it all from any node. The business use case is:
I have a group of customers, each is a different firm with data in their own datacentres. I want to do some cross-firm data analysis without useable data leaving their premises i.e. I can't get them all to load their data onto a central server. I am looking for a platform that allows me to deploy software to each firm such that I can do distributed comparisons of their data without them having to send me their data in bulk (much of it is prohibited for distribution). Data transferred in a non-readable wire format as part of a distributed "join" will be fine as long as I'm not replicating the data to the other customers data centres.
Yes, you can have a replication factor of 1. However, ensuring that each item of data is on the node at a particular site requires additional work. You will need to have a customer ID as the partition key for every table, and write a custom partitioner that maps customer ID to a token for that customer. And you will have to manually configure each node to use only the one token for its customer.
Related
I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.
What is the ramification of using each api?
For example if I am using sql api, am I sacrificing ACID, and which part of CAP am I using? How do azure achieve horizontal scalability
If I am using document api or key value api, is the internal data layout different?
The internal storage format for the various database API's in Cosmos DB doesn't have any bearing on ACID or CAP. The API you choose should be driven by the appropriate use case needed and/or your familiarity with it. For instance, both SQL API and Mongo DB API are document databases. But if you have experience using Mongo, then it's probably a better choice rather than say Gremlin which is a graph, Table which is a key/value store, or Cassandra which is a columnar store.
Cosmos DB provides ACID support for data within the same logical partition. With regards to CAP that depends on which consistency level/model you choose. Strong consistency trades "A" availability for "C" consistency. All the other consistency models trade consistency for availability but at varying degrees. For instance, bounded staleness defines an upper bound that data can lag in consistency by time or updates. As that boundary approaches, Cosmos will throttle new writes to allow the replication queue to catch up, ensuring the consistency guarantees are met.
Achieving horizontal scalability is a function of your partitioning strategy. For write heavy workloads, the objective should be choosing a partition key that will allow for writes to be distributed across a wide range of values. For read heavy workloads, you want for queries to be served by one or a bounded number of partitions. If it's both, then using change feed to copy data into 2 or more containers as needed such that the cost of copying data is cheaper than running it in a container that would result in cross-partition queries. At the end of the day, choosing a good partition key requires testing under production levels of load and data.
I am trying a use case with cosmoseDB where we want to maintain one CosmoseDB but split the data into US region and Europe region with some partition key?
And for inserting/updating documents, application know which region(US/Europe) the documents go so is it possible to point to the right region while inserting/updating the document?
As I know , Cosmos DB global distribution mechanism guarantees consistency of all replica sets.
When you create the distribute cosmos db account, you enable the geo-redundancy.
You will see the regions of read and write separation.
Write operations are completed in write region and replicated to other read regions to ensure consistency. On the client side, there is no need to point to specific region to write data. From perspective of consistency , all region data supposed to be same.
More details, you could refer to this document.
Hope it helps you.
Can you have multiple write regions?
DocumentDB has good build-in features to bring read operations closer to consumers by adding read-regions to your documentDB account. You can read about it in documentation: "How to setup Azure Cosmos DB global distribution using the SQL API".
My understanding based on this is that there is always only 1 write region at any given time. I would not bet my thumbs on it, but it's hinted at in documentation. For example in "Connecting to a preferred region using the SQL API":
The SDK will automatically send all writes to the current write region.
All reads will be sent to the first available region in the PreferredLocations list. If the request fails, the client will fail down the list to the next region, and so on.
What you can do..
Things get more complicated when you also want to distribute writes (espcially if you care about consistency and latency). DocumentDB's own documentation suggest you implement this as a combination of multiple accounts, each of which has its own local write region and automatic distribution to read/fallback node in other regions.
The downside is that your application would have to configure and implement reading from all accounts in your application code and merging the results. Having data well partitioned by geography could help avoid full fan-out at times but your DAL would still have to manage multiple storages internally.
This scenario is explained in more detail in documentation page
"Multi-master globally replicated database architectures with Azure Cosmos DB".
I would seriously consider if adding such complexity would be justified, or if distributing just the reads would suffice.
All regions for a given account have the same replicated data. If you want to separate data across regions, you'd need to split it into two accounts.
Given partition A in the US and partition B in the EU – there is very little difference if A and B were under the same account, or under different accounts… the collection/db/account are all just logical wrappers on top of the partition.
We plan to use Cassandra 3.x and we want to allow our customers to connect to Cassandra directly for exporting the data into their data warehouses.
They will connect via ODBC from remote.
Is there any way to prevent that the customer executes huge or bad SELECT statements that will result in a high load for all nodes? We use an extra data center in our replication strategy where only customers can connect, so live system will not be affected. But we want to setup some workers that will run on this shadow system also. Most important thing is, that a connected remote client will not have any noticable impact on other remote connections or our local worker jobs. There is a materialized view already and I want to force customers to get data based on primary key only (i.e. disallow usage of ALLOW FILTERING). It would be great also, if one can limit the number of rows returned (e.g. 1 million) to prevent a pull of all data.
Is there a best practise for this use case?
I know of BlackRocks video related to multi-tenant strategy in C* which advises to use tenant_id in schema. That is what we're doing already, but how can I ensure security/isolation via ODBC connected tenants/customers? Or do I have to write an API on my own which handles security?
I would recommend to expose access via API, not via ODBC - at least you would have greater control on what is executed, and enforce tenant_id, and other checks, like limits, etc. You can try to utilize the Cassandra's CQL parser to decompose query, and put all required things back.
Theoretically, you can could utilize Apache Calcite, for example. It has implementation of JDBC driver that could be used, plus there is existing Cassandra adapter that you can modify to accomplish your task (mapping authentication into tenant_ids, etc.), but this will be quite a lot of work.
Hi we are planning to use Cassandra for ad server implementation. We have a req where client can create advertisers publishers and new ads sort of typical relational req as well as interface to monitor analytical data ad hits, conversion etc. We also needs an interface where client is able to apply filters based on master fields such as name, location etc. As well as based on analytical data like where ad revenue > x and similar other criterias quite a few in nos.
Is it OK to use a single databases like Cassandra to maintain both types of data. As Cassandra has fairly limited querying capacity on fields unless u create views n index we are skeptical. If we keep two seperate databases products will it complicate and add additional redundancy. How companies such as Facebook, linkedin are accounting for both master and analytical data req. Any suggestions are appreciated. Thx
The typical solution in Cassandra is to have multiple datacenters - one for online transaction processing, and another for spark analytical queries. The different datacenters allow you to query them independently so spark doesn't impact production. Alternatively you can denormalize and insert into multiple tables using 'BaTCH'