Hi we are planning to use Cassandra for ad server implementation. We have a req where client can create advertisers publishers and new ads sort of typical relational req as well as interface to monitor analytical data ad hits, conversion etc. We also needs an interface where client is able to apply filters based on master fields such as name, location etc. As well as based on analytical data like where ad revenue > x and similar other criterias quite a few in nos.
Is it OK to use a single databases like Cassandra to maintain both types of data. As Cassandra has fairly limited querying capacity on fields unless u create views n index we are skeptical. If we keep two seperate databases products will it complicate and add additional redundancy. How companies such as Facebook, linkedin are accounting for both master and analytical data req. Any suggestions are appreciated. Thx
The typical solution in Cassandra is to have multiple datacenters - one for online transaction processing, and another for spark analytical queries. The different datacenters allow you to query them independently so spark doesn't impact production. Alternatively you can denormalize and insert into multiple tables using 'BaTCH'
Related
I am designing a basic ERP (nodejs/express/postgresql-vue3/quasar), in which several businesses of different clients will be managed, some of these clients have several businesses with some branches, I should implement a server/database instance per customer or should I look to load balance and scale a single database in the future?
That is database tenancy aproach. Here is nice article on that.
Personally, would recommend schema multi-tenancy for start (one client per schema) as it is basic ERP and it's easier to manage and maintain single DB, and you can add specific changes for some clients on table design if needed
You can use set search_path on pg connection for each client to direct queries to specific schema
PostGreSQL has not be designed for VLDB, so you must evaluate the final volume for 3 to 5 years.
If this volume will be over 300 Gb, it is preferable to split your customers into one database each.
If this volume will be under, you can use SQL schemas.
Beware of the number of files... PG create many file for each table... If there is too much files this will need a high consumption of resources. In this case, it will be necessary to split your system over many PG clusters...
I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.
we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?
Is there a way to configure a Cassandra cluster with data centre splitting / NetworkTopologyStrategy / ReplicationFactor 1? Basically, I want to keep the data in its originating node but still be able to query it all from any node. The business use case is:
I have a group of customers, each is a different firm with data in their own datacentres. I want to do some cross-firm data analysis without useable data leaving their premises i.e. I can't get them all to load their data onto a central server. I am looking for a platform that allows me to deploy software to each firm such that I can do distributed comparisons of their data without them having to send me their data in bulk (much of it is prohibited for distribution). Data transferred in a non-readable wire format as part of a distributed "join" will be fine as long as I'm not replicating the data to the other customers data centres.
Yes, you can have a replication factor of 1. However, ensuring that each item of data is on the node at a particular site requires additional work. You will need to have a customer ID as the partition key for every table, and write a custom partitioner that maps customer ID to a token for that customer. And you will have to manually configure each node to use only the one token for its customer.
I'm planing on having my database stored in Cloudant.
Our application is multi-tenant. We currently do the separation to tenants based on a value in some of our tables which will naturally translation to value in a document. Another way is to have database per tenant. We currently have around 100 tenants and hopefully will grow to 500-2000 in our best projections.
What is the pros and cons between all tenants in one db vs. db per tenant?
Is there limitation on the number of database we can create and work with concurrently?
This is a good and involved question. There are pros and cons to both models. The main advantage to one large database is that you can analyze (search, mapreduce, etc) across all users very easily. The main advantage of one-db-per-user is that every user has their own data "sandbox", which may be nice for your SLA. Additionally, that means that the amount of data in each user database can be relatively small.
If you can provide more details about the data you are storing, the relational modeling, and the queries you hope to be able to do, I can probably give you a more satisfying answer.