In a typical star schema we have fact tables and dimension tables. Reading that article it seems like databricks suggests to use delta tables for realizing the star schema. However, delta tables do not support referential integrity - see here and here .
Do star-schemas in the delta lake support referential integrity?
Related
From data modeling documentation (Dimensional/ ER Diagram) is there any tool available which can connect to databricks/ data lake and read the table structure directly and also updates the structure of table whenever there is a addition or deletions of columns in a table?
And in a process, it should not remove the relationship made between tables whenever there is an update to a columns and/ or tables (addition/ deletion). And version control on same will be helpful using GIT etc.
Reason being I understand the PK and FK details are not maintained in datalake/ databricks tables entities. Request to please propose if any modeling tools are present for this use case.
If data lake is a repository with unstructured, semi-structured and structured data, is it physically implemented in a single DB technology? Which ones support the three types of data?
That's a broad question, but Delta Lake supports all of these data types. Of course many things are dependent on the specific access patterns, but it's all doable with Delta.
We are creating a new web application and we intend to use CouchDB. The old web application is being rewritten and we are migrating from RDBMS to CouchDB. I have a RDBMS schema with 10+ tables and I want to recreate the same in CouchDB. Which is better approach to do this in CouchDB?
Options
Create a new database in CouchDB for every table in my RDBMS schema
Create only one database in CouchDB and store all RDBMS tables into this CouchDB, having an explicit column called doc_type/table_type to represent which table/row type it represents in RDBMS table.
What are the pros and cons of these approaches? What is the recommended approach?
It all depends.
In general, be wary of trying to "translate" a RDBMS schema naively to CouchDB -- it rarely ends up in a happy place. A relational schema -- if designed well -- will be normalized and reliant on multi-tabular joins to retrieve data. In CouchDB, your data model will (probably) not be normalized nearly as much, and instead the document unit representing either a row from a table, or a row returned from a joins in the relational DB.
In CouchDB there are no joins, and no atomic transactions above the document unit. When designing a data model for CouchDB, consider how data is accessed and changed. Things you need to be accessed atomically belong in the same document.
As to the many databases vs a single database with documents with a "type" field, the single database option allows you to easily perform map-reduce queries across your whole data set. This isn't possible if you use multiple databases, as a map-reduce view is strictly per-database. The number of databases should be dictated by the access pattern -- if you have data that is only ever accessed by a subset of your application's queries, and never needs slicing and dicing with other data, that can be housed in a separate database.
I'd also recommend checking out the new partitioned database facility in CouchDB and Cloudant.
Is there any way i can maintain table relationship in Hive, Spark SQL and Azure U SQL. Does any of these support creating relationships like in Oracle or SQL Server. In oracle i can get the relationships using user_constraints table. Looking for something similar to that for processing in ML.
I don't believe any of the big data platforms support integrity constraints. They are designed to load data as quickly as possible. Having constraints will only slow down the import. Hope this makes sense
None of the big data tools maintain constraints. If we just consider hive it doesn't even bother while you are writing data to the table whether the table schema is maintained or not as Hive follows schema on read approach.
While reading the data if you want to establish any relation, one has to work with joins.
I am creating a table from CSV file in apache SPARK. Is it possible to create a table with not null constraint or primary key-foreign key constraints?
It is not possible to define constraints on Spark tables / DataFrames. While StructFields can be defined as nullbale or not this property is not enforced on runtime. There are multiple reasons why constraints like this wouldn't be useful in practice but the fundamental one is that Spark SQL is not a database. In general it doesn't have any control over the data sources and validates data only on read so the only possible approach would be to fail on access.