Table relationship in hive, Spark Sql and Azure USQL - azure

Is there any way i can maintain table relationship in Hive, Spark SQL and Azure U SQL. Does any of these support creating relationships like in Oracle or SQL Server. In oracle i can get the relationships using user_constraints table. Looking for something similar to that for processing in ML.

I don't believe any of the big data platforms support integrity constraints. They are designed to load data as quickly as possible. Having constraints will only slow down the import. Hope this makes sense

None of the big data tools maintain constraints. If we just consider hive it doesn't even bother while you are writing data to the table whether the table schema is maintained or not as Hive follows schema on read approach.
While reading the data if you want to establish any relation, one has to work with joins.

Related

Data model tool to connect to Databricks or Data lake

From data modeling documentation (Dimensional/ ER Diagram) is there any tool available which can connect to databricks/ data lake and read the table structure directly and also updates the structure of table whenever there is a addition or deletions of columns in a table?
And in a process, it should not remove the relationship made between tables whenever there is an update to a columns and/ or tables (addition/ deletion). And version control on same will be helpful using GIT etc.
Reason being I understand the PK and FK details are not maintained in datalake/ databricks tables entities. Request to please propose if any modeling tools are present for this use case.

Cassandra store data in BLOB

We are using Cassandra 3 and have come up with a modelling based on the initial requirements. Since there have been very frequent requirements changes, this model has subsequently changed many times as well. Hence considering these requirements and model changes, there has been no major improvement in terms of development. The team have decided to go with the BLOB data type and store the entire data in the BLOB. Can you please share the drawback to use BLOB such a scenario. Thanks in Advance.
We migrated from Astyanax Cassandra 1.1 to CQL Cassandra 3.0 directly, so we still have a lot of column families which have value as BLOB.
Major issues we face right now are:
1) Difficult to visualize data directly from database: Biggest advantage of CQL is it supports SQL like queries, hence logging into cql terminal and getting results directly from there is saves a lot of time normally. If you use BLOB you will not be able to do all such things.
2) CQL performs better when your table has a well defined schema instead of using blob to store big chunk of data together.
If you are creating a new table, I will suggest to use Collections for your use case. You will be able to store different type of data and performance will also be good.
Nice slides comparing performance of schemaless tables and tables with scehma and collections. You can skip to slide 26 if you just want the summary.
https://www.slideshare.net/DataStax/migration-from-thrift-to-cql-brij-bhushan-ravat-ericsson-cassandra-summit-2016

Columns constraints in Spark tables

I am creating a table from CSV file in apache SPARK. Is it possible to create a table with not null constraint or primary key-foreign key constraints?
It is not possible to define constraints on Spark tables / DataFrames. While StructFields can be defined as nullbale or not this property is not enforced on runtime. There are multiple reasons why constraints like this wouldn't be useful in practice but the fundamental one is that Spark SQL is not a database. In general it doesn't have any control over the data sources and validates data only on read so the only possible approach would be to fail on access.

How to manage duplicated Data between different tables in Query-Driven Data Model in Cassandra?

I'm new in Cassandra NOSQL DB. I've read A Big Data Modeling Methodology for Apache Cassandra and Basic Rules of Cassandra Data Modeling as useful articles about data modelling in Cassandra. In this pages, it's mentioned that, data duplication is used to achieve best performance (more write) based on Query-Driven methodology.OK! We will have a physical diagram as this:
As you see, ave-rating is duplicated in three tables.The question is that, when we want to update or insert ave-rating:
Does Cassandra have any tools itself to manage write in any tables having this column?(CRUD operations in duplicated Data toward some columns)
Is there any third-party tool for issue, above?
Should this issue be handled in application level? if yes, what is the best practice, then?
Does Cassandra have any tools itself to manage write in any tables having this column?
Yes, look at materialized views: http://www.doanduyhai.com/blog/?p=1930
And here too: http://www.datastax.com/dev/blog/understanding-materialized-views

One or two keyspaces in Cassandra for a different kind of data of the single application

In my project I use Cassandra for analytics and MySQL to store data. I see that Cassandra could be good fit for data as well.
My question is: should I create a new keyspace for data or should I use keyspace that already exists an used for analytical data? What should I take into account when making such decision?
My stack is Python (Django) + pycassa, cassandra 1.2.
Keyspace is simply a high level grouping of similar column families. There are no hard and fast rules, and the most significant implications of either decision relate to the specific client library's API. Personally, I create a new keyspace when I want a separation of concerns with my data. It's somewhat analogous to creating a different database in a relational DB.

Resources