Columns constraints in Spark tables - apache-spark

I am creating a table from CSV file in apache SPARK. Is it possible to create a table with not null constraint or primary key-foreign key constraints?

It is not possible to define constraints on Spark tables / DataFrames. While StructFields can be defined as nullbale or not this property is not enforced on runtime. There are multiple reasons why constraints like this wouldn't be useful in practice but the fundamental one is that Spark SQL is not a database. In general it doesn't have any control over the data sources and validates data only on read so the only possible approach would be to fail on access.

Related

How do we handle alter table column datatypes in Cassandra, in actual scenarios?

I'm aware of the restrictions Cassandra has on modifying column data types once table is created or even dropping and adding column with same name of different data type.
Dropping and Adding is allowed with restrictions.
But, if we talk about actual scenarios, it's not that uncommon to modify table schema during initial phase of our project.
Example: modifying Name column of User table from TEXT to a UDT(User Defined Type) that could encapsulate more information.
Coming from a RDBMS background, this is a very strange behaviour and maybe someone with actual project experience on Cassandra can answer it.
How do we handle such scenario of modifying column datatypes ? And what are the best practices.
Also, is this a common behaviour with other NoSQL or columnar databases ?

Table relationship in hive, Spark Sql and Azure USQL

Is there any way i can maintain table relationship in Hive, Spark SQL and Azure U SQL. Does any of these support creating relationships like in Oracle or SQL Server. In oracle i can get the relationships using user_constraints table. Looking for something similar to that for processing in ML.
I don't believe any of the big data platforms support integrity constraints. They are designed to load data as quickly as possible. Having constraints will only slow down the import. Hope this makes sense
None of the big data tools maintain constraints. If we just consider hive it doesn't even bother while you are writing data to the table whether the table schema is maintained or not as Hive follows schema on read approach.
While reading the data if you want to establish any relation, one has to work with joins.

Cassandra - advantages of custom type

I am planning to use a Java object as a custom type and store it Cassandra. I am taking out 2 data members from the class and making them into primary key and keeping the rest of the data members in the custom type.
data members of my class: name, date_of_birth, occupation, last_visit, family_members, total_income
primary key: name, date_of_birth
cassandra custom type members: occupation, last_visit, family_members, total_income
Will the custom data type have any performance benefits while writing or reading when compared to storing the individual data members in terms of Cassandra data types.
Will the custom data type have any performance benefits while writing or reading when compared to storing the individual data members in terms of Cassandra data types.
Not really. Data for user defined types (UDTs) is stored in a single column in the row, and that should be a faster read than multiple individual columns. But whatever performance gain you achieve there will quickly be erased as the data is serialized for the result set. While CQL will allow you to read individual fields of the UDT if you desire, Cassandra still has to read all contents of that column regardless.
It is important to note that user defined types are not about improving performance. They're about offering the flexibility to achieve small amounts of denormalization.
And just a suggestion, but perhaps it makes more sense to have members as a collection, with each item containing data for each family member?

Does the Cassandra 2.1 stress tool support UDT and multiple tables

Our CREATE TABLE statement uses a user defined type (the ones you create with CREATE TYPE). Is this supported in the stress tool in 2.1? It doesn't look that way if I look into StressProfile.java
Also I was wondering if there was a way to stress test multiple tables at the same time.
In my experience that is not possible. While using cassandra-stress (2.1) I furthermore noticed that not only UDTs but the CQL data type map is not supported as well.
I ended up to create one user profile for each table and dropped the map-typed columns from the table while stressing.

Cassandra CQL DataType Advantages and Disadvantages

I am looking into using Cassandra CQL 3.0 and was reading over the various datatypes provided for tables (or column families). See here for a list of the datatypes: CQL Datatypes. My questions is what are some advantages and disadvantages of the different datatypes. For example, if I am storing XML into a column, what would be a driver to use blob vs. text?
Don't use blob unless none of the other types makes sense. For XML it would make sense to me to use text.

Resources