describe spark table/view comment - apache-spark

We can create a table and view it with a comment describing it.
For example (from spark docs):
CREATE TABLE student (id INT, name STRING, age INT) USING CSV
COMMENT 'this is a comment'
TBLPROPERTIES ('foo'='bar');
How can you retrieve the comment in a "clean format"?
By clean format, I mean only (or almost only) the table name and the comment describing it. Any other solution I've found bloats me with all the column types and information (which I don't need in this case).
I've tried:
DESCRIBE student
DESCRIBE EXTENDED student
SHOW CREATE TABLE student
DESCRIBE DETAIL student -- databricks only
SHOW VIEWS FROM default -- to try seeing views description
SHOW TABLES FROM default -- to try seeing tables description
The best would be to have something like SHOW TABLE/SHOW VIEWS, but with a column adding the description.
Is there an out-of-the-box solution for this? If not, is there a good custom way to achieve it?
Thank you.

There is no way to get table comment only. However it's fairly easy to filter it out of DESCRIBE TABLE statement using Scala.
spark.sql("CREATE TABLE student (id INT, name STRING, age INT) USING CSV COMMENT 'this is a comment'")
spark.sql("DESCRIBE TABLE EXTENDED student").filter($"col_name" === "Comment").show
+--------+-----------------+-------+
|col_name| data_type|comment|
+--------+-----------------+-------+
| Comment|this is a comment| |
+--------+-----------------+-------+

Related

Databricks table metadata through JDBC driver

The Spark JDBC driver (SparkJDBC42.jar) is unable to capture certain information from the below table structure:
table level comment
The TBLPROPERTIES key-value pair information
PARTITION BY information
However, it captures the column level comment (eg. the comment against employee_number column), all columns of employee table, their technical data types.
Please advise if I need to configure any additional properties to be ale to read/extract the information that the driver could not extract at the moment.
create table default.employee(
employee_number INT COMMENT ‘Unique identifier for an employee’,
employee_name VARCHAR(50),
employee_age INT)
PARTITIONED BY (employee_age)
COMMENT ‘this is a table level comment’
TBLPROPERTIES (‘created.by.user’ = ‘Noor’, ‘created.date’ = ‘10-08-2021’);
You should be able to execute:
describe table extended default.employee
via JDBC interface as well. In first case it will return a table with 3 columns, that you can parse into column level & table level properties - it shouldn't be very complex, as there are explicit delimiters between row-level & table level data:
You can also execute:
show create table default.employee
that will give you a table with one column, containing the SQL statement that you may parse:

In Cassandra, why dropping a column from tables defined with compact storage not allowed?

As per datastx documentation here, we cannot delete column from tables defined with COMPACT STORAGE option. What is the reason for this?
This goes back to the original implementation of CQL3, and changes which were made to allow it to abstract a "SQL-like," wide-row structure on top of the original Thrift-based storage engine. Ultimately, managing the schema comes down to whether or not the underlying structure is a table or a column_family.
As an example, I'll create two tables using an old install of Apache Cassandra (2.1.19):
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT);
CREATE TABLE studentcomp (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT)
WITH COMPACT STORAGE;
I'll insert one row into each table:
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO studentcomp (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
And then I'll look at the tables with the old cassandra-cli tool:
[default#stackoverflow] list student;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=, value=, timestamp=1599248215128672)
=> (name=fname, value=4a6f726479, timestamp=1599248215128672)
=> (name=lname, value=416e646572736f6e, timestamp=1599248215128672)
[default#stackoverflow] list studentcomp;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=fname, value=Jordy, timestamp=1599248302715066)
=> (name=lname, value=Anderson, timestamp=1599248302715066)
Do you see the empty/"ghost" column value in the first result? That empty column value was CQL3's link between the column values and the table's meta data. If it's not there, then CQL cannot be used to manage a table's columns.
The comparator used for type conversion was all that was really exposed via Thrift. This lack of meta data control/exposure is what allowed Cassandra to be considered "schemaless" in the pre-CQL days. If I run a describe studentcomp from within the cassandra-cli, I can see the comparators (validation class) used:
Column Metadata:
Column Name: lname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Column Name: fname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
But if I try describe student, I see this:
WARNING: CQL3 tables are intentionally omitted from 'describe' output.
See https://issues.apache.org/jira/browse/CASSANDRA-4377 for details.
Sorry, no Keyspace nor (non-CQL3) ColumnFamily was found with name: student (if this is a CQL3 table, you should use cqlsh instead)
Bascially, tables and column families were different entities forced into the same bucket. Adding WITH COMPACT STORAGE essentially made a table a column family.
With that came the lack of any schema management (adding or removing columns), outside of access to the comparators.
Edit 20200905
Can we somehow / someway (hack) drop the columns from table?
You might be able to accomplish this. Sylvain Lebresne wrote A Thrift to CQL3 Upgrade Guide which will have some necessary details for you. I also advise reading through the Jira ticket mentioned above (CASSANDRA-4377), as that covers many of the in-depth technical challenges that make this difficult.

How to create an efficient Cassandra Data model?

I'm new to Cassandra and trying to create an application. In which I have an entity 'student' consist of 4 columns as given below:
student_id
student_name
dob
course_name
create table student(student_id uuid, student_name text, dob date, course_name text, PRIMARY KEY(student_id));
I have to search students by course_name. Now according to Cassandra data modeling for searching student by course name I need to create another table as student_by_course_name which consist of two columns:
course_name
student_id
where course_name will be the partition key and student_id will be the cluster key as given below:
create table student_by_course_name(course_name text, student_id uuid PRIMARY KEY(course_name, student_id));
The problem arises when a student changes his course. Now I want to update the course name in the student_by_course_name table but it throws an error as the course_name column is a partition key. How to resolve this or pls suggest if i'm using Cassandra data modeling wrongly??
In this case you have to delete the old entry first and then add a new entry to student_by_course_name with the new course.
Your model looks good
The best way is indeed as Alex suggested. Delete and then update.
There are a couple of problems than you might need to be aware.
If your course have a LOT of students, it will generate big partitions (for this specific case might not be a issue)
Deleting entries will cause tombstones, and as such you should be prepared to handle them (Ex: Use low GC_GRACE, if you think a lot will be generated set unchecked_tombstones in the table)
Cassandra isn't the best for deleting data or updating data in-place. I believe that you have to use a batch statement to keep the tables in sync.
You can take two approaches. The first would be to delete the existing student ID/course name combination. This will create a tombstone but if it doesn't happen often, it won't be a big deal. The second option would be to use the original table and to create a secondary index on course name. This will allow both for the course name to be updated and queried by but may not preform well over time.

Is it possible to write comment for UDT in cassandra?

In Cassandra, for tables we can write a comment as follows:
CREATE TABLE company.address(
id int PRIMARY KEY,
street text,
...
) WITH COMMENT = 'Table containing the address of company
id - unique identifier of a company,
street - street of the company';
But for UDT(user defined type) I can't find if there is a way for writing a comment where I want to provide a description for each field of UDT. Is that possible in Cassandra ?
Comments for columns are not possible in cassandra 3.x (latest available version).
Jira Ticket for the same CASSANDRA-9836.
As of now best bet is to use self explanatory column names.

Understanding Cassandra Data Model

I have recently started learning No-SQL and Cassandra through this article. The author explains the data model through this diagram:
The author also gives the below column family example:
Book {
key: 9352130677{ name: “Hadoop The Definitive Guide”, author:” Tom White”, publisher:”Oreilly”, priceInr;650, category: “hadoop”, edition:4},
key: 8177228137{ name”” Hadoop in Action”, author: “Chuck Lam”, publisher:”manning”, priceInr;590, category: “hadoop”},
key: 8177228137{ name:” Cassandra: The Definitive Guide”, author: “Eben Hewitt”, publisher:” Oreilly”, priceInr:600, category: “cassandra”},
}
But in that tutorial and every other tutorial I have gone through, then end up creating regular tables in cassandra. I am unable to connect the Cassandar model with what I am creating.
For example, I created a column family called Employee as below:
create columnfamily Employee(empid int primary key,empName text,age int);
Now I inserted some data and my column family looks as this:
For me this looks like a regular relational table and not like the data model the author has explained. How do I create a Employee column family where each row represents an employee with different attributes? Something like:
Employee{
101:{name:Emp1,age:20}
102:{name:Emp2,salary:1000}
102:{manager_name:Emp3,age:45}
}
}
You need to understand that in the representation using cql, is may look like regular relational table, but the internal structure of the rows in Cassandra is completely different. It is saving different set of attributes for each employee, and the nulls you can see while querying with cql is just a representation of empty/nonexistent cells.
What you trying to achieve, is unstructured data model. Cassandra started with this model, and all was working as described in the tutorial you've read, but there is an opinion that unstructured data design is unhealthy to development and makes more problems than it solves. So, after sometime, Cassandra moved to the "structured" data structure (and from thrift to cql). It doesn't mean that you have to store all attributes for all keys/rows, it doesn't mean that all the rows are have same number of attributes, it just means that you have to declare attributes before you use them.
You can achieve some kind of unstructured data modeling using Map, List, Set, etc. data types, UDT (User defined types) or just saving your data as json string and parsing it on the application side.
What you have understood is correct. Just believe it. Internally cassandra stores columns exactly like the image in your question.
Now, what you are expecting is to insert a column which is not defined while creating the Employee table. For dynamic columns, you can always use Map data types .
For example
create table Employee(
empid int primary key,
empName text,
age int,
attributes Map<text,text>);
To add new attributes you can use below queries.
UPDATE Employee SET attributes = { manager_name : Emp3, age:45 } WHERE empid = 102;
Update -
another way to to create a dynamic column model is as below
create table Employee(
empid int primary key,
empName text,
attribute text,
attributevalue text,
primary key (empid,empName,attribute)
);
Lets take few inserts -
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','age','25') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','manager','emp2') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','department','hr') ;
this data structure will create a wide row, and behaves like dynamic column. you can see primary key empid and name is common for all three rows, only attribute and value will change.
Hope this will help
Cassandra uses a special primary key called compositie key. This is the representation of the partitions. This is also one reason why cassandra scales well. The composite key is used to determine the nodes on which the rows are stored.
The result in your console may be a result set of rows, but the intern organization of cassandra is differnt from that. Have you ever tried to query a table without an primary key? You will quickly see that you can't query that flexible (because of the partitioning).
After that you will understand why we have to use a query-first design aproach for cassandra. This is completely different from RDBBS.

Resources