Databricks table metadata through JDBC driver - apache-spark

The Spark JDBC driver (SparkJDBC42.jar) is unable to capture certain information from the below table structure:
table level comment
The TBLPROPERTIES key-value pair information
PARTITION BY information
However, it captures the column level comment (eg. the comment against employee_number column), all columns of employee table, their technical data types.
Please advise if I need to configure any additional properties to be ale to read/extract the information that the driver could not extract at the moment.
create table default.employee(
employee_number INT COMMENT ‘Unique identifier for an employee’,
employee_name VARCHAR(50),
employee_age INT)
PARTITIONED BY (employee_age)
COMMENT ‘this is a table level comment’
TBLPROPERTIES (‘created.by.user’ = ‘Noor’, ‘created.date’ = ‘10-08-2021’);

You should be able to execute:
describe table extended default.employee
via JDBC interface as well. In first case it will return a table with 3 columns, that you can parse into column level & table level properties - it shouldn't be very complex, as there are explicit delimiters between row-level & table level data:
You can also execute:
show create table default.employee
that will give you a table with one column, containing the SQL statement that you may parse:

Related

Cassandra : (3.11.11) find a string in the cassandra table column

I am a new bee to Cassandra.
I have a Table(table1) and the Data like
ch1,ch2,ch3,ch4
LD,9813970,1484914,'T03103','T04014'
LD,1008203,1486104,'T03103','T04024'
Want to find a string in this cassandra table : table1. Is there any option to search a given string in this table's column ch4 using only IN operator (not LIKE operator). Sample query is like
select * from table1 where 'T04014' IN (ch4)
if required ch4 column may included in the partition or clustering keys.
You didn't post the table schema so I'm going to assume that ch4 is not part of the primary key.
You cannot include a column in the filter unless it is part of the primary key or you have a secondary index defined on it. Be aware that secondary indexes are not always a good fit. Have a look at when to use an index for details.
The general recommendation is to denormalise and create a table specifically designed for each app query so you get the best performance out of your cluster. Cheers!

In Cassandra, why dropping a column from tables defined with compact storage not allowed?

As per datastx documentation here, we cannot delete column from tables defined with COMPACT STORAGE option. What is the reason for this?
This goes back to the original implementation of CQL3, and changes which were made to allow it to abstract a "SQL-like," wide-row structure on top of the original Thrift-based storage engine. Ultimately, managing the schema comes down to whether or not the underlying structure is a table or a column_family.
As an example, I'll create two tables using an old install of Apache Cassandra (2.1.19):
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT);
CREATE TABLE studentcomp (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT)
WITH COMPACT STORAGE;
I'll insert one row into each table:
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO studentcomp (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
And then I'll look at the tables with the old cassandra-cli tool:
[default#stackoverflow] list student;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=, value=, timestamp=1599248215128672)
=> (name=fname, value=4a6f726479, timestamp=1599248215128672)
=> (name=lname, value=416e646572736f6e, timestamp=1599248215128672)
[default#stackoverflow] list studentcomp;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=fname, value=Jordy, timestamp=1599248302715066)
=> (name=lname, value=Anderson, timestamp=1599248302715066)
Do you see the empty/"ghost" column value in the first result? That empty column value was CQL3's link between the column values and the table's meta data. If it's not there, then CQL cannot be used to manage a table's columns.
The comparator used for type conversion was all that was really exposed via Thrift. This lack of meta data control/exposure is what allowed Cassandra to be considered "schemaless" in the pre-CQL days. If I run a describe studentcomp from within the cassandra-cli, I can see the comparators (validation class) used:
Column Metadata:
Column Name: lname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Column Name: fname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
But if I try describe student, I see this:
WARNING: CQL3 tables are intentionally omitted from 'describe' output.
See https://issues.apache.org/jira/browse/CASSANDRA-4377 for details.
Sorry, no Keyspace nor (non-CQL3) ColumnFamily was found with name: student (if this is a CQL3 table, you should use cqlsh instead)
Bascially, tables and column families were different entities forced into the same bucket. Adding WITH COMPACT STORAGE essentially made a table a column family.
With that came the lack of any schema management (adding or removing columns), outside of access to the comparators.
Edit 20200905
Can we somehow / someway (hack) drop the columns from table?
You might be able to accomplish this. Sylvain Lebresne wrote A Thrift to CQL3 Upgrade Guide which will have some necessary details for you. I also advise reading through the Jira ticket mentioned above (CASSANDRA-4377), as that covers many of the in-depth technical challenges that make this difficult.

How to add new column in to partition by clause in Hive External Table

I have external Hive Table which is filled by spark job and partitioned by(event_date date) now I have modified the spark code and added one extra column 'country'.In earlier written data country column will have null values as it is newly added. now I want to Alter 'partitioned by' clause as partition by(event_date date,country string) how can I achieve this.Thank you!!
Please try to alter the partition using below commnad-
ALTER TABLE table_name PARTITION part_spec SET LOCATION path
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Try this databricks spark-sql language manual for alter command

Fetch distinct field values from frozen set column in Cassandra columnfamily

Hi please help me to get cql query for below requirement
- Column family contains columns: deptid (datatype:uuid emplList (datatype: set frozen(employee) )
How would I get all distinct employees name from employee object where it is stored at set as column value for emplList.
Such queries couldn't be expressed in the pure CQL - Cassandra is optimized to read data by primary key, and aggregation operations are very limited. You have 2 choices:
Read all data from table by your program, and extract distinct values
Use Spark with Spark Cassandra Connector - it will read all the data from table, but you'll have higher level abstraction to work with data, and it could perform more optimized scanning of your table.

Brisk cassandra TimeUUIDType

I used brisk. The cassandra column family automatically maps to Hive tables.
However, if data type is timeuuid in column family, it is unreadable in Hive tables.
For example, I used following command to create an external table in hive to map column family.
Hive > create external table A (rowkey string, column_name string, value string)
> STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
> WITH SERDEPROPERTIES (
> "cassandra.columns.mapping" = ":key,:column,:value");
If column name is TimeUUIDType in cassandra, it becomes unreadable in the Hive table.
For example, a row in cassandra column family looks like:
RowKey: 2d36a254bb04272b120aaf79d70a3578
=> (column=29139210-b6dc-11df-8c64-f315e3a329d6, value={"event_id":101},timestamp=1283464254261)
Where column name is TimeUUIDType.
In hive table, it looks like the following row:
2d36a254bb04272b120aaf79d70a3578 t��ߒ4��!�� {"event_id":101}
So, column name is unreadable in Hive table.
This is a known issue with the automatic table mapping. For best results with a timeUUIDType, turn the auto-mapping feature off in $brisk_home/resources/hive/hive-site.xml:
"cassandra.autoCreateHiveSchema"
and create the table in hive manually.

Resources