Spark Cassandra Connector - not able to fetch dynamic columns - cassandra

I have a cassandra column family with a lot of dynamic columns. I am running a simple Spark-Cassandra connector example where I am trying to fetch all the data from this table. The issue is that it is not fetching any of the dynamic columns from my column family.
In my example and code snippet below, it is able to fetch the primary key and secondary index column for all the rows but not any of the other columns (It has 30+ more dynamic columns). I have a feeling the connector supports fetching of only partition and clustering keys as columns as of now, based on my reading here (Spark Datastax Java API Select statements). Could someone please confirm if my understanding is correct. It would be great if someone can suggest how to fix this ?
/**
* Loads a cassandra column family as a spark RDD.
*/
public static CassandraJavaRDD<CassandraRow> getCassandraTableRDD(
JavaSparkContext context, String keyspace, String table)
{
return javaFunctions(context).cassandraTable(keyspace, table);
}
CREATE TABLE source_product_canonical_data_sample (
'key' text PRIMARY KEY,
source text
) WITH
comment='' AND
comparator=text AND
read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
default_validation=text AND
min_compaction_threshold=4 AND
max_compaction_threshold=32 AND
replicate_on_write='true' AND
compaction_strategy_class='SizeTieredCompactionStrategy' AND
compression_parameters:sstable_compression='LZ4Compressor';

Your CQL table definition is not aware of your "dynamic columns". There is no compound primary key with clustering columns in it. Dynamic columns / wide-rows are terms related to the old thrift data model, and in CQL they have been replaced with compound primary key.
See this excellent blog post by Jonathan Ellis explaining how to transition to the new data model: http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

Related

Cassandra : (3.11.11) find a string in the cassandra table column

I am a new bee to Cassandra.
I have a Table(table1) and the Data like
ch1,ch2,ch3,ch4
LD,9813970,1484914,'T03103','T04014'
LD,1008203,1486104,'T03103','T04024'
Want to find a string in this cassandra table : table1. Is there any option to search a given string in this table's column ch4 using only IN operator (not LIKE operator). Sample query is like
select * from table1 where 'T04014' IN (ch4)
if required ch4 column may included in the partition or clustering keys.
You didn't post the table schema so I'm going to assume that ch4 is not part of the primary key.
You cannot include a column in the filter unless it is part of the primary key or you have a secondary index defined on it. Be aware that secondary indexes are not always a good fit. Have a look at when to use an index for details.
The general recommendation is to denormalise and create a table specifically designed for each app query so you get the best performance out of your cluster. Cheers!

In Cassandra, why dropping a column from tables defined with compact storage not allowed?

As per datastx documentation here, we cannot delete column from tables defined with COMPACT STORAGE option. What is the reason for this?
This goes back to the original implementation of CQL3, and changes which were made to allow it to abstract a "SQL-like," wide-row structure on top of the original Thrift-based storage engine. Ultimately, managing the schema comes down to whether or not the underlying structure is a table or a column_family.
As an example, I'll create two tables using an old install of Apache Cassandra (2.1.19):
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT);
CREATE TABLE studentcomp (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT)
WITH COMPACT STORAGE;
I'll insert one row into each table:
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO studentcomp (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
And then I'll look at the tables with the old cassandra-cli tool:
[default#stackoverflow] list student;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=, value=, timestamp=1599248215128672)
=> (name=fname, value=4a6f726479, timestamp=1599248215128672)
=> (name=lname, value=416e646572736f6e, timestamp=1599248215128672)
[default#stackoverflow] list studentcomp;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=fname, value=Jordy, timestamp=1599248302715066)
=> (name=lname, value=Anderson, timestamp=1599248302715066)
Do you see the empty/"ghost" column value in the first result? That empty column value was CQL3's link between the column values and the table's meta data. If it's not there, then CQL cannot be used to manage a table's columns.
The comparator used for type conversion was all that was really exposed via Thrift. This lack of meta data control/exposure is what allowed Cassandra to be considered "schemaless" in the pre-CQL days. If I run a describe studentcomp from within the cassandra-cli, I can see the comparators (validation class) used:
Column Metadata:
Column Name: lname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Column Name: fname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
But if I try describe student, I see this:
WARNING: CQL3 tables are intentionally omitted from 'describe' output.
See https://issues.apache.org/jira/browse/CASSANDRA-4377 for details.
Sorry, no Keyspace nor (non-CQL3) ColumnFamily was found with name: student (if this is a CQL3 table, you should use cqlsh instead)
Bascially, tables and column families were different entities forced into the same bucket. Adding WITH COMPACT STORAGE essentially made a table a column family.
With that came the lack of any schema management (adding or removing columns), outside of access to the comparators.
Edit 20200905
Can we somehow / someway (hack) drop the columns from table?
You might be able to accomplish this. Sylvain Lebresne wrote A Thrift to CQL3 Upgrade Guide which will have some necessary details for you. I also advise reading through the Jira ticket mentioned above (CASSANDRA-4377), as that covers many of the in-depth technical challenges that make this difficult.

Fetch distinct field values from frozen set column in Cassandra columnfamily

Hi please help me to get cql query for below requirement
- Column family contains columns: deptid (datatype:uuid emplList (datatype: set frozen(employee) )
How would I get all distinct employees name from employee object where it is stored at set as column value for emplList.
Such queries couldn't be expressed in the pure CQL - Cassandra is optimized to read data by primary key, and aggregation operations are very limited. You have 2 choices:
Read all data from table by your program, and extract distinct values
Use Spark with Spark Cassandra Connector - it will read all the data from table, but you'll have higher level abstraction to work with data, and it could perform more optimized scanning of your table.

Cassandra dynamic column family

I am new to cassandra and I read some articles about static and dynamic column family.
It is mentioned ,From Cassandra 3 table and column family are same.
I created key space, some tables and inserted data into that table.
CREATE TABLE subscribers(
id uuid,
email text,
first_name text,
last_name text,
PRIMARY KEY(id,email)
);
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test#123.com','Test1','User1');
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test2#222.com','Test2','User2');
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test3#333.com','Test3','User3');
It all seems to work fine.
But what I need is to create a dynamic column family with only data types and no predefined columns.
With insert query I can have different arguments and the table should be inserted.
In articles, it is mentioned ,for dynamic column family, there is no need to create a schema(predefined columns).
I am not sure if this is possible in cassandra or my understanding is wrong.
Let me know if this is possible or not?
if possible Kindly provide with some examples.
Thanks in advance.
I think that articles that you're referring where written in the first years of Cassandra, when it was based on the Thrift protocols. Cassandra Query Language was introduced many years ago, and now it's the way to work with Cassandra - Thrift is deprecated in Cassandra 3.x, and fully removed in the 4.0 (not released yet).
If you really need to have fully dynamic stuff, then you can try to emulate this by using table with columns as maps from text to specific type, like this:
create table abc (
id int primary key,
imap map<text,int>,
tmap map<text,text>,
... more types
);
but you need to be careful - there are limitations and performance effects when using collections, especially if you want to store more then hundreds of elements.
another approach is to store data as individual rows:
create table xxxx (
id int,
col_name text,
ival int,
tval text,
... more types
primary key(id, col_name));
then you can insert individual values as separate columns:
insert into xxxx(id, col_name, ival) values (1, 'col1', 1);
insert into xxxx(id, col_name, tval) values (1, 'col2', 'text');
and select all columns as:
select * from xxxx where id = 1;

How to add the multiple column as a primary keys in cassandra?

I have an existing table with millions of records and initially we have two columns as partitioning key and clustering key and now I want add two more columns in a table as a partitioning key.
How?
If you make a change to the partition key you will need to create a new table and import the existing data. This is due to, in part, the fact that a partition key is not equal to a primary key in a relational database. The partition key is hashed by Cassandra and that hash is used to find partitions on disk. If you change the partition key you change the hash value and can no longer look up the partition!
CREATE TABLE KEYSPACE_NAME.AMAR_EXAMPLE (
COLUMN_1 TYPE,
COLUMN_2 TYPE,
COLUMN_3 TYPE,
...
COLUMN_N TYPE
// Here we declare the partition key columns and clustering columns
PRIMARY KEY ((COLUMN_1, COLUMN_2, COLUMN_3, COLUMN_4), CLUSTERING_COLUMN)
)
//If you need to change the default clustering order declare that here
WITH CLUSTERING ORDER BY (COLUMN_4 DESC);
You could export the data to CSV using COPY and then import the data to the new table via COPY or use the SSTABLELOADER. There is plenty of documentation and walkthroughs on how to use those tools. For example, this Datastax blog post talks about the changes made to the updated SSTABLELOADER. If you create a new table and import the existing data you will create new partitions and new hashes. Cassandra will not let you simply add additional columns to the partition key after the table has been created.
Understanding your data and the Cassandra data modeling techniques will help mitigate the amount of work you may find yourself doing changing partition keys. Check out the self-paced courses provided by Datastax. DS220: Data Modeling could really help.

Resources