programmatically determining Cassandra columns at runtime - cassandra

I'm accessing a Cassandra database and I only know the table names.
I want to discover the names & types of the columns.
This will give me the column names:
select column_name
from system.schema_columns
where columnfamily_name = 'customer'
allow filtering;
Is this reasonable?
Does anyone have suggestions about determining column types?

Depending on what driver you're using, you should be able to use the metadata API.
A couple examples:
http://datastax.github.io/python-driver/api/cassandra/metadata.html#schemas
https://datastax.github.io/java-driver/features/metadata/#schema-metadata
The drivers query the system schema metadata to create these models.

You can infer the column types by looking at the classes used for the validator. The validator column is just a string.
The string has one of 3 formats:
org.apache.cassandra.db.marshal.XXXType for simple column types, where XXX is the Java type for the column (e.g. for bigint columns, XXX is "Long", for varchar/text, XXX is "UTF8", etc.)
org.apache.cassandra.db.marshal.SetType(org.apache.cassandra.db.marshal.XXXType) for set columns, where the type in parenthesis is the type of each set element
org.apache.cassandra.db.marshal.MapType(org.apache.cassandra.db.marshal.XXXType,org.apache.cassandra.db.marshal.XXXType) for maps

Quite old but still valid question. There is a class variable of your model that describe columns (field name and column class):
class Tweet(cqldb.Model):
"""
Object representing the tweet column family in Cassandra
"""
__keyspace__ = 'my_ks'
# follows model definition
...
...
print(Tweet._defined_columns)
# output
OrderedDict([('tweetid',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b66a0>),
('tweet_id',
<cassandra.cqlengine.columns.BigInt at 0x7f4a4c9b6828>),
('created_at',
<cassandra.cqlengine.columns.DateTime at 0x7f4a4c9b6748>),
('ttype',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b6198>),
('tweet',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b6390>),
('lang',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b3d68>)])

Related

How to Flatten a semicolon Array properly in Azure Data Factory?

Context: I've a data flow that extracts data from SQL DB, when data comes is just one column with a string separated by tab, in order to manipulate the data properly, I've tried to separate every single column with its corresponding data:
Firstly, to 'rebuild' the table properly I used a 'Derived Column' activity replacing tab with semicolons instead (1)
dropLeft(regexReplace(regexReplace(regexReplace(descripcion,[\t],';'),[\n],';'),[\r],';'),1)
So, after that use 'split()' function to get an array and build the columns (2)
split(descripcion, ';')
Problem: When I try to use 'Flatten' activity (as here https://learn.microsoft.com/en-us/azure/data-factory/data-flow-flatten), is just not working and data flow throws me just one column or if I add an additional column in the 'Flatten' activity I just get another column with the same data that the first one:
Expected output:
column2
column1
column3
2000017
ENVASE CORONA CLARA 24/355 ML GRAB
PC13
2004297
ENVASE V FAM GRAB 12/940 ML USADO
PC15
Could you say me what i'm doing wrong, guys? thanks by the way.
You can use the derived column activity itself, try as below.
After the first derived column, what you have is a string array which can just be split again using derived schema modifier.
Where firstc represent the source column equivalent to your column descripcion
Column1: split(firstc, ';')[1]
Column2: split(firstc, ';')[2]
Column3: split(firstc, ';')[3]
Optionally you can select the columns you need to write to SQL sink

Dynamically rename/map columns according to Azure Table data model convention

How would you dynamically rename/map columns according to Azure Table data model convention, property key name should follow C# identifiers. Since we cannot guarantee the columns coming to us conform to the standard, or when we get new columns in, that it is automatically fixed.
Example:
column_1 (something_in_parens), column with spaces, ...
returned...
column_1 something_in_parens, column_with_spaces, ...
The obvious solution might be to run a databricks python step in front of the Copy Data functionality, but maybe Copy Data is able to inflect the right schema?
columns = ["some Not so nice column Names", "Another ONE", "Last_one"]
​
new_columns = [x.lower().replace(" ", "_") for x in columns]
# returns ['some_not_so_nice_column_names', 'another_one', 'last_one']

CQL - Uniqueness of elements in a set of user defined types

C* sets guarantee that all elements in a set are unique. How does it work for user defined types (UDT)?
With simple types, the cell name is just the name of the CQL column concatenated with the column value. For example if we have
CREATE TABLE friendsets (
... user text PRIMARY KEY,
... friends set <text>
... );
We friends are stored as
(column=friends:'doug', value=)
(column=friends:'jon', value=)
What if friends is defined as a set of UTD (friends set < frozen Friend >) ? Will the name of the cells 'friends' concatenated with the serialized value of Friend?
Cassandra will serialize the value for frozen types to a BLOB when you save it to a table. The representation on disk should be identical from any other type for your set, but Cassandra will be able to deserialize the bytes to a UDT instance, once read by a query.

how to support composite column names in CQL3 with empty prefixes

In thrift, you could have composite columns of the form string:bytearray and integer:bytearray and decimal:bytearray. Once defined, you could store values in an integer:bytearray like so
{empty}.somebytearray
{empty}.somebytearray
5.somebytearray
10.somebytearray
I could then query and get all the columns that were prefixed with {empty}.
This seems it cannot be done in CQL3 so we cannot port our code to CQL3 at this time? Is there a ticket for this or will it every be resolved.
thanks,
Dean
The empty column name isn't null.
A good example is the cql3 row marker which looks like this when exported via sstable2json:
//<--- row marker ----->
{"key": "6b657937","columns": [["","",1375804090248000], ["value","value7",1375804090248000]]}
It looks like the column name is empty, but its a byte array with 3 components. So say we want to add a column with an empty name:
// column name
columnFamily.addColumn(ByteBuffer.wrap(new byte[3]), value, timestamp);

range queries in cassandra

The following is working as expected. But who do I execute range queries like "where age > 40 and age < 50"
create keyspace Keyspace1;
use Keyspace1;
create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
set Users[jsmith][first] = 'John';
set Users[jsmith][last] = 'Smith';
set Users[jsmith][age] = long(42);
get Users[jsmith];
=> (column=age, value=42, timestamp=1341375850335000)
=> (column=first, value=John, timestamp=1341375827657000)
=> (column=last, value=Smith, timestamp=1341375838375000)
The best way to do this in Cassandra varies depending on your requirements, but the approaches are fairly similar for supporting these types of range queries.
Basically, you will take advantage of the fact that columns within a row are sorted by their names. So, if you use an age as the column name (or part of the column name), the row will be sorted by ages.
You will find a lot of similarities between this and storing time-series data. I suggest you take a look at Basic Time Series with Cassandra for the fundamentals, and the second half of an intro to the latest CQL features that gives an example of a somewhat more powerful approach.
The built-in secondary indexes are basically designed like a hash table, and don't work for range queries unless that range expression accompanies an equality expression on an indexed column. So, you could ask for select * from users where name = 'Joe' and age > 54, but not simply select * from users where age > 54, because that would require a full table scan. See Secondary Indexes doc for more details.
You have to create a Secondary index on the column age:
update column family Users with column_metadata=[{column_name: age, validation_class: LongType, index_type: KEYS}];
Then use:
get Users where age > 40 and age < 50
Note: I think: Exclusive operators are not supported since 1.2.
Datastax has a good documentation about that: http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes Or you can create and maintain your own secondary index. This is a good link about that:
http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html

Resources