Querying the same amount of labels in Presto - presto

I have a field in a table named types and this field can have one or more type value, like this:(they are string inside [] and all between ""
["type1","type2"]
["type1"]
["type3","type4","type11"]
This table has 100 distinct type of types values.
What I need: I want to query this table in a way that I can ensure that I will have the same count of each type value, let's say:
100type1, 100type2,... 100type100
In the above example, I have 2 types1..

Related

PrestoDB: Filter to get rows where element is contained in map column's keys

I have a column of type map(varchar, varchar). I would like to filter on the keys of the map to get rows of the table where the map contains a given string.
How do I check whether a varchar is contained in the keys of a map type column?
Get rows with a certain string (varchar) / strings as keys
select * from planet
where map_keys(tags) = ARRAY['barrier'];
Get rows where an array column contains a particular string (varchar)
select * from planet
where contains(map_keys(tags), 'barrier');
In this case,
table name: planet
schema of column "tags": map(varchar, varchar)
String I was searching for in column tags: 'barrier'

CQL - Uniqueness of elements in a set of user defined types

C* sets guarantee that all elements in a set are unique. How does it work for user defined types (UDT)?
With simple types, the cell name is just the name of the CQL column concatenated with the column value. For example if we have
CREATE TABLE friendsets (
... user text PRIMARY KEY,
... friends set <text>
... );
We friends are stored as
(column=friends:'doug', value=)
(column=friends:'jon', value=)
What if friends is defined as a set of UTD (friends set < frozen Friend >) ? Will the name of the cells 'friends' concatenated with the serialized value of Friend?
Cassandra will serialize the value for frozen types to a BLOB when you save it to a table. The representation on disk should be identical from any other type for your set, but Cassandra will be able to deserialize the bytes to a UDT instance, once read by a query.

programmatically determining Cassandra columns at runtime

I'm accessing a Cassandra database and I only know the table names.
I want to discover the names & types of the columns.
This will give me the column names:
select column_name
from system.schema_columns
where columnfamily_name = 'customer'
allow filtering;
Is this reasonable?
Does anyone have suggestions about determining column types?
Depending on what driver you're using, you should be able to use the metadata API.
A couple examples:
http://datastax.github.io/python-driver/api/cassandra/metadata.html#schemas
https://datastax.github.io/java-driver/features/metadata/#schema-metadata
The drivers query the system schema metadata to create these models.
You can infer the column types by looking at the classes used for the validator. The validator column is just a string.
The string has one of 3 formats:
org.apache.cassandra.db.marshal.XXXType for simple column types, where XXX is the Java type for the column (e.g. for bigint columns, XXX is "Long", for varchar/text, XXX is "UTF8", etc.)
org.apache.cassandra.db.marshal.SetType(org.apache.cassandra.db.marshal.XXXType) for set columns, where the type in parenthesis is the type of each set element
org.apache.cassandra.db.marshal.MapType(org.apache.cassandra.db.marshal.XXXType,org.apache.cassandra.db.marshal.XXXType) for maps
Quite old but still valid question. There is a class variable of your model that describe columns (field name and column class):
class Tweet(cqldb.Model):
"""
Object representing the tweet column family in Cassandra
"""
__keyspace__ = 'my_ks'
# follows model definition
...
...
print(Tweet._defined_columns)
# output
OrderedDict([('tweetid',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b66a0>),
('tweet_id',
<cassandra.cqlengine.columns.BigInt at 0x7f4a4c9b6828>),
('created_at',
<cassandra.cqlengine.columns.DateTime at 0x7f4a4c9b6748>),
('ttype',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b6198>),
('tweet',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b6390>),
('lang',
<cassandra.cqlengine.columns.Text at 0x7f4a4c9b3d68>)])

Reading columns with different valuetypes

A SliceQuery< Long, String, String > says the keytype is long, column name is string and column value is string. When I execute the slice query using QueryResult < ColumnSlice< String, String > > I can get all the columns, but the key is long so there must be a column whose value is long type. It's a bit confusing for me to see how type safety works here(since query result will get a column type ).
Also, if there is a column with value type other than string, then problem must arise.
How to have a generic slicequery that can be used to query columns of different value types ?
P.S : I am new to cassandra/hector.
Thanks
Almost. The first type is the type of the row key as you point out, but the row key is not stored as a column. The row key is stored off in some other special place. This is one of those gotcha's that folks coming from the relational DB world (like me) trip over.
As for how to manage column values with different types, there's a two-pronged approach. First, you store the value as a byte array and serialize it yourself. Second, you key off of the column name to tell you which column - and therefore which value type - you're dealing with. Once you know the correct type you can use the appropriate Serializer to deserialize the byte value into a variable of the correct type. For your own complex objects and special types, you can write your own serializers.

Cassandra-secondary index on part of the composite key?

I am using a composite primary key consisting of 2 strings Name1, Name2, and a timestamp (e.g. 'Joe:Smith:123456'). I want to query a range of timestamps given an equality condition for either Name1 or Name2.
For example, in SQL:
SELECT * FROM testcf WHERE (timestamp > 111111 AND timestamp < 222222 and Name2 = 'Brown');
and
SELECT * FROM testcf WHERE (timestamp > 111111 AND timestamp < 222222 and Name1 = 'Charlie);
From my understanding, the first part of the composite key is the partition key, so the second query is possible, but the first query would require some kind of index on Name2.
Is it possible to create a separate index on a component of the composite key? Or am I misunderstanding something here?
You will need to manually create and maintain an index of names if you want to use your schema and support the first query. Given this requirement, I question your choice in data model. Your model should be designed with your read pattern in mind. I presume you are also storing some column values as well that you want to query by timestamp. If so, perhaps the following model would serve you better:
"[current_day]:Joe:Smith" {
123456:Field1 : value
123456:Field2 : value
123450:Field1 : value
123450:Field2 : value
}
With this model you can use the current day (or some known day) as a sentinel value, then filter on first and last names. You can also get a range of columns by timestamp using the composite column names.

Resources