Is it possible to create a Cassandra table with GeoMesa specifying keys (ie - a composite key)? I have a spark job that writes to Cassandra and a composite key is necessary for the output table. I would now like to create/write that same table somehow through the GeoMesa api instead of directly to Cassandra. The format is like this:
CREATE TABLE IF NOT EXISTS mykeyspace.testcompkey (pkey1 text, ckey1 int, attr1 int, attr2 int, minlat decimal, minlong decimal, maxlat decimal, maxlong decimal, updatetime text, PRIMARY KEY((pkey1), ckey1) )
Is this possible? You can see also in the create table statement that I have a partition key and a clustering key. From what I have read, I believe Geoserver does support both Simple and Complex features. I am just wondering if that support also maps over into the realm of Cassandra with GeoMesa?
Thank you
GeoMesa does use composite partition and clustering keys for Cassandra tables, but the keys are not configurable by the user - they are designed to facilitate spatial/temporal/attribute CQL queries.
Keys can be seen in the index table implementations here. The columns field (for example here) defines the primary keys. Columns with partition = true are used for partitioning, the rest are used for clustering.
Related
I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries
I have a little question concerning the partition key in Cassandra.
When I create a table which contain a field called flxB whose type is an UDT like this :
CREATE TYPE fluxes (
flux float,
flux_prec smallint,
flux_error float,
flux_error_prec smallint,
flux_bibcode text,
system text
);
Can I put the field flxB.flux in my partition key ?
No, you can't put flxB.flux on any part of primary key
Even In cassandra version lower than 3.0 UDT type field must be defined as frozen
When using the frozen keyword, you cannot update parts of a user-defined type value. The entire value must be overwritten. Cassandra treats the value of a frozen, user-defined type like a blob.
In Cassandra all the part of the primary key must be present when inserting/updating, If cassandra would allow you to put flx.flux in partition key, How cassandra will make sure all the part of the primary key is present in the insert/update query ?
i new for use apache cassandra, i have install cassandra and use cqlsh in my laptop
i used to create table using :
create table userpageview( created_at timestamp, hit int, userid int, variantid int, primary key (created_at, hit, userid, variantid) );
and insert several data into table, but when i tried to select using condition for all column (i mean one by one) it's error
maybe my data modelling wrong, maybe anyone can tell me how create data modelling in cassandra
thx
You need to read about partition keys and clustering keys. Cassandra works much differently than relational databases and the types of queries you can do are much more restricted.
Some information to get you started: here and here.
I'm trying to leverage a specific feature of Apache Cassandra CQL3, which is partition and clustering keys for tables which are created with compact storage option.
For Eg.
CREATE TABLE EMPLOYEE(id uuid, name text, field text, value text, primary key(id, name , field )) with compact storage;
I've created the table via CQL3 and i;m able to insert rows successfully using the Hector API.
But I couldn't find right set of options in the hector api to create the table itself as i require.
To elaborate a little bit more:
In ColumnFamilyDefinition.java i couldnt see an option for setting storage option (as compact storage) and In ColumnDefinition.java, i couldnt find the option to say that this column is part of the Partition and Clustering Keys
Could you please give me an idea of whether i can use Hector for this (i.e. Creating table) or not and if i can do that, what are the options that i need to provide?
If you are not tied to Hector, you could look into the DataStax Java Driver which was created to use CQL3 and Cassandra's binary protocol.
If I want to partition my primary key by time window would it be better (for storage and retrieval efficiency) to use a textual representation of the time or a truncated native timestamp ie
CREATE TABLE user_data (
user_id TEXT,
log_day TEXT, -- store as 'yyyymmdd' string
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
or
CREATE TABLE user_data (
user_id TEXT,
log_day TIMESTAMP, -- store as (timestamp-in-milli - (timestamp-in-mills mod 86400)
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
Regarding your column key "log_timestamp":
If you are working with multiple writing clients - which I suggest, since otherwise you probably won't get near the possible throughput in a distributed write-optimized data base like C* - you should consider using TimeUUIDs instead of timestamps, as they are conflict-free (assuming MAC addresses are unique). Otherwise you would have to guarantee that no two inserts happen at the same time, otherwise you will lose this data. You can do column slice queries on TimeUUIDs and other time based operations.
I'd use unix time (i.e. 1234567890) over either of those formats - to point to an entire day, you'd just use the timestamp for 00:00.
However, I very much recommend reading Advanced Time Series with Cassandra on the DataStax dev blog. It covers some important things to consider in your model, with regards to bucketing/splitting.