I had a couple questions regarding the Cassandra connector written by Data Mountaineer. Any help is greatly appreciated as we're trying to figure out the best way to scale our architecture.
Do we have to create a Connector config for each Cassandra table we want to update? For instance, let's say I have a 1000 tables. Each table is dedicated to a different type of widget. Each widget has similar characteristics, but slightly different data. Do we need to create a connector for each table? If so, how is this managed and how does this scale?
In Cassandra, we often need to model column families based on the business need. We may have 3 tables representing user information. 1 by username, 1 by email and 1 by last name. Would we need 3 connector configs and deploy 3 separate Sink tasks to push data to each table?
I think both questions are similar, can the sink handle multiple topics?
The sink can handle multiple tables in one sink so one configuration. This is set in the kcql statement connect.cassandra.export.route.query=INSERT INTO orders SELECT * FROM orders-topic;INSERT INTO positions SELECT * FROM positions but at present they need to be in the same Cassandra keyspace. This would route events from the trades topic to a Cassandra table called trades and events from positions. You can also select specific columns and rename like select columnA as columnB.
You may want more than one sink instance for separation of concerns, i.e. isolating the write of a group of topics from other unrelated topics.
You can scale with the number of tasks the connector is allowed to run, each task starts a Writer for all the target tables.
We have a support channel of our own for more direct communication. https://datamountaineer.com/contact/
Related
I have an exchangeRates table that gets updated in batch once per week. This is to be used by other batch and streaming jobs, across different clusters - thus I want to save this as a persistent, shared table for all to jobs share.
allExchangeRatesDF.write.saveAsTable("exchangeRates")
How best then (for the batch job that manages this data) to gracefully update the table contents (actually overwrite it completely) - considering the various spark job as consumers of it and particularily giving its use in some 24/7 structured streaming streams?
Ive checked the APIs, maybe I am missing something obvious! Very likely.
Thanks!
I think you expect some kind of transaction support from Spark so when there's saveAsTable in progress Spark would hold all writes until the update/reset has finished.
I think that the best way to deal with the requirement is to append new records (using insertInto) with the batch id that would denote the rows that belong to a "new table".
insertInto(tableName: String): Unit Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
You'd then use the batch id to deal with the rows as if they were the only rows in the dataset.
I'm faced with the following problem using PySpark and dataframes with the cassandra-connector. My Cassandra data lake consists of metric measurements across (network) devices, and the entries are of type (device,interface,metric,time,value).
My cassandra table for the raw data has:
PRIMARY KEY ((device,interface,metric),time)
for supposedly efficient fetching of time ranges for a given measurement.
Now for reporting purposes, users can query any set of device/interface/metric combinations (ie give me a specific metric for all interfaces of a device). Now I know the list of each, so I'm not looking to do wildcard searches, but rather IN queries.
I'm using Spark 1.4, so I'm adding filters like to obtain dataframes to calculate min/max/percentile/etc... on the recorded metric values.
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device.inSet (device_list))\
.filter(metrics_raw.interface.inSet (interface_list))\
.filter(metrics_raw.metric.inSet (metric_list))
This isn't very efficient as these predicates do not get pushed down to CQL (only the last predicate can be an IN query), so I'm pulling in tons of data and filtering on the client side. (not good)
Why doesn't cassandra-connector allow multiple IN predicates across partition columns? Doing this in a native CQL shell appears to work?
Another approach to my problem above would be to (and this yields efficient individual queries as predicates are pushed down to Cassandra):
for device in device_list:
for interface in interface_list:
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device == device)\
.filter(metrics_raw.interface == interface)\
.filter(metrics_raw.metric.inSet (metric_list))
And then run the aggregation logic for each subquery, but I feel like this is largely serialising what should be a parallel computation across all requested device/interface/metric values... Can I batch the Cassandra queries so I can run my analytics on one large distributed dataframe?
Bottom line, I'm looking to do this very efficiently. If the turn-around times are short enough, we'll run these on-demand. If not, we'll need to look into pre-computing them and storing into tables (which sacrifices flexibility for doing custom time-range reporting)
Any insights would be much appreciated!!
Nik.
If Apache Cassandra's architecture encourages the use of non-normalized column families designed specifically for anticipated queries, how do users edit data that is replicated across many columns without creating inconsistencies?
e.g., example 3 here: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
If Jay was no longer interested in iphones, deleting this piece of information would require that columns in 2 separated column families be deleted. Do users just need to code add/edit/delete functions that appropriately update all the relevant tables, or does Cassandra somehow know how records are related and handle this for users?
In the Cassandra 2.x world, the way to keep your denormalized query tables consistent is to use atomic batches.
In an example taken from the CQL documentation, assume that I have two tables for user data. One is the "users" table and the other is "users_by_ssn." To keep these two tables in sync (should a user change their "state" of residence) I would need to apply an upsert like this:
BEGIN BATCH;
UPDATE users
SET state = 'TX'
WHERE user_uuid = 8a172618-b121-4136-bb10-f665cfc469eb;
UPDATE users_by_ssn
SET state = 'TX'
WHERE ssn = '888-99-3987';
APPLY BATCH;
User need to code add/edit/delete function himself.
Take in to attention that Cassandra 3.0 have materialised view that automate denormalization on the server side. Materialised views would add/edit/update automatically based on the parent table.
I wanted to hear your advice about a potential solution for an advertise agency database.
We want to build a system that will be able to track users in a way that we know
what they did on the ads, and where.
There are many type of ads, and some of them also FORMS, so user can fill data.
Each form is different but we dont want to create table per form.
We thought of creating a very WIDE table with 1k columns, dozens for each type, and store the data.
In short:
Use Cassandra;
Create daily tables so data will be stored on a daily table;
Each table will have 1000 cols (100 for datetime, 100 for int, etc).
Application logic will map the data into relevant cols so we will be able to search and update those later.
What do you think of this ?
Be careful with generating tables dynamically in Cassandra. You will start to have problems when you have too many tables because there is a per table memory overhead. Per Jonathan Ellis:
Cassandra will reserve a minimum of 1MB for each CF's memtable: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
Even daily tables are not a good idea in Cassandra (tables per form is even worse). I recommend you build a table that can hold all your data and you know will scale well -- verify this with cassandra-stress.
At this point, heed mikea's advice and start thinking about your access patterns (see Patrick's video series), you may have to build additional tables to meet your querying needs.
Note: For anyone wishing for a schemaless option in c*:
https://blog.compose.io/schema-less-is-usually-a-lie/
http://rustyrazorblade.com/2014/07/the-myth-of-schema-less/
I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.