using cassandra 2.2.8
my challenge is as follows. In my database we've bunch of tables with millions of rows. Unfortunately, due to loose design partition keys on few tables have grown in Gigabytes in size - this is causing negative pressure on system and issues like jvm out-of-memory/node crashes are happening.
we need to redesign the partition keys on few tables. we've data in tables that we would need to retain/or migration to new table.
I'm looking for solution that enables me to export data from source table to target table (i.e. with re-designed composite partitioned keys); I hope this would help spread partitions in more balance manner.
I tried to use COPY[tablename(column1,column2...)] command but that is probing number of nodes , causing pressure on system/heap i.e. affecting application. I'm seeking guidance here as how best i can address this challenge - thank you in advance for any help.
Since you have very big tables and already failed at using COPY, you must export and import your data manually. To perform such task you need to use the TOKEN function.
With some small client code you can write queries to perform a full table data extraction with something like:
SELECT * FROM mytable WHERE token(pk) >= MIN_TOKEN AND TOKEN(pk) < MIN_TOKEN + QUERY_INTERVAL;
SELECT * FROM mytable WHERE token(pk) >= MIN_TOKEN + QUERY_INTERVAL AND TOKEN(pk) < MIN_TOKEN + 2*QUERY_INTERVAL;
....
SELECT * FROM mytable WHERE token(pk) >= MAX_TOKEN - QUERY_INTERVAL AND TOKEN(pk) < MIN_TOKEN;
where MIN_TOKEN and MAX_TOKEN are both the constant minimum and the maximum token value of your cluster partitioner, and QUERY_INTERVAL is the range window you want to query. The bigger the QUERY_INTERVAL, the more data you will fetch in a single query (and more likely trigger a timeout).
Please note that Cassandra never allows a range operator (> >= <= <) in the WHERE clause on partition key column specifiers. The exception is with the use of the TOKEN function.
I also suggest these readings:
Understanding the Token Function in Cassandra
Displaying rows from an unordered partitioner with the TOKEN function
COPY just import/export to/from a file. If you want to redesign you data model, it probably will be better to implement specialized tool for your task, which will:
read data from source table by portions (e.g. by Tokens as #xmas79 discribed above)
transform the data portion to new model
write the data portion to new tables
Here is an example how to read big tables by token ranges with java and datastax driver
Related
I am using ScyllaDb, but I think this also applies to Cassandra since ScyllaDb is compatible with Cassandra.
I have the following table (I got ~5 of this kind of tables):
create table batch_job_conversation (
conversation_id uuid,
primary key (conversation_id)
);
This is used by a batch job to make sure some fields are kept in sync. In the application, a lot of concurrent writes/reads can happen. Once in a while, I will correct the values with a batch job.
A lot of writes can happen to the same row, so it will overwrite the rows. A batch job currently picks up rows with this query:
select * from batch_job_conversation
Then the batch job will read the data at that point and makes sure things are in sync. I think this query is bad because it stresses all the partitions and the node coordinator because it needs to visit ALL partitions.
My question is if it is better for this kind of tables to have a fixed field? Something like this:
create table batch_job_conversation (
always_zero int,
conversation_id uuid,
primary key ((always_zero), conversation_id)
);
And than the query would be this:
select * from batch_job_conversation where always_zero = 0
For each batch job I can use a different partition key. The amount of rows in these tables will be roughly the same size (a few thousand at most). The tables will overwrite the same row probably a lot of times.
Is it better to have a fixed value? Is there another way to handle this? I don't have a logical partition key I can use.
second model would create a LARGE partition and you don't want that, trust me ;-)
(you would do a partition scan on top of large partition, which is worse than original full scan)
(and another advice - keep your partitions small and have a lot of them, then all your cpus will be used rather equally)
first approach is OK - and is called FULL SCAN, BUT
you need to manage it properly
there are several ways, we blogged about it in https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and basically it boils down to divide and conquer
also note spark implements full scans too
hth
L
What happens if our query contains several tokens that finally there on different nodes?
Are possible that the client runs multiple queries Sync or Async on nodes?
sample:
//Our query
SELECT * FROM keyspace1.standard1 WHERE key = 1 or key = 2 or key = 3;
//Client change our query to multiple queries depends on the token ranges and run them sync or async.
SELECT * FROM keyspace1.standrad1 WHERE key = 1 or key = 3; //Token On node X
SELECT * FROM keyspace1.standard1 WHERE key = 3; //token On node Y
Sample2:
//Our Query
SELECT * FROM kspc.standard1;
//Client Change our query to multiple queries on the token ranges and run them sync or async.
SELECT * FROM kspc.standard1 WHERE token(key) > [start range node1] and token(key) < [end range node1];
SELECT * FROM kspc.standard1 WHERE token(key) > [start range node2] and token(key) < [end range node2];
and ...
As Manish mentioned, if query contains several partitions then token aware policy won't select anything and will send query to any node in the cluster (the same behaviour is for unpreparred queries and DDLs). And in general, it's an anti-pattern as it put more load onto the nodes, so it should be avoided. But if you really need, then you can force driver to send query to one of the nodes that owns a specific partition key. In Java driver 3.x there was a function statement.setRoutingKey, for Java driver 4.x should be something similar. For other drivers there should be similar stuff, but maybe not in all.
For second class of queries - it's the same, by default driver can't find to which node to send the query, and routing key should be set explicitly. But in general, full table scan could be tricky as you need to handle conditions on the lower & upper bounds, and you can't expect that token range starts exactly at lower bound - it could be a situations when token range starts near upper bound & ends slightly above the lower bound - and this is a typical error that I have seen regularly. If you interested, I have an example of how to perform full table scan using the Java (it uses the same algorithm as Spark Cassandra Connector and DSBulk) - the main part is this cycle over the available token ranges. But if you're looking into writing full table scan yourself, think about using the DSBulk parts as an SDK - you need to look onto partitioner module that was designed specifically for that.
For Sample 1, just query for single partition and merge results at the client end. This will be much faster. Datastax driver has token aware policy but it will only work when query refers to single partition.
You can refer this link.
For Sample 2, it is an anti pattern query and you cannot expect the client to do all the work for you. If you want to read complete table then you can use spark. Datastax provides spark-cassandra-connector which can provide somewhat same functionality which you have given. Here you can find description of spark-cassandra-connector.
I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.
Accessing all rows from all nodes in cassandra would be inefficient. Is there a way to have some access to index.db which already has row keys? is something of this sort supported in built in cassandra?
There is no way to get all keys with one request without reaching every node in the cluster. There is however paging built-in in most Cassandra drivers. For example in the Java driver: https://docs.datastax.com/en/developer/java-driver/3.3/manual/paging/
This will put less stress on each node as it only fetches a limit amount of data each request. Each subsequent request will continue from the last, meaning you will touch every result for the request you're making.
Edit: This is probably what you want: How can I get the primary keys of all records in Cassandra?
One possible option could be querying all the token ranges.
For example,
SELECT distinct <partn_col_name> FROM <table_name> where token(partn_col_name) >= <from_token_range> and token(partn_col_name) < <to_token_range>
With above query, you can get the all the partition keys available within given token range. Adjust token ranges depending on execution time.
I have the following table (using CQL3):
create table test (
shard text,
tuuid timeuuid,
some_data text,
status text,
primary key (shard, tuuid, some_data, status)
);
I would like to get rows ordered by tuuid. But this is only possible when I restrict shard - I get this is due to performance.
I have shard purely for sharding, and I can potentially restrict its range of values to some small range [0-16) say. Then, I could run a query like this:
select * from test where shard in (0,...,15) order by tuuid limit L;
I may have millions of rows in the table, so I would like to understand the performance characteristics of such a order by query. It would seem like the performance could be pretty bad in general, BUT with a limit clause of some reasonable number (order of 10K), this may not be so bad - i.e. a 16 way merge but with a fairly low limit.
Any tips, advice or pointers into the code on where to look would be appreciated.
Your data is sorted according to your column key. So the performance issue in your merge in your query above does not happen due to the WHERE clause but because of your LIMIT clause, afaik.
Your columns are inserted IN ORDER according to tuuid so there is no performance issue there.
If you are fetching too many rows at once, I recommended creating a test_meta table where you store the latest timeuuid every X-inserts, to get an upper bound on the rows your query will fetch. Then, you can change your query to:
select * from test where shard in (0,...,15) and tuuid > x and tuuid < y;
In short: make use of your column keys and get rid of the limit. Alternatively, in Cassandra 2.0, there will be pagination which will help here, too.
Another issue I stumbled over, you say that
I may have millions of rows in the table
But according to your data model, you will have exactly shard number of rows. This is your row key and - together with the partitioner - will determine the distribution/sharding of your data.
hope that helps!
UPDATE
From my personal experience, cassandra performances quite well during heavy reads as well as writes. If the result sets became too large, I rather experienced memory issues on the receiving/client side rather then timeouts on the server side. Still, to prevent either, I recommend having a look a the upcoming (2.0) pagination feature.
In the meanwhile:
Try to investigate using the trace functionality in 1.2.
If you are mostly reading the "latest" data, try adding a reversed type.
For general optimizations like caches etc, first, read how cassandra handles reads on a node and then, see this tuning guide.