In Cassandra, if I update different columns concurrently in one row, will there be any write conflicts?
For example I have a table
CREATE TABLE foo (k text, a text, b text, PRIMARY KEY (k))
One thread in my code updates column a
INSERT INTO foo (k, a) VALUES ('hello', 'foo')
while the other thread updates column b
INSERT INTO foo (k, b) VALUES ('hello', 'bar').
When running concurrently, it is possible that the two queries arrive at the server at the same time.
Could I always expect the same result as I update the two columns in one CQL?
INSERT INTO foo(k, a, b) VALUES ('hello', 'foo', 'bar')
Will there be any write conflicts? Is each insertion atomic?
As Tom mentioned in his reply that in Cassandra, all the operations are column-based. Then each column should have a timestamp. In such a case, the above scenario will not bring any trouble given one thread will only update column a while the other only update column b. Is my understanding correct?
Thank you!
Write conflicts are resolved by having each server track the time of the write. If they arrive at the exact same time (with ms accuracy) Cassandra will pick one based on an algorithm (not sure about the details, I would assume it involves node UUIDs).
So write conflicts are not something you need to worry about. Reducing those two queries into the single one will do the right thing.
Of course it is very important that your servers are synchronized with their times, or funny things may happen.
Related
I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html
I would like to add a notes column to a merged query table, so that when I refresh the data the notes that I've made on records continue to line up. How can I add a column to do this?
See my answer to this question:
Inserting text manually in a custom column and should be visible on refresh of the report
It includes a link to an explanatory video:
https://youtu.be/duNYHfvP_8U?list=PLmajzIMNl6yH7MvMLmlgGUW5dOsKg74mQ
MySQL provides several variations on INSERT and UPDATE to allow inserting and updating exactly the desired data. These features provide a lot of power and flexibility, making MySQL significantly more capable than it otherwise might be. In this article I’ll give an overview of each feature, help you understand how to choose among them, and point out some things to watch out for.
Setup
I am using MySQL 4.1.15 to create my examples. I assume MyISAM tables without support for transactions, with the following sample data:
create table t1 (
a int not null primary key,
b int not null,
c int not null
) type=MyISAM;
create table t2 (
d int not null primary key,
e int not null,
f int not null
) type=MyISAM;
insert into t1 (a, b, c) values
(1, 2, 3),
(2, 4, 6),
(3, 6, 9);
insert into t2 (d, e, f) values
(1, 1, 1),
(4, 4, 4),
(5, 5, 5);
Overview
Suppose I wish to insert the data from t2 into t1. This data would violate the primary key (a row exists where column a is 1) so the insert will fail: ERROR 1062 (23000): Duplicate entry '1' for key 1. Recall that in MySQL, a primary key is simply a unique index named PRIMARY. Any data that violates any unique index will cause the same problem.
This situation occurs frequently. For example, I might export some data to a spreadsheet, send it to a client, and the client might update or add some data and return the spreadsheet to me. That’s a terrible way to update data, but for various reasons, I’m sure many readers have found themselves in a similar situation. It happens a lot when I’m working with a client who has multiple versions of data in different spreadsheets, and I’m tasked with tidying it all up, standardizing formatting and importing it into a relational database. I have to start with one spreadsheet, then insert and/or update the differences from the others.
What I want to do is either insert only the new rows, or insert the new rows and update the changed rows (depending on the scenario). There are several ways to accomplish both tasks.
Inserting only new rows
If I want to insert only the rows that will not violate the unique index, I can:
Delete duplicate rows from t2 and insert everything that remains:
delete t2 from t2 inner join t1 on a = d;
insert into t1 select * from t2;
The first statement deletes the first row from t2; the second inserts the remaining two. The disadvantage of this approach is that it’s not transactional, since the tables are MyISAM and there are two statements. This may not be an issue if nothing else is altering either table at the same time. Another disadvantage is that I just deleted some data I might want in subsequent queries.
I have a DataFrame with columns a, b for which I want to partition the data by a using a window function, and then give unique indices for b
val window_filter = Window.partitionBy($"a").orderBy($"b".desc)
withColumn("uid", row_number().over(window_filter))
But for this use-case, ordering by b is unneeded and may be time consuming. How can I achieve this without ordering?
row_number() without order by or with order by constant has non-deterministic behavior and may produce different results for the same rows from run to run due to parallel processing. The same may happen if the order by column does not change, the order of rows may be different from run to run and you will get different results.
I have a table Foo with 4 columns A, B, C, D. The partitioning key is A. The clustering key is B, C, D.
I want to scan the entire table and find all rows where D is in set (X, Y, Z).
Then I want to delete these rows but I don't want to "kill" Cassandra (because of compactions), I'd like these rows deleted with minimal disruption or risk.
How can I do this?
You have a big problem here. Indeed, you really can't find the rows without actually scanning all of your partitions. The problem real problem is that C* will allow you to restrict your queries with a partition key, and then by your cluster keys in the order in which they appear in your PRIMARY KEY table declaration. So if your PK is like this:
PRIMARY KEY (A, B, C, D)
then you'd need to filter by A first, then by B, C, and only at the end by D.
That being said, for the part of finding your rows, if this is something you have to run only once, you
Could scan all your table and do comparisons of D in your App logic.
If you know the values of A you could query every partition in parallel and then compare D in your application
You could attach a secondary index and try to exploit speed from there.
Please note that depending on how many nodes do you have 3 is really not an option, secondary indexes don't scale)
If you need to perform such tasks multiple times, I'd suggest you to create another table that would satisfy this query, something like PRIMARY KEY (D), you'd then just scan three partitions and that would be very fast.
About deleting your rows, I think there's no way to do it without triggering compactions, they are part of C* and you have to live with them. If you really can't tolerate tombstone creation and/or compactions, the only alternative is to not delete rows from a C* cluster, and that often means thinking about a new data model that won't need deletes.
I am new to column store db family and some of the concepts are not yet completely clear to me. I want to use MemSQL to store sparse matrix.
The table would look something like this:
CREATE TABLE matrix (
r_id INT,
c_id INT,
cell_data VARCHAR(10),
KEY (`r_id`, `c_id`) USING CLUSTERED COLUMNSTORE,
);
The Queries:
SELECT c_id, cell_data FROM matrix WHERE r_id=<val>; i.e. whole row
SELECT r_id, cell_data FROM matrix WHERE c_id=<val>; i.e. whole column
SELECT cell_data FROM matrix WHERE r_id=<val1> AND c_id=<val2>; i.e. one cell
UPDATE matrix SET cell_data=<val> WHERE r_id=<val1> AND c_id=<val2>;
INSERT INTO matrix VALUES (<v1>, <v2>, <v3>);
The queries 1 and 2 are about equally frequent and 3, 4 and 5 are also equally frequent. One of Q1,2 are equally frequent as one of Q3,4,5 (i.e. Q1,2:Q3,4,5 ~= 1:1).
I do realize that inserting into column store one row at a time creates Row segment group for each insert and thus degrading performance. I cannot batch the inserts. Also I cannot use in-memory row store (the matrix is too big).
I have three questions:
Does the issue with single row inserts concern updates too if only cell_data is changed (i.e. Q4)?
Would it be possible to have in-memory row table in which I would do INSERT (?and UPDATE?) operations and periodically batch the contents to column table?
How would I perform Q1,2 if I need most recent data (?UNION ALL?)?
Is it possible avoid executing Q3 for both tables (?which would mean two round trips?)?
I am concerned by execution speed of Q1 and Q2. Is the Clustered key optimal for those. I am not sure how the records would be stored with table above.
1.
Yes, single-row updates also perform poorly - they are essentially a delete and an insert.
2.
Yes, and in fact we automatically do this behind the scenes - the most recently inserted data (if it is too small a number of rows to be a good columnar segment) is kept in an in-memory rowstore form, and read queries are essentially looking at a UNION ALL of that data and the column-oriented data. We then batch up this data to write into column-oriented form.
If that doesn't work well enough, depending on your workload, you may benefit from explicitly keeping some of your data in a rowstore table instead of relying on the above behavior, in which case:
2a. yes, to see the most recent data you would use UNION ALL
2b. the data could be in either table, so you would have to query both (like for Q1,2, using UNION ALL works). This does not do two round trips, just one.
3.
You can either order by r or c first in the columnstore key - r in your current schema. This makes queries for a row efficient, but queries for a column are going to be very inefficient, they may have to scan basically the full table (depending on the patterns in your data). Unfortunately columnstore tables do not support using multiple keys, so there is no good way to solve this. One potential hacky solution is to maintain two copies of your table, one with key (r, c) and one with key (c, r) - this is essentially manually maintaining two indexes.
Based on the workload you're describing, it sounds like you are doing many single-row queries (Q3,4,5, which is 50% of the workload), which rowstore is much better suited for than columnstore (see http://docs.memsql.com/latest/concepts/columnstore/). Unfortunately, if it doesn't fit in memory, there isn't really a good way around this other than perhaps to add more memory.