Auto increment column in delta table without rekey

Auto increment column in delta table without rekey - apache-spark

I have a dim delta table so far i am calculating dim_id using row_number() + max(dim_id).
Dim_id | user_id
1001 | 1
1002 | 3
1003 | 5
1004 | 9
For example if i deleted 1004 id then insert a new user_id like 7 (row_number() + max(dim_id) = 1004) 1004 id repeated. Is there any way to prevent already used ids not created once it deleted from the delta table

The perfect way to solve this is Primary Key, yet not support until now.
You can combine monotonically_increasing_id() with row_number() for two columns. Here is the example: Generate unique increasing values
RD：
monotonically_increasing_id()
Constraints on Databricks

Related

Cassandra query max of a particular column for a particular ID

I am trying to write a Cassandra query and my use case is as follows
Let's say the table is
ID | Version
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
Now what I want is to get the latest version for all the IDs.
So the query should give me 2 rows. The first with Id:1 Version 2 and second with ID:2 Version:3
I tried a query like Select * from table where ID=1 and Version= MAX(Version) but it's not a valid syntax.
Can anybody help in this?

SELECT * FROM table WHERE ID = 1 LIMIT 1 would give you the highest version if your clustering key is Version ordered by descending.
CREATE TABLE table (
id int,
version int,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC);

Delete Rows using timestamp column cassandra

I want to delete data between timestamp from my table.
CREATE TABLE propatterns_test.test (
clientId text,
meterId text,
meterreading text,
date timestamp,
PRIMARY KEY (meterId, date) );
My delete query is:
DELETE FROM test WHERE meterid = 'M5' AND date > '2016-12-27 10:00:00+0000';
Which returned this error :
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Invalid operator < for PRIMARY KEY part date"
After that I tried to delete a specific row :
DELETE FROM test WHERE meterid = 'M5' AND date = '2016-12-27 09:42:30+0000';
Actually the table contains the same record, but it was not deleted.
This is what my data looks like:
meterid | date | clientid | meterreading
---------+--------------------------+----------+--------------
M5 | 2016-12-27 09:42:30+0000 | RDS | 35417.8
M5 | 2016-12-27 09:42:44+0000 | RDS | 35417.8
M5 | 2016-12-27 09:47:20+0000 | RDS | 35417.8
M5 | 2016-12-27 09:47:33+0000 | RDS | 35417.8
Nothing is deleting from table. So how can I delete data between timestamp dates which is part of the primary key?

I see a couple of things happening here. First of all, like iconnj mentioned, range deletes are not possible in versions prior to Cassandra 3.0.
Secondly, your single-row delete attempt is failing (I believe) due to the fact that you are not accounting for the milliseconds present on the timestamp. You can see this if you nest your date column inside the timestsampasblob and blobasbigint functions:
aploetz#cqlsh:stackoverflow> SELECT meterid,date,blobAsBigint(timestampAsBlob(date))
FROM propatterns WHERE meterid='M5';
meterid | date | system.blobasbigint(system.timestampasblob(date))
---------+--------------------------+---------------------------------------------------
M5 | 2016-12-27 09:42:30+0000 | 1482831750000
M5 | 2016-12-30 17:31:53+0000 | 1483119113231
M5 | 2016-12-30 17:32:08+0000 | 1483119128812
(3 rows)
Note the zeros on the end of the 2016-12-27 09:42:30+0000 row, that I explicitly INSERTed from your example. Note that the two rows I INSERTed using the dateof(now()) nested functions actually has the milliseconds as the last three digits on the timestamps.
Watch what happens when I take those three digits and add them as milliseconds when I delete one of the rows:
aploetz#cqlsh:stackoverflow> DELETE FROM propatterns WHERE meterid='M5'
AND date='2016-12-30 17:32:08.812+0000';
aploetz#cqlsh:stackoverflow> SELECT meterid,date,blobAsBigint(timestampAsBlob(date))
FROM propatterns WHERE meterid='M5';
meterid | date | system.blobasbigint(system.timestampasblob(date))
---------+--------------------------+---------------------------------------------------
M5 | 2016-12-27 09:42:30+0000 | 1482831750000
M5 | 2016-12-30 17:31:53+0000 | 1483119113231
(2 rows)
In summary:
You cannot perform range deletes prior to Cassandra 3.0.
You cannot delete individual rows keyed by timestamps without specifying milliseconds, if milliseconds are indeed present.

Delete with range clause is possible in C* 3.0 onwards. Looking at the error you got I think you are on a pre 3.0 version in which case you won't be able to do this via CQL

In Cassandra 3 you can use the "...from Y using timestamp XXX where ..." command:
create table mytime (
location_id text,
tour_id text,
mytime timestamp,
PRIMARY KEY (location_id, tour_id));
INSERT INTO mytime (location_id, tour_id, mytime) values ('location1', '1', toTimeStamp(now()));
INSERT INTO mytime (location_id, tour_id, mytime) values ('location1', '2', toTimeStamp(now()));
Be aware: the value you need to use for the timestamp is nanoseconds not miliseconds:
select location_id, mytime, blobAsBigint(mytime), WRITETIME(mytime) from mytime;
location_id |mytime |system.blobasbigint(mytime) |writetime(mytime) |
------------|------------------------|----------------------------|------------------|
location1 |2018-11-28-09.53.52.110 |1543395232110 |1543395232109517 |
location1 |2018-11-28-09.53.52.742 |1543395232742 |1543395232740055 |
So now you can do
delete from mytime using timestamp 1543395232109517 where location_id = 'location1';
Which correctly deletes the entry <= 1543395232109517:
select location_id, mytime, blobAsBigint(mytime), WRITETIME(mytime) from mytime;
location_id |mytime |system.blobasbigint(mytime) |writetime(mytime) |
------------|------------------------|----------------------------|------------------|
location1 |2018-11-28-09.53.52.742 |1543395232742 |1543395232740055 |

Duplicate rows/columns for the same primary key in Cassandra

I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)

I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...

Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3

"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".

Cassandra UPDATE primary key value

I understand that this is not possible using an UPDATE.
What I would like to do instead, is migrate all rows with say PK=0 to new rows where PK=1. Are there any simple ways of achieving this?

For a relatively simple way, you could always do a quick COPY TO/FROM in cqlsh.
Let's say that I have a column family (table) called "emp" for employees.
CREATE TABLE stackoverflow.emp (
id int PRIMARY KEY,
fname text,
lname text,
role text
)
And for the purposes of this example, I have one row in it.
aploetz#cqlsh:stackoverflow> SELECT * FROM emp;
id | fname | lname | role
----+-------+-------+-------------
1 | Angel | Pay | IT Engineer
If I want to re-create Angel with a new id, I can COPY the table's contents TO a .csv file:
aploetz#cqlsh:stackoverflow> COPY stackoverflow.emp TO '/home/aploetz/emp.csv';
1 rows exported in 0.036 seconds.
Now, I'll use my favorite editor to change the id of Angel to 2 in emp.csv. Note, that if you have multiple rows in your file (that don't need to be updated) this is your opportunity to remove them:
2,Angel,Pay,IT Engineer
I'll save the file, and then COPY the updated row back into Cassandra FROM the file:
aploetz#cqlsh:stackoverflow> COPY stackoverflow.emp FROM '/home/aploetz/emp.csv';
1 rows imported in 0.038 seconds.
Now Angel has two rows in the "emp" table.
aploetz#cqlsh:stackoverflow> SELECT * FROM emp;
id | fname | lname | role
----+-------+-------+-------------
1 | Angel | Pay | IT Engineer
2 | Angel | Pay | IT Engineer
(2 rows)
For more information, check the DataStax doc on COPY.

Timestamp / date as key for cassandra column family / hector

I have to create and query a column family with composite key as [timestamp,long]. Also,
while querying I want to fire range query for timestamp (like timestamp between xxx and yyy) Is this possible ?
Currently I am doing something really funny (Which I know its not correct). I create keys with timestamp string for given range and concatenate with long.
like ,
1254345345435-1234
3423432423432-1234
1231231231231-9999
and pass set of keys to hector api. (so if i have date range for 1 month and I want every minute data, i create 30 * 24 * 60 * [number of secondary key - long])
I can solve concatenation issue with composite key. But query part is what I am trying to understand.
As far as I understood, As we are using RandomPartitioner we cannot really query based on range as keys are MD5 checksum. Whats ideal design for this kind of use case ?
my schema and requirements are as follows : (actual csh)
CREATE TABLE report(
ts timestamp,
user_id long,
svc1 long,
svc2 long,
svc3 long,
PRIMARY KEY(ts, user_id));
select from report where ts between (123445345435 and 32423423424) and user_id is in (123,567,987)

You cannot do range queries on the first component of a composite key. Instead, you should write a sentinel value such as a daystamp (the unix epoch at midnight on the current day) as the key, then write a composite column as timestamp:long. This way you can provide the keys that comprise your range, and slice on the timestamp component of the composite column.

Denormalize! You must model your schema in a manner that will enable the types of queries you wish to perform. We create a reverse (aka inverted, inverse) index for such scenarios.
CREATE TABLE report(
KEY uuid PRIMARY KEY,
svc1 bigint,
svc2 bigint,
svc3 bigint
);
CREATE TABLE ReportsByTime(
KEY ascii PRIMARY KEY
) with default_validation=uuid AND comparator=uuid;
CREATE TABLE ReportsByUser(
KEY bigint PRIMARY KEY
)with default_validation=uuid AND comparator=uuid;
See here for a nice explanation. What you are doing now is generating your own ascii key in the times table, to enable yourself to perform the range slice query you want - it doesn't have to be ascii though just something you can use to programmatically generate your own slice keys with.
You can use this approach to facilitate all of your queries, this likely isn't going to suit your application directly but the idea is the same. You can squeeze more out of this by adding meaningful values to the column keys of each table above.
cqlsh:tester> select * from report;
KEY | svc1 | svc2 | svc3
--------------------------------------+------+------+------
1381b530-1dd2-11b2-0000-242d50cf1fb5 | 332 | 333 | 334
13818e20-1dd2-11b2-0000-242d50cf1fb5 | 222 | 223 | 224
13816710-1dd2-11b2-0000-242d50cf1fb5 | 112 | 113 | 114
cqlsh:tester> select * from times;
KEY,1212051037 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5 | 1381b530-1dd2-11b2-0000-242d50cf1fb5,1381b530-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051035 | 13816710-1dd2-11b2-0000-242d50cf1fb5,13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051036 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
cqlsh:tester> select * from users;
KEY | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
-------------+--------------------------------------+--------------------------------------
23123123231 | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5

Why don't you use wide rows, where Key is timestamp and Column Name as Long-Value then you can pass multiple key's (timestamp's) to getKeySlice and select multiple column's to withColumnSlice by there name (which is id).
As I don't know what is column name and value, I feel this can help you. Can you provide more details of your column family definition.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Auto increment column in delta table without rekey - apache-spark

The perfect way to solve this is Primary Key, yet not support until now. You can combine monotonically_increasing_id() with row_number() for two columns. Here is the example: Generate unique increasing values RD： monotonically_increasing_id() Constraints on Databricks

Related

Cassandra query max of a particular column for a particular ID

Delete Rows using timestamp column cassandra

Duplicate rows/columns for the same primary key in Cassandra

Cassandra UPDATE primary key value

Timestamp / date as key for cassandra column family / hector

Categories

Resources