One issue that I would like to avoid is two branches updating the JOOQ generated code. I imagine that this can lead to a messy merge conflict. Is there a best-practices strategy for managing DB changes across two different branches with JOOQ?
Future jOOQ multi schema version support
Multi schema version code generation is coming in a future release with https://github.com/jOOQ/jOOQ/issues/9626
Work on the required infrastructure for the above has started in jOOQ 3.15. There's a lot of work and open questions, but eventually, it will be possible to define a set of source schemas which should all be supported at the same time:
By code generation
By the runtime (e.g. * includes only columns available in a given version)
Rolling your own using SQL views
Until then, you might be able to pull off a compatibility layer yourself using views. For example:
-- Version 1
CREATE TABLE t (
id INT PRIMARY KEY,
col1 TEXT,
col2 TEXT
);
-- Version 2
CREATE TABLE t (
id INT PRIMARY KEY,
-- col1 was dropped
col2 TEXT,
-- col3 was added
col3 TEXT
);
Now deploy a view that looks the same to your client code for both versions:
-- Version 1
CREATE OR REPLACE VIEW v (id, col1, col2, col3) AS
SELECT id, col1, col2, NULL
FROM t;
-- Version 1
CREATE OR REPLACE VIEW v (id, col1, col2, col3) AS
SELECT id, NULL, col2, col3
FROM t;
If your RDBMS supports updatable views, you might be able to use them like any other table, especially when adding synthetic primary keys / synthetic foreign keys to your generated code.
With a generator strategy, you could further rename your generated view names V to T (assuming you exclude the actual T from being generated), and your client code won't even notice that you emulated the T table with a view.
Related
My s3 location has the below structure
s3://bucketname/snapshot/db_collection/snapshot1/*.parquet
s3://bucketname/snapshot/db_collection/snapshot2/*.parquet
s3://bucketname/snapshot/db_collection/snapshot3/*.parquet
What I want is
to be able to define the trino table at the level s3://bucketname/snapshot/db_collection/; so that if I query for a row and it exists in 2 snapshots then I get 2 rows as output. I was not able to find how to write a create table query for this use-case (which essentially is a partition use-case). Also note that the partition folder snapshotX is not of format <abc>=<efg> format.
is there any tool/ way which can generate the table automatically out of the parquet file or the schema -json file. Why I ask is because -- my parquet file has 150 columns and each column is again nested etc. Writing a table by hand is not easy
I tried to run aws glue crawler --to generate the table and use athena for querying, but when I run select query i get into weird errors which scares me out. So I don't want to use this path.
My existing table definition is as follows
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET'
)
My setup is AWS EMR 6.8.0 with trino-v388.
Regarding partitions:
As you mentioned, automatic partition discovery won't work because Trino looks for the hive format col_name=value. As a best practice I would recommend to run a one-time procedure to rename the keys, however, if this is not possible, you can still manually register partitions using the register_partition system procedure. It's just tedius to maintain.
system.register_partition(schema_name, table_name, partition_columns, partition_values, location)
Please note you'll also need to edit your installation config and enable this on the catalog properties file.
From the docs (https://trino.io/docs/current/connector/hive.html#procedures.):
Due to security reasons, the procedure is enabled only when hive.allow-register-partition-procedure is set to true.
The partition column has to be in the last in your table schema, and the parittioned_by property defined in the table properties.
So in your example:
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar,
snapshot varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ['snapshot']
)
Regarding inferring the table schema:
This is not supported in Trino but can be done in Spark/Glue Crawler. If you register the table in the glue catalog it can be read by Trino as well.
Can you share the errors you got when selecting?
Above is missing ARRAY keyword
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ARRAY['col1','col2']
)
I am new to cassandra and I read some articles about static and dynamic column family.
It is mentioned ,From Cassandra 3 table and column family are same.
I created key space, some tables and inserted data into that table.
CREATE TABLE subscribers(
id uuid,
email text,
first_name text,
last_name text,
PRIMARY KEY(id,email)
);
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test#123.com','Test1','User1');
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test2#222.com','Test2','User2');
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test3#333.com','Test3','User3');
It all seems to work fine.
But what I need is to create a dynamic column family with only data types and no predefined columns.
With insert query I can have different arguments and the table should be inserted.
In articles, it is mentioned ,for dynamic column family, there is no need to create a schema(predefined columns).
I am not sure if this is possible in cassandra or my understanding is wrong.
Let me know if this is possible or not?
if possible Kindly provide with some examples.
Thanks in advance.
I think that articles that you're referring where written in the first years of Cassandra, when it was based on the Thrift protocols. Cassandra Query Language was introduced many years ago, and now it's the way to work with Cassandra - Thrift is deprecated in Cassandra 3.x, and fully removed in the 4.0 (not released yet).
If you really need to have fully dynamic stuff, then you can try to emulate this by using table with columns as maps from text to specific type, like this:
create table abc (
id int primary key,
imap map<text,int>,
tmap map<text,text>,
... more types
);
but you need to be careful - there are limitations and performance effects when using collections, especially if you want to store more then hundreds of elements.
another approach is to store data as individual rows:
create table xxxx (
id int,
col_name text,
ival int,
tval text,
... more types
primary key(id, col_name));
then you can insert individual values as separate columns:
insert into xxxx(id, col_name, ival) values (1, 'col1', 1);
insert into xxxx(id, col_name, tval) values (1, 'col2', 'text');
and select all columns as:
select * from xxxx where id = 1;
Assuming you have a table with a field (column) that serves as the primary (partition) key (let say its name is "id") and the rest of the fields columns are "regular" (no clustering) - lets call them "field1", "field2", field3", "field4", etc. The logic that currently exists in the system might generate 2 separate update commands to the same row. For example:
UPDATE table SET field1='value1' WHERE id='key';
UPDATE table SET field2='value2' WHERE id='key';
These commands run one after the other in quorum.
Seldom, when you retrieve the row (quorum read) from the DB, its as if one of the updates did not happen. Is it possible that the inconsistency is because of this write pattern and can be circumvented by making one update call like this:
UPDATE table SET field1='value1',field2='value2' WHERE id='key';
This is happening on Cassandra 2.1.17
Yes this is totally possible.
If you need to preserve the orders when making the two statements you can to 2 things:
add using timestamp to your queries and set it explicitly on client code - this will prevent the inconsistencies
use batch
What I would have done,is change the table definition
CREATE TABLE TABLE_NAME(
id text,
field text,
value text
PRIMARY KEY( id , field )
This way you don't have to worry about updates to fields for a particular key.
Your queries would be ,
INSERT INTO TABLE_NAME (id , field , value ) VALUES ('key','fieldname1', 'value1' );
INSERT INTO TABLE_NAME (id , field , value ) VALUES ('key','fieldname2', 'value2' );
The drawback of design is, if you have too many data for 'key',it would created wide row.
For select queries -
SELECT * from TABLE_NAME where id ='key';
On client side, build your object.
I am new to cassandra and am coming from Postgres. I was wondering if there is a way that I can get data from 2 different tables or column family and then return the results. I have this query
select p.fullname,p.picture s.post, s.id, s.comments, s.state, s.city FROM profiles as p INNER JOIN Chats as s ON(p.id==s.profile_id) WHERE s.latitudes>=28 AND 29>= s.latitudes AND s.longitudes
">=-21 AND -23>= s.longitudes
The query has 2 tables: Profiles and Chat and they both share a common field Chats.id==Proifles.profile_id it boils down to this basically return all rows where Chat ID is equal to Profiles id. I would like to keep it that way because now updating profiles are simple and would only need to update 1 row per profile update instead of de-normalizing everything and updating thousands of records. Any help or suggestions would be great
You have to design tables in way you won't need joins. Best practice is if your table matches exactly the use case it is used for.
Cassadra has a feature called shared static columns; this allows you to bind values with partition part of primary key. Thus, you can create "joined" version of table without duplicates.
CREATE TABLE t (
p_id uuid,
p_fullname text STATIC,
p_picture text STATIC,
s_id uuid,
s_post text,
s_comments text,
s_state text,
s_city text,
PRIMARY KEY (p_id, s_id)
);
I've just had a crash course of Cassandra over the last week and went from Thrift API to CQL to grokking SuperColumns to learning I shouldn't use them and user Composite Keys instead.
I'm now trying out CQL3 and it would appear that I can no longer insert into columns that are not defined in the schema, or see those columns in a select *
Am I missing some option to enable this in CQL3 or does it expect me to define every column in the schema (defeating the purpose of wide, flexible rows, imho).
Yes, CQL3 does require columns to be declared before used.
But, you can do as many ALTERs as you want, no locking or performance hit is entailed.
That said, most of the places that you'd use "dynamic columns" in earlier C* versions are better served by a Map in C* 1.2.
I suggest you to explore composite columns with "WITH COMPACT STORAGE".
A "COMPACT STORAGE" column family allows you to practically only define key columns:
Example:
CREATE TABLE entities_cargo (
entity_id ascii,
item_id ascii,
qt ascii,
PRIMARY KEY (entity_id, item_id)
) WITH COMPACT STORAGE
Actually, when you insert different values from itemid, you dont add a row with entity_id,item_id and qt, but you add a column with name (item_id content) and value (qt content).
So:
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 1',3);
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 2',3);
Now, here is how you see this rows in CQL3:
cqlsh:goh_master> select * from entities_cargo where entity_id = 100;
entity_id | item_id | qt
-----------+-----------+----
100 | oggetto 1 | 3
100 | oggetto 2 | 3
And how they are if you check tnem from cli:
[default#goh_master] get entities_cargo[100];
=> (column=oggetto 1, value=3, timestamp=1349853780838000)
=> (column=oggetto 2, value=3, timestamp=1349853784172000)
Returned 2 results.
You can access a single column with
select * from entities_cargo where entity_id = 100 and item_id = 'oggetto 1';
Hope it helps
Cassandra still allows using wide rows. This answer references that DataStax blog entry, written after the question was asked, which details the links between CQL and the underlying architecture.
Legacy support
A dynamic column family defined through Thrift with the following command (notice there is no column-specific metadata):
create column family clicks
with key_validation_class = UTF8Type
and comparator = DateType
and default_validation_class = UTF8Type
Here is the exact equivalent in CQL:
CREATE TABLE clicks (
key text,
column1 timestamp,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
Both of these commands create a wide-row column family that stores records ordered by date.
CQL Extras
In addition, CQL provides the ability to assign labels to the row id, column and value elements to indicate what is being stored. The following, alternative way of defining this same structure in CQL, highlights this feature on DataStax's example - a column family used for storing users' clicks on a website, ordered by time:
CREATE TABLE clicks (
user_id text,
time timestamp,
url text,
PRIMARY KEY (user_id, time)
) WITH COMPACT STORAGE
Notes
a Table in CQL is always mapped to a Column Family in Thrift
the CQL driver uses the first element of the primary key definition as the row key
Composite Columns are used to implement the extra columns that one can define in CQL
using WITH COMPACT STORAGE is not recommended for new designs because it fixes the number of possible columns. In other words, ALTER TABLE ... ADD is not possible on such a table. Just leave it out unless it's absolutely necessary.
interesting, something I didn't know about CQL3. In PlayOrm, the idea is it is a "partial" schema you must define and in the WHERE clause of the select, you can only use stuff that is defined in the partial schema BUT it returns ALL the data of the rows EVEN the data it does not know about....I would expect that CQL should have been doing the same :( I need to look into this now.
thanks,
Dean