How to create a partitioned Trino table on S3 (with sub-folders) - presto

My s3 location has the below structure
s3://bucketname/snapshot/db_collection/snapshot1/*.parquet
s3://bucketname/snapshot/db_collection/snapshot2/*.parquet
s3://bucketname/snapshot/db_collection/snapshot3/*.parquet
What I want is
to be able to define the trino table at the level s3://bucketname/snapshot/db_collection/; so that if I query for a row and it exists in 2 snapshots then I get 2 rows as output. I was not able to find how to write a create table query for this use-case (which essentially is a partition use-case). Also note that the partition folder snapshotX is not of format <abc>=<efg> format.
is there any tool/ way which can generate the table automatically out of the parquet file or the schema -json file. Why I ask is because -- my parquet file has 150 columns and each column is again nested etc. Writing a table by hand is not easy
I tried to run aws glue crawler --to generate the table and use athena for querying, but when I run select query i get into weird errors which scares me out. So I don't want to use this path.
My existing table definition is as follows
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET'
)
My setup is AWS EMR 6.8.0 with trino-v388.

Regarding partitions:
As you mentioned, automatic partition discovery won't work because Trino looks for the hive format col_name=value. As a best practice I would recommend to run a one-time procedure to rename the keys, however, if this is not possible, you can still manually register partitions using the register_partition system procedure. It's just tedius to maintain.
system.register_partition(schema_name, table_name, partition_columns, partition_values, location)
Please note you'll also need to edit your installation config and enable this on the catalog properties file.
From the docs (https://trino.io/docs/current/connector/hive.html#procedures.):
Due to security reasons, the procedure is enabled only when hive.allow-register-partition-procedure is set to true.
The partition column has to be in the last in your table schema, and the parittioned_by property defined in the table properties.
So in your example:
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar,
snapshot varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ['snapshot']
)
Regarding inferring the table schema:
This is not supported in Trino but can be done in Spark/Glue Crawler. If you register the table in the glue catalog it can be read by Trino as well.
Can you share the errors you got when selecting?

Above is missing ARRAY keyword
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ARRAY['col1','col2']
)

Related

JOOQ code generation strategy across multiple branches

One issue that I would like to avoid is two branches updating the JOOQ generated code. I imagine that this can lead to a messy merge conflict. Is there a best-practices strategy for managing DB changes across two different branches with JOOQ?
Future jOOQ multi schema version support
Multi schema version code generation is coming in a future release with https://github.com/jOOQ/jOOQ/issues/9626
Work on the required infrastructure for the above has started in jOOQ 3.15. There's a lot of work and open questions, but eventually, it will be possible to define a set of source schemas which should all be supported at the same time:
By code generation
By the runtime (e.g. * includes only columns available in a given version)
Rolling your own using SQL views
Until then, you might be able to pull off a compatibility layer yourself using views. For example:
-- Version 1
CREATE TABLE t (
id INT PRIMARY KEY,
col1 TEXT,
col2 TEXT
);
-- Version 2
CREATE TABLE t (
id INT PRIMARY KEY,
-- col1 was dropped
col2 TEXT,
-- col3 was added
col3 TEXT
);
Now deploy a view that looks the same to your client code for both versions:
-- Version 1
CREATE OR REPLACE VIEW v (id, col1, col2, col3) AS
SELECT id, col1, col2, NULL
FROM t;
-- Version 1
CREATE OR REPLACE VIEW v (id, col1, col2, col3) AS
SELECT id, NULL, col2, col3
FROM t;
If your RDBMS supports updatable views, you might be able to use them like any other table, especially when adding synthetic primary keys / synthetic foreign keys to your generated code.
With a generator strategy, you could further rename your generated view names V to T (assuming you exclude the actual T from being generated), and your client code won't even notice that you emulated the T table with a view.

Get column type of a table using cql command

I am trying to get column type of a table using cql command.
My table:
CREATE TABLE users (
id uuid,
name text);
Now I am trying to get type of name column. With the help of some select query I want to get text as output.
My use case is: I am trying to drop name column only if type of name is text
What script should I try?
From CQL you can read this data from system tables. In Cassandra 3.x, this information is located in the system_schema.columns table that has following schema:
CREATE TABLE system_schema.columns (
keyspace_name text,
table_name text,
column_name text,
clustering_order text,
column_name_bytes blob,
kind text,
position int,
type text,
PRIMARY KEY (keyspace_name, table_name, column_name)
) WITH CLUSTERING ORDER BY (table_name ASC, column_name ASC);
so you can use query like this to retrieve the data:
select type from system_schema.columns where keyspace_name = 'your_ks'
and table_name = 'users' and column_name = 'name';
In Cassandra 2.x, the structure of the system tables is different, so you may need to adapt your query.
If you're accessing cluster programmatically, then the driver hides differences between Cassandra versions, and you can use something like Metadata class from Java driver to get information about table's structure and types of columns. But if you're doing schema changes programmatically, you must be careful and explicitly wait for schema agreement, like in following example.
select keyspace_name, table_name, column_name, type from system_schema.columns WHERE type='counter' AND keyspace_name='someName' LIMIT 100 ALLOW FILTERING ;

Presto possible table format values

According to the docs, when creating a table in Presto
CREATE TABLE orders (
orderkey bigint,
orderstatus varchar,
totalprice double,
orderdate date
)
WITH (format = 'ORC')
you can specify format = 'xxx'. Apart from 'ORC' I know there is a TEXTFILE. I am curious what other options for the format are there? And is there a reason why you shouldn't use the 'ORC' (I suppose it's the default).
For Hive connector, supported file formats are listed in the Hive connector documentation.
ORC is not default (hive.storage-format connector configuration property governs default format when not specified in CREATE TABLE and that setting currently defaults to RCBINARY), though it's generally recommendable choice.

Apache Ignite with Apache Cassandra

I am exploring Apache Ignite on top of Cassandra as a possible tool to be able to give ad-hoc queries on cassandra tables. Using Ignite is it
possible to able to search or query on any column in the underlying cassandra tables, like a RDBMS? Or can the join columns and search
columns only be partition and clustering columns ?
If using Ignite, is there still need to create indexes on cassandra ? Also how does ignite treat materialized views ? Will there be a need
to create materialized views ?
Also any insights into how updates to cassandra release can/will be handled by Ignite would be very helpful.
I will elaborate my question further:
Customer table:
CREATE TABLE customer (
customer_id INT,
joined_date date,
name text,
address TEXT,
is_active boolean,
created_by text,
updated_by text,
last_updated timestamp,
PRIMARY KEY(customer_id, joined_date)
);
Product table:
CREATE TABLE PDT_BY_ID (
device_id uuid,
desc text,
serial_number text,
common_name text,
customer_id int,
manu_name text,
last_updated timestamp,
model_number text,
price double,
PRIMARY KEY((device_id), serial_number)
) WITH CLUSTERING ORDER BY (serial_number ASC);
A join is possible on these tables using apache Ignite.
But is the join possible on non-primary keys ?
Is it possible for example, to give queries on product table like 'where customer_id = ... AND model_number like = '%ABC%' ' etc. ?
Is it possible to give RDBMS like queries where one can give conditions on any columns ?
Run ad-hoc queries on the tables ?
This is discussed on Apache Ignite forum: http://apache-ignite-users.70518.x6.nabble.com/Newbie-Questions-on-Ignite-over-cassandra-td10264.html

Load data into Cassandra denormalized table

I understand that as Cassandra does not support join, we need to create denormalized table sometimes.
Given I need to get Item names for each item within a order given order Id, I create a table using:
CREATE TABLE order (
order_id int,
item_id int,
item_name,
primary key ((id), item_id)
);
I have two csv files to load data from, order.csv and item.csv, where order.csv contains order_id and item_id and item.csv contains item_id and item_name.
The question is how to load data from the csv file into the table I create? I insert data from order file first and it works fine. When I do a insertion of item, it will throw error saying missing primary key.
Any idea how I can insert data from different input files into the denormalized table? Thanks.
there is a typo on the definition of the primary key, it should be
CREATE TABLE order (
order_id int,
item_id int,
item_name,
primary key (order_id, item_id)
);
Are you using COPY to upload the data?
Regarding the denormalization, that depends on your use case, usually on a normalized schema you will have one table for orders, another for customers and do a join with SQL to display information of the order and customers at the same time; in this case, for a denormalized table you will have the order and the customer information in the same table, the fields will depend on how you are going to use the query.
As a rule of thumb, before creating the table, you first need to define what that are you going to use.
Using a secondary index on your item_id should do the trick:
CREATE INDEX idx_item_id ON order (item_id);
Now you should be able to query like:
SELECT * FROM order WHERE item_id = ?;
Beware that indexes usually have performance impacts, so you can use them to import your data, and drop them when finished.
Please refer to the Cassandra Index Documentation for further information.

Resources