Presto possible table format values - presto

According to the docs, when creating a table in Presto
CREATE TABLE orders (
orderkey bigint,
orderstatus varchar,
totalprice double,
orderdate date
)
WITH (format = 'ORC')
you can specify format = 'xxx'. Apart from 'ORC' I know there is a TEXTFILE. I am curious what other options for the format are there? And is there a reason why you shouldn't use the 'ORC' (I suppose it's the default).

For Hive connector, supported file formats are listed in the Hive connector documentation.
ORC is not default (hive.storage-format connector configuration property governs default format when not specified in CREATE TABLE and that setting currently defaults to RCBINARY), though it's generally recommendable choice.

Related

How to create a partitioned Trino table on S3 (with sub-folders)

My s3 location has the below structure
s3://bucketname/snapshot/db_collection/snapshot1/*.parquet
s3://bucketname/snapshot/db_collection/snapshot2/*.parquet
s3://bucketname/snapshot/db_collection/snapshot3/*.parquet
What I want is
to be able to define the trino table at the level s3://bucketname/snapshot/db_collection/; so that if I query for a row and it exists in 2 snapshots then I get 2 rows as output. I was not able to find how to write a create table query for this use-case (which essentially is a partition use-case). Also note that the partition folder snapshotX is not of format <abc>=<efg> format.
is there any tool/ way which can generate the table automatically out of the parquet file or the schema -json file. Why I ask is because -- my parquet file has 150 columns and each column is again nested etc. Writing a table by hand is not easy
I tried to run aws glue crawler --to generate the table and use athena for querying, but when I run select query i get into weird errors which scares me out. So I don't want to use this path.
My existing table definition is as follows
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET'
)
My setup is AWS EMR 6.8.0 with trino-v388.
Regarding partitions:
As you mentioned, automatic partition discovery won't work because Trino looks for the hive format col_name=value. As a best practice I would recommend to run a one-time procedure to rename the keys, however, if this is not possible, you can still manually register partitions using the register_partition system procedure. It's just tedius to maintain.
system.register_partition(schema_name, table_name, partition_columns, partition_values, location)
Please note you'll also need to edit your installation config and enable this on the catalog properties file.
From the docs (https://trino.io/docs/current/connector/hive.html#procedures.):
Due to security reasons, the procedure is enabled only when hive.allow-register-partition-procedure is set to true.
The partition column has to be in the last in your table schema, and the parittioned_by property defined in the table properties.
So in your example:
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar,
snapshot varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ['snapshot']
)
Regarding inferring the table schema:
This is not supported in Trino but can be done in Spark/Glue Crawler. If you register the table in the glue catalog it can be read by Trino as well.
Can you share the errors you got when selecting?
Above is missing ARRAY keyword
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ARRAY['col1','col2']
)

Databricks table metadata through JDBC driver

The Spark JDBC driver (SparkJDBC42.jar) is unable to capture certain information from the below table structure:
table level comment
The TBLPROPERTIES key-value pair information
PARTITION BY information
However, it captures the column level comment (eg. the comment against employee_number column), all columns of employee table, their technical data types.
Please advise if I need to configure any additional properties to be ale to read/extract the information that the driver could not extract at the moment.
create table default.employee(
employee_number INT COMMENT ‘Unique identifier for an employee’,
employee_name VARCHAR(50),
employee_age INT)
PARTITIONED BY (employee_age)
COMMENT ‘this is a table level comment’
TBLPROPERTIES (‘created.by.user’ = ‘Noor’, ‘created.date’ = ‘10-08-2021’);
You should be able to execute:
describe table extended default.employee
via JDBC interface as well. In first case it will return a table with 3 columns, that you can parse into column level & table level properties - it shouldn't be very complex, as there are explicit delimiters between row-level & table level data:
You can also execute:
show create table default.employee
that will give you a table with one column, containing the SQL statement that you may parse:

Spark SQL ignoring dynamic partition filter value

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.
I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.
In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show
However, in this instance, If I hardcode the value, it only reads the proper directory.
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show
Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).
I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.
Workaround by changing configuration
sql("set spark.sql.hive.convertMetastoreParquet = false")
Spark docs
"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Unable to coerce to a formatted date - Cassandra timestamp type

I have the values stored for timestamp type column in cassandra table in format of
2018-10-27 11:36:37.950000+0000 (GMT date).
I get Unable to coerce '2018-10-27 11:36:37.950000+0000' to a formatted date (long) when I run below query to get data.
select create_date from test_table where create_date='2018-10-27 11:36:37.950000+0000' allow filtering;
How to get the query working if the data is already stored in the table (of format, 2018-10-27 11:36:37.950000+0000) and also perform range (>= or <=) operations on create_date column?
I tried with create_date='2018-10-27 11:36:37.95Z',
create_date='2018-10-27 11:36:37.95' create_date='2018-10-27 11:36:37.95'too.
Is it possible to perform filtering on this kind of timestamp type data?
P.S. Using cqlsh to run query on cassandra table.
In first case, the problem is that you specify timestamp with microseconds, while Cassandra operates with milliseconds - try to remove the three last digits - .950 instead of .950000 (see this document for details). The timestamps are stored inside Cassandra as 64-bit number, and then formatted when printing results using the format specified by datetimeformat options of cqlshrc (see doc). Dates without explicit timezone will require that default timezone is specified in cqlshrc.
Regarding your question about filtering the data - this query will work only for small amounts of data, and on bigger data sizes will most probably timeout, as it will need to scan all data in the cluster. Also, the data won't be sorted correctly, because sorting happens only inside single partition.
If you want to perform such queries, then maybe the Spark Cassandra Connector will be the better choice, as it can effectively select required data, and then you can perform sorting, etc. Although this will require much more resources.
I recommend to take DS220 course from DataStax Academy to understand how to model data for Cassandra.
This is works for me
var datetime = DateTime.UtcNow.ToString("yyyy-MM-dd HH:MM:ss");
var query = $"SET updatedat = '{datetime}' WHERE ...

Apache Ignite with Apache Cassandra

I am exploring Apache Ignite on top of Cassandra as a possible tool to be able to give ad-hoc queries on cassandra tables. Using Ignite is it
possible to able to search or query on any column in the underlying cassandra tables, like a RDBMS? Or can the join columns and search
columns only be partition and clustering columns ?
If using Ignite, is there still need to create indexes on cassandra ? Also how does ignite treat materialized views ? Will there be a need
to create materialized views ?
Also any insights into how updates to cassandra release can/will be handled by Ignite would be very helpful.
I will elaborate my question further:
Customer table:
CREATE TABLE customer (
customer_id INT,
joined_date date,
name text,
address TEXT,
is_active boolean,
created_by text,
updated_by text,
last_updated timestamp,
PRIMARY KEY(customer_id, joined_date)
);
Product table:
CREATE TABLE PDT_BY_ID (
device_id uuid,
desc text,
serial_number text,
common_name text,
customer_id int,
manu_name text,
last_updated timestamp,
model_number text,
price double,
PRIMARY KEY((device_id), serial_number)
) WITH CLUSTERING ORDER BY (serial_number ASC);
A join is possible on these tables using apache Ignite.
But is the join possible on non-primary keys ?
Is it possible for example, to give queries on product table like 'where customer_id = ... AND model_number like = '%ABC%' ' etc. ?
Is it possible to give RDBMS like queries where one can give conditions on any columns ?
Run ad-hoc queries on the tables ?
This is discussed on Apache Ignite forum: http://apache-ignite-users.70518.x6.nabble.com/Newbie-Questions-on-Ignite-over-cassandra-td10264.html

Resources