Does Trino (formerly Presto) INSERT work with CTEs? - presto

I am trying to insert rows into a table with a query of this form:
with my_CTE as
(SELECT a,b,c
FROM my_source_table
WHERE <some conditions to apply>)
INSERT INTO my_destination_table
SELECT a, b, c
FROM my_CTE;
This error appears:
mismatched input 'INSERT'. Expecting: '(', 'SELECT', 'TABLE', 'VALUES'
I know the CTE is written correctly because I can do other things with it.
I know the table exists with the columns available in the CTE because I can write to it, as shown above, with Hive.
Does Trino not support CTEs and INSERT to work together as shown above?
I have reasons for wanting to use Trino instead of Hive.

I had the order of the CTE and INSERT backwards.
This works:
INSERT INTO my_destination_table
with my_CTE as
(SELECT a,b,c
FROM my_source_table
WHERE <some conditions to apply>)
SELECT a, b, c
FROM my_CTE;
See the previous Stack Exchange Q&A in my comment above.

Related

SparkSQL cannot find existing rows in a Hive table that Trino(Presto) does

We are using Trino(Presto) and SparkSQL for querying Hive tables on s3 but they give different results with the same query and on the same tables. We found the main problem. There are existing rows in a problematic Hive table which can be found with a simple where filter on a specific column with Trino but cannot be found with SparkSQL. The sql statements are the same in both.
On the other hand, SparkSQL can find these rows in the source table of that problematic table, filtering on the same column.
Create sql statement:
CREATE TABLE problematic_hive_table AS SELECT c1,c2,c3 FROM source_table
The select sql that can be used to find missing rows in Trino but not in SparkSQL
SELECT * FROM problematic_hive_table WHERE c1='missing_rows_value_in_column'
And this is the select query which can find these missing rows in SparkSQL:
SELECT * FROM source_table WHERE c1='missing_rows_value_in_column'
We execute the CTAS in Trino(Presto). If we are using ...WHERE trim(c1) = 'missing_key'' then spark can also find the missing rows but the fields do not contain trailing spaces (the length of these fields are the same in the source table as in the problematic table). In the source table spark can find these missing rows without trim.

Spark SQL Insert Select with a column list?

As I read Spark/Hive SQL documentation is appears that Insert into table with a Column List is not supported in Spark 2.4 and earlier versions.
I have a source table and a destination table with different number of columns and different column names which I need to copy.
Does this mean I have to code this in PySpark to do this job as Spark SQL will not be able to do it ??
Example:
input_table( cola, colb, colc, cold, cole)
output_table(fieldx, fieldy, fieldz)
In SQL (assuming RDBMS such as MS-SQL, PostgreSQL etc) I would do the following:
insert into output_table(fieldx, fieldy, fieldz) select cola, colb, colc from input_table
Spark SQL does not allow this, it does not accept a column list in Insert SQL statement.
Question: how can I do this task with minimum of code and maximum performance in either PySpark or (ideally) in Spark-SQL (I am using Spark 2.4) ?
thank you
Specify the columns in output that won't be copied from input_table as null in select. (This is what would happen when only a set of columns, not all, would be inserted with a column list, if it were allowed)
insert into output_table
select cola, colb, colc,null as other1,--..specify non-copied column values as null
from input_table

Is OR gives better performance than IN for HIVE queries

I have following two queries in Hive to get some specific result.
select * from table1 where col1 IN (a, b, c)
select * from table1 where col1=a OR col1=b OR col1=c
As per my understanding IN will be converted internally to sequence of ORs.
Executed locally in spark-sql but did not find any sort of performance difference(like execution timing,filtered data scanning etc).
So what difference we can see in IN and OR based on the functionality.
Any help will be appreciated.
Col1 in (a,b,c) is a macro that expands to Col1=a or Col1=b or Col1=c.
There is no performance difference

U-Sql Create table statement failing

I'm trying to create a U-sql table from two tables using Create table as select (CTA's) as below -
DROP TABLE IF EXISTS tpch_query2_result;
CREATE TABLE tpch_query2_result
(
INDEX idx_query2
CLUSTERED(P_PARTKEY ASC)
DISTRIBUTED BY HASH(P_PARTKEY)
) AS
SELECT
a.P_PARTKEY
FROM part AS a INNER JOIN partsupp AS b ON a.P_PARTKEY == b.PS_PARTKEY;
But while running the U-sql query im getting the below error -
E_CSC_USER_QUALIFIEDCOLUMNNOTFOUND: Column 'P_PARTKEY' not found in rowset 'a'.
Line 11
E_CSC_USER_QUALIFIEDCOLUMNNOTFOUND: Column 'PS_PARTKEY' not found in rowset 'b'.
Not sure about the error. Can someone provide some insights on this error.Thanks
The error normally indicates that the specified column does not exists in the specified rowset referenced by a (i.e., part) or b (i.e., partsupp). What is the schema of either of these tables? do they have columns of the expected names?

Bad Request: No indexed columns present in by-columns clause with Equal operator : CQL error?

I have below table in CQL-
create table test (
employee_id text,
employee_name text,
value text,
last_modified_date timeuuid,
primary key (employee_id)
);
I inserted couple of records in the above table like this which I will be inserting in our actual use case scenario-
insert into test (employee_id, employee_name, value, last_modified_date) values ('1', 'e27', 'some_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('2', 'e27', 'some_new_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('3', 'e27', 'some_again_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('4', 'e28', 'some_values', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('5', 'e28', 'some_new_values', now());
Now I was doing select query for - give me all the employee_id for employee_name e27.
select employee_id from test where employee_name = 'e27';
And this is the error I am getting -
Bad Request: No indexed columns present in by-columns clause with Equal operator
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
Is there anything wrong I am doing here?
My use cases are in general -
Give me everything for any of the employee_name?
Give me everything for what has changed in last 5 minutes?
Give me the latest employee_id and value for any of the employee_name?
Give me all the employee_id for any of the employee_name?
I am running Cassandra 1.2.11
The general rule is simple: "you can only query by columns that are part of the key". As an explanation all other queries would require a complete scan of the tables which might mean a lot of data sifting.
There are things that can modify this rule:
use secondary indexes for columns with low cardinality (more details here)
define multi-column keys (e.g. PRIMARY KEY (col1, col2); which would allow queries like col1 = value1 and col1 = value1 and col2 COND)
use ALLOW FILTERING in queries. This will result in a warning as Cassandra will have to sift through a lot of data and there will be no performance guarantees. For more details see details of ALLOW FILTERING in CQL and this SO thread
Cassandra take a little getting used to :) Some of us have been spoiled by some of the extra stuff RDBMS does for you that you do not get for free from noSql.
If you think back on a regular RDBMS table, if you SELECT on a column that has no index, the DB must do a full-table scan to find all the matches you seek. This is a no-no in Cassandra, and it will complain if you try to do this. Imagine if you found 10^32 matches to this query? It is not a reasonable ask.
In your table, you have coded *PRIMARY KEY(employee_id);* this is the row's primary and unique identifying key. You can now SELECT * from TEST where employee_id='123'; this is perfectly reasonable and Cassandra will happily return the result.
However, your SELECT from TEST WHERE employee_name = 'e27'; tells Cassandra to go and read EVERY record until it finds a match on 'e27'. With no index to rely on, it politely asks you to 'forget it'.
If you want to filter on a column, make sure you have an index on that column so that Cassandra can performs the filtering you need.

Resources