Reference to a field of a row object - presto

I'm having trouble accessing the fields of row objects which I have created in Presto. The Presto documentation claims "fields... are accessed with field reference operator." However that doesn't seem to work. This code reproduces the problem:
CREATE TABLE IF NOT EXISTS data AS
SELECT * FROM (VALUES
(1, 'Adam', 17),
(2, 'Bill', 42)
) AS x (id, name, age);
CREATE TABLE IF NOT EXISTS ungrouped_data AS
WITH grouped_data AS (
SELECT
id,
ROW(name, age) AS name_age
FROM data
)
SELECT
id,
name_age.1 AS name,
name_age.2 AS age
FROM grouped_data;
Which returns an "extraneous input '.1'" error.

Starting with Trino (formerly known as Presto) 314, it is now possible to reference ROW fields using the [] operator.
WITH grouped_data AS (
SELECT
id,
ROW(name, age) AS name_age
FROM data
)
SELECT
id,
name_age[1] AS name,
name_age[2] AS age
FROM grouped_data;

ROW(name, age) will create an row without field names. Today to access the fields in such row, you need to cast it into a row with field names. Try this:
WITH grouped_data AS (
SELECT
id,
CAST(ROW(name, age) AS ROW(col1 VARCHAR, col2 INTEGER)) AS name_age
FROM data
)
SELECT
id,
name_age.col1 AS name,
name_age.col2 AS age
FROM grouped_data;
Result:
id | name | age
----+------+-----
1 | Adam | 17
2 | Bill | 42
See https://github.com/prestodb/presto/issues/7640 for discussions on this.

Related

Get value from specific map-key in Cassandra

For example. I have a map under the column 'users' in a table called 'table' with primary key 'Id'.
If the map looks like this, {{'Phone': '1234567899'}, {'City': 'Dublin'}}, I want to get the value from key 'Phone' for specific 'Id', in Cassandra database.
Yes, that's possible to do with CQL when using a MAP collection.
To test this, I created a simple table using the specifications and data you mentioned above:
> CREATE TABLE stackoverflow.usermap (
id text PRIMARY KEY,
users map<text, text>);
> INSERT INTO usermap (id,users)
VALUES ('1a',{'Phone': '1234567899','City': 'Dublin'});
> SELECT * FROM usermap WHERE id='1a';
id | users
----+-------------------------------------------
1a | {'City': 'Dublin', 'Phone': '1234567899'}
(1 rows)
Then, I queried with the same WHERE clause, but altering my SELECT to pull back the user's phone only:
> SELECT users['Phone'] FROM usermap WHERE id='1a';
users['Phone']
----------------
1234567899
(1 rows)

Cassandra migrate int to bigint

What would be the easiest way to migrate an int to a bigint in Cassandra? I thought of creating a new column of type bigint and then running a script to basically set the value of that column = the value of the int column for all rows, and then dropping the original column and renaming the new column. However, I'd like to know if someone has a better alternative, because this approach just doesn't sit quite right with me.
You could ALTER your table and change your int column to a varint type. Check the documentation about ALTER TABLE, and the data types compatibility matrix.
The only other alternative is what you said: add a new column and populate it row by row. Dropping the first column can be entirely optional: if you don't assign values when performing insert everything will stay as it is, and new records won't consume space.
You can ALTER your table to store bigint in cassandra with varint. See the example-
cassandra#cqlsh:demo> CREATE TABLE int_test (id int, name text, primary key(id));
cassandra#cqlsh:demo> SELECT * FROM int_test;
id | name
----+------
(0 rows)
cassandra#cqlsh:demo> INSERT INTO int_test (id, name) VALUES ( 215478936541111, 'abc');
cassandra#cqlsh:demo> SELECT * FROM int_test ;
id | name
---------------------+---------
215478936541111 | abc
(1 rows)
cassandra#cqlsh:demo> ALTER TABLE demo.int_test ALTER id TYPE varint;
cassandra#cqlsh:demo> INSERT INTO int_test (id, name) VALUES ( 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999, 'abcd');
cassandra#cqlsh:demo> SELECT * FROM int_test ;
id | name
------------------------------------------------------------------------------------------------------------------------------+---------
215478936541111 | abc
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999 | abcd
(2 rows)
cassandra#cqlsh:demo>

Not getting exact output using User defined data types in cassandra

In CASSANDRA, I created a User defined data type,
cqlsh:test> create type fullname ( firstname text, lastname text );
And i created a table with that data type and inserted into the table like this,
cqlsh:test> create table people ( id UUID primary key, names set < frozen <fullname>> );
cqlsh:test> insert into people (id, names) values (
... now(),
... {{firstname: 'Jim', lastname: 'Jones'}}
... );
When i querying into the table iam getting output with some additional values like this
cqlsh:test> SELECT * from people ;
id | names
--------------------------------------+--------------------------------------------
3a59e2e0-14df-11e5-8999-abcdb7df22fc | {\x00\x00\x00\x03Jim\x00\x00\x00\x05Jones}
How can i get output like this???
select * from people;
id | names
--------------------------------------+-----------------------------------------
69ba9d60-a06b-11e4-9923-0fa29ba414fb | {{firstname: 'Jim', lastname: 'Jones'}}

how to construct range query in cassandra?

CREATE TABLE users (
userID uuid,
firstname text,
lastname text,
state text,
zip int,
age int,
PRIMARY KEY (userID)
);
I want to construct the following queries:
select * from users where age between 30 and 40
select * from users where state in "AZ" AND "WA"
I know I need two more tables to do this query but I dont know how the should be?
EDIT
From Carlo's comments, I see this is the only possibility
CREATE TABLE users (
userID uuid,
firstname text,
lastname text,
state text,
zip int,
age int,
PRIMARY KEY (age,zip,userID)
);
Now to select Users with age between 15 and 30. this is the only possibility:
select * from users where age IN (15,16,17,....30)
However, using IN operator here is not recommended and is anti-pattern.
How about creating secondary Index on age?
CREATE index users_age ON users(age)
will this help?
Thanks
Range queries is a prikly question.
The way to perform a real range query is to use a compound primary key, making the range on the clustering part. Since the range is on clustering part you can't perform the queries you wrote: you need at least to have an equal condition on the whole partition key.
Let's see an example:
CREATE TABLE users (
mainland text,
state text,
uid int,
name text,
zip int,
PRIMARY KEY ((mainland), state, uid)
)
The uid is now an int just to make tests easier
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'washington', 1, 'john', 98100);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'texas', 2, 'lukas', 75000);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 3, 'henry', 19904);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 19910);
insert into users (mainland, state, uid, name, zip) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 20150);
insert into users (mainland, state, uid, name, zip) VALUES ( 'southamerica', 'argentina', 6, 'alex', 10840);
Now the query can perform what you need:
select * from users where mainland = 'northamerica' and state > 'ca' and state < 'ny';
Output
mainland | state | uid | name | zip
-------------+----------+-----+--------+-------
northamerica | delaware | 3 | henry | 19904
northamerica | delaware | 4 | dawson | 19910
if you put an int (age, zipcode) as first column of the clustering key you can perform the same queries comparing integers.
TAKE CARE: most of people when looking at this situation starts thinking "ok, I can put a fake partition key that is always the same and then I can perform range queries". This is a huge error, the partition key is responsible for data distribution accross nodes. Setting a fix partition key means that all data will finish in the same node (and in its replica).
Dividing the world zone into 15/20 zones (in order to have 15/20 partition key) is something but is not enough and is made just to create a valid example.
EDIT: due to question's edit
I did not say that this is the only possibility; if you can't find a valid way to partition your users and need to perform this kind of query this is one possibility, not the only one. Range queries should be performed on clustering key portion. A weak point of the AGE as partition key is that you can't perform an UPDATE over it, anytime you need to update the user's age you have to perform a delete and an insert (an alternative could be writing the birth_year/birth_date and not the age, and then calculate client side)
To answer your question on adding a secondary index: actually queries on secondary index does not support IN operator. From the CQL message it looks like they're going to develop it soon
Bad Request: IN predicates on non-primary-key columns (xxx) is not yet
supported
However even if secondary index would support IN operator your query wouldn't change from
select * from users where age IN (15,16,17,....30)
Just to clarify my concept: anything that does not have a "clean" and "ready" solution requires the effort of the user to model data in a way that satisfy its needs. To make an example (I don't say this is a good solution: I would not use it)
CREATE TABLE users (
years_range text,
age int,
uid int,
PRIMARY KEY ((years_range), age, uid)
)
put some data
insert into users (years_range, age , uid) VALUES ( '11_15', 14, 1);
insert into users (years_range, age , uid) VALUES ( '26_30', 28, 3);
insert into users (years_range, age , uid) VALUES ( '16_20', 16, 2);
insert into users (years_range, age , uid) VALUES ( '26_30', 29, 4);
insert into users (years_range, age , uid) VALUES ( '41_45', 41, 5);
insert into users (years_range, age , uid) VALUES ( '21_25', 23, 5);
query data
select * from users where years_range in('11_15', '16_20', '21_25', '26_30') and age > 14 and age < 29;
output
years_range | age | uid
-------------+-----+-----
16_20 | 16 | 2
21_25 | 23 | 5
26_30 | 28 | 3
This solution might solve your problem and could be used in a small cluster, where about 20 keys (0_5 ...106_110) might have a good distribution. But this solution, like the one before, does not allow an UPDATE and reduces the distribution of key. The advantage is that you have small IN sets.
In a perfect world where S.I. already allows IN clause I'd use the UUID as partition key, the years_range (set as birth_year_range) as S.I. and "filter" my data client side (if interested in 10 > age > 22 I would ask for IN('1991_1995', '1996_2000', '2001_2005', '2006_2010', '2011_2015') calculating and removing unuseful years on my application)
HTH,
Carlo
I found that using allow filtering, I can query for range:
example is here:
CREATE TABLE users2 (
mainland text,
state text,
uid int,
name text,
age int,
PRIMARY KEY (uid, age, state)
) ;
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'washington', 1, 'john', 81);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'texas', 1, 'lukas', 75);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'delaware', 1, 'henry', 19);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 90);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 50);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'southamerica', 'argentina', 6, 'alex', 40);
select * from users2 where age>50 and age<=100 allow filtering;
uid | age | state | mainland | name
-----+-----+------------+--------------+--------
1 | 75 | texas | northamerica | lukas
1 | 81 | washington | northamerica | john
2 | 75 | texas | northamerica | lukas
4 | 90 | delaware | northamerica | dawson
(4 rows)
I am not sure if this performance killer. But this seems to work. Infact, I don't even have to give the primary key which is uid in this case during query execution

Mixing column types in Cassandra / wide rows

I am trying to learn how to implement a feed in cassandra (think twitter). I want to use wide rows to store all the posts made by a user.
I am thinking about adding user information or statistical information in the same row (num of posts, last post date, user name, etc.).
My question is: is name, age, etc. "field name" stored in column? Or those wide rows only store the column-name and values specified? Am I wasting disk space? Am I compromising performance somehow?
Thanks!
-- TABLE CREATION
CREATE TABLE user_feed (
owner_id int,
name text,
age int,
posted_at timestamp,
post_text text,
PRIMARY KEY (owner_id, posted_at)
);
-- INSERTING THE USER
insert into user_feed (owner_id, name, age, posted_at) values (1, 'marc', 36, 0);
-- INSERTING USER POSTS
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'first post!');
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'hello there');
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'i am kind of happy');
-- GETTING THE FEED
select * from user_feed where owner_id=1 and posted_at>0;
-- RESULT
owner_id | posted_at | age | name | post_text
----------+--------------------------+------+------+--------------------
1 | 2014-07-04 12:01:23+0000 | null | null | first post!
1 | 2014-07-04 12:01:23+0000 | null | null | hello there
1 | 2014-07-04 12:01:23+0000 | null | null | i am kind of happy
-- GETTING USER INFO - ONLY USER INFO IS POSTED_AT=0
select * from user_feed where owner_id=1 and posted_at=0;
-- RESULT
owner_id | posted_at | age | name | post_text
----------+--------------------------+------+------+--------------------
1 | 1970-01-01 00:00:00+0000 | 36 | marc | null
What about making them static?
A static column is the same in all partition key and since your partition key is the id of the owner you could avoid wasting space and retrieve the user informations in any query.
CREATE TABLE user_feed (
owner_id int,
name text static,
age int static,
posted_at timestamp,
post_text text,
PRIMARY KEY (owner_id, posted_at)
);
Cheers,
Carlo

Resources