I have loaded a table from an input file.
CREATE TABLE MyTable (
ID INT,
VALUE FLOAT,
RATE INT
...
LOAD DATA LOCAL INPATH 'MYPATH' INTO TABLE MyTable;
Now I'd like to create a new based on this one
DerivedTable =
SELECT ID, VALUE*RATE AS Total
FROM MyTable
WHERE VALUE IS NOT NULL;
Then I'm going to use this table as a source for other tables and for outputs.
What is a correct Sql (or Hive) way to create this "temporary" table? This should work in spark-sql?
PS: I know how to do that in spark-shell. But that is not what I'm looking for.
You can:
CREATE TEMPORARY VIEW DerivedTable AS (
SELECT ID, VALUE*RATE AS Total
FROM MyTable
WHERE VALUE IS NOT NULL);
Related
I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?
In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.
There is a number of Cassandra built-in functions, for example now() or uuid(). Is it possible to call those functions without a SELECT operation, using CQL? So far I have to do
SELECT count(*), uuid() from table;
Where table is a table that's always empty.
Is there a better way?
Unfortunately no, you cannot call functions like uuid() and now() without executing a query/upsert. But I do have a way to keep you from having to maintain an empty table.
SELECT uuid() FROM system.local;
system.local will:
A) always be there.
and
B) only ever contain a single row.
It's similar to what you're doing now, but again, it prevents you from having to maintain an empty table just to gen-up a UUID.
Upsert example
If I have a table like this:
CREATE TABLE timetest (
yearmonth TEXT,
id UUID,
value TEXT,
PRIMARY KEY (yearmonth, id));
I can INSERT to it and gen a new UUID like this:
INSERT INTO timetest (yearmonth,id,value)
VALUES ('201601',uuid(),'v1');
I have a cassandra table defined like this:
CREATE TABLE test.test(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (id, time, tag)
)
And I want to select one column using select.
I tried:
select * from test where lonumb = 4231;
It gives:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
Also I cannot do
select * from test where mstatus = true;
Doesn't cassandra support where as a part of CQL? How to correct this?
You can only use WHERE on the indexed or primary key columns. To correct your issue you will need to create an index.
CREATE INDEX iname
ON keyspacename.tablename(columname)
You can see more info here.
But you have to keep in mind that this query will have to run against all nodes in the cluster.
Alternatively you might rethink your table structure if the lonumb is something you'll do the most queries on.
Jny is correct in that WHERE is only valid on columns in the PRIMARY KEY, or those where a secondary index has been created for. One way to solve this issue is to create a specific query table for lonumb queries.
CREATE TABLE test.testbylonumb(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (lonumb, time, id)
)
Now, this query will work:
select * from testbylonumb where lonumb = 4231;
It will return all CQL rows where lonumb = 4231, sorted by time. I put id on the PRIMARY KEY to ensure uniqueness.
select * from test where mstatus = true;
This one is trickier. Indexes and keys on low-cardinality columns (like booleans) are generally considered a bad idea. See if there's another way you could model that. Otherwise, you could experiment with a secondary index on mstatus, but only use it when you specify a partition key (lonumb in this case), like this:
select * from testbylonumb where lonumb = 4231 AND mstatus = true;
Maybe that wouldn't perform too badly, as you are restricting it to a specific partition. But I definitely wouldn't ever do a SELECT * on mstatus.
Using Cassandra 2.0.9 CQL, how does one query for rows that don't have a particular column defined? For example:
create table testtable ( id int primary key, thing int );
create index on testtable ( thing );
# can now select rows by thing
insert into testtable( id, thing ) values ( 100, 100 );
# row values will be persistent
update testtable using TTL 30 set thing=1 where id=100;
# wait 30 seconds, thing column will go away for the row
select * from testtable;
Ideally I'd like to be able to do something like this:
select * from testtable where NOT DEFINED thing;
or some such and have the row with the id==1 returned. Is there any way to search for rows that do not have a particular column value assigned?
I'm afraid I've been through the Datastax 2.0 manual, as well as the CQLSH help with no luck trying to find an operator or syntax for this. Thanks.
Doesn't appear to be available yet
https://issues.apache.org/jira/browse/CASSANDRA-3783
We are trying to store lots of attributes for a particular profile_id inside a table (using CQL3) and cannot wrap our heads around which approach is the best:
a. create table mytable (profile_id, a1 int, a2 int, a3 int, a4 int ... a3000 int) primary key (profile_id);
OR
b. create MANY tables, eg.
create table mytable_a1(profile_id, value int) primary key (profile_id);
create table mytable_a2(profile_id, value int) primary key (profile_id);
...
create table mytable_a3000(profile_id, value int) primary key (profile_id);
OR
c. create table mytable (profile_id, a_all text) primary key (profile_id);
and just store 3000 "columns" inside a_all, like:
insert into mytable (profile_id, a_all) values (1, "a1:1,a2:5,a3:55, .... a3000:5");
OR
d. none of the above
The type of query we would be running on this table:
select * from mytable where profile_id in (1,2,3,4,5423,44)
We tried the first approach and the queries keep timing out and sometimes even kill cassandra nodes.
The answer would be to use a clustering column. A clustering column allows you to create dynamic columns that you could use to hold the attribute name (col name) and it's value (col value).
The table would be
create table mytable (
profile_id text,
attr_name text,
attr_value int,
PRIMARY KEY(profile_id, attr_name)
)
This allows you to add inserts like
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a1', 3);
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a2', 1031);
.....
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'an', 2);
This would be the optimal solution.
Because you then want to do the following
'The type of query we would be running on this table: select * from mytable where profile_id in (1,2,3,4,5423,44)'
This would require 6 queries under the hood but cassandra should be able to do this in no time especially if you have a multi node cluster.
Also if you use the DataStax Java Driver you can run this requests asynchronously and concurrently on your cluster.
For more on data modelling and the DataStax Java Driver check out DataStax's free online training. Its worth a look
http://www.datastax.com/what-we-offer/products-services/training/virtual-training
Hope it helps.