We have a table:
CREATE TABLE table (
col1 text,
col2 text,
col3 timestamp,
cl4 int,
col5 timestamp,
PRIMARY KEY (col1, col2, col3, col4)
) WITH CLUSTERING ORDER BY (col2 DESC, col3 DESC,col4 DESC)
When I try querying from this table like:
select * from table where col1 = 'something' and col3 < 'something'
and col4= 12 limit 5 ALLOW FILTERING;
select * from table where col1 = 'something' and col4 < 23
and col3 >= 'something' ALLOW FILTERING;
I always get the error: Clustering column "col4" cannot be restricted (preceding column "col3" is restricted by a non-EQ relation) .
I tried to change the table creation by making col4, col3, col2, but the second query doesn't work and throw a similar error.
Any suggetion/advice to solve this problem.
We are on : Cassandra 3.0.17.7.
You can use non-equality condition only on the last column of partition of the query.
For example, you can do use col1 = val and col2 <= ..., or col1 = val and col2 = val2 and col3 <= ..., or col1 = val and col2 = val2 and col3 = val3 and col4 <= ..., but you can't do non-equality condition on several columns - that's how Cassandra reads data.
Related
I have a table that looks like this
select 'Alice' AS ID, 1 as col1, 3 as col2, -2 as col3, 9 as col4
union all
select 'Bob' AS ID, -9 as col1, 2 as col2, 5 as col3, -6 as col4
I would like to get the top 3 absolute values for each record across the four columns and then format the output as a dictionary or STRUCT like below
select
'Alice' AS ID, [STRUCT('col4' AS column, 9 AS value), STRUCT('col2',3), STRUCT('col3',-2)] output
union all
select
'Bob' AS ID, [STRUCT('col1' AS column, -9 AS value), STRUCT('col4',-6), STRUCT('col3',5)]
output
output
I would like it to be dynamic, so avoid writing out columns individually. It could go up to 100 columns that change
For more context, I am trying to get the top three features from the batch local explanations output in Vertex AI
https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions
I have looked up some examples, would like something similar to the second answer here How to get max value of column values in a record ? (BigQuery)
EDIT: the data is actually structured like this. If this can be worked with more easily, this would be a better option to work from
select 'Alice' AS ID, STRUCT(1 as col1, 3 as col2, -2 as col3, 9 as col4) AS featureAttributions
union all
SELECT 'Bob' AS ID, STRUCT(-9 as col1, 2 as col2, 5 as col3, -6 as col4) AS featureAttributions
Consider below query.
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
Query results
Dynamic Query
I would like it to be dynamic, so avoid writing out columns individually
You need to consider a dynamic SQL for this. By refering to the answer from #Mikhail you linked in the post, you can write a dynamic query like below.
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT (ID) FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
For updated sample table
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT featureAttributions FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
I have a data that looks like below:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15
2020-10-30 17:57:17,False,2020-07-01,14,2,False,0.0,True,30.0,True,30.0,True,True,True,False
2020-10-30 17:57:17,False,2020-07-01,15,2,True,28.0,False,0.0,False,0.0,True,True,True,False
2020-11-15 17:57:17,True,2020-07-01,5,2,True,28.0,False,0.0,False,0.0,True,True,True,False
2020-11-15 17:57:17,False,2020-07-01,7,2,False,0.0,True,30.0,True,30.0,True,True,True,False
My query looks like the following:
select distinct on (col3) col4
from table where col13 is true and col15 is false
and col3 = '2020-07-01'
and col1 <= '2020-09-16'
and col2 is false order by col3, col1 asc;
My expected answer should be [14, 15] since these are earliest records for '2020-07-01'. However using the above query I only get [15]. Any ideas what I might be doing wrong.
I was able to resolve this using the following query:
select distinct col4,
from table where col13 is true and col15 is false
and col3 = '2020-07-01'
and col1 = (select min(col1) from table
where col1 <= '2020-09-16' and col3 = '2020-07-01')
and col2 is false;
I have an update query as below:
WITH update_cte as (
SELECT col1, col2,
daterange(col3, col4) as col5
FROM temp_table
)
UPDATE my_table
SET col5 = update_cte.col5
FROM update_cte inner join my_table
ON (update_cte.col1 = mytable.col1)
AND (update_cte.col2 = mytable.col2)
It gives me the following error:
"Error: table name \"my_table\" specified more than once\n"}
I was able to resolve it by specifying an alias in inner join.
WITH update_cte as (
SELECT col1, col2,
daterange(col3, col4) as col5
FROM temp_table
)
UPDATE my_table
SET col5 = update_cte.col5
FROM update_cte inner join my_table as idr
ON (update_cte.col1 = idr.col1)
AND (update_cte.col2 = idr.col2)
I am having two tables TableA and TableB with below structure
Table1
PkCol1
PkCol2
Col3
Col4
Col5
Table2
PkCol1
PkCol2
Col3
Col4
Col5
But i am getting the primary key information as input. For example, i receive it as PkCol2,PkCol2. I may receive more primaryKey columns as input too.
How do i dynamically add my where condition to spark sql ?
Below is my code
df.createOrReplaceTempView("Table1")
df2.createOrReplaceTempView("Table2")
primaryKeyString = ar(1)
val df3 = spark.sql("Select * from table1 where "+primaryKeyString+" not in (select "+primaryKeyString+" from table2)").toDF()
If there is a better way to do it with Dataframes let me know.
I am able to acheive my purpose by concatenating in spark sql as below
val df3 = spark.sql("Select * from table1 where CONCAT("+primaryKeyString+") not in (select CONCAT("+primaryKeyString+") from table2)").toDF()
Trying to find if there is a better way to achieve it in scala.
I'm new in Cassandra and I have a trouble inserting some rows in a database getting the error of the title.
I use cassandra 1.0.8 and cqlsh for doing changes in my database.
Next, I explain the given steps before I get the error:
CREATE A COLUMN FAMILY
CREATE TABLE test (
col1 int PRIMARY KEY,
col2 bigint,
col3 boolean,
col4 timestamp
);
INSERT SEVERAL ROWS WITHOUT SPECIFICYING ALL OF COLUMNS OF THE TABLE
insert into test (col1, col2, col3) values (1, 100, true);
insert into test (col1, col2, col3) values (2, 200, false);
SELECT FOR CHECKING THAT ROWS HAVE BEEN INSERTED CORRECTLY
select * from test;
The result is the following:
INSERT A ROW SPECIFICYING A VALUE FOR THE col4 (NOT SPECIFIED BEFORE)
insert into test (col1, col2, col3, col4) values (3, 100, true, '2011-02-03');
SELECT FOR CHECKING THAT ROW HAS BEEN INSERTED CORRECTLY
select * from test;
In this SELECT is the error. The result is the following:
SELECT EACH COLUMN OF THE TABLE SEPARATELY
select col1 from test;
select col2 from test;
select col3 from test;
select col4 from test;
it works fine and shows the right values:
Then, my question is: what's the problem in the first SELECT? what's wrong?
Thanks in advance!!
NOTE:
If I define col4 as Integer rather than a timestamp it works. However, I've tried to insert col4 as the normalized format yyyy-mm-dd HH:mm (I've tried with '2011-02-03 01:05' and '2011-02-03 01:05:10') but it doesn't work.
Cassandra 1.0.8 shipped with CQL2 and that's where your problem is coming from. I managed to recreate this in 1.0.8 but it works fine with 1.2.x so my advice is upgrade if you can.
In C* 1.2.10
cqlsh> update db.user set date='2011-02-03 01:05' where user='JCTYpjJlM';
cqlsh> SELECT * from db.user ;
user | date | password
-----------+--------------------------+----------
xvkYQKerQ | null | 765
JCTYpjJlM | 2011-02-03 01:05:00+0200 | 391
#mol
Weird, try to insert col4 as Integer (convert to milliseconds first) or use the normalized format : yyyy-mm-dd HH:mm
Accodring to the doc here, you can omit the time and just input the date but it seems that breaks something in your case