I have Cassandra table with one column defined as set.
How can I achieve something like this:
SELECT * FROM <table> WHERE <set_column_name> NOT CONTAINS <value>
Proper secondary index in was already created.
From the documentation:
SELECT select_expression FROM keyspace_name.table_name WHERE
relation AND relation ... ORDER BY ( clustering_column ( ASC | DESC
)...) LIMIT n ALLOW FILTERING
then later:
relation is:
column_name op term
and finally:
op is = | < | > | <= | > | = | CONTAINS | CONTAINS KEY
So there's no native way to perform such query. You have to workaround by designing a new table to specifically satisfy this query.
Related
I have a dataset.table partioned by date (100 partition) like this :
table_name_(100) which means : table_name_20200101, table_name_20200102, table_name_20200103, ...
Exemple of table_name_20200101 :
| id | col_1 | col_2 | col_3 |
-----------------------------------------------------------------------------
| xxx | 2 | 6 | 10 |
| yyy | 1 | 60 | 29 |
| zzz | 12 | 61 | 78 |
| aaa | 18 | 56 | 80 |
I would like to delete the row ID = yyy in all the table (partioned) :
DELETE FROM `project_id.dataset_id.table_name_*`
WHERE id = 'yyy'
I got this error :
Illegal operation (write) on meta-table
project_id:dataset_id.table_name_*
Is there a way to delete rows 'yyy' in all table (partioned) ?
Thank you
Okay, some various things to call out here to ensure we're using consistent terminology.
You're talking about sharded tables, not partitioned. In a partitioned table, the data within the table is organized based on the partitioning specification. Here, you just have a series of tables named using a common prefix and a suffix based on date.
The use of the table_prefix* syntax is called a wildcard table, and DML is explicitly not allowed via wildcard tables: https://cloud.google.com/bigquery/docs/querying-wildcard-tables
The table_name_(100) is an aspect of how the BigQuery UI collapses series of like-named tables to save space in the navigation panes. It's not how the service itself references tables at all.
The way you can accomplish this is to leverage other aspects of BigQuery: The INFORMATION_SCHEMA tables and scripting functionality.
Information about what tables are in a dataset is available via the TABLES view: https://cloud.google.com/bigquery/docs/information-schema-tables
Information about scripting can be found here: https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now, here's an example that combines these concepts:
DECLARE myTables ARRAY<STRING>;
DECLARE X INT64 DEFAULT 0;
DECLARE queryStr STRING;
# First, we query INFORMATION_SCHEMA to generate an array of the tables we want to process.
# This INFORMATION_SCHEMA query currently has a LIMIT clause so that if you get it wrong,
# you won't bork all the tables in the dataset in one go.
SET myTables = (
SELECT
ARRAY_AGG(t)
FROM (
SELECT
TABLE_NAME as t
FROM `my-project-id`.my_dataset.INFORMATION_SCHEMA.TABLES
WHERE
TABLE_TYPE = 'BASE TABLE' AND
STARTS_WITH(TABLE_NAME, 'table_name_')
ORDER BY TABLE_NAME
LIMIT 2
)
);
# Now, we process that array of tables using scripting's loop construct,
# one at a time.
LOOP
IF X >= ARRAY_LENGTH(myTables)
THEN LEAVE;
END IF;
# DANGER WILL ROBINSON: This mutates tables!!!
#
# The next line constructs the SQL statement we want to run for each table.
#
# In this example, we're constructing the same DML DELETE
# statement to run on each table. For safety sake, you may want to start with
# something like a SELECT query to validate your assumptions and project the
# myTables values to see what you're getting.
SET queryStr = "DELETE FROM `my-project-id`.my_dataset." || myTables[SAFE_OFFSET(X)] || " WHERE id = 'yyy'";
# Now, run the generated SQL via EXECUTE IMMEDIATE.
EXECUTE IMMEDIATE queryStr;
SET X = X + 1;
END LOOP;
I have just 2 tables wherein I need to get the records from the first table (big table 10 M rows) whose transaction date is lesser than or equal to the effective date present in the second table (small table with 1 row), and this result-set will then be consumed by downstream queries.
Table Transact:
tran_id | cust_id | tran_amt | tran_dt
1234 | XYZ | 12.55 | 10/01/2020
5678 | MNP | 25.99 | 25/02/2020
5561 | XYZ | 32.45 | 30/04/2020
9812 | STR | 10.32 | 15/08/2020
Table REF:
eff_dt |
30/07/2020 |
Hence as per logic I should get back the first 3 rows and discard the last record since it is greater than the reference date (present in the REF table)
Hence, I have used a non-equi Cartesian Join between these tables as:
select
/*+ MAPJOIN(b) */
a.tran_id,
a.cust_id,
a.tran_amt,
a.tran_dt
from transact a
inner join ref b
on a.tran_dt <= b.eff_dt
However, this sql is taking forever to complete due to the cross Join with the transact table even using Broadcast hints.
So is there any smarter way to implement the same logic which will be more efficient than this ? In other words, is it possible to optimize the Theta join in this query ?
Thanks in advance.
So I wrote something like this:
Referring from https://databricks.com/session/optimizing-apache-spark-sql-joins
Can you try Bucketing on trans_dt (Bucketed on Year/Month only). And write 2 queries to do the same work
First query, trans_dt(Year/Month) < eff_dt(Year/Month). So this could help you actively picking up buckets(rather than checking each and every record trans_dt) which is less than 2020/07.
second query, trans_dt(Year/Month) = eff_dt(Year/Month) and trans_dt(Day) <= eff_dt(Day)
I have created a persistant table via df.saveAsTable
When I run the following query I receive these results
spark.sql("""SELECT * FROM mytable """).show()
I get view of the DataFrame and all of it's columns, and all of the data.
However when I run
spark.sql("""SELECT 'NameDisplay' FROM mytable """).show()
I receive results that look like this
| NameDisplay|
|--|
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
NameDisplay is definitely one of the columns in the table as it's shown when I run select * - how come this is not shown in the second query?
Issue was using quotes on the column names. Needs to be escaped via backtick ``NameDisplay`
Selecting 'NameDisplay', in SQL, is selecting the literal, text "NameDisplay". In that, the result you got are in fact valid.
To select values of the "NameDisplay" column, then you must issue:
"SELECT NameDisplay FROM mytable "
Or, if you need to quote it (maybe in case the column was created like this or has spaces, or is case-sensitive):
"""SELECT `NameDisplay` FROM mytable"""
This is SQL syntax, nothing specific to Spark.
How to check in node.js that the column does not exist in Apache Cassandra ?
I need to add a column only if it not exists.
I have read that I must make a select before, but if I select a column that does not exist, it will return an error.
Note that if you're on Cassandra 3.x and up, you'll want to query from the columns table on the system_schema keyspace:
aploetz#cqlsh:system_schema> SELECT * FROm system_schema.columns
WHERE keyspace_name='stackoverflow'
AND table_name='vehicle_information'
AND column_name='name';
keyspace_name | table_name | column_name | clustering_order | column_name_bytes | kind | position | type
---------------+---------------------+-------------+------------------+-------------------+---------+----------+------
stackoverflow | vehicle_information | name | none | 0x6e616d65 | regular | -1 | text
(1 rows)
You can check a column existance using a select query on system.schema_columns table.
Suppose you have the table test_table on keyspace test. Now you want to check a column test_column If exit or not.
Use the below query :
SELECT * FROM system.schema_columns WHERE keyspace_name = 'test' AND columnfamily_name = 'test_table' AND column_name = 'test_column';
If the above query return a result then the column exist otherwise not.
I have a table that has a column of list type (tags):
CREATE TABLE "Videos" (
video_id UUID,
title VARCHAR,
tags LIST<VARCHAR>,
PRIMARY KEY (video_id, upload_timestamp)
) WITH CLUSTERING ORDER BY (upload_timestamp DESC);
I have plenty of rows containing various values in the tags column, ie. ["outdoor","funny cats","funny mice"].
I want to perform a SELECT query that will return all rows that contain "funny cats" in the tags column. How can I do that?
To directly answer your question, yes there is a way to accomplish this. As of Cassandra 2.1 you can create a secondary index on a collection. First, I'll re-create your column family definition (while adding a definition for upload_timestamp timeuuid) and put some values in it.
aploetz#cqlsh:stackoverflow> SELECT * FROM videos ;
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+---------------------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
ab696e1f-78c0-45e6-893f-430e88db7f46 | 8db7c4b0-64fa-11e4-a819-21b264d4c94d | ['documentary'] | The Witches of Whitewater
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(3 rows)
Next, I'll create a secondary index on the tags column:
aploetz#cqlsh:stackoverflow> CREATE INDEX ON videos (tags);
Now, if I want to query the videos that contain the tag "action," I can accomplish this with the CONTAINS keyword:
aploetz#cqlsh:stackoverflow> SELECT * FROM videos WHERE tags CONTAINS 'action';
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+--------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(2 rows)
With this all being said, I should pass along a couple of warnings:
Secondary indexes do not perform well at scale. They exist to provide convenience, not performance. If you are expecting to have to query by tag often, then the right way to solve this would be to create a videosbytag query table, with the same data but keyed like this: PRIMARY KEY (tag,video_id)
You don't need the double-quotes in your table name. In fact, having it in quotes may cause you problems (ok, maybe minor irritations) down the road.