BigQuery - Delete rows from table partitioned by date - python-3.x

I have a dataset.table partioned by date (100 partition) like this :
table_name_(100) which means : table_name_20200101, table_name_20200102, table_name_20200103, ...
Exemple of table_name_20200101 :
| id | col_1 | col_2 | col_3 |
-----------------------------------------------------------------------------
| xxx | 2 | 6 | 10 |
| yyy | 1 | 60 | 29 |
| zzz | 12 | 61 | 78 |
| aaa | 18 | 56 | 80 |
I would like to delete the row ID = yyy in all the table (partioned) :
DELETE FROM `project_id.dataset_id.table_name_*`
WHERE id = 'yyy'
I got this error :
Illegal operation (write) on meta-table
project_id:dataset_id.table_name_*
Is there a way to delete rows 'yyy' in all table (partioned) ?
Thank you

Okay, some various things to call out here to ensure we're using consistent terminology.
You're talking about sharded tables, not partitioned. In a partitioned table, the data within the table is organized based on the partitioning specification. Here, you just have a series of tables named using a common prefix and a suffix based on date.
The use of the table_prefix* syntax is called a wildcard table, and DML is explicitly not allowed via wildcard tables: https://cloud.google.com/bigquery/docs/querying-wildcard-tables
The table_name_(100) is an aspect of how the BigQuery UI collapses series of like-named tables to save space in the navigation panes. It's not how the service itself references tables at all.
The way you can accomplish this is to leverage other aspects of BigQuery: The INFORMATION_SCHEMA tables and scripting functionality.
Information about what tables are in a dataset is available via the TABLES view: https://cloud.google.com/bigquery/docs/information-schema-tables
Information about scripting can be found here: https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now, here's an example that combines these concepts:
DECLARE myTables ARRAY<STRING>;
DECLARE X INT64 DEFAULT 0;
DECLARE queryStr STRING;
# First, we query INFORMATION_SCHEMA to generate an array of the tables we want to process.
# This INFORMATION_SCHEMA query currently has a LIMIT clause so that if you get it wrong,
# you won't bork all the tables in the dataset in one go.
SET myTables = (
SELECT
ARRAY_AGG(t)
FROM (
SELECT
TABLE_NAME as t
FROM `my-project-id`.my_dataset.INFORMATION_SCHEMA.TABLES
WHERE
TABLE_TYPE = 'BASE TABLE' AND
STARTS_WITH(TABLE_NAME, 'table_name_')
ORDER BY TABLE_NAME
LIMIT 2
)
);
# Now, we process that array of tables using scripting's loop construct,
# one at a time.
LOOP
IF X >= ARRAY_LENGTH(myTables)
THEN LEAVE;
END IF;
# DANGER WILL ROBINSON: This mutates tables!!!
#
# The next line constructs the SQL statement we want to run for each table.
#
# In this example, we're constructing the same DML DELETE
# statement to run on each table. For safety sake, you may want to start with
# something like a SELECT query to validate your assumptions and project the
# myTables values to see what you're getting.
SET queryStr = "DELETE FROM `my-project-id`.my_dataset." || myTables[SAFE_OFFSET(X)] || " WHERE id = 'yyy'";
# Now, run the generated SQL via EXECUTE IMMEDIATE.
EXECUTE IMMEDIATE queryStr;
SET X = X + 1;
END LOOP;

Related

Optimizing Theta Joins in Spark SQL

I have just 2 tables wherein I need to get the records from the first table (big table 10 M rows) whose transaction date is lesser than or equal to the effective date present in the second table (small table with 1 row), and this result-set will then be consumed by downstream queries.
Table Transact:
tran_id | cust_id | tran_amt | tran_dt
1234 | XYZ | 12.55 | 10/01/2020
5678 | MNP | 25.99 | 25/02/2020
5561 | XYZ | 32.45 | 30/04/2020
9812 | STR | 10.32 | 15/08/2020
Table REF:
eff_dt |
30/07/2020 |
Hence as per logic I should get back the first 3 rows and discard the last record since it is greater than the reference date (present in the REF table)
Hence, I have used a non-equi Cartesian Join between these tables as:
select
/*+ MAPJOIN(b) */
a.tran_id,
a.cust_id,
a.tran_amt,
a.tran_dt
from transact a
inner join ref b
on a.tran_dt <= b.eff_dt
However, this sql is taking forever to complete due to the cross Join with the transact table even using Broadcast hints.
So is there any smarter way to implement the same logic which will be more efficient than this ? In other words, is it possible to optimize the Theta join in this query ?
Thanks in advance.
So I wrote something like this:
Referring from https://databricks.com/session/optimizing-apache-spark-sql-joins
Can you try Bucketing on trans_dt (Bucketed on Year/Month only). And write 2 queries to do the same work
First query, trans_dt(Year/Month) < eff_dt(Year/Month). So this could help you actively picking up buckets(rather than checking each and every record trans_dt) which is less than 2020/07.
second query, trans_dt(Year/Month) = eff_dt(Year/Month) and trans_dt(Day) <= eff_dt(Day)

Azure Kusto - how to parse a string looking for the last node?

I am writing kusto queries to analyze the state of the database when simple queries run for a long time.
For ex: data and type = SQL in dependencies is a sql server query. If its duration at timestamp 2019-06-24T16:41:24.856 is >= 15000 (>= 15 secs)
I would like to query and analyze the dtu_consumption_percent out of AzureMetrics from 2019-06-24T16:40:24.856 to 2019-06-24T16:42:24.856. ( 1 min before and 1 min after the query completion time) to determine the state of the database at that point in time.
Question: I wonder if anyone can give me pointers on getting the database name out of the target column from dependencies?
target looks as below:
tcp:sqlserver-xxx-xxxxxx.database.windows.net | DDDDD
and I am needing to extract DDDDD to join to AzureMetrics column Resource.
Thank you!
As Yoni says you can use parse, or you could use substring:
let T = datatable(Value:string) [
'tcp:sqlserver-xxx-xxxxxx.database.windows.net | DDDDD',
'udp:appserver-yyy-yyyyyy.database.contoso.com | EEEEE'
];
T
// Look for the pipe and take everything after it as the value
| extend ToSubstring = substring(Value, indexof(Value, "|")+1)
https://learn.microsoft.com/en-us/azure/kusto/query/substringfunction
However if you find yourself doing this a lot you may want to take a look at Custom Fields:
https://learn.microsoft.com/en-us/azure/azure-monitor/platform/custom-fields
You could use the parse operator:
https://learn.microsoft.com/en-us/azure/kusto/query/parseoperator
print value = 'tcp:sqlserver-xxx-xxxxxx.database.windows.net | DDDDD'
| parse value with * "| " database
this returns:
| value | database |
|-------------------------------------------------------|----------|
| tcp:sqlserver-xxx-xxxxxx.database.windows.net | DDDDD | DDDDD |

How to do negation for 'CONTAINS'

I have Cassandra table with one column defined as set.
How can I achieve something like this:
SELECT * FROM <table> WHERE <set_column_name> NOT CONTAINS <value>
Proper secondary index in was already created.
From the documentation:
SELECT select_expression FROM keyspace_name.table_name WHERE
relation AND relation ... ORDER BY ( clustering_column ( ASC | DESC
)...) LIMIT n ALLOW FILTERING
then later:
relation is:
column_name op term
and finally:
op is = | < | > | <= | > | = | CONTAINS | CONTAINS KEY
So there's no native way to perform such query. You have to workaround by designing a new table to specifically satisfy this query.

Cassandra - Overlapping Data Ranges

I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.

Pull Oracle data From Excel File

So I have a lot of rows taken up by excel. I have 10,000 rows or so taken up by data and I am working with 10,000 or different IDs. Is there a way to query off an oracle database just 1 time by capturing the entire ID column as a group and including the group in the WHERE query instead of looping the 10,000 assets and query the database 10,000 times?
Sorry for not providing code. I really have not attempted this because I dont know if a solution exists.
Something like what you are asking can be accomplished in a two step process. First, by creating SELECT-FROM-DUAL queries for the relevant IDs, and second, inputting those queries into your main query and joining against them to limit to only the returns you need.
For the first step, use Excel to create SELECT-FROM-DUAL subqueries.
If your ID column starts in cell A2, copy the following formula into an empty cell on the same row and drag it down the column until all rows with an ID also have the formula. Alter the references to cell A2 and A3 if your IDs don't start in cell A2.="SELECT "&A2&" AS id FROM DUAL"&IF(NOT(ISBLANK(A3)), " UNION ALL", "")
Ultimately, what we want is a block of SELECT-FROM-DUAL statements that look like the below. Note that the last statement will not end in "UNION ALL", but all other statements should.
| IDs | Formula |
|----- |------------------------------------ |
| 1 | SELECT 1 AS id FROM DUAL UNION ALL |
| 2 | SELECT 2 AS id FROM DUAL UNION ALL |
| 3 | SELECT 3 AS id FROM DUAL UNION ALL |
| 4 | SELECT 4 AS id FROM DUAL UNION ALL |
| 5 | SELECT 5 AS id FROM DUAL UNION ALL |
| 6 | SELECT 6 AS id FROM DUAL |
For the second step, add all the SELECT-FROM-DUAL statements to your main query and then add an appropriate JOIN condition.SELECT
*
FROM table_you_need tyn
INNER JOIN (
SELECT 1 AS id FROM DUAL UNION ALL
SELECT 2 AS id FROM DUAL UNION ALL
SELECT 3 AS id FROM DUAL UNION ALL
SELECT 4 AS id FROM DUAL UNION ALL
SELECT 5 AS id FROM DUAL UNION ALL
SELECT 6 AS id FROM DUAL
) your_ids yi
ON tyn.id = yi.id
;
If you had a shorter list of IDs you could use a similar strategy to create an ID list for a WHERE ids IN (<list_of_numbers>) clause, but the IN list is typically limited to 100 items, and consequently would not work for your current question.
You can import data from Excel using Toad or SQL Developer. You need to create a table first in the database.
You can read the data directly with external tables if you save the excel file as a CSV file to a folder on the database server that the database can access.
You can read files as Excel (xls or xlsx format) using a PL/SQL library.
There are probably a few other ways I haven't thought of as well. This is a very common question.

Resources