Please give me an example about ImpEx update using [batchmode =true] and without using [batchmode =true]
Let's take as example these two items :
Product 1 :
--------------------------------------------------
| id (unique) | catalog (unique) | name |
--------------------------------------------------
| x | staged | |
--------------------------------------------------
Product 2 :
--------------------------------------------------
| id (unique) | catalog (unique) | name |
--------------------------------------------------
| x | online | |
--------------------------------------------------
And let's assume that here you want to update the name of both of the items using an impex :
BATCH MODE :
Using batch mode it will be (minimum one unqiue attribute needed to modify items, and it modifies all the items that matches that unique attribute, non unique attribute will be updated) :
UPDATE Product[batchmode=true]; id[unique]; name;
; x ; randomName;
WITHOUT BATCH MODE :
Without batch mode you have to specify all the unique attribute to modify the items:
UPDATE Product; id[unique]; catalog[unique=true]; name;
; x ; staged ; randomName;
; x ; online ; randomName;
Hope this helps
Without batch mode you have to specify a unique attribute (or set of attributes for each row). Each row MUST have a unique set of primary key i.e. attributes with [unique=true].
With batch mode you can create a row without specifying unique attributes i.e. the attribute specified as [unique=true] can be present in multiple rows in the database.
Related
I have a dataset.table partioned by date (100 partition) like this :
table_name_(100) which means : table_name_20200101, table_name_20200102, table_name_20200103, ...
Exemple of table_name_20200101 :
| id | col_1 | col_2 | col_3 |
-----------------------------------------------------------------------------
| xxx | 2 | 6 | 10 |
| yyy | 1 | 60 | 29 |
| zzz | 12 | 61 | 78 |
| aaa | 18 | 56 | 80 |
I would like to delete the row ID = yyy in all the table (partioned) :
DELETE FROM `project_id.dataset_id.table_name_*`
WHERE id = 'yyy'
I got this error :
Illegal operation (write) on meta-table
project_id:dataset_id.table_name_*
Is there a way to delete rows 'yyy' in all table (partioned) ?
Thank you
Okay, some various things to call out here to ensure we're using consistent terminology.
You're talking about sharded tables, not partitioned. In a partitioned table, the data within the table is organized based on the partitioning specification. Here, you just have a series of tables named using a common prefix and a suffix based on date.
The use of the table_prefix* syntax is called a wildcard table, and DML is explicitly not allowed via wildcard tables: https://cloud.google.com/bigquery/docs/querying-wildcard-tables
The table_name_(100) is an aspect of how the BigQuery UI collapses series of like-named tables to save space in the navigation panes. It's not how the service itself references tables at all.
The way you can accomplish this is to leverage other aspects of BigQuery: The INFORMATION_SCHEMA tables and scripting functionality.
Information about what tables are in a dataset is available via the TABLES view: https://cloud.google.com/bigquery/docs/information-schema-tables
Information about scripting can be found here: https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now, here's an example that combines these concepts:
DECLARE myTables ARRAY<STRING>;
DECLARE X INT64 DEFAULT 0;
DECLARE queryStr STRING;
# First, we query INFORMATION_SCHEMA to generate an array of the tables we want to process.
# This INFORMATION_SCHEMA query currently has a LIMIT clause so that if you get it wrong,
# you won't bork all the tables in the dataset in one go.
SET myTables = (
SELECT
ARRAY_AGG(t)
FROM (
SELECT
TABLE_NAME as t
FROM `my-project-id`.my_dataset.INFORMATION_SCHEMA.TABLES
WHERE
TABLE_TYPE = 'BASE TABLE' AND
STARTS_WITH(TABLE_NAME, 'table_name_')
ORDER BY TABLE_NAME
LIMIT 2
)
);
# Now, we process that array of tables using scripting's loop construct,
# one at a time.
LOOP
IF X >= ARRAY_LENGTH(myTables)
THEN LEAVE;
END IF;
# DANGER WILL ROBINSON: This mutates tables!!!
#
# The next line constructs the SQL statement we want to run for each table.
#
# In this example, we're constructing the same DML DELETE
# statement to run on each table. For safety sake, you may want to start with
# something like a SELECT query to validate your assumptions and project the
# myTables values to see what you're getting.
SET queryStr = "DELETE FROM `my-project-id`.my_dataset." || myTables[SAFE_OFFSET(X)] || " WHERE id = 'yyy'";
# Now, run the generated SQL via EXECUTE IMMEDIATE.
EXECUTE IMMEDIATE queryStr;
SET X = X + 1;
END LOOP;
We are dealing with a situation where we store items with an X amount of properties (it is a SaaS solution and every instance has a different amount of properties). What we are struggling with is the dimension of time.
What would be the best way to store the data if we want to be able to:
Quickly get individual items.
Get the value of a property with a certain timestamp (ie, historic info).
Note: we do not want to search for property values, we want speed :-) We will have many items with many properties, with many timestamps that we should be able to fetch as fast as possible.
Example use case of the SaaS solution: We have a ship with 10.000 sensors, they collect temperature every minute. This means that we have 10.000 "items" with "temperature" as one of the properties. They will be updated every minute and we want to store the history.
Option 1. Store all in maps (Id = Primary Key)
------------------------------------------------
Id | Name | Props
------------------------------------------------
1 | Foo | map<timestamp, map<name, text>>
------------------------------------------------
2 | Bar | map<timestamp, map<name, text>>
------------------------------------------------
In the map we will have something like:
{
"1518023285": {
"propName": "Prop A",
"propValue": "Value A"
},
"1518011111": {
"propName": "Prop A",
"propValue": "Value B"
},
"1518011111": {
"propName": "Prop B",
"propValue": "Value C"
}
}
Prop A and Prop B are created at the same time, Prop A got updated.
We will collect the complete item and use our application to find the right value at the right time.
Option 2. Store time in maps and props as rows (Id = Primary Key)
-----------------------------------------------------------
Id | Name | Prop_A | Prop_B
-----------------------------------------------------------
1 | Foo | map<timestamp, text> | map<timestamp, text>
-----------------------------------------------------------
2 | Bar | map<timestamp, text> | map<timestamp, text>
-----------------------------------------------------------
In the column Prop_A we will have something like:
{
"1518023285": "Value B",
"1518011111": "Value A"
}
Meaning that Prop_A got created with Value A and updated later with Value B.
We will collect the complete item and use our application to find the right value at the right time.
Option 3. Properties in a map and time in a row (Id = Primary Key, ItemId has index, Time has index)
-------------------------------------------------
Id | ItemId | Name | Time | Props
-------------------------------------------------
1 | 1 | Foo | 1518011111 | map<name, text>
-------------------------------------------------
2 | 2 | Bar | 1518011111 | map<name, text>
-------------------------------------------------
3 | 2 | Bar | 1518023285 | map<name, text>
-------------------------------------------------
A map will look like:
{
"Prop A": "Value A",
"Prop B": "Value B"
}
We will collect all rows of items and find the right time in our application
Option 4. Properties and time in a row (Id = Primary Key, ItemId has index, Time has index)
----------------------------------------------------
Id | ItemId | Name | Time | Prop_A | Prop_B
----------------------------------------------------
1 | 1 | Foo | 1518011111 | Value A | Value B
----------------------------------------------------
2 | 2 | Bar | 1518011111 | Value A | Value B
----------------------------------------------------
3 | 2 | Bar | 1518023285 | Value A | Value C
----------------------------------------------------
Row 3 got updated.
We create 2 CQL queries, one to find the latest version and seconly to collect the props.
CQL collections are (with some exceptions) completely deserialized into memory, this could be really bad long term. Especially from a perf perspective its less than ideal, they are for convenience with smaller maps, not performance.
I would actually recommend something like Option 4, like: ((id, item_id), name, time, prop) where prop can just be "A" or "B" and a value field for its value. if "prop" is really limited to just A-C or something, can switch time and prop so you can query for timelines of each property and just make a few queries merged together. Be sure to change ordering of time so that the recent data is at beginning of partition for more efficient reads on getting latest value. If theres a ton of inserts you will want too break up the partitions more, maybe including a "year-month" to your partition key.
I would go for option 3, but with a similar change to what Chris is proposing:
((id, item_id), time, name, map)
If the maps don't change in each timestamp (meaning they are read-only for that timestamp), I don't see a downside with taking advantage of the collection. It will also save you some disk space having all the properties in one map, instead of having them in separate columns.
So, I have list of dictionaries like this
dnc_info = [{'website': 'www.mdn.com', 'name':'shubham', 'company_name': 'mdn'}, {'website': 'google.com', 'name': 'ketan', 'company_name': 'google'}, {'website': 'http://microsoft.com', name:'somename', , 'company_name': 'microsoft'}, {'website': None, 'name':'somename2',, 'company_name': None}....] upto 10,000 dict
Now, I have a DataBase(PostgreSQL) table which contains the following field:
+--------------+-------------+--------------------+-------------+---------
| company_name | website | email | campaign_id | color_code | |
+--------------+-------------+--------------------+-------------+------------+--+
| google | google.com | shubham#google.com | 50 | #FFFFFF | |
| mdn | www.mdn.com | some#mdn.com | 50 | #FFFFFF | |
+--------------+-------------+--------------------+-------------+---------
up to 20,000 rows
Now what I want is to be able to update the color code field the above table from dnc_info on basis following conditions
Condition 1: Table's company name should match with dnc_info company name ignoring case sensitivity
Condition 2: Only website's domain from table should match with dnc_info website domain ignoring case senstivity
Condition 3: Table's email domain should match with dnc_info website's domain also ignoring case sensitivity.
Condition 4: Table's email should match dnc_info email also ignoring case sensitivity.
I'm able to create separate lists for every object key from dnc_info like this:
website = ['mdn.com', 'google.com', 'microsoft.com']
email = ['shubham#mdn.com', 'someone#google.com']
Please suggest an optimised model query based on the above conditions that will update the column color_code in the table.
Instead of using ORM, I had used raw_query() and it worked for me.
I'm using 2-D arraylist to store the query results of sql.rows in SOAP UI Groovy. Outputrows, in the below code, is an arraylist.
Outputrows = sql.rows("select CORR.Preferred as preferred ,CORR.Category as category,CORR.Currency as currency\
from BENEFICIARY CORR \
JOIN LOCATION LOC on CORR.UID=LOC.UID")
The problem with the arraylist is that I'm unable to update any value of a particular cell with the set command. Set is not a valid one for GroovyRowResult class.
Outputrows.get(row).set(col,categoryValue)
So I am just wondering if I can store the queryresults(Outputrows) to a 2D Map(Outputrows) and if so, how can I update the value of any particular row with the given map key.
[{'preferred': 'N', 'category': 'Commerical'}, {'currency': 'USD'}.. ] and so on.
If I want to update Currency for the 3rd row, how can I update that.
Data in the output
Preferred | Category | Currency |
----------------------------------
N | CMP | USD |
----------------------------------
Y | RTL | GBP |
----------------------------------
N | CMP | JPY |
----------------------------------
Y | RTL | USD |
----------------------------------
Now here in 'outputrows' the values are stored from first row(N, CMP, USD) as Arraylist. I would like to store the values of the query result,'outputrows', as a Maps instead of Arraylist, so I can easily access any value in 'outputrows'
with Map key.
Hope this makes sense.
I need to use the column name with put instead of the column number.
Outputrows.get(row).put("currency",categoryValue) .. this is correct
Outputrows.get(row).put(2,categoryValue).. adds a new column with name "2", instead of column reference to currency
Let say i have users. Those users can have access to multiple projects. So a project can also allow multiple users.
So I model four tables. users (by_id), projects (by id), projects_by_user_id and users_by_project_id.
----------- ------------ -------------------- --------------------
| users | | projects | | projects_by_user | | users_by_project |
|---------| |--------- | |------------------| |------------------|
| id K | | id K | | user_id K | | project_id K |
| name | | name | | project_id C | | user_id C |
----------- ------------ | project_name S | | user_name S |
-------------------- --------------------
So storing the user_name in the users_by_project and the projet_name in the projects_by_user table for querying.
The problem I have is when an user updates the project_name, this will of course update the projects table. But for data consistency I also need to update each partition in the projects_by_user table.
As far as I can see, this is only possible by querying all the users from the users_by_project table and doing an update for each user.
Is there any better way without first reading lots of data?
I don't see why you need four tables. Your users and projects tables could contain all of the data.
If you define the tables like this:
CREATE TABLE users (
user_id int PRIMARY KEY,
name text,
project_ids list<int> );
CREATE TABLE projects (
project_id int PRIMARY KEY,
name text,
user_ids list<int> );
Then each user would have a list of project ids they have access to, and each project would have a list of users that have access to it.
To add access to project 123 to user 1 you would run:
BEGIN BATCH
UPDATE users SET project_ids = project_ids + [123] WHERE user_id=1;
UPDATE projects SET user_ids = user_ids + [1] WHERE project_id=123;
APPLY BATCH;
To change a project name, you would just do:
UPDATE projects SET name = 'New project name' WHERE project_id=123;
For simplicity I showed the id fields as int's, but normally you would use uuid's for that.
I don't think there is better way. Cassandra has a lot of limitation on the queries you can make. In your case, you have to create a compound key (user_id, project_id), and in order to update it you have to provide both parts in where clause, which means you have to read all users for specific project and update each of these. If you have a large database and this scenario will happen often, this would be significant overhead, so I guess it would be better to remove projectname field from the table and perform join of the projects and projects_by_users at the application level.
BTW: Scenario you described here is more convenient for relational database model, so if the rest of your database model is similar to this, I would think of using some relational database instead.