What is the best way to manage an upsert with CloudSpanner based on some conditions? - google-cloud-spanner

I have this (simplified) case, with 2 tables related in this way:
CREATE TABLE a (ida STRING(36) NOT NULL, name STRING(15)) PRIMARY KEY (ida);
CREATE TABLE b (idb INT64 NOT NULL, ida STRING(36) NOT NULL) PRIMARY KEY (ida, idb)
INTERLEAVE IN PARENT a ON DELETE CASCADE;
where my ida is a UUID4 id format that I am generating from my code (Python 3).
In my case a batch of a few thousands "tuples" (idb, name) is sent to my service.
If idb does not exist in the table b then create a uuid4 and do the following inserts:
my_uuid_1 = uuid.uuid4().__str__() # generated via Python3
idb = 123 # received from the request
INSERT a (ida, name) VALUES (my_uuid_1, 'John')
INSERT b (idb, ida) VALUES (123, my_uuid_1)
If idb exists in the table b then just update the table a with the eventual new name.
Now this process needs to be run in a batch way for multiple records that can generate the situation just described. In order to do this with CloudSpanner I have been looking to the functionality:
def _unit_of_work(transaction):
try:
transaction.insert_or_update(
table=table,
columns=columns,
values=values,
)
except BadRequest as err:
logging.error(f'Error: ${err.args}')
raise
spanner_database = spanner_instance.database(database_id=my_database_id)
spanner_database.run_in_transaction(_unit_of_work)
but I cannot see a way to use it with the condition from before. I am looking in the right direction or there is a better way to do it?

In your _unit_of_work function, you could use https://googleapis.dev/python/spanner/latest/transaction-usage.html#read-table-data read API or query API https://googleapis.dev/python/spanner/latest/transaction-usage.html#execute-a-sql-select-statement to read idb from table b, and branch on the value. Then db.run_in_transaction will execute these reads/writes in one transaction.

Related

mariadb python - executemany using SELECT

Im trying to input many rows to a table in a mariaDB.
For doing this i want to use executemany() to increase speed.
The inserted row is dependent on another table, which is found with SELECT.
I have found statements that SELECT doent work in a executemany().
Are there other ways to sole this problem?
import mariadb
connection = mariadb.connect(host=HOST,port=PORT,user=USER,password=PASSWORD,database=DATABASE)
cursor = connection.cursor()
query="""INSERT INTO [db].[table1] ([col1], [col2] ,[col3])
VALUES ((SELECT [colX] from [db].[table2] WHERE [colY]=? and
[colZ]=(SELECT [colM] from [db].[table3] WHERE [colN]=?)),?,?)
ON DUPLICATE KEY UPDATE
[col2]= ?,
[col3] =?;"""
values=[input_tuplets]
When running the code i get the same value for [col1] (the SELECT-statement) which corresponds to the values from the from the first tuplet.
If SELECT doent work in a executemany() are there another workaround for what im trying to do?
Thx alot!
I think that reading out the tables needed,
doing the search in python,
use exeutemany() to insert all data.
It will require 2 more queries (to read to tables) but will be OK when it comes to calculation time.
Thanks for your first question on stackoverflow which identified a bug in MariaDB Server.
Here is a simple script to reproduce the problem:
CREATE TABLE t1 (a int);
CREATE TABLE t2 LIKE t1;
INSERT INTO t2 VALUES (1),(2);
Python:
>>> cursor.executemany("INSERT INTO t1 VALUES \
(SELECT a FROM t2 WHERE a=?))", [(1,),(2,)])
>>> cursor.execute("SELECT a FROM t1")
>>> cursor.fetchall()
[(1,), (1,)]
I have filed an issue in MariaDB Bug tracking system.
As a workaround, I would suggest reading the country table once into an array (according to Wikipedia there are 195 different countries) and use these values instead of a subquery.
e.g.
countries= {}
cursor.execute("SELECT country, id FROM countries")
for row in cursor:
countries[row[0]]= row[1]
and then in executemany
cursor.executemany("INSERT INTO region (region,id_country) values ('sounth', ?)", [(countries["fra"],) (countries["ger"],)])

Delta table merge on multiple columns

i have a table which has primary key as multiple columns so I need to perform the merge logic on multiple columns
DeltaTable.forPath(spark, "path")
.as("data")
.merge(
finalDf1.as("updates"),
"data.column1 = updates.column1 AND data.column2 = updates.column2 AND data.column3 = updates.column3 AND data.column4 = updates.column4 AND data.column5 = updates.column5")
.whenMatched
.updateAll()
.whenNotMatched
.insertAll()
.execute()
When I check the data counts it is not updating as expected.
Could someone help me here on this?
Please try also approach like in this example: https://docs.databricks.com/_static/notebooks/merge-in-cdc.html
Create a changes tables with additional columns which you will note
if a row is new (be inserted)
old (primary key exists) and nothing has changed
old (primary key exists) but other fields needs an update
and then use additional conditions on merge, for example:
.whenMatched("s.new = true")
.insert()
.whenMatched("s.updated = true")
.updateExpr(Map("key" -> "s.key", "value" -> "s.newValue"))
How are you counting your rows?
One thing to keep in mind is that directly reading and counting from the parquet files produced by Delta Lake will potentially give you a different result than reading the rows through the delta table interface. Remember that delta keeps a log and supports time travel so it does store copies of rows as they change over time.
Here's a way to accurately count the current rows in a delta table:
deltaTable = DeltaTable.forPath(spark,<path to your delta table>)
deltaTable.toDF().count()

How to make a lookup-table in cassandra

I want to create a table in cassandra, that is used as a lookup table. I have a lot of urls in my database and want to store ids instead of the urls-strings. So my approach is, to store the urls in a table with two columns: id (int) and url (text).
My problem is, that I need an index for the url field and also for the id field.
The first index is used during progressing new ulrs (so find an id for an url in the database) and the second index is use during displaying data (get the url for an id).
How can I implement that in cassandra?
I would suggest creating 2 separate tables for this:
CREATE TABLE id_url (id int primary key, url text);
and
CREATE TABLE url_id (url text primary key, id int);
Inserts to these tables should be done with a batch:
BEGIN BATCH
INSERT INTO id_url (id, url) VALUES (1, '<url1>');
INSERT INTO url_id (url, id) VALUES ('<url1>', 1);
APPLY BATCH
You could create your table like this:
CREATE TABLE urls_table(
id int PRIMARY KEY,
url text
);
and then create an index on the second column:
create index urls_table_url on urls_table (url);
Your first query is satisfied since you're querying over partition key. The second one is satisfied since you created an index on url column.

Foreign keys Sqlite3 Python3

I have been having some trouble with my understanding of how foreign keys work in sqlite3.
Im trying to get the userid (james) in one table userstuff to appear as foreign key in my otherstuff table. Yet when I query it returns None.
So far I have tried:
Enabling foreign key support
Rewriting a test script (that is being discussed here) to isolate issue
I have re-written some code after finding issues in how I had initially written it
After some research I have come across joins but I do not think this is the solution as my current query is an alternative to joins as far as I am aware
Code
import sqlite3 as sq
class DATAB:
def __init__(self):
self.conn = sq.connect("Atest.db")
self.conn.execute("pragma foreign_keys")
self.c = self.conn.cursor()
self.createtable()
self.defaultdata()
self.show_details() # NOTE DEFAULT DATA ALREADY RAN
def createtable(self):
self.c.execute("CREATE TABLE IF NOT EXISTS userstuff("
"userid TEXT NOT NULL PRIMARY KEY,"
" password TEXT)")
self.c.execute("CREATE TABLE IF NOT EXISTS otherstuff("
"anotherid TEXT NOT NULL PRIMARY KEY,"
"password TEXT,"
"user_id TEXT REFERENCES userstuff(userid))")
def defaultdata(self):
self.c.execute("INSERT INTO userstuff (userid, password) VALUES (?, ?)", ('james', 'password'))
self.c.execute("INSERT INTO otherstuff (anotherid, password, user_id) VALUES (?, ?, ?)",('aname', 'password', 'james'))
self.conn.commit()
def show_details(self):
self.c.execute("SELECT user_id FROM otherstuff, userstuff WHERE userstuff.userid=james AND userstuff.userid=otherstuff.user_id")
print(self.c.fetchall())
self.conn.commit()
-----NOTE CODE BELOW THIS IS FROM NEW FILE---------
import test2 as ts
x = ts.DATAB()
Many thanks
A foreign key constraint is just that, a constraint.
This means that it prevents you from inserting data that would violate the constraint; in this case, it would prevent you from inserting a non-NULL user_id value that does not exist in the parent table.
By default, foreign key constraints allow NULL values. If you want to prevent userstuff rows without a parent row, add a NOT NULL constraint to the user_id column.
In any case, a constraint does not magically generate data (and the database cannot know which ID you want). If you want to reference a specific row of the parent table, you have to insert its ID.

Inserting/Updating sqlite table from python program

I have a sqlite3 table as shown below
Record(WordID INTEGER PRIMARY KEY, Word TEXT, Wordcount INTEGER, Docfrequency REAL).
I want to create and insert data into this table if the table not exists else I want to update the table in such a way that only 'Wordcount' column get updated on the basis(Reference) of data in the column 'Word'. I am trying to execute this from a python program like
import sqlite3
conn = sqlite3.connect("mydatabase")
c = conn.cursor()
#Create table
c.execute("CREATE TABLE IF NOT EXISTS Record(WordID INTEGER PRIMARY KEY, Words TEXT, Wordcount INTEGER, Docfrequency REAL)")
#Update table
c.execute("UPDATE TABLE IF EXISTS Record")
#Insert a row of data
c.execute("INSERT INTO Record values (1,'wait', 9, 10.0)")
c.execute("INSERT INTO Record values (2,'Hai', 5, 6.0)")
#Updating data
c.execute("UPDATE Record SET Wordcount='%d' WHERE Words='%s'" %(11,'wait') )
But I can't update the table. On running the program I am getting the error message as
c.execute("UPDATE TABLE IF EXISTS Record")
sqlite3.OperationalError: near "TABLE": syntax error
How should I write the code to update the table ?
Your SQL query for UPDATE is invalid - see the documentation.
Also, I don't understand why you'd want to check for the table's existence when updating, given that just before that you're creating it if it doesn't exist.
If your goal is to update an entry if it exists or insert it if it doesn't, you might do it either by:
First doing an UPDATE and checking the number of rows updated. If 0, you know the record didn't exist and you should INSERT instead.
First doing an INSERT - if there's an error related to constraint violation, you know the entry already existed and you should do an UPDATE instead.

Resources