How to overwrite data with PySpark's JDBC without losing schema? - apache-spark

I have a DataFrame that I'm willing to write it to a PostgreSQL database. If I simply use the "overwrite" mode, like:
df.write.jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)
The table is recreated and the data is saved. But the problem is that I'd like to keep the PRIMARY KEY and Indexes in the table. So, I'd like to either overwrite only the data, keeping the table schema or to add the primary key constraint and indexes afterward. Can either one be done with PySpark? Or do I need to connect to the PostgreSQL and execute the commands to add the indexes myself?

The default behavior for mode="overwrite" is to first delete the table, then recreate it with the new data. You can instead truncate the data by including option("truncate", "true") and then push your own:
df.write.option("truncate", "true").jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)
This way, you are not recreating the table so it shouldn't make any modifications to your schema.

Related

How to force an ADX to update from source with all historical data

Does anyone know a way to force a child table to update from the source table? it would be a one off command to run when the child table is created, then we have an auto update policy in place.
Googling suggests to try this, however it produces a syntax error
.update policy childTable with (sourceTable)
Thanks!:)
Update policy is a mechanism that works during the ingestion and does not support backfill.
You can consider using materialized view with backfill property (if your transformation logic falls under the limitations) or create the child table based on a query, using the .set command.
If your source table is huge you might need to split it to multiple commands.
We had to use this:
.append childTable <| updateFunction

Write data frame to hive table in spark

could you please tell me if this command could create problems with overwriting all tables in the DB:
df.write.option(“path”, “path_to_the_db/hive/”).mode(overwrite).saveAsTable("result_data")
table_name is a new table in the DB, it hasn't existed.
After these commands, all tables disappeared.
I was using Spark3 and tried to solve an error:
Can not create the managed table('result_data').
The associated location('dbfs:/user/hive/warehouse/result_data') already exists.
I expected that a new table will be created without any issues if it doesn’t exist.
If path_to_the_db/hive contains other tables, then you overwrite into that folder, it seems possible that the whole directory would be emptied first, yes. Perhaps you should instead use path_to_the_db/hive/result_data
According to the error, though, your table does already exist.
You can use Spark to register a temporary table in SQL code, then run INSERT OVERWRITE query for existing tables.

Is there a method to drop keyspace without removing schema in cassandra.If there How would i do this?

There is a script which will export Cassandra schema and it will generate two cql files. These files will be called in restoring schema script.
So previously i have dropped the keyspace. While restoring the schema I am getting an error "cannot add column family to nonexisting keyspaces "graphdb"
It sounds to me that you are attempting to create the table in a keyspace that has not been created yet. Make sure that the creation happens in the right order in the two cql files that you have exported. First the keyspace needs to be created before you attempt to create the tables.

Data modelling for consistent secondary keys with Cassandra

With Cassandra,
I want to represent all users objects with a unique uuid, but also contain a set of zero or more secondary user keys to map to a user. Each secondary key should map to one and only one user(id). Because I need to be able to quick lookup of secondarykey to find a user, I maintain a separate lookup table, instead of a secondary INDEX.
I've modelled the data like this, but I am open to alternatives:
CREATE TABLE users (
userid uuid PRIMARY KEY,
name text,
secondarykeys set<text>
);
CREATE TABLE user_secondarykeys (
secondarykey text,
userid uuid,
PRIMARY KEY(secondarykey)
);
A typical use case is this:
I got this user with a secondary key mail:andreas#example.org, and I would like to see if there exists any user with that secondary key, and if it do not exists, I would like to create a new user object.
I can look for the secondary key:
SELECT * FROM "user_secondarykeys" WHERE secondarykey = "mail:andreas#example.org";
and if I do not find any matches, I can insert a new user:
BEGIN BATCH
INSERT INTO users (userid, name, secondarykeys) VALUES (77059e45-5fac-460b-9c4f-47528c292be0, "Andreas", {'mail:andreas#example.org'});
INSERT INTO user_secondarykeys (secondarykey, userid) VALUES ('mail:andreas#example.org', 77059e45-5fac-460b-9c4f-47528c292be0);
APPLY BATCH;
My problem is that this can lead to inconsistent data, because a user can be inserted with that secondary key in the meantime between my select and my inserts.
I'm thinking that if I can make my INSERT transaction fail if the secondary key already exists in user_secondarykeys, that would work, because it should then also revert the insert into the users table, because of the atomic property of the transaction. However, I do not know any ways to make the INSERT fail if the secondary key exists. If I add IF NOT EXISTS to the second insert, it will not revert the trasaction it will just avoid inserting into user_secondarykeys, but it will still insert into users.
Any suggestions on how to implement this use case in a reliable way is appreciated. Thanks.
At first, I think that your model is pretty complicated, and I'm not sure if I understand correctly all of your requirements.
So if you get at first this secondary key, and then you have to decide what to do - add user or not - then the following will work for you:
Instead of checking user_secondarykeys table with SELECT statement for occurrence of particular secondary key, go with the following:
INSERT INTO user_secondarykeys (secondarykey, userid) VALUES ('mail:andreas#example.org', 77059e45-5fac-460b-9c4f-47528c292be0) IF NOT EXISTS;
So if it applies, it means that this secondary key is not connected with any user - so there are two cases: user doesn't exists or user exists and someone want's to add new secondary key for him. The following will do the job in both cases:
INSERT INTO users(userid, name, secondarykeys) VALUES(77059e45-5fac-460b-9c4f-47528c292be0, 'Andreas', secondarykeys = secondarykeys + 'mail:andreas#example.org')
Because inserts/updates in Cassandra are idempotent(except counters), this will work even if there will be already an user with that id in users table - this should just add another secondary key for him.
Pros of this solution are that you will remove this gap in time which can make you 'inconsistent'. You have a guarantee that no one will insert two users with the same secondary key. You specified that user can have no secondary keys at all - in this situation you can add him straight to the users table.
I'm thinking that if I can make my INSERT transaction fail if the secondary key already exists in user_secondarykeys, that would work, because it should then also revert the insert into the users table, because of the atomic property of the transaction. However, I do not know any ways to make the INSERT fail if the secondary key exists. If I add IF NOT EXISTS to the second insert, it will not revert the trasaction it will just avoid inserting into user_secondarykeys, but it will still insert into users.
Since Cassandra 2.0.6 you can use a conditional statements inside a batch, and if any of conditions will be not met then all instructions in that batch won't fire. This sounds great but there is a limitation - all of the statements inside batch have to operate on the single, same partition. According to this, it is impossible to make cross partition/table conditional insert/update/delete. So in your case this:
BEGIN BATCH
INSERT INTO users (userid, name, secondarykeys) VALUES (77059e45-5fac-460b-9c4f-47528c292be0, "Andreas", {'mail:andreas#example.org'});
INSERT INTO user_secondarykeys (secondarykey, userid) VALUES ('mail:andreas#example.org', 77059e45-5fac-460b-9c4f-47528c292be0) IF NOT EXISTS;
APPLY BATCH;
would not even pass the query validation, because you try here to operate on two different tables.
I'm not sure if this will be suitable for other of your requirements, I would need more information about your queries and the velocity/volume of the data. For sure there are other ways for modeling this.
It would greatly simplify the problem if every user would have to have at least one specified secondary key(e.g. email would be a great unique key for your users table), but that's are your requirements, so unless you can't change them there is no discussion.
Hope this will help you a bit.
Good luck!

Migrator.net drop table in Up(), what to do in Down()?

I have Migrator.net implemented in my project and I am removing a table from the current schema. My Up() simply contains Database.RemoveTable("FooTable"). But now I'm at a bit of a loss as to what I'm supposed to do for my Down(). Do I need to manually parse all past migrations for modifications on FooTable? Is there a way to run all previous migrations on FooTable in Down()?
What about the data? If there were 50,000 rows, recreating an empty table doesn't rollback to the previous state.
To enable database downgrades with data you need to:
In Up(), detach the table from your data model (e.g. drop FKs) and rename it to something like DeleteMe_FooTable. Do not actually drop it however.
In Down(), reattach it to your data model - rename it to its original name and restore FKs.
A few days/weeks after the deployment, when you know you are 100% guaranteed never to rollback, a DBA can manually delete the table.
The idea of Down() is that it would reverse the effects of you Up() method, so technically if you ran Up() and then Down() right after your database schema would be back where you started.
in your case you would have to recreate the Table in your Down()

Resources