Update existing rows, while altering Cassandra table - cassandra

I have some table in Cassandra, and I need to add field with default data.
Is there way, to add default value to already existing rows, without updating all data manually?
ALTER TABLE data ADD some_bool bool; // Make it false for all existing records.
(Docs: ALTER TABLE Does not update existing rows)

You have to take care of that at application level when you retrieve the rows. Cassandra will return data to the client as NULL, so everything depends on the driver and language you use. Check the driver's documentation to find out if the returned values are null or real values. They usually have an isNull method to perform such checks.

Related

Insert or Update a delta table from a dataframe in Pyspark

I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.

Saving an existing item in table from dataFrame

I have a dataframes which have few rows among them some already exists in db. I want to update few columns of existing rows. How can we do that?
I see we have SaveModes:
append and override which might serve the purpose but there is a limitation in both the cases.
With append, I am getting primary key error, as this option tries to create a new row in db
With ovverride, I will loose values for the unchanged attributes in the tuple.
Can someone please suggest how can I update few attributes(Columns values) of a row(tuple).?
This can be handled in MySql level, The concept is known as upsert.
case when : primary key is new
The SQL will insert into MySQL DB as new row
Case when : primary key is existing
You can use
INSERT
ON DUPLICATE KEY UPDATE
Which will update the key with the new entries/changes.
Read More here and here.
The ideal way to such use case is, insert your data into a temporary table first in your MySQL DB and post that use a trigger in order to load that data into original table. Call that trigger from spark itself.
In spark, dataframes are immutable. So you cannot change a value in place. One way would be to read the complete table, make the modification and write back the complete table in overwrite mode. This will take time.
If your modifications are always for a particular group, say user id based or date based, then you can write the data based on that column using partitionBy(). Then you can read that partition using .filter() do the modifications and overwrite only that partition using insertInto() - from pyspark 2.3.0
Refer this answer for other versions for pyspark :Overwrite specific partitions in spark dataframe write method

Azure SQL External Table alternatives

Azure external tables between two azure sql databases on the same server don't perform well. This is known. I've been able to improve performance by defining a view from which the external table is defined. This works if the view can limit the data set returned. But this partial solution isn't enough. I'd love a way to at least nightly, move all the data that has been inserted or updated from the full set of tables from the one database (dbo schema) to the second database (pushing into the altdbo schema). I think Azure data factory will let me do this, but I haven't figured out how. Any thoughts / guidance? The copy option doesn't copy over table schemas or updates
Data Factory Mapping Data Flow can help you achieve that.
Using the AlterRow active and select an uptade method in Sink:
This can help you copy the new inserted or updated data to the another Azure SQL database based on the Key Column.
Alter Row: Use the Alter Row transformation to set insert, delete, update, and
upsert policies on rows.
Update method: Determines what operations are allowed on your
database destination. The default is to only allow inserts. To
update, upsert, or delete rows, an alter-row transformation is
required to tag rows for those actions. For updates, upserts and
deletes, a key column or columns must be set to determine which row
to alter.
Hope this helps.

How to quickly migrate from one table into another one with different table structure in the same/different cassandra?

I had one table with more than 10,000,000 records in Cassandra, but for some reason, I want to build another Cassandra table with the same fields and several additional fields, and I will migrate the previous data into it. And now the two tables are in the same Cassandra cluster.
I want to ask how to finish this task in a shortest time?
And If my new table in the different Cassandra, How to do it?
Any advice will be appreciated!
If you just need to add blank fields to a table, then the best thing to do is use the alter table command to add the fields to the existing table. Then no copying of the data would be needed and the new fields would show up as null in the existing rows until you set them to something.
If you want to change the structure of the data in the new table, or write it to a different cluster, then you'd probably need to write an application to read each row of the old table, transform the data as needed, and then write each row to the new location.
You could also do this by exporting the data to a csv file, write a program to restructure the csv file as needed, then import the csv file into the new location.
Another possible method would be to use Apache Spark. You'd read the existing table into an RDD, transform and filter the data into a new RDD, then save the transformed RDD to the new table. That would only work within the same cluster and would be fairly complex to set up.

Adding columns to a sybase table with unique auto_identity index option

I've inherited a Sybase database that has the 'unique auto_identity index' option enabled on it. As part of an upgrade process I need to add a few extra columns to the tables in this database i.e.
alter table mytable add <newcol> float default -1 not null
When I try to do this I get the follow error:
Column names in each table must be unique, column name SYB_IDENTITY_COL in table #syb__altab....... is specifed more than once
Is it possible to add columns to a table with this property enabled?
Update 1:
I created the following test that replicates the problem:
use master
sp_dboption 'esmdb', 'unique auto_identity indexoption',true
use esmdb
create table test_unique_ids (test_col char)
alter table test_unique_ids add new_col float default -1 not null
The alter table command here produces the error. (Have tried this on ASE 15/Solaris and 15.5/Windows)
Update 2:
This is a bug in the Sybase dbisql interface, which the client tools Sybase Central and Interactive SQL use to access the database and it only appears to affect tables with the 'unique auto_identity index' option enabled.
To work around the problem use a different SQL client (via JDBC for example) to connect to the database or use isql on the command line.
Should be no problem to ALTER TABLE with such columns; the err msg indicates the problem regards something else. I need to see the CREATE TABLE DDL.
Even if we can't ALTER TABLE, which we will try first, there are several work-arounds.
Responses
Hah! Internal Sybase error. Open a TechSupport case.
Workaround:
Make sure you get jthe the exact DDL. sp_help . Note the IDENTITY columns and indices.
Create a staging table, exactly the same. Use the DDL from (1). Exclude the Indices.
INSERT new_table SELECT old_table. If the table is large, break it into batches of 1000 rows per batch.
Now create the Indices.
If the table is very large, AND time is an issue, then use bcp. You need to research that first, I am happy to answer questions afterwards.
When I ran your sample code I first get the error:
The 'select into' database option is not enabled for database 'mydb'. ALTER TABLE with data copy cannot be done. Set the 'select into' database option and re-run
This is no doubt because the data within your table needs copying out because the new column is not null. This will use tempdb I think, and the error message you've posted refers to a temp table. Is it possible that this dboption has been accidentally enabled for the tempdb?
It's a bit of a shot in the dark, as I only have 12.5 to test on here, and it works for me. Or it could be a bug.

Resources