I need to delete all rows in Cassandra but with Amazon Keyspace isn't possible to execute TRUNCATE tbl_name because the TRUNCATE api isn't supported yet.
Now the few ideas that come in my mind are a little bit tricky:
Solution A
select all the rows
cycle all the rows and delete it (one by one or in a batch)
Solution B
DROP TABLE
CREATE TABLE with the structure of the old table
Do you have any idea to keep the process simplest?
Tnx in advance
If the data is not required. Option B - drop the table and recreate. You can pass in the capacity on create table statment using custom table properties.
CREATE TABLE my_keyspace.my_table (
id text,
division text,
project text,
role text,
manager_id text,
PRIMARY KEY (id,division))
WITH CUSTOM_PROPERTIES=
{'capacity_mode':
{'throughput_mode' : 'PROVISIONED',
'read_capacity_units' : 10,
'write_capacity_units' : 20},
'point_in_time_recovery': {'status': 'enabled'}}
AND TAGS={'pii' :'true',
'prod':'true'
};
Option C. If you require the data you can also leverage on-demand capacity mode which is pay-per request mode. With no request you only have to pay for storage. You can change modes once a day.
ALTER TABLE my_keyspace.my_table
WITH CUSTOM_PROPERTIES=
{'capacity_mode': {'throughput_mode': 'PAY_PER_REQUEST'}}
Solution B should be fine in absence of TRUNCATE. In older versions (version prior to 2.1) of Cassandra recreating table with the same name was a problem. Refer article Datastax FAQ Blog. But since then issue has been resolved via CASSANDRA-5202.
If data in table is not required anymore it is better to drop the table and recreate it. Moreover it will be very tedious task if table contains big amount of data.
Related
I have column toor_0_sup with the bigint in cassandra db I want to change datatype to text.
How I can do that??
As we know for cassandra we cant change datatype from bigint to text.
If any solution for that will help me.
Tried to alter table but not working.
By this thread discussion Altering column types is no longer supported.
REF: https://dba.stackexchange.com/questions/314933/how-do-we-handle-alter-table-column-type-in-cassandra-in-actual-scenarios
Possible approach:
unload the table data
drop the table
create the table with structure
load the data
On cassandra, we only need 100 days of data for specific tables. However, we only recently set the TTL value and the data older than that still stays in the system as stale data. We were thinking of different approaches to delete the old data out of the system. One suggestion was to create a Spark job to identify the data older than a specific timeframe and delete them all.
Another thought was to create a new table with just 100 days data and delete the old table. But I have various doubts on
how to rename the table where live data is being updated,
how will cassandra deal with such a table? While I have recreated a new table with less data and renamed it on one node(say node 1), will the other nodes in the cluster automatically delete the older data in their tables or sync the table on the node 1 and push all the older data onto it?
I am really new to cassandra and require expert advice on this.
Please suggest if there are better ways to handle this.
Cassandra does not have a way to rename a table, you will need to
create the new table with a different name
ensure this table has the TTL clause
load into it only the subset of records that you are interested on; this could be tricky as the query will depend on the schema of the table, is the column with the timestamp part of the clustering key?
update your application to point to the new table
drop the table
I have a dataframes which have few rows among them some already exists in db. I want to update few columns of existing rows. How can we do that?
I see we have SaveModes:
append and override which might serve the purpose but there is a limitation in both the cases.
With append, I am getting primary key error, as this option tries to create a new row in db
With ovverride, I will loose values for the unchanged attributes in the tuple.
Can someone please suggest how can I update few attributes(Columns values) of a row(tuple).?
This can be handled in MySql level, The concept is known as upsert.
case when : primary key is new
The SQL will insert into MySQL DB as new row
Case when : primary key is existing
You can use
INSERT
ON DUPLICATE KEY UPDATE
Which will update the key with the new entries/changes.
Read More here and here.
The ideal way to such use case is, insert your data into a temporary table first in your MySQL DB and post that use a trigger in order to load that data into original table. Call that trigger from spark itself.
In spark, dataframes are immutable. So you cannot change a value in place. One way would be to read the complete table, make the modification and write back the complete table in overwrite mode. This will take time.
If your modifications are always for a particular group, say user id based or date based, then you can write the data based on that column using partitionBy(). Then you can read that partition using .filter() do the modifications and overwrite only that partition using insertInto() - from pyspark 2.3.0
Refer this answer for other versions for pyspark :Overwrite specific partitions in spark dataframe write method
I have a Folder which previously had subfolders based on ingestiontime which is also the original PARTITION used in its Hive Table.
So the Folder Looks as -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
Inside each ingestiontime folder, data is present in PARQUET format.
Now in the Same myStreamingData folder, I am adding another folder that holds similar data but in the folder named businessname.
So my Folder structure now looks like -
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
So I need to add the data in the businessname partition to my current hive table too.
To achieve this , I was running the ALTER Query - ( on Databricks)
%sql
alter table gp_hive_table add partition (businessname=007,ingestiontime=20200712230000) location "s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000"
But I am getting this error -
Error in SQL statement: AnalysisException: businessname is not a valid partition column in table `default`.`gp_hive_table`.;
What part I am doing incorrectly here ?
Thanks in Advance.
Since you're already using Databricks and this is a streaming use case, you should definitely take a serious look at using Delta Lake tables.
You won't have to mess with explicit ... ADD PARTITION and MSCK statements.
Delta Lake with the ACID properties will ensure your data is committed properly, if your job fails you won't end up with partial results. As soon as the data is committed, it is available to users (again without the MSCK and ADD PARTITION) statements.
Just change 'USING PARQUET' to 'USING DELTA' in your DDL.
You can also (CONVERT) your existing parquet table to a Delta Lake table and then start using INSERT, UPDATE, DELETE, MERGE INTO, COPY INTO, from Spark batch and structured streaming jobs. OPTIMIZE will clean up the small file problem.
alter table gp_hive_table add partition is to add partition(data location, not new column) to the table with already defined partitioning scheme, it does not change current partitioning scheme, it just adds partition metadata, that in some location there is partition corresponding to some partitioning column value.
If you want to change partition columns, you need to recreate the table.:
Drop (check it is EXTERNAL) the table: DROP TABLE gp_hive_table;
Create table with new partitioning column. Partitions WILL NOT be created automatically.
Now you can add partitions using ALTER TABLE ADD PARTITION or use MSCK REPAIR TABLE to create them automatically based on directory structure. Directory structure should already match partitioning scheme before you execute these commands
So,
building upon the suggestion from #leftjoin,
Instead of having a hive table without businessname as one of the partition ,
What I did is -
Step 1 -> Create hive table with - PARTITION BY (businessname long,ingestiontime long)
Step 2 -> Executed the query - MSCK REPAIR <Hive_Table_name> to auto add partitions.
Step 3 ->
Now, there are ingestiontime folders which are not in the folder businessname i.e
folders like -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
I wrote a small piece of code to fetch all such partitions and then ran the following query for all of them -
ALTER TABLE <hive_table_name> ADD PARTITION (businessname=<some_value>,ingestiontime=<ingestion_time_partition_name>) LOCATION "<s3_location_of_all_partitions_not_belonging_to_a_specific_businesskey>
This solved my issue.
I've inherited a Sybase database that has the 'unique auto_identity index' option enabled on it. As part of an upgrade process I need to add a few extra columns to the tables in this database i.e.
alter table mytable add <newcol> float default -1 not null
When I try to do this I get the follow error:
Column names in each table must be unique, column name SYB_IDENTITY_COL in table #syb__altab....... is specifed more than once
Is it possible to add columns to a table with this property enabled?
Update 1:
I created the following test that replicates the problem:
use master
sp_dboption 'esmdb', 'unique auto_identity indexoption',true
use esmdb
create table test_unique_ids (test_col char)
alter table test_unique_ids add new_col float default -1 not null
The alter table command here produces the error. (Have tried this on ASE 15/Solaris and 15.5/Windows)
Update 2:
This is a bug in the Sybase dbisql interface, which the client tools Sybase Central and Interactive SQL use to access the database and it only appears to affect tables with the 'unique auto_identity index' option enabled.
To work around the problem use a different SQL client (via JDBC for example) to connect to the database or use isql on the command line.
Should be no problem to ALTER TABLE with such columns; the err msg indicates the problem regards something else. I need to see the CREATE TABLE DDL.
Even if we can't ALTER TABLE, which we will try first, there are several work-arounds.
Responses
Hah! Internal Sybase error. Open a TechSupport case.
Workaround:
Make sure you get jthe the exact DDL. sp_help . Note the IDENTITY columns and indices.
Create a staging table, exactly the same. Use the DDL from (1). Exclude the Indices.
INSERT new_table SELECT old_table. If the table is large, break it into batches of 1000 rows per batch.
Now create the Indices.
If the table is very large, AND time is an issue, then use bcp. You need to research that first, I am happy to answer questions afterwards.
When I ran your sample code I first get the error:
The 'select into' database option is not enabled for database 'mydb'. ALTER TABLE with data copy cannot be done. Set the 'select into' database option and re-run
This is no doubt because the data within your table needs copying out because the new column is not null. This will use tempdb I think, and the error message you've posted refers to a temp table. Is it possible that this dboption has been accidentally enabled for the tempdb?
It's a bit of a shot in the dark, as I only have 12.5 to test on here, and it works for me. Or it could be a bug.