Migrate data from cassandra to cassandra - cassandra

We have 2 cassandra clusters, first one has the old data and second one has the new data.
Now we want to move or copy the old data from first cluster to second. What is the best way to do this and how to do this?
we are using DSE 3.1.4.

One tool you could try would be the COPY TO/FROM cqlsh command.
On a node in the old cluster, you would use the COPY TO:
cqlsh> COPY myTable (col1, col2, col3, col4) TO 'temp.csv'
And then (after copying the file over) on a node in your new cluster, you would copy the data in the CSV file into Cassandra:
cqlsh> COPY myTable (col1, col2, col3, col4) FROM 'temp.csv'
Here is some more documentation on the COPY command.
Note that the COPY TO/FROM is recommended for tables that contain only a few million rows or less. For larger datasets you should look at:
Cassandra Bulk Loader
sstable2json

There is a tool called /usr/bin/sstableloader for copying data between the clusters. Although when I used it months ago, I encountered an error and used this instead. But since it was a long time ago, sstableloader might have been fixed already.

Related

Apache Hive Add TIMESTAMP partition using alter table statement

i'm currently running MSCK HIVE REPAIR SCHEMA.TABLENAME for all my tables after data is loaded.
As the partitions are growing, this statement is taking much longer (some times more than 5 mins) for one table. I know it scans and parses through all partitions in s3 (where my data is) and then adds the latest partitions into hive metastore.
I want to replace MSCK REPAIR with ALTER TABLE ADD PARTITION statement. MSCK REPAIR works perfectly fine with adding latest partitions, however i'm facing problem with TIMESTAMP value in the partition when using ALTER TABLE ADD PARTITION.
I have a table with four partitions (part_dt STRING, part_src STRING, part_src_file STRING, part_ldts TIMESTAMP).
After running **MSCK REPAIR, the SHOW PARTITIONS command gives me below output
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39
But, when i drop the above partition from metastore, and recreate it using ALTER TABLE ADD PARTITION
hive> alter table hub_cont add partition(part_dt='20181016',part_src='asfs',part_src_file='kjui',part_ldts='2019-05-02 06:30:39');
OK
Time taken: 1.595 seconds
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39.0
Time taken: 0.128 seconds, Fetched: 1 row(s)
It is adding .0 at the end of timestamp value. When i query the table for this partition, it is giving me 0 records.
Is there way to add parition that has timestamp value without getting this zero added at the end. I'm unable to figure out how MSCK REPAIR is handling this case that is ALTER TABLE statement not able to.
The same should happen if you insert dynamic partitions, it will create new partitions with .0 because default timestamp string representation format includes milliseconds part, REPAIR TABLE finds new folders and adds partition to the metastore and also works correct because timestamp string without milliseconds is quite compatible with the timestamp...
The solution is to use STRING instead of TIMESTAMP and remove milliseconds explicitly.
But first of all double-check that you have really millions of rows in single partition and really need timestamp grain partition, not DATE and this partition column is really significant (for example if it is functionally dependent on another partition column part_src_file, you can completely get rid of it). Too many partitions will cause performance degradation.

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

SparkSQL - some partitions appear in HiveServer2 but not SparkSQL

Hive external table is pointing to files on S3, ddl includes partitioned by eod clause. Under a folder there are 5 subfolders, each with a file underneath for different partition_date. ie
eod=20180602/fileA
eod=20180603/fileA
eod=20180604/fileA
eod=20180605/fileA
eod=20180606/fileA
Msck repair table is run on HiveServer2
select distinct part_dt from tbl on HiveServer2 (port 10000) returns all 5 dates
However, select distinct part_dt from tbl on SparkThriftServer (ie SparkSQL, port 10015) returns only the first 2 dates.
How is this possible?
Even when running msck repair on SparkThriftServer the discrepancy still exists.
The file schema is same on all dates. (ie each file has same number/type of columns)
Resolved, those 8 affected tables were previously cached in sparksql (ie cache table <table>). Once i ran uncache table <table> all partitions lined up again!

How do I delete rows from Cassandra table from spark streaming using deleteFromCassandra?

I have a RDD containing the primary keys of a table. I need to delete the rows in Cassandra table which matches the values in RDD.
I see that there is deleteFromCassandra in spark-cassandra-connector, but unable to use it, deleteFromCassandra is unresolved.
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/streaming/DStreamFunctions.scala
Thanks for your help.
if i understand your use case, you can follow this question, maybe it will help you:
Delete from cassandra Table in Spark

Apache Spark search inside a dataset

I am new to spark and trying to achieve the following. I am not able to find the best way to do this. Please suggest.
I am using Spark 2.0 & its Dataset, Cassandra 3.7 and Cassandra Java connector
I have one ColumnFamily with partition key and 2 clustering keys. For example
myksyspace.myTable (some_id, col1, col2, col3, col4, col5, PRIMARY KEY (some_id, col1, col2))
I can get the data of myksyspace.myTable in a myTableDataset. The data has large number of rows (may be 200000).
After every 1 hour I get updated data from some other source where there could be some new data which is not in my database and I want to save it in database.
The data which I receive from some other source contains updated data but no value of “col2”. I get the rest of the data in a list, “dataListWithSomeNewData” in my Java code.
Now I want to match the data in list with data in myTableDataset and copy col2 from dataset to the list “dataListWithSomeNewData” and then generate new dataset and save it back in Database. This way my existing data will be updated. i want the new data to be inserted with new generated unique value for col2 for each list item. How do I achieve this?
I want to avoid collectAsList() on dataset to avoid out of memory as I may load large data in memory. With collectAsList(), the code works with small amount of data.
Any suggestion/idea on this. How can I achieve this?
Thank you in advance

Resources