Bulk copy from Cassandra table column to a file - cassandra

I have a requirement to copy cassandra database column into a file.
The databas has 15 million records with below columns in it. I want to copy payment column data into a file. Since it a production environment that will leads to stress on cassandra clusters.
userid | contract | payment | createdDate
Any suggestions?
Out of 15 millions payment details we want to modify few (based on some condition) and insert into a different Cassandra table.
Copying to a file -> process it -> write it to new Database table. that is the plan. but first of all how to get the copy of the column from cassandra database.
Regards
Kiran

You can use Spark + Spark Cassandra Connector (SCC) to perform data loading, modification and writing back. SCC has a number of knobs that you can use to tune throughput, to not overload the cluster when reading & writing.
If you don't have Spark, you can still use the similar approach when fetching data - not issuing the select * from table (this will overload the node that handles request), but instead perform loading of the data by specific token ranges, so the queries will go to different servers and don't overload them too much. You can find code example that is doing scan by token ranges here.

Related

Partition strategy for hive

I have a monthly Spark job that process data and save into Hive/Impala tables (file storage format is parquet). The granularity of the table is daily data, but source data for this job also comes monthly job.
I'm trying to see how to best partition the table. I'm thinking of partitioning the table base a month key. Wondering if anyone sees any problems with this approach, or have other suggestions? Thanks.

How to get all the operations done on an oracle table be imported into hive for processing?(not just actual data in table, but the operations also)

I have a table in oracle db which gets multiple transactions done (lets say around 100 million inserts,updates or deletes in a day). I want to get all the transactions happening in that table to be brought into hive for processing through spark or hive.
For example:
lets say a record in that oracle table goes through initial insert operation followed by 5 updates to same/different columns and finally gets deleted. I want to capture all such operations for all the records in that table and import into hive.
We want to find records with number of operations that exceed a threshold for specific columns and pull a report on them.
Has anyone come across such a use case? Appreciate any help in achieving this.

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

DSE (Cassandra) - Range search on int data type

I am a beginner using Cassandra. I created a table with below details and when I try to perform range search using token, I am not getting any results. Am I doing something wrong or is it my understanding of data model?
Query select * from test where token(header)>=2 and token(header)<=4;
the token function calculates the token from the value based on the configured partitioner. The calculated value is the hash that is used to identify the node where the data is located, this is not a data itself.
Cassandra can perform range search on values only on clustering columns (only for some designs) only inside the single partition. If you need to perform range on arbitrary column (also for partition keys), there is a DSE Search that allows you to index the table and perform different types of search, including range... But take into account that it will be much slower than traditional Cassandra queries.
In your situation, you can run 3 queries in parallel (to cover values 2,3,4), like this:
select * from test where header = value;
and then combine results in your code.
I recommend to take DS201 & DS220 courses on DataStax Academy to understand how Cassandra performs queries, and how to model data to make this possible.

Data is not getting written in sorted format on target oracle table through SPARK

I have a table in hive with below schema
emp_id:int
emp_name:string
I have created data frame from above hive table
df = sql_context.sql('SELECT * FROM employee ORDER by emp_id')
df.show()
After above code is run I see that data is sorted properly on emp_id
I am trying to write the data to Oracle table through below code
df.write.jdbc(url=url, table='target_table', properties=properties, mode="overwrite")
As per my understanding, This is happening because of multiple executor processes running at the same time on every data partitions and sorting applied through query is been applied on specific partition and when multiple processes writing data to Oracle at the same time the result table ordering is distorted
I further tried to repartition the data to just one partition(Which is not ideal solution) and post writing the data to oracle the sorting worked properly
Is there any way to write sorted data to RDBMS from SPARK
TL;DR When working with relational systems you should never depend on the insert order. Spark is not really relevant here.
Relational databases, including Oracle, don't guarantee any intrinsic order of the stored data. Exact order of stored records is a detail of implementation, and can change during lifetime of the data.
The sole exception in Oracle are Index Organized Tables where:
data for an index-organized table is stored in a B-tree index structure in a primary key sorted manner.
This of course requires a primary key which can reliably determine order.

Resources