How to get all the operations done on an oracle table be imported into hive for processing?(not just actual data in table, but the operations also) - apache-spark

I have a table in oracle db which gets multiple transactions done (lets say around 100 million inserts,updates or deletes in a day). I want to get all the transactions happening in that table to be brought into hive for processing through spark or hive.
For example:
lets say a record in that oracle table goes through initial insert operation followed by 5 updates to same/different columns and finally gets deleted. I want to capture all such operations for all the records in that table and import into hive.
We want to find records with number of operations that exceed a threshold for specific columns and pull a report on them.
Has anyone come across such a use case? Appreciate any help in achieving this.

Related

Best practices to store every minute and select from database latest 24h data only?

The task is to permanently record new data to a database every minute and then, occasionally, to read only latest 24h data, using Python.
The only approach I know:
create a script A that will be inserting into a MariaDB database table, one new line per minute, with a timestamp as a field value
create a script B that will be reading from the database table, using WHERE and timestamp values
The problem is, there are 2 restrictions:
it is not allowed to have more than 10.000 lines in one database table
it is not allowed to delete any lines
How to fulfill the task and meet both restrictions? Are there best practices?
Thanks!
You can create a new table every X days when it is full. Name the table with the first timestamp value.
With this solution you need to create your B script in this way:
List all tables
Find the tables you are looking for
Write your SQL query on all theses tables using UNION ALL
You can do it into a single SQL query for optimisation or into a script using multiple queries for simplicity.

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

Bulk copy from Cassandra table column to a file

I have a requirement to copy cassandra database column into a file.
The databas has 15 million records with below columns in it. I want to copy payment column data into a file. Since it a production environment that will leads to stress on cassandra clusters.
userid | contract | payment | createdDate
Any suggestions?
Out of 15 millions payment details we want to modify few (based on some condition) and insert into a different Cassandra table.
Copying to a file -> process it -> write it to new Database table. that is the plan. but first of all how to get the copy of the column from cassandra database.
Regards
Kiran
You can use Spark + Spark Cassandra Connector (SCC) to perform data loading, modification and writing back. SCC has a number of knobs that you can use to tune throughput, to not overload the cluster when reading & writing.
If you don't have Spark, you can still use the similar approach when fetching data - not issuing the select * from table (this will overload the node that handles request), but instead perform loading of the data by specific token ranges, so the queries will go to different servers and don't overload them too much. You can find code example that is doing scan by token ranges here.

Data is not getting written in sorted format on target oracle table through SPARK

I have a table in hive with below schema
emp_id:int
emp_name:string
I have created data frame from above hive table
df = sql_context.sql('SELECT * FROM employee ORDER by emp_id')
df.show()
After above code is run I see that data is sorted properly on emp_id
I am trying to write the data to Oracle table through below code
df.write.jdbc(url=url, table='target_table', properties=properties, mode="overwrite")
As per my understanding, This is happening because of multiple executor processes running at the same time on every data partitions and sorting applied through query is been applied on specific partition and when multiple processes writing data to Oracle at the same time the result table ordering is distorted
I further tried to repartition the data to just one partition(Which is not ideal solution) and post writing the data to oracle the sorting worked properly
Is there any way to write sorted data to RDBMS from SPARK
TL;DR When working with relational systems you should never depend on the insert order. Spark is not really relevant here.
Relational databases, including Oracle, don't guarantee any intrinsic order of the stored data. Exact order of stored records is a detail of implementation, and can change during lifetime of the data.
The sole exception in Oracle are Index Organized Tables where:
data for an index-organized table is stored in a B-tree index structure in a primary key sorted manner.
This of course requires a primary key which can reliably determine order.

How to optimize a table containing 1 billion rows, fixed row format using myisam engine in mysql?

I am having a table containing 1 billion rows, fixed row format and using myisam engine in mysql. I am thinking of shardding the table but that development takes time. Are there any temporary solutions for improving the performance?
you can take a look at mysql partitioning. http://dev.mysql.com/doc/refman/5.1/en/partitioning-overview.html
it allows you to distribute portions of individual tables across a file system transparent to your queries
As per your comment if "insert/select ratio = 100:1" is the case, then i don see any reason to have indexes (apart from primary key index if any) on the table. It will further slow down your inserts.
Also, if you can queue inserts to this table then you can try creating a in-memory table (http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html) and direct all the inserts to the table which will be faster and then do a bulk insert/periodic flush in to ur myisam engine based table.
Also you can partition the table on a specific column out of those 4 you have(if there is any good candidate) or go for hash based partition (if you don find any). I am not sure why you are saying sharding is going to take dev time. you can partition an existing non partitioned table too. http://forums.mysql.com/read.php?106,264106,264110

Resources