Create data lineage on yugabyte db thru apache atlas - yugabytedb

No much resources are available online. But i wanted to create a data lineage system on data sourcing from yugabyte db thru Apache Atlas . Any pointers are appreciated .
For e.g. Below is the process that i have
[TABLE A] --python function--> [TABLE B] --> [report x]
Lets say both table a and b are from yugabyte db.
The python function aggregates the data from table a and insert into table b. An report x will be created on the table b.
If i wanted to create lineage on Atalas for this process. I understand that I will have to create 4 entity. 2 table entity and 2 process entity. Then i will have to build relationship between them but what i am not sure if any new data that comes tomorrow how will that get reflected into Atlas.

Related

Azure Data Factory join two lookup output

I'm trying to create a pipeline which:
Gets some info from an Azure SQL Database table with a LookUp activity.
Gets similar info from an Oracle Database table with a similar LookUp activity.
Join both tables somehow
Then process this joined data with a ForEach activity.
It seems pretty simple but I'm not able to find a solution that sweets my requirements.
I want to join those tables with a simple join as both tables shares a column with the same type and value. I've tried some approaches:
Tried a filter activity with items like #union(activity('Get Tables from oracle config').output.value, activity('Get tables from azure config').output.value) and condition #or(equals(item().status, 'READY'), equals(item().owner,'OWNer')) but it fails because some records doesn't have a status field and some others haven't the owner field and I don't know how to bypass this error
Tried a Data Flow Activity approach which should be the one, but Oracle connection is not compatible with the data flow Activity, just the Azure SQL Server one
I've tried to put all the records into an array and then process it through a DataBricks Activity, but DataBricks doesn't accept an array as a job input through their widgets.
I've tried a ForEach loop for appending the Oracle result to the Azure SQL one but no luck at all.
So, I'm totally blocked on how to proceed. Any suggestions? Thanks in advance.
Here's a pipeline overview:

How to persist data to Hive from PySpark - Avoiding duplicates

I am working with graphframes, pyspark, and hive to work with graph data. As I process data I will be building a graph and eventually will be persisting this data into a Hive table, where I will not update it ever again.
Subsequent runs may have relationships to nodes from previous runs, so I will want to ensure I don't duplicate data.
For example, run #1 might find nodes: A, B, C. Run #2 might re-find node A, and also find new nodes X, Y, Z. I do not want A to appear twice in my table.
I am looking for the best way to handle this and would like to address the following issues:
I will need to track the status of the node as I process metadata associated with it. I will only want to persist the node's data to Hive after I have finished this processing.
I want to ensure that I don't create duplicate data when I encounter the same node (e.g. when I re-find A node above, I don't want to insert another row into Hive)
I am currently tinkering with the best way to do this. I know hive supports ACID transactions now, but it does not appear as though pyspark currently supports CRUD type operations. So here is what I'm planning on:
On each run, create a dataframe to store the nodes I have found.
When a new node is found: Check if the node already exists in Hive (e.g. sqlContext.sql("SELECT * FROM existingTable WHERE name="<NAME>"). If it does not exist update the dataframe with x = vertices.withColumn("name", F.when(F.col("id")=="a", "<THE-NEW-NAME>").otherwise(F.col("name"))) to add it to our Dataframe.
Once all the nodes have finished processing, create a temporary view: x.createOrReplaceTempView("myTmpView")
Finally, insert data from my temporary view into an existing table with sqlContext.sql("INSERT INTO TABLE existingTable SELECT * FROM myTmpView")
I think this will work, but it seems extremely hacky. I'm not sure if this is a function of my lack of understanding of Hive/Spark, or if this is just the nature of the tech stack. Is there a better way to do this? Is there a performance cost to handling it in this way?
In deltalake api, upserts(Merge) are supported using scala and also python. Which is exactly you are trying to implement.
https://docs.delta.io/latest/delta-update.html#merge-examples
Here is an alternate solution
Have a column updated_time timestamp in your table
union prev_run_results and current_run_results
group by 'node', select the latest timestamp
save the results

Is Spark good for processing data from sql db in a job? How to avoid processing the same data in the job?

I have a problem and I wonder if spark is a good tool to solve it:
There is sql db. I want to process data from such table:
Orders Table:
| id | product | date |
I would like to create "processing job" which can scan all records and save to other db/file.
Ultimately, I would like to have several features/tables in the database/file (for example, the older product orders, the number of orders for a given month).
So, the target database/file will contain the ordersForGivenMounths table with the values: September: 150 (orders with same id), October: 230 ... etc.
Tables in the database will be expanded. I have given only two examples.
Can it be done at Spark? Is it a good tool for this type of task?
Can I create jobs in Spark that will process the sql database every given period of time?
New records will be constantly added to the source sql database. Is it possible to configure Spark so that it does not process data that it has processed earlier and already pushed into the target database/file earlier?
I was looking for tutorials/docs but most are introductions without specific solutions.
I think you can use spark streaming with custom receivers, and you can add some logic at receiver. [receiver]http://spark.apache.org/docs/latest/streaming-custom-receivers.html

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

cassandra "data model failover", not looking for cluster failover

Understand that Cassandra does not have a master-slave relationship which what I am after(peer to peer) and the data replication concept of Cassandra is one also the other feature I am looking forward for.
However, is there a inbuilt function within Cassandra that will perform data model failover without the enhancement on the application. I am not after cluster failover but data model failover and possible of rollback of data model.
And would also like to know if possible for database compare tool natively.
Sorry for my bad English.
Thanks.
Etc:
Data Model A is table1....table10 (for example)
Data Model B is replica as Data Model A
Scenario....
Control Centre East - 3 nodes ( consistency must all table1....table10)
Control Center West - 2 nodes ( eventual consistency table1....table10)
Control Center North - 1 nodes ( consistency must be high for some tables table1...table5 and eventual consistency for the rest)
Sequence of event
All nodes read/write to Data Model set A (I am mean Client read and write Data Model set A). Data Model A is the one in production.
Make changes to Data Model B (add new row, modify column value etc)
Failover or changeover for all nodes from Data Model set A to Data Model set B.
During the failover, some small amout of column value for identical row between Data Model A and B has to be copied over to from Datamodel A to B.
After the failover, check if application giving wrong calculation(formula/output) based on the new Data Model B.
Data model B is problematic, failover back to Data Model A for all nodes. Check/fix on Data Model B before running seq 2-3 again.
Hopefully, it is not too long winded.

Resources