Cassandra Schema Design - Handling Merging of Similar but Differing Source Data Sets - cassandra

I'm working on a project to merge data from multiple database tables and files into cassandra. This will come from different sources such as flat files, sql db's, etc.
Problem Statement: Most of these source files are similar, however, there are some differences and I want to merge each of these into a single cassandra table. There are about 50 similar fields and an extra 20 fields that don't coexist. My thought is that I can merge them all and just add all of the fields and leave them as tombstones if not populated. The other option would be to merge the same fields into cassandra and then for the fields that are different to add a map column; however, I don't know if there is really any benefit in doing this other than looking nicer.
Any ideas/advice from people who have dealt with this?

What you need is an ETL tool (Extract/Transform/Load) to combine, clean and or standardize the data, and use Cassandra as your repository. There are multiple tools in the market that can provide you this functionality, (a google search for "ETL tools" can give you an overwhelming amount of resources to choose from).
As a personal preference check https://nifi.apache.org/ , you can define those transformations and filtering as workflows

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

lakeFS, Hudi, Delta Lake merge and merge conflicts

I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS.
Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support and for this purpose would like to use lakeFS together with Hudi.
If I understand everything correctly, lakeFS is a data agnostic solution and knows nothing about the data itself. lakeFS only establishes boundaries (version control) and moderates somehow the concurent access to the data..
So the reasonable question is - if lakeFS is data agnostic, how it supports merge operation? What merge itself means in terms of lakeFS? And is it possible to have a merge conflict there?
You do understand everything correctly. You could see in the branching model page that lakeFS is currently data agnostic and relies simply on the hierarchical directory structure. A conflict would occur when two branches update the same file.
This behavior fits most data engineers CI/CD use cases.
In case you are working with Delta Lake and made changes to the same table from two different branches, there will still be a conflict because the two branches changed the log file. In order to resolve the conflict you would need to forgo one of the change sets.
Admittedly this is not the best user experience and it's currently being worked on. You could read more about it on the roadmap documentation.

Spotfire for cross fact table joins using conformed dimensions

A few years I ascertained that Spotfire cannot perform multi-fact table queries using conform dimensions a la Ralph Kimball - like Tableau in which this is still the case.
Is this still so? Most people I speak to are not aware of this. I am not in a position to quickly assess this, hence my question.
If you are reading from a DB, you can create custom information links using SQL (or what Spotfire calls SQL, its a little different) that can certainly join multiple fact tables together through conforming dimensions. These may perform well or poorly depending on the amount of data and structure of the tables in question.
You can also 'join' fact tables across dimensions (or directly to each other if you have the right keys) within the tool itself. These are called relations and work under the same principles, but don't kick off joined SQL statements.
If you create a view in the DB that does the joins as you have said, Spotfire can read those as well into an information link.

Using Cognos 10.1 which is better an Inner Join or an "IN" Filter?

I'm using Cognos 10.1 and I have a report that uses two queries each with the same primary key.
Query 1: UniqueIds
Query 2: DetailedInfo
I'm not sure how to tell whether it's better build a report using the DetailedInfo query with a filter that says PrimaryKey in (UniqueIds.PrimaryKey) or should I create a third query that joins UniqueIds to DetailedInfo on PrimaryKey.
I'm new to Cognos and I'm learning to think differently. Using MicroSoft SQL Server I'd just use an inner join.
So my question is, in Cognos 10.1 which way is better and how can tell what the performance differences are?
You'd better start from the beginning.
You queries (I hope Query Subjects) should be joined in Framework Manager, in a model. Then you can easily filter second query by applying filters to first query.
Joins in Report Studio is the last solution.
The report writers ultimate weapon is a well indexed data warehouse, with a solid framework model built on top.
You want all of your filtering and joining to happen on the database side as much as possible. If not, then large data sets are brought over to the Cognos server before they are joined and filtered by Cognos.
The more work that happens on the database, the faster your reports will be. By building your reports in certain ways, you can mitigate Cognos side processing, and promote database side processing.
The first and best way to do this is with a good Framework Model, as Alexey pointed out. This will allow your reports to be simpler, and pushes most of the work to the database.
However a good model still exposes table keys to report authors so that they can have the flexibility to create unique data sets. Not every report warrants a new Star Schema, and sometimes you want to join the results of queries against two different Star Schema sources.
When using a join or a filter, Cognos attempts to push all of the work to the database as a default. It wants to have the final data set sent to it, and nothing else.
However when creating your filters, you have two ways of defining variables... with explicit names that refer to modeled data sources (ie. [Presentation View].[Sales].[Sales Detail].[Net Profit] ) or by referring to a column in the current data set (such as [Net Profit] ). Using explicit columns from the model will help ensure the filters are applied at the database.
Sometimes that is not possible, such as with a calculated column. For example, if you dont have Net Profit in your database or within your model, you may establish it with a Calculated column. If you filter on [Net Profit] > 1000, Cognos will pull the dataset into Cognos before applying your filter. Your final result will be the same, but depending on the size of data before and after the filter is applied, you could see a performance decrease.
It is possible to have nested queries within your report, and cognos will generate a single giant SQL statement for the highest level query, which includes sub queries for all the lower level data. You can generate SQL/MDX in order to see how Cognos is building the queries.
Also, try experimenting. Save your report with a new name, try it one way and time it. Run it a few times and take an average execution speed. Time it again with the alternate method and compare.
With smaller data sets, you are unlikely to see any difference. The larger your data set gets, the bigger a difference your method will affect the report speed.
Use joins to merge two queries together so that columns from both queries can be used in the report. Use IN() syntax if your only desire is to filter one query using the existence of corresponding rows in a second. That said, there are likely to be many cases that both methods will be equally performant, depending on the number of rows involved, indexes etc.
By the way, within a report Cognos only supports joins and unions between different queries. You can reference other queries directly in filters even without an established relationship but I've seen quirks with this, like it works when run interactively but not scheduled or exported. I would avoid doing this in reports.

Is Cassandra a good candidate database for that must sustain over 100 read/write operations per second?

Currently our system uses PostgreSQL, however we seem to have pushed the limit of its capabilities. Some of our tables need to handle over 100 read/write operations per second so it is probably time to scale horizontally across multiple machines.
Have a lot of experience using GAE's Big Table. Big Table had rich options for querying. For example, queries were possible against list data fields. Cassandra is supposed to be based off of Big Table, but if I understand correctly, for Cassandra, we will actually have to custom-code a layer on top of Cassandra that uses and maintains index tables.
Would be great if there was an open source database available for which we did not have to build our own custom logic for maintaining index tables, zig-zag merge joins, etc...
Is Cassandra a good candidate here? Or are there ones that might be considered better?
Unless the operations are huge joins or return hundreds of thousands of rows, any database you choose will be able to sustain 100 ops/s. Cassandra will have no problems serving thousands if not tens of thousands of reads and writes per node.
Without knowing more about your particular use case it's impossible to give you meaningful advice. Cassandra is a great database, but if it's right for you I don't know. I'd suggest looking through the cassandra tag here on Stack Overflow and look at what people ask about and if it looks at all like what you're trying to do, and if the answers say that it's possible with Cassandra (I know I've answered quite a few questions where the answer was that Cassandra wasn't the best choice for that particular case).
Cassandra and GAE Big Table have big similarities, but also big differences. One thing that trips up new Cassandra users is that there really isn't any way of doing things like "add this thing only unless that other thing was there" or "add an item and remove all but the last N items".

Resources