Apache Calcite Data Federation Usecase - apache-spark

Just want to check if the Apache Calcite can be used for the use case "Data Federation"(query with multiple databases).
The idea is I have a master query (5 tables) that has tables from one database (say Hive) and 3 tables from another database (say MySQL).
Can I execute master query on multiple database from one JDBC Client interface ?
If this is possible; where the query execution (particularly inter database join) happens?
Also, can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
I read from Calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?

I will try to answer. you can as well send questions to the mailing list. dev#calcite.apache.org you are more likely get answer there.
Can I execute master query on multiple database from one JDBC Client interface ? If this is possible; where the query execution (particularly inter database join) happens?
yes, you can. the Inter database join happens in your memory where calcite runs.
Can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
yes, you can. a lot of calcite consumers are doing this way. but you will have to wrap around the calcite rule system, I mean excute
I read from calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?
these are the SQL optimisations that the engine does. imagine a groupBy which could have happened on a tiny table but actually specified after joining with a huge table.

Related

Use RichMap in Flink like Scala MapPartition

In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.
Now I want to do the same thing in Flink. After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job. I will explain my use case which will clarify the situtaion.
Example : I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted. Now this list of users is dynamic and is available in a db. I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users. In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins. But with Flink using RichMap I can do that only in the open function when my job starts.
How can I do the following operation in Flink?
It seems that what You want to do is stream-table join. There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here.
The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.
Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.

Dynamic Cassandra queries

I have a messenger application with a history page, on which you can see your sent and received messages.
Since the amount of messages has lowered my performance I have been thinking about using Cassandra.
After researching on the topic of Cassandra, I found out that you have to build tables to satisfy your queries.
Now the problem: on the history page you can use x amount of different filters at the same time. e.g filter by date,receiver and sender.
If I were to use Cassandra, would I need to create a table for every combination of these filters?
Or is this a bad use case for Cassandra in general?
If so, are there any alternatives?
Why don't you just make a SELECT statement.
You should definately have a look into CQL (Cassandra Query Language).
While CQL and SQL share a similar syntax queries are a lot different.
The reasons for these differences is the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries.
See this link for reference. It shows queries you can or cannot do.

When to use Materialized Views?

I'm learning Cassandra now and I understand I should make a table for each query. I'm not sure when I should make separate tables or materialized views. For example, I have the following queries for users and posts:
users_by_id
users_by_email
users_by_session_key
posts_by_id
posts_by_category
posts_by_user
Should I always use materialized views?
It seems to me that if you want to keep the Posts or Users consistent across queries, then I have to use materialized views. However materialized views I read have a read before write latency.
On the other hand, if I use different tables, am I supposed to make 3 Inserts every time a new post is created? I noticed that I get the error batch with conditions cannot span multiple tables, which means I have to insert it one at a time into each separate table, which can cause consistency problems if one of the queries fails. (A batch statement, would fail all 3 if one of them failed).
So, since it makes sense to have consistency, then it seems to me that I will always want to use materialized views, and have to take the read before write penalty.
I guess my other question is when would it ever be okay for data to be inconsistent?
So hoping someone can provide more clarity for me for how to handle multiple queries in cassandra on a 'theoretical model` like Users or Posts. Should I be using materialized views? If I use 3 different tables for each model, how do I keep them consistent? Just hope that all 3 inserts don't fail? Doesn't seem right.
Read my deep dive blog post for all the trade-offs when using materialized views. Once you understand the trade-offs, choose wisely: http://www.doanduyhai.com/blog/?p=1930
No, you shouldn't always use materialized views. The perfect solution is a interface for your database. In this application, you handle all your different tables. But there's are also some use case for the materialized views: If you haven't the time for this application but you need this feature, use materialized views. You have a performance trade off but in this scenario, the time is more important. If you also need real updates instead of upserts on all tables: use materialized views.
Batch is useful for buffering or putting data-sets with the same partition key together. For example: You have a high data troughput application. Between your heartbeats or between execution another query with QUORUM, you got 10 other events with the same partition key. But you won't execute them because you're waiting for a successful response. If a success comes back, you execute a batch query. But please keep in mind: Use only a batch for the same partition keys.
Generally, remember one important thing: Cassandra has an eventually consistency model. That means: If you use qourum, you will have consistency but not every time. If your application needs a full consistency, not only eventually use another solution. E.g. SQL with sharding. Cassandra is optimized for writes and you will only get happy when you're using the cassandra features.
Some performance tips:
If you need a better consistency: Use QUORUM, never use ALL. And, generally, write you queries standalone. Sometimes batch is useful. Don't execute queries with ALLOW FILTERING. Don't use token ranges or IN operator on partition keys :)

cassandra trigger execution sequence

I want to use cassandra tigger to import my data to elasticsearch for searching.
Considering the data consistency, I hope they execute atomically.
So I want to know the trigger of the execution sequence, together with "write commitlog","memtable", "index" atomically, or the trigger is completely asynchronous?
Triggers are run before anything you listed above. The intent is to capture a mutation before it is persisted in the database. This is to potentially enhance data as it is received. What you have outlined about could have some edge failure conditions with data indexed in ES and not persisted to the database.
Have you looked at the DataStax search product? It has a much deeper integration with Cassandra that avoids these problems.

Multithreaded DBMSs?

I was wondering what DBMSs actually use multithreading in their query plans/executions?
Oracle supports this, as does SQL Server and DB2. I do not believe that MySQL or PostgeSQL support parallel queries though.
I believe most databases that support table partitioning will support querying each partition at the same time if the need arises rather than just pruning unneeded partitions. Oracle can do this. Teradata definitely does this.
MySQL only uses one thread per query (in the standard engines); this includes if the tables are partitioned.
The Multi-threading is used in dB # many areas, for example at the Query Evaluation.
*) The Parallel Query execution is done with the help of Multi-threading for Optimizing
the performance of the Query evaluation.
*) Parallelize the dB backup like creating a separate backup thread for each available tape drive will accomplish the dB server Back-up. (eg) Oracle Uses it.
*) Using the Table Reorganization - When the time goes the dB becomes bulky and the DBA will reorganize the tables in the intention of improve the performance of the dB.
---- In oracle the POSIX and C++ is used to achieve the multi-threading.----

Resources