Use RichMap in Flink like Scala MapPartition

Use RichMap in Flink like Scala MapPartition - apache-spark

In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.
Now I want to do the same thing in Flink. After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job. I will explain my use case which will clarify the situtaion.
Example : I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted. Now this list of users is dynamic and is available in a db. I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users. In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins. But with Flink using RichMap I can do that only in the open function when my job starts.
How can I do the following operation in Flink?

It seems that what You want to do is stream-table join. There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here.
The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.
Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.

Related

Intercept and modify incoming SQL queries to Spark Thrift Server

I have a thrift server up and running, with users sending queries over a JDBC connection. Can I intercept and modify the queries as they come in, and then send the result of the modified query back to the user?
For example - I want the user to be able to send the query
SELECT * FROM table_x WHERE pid="123";
And have the query modified to
SELECT * FROM table_y WHERE pid="123";
and the results of the second query should be returned. This should be transparent to the user.

SparkExecuteStatementOperation and SparkSession is what we thought we would add our code. I am using (yet to go in prod) a simple rule based on some external policy, I change the name of table to a view in the SQL before passing ahead. Its a bit hacky though.

There is no way to change the query in Spark Thrift Server.You can used other way to change the query before your Jdbc/odbc driver.Which takes several operation on it in complex query.You can use string modification in simple query.Only a table name change is easy but Parsing the query and modify the complex query is not easy.

You could use a database proxy to rewrite the queries as needed before they hit the database(s).
I'm not sure if it makes sense in your particular situation, but if it does, take a look at Gallium Data, that's a common use case.

Redis and Postgresql synchronization (online users status)

In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?

I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.

How does Spark deal with JDBC data in relation to time?

I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?

Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.

Apache Calcite Data Federation Usecase

Just want to check if the Apache Calcite can be used for the use case "Data Federation"(query with multiple databases).
The idea is I have a master query (5 tables) that has tables from one database (say Hive) and 3 tables from another database (say MySQL).
Can I execute master query on multiple database from one JDBC Client interface ?
If this is possible; where the query execution (particularly inter database join) happens?
Also, can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
I read from Calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?

I will try to answer. you can as well send questions to the mailing list. dev#calcite.apache.org you are more likely get answer there.
Can I execute master query on multiple database from one JDBC Client interface ? If this is possible; where the query execution (particularly inter database join) happens?
yes, you can. the Inter database join happens in your memory where calcite runs.
Can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
yes, you can. a lot of calcite consumers are doing this way. but you will have to wrap around the calcite rule system, I mean excute
I read from calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?
these are the SQL optimisations that the engine does. imagine a groupBy which could have happened on a tiny table but actually specified after joining with a huge table.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?

Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string