SparkStreaming for multiple data sources - apache-spark

We would like to design/develop a DataConnector service using SparkStreaming.
The data connector service will allow the user to specify a source and destination database, and
the connection parameters for the same. Therefore, dynamic data connection requests need to be handled by the system. We would like to have one streaming job that can handle all connection requests made to the Data Connector service.
We propose to handle this in the following manner:
All connection requests made to the DataConnector service, will be updated in our "meta-store" database.
We will have a custom receiver which fetches all active connection requests from the meta-store and use the DStream results (database names, connection strings, etc) of the receiver, to fetch data from multiple databases and process the same.
Any problems with this approach?

Related

Session to connect Snowflake from Lambda

I'm developing a AWS Lambda in python which will trigger by API Gateway and lambda will connect my snowflake. I'll process few CSV files via API Gateway to get some data from snowflake. Currently I'm using Python connector to connect Snowflake.
My issue is, if my csv has 100 records so it process the records recursively and it connects snowflake from lambda every time to process each record and its impacting on the performance.
Is there any method or mechanism that lambda can create a session for certain period of time and process all records in single connection.
As far as I know, connect() will automatically create a session that will last for a period of time. Once connected, you can use the cursor to execute multiple commands without needing to call connect() every time. Docs here. But I'm guessing you know this, and what you want is a single command instead of having to call multiple INSERT.
This is also possible, using a STAGE and COPY INTO command instead of INSERT. You can find an example from Snowflake documentation for bulk loading from AWS S3 here.

Is there any way to differentiate sessions of each client in cassandra QueryHandler?

My aim is to log unique queries per session by writing custom QueryHandler implementation as logging all queries causes performance hit in our case.
Consider the case : If a user connected to cassandra cluster with java client and performs "select * from users where id = ?" 100 times.
And another user connected from cqlsh and performed same query 50 times. so i want to log only two queries in this case. For that i need a unique session id per login.
Cassandra provides below interface where all requests lands up but none of its apis provide any sessionId to differentiate between two different session described in above case.
org.apache.cassandra.cql3.QueryHandler
Note: I am able to get remoteaddress/port but i want some id which is created when user logged in and get destroyed when he disconnects.
In queryState.getClientState().getRemoteAddress() the address + port will be unique per tcp connection in the sessions pool. There can be multiple concurrent requests over each connection though, and a session can have multiple connections per host. There is also no guarantee the same tcp connection will be used from one request to another on client side.
However a single session cannot be connected as 2 different users (part of the initialization of connection) so the scenario you described isn't possible from the same Session object perspective. I think just using the address as the key for uniqueness will be all you can do given how the protocol/driver works. It will at least dedup things a little.
Are you actually processing your logging inline or are you pushing it off async? If using logback it should be using async appender but if your posting events synchronously to another server, might be better just to throw all the events on a queue and let it do the deduping in another thread so you don't hurt latency.

Creating a Table in Node OPCUA

How to extend the address space into your SQL database with bidirectional mirroring, which immediately reflects any value change in the variable or database end in the opposite end.
So if I have a table in Database, whose values can be changed from outside(for-example data could be added, deleted or updated), how would my node-opcua server would be notified?
In OPC UA, any server which is created will follow SoA architecture. Meaning server will process request only when some service request.
In your case, you can achieve that with the help of Subscribing for Data Change and Monitoring the node which exposes the table in your data base to client. Subscribing for data change will be possible only when that node is exposed to client.
Once node is subscribed for data change, there are 2 values which is needed by server from client.
Sampling interval: how frequently server should refresh data from source
Publishing interval: how frequently client is going to ask for notification from server.
Lets say for example Sampling interval is 100 milliseconds and Publishing interval is 1 minute. Meaning Server has to collect the samples from the source (in your case it could be data base) at every 100 milliseconds, But Client will request for all those collected samples every 1 minute.
In you will be able to achieve updating the server with the changed values for table in database.
If SDK supports multi threading, then there is another way to achieve what is mentioned in question.
In server application, let the data source (i.e. data base) object be running in one thread.
Create a callback to server application layer and intialise data source object with this callback.
When data changes in data base, trigger a call to data source thread from data base. and if it is the required data and need to be informed to server, call the callback function which is initialized earlier.
I hope this answered your question.

Is there a way to connect to a database via sockets and socket.io?

I am writing an application whereby some external module/component is updating a SQLite database with new data every few hundred milliseconds or so, and my job is to write an application that queries that data and broadcasts it over sockets every few hundred milliseconds as well.
So currently I'm doing something like this with node, express, and socket.io:
timer = setInterval(function() {
db.all('SELECT * FROM cache', function(err, rows) {
io.emit('data', rows);
});
},
400
);
But I feel like there should be a more direct approach to this, whereby I can maintain a socket connection directly to the database, and listen for changes "live", rather than having to do blind queries (even if the data may not have changed), and emit.
Maybe this is not supported by SQLite (which is fine, I think I have some flexibility in the storage system I'm using), but is what I'm asking at all possible?
Note that I don't have control over the database updating process, so I can't just emit the data I'm about to store in the database. That whole process is a black box C program and I ONLY have access to the database itself.
What you're looking for is commonly called pub/sub (short for publish and subscribe). Clients waiting for data connect to a server and subscribe to the sort of events they want to receive. The data originators also connect to this server and publish events. The RPC with events that Socket.IO gives you are really similar to this. The clients have set up handlers for certain types of events, and the server fires these events with the appropriate data.
The problem is, pub/sub isn't typically implemented in a database. (Redis is an exception.) SQLite certainly has no capability for this. Since you can't modify the original application and only have access to the file database, there is nothing you can do. What you need is to effectively make your server an adapter from polling the database to broadcasting messages.
I do see a problem though with your setup. The first is that you are querying the database every 400 milliseconds. Don't do that. What if your query takes 500 milliseconds? Now you have a second query piling up. What if those two queries are now slow because they are both attempting to run at the same time? Now you have 3, 4, 5, and then 100 queries piling up. Don't schedule your next query to run until one is done. Check out an implementation of throttle for this.
The next problem is that you are blindly sending out all of the results to the client every time. I don't know what your application does, but I'm guessing that there is a chance for overlap from the previous query. Does your database have columns with timestamps? You could modify your query to use them. Or, modify your application to filter them.

replicaset vs multi-mongos vs multiple connections

what is the difference and why use each of this features of mongoose?
for now I just need a method to transfer a document from one database to another.
Replica-Set
A replica-set are two or more MongoDB servers which mirror the same data. Reads can be served by any member of the set, but writes can only be handled by a single server (the "Master" or "Primary").
An application can only connect to the replica-set members it knows, so you need to tell it the hostnames and ports of all of them. There are cases where you want to restrict an application to specific members. In that case you wouldn't tell them about the other servers.
Multiple mongos
Another feature to scale MongoDB on multiple servers is sharding. A sharded cluster consists of multiple replica-sets or stand-alone MongoDB servers where each one has only a part of the data. This improves both read- and write performance but is technically more complex. When an application wants to connect to a cluster, it doesn't connect to the MongoDB processes directly. Each connection goes through a MongoDB router instead (mongos) which forwards each query to the mongod's who are responsible for it. For increased performance and redundancy, a cluster can have multiple mongos servers. When this is the case, the clients should pick one at random for each connection.
Multiple connections
When your application opens multiple connections to the database, it can perform multiple requests in parallel. Usually the database driver should do this automatically, so you don't have to worry about this, unless you need to connect to multiple databases at the same time or you need connections with different connection settings for some reason.

Resources