How to secure access to HiveThriftServer2.startWithContext - security

I want to read data from Hadoop and make it available in memory through SparkSQL so I can use a NodeJS application to access it with low latency.
Since the data is sensitive I need to ensure whoever is accessing this thrift server instance has the required access.
I don't know too much about thrift server with Spark but I understand that I should provide a username and password in my application to authenticate against the server.
How can I do that? Define a username and password in my Thrift or Spark code/environment and use that in my JDBC / ODBC calls?
Thanks!

Related

Connection Pooling in VoltDB

If I use JDBC approach, I am able to achieve connection pooling using third party library(Apache Dbcp).
I am using Client based Approach, VoltDB is not exposing connection object, How to implement connection pooling?
Is there any mechanism for Client based approach?
The Client based approach is a lighter-weight yet more powerful API than JDBC.
The Client object should be connected to each of the servers in the cluster, or you can set the "TopologyChangeAware" property to true on the ClientConfig object prior to creating the Client object, then connect the client to any server in the cluster and it will create connections to all the others automatically.
The application will then interact with the database using this Client object, which has connections, rather than using a JDBC Connection object. Since the Client object is thread-safe and can support multiple simultaneous invocations of callProcedure() on multiple threads, there is no need to create a pool of Clients.
For more details on the Client interface, see Using VoltDB Chapter 6. Designing VoltDB Client Applications
Disclaimer: I work for VoltDB.

VoltDB - Writing your own client or using JSON HTTP Interface

I am a bit confused in how should I perform my operation using VoltDB. There are two choice -
Run VoltDB server, create a connection from the client and call your required procedure.
JSON HTTP Interface provided by the VoltDB itself.
I have different applications which need to access the data stored in VoltDB, So I was writing code to connect and call required procedures, but later when I read about JSON HTTP Interface provided by the VoltDB I realized that the data can be accessed over the HTTP APIs without connecting each application with VoltDB.
Now I am confused which method should I choose and why?
I am pretty much in favor of using HTTP APIs provided by VoltDB. But what are the implications of it?
Well, the answer is pretty straightforward.
If you have a situation where low latency is high priority, for example,
storing/processing real-time data which can go to a high rate of transactions/sec.
High insertion rates.
High data query rates.
Then using a real client will typically be the best solution, since you can have a persistent connection. This is not possible with the HTTP API, which needs to reconnect and re-authenticate for each call and use HTTP API for operational queries like fetching/storing data which has low hits.

Scaling nodejs app with pm2

I have an app that receives data from several sources in realtime using logins and passwords. After data is recieved it's stored in memory store and replaced after new data is available. Also I use sessions with mongo-db to auth user requests. Problem is I can't scale this app using pm2, since I can use only one connection to my datasource for one login/password pair.
Is there a way to use different login/password for each cluster or get cluster ID inside app?
Are memory values/sessions shared between clusters or is it separated? Thank you.
So if I understood this question, you have a node.js app, that connects to a 3rd party using HTTP or another protocol, and since you only have a single credential, you cannot connect to said 3rd party using more than one instance. To answer your question, yes it is possibly to set up your clusters to use a unique use/pw combination, the tricky part would be how to assign these credentials to each cluster (assuming you don't want to hard code it). You'd have to do this assignment when the servers start up, and perhaps use a a data store to hold these credentials and introduce some sort of locking mechanism for each credential (so that each credential is unique to a particular instance).
If I was in your shoes, however, what I would do is create a new server, whose sole job would be to get this "realtime data", and store it somewhere available to the cluster, such as redis or some persistent store. The server would then be a standalone server, just getting this data. You can also attach a RESTful API to it, so that if your other servers need to communicate with it, they can do so via HTTP, or a message queue (again, Redis would work fine there as well.
'Realtime' is vague; are you using WebSockets? If HTTP requests are being made often enough, also could be considered 'realtime'.
Possibly your problem is like something we encountered scaling SocketStream (websockets) apps, where the persistent connection requires same requests routed to the same process. (there are other network topologies / architectures which don't require this but that's another topic)
You'll need to use fork mode 1 process only and a solution to make sessions sticky e.g.:
https://www.npmjs.com/package/sticky-session
I have some example code but need to find it (over a year since deployed it)
Basically you wind up just using pm2 for 'always-on' feature; sticky-session module handles the node clusterisation stuff.
I may post example later.

Hazelcast: difference between Java native client and embedded version

We use Hazelcast (2.3) in a Web backend running in a Java servlet container to distribute data in a cluster. Hazelcast maps are persisted in a MySQL database using a MapStore interface. Right now, we are using the Java native client interface and I wonder what is the difference between a "native" client and the embedded version when it comes to performance.
Is it correct that a "native" client might connect to any of the cluster nodes and that this decision is made again for every single request?
Is it correct that the overhead of sending all requests and responses through a TCP socket in a native client is avoided when the embedded version is used?
Is it fair to conclude that the embedded version is in general faster than the "native" client?
In case of a "native" client: it is correct that the MapStore implementation is part of the Hazelcast server (as class during runtime)? Or is it part of the "native" client so that all data that has to be persisted is sent through the TCP socket at first?
You give the set of nodes for native client to connect. Once it connects one it will use this node for communication with cluster till it dies. When it dies client will connect to other node to continue communication.
With native client there are two hops one from client to the node, one from the node to target node. (Target node is the node the target data is located) With embedded client there is single hop as it already knows where the wanted data is located (target node)
Yes generally but see: (from hazelcast documentation)
LiteMember is a member of the cluster, it has socket connection to
every member in the cluster and it knows where the data is so it will
get to the data much faster. But LiteMember has the clustering
overhead and it must be on the same data center even on the same RAC.
However Native client is not member and relies on one of the cluster
members. Native Clients can be anywhere in the LAN or WAN. It scales
much better and overhead is quite less. So if your clients are less
than Hazelcast nodes then LiteMember can be an option; otherwise
definitely try Native Client. As a rule of thumb: Try Native client
first, if it doesn't perform well enough for you, then consider
LiteMember.
4- Store operations are executed in hazelcast server. The object sent from client is persisted to centralized datastore by the target node which also stores the object in its memory.

Server side tasks for CouchDB

I need to perform some background task periodically in CouchDB (guess that could be done through cronjob, just curious about some native CouchDB approaches). I also need to retrieve some resources from HTTP on server (e.g. to authenticate through OAuth2 and store token permanently in some document). Could it be achieved somehow (e.g. nodejs to be integrated with CouchDB. I don't really like the idea to have nodejs webserver in front of couchdb, I'm trying to avoid that additional layer and use couchdb as HTTP server, DB backed and server-side business logic).
CouchDB is a database. Its primary job is to store data. Yes, it has some JavaScript parts but those are to help it build indexes, or convert to and from JSON.
Asking CouchDB to run periodic cron-style tasks, or to fetch HTTP resources, is similar to asking MySQL to run periodic cron-style tasks, or to fetch HTTP resources. Unfortunately, it's not possible.
You do not necessarily need a HTTP server. You can build a 2.1-tier architecture, with direct browser-to-CouchDB connections as before; but run your periodic or long-lasting back-end programs yourself, and they simply read and write CouchDB data as a normal user (perhaps an admin user).

Resources