Is SearchLight Package using Connection Pools? - web

SearchLight is a Julia package that builds the ORM (object-relational mapping) layer in Genie (a web development framework in Julia).
Right now, I am building a website backend and I decided to use SearchLight for data storing, because someone told me to do so with an ORM instead of "just" parsing SQL queries - sorry, I am so not a web developer and know basically nothing.
Unfortunately, SearchLight is (i) unbelievably badly documented (i.e. in most cases not at all) and (ii) missing important functionalities, like setting foreign keys in a DB table. I found several work-arounds, even if they are not pretty and kind of counter the ORM idea.
Actual Question
Does SearchLight use connection pools?
The web service I'm building, gets a JSON, queries the database, gets the result and sends it back to the client. This might take several seconds and multiple clients doing a request should not "stand in line" and wait for other requests to finish. Will SearchLight do that? I.e. open several connections and use them when they're needed? (Or does it something else that leads to what I want?)

Related

How to distribute NodeJS requests to several servers and merge the results

I have a simple NodeJS web app that calls several apis asynchronously and merges the results to return one big result. Now let's say that I want to optimize this. How do I do this?
I am new to NoeJS and also the concept of scaling systems. I have been reading about load balancing, distributed systems, etc... I think this is the right way to go, but honestly I don't know.
I was thinking of doing something like this -
Set up a system that has several servers, and each has an instance of a NodeJS webapp that makes an api call given a path, and returns the result.
Have a master server that grabs the result from each of these servers, and merge the result and return it to the client.
Is this right way to go? What technologies do I use? Thank you for your help.
I am guessing you are trying to setup web-crawling or api-crawling, to grab data from 3rd party end point. If that is true, you would have a list of users / IDs or something like that that you pass to the web service you call and grab the data.
First of making a large number of requests very fast and in a stable way is tricky and depends on several factors to be stable and robust.
Is the 3rd party API rate limited.
Network connection on the client machine making the requests.
Error handling for both API and client errors like connection reset etc.
Sheer volume of data you are fetching back, like if you are trying to crawl data on millions of users from 3rd party API as fast as possible.
Your instinct is correct that you would have to scale this over several servers or at-least several parallel node processes on machine with lot of resources, however start small, test, and then scale would be my recommendation. Here are a few steps.
Use a good robust node http client like axios
If you are dealing with huge number of items (username, ids. emails etc) you will need stable way of iterating over them. Put them in a database like PostgreSQL or MySQL.
From here on figure out what's the fastest rate at which your API supports calling. And write stable function to iterate over your 'input' and call the API.
Then you have a couple of options. If data you are collecting is separate for each request you make. You can save it back in the database for each input. If you literally want to merge the data from multiple API calls, you can use a key-value storage like Redis. You can give an ID to each call and create a combination key for input+request_id format, then when all requests are done, you can merge them.
When you a small scale model in place you can now add a good job manager like Kue or Bull to the mix, and split the set of inputs in database from point (2) over several jobs that can be run in parallel.
Once you have a stable job-manager for that can repeat this node process for a range of inputs , now you are at a point where you can scale.
Deploy this same code on multiple servers that all talk to same Database and Redis. Install the Node process to run using a process manager like PM2.
Finally the way setup works is, each copy of same node program fetches a different set of inputs (usernames/IDs etc) form the source database, and writes the results back to the database or Redis depending on how you want to handle the output.
Optional post processing on redis to fetch the key value pairs and merge the responses grouped by input.
Some important things you have to be hyper aware of when coding this issues are:
Memory Management: Use design patterns/code/libraries that saves you most memory. Load absolutely minimum of what you need to in memory. Eg: iterating on an array of 1 millions usernames in memory is more expensive than keeping them in database and paging over them.
Error Handing: There will be lots of them. API errors, unforeseen exceptions, memory leaks, network drops etc. Having robust error handling and recovery mechanism will save the day.
Logging: Good quality logging will be critical to keep a check on how different parts of system are doing. Look at winston.
Throttling API calls: Remember making 10,000 API calls at the same minute will likely crash your machine or even most APIs.At the very least go very slow due to memory overloads. However adding a slight delay (like 10 milliseconds) between every 10 parallel calls will be HUGE boost in speed and make the calls much more stable. This strategy is called throttling or rate-limiting the API calls. Finding a sweet spot that works for your problem is important. Yes going slow can actually make you reach goal faster!
Your question was quite broad without specific code question, this is a general strategy and hopefully will give you a good starting point and links to reference materials so you can start building your solution.

Options for getting a CPU intensive job off my web server?

I have been working on a Web App for visualizing live data. It is crucial that this data is kept up to date on the client side without such updates being invoked directly by the client (e.g. no button presses or refreshing the page). Currently, on page load, I grab the current data set from a database (DynamoDB) via Ajax, and subsequent updates are pushed to any listening clients every 5 minutes via a Websockets connection (using Socket.io).
I have overlooked the computational load of this update job. It has to mine some data, process it, update the database, and send the update out to all clients. As a result, the web server is left unresponsive for about 30 seconds with each update. Furthermore, my current architecture limits me from putting my server behind a load balancer, which is something I anticipate coming up in the future. For both these reasons, I really need to get this update job off my web server.
I am relatively inexperienced in web development, and I don't feel I am knowledgeable enough about these technologies to know the drawbacks of the solutions I have come up with. Currently, I am considering:
Break the update off into a separate process so it does not block the Node event loop. This would solve my issue in the short term, but if I ever want to load balance my application, I can't have the update running on multiple machines.
Drop Websockets entirely and just have the client query the database every 5 minutes, while a separate process (or separate server if I want load balancing) keeps the database up to date without interacting directly with the client. Will this kind of access pattern put too much load on my db?
Have a separate server run the update, and send the result via Websockets (or maybe some other protocol) to my load balanced application servers, which then push that update to all listening clients as usual. Is this even possible?
Perhaps there are other solutions. It seems like this would be a relatively common problem, so I was hoping I could find some guidance here. What are the potential issues with the solutions I have proposed, and are there other possible solutions that my suit my use case better?
It sounds like you want one process sitting somewhere which crunches the data and publishes it to a stream. Clients can then subscribe to the stream as and when they like. Redis handles streams nicely, you could process your data and push it into a redis stream. You could then create a small node service which subscribes to the redis stream and pushes the formatted data out over a websocket or via polling.
In this scenario you can then scale up either the publishing process (the one crunching the numbers) if your data load goes up, or scale up your subscribed process (which serves the data over a websocket to browsers) if you get an influx of clients watching the data.
You can also easily distribute the hosting of these services across other machines, and even write them in different languages if you decide the number crunching needs something like threading.
You're then left with the issue of clients (web browsers) consuming this data with a load balance in-between. This can be a hard problem if you use websockets and is bundled with pros and cons. But importantly you'll have separated your data crunching from your result publishing and that'll isolate out your issue to only the load balancing.
I have done pretty much the same to check ressources on some of our servers.
I have a C# service getting the information on each server that we manage, sending them to a queue (Amq).
From there, I have a stomp client fetching data from amq and emiting them to a websocket.
My main micro service is fetching the data to save them into a db.
My visualisation webapp is connected to the same ws and is fetching the data as they are sent to display them.
The Amq step isn't mandatory at all, it's just something I had to work with (historical).
I don't know what type of data your are working with, so I don't know if my solution can apply to you.
Don't hesitate if I'm not clear or you have any question.
This is a big question and I'm not going to try and give you a definitive answer.
For option 2
It really depends on how expensive your queries are. You can make DynamoDB fast if you pay for enough throughput. That said, on the face it, re-loading your whole dataset, when that sounds like its probably large, probably isn't good engineering.
For option 3
This option seems best to me if its achievable, although admittedly its hard to say with such a complex system - obviously you can't share your whole project.
Given your are already using AWS you might want to look into AWS Lambda. If you can move the update process into a stand alone job, you can host it on lambda and move the load off the web server. Lambda is essentially infinitely scalable and you only pay for the compute you use.
This really depends on you being able to split the update task off into a separate service. Its likely you would need a fair bit of refactoring to isolate it as a service. If you can break little bits off at a time, and make the move gradually, even better.
If you consider trying this, and you've not used Lambda before, I would definitely start small with some hello world examples. Then try a very simple service in your application, and build up to taking on the update service.
You might also consider looking in AWS Simple Message Queue Service to handle the comms between clients and server.
Database tuning
If a lot of your update time is spent waiting for database actions to complete, rather than server processing, you can consider tuning that side of things up. Things to consider are:
Buying more throughput
Using batch operations (as these move load to DynamoDB from your server)
Tuning keys, indexes and database access

EF6 with one database per client (same context)

I am developping an ASP.Net MVC 5 application that will be a SaaS for my clients. I want to use EF6 and I am currently using localDb. I am an Entity Framwork beginner and I am having a hard time learning it. I have been searching the web for the last 2 days, found different approaches but never found something clear for me that would answers my questions.
I followed Scott Allen Tutorial on ASP.Net MVC 4 and 5 so currently, I have 2 contexts, 'IdendityDbContext' and 'MyAppDbContext' both pointing to the DefaultConnection sting using a database called MyAppDb.mdf
I want my customers to be able to login on the website and connect to their own database so I was planning on creating a new ConnectionString (and database) for each of my clients and keeping one ConnectionString for my client Accounts information using my IdendityDbContext.
I have plenty of questions but here the 2 most importants ones :
1) I am not sure how to do that and test it locally. Do I have to create new data connections for all my clients and when a client connect, I edit the connection string dynamically and pass it to 'MyAppContext' ?
2) Even if I am able to do this, let's say I have 200 customers, it means I will have 201 databases : 1 Account Database (IdentityDbContext) and 200 Client Databases (MyAppDbContext). If I change my model in the future, does it means I have to run package manager console migrations command line for each of the 200 databases ? This seems brutal. There must be a way to propagate my model easily on every clients database right?
Sorry for the long post and thank you very much in advance.
The answer to (1) is basically "yes", you need to do just that. The answer to (2) is that you'll have to run migrations against all the databases. I can't imagine how you would think there would be any other way to do it, you've got 200 separate databases that all need the same schema change. The only way to accomplish that is to run the same script (or migration) against each one of them. That's the downside of a single-tenant model like you've got.
A few things you should know since you're new to all of this. First, LocalDB is only for development. It's fine to use it while in development, but remember that you'll need a full SQL Server instance when it comes time to deploy. It's surprising how common a hangup this is, so I just want make sure you know out the gate.
Second, migrations, at least code-first migrations, are also for development. You should never run your code-first migrations against a production database. Not only would this require that you actually access the production database directly from Visual Studio, which is a pretty big no-no in and of itself, but nothing should ever happen on a production database unless you explicitly know what's changing, where. I have a write-up about how to migrate production databases that might be worth looking at.
For something like your 200 database scenario, though, it would probably be better to invest in something like this from Red Gate.

Send requests directly to couchDB from NodeJS/Angular application?

I'm currently building a new web-application with user registration, profiles, image upload and so on. I was using the MEAN stack (MongoDB, ExpressJS, Angular, NodeJS) for previous projects and now want to try out couchDB.
couchDB delivers a REST-API for free. I could shift all the logic to the client and make sure, that the input is valid by couchDBs validation functions. Therefore I could make the requests from client directly to the database and I would not have to code annoying things like CRUD Operations in my expressJS controllers. Authentication, Validation and simple CRUD operations - it's all there and for free.
Is there a reason not to do so? I would then pass the request to my server and then pass it on to the couchDB from there, which pretty much eradicates all the nice benefits over mongoDB.
greetings,
Michel
I think your proposal is at least theoretically true and you might want to go ahead and do it, perhaps forwarding requests from the browser to couchdb with a reverse proxy like nginx or node-http-proxy. I believe there are products on the market espousing this "no application server" architecture such as parse.com, which provides some social proof that this idea is at least interesting and worth exploring.
However I think you will at some point discover there is such a thing as an application server and people use them and write code for them in nearly every application for good reason. Debugging problems with your couchdb data validation code is probably going to be cumbersome at best. Compare that to the amazing features you have debugging node.js code with node-inspector and the chrome developer tools debugger.
couchdb is also probably not going to provide realistically granular enough authorization capabilities. This means eventually your application will be exposed to malicious users just doing a PUT with the right document id and gaining access to data they are unauthorized to see or change.
Very few applications are simple enough that UI + DB can handle all of the data transitions and operations that are needed. You could in theory code some of this logic in the browser, but having the Internet between your compound query logic and your database is going to add so much latency to your app to make some features impossible, especially if you have to do a query, get some results, then do a secondary query based on each of those results. That is sometimes feasible between a server-side application and its couchdb, but doing that across the Internet will suffer from the latency.

Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis?

I'm writing a piece to a project that's responsible for processing tasks outside of the main application facing data server, which is written in javascript using Node.js. It needs to handle tasks which are scheduled in the future and potentially handle tasks that are "right now". The "right now" just means the next time a worker becomes available it will operate on that task, so that bit might not matter. The workers are going to all talk to external resources, an example job would be to send an email. We are a small shop and we don't have a ton of resources so one thing I don't want to do is start mixing languages at this point in the process, and I already see that Node can do this for us pretty easily, so that's what we're going to go with unless I see a compelling reason not to before I start coding, which is soon.
All that said, I can't tell if there is a compelling reason to use an AMQP based server, like OpenAMQ or RabbitMQ over something like Kue or Beanstalkd with a node client. So, here we go:
Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis with Kue? If yes, which AMPQ based server would fit best with the architecture that I laid out? If no, which nosql solution (beanstalkd, redis/Kue) would be easiest to set up and fastest to deploy?
FWIW, I'm not accepting my answer yet, I'm going to explain what I've decided and why. If I don't get any answers that appear to be better than what I've decided, I'll accept my own later.
I decided on Kue. It supports multiple workers running asynchronously, and with cluster it can take advantage of multicore systems. It is easily extended to provide security. It's backed with Redis, which is used all over for this exact thing, so I know I'm not backing my job process server with unproven software (that's not to say that any of the others are unproven.)
The most compelling reasons that I picked Kue is that it provides a JSON api so that the client applications (The first client is going to be a web based application, but we're planning on making smartphone apps also) can add jobs easily without going through the main application facing node instance, so I can be totally out of the way of the rest of my team as I write this. I don't need a route, I don't need anything, and it's all provided for me so I don't need to write anything to support this. This has another advantage, with an extention to provide l/p security only authorized clients can add jobs, so I don;t have to expose my redis server to client applications directly. It also has a built in web console and the API allows the client to pull back lists of jobs associated with a given user very easily, so we can show the user all of their scheduled tasks in a nifty calendar view with 0 effort on my part.
The other compelling reason is the lack of steep learning curve associated with getting redis and Kue going for me. I've set up redis before, and Kue is simple and effective.
Yes, I'm a lazy developer, but I'm the good kind of lazy developer.
UPDATE:
I have it working and doing jobs, the throughput is amazing. I split out the task marshaling logic into it's own node instance, basically all I have to do is deploy my repo to a new machine and run node task-server.js to scale out my workers. I may need to add in some more job searching calls to Kue, because of how I implimented a few things, but that will be easy.

Resources