Meteor backend parsing xml api, scraping websites etc - node.js

I would be very gratefull if someone could answer my question. I'm new to nodejs. Doing app in meteor. Everything is fine, mongo etc. But when I end huge crud...I will need to parse some xml apis, scrape some websites. All as backend tasks done via cron etc. My question is...I don't see any examples of such backends in meteor. I see using npm libs. Is this the only path to follow? Also meteor writes mongo's ids as strings. While php writes as objectid. If I would use npm will it write as objectid? Will it do harm? Overall question is...to do parsing backend in meteor the npm are the good path?

I quote #Dan Dascalescu in his excellent answer to another question:
There are several packages to run background tasks in Meteor. From the simplest to the most involved:
super basic cron packages: cron,
easycron
percolatestudio:synced-cron
cron jobs distributed across multiple app servers
queue-async - minimalistic async (see below), written by D3 author Mike Bostock
peerlibrary:async - wrapper for the popular async package for Node.js and the
browser. Offers over 20 functions
(map, reduce, every, filter etc.) and supports powerful control flow
(serial, parallel, waterfall etc.); see also this
example.
artwells:queue - priorities, scheduling, logging, re-queuing. Queue backed by MongoDB.
vsivsi:jobCollection
schedule persistent jobs to be run anywhere (servers, clients). I used this to power the RSS feed aggregation at a financial news
aggregator startup (StockBase.com).
differential:workers
Spawn headless worker meteor processes to work on async jobs
Packages I would recommend caution with:
PowerQueue - queue async tasks, throttle resource usage, retry failed. Supports sub
queues. No
scheduling.
No tests, but nifty demo. Not
suitable for running for a long
while
due to using recursive
calls.
Kue - the priority job queue for Node.js backed by redis.
Not updated for Meteor 0.9+.

Related

Scheduled push notifications - inside Node.js function or CRON job?

I have to implement scheduled push notifications in backend (Node.js) which are triggered only at certain time. So I need to query DB (PostgreSQL) in interval 1-2 minutes and find only those notifications which need to be triggered.
What is the better solution?
Use internal setTimeout query function
or
External CRON script which will trigger querying function in Node.js?
Thank you
If the second option is just a cron that makes an HTTP request to your service, its pretty equivalent. If, instead the solution is packaged as a script and the cron drives that script directly it has a couple of tradeoffs, mainly based around operations:
Use internal setTimeout query function or
This means you have to launch a long running service, and to keep it running. Things like memory leaks may become an issue.
External CRON script which will trigger querying function in Node.js?
This is the strategy that google GCP uses for its cron offering. If the function just pings a web url the solutions are pretty equivalent.
IMO, The biggest issue with both of these is being careful about coupling a background (async) workload with an online workload. If your service is servicing real time live HTTP requests, but is also running these background workloads that takes resources away from servicing synchronous HTTP requests. If they are two fundamentally different workloads than it also makes sense to separate them for scaling purposes.
I've been in a situation where monitoring has actually informed this decision. The company used prometheus and didn't have push gateway installed. So a cron based solution had 0 metric visibility, but the service version was trivial to add metrics / alerting.

How to prioritize express requests/responds over other intensive server related tasks

My node application currently has two main modules:
a scraper module
an express server
The former is very server intensive task which indefinately runs in a loop. It scrapes information from over more than 100 urls, crunches the data and puts it into a mongodb database (using mongoose). This process runs over and over and over. :P
The latter part, my express server, responds to http/socket get requests and returns the crunched data which was written to the db by the scraper to the requesting client.
I'd like to optimize the performance of my server so that the express requests and responds get prioritized over the server intensive task(s). A client should be able to get the requested data asap, without having the scraper eat up all of my server resources.
I though about putting the server intensive task or the express server into its own thread, but then I stumbled upon cluster, and child processes; and now I'm totally confused which approach would be the right one for my situation.
One of the benefits I'm having is that there is a clear seperation between the writing part of my application and the reading part. The scraper writes stuff to the db, express reads from the db (no post/put/delete/...) calls are exposed. So, I -guess- I won't run into threading problems with different threads trying to write to the same db.
Any good suggestions? Thanks in advance!
Resources like cpu and memory required by processes are managed by the operative system. You should not waste your time writing that logic within your source code.
I think you should look at the problem from outside your source code files. Once they ran they are processes. Processes are managed, as I said, by the OS.
Firstly I would split that on two separate commands.
One being the scraper module (eg npm run scraper, that runs something like node scraper.js).
The other one being your express server (eg npm start, that runs something like node server.js).
This approach will let you configure that within your OS or your cluster.
A rapid approach for that will be to use docker.
With two docker containers running your projects with cpu usage limitations. This is fairly easy to do and does not require for you to lift a new server... and at the same time it provides the
isolation level you need to scale it to many servers in the future.
Steps to do this:
Learn a little about docker and docker compose and install them in your server
Build a docker image for your application (you can upload it to a free private image that docker hub gives you for free)
Build a docker compose for your two services using that image, with the cpu configuration you need (you can set both cpu and memory limits easily)
An alternative to that would be running the two commands (npm run scraper and npm start) with some tool like cpulimit, nice/niceness and ionice, or something else like namespaces and cgroups manually (but docker does that for you).
PD: Also, I would recommend to rethink your backend process. Maybe it's better to run it every 12 hours or something like that, instead of all the time, and you may run it from within cron instead of a loop.

node.js golang composite architecture for web application

I am currently architecting a web app that will use node.js for basic routing. Some parts of the app are more processor intensive and I wanted to use golang for those parts. However, I'm not sure the best way to install and communicate between the two languages. I'm using Amazon Elastic Beanstalk for initial tests, so any specifics can be targeted for that platform.
In essence it boils down to the following 2 questions:
1) How do you install both node.js and a golang docker image on Amazon EC2? Amazon has guides for one or the other, but not both.
2) What is the best way to offload processor intensive tasks from node.js to a golang codebase (I could imaging RPC, or just running golang on some localhost port, but I'm new to this type of thing)? The golang tasks might be things like serious number crunching or complex graph searches.
Thanks for any guidance.
Go is trivial to deploy. Just build it on a linux box (or use gox) and deploy the binary. (You don't need go installed on the server to run a go program)
There are many options for communicating between Go and Node.js. Here are a few:
If the work you are doing takes a long time it may not be appropriate to have the user wait for a response. For background tasks you can use a queue (like Redis' rpoplpush or a real queue like Kafka or RabbitMQ, or since you're using Amazon: SQS). Push your job as a JSON object onto the queue, then write a Go program that pulls from the queue, does its processing and then writes the final result somewhere.
Go has a jsonrpc library. You can communicate over TCP, serialize a request in Node, read it in Go, then deserialize the response in Node. It's the jsonrpc 1.0 protocol and for TCP all you have to do is add some message framing (prefix your json string with a length) or just newline separate each request / response.
Write a standard HTTP service in Go and just make HTTP calls from NodeJS. (PUT/POST/GET)

Equivalent of Celery in Node JS

Please suggest an equivalent of Celery in Node JS to run asynchronous tasks.
I have been able to search for the following:
(Later)
Kue (Kue),
coffee-resque (coffee-resque)
cron (cron)
node-celery(node celery)
I have run both manual and automated threads in background and interact with MongoDB.
node-celery is using redis DB and not Mongo DB. Is there any way I can change that?When I installed node-celery redis was installed as dependency.
I am new to celery, Please guide.Thanks.
Celery is basically a RabbitMQ client. There are producers (tasks), consumers (workers) and AMQP message broker which delivers messages between tasks and workers.
Knowing that will enable you to write your own celery in node.js.
node-celery here is a library that enables your node process to work both as a celery client (Producer/Publisher) and a celery worker (Consumer).
See https://abhishek-tiwari.com/post/amqp-rabbitmq-and-celery-a-visual-guide-for-dummies
Edit-1/2018
My recommendation is not to use Kue now, as it seems to be a stalled project, use Celery instead. It is very well supported and maintained by the community and supports large number of use cases.
Old Answer
Go for Kue, it's a wholistic solution that resembles Celery in Python word; it has the concepts of: producers/consumers, delayed tasks, task retrial, task TTL, ability to round-robin tasks across multiple consumers listening to the same queue, etc.
Probably Celery is more advanced with more features with more brokers to support and you can use celery-node if you like, but, in my opinion, I think no need to go for a hybrid solution that requires installation of python and node when you can only use only language that's sufficient in 90% of the cases (unless necessary of course).
Go for Kue, it's a wholistic solution that resembles Celery in Python word; it has the concepts of: producers/consumers, delayed tasks, task retrial, task TTL, ability to round-robin tasks across multiple consumers listening to the same queue, etc.
Kue, after so much time have passed, still has the same old core issues unsolved:
github.com/Automattic/kue/issues/514
github.com/Automattic/kue/issues/130
github.com/Automattic/kue/issues/53
If anyone reading this don't want to rewrite Kue, don't start with it. It's good for simple tasks. But if you want to deal with a lot of them, concurrent, or task chains (when one task creates another) - stop wasting your time.
I've wasted a month trying to debug Kue and still no success. The best choice was to change Kue for Pubs/sub Messaging queue on RabbitMQ and Rabbot (another RabbitMQ wrap up).
Personally, I haven't used Celery as much to put all in for it, but as I've been searching for Celery alternative and found how someone is advising for Kue just boiled my blood in veins.
If you want to send a delayed email (as in Kue example) you can go for whatever you'd like without worrying about errors. But if you want a reliable system task/message queue, don't even start with Kue. I'd personally go with 5. node-celery(node celery)
It is also worth mentioning https://github.com/OptimalBits/bull. It is a fast, reliable, Redis-based queue written for stability and atomicity.
Bull 4 is currently in beta and has some nice features https://github.com/taskforcesh/bullmq
In our experience, Kue was unreliable, losing jobs. Granted, we were using an older version, it's probably been fixed since. That was also during the period when TJ abandoned the project and the new maintainers hadn't been chosen. We switched to beanstalkd and have been very happy. We're using https://github.com/ceejbot/fivebeans as the node interface to beanstalkd.

Background processing on a nodejs, mongodb and heroku stack

I'm writing a simple image upload site as a learning project.
It's written in nodejs, with mongodb and deployed onto Heroku cedar.
I'd like to implement a node script that runs say, once an hour, and applies the reddit algorithm to images and stores the score against each image in mongodb.
How can I achieve this bearing in mind I am on heroku and have file system limitations? - Given the cedar architecture, it would be best to hand off to a separate worker, but if there's a faster/simpler/easier approach I'd be happy to hear it. The heroku dev center article on workers/background jobs unfortunately doesn't list any tutorials yet for such a system.
My previous experience of background processing on heroku was with rails - so scheduled tasks add-on, + delayed_job and it's very straight forward.
An extremely simple approach might utilize setInterval or node-cron. You might also want to spawn or fork a child process for this periodic processing.

Resources