Build an extensible system for scraping websites - node.js

Currently, I have a server running. Whenever I receive a request, I want some mechanism to start the scraping process on some other resource(preferably dynamically created) as I don't want to perform scraping on my main instance. Further, I don't want the other instance to keep running and charging me when I am not scraping data.
So, preferably a system that I can request to start scraping the site and close when it finishes.
Currently, I have looked in google cloud functions but they have a cap at 9 min max for every function so it won't fit my requirement as scraping would take much more time than that. I have also looked in AWS SDK it allows us to create VMs on runtime and also close them but I can't figure out how to push my API script onto the newly created AWS instance.
Further, the system should be extensible. Like I have many different scripts that scrape different websites. So, a robust solution would be ideal.
I am open to using any technology. Any help would be greatly appreciated. Thanks

I can't figure out how to push my API script onto the newly created AWS instance.
This is achieved by using UserData:
When you launch an instance in Amazon EC2, you have the option of passing user data to the instance that can be used to perform common automated configuration tasks and even run scripts after the instance starts.
So basically, you would construct your UserData to install your scripts, all dependencies and run them. This would be executed when new instances are launched.
If you want the system to be scalable, you can lunch your instances in Auto Scaling Group and scale it up or down as you require.
The other option is running your scripts as Docker containers. For example using AWS Fargate.
By the way, AWS Lambda has limit of 15 minutes, so not much more than Google functions.

Related

GCP App Engine use for non web applications

I have a use case where I'd like to have an app running on GCP, with a schedule. Every X hours my main.py would execute a function, but I think I am in no need of having a web app or use Flask (which are the examples I've found).
I did try to use the function-framework, would this be an option within App Engine? (have the function-framework entrypoint as the entrypoint for the app)
Conceptually I don't know if the app engine is the right way forward, although it does look like the simplest option (excluding cloud function which I can't use because of the time restrictions)
Thanks!
You can use a Cloud Run Job (note that it's still in preview). As its documentation says
Unlike a Cloud Run service, which listens for and serves requests, a Cloud Run job only runs its tasks and exits when finished. A job does not listen for or serve requests, and cannot accept arbitrary parameters at execution.
You can also still use App Engine (Python + Flask). Using Cloud Scheduler, you schedule invoking a url of your web app. However, because your task is long running, you should use Cloud Tasks. Tasks allow you run longer processes. Essentially, you'll have a 2 step process
a. Cloud Scheduler invokes a url on your GAE App.
b. This url in turn pushes a task into your task queue which then executes the task. This is a blog article (with sample code) we wrote for using tasks in GAE. It's for DJango but you can easily replace it with Flask.
If you just need to run some backend logic and then shutdown until the next run, cloud functions is done for that.
You can setup a cloud scheduler task to invoke the function on a time basis.
Make sure to keep the function private to the internet, as well as configuring a service account for the cloud scheduler to use with the rights to invoke the private function.
Be aware of functions configuration options to fit your use case https://cloud.google.com/functions/docs/configuring , as well as limits https://cloud.google.com/functions/quotas#resource_limits
Good turtorial to implement it: https://cloud.google.com/community/tutorials/using-scheduler-invoke-private-functions-oidc

How to reload tensorflow model in Google Cloud Run server?

I have a webserver hosted on cloud run that loads a tensorflow model from cloud file store on start. To know which model to load, it looks up the latest reference in a psql db.
Occasionally a retrain script runs using google cloud functions. This stores a new model in cloud file store and a new reference in the psql db.
Currently, in order to use this new model I would need to redeploy the cloud run instance so it grabs the new model on start. How can I automate using the newest model instead? Of course something elegant, robust, and scalable is ideal, but if something hacky/clunky but functional is much easier that would be preferred. This is a throw-away prototype but it needs to be available and usable.
I have considered a few options but I'm not sure how possible either of them are:
Create some sort of postgres trigger/notification that the cloud run server listens to. Guess this would require another thread. This ups complexity and I'm unsure how multiple threads works with Cloud Run.
Similar, but use a http pub/sub. Make an endpoint on the server to re-lookup and get the latest model. Publish on retrainer finish.
could deploy a new instance and remove the old one after the retrainer runs. Simple in some regards, but seems riskier and it might be hard to accomplish programmatically.
Your current pattern should implement cache management (because you cache a model). How can you invalidate the cache?
Restart the instance? Cloud Run doesn't allow you to control the instances. The easiest way is to redeploy a new revision to force the current instance to stop and new ones to start.
Setting a TTL? It's an option: load a model for XX hours, and then reload it from the source. Problem: you could have glitches (instances with new models and instances with the old one, up to the cache TTL expires for all the instances)
Offering cache invalidation mechanism? As said before, it's hard because Cloud Run doesn't allow you to communicate with all the instances directly. So, push mechanism is very hard and tricky to implement (not impossible, but I don't recommend you to waste time with that). Pull mechanism is an option: check a "latest updated date" somewhere (a record in Firestore, a file in Cloud Storage, an entry in CLoud SQL,...) and compare it with your model updated date. If similar, great. If not, reload the latest model
You have several solutions, all depend on your wish.
But you have another solution, my preference. In fact, every time that you have a new model, recreate a new container with the new model already loaded in it (with Cloud Build) and deploy that new container on Cloud Run.
That solution solves your cache management issue, and you will have a better cold start latency for all your new instances. (In addition of easier roll back, A/B testing or canary release capability, version management and control, portability, local/other env testing,...)

How do I run puppeteer on a server/in the cloud

Feels like I've searched the entire web for an answer...to no avail. I have a puppeteer script that works perfectly locally. My local machine is a little unreliable, so I've been trying to push this script to the cloud so that it can run there. But I have no idea where to start. I'm sitting here with an IBM cloud account with no idea what to do. Can anyone help me out?
Running Puppeteer scripts can be done on any cloud platform that
exposes a Node.js environment
enables running a browser (Puppeteer will need to start Chromium)
This could be achieved, for example, using AWS EC2.
AWS Lambda, Google Cloud Functions and IBM Cloud Functions (and similar services) might also work but they might need additional work on your side to get the browser running.
For a step-by-step guide, I would suggest checking out this article and this follow-up.
Also, it might just be easier to look into services like Checkly (disclaimer: I work for Checkly), Browserless and similar (a quick search for something along the lines of "run puppeteer online" will return several of those), which allow you to run Puppeteer checks online without requiring any additional setup. Useful if you are serious about using Puppeteer for testing or synthetic monitoring in the long run.

What is the best service for a GCP FTP Node App?

Ok, so a bit of background on what we are doing.
We have various weather station and soil monitoring stations across the country that gather up data and then using FTP, upload to a server for processing.
Note: this server is not located in the GCP, but we are migrating all our services over at the moment.
Annoyingly FTP is the only service that these particular stations allow. Newer stations thankfully are using REST APIs instead, so that makes it much simpler.
I have written a small nodejs app that works with ftp-srv. This acts as the FTP server.
I have also written a new FileSystem class that will hook directly into Google Cloud Storage. So instead of getting a local directory, it reads the GCS directory.
This allows for weather stations to upload their dump files direct to GCP for processing.
My question is, what is the best service to use?
First I thought using App Engine, since its just a small nodejs app, I don't really want to have to go and create a VM for it just to run this.
However, I have found that I have been unsuccessful to open up port 21 and any other ports used for passive FTP.
I then thought using Kubernetes Engine. To be honest, I don't know anything at all about this, as of yet. But it seems like its a bit of an overkill just to run the small app.
My last thought would be to use Compute Engine. I have a working copy with PROFTPD installed and working, so I know I can get the ports open and have data flowing, but I feel that it's a bit overkill to run a full VM just for something that is acting as an intermediary between the weather stations and GCS.
Any recommendations would be very appreciated.
Thanks!
Kubernetes just for FTP would be using a crane to lift your fork.
Google Compute Engine and PROFTPD will fit in a micro instance at a whopping cost of about $6.00 per month.
The other Google Compute services do not support FTP. This includes:
App Engine Standard
App Engine Flexible
Cloud Run
Cloud Functions
This leaves you with either Kubernetes or Compute Engine.

Automating NodeJS scripts with Google Cloud Platform

My question is in regards to clarification and/or anybodies previous experience with NodeJS and Google Cloud Platform (GCP).
I have developed numerous NodeJS scripts that read and transform serveral JSON sports feed in order to populate a Google Firebase database backend.
The NodeJS scripts work exactly as desired; with the exception that I need to run/execute the NodeJs script manually in order to populate the backend. I obviously want this to be automatically, lets say an interval of every 2 mins.
I am unclear on how to achieve this!? Does GCP offer a cron job that can execute my NodeJS on a specific time interval? If so how should I implement this?!
If you are planning on using Compute Engine you can just use a cron job which comes with the both the Debian and Red Hat Linux public images available within Google Cloud Platform.
You could create an enrty like this to run the script every 2 hours.
* /2* * * * /usr/local/bin/node /home/example/script.js
Here are two examples of how to do this using cron and appengine:
https://github.com/firebase/functions-cron
https://mhaligowski.github.io/blog/2017/05/25/scheduled-cloud-function-execution.html
The basic idea is the same: one appengine app for cron, where you tell it what URL to get, at what frequency. What is serving at the URL is immaterial here, you would obviously have your nodejs app in an appengine instance, serving URLs that match those given to cron. The cron portion of the examples is independent of language, it is REST based.
So the steps for you would be:
Setup your nodejs app in GAE the standard way (regardless of the
fact that you want your app URLs called at intervals)
Setup your cron app in GAE as explained in those examples
Notice your nodejs app of step 1 being called as you specified in step 2!

Resources