How do I use the foreach method in MongoDB to do scraping/API calls without getting blacklisted by sites?

How do I use the foreach method in MongoDB to do scraping/API calls without getting blacklisted by sites? - node.js

I have about 20 documents currently in my collection (and I'm planning to add many more probably in the 100s). I'm using the MongoDB Node.js clients collection.foreach() method to iterate through each one and based on the document records go to 3 different endpoints: two APIs (Walmart and Amazon) and one website scrape (name not relevant). Each document contains the relevant data to execute the requests and then I update the documents with the returned data.
The problem I'm encountering is the Walmart API and the website scrape will not return data toward the end of the iteration. Or at least my database is not getting updated. My assumption is that the foreach method is firing off a bunch of simultaneous requests and either I'm bumping up against some arbitrary limit of simultaneous requests allowed by the endpoint or the endpoints simply can't handle this many requests and ignore anything above and beyond its "request capacity." I've ran some of the documents that were not updating through the same code but in a different collection that contained just a single document and they did update so I don't think it's bad data inside the document.
I'm running this on Heroku (and locally for testing) using Node.js. Results are similar both on Heroku instance and locally.
If my assumption is correct I need a better way to structure this so that there is some separation between requests or maybe it only does x records on a single pass.

It sounds like you need to throttle your outgoing web requests. There's a fantastic node module for doing this called limiter. The code looks like this:
var RateLimiter = require('limiter').RateLimiter;
var limiter = new RateLimiter(1, 1000);
var throttledRequest = function() {
limiter.removeTokens(1, function() {
console.log('Only prints once per second');
});
};
throttledRequest();
throttledRequest();
throttledRequest();

Related

Node app that fetches, processes, and formats data for consumption by a frontend app on another server

I currently have a frontend-only app that fetches 5-6 different JSON feeds, grabs some necessary data from each of them, and then renders a page based on said data. I'd like to move the data fetching / processing part of the app to a server-side node application which outputs one simple JSON file which the frontend app can fetch and easily render.
There are two noteworthy complications for this project:
1) The new backend app will have to live on a different server than its frontend counterpart
2) Some of the feeds change fairly often, so I'll need the backend processing to constantly check for changes (every 5-10 seconds). Currently with the frontend-only app, the browser fetches the latest versions of the feeds on load. I'd like to replicate this behavior as closely as possible
My thought process for solving this took me in two directions:
The first is to setup an express application that uses setTimeout to constantly check for new data to process. This data is then sent as a response to a simple GET request:
const express = require('express');
let app = express();
let processedData = {};
const getData = () => {...} // returns a promise that fetches and processes data
/* use an immediately invoked function with setTimeout to fetch the data
* when the program starts and then once every 5 seconds after that */
(function refreshData() {
getData.then((data) => {
processedData = data;
});
setTimeout(refreshData, 5000);
})();
app.get('/', (req, res) => {
res.send(processedData);
});
app.listen(port, () => {
console.log(`Started on port ${port}`);
});
I would then run a simple get request from the client (after properly adjusting CORS headers) to get the JSON object.
My questions about this approach are pretty generic: Is this even a good solution to this problem? Will this drive up hosting costs based on processing / client GET requests? Is setTimeout a good way to have a task run repeatedly on the server?
The other solution I'm considering would deal with setting up an AWS Lambda that writes the resulting JSON to an s3 bucket. It looks like the minimum interval for scheduling an AWS Lambda function is 1 minute, however. I imagine I could set up 3 or 4 identical Lambda functions and offset them by 10-15 seconds, however that seems so hacky that it makes me physically uncomfortable.
Any suggestions / pointers / solutions would be greatly appreciated. I am not yet a super experienced backend developer, so please ELI5 wherever you deem fit.

A few pointers.
Use crontasks for periodic processing of data. This is far preferable especially if you are formatting a lot of data.
Don't setup multiple Lambda functions for the same task. It's going to be messy to maintain all those functions.
After processing / fetching the feed, you can store the JSON file in your own server or S3. Note that if it's S3, then you are paying and waiting for a network operation. You can read the file from your express app and just send the response back to your clients.
Depending on the file size and your load in the server you might want to add a caching server so that you can cache the response until new JSON data is available.

Exposing Meteor's mongo DB to a stateless client

I need a legacy java application to pull information from a meteor's collection.
Ideally, I would need a simple service where my app would be able to download the latest list of items prices. A scenario like going on (through an http GET):
www.mystore.com/listOfPrices
would return a json with an array
[{"item":"beer", price:"2.50"}, {"item":"water":, price:"1"}]
The problem is that I cannot make a meteor page printing the result "as is" because meteor assumes the client supports javascript. Note that I do plan to implement the java DDP client in a latter stage but here I would like to start with a very simple service.
Idea: I thought of doing my own Node.js request aside of the running meteor service in order to retrieve a snapshot of the collection. Then this request would be using a server based javascript DDP client in order to subscribe and filter to then return the collection once loaded as a jSON document (array).
Any idea on how to achieve this ?

Looks like you want to provide a REST interface. See the MeteorPedia page on REST for how to expose collection data. It might be as simple as
prices = new Mongo.Collection('prices');
// Add access points for `GET`, `POST`, `PUT`, `DELETE`
HTTP.publish({collection: prices}, function (data) {
// here you have access to this.userId, this.query, this.params
return prices.find({});
});

How to lock (Mutex) in NodeJS?

There are external resources (accessing available inventories through an API) that can only be accessed one thread at a time.
My problems are:
NodeJS server handles requests concurrently, we might have multiple requests at the same time trying to reserve inventories.
If I hit the inventory API concurrently, then it will return duplicate available inventories
Therefore, I need to make sure that I am hitting the inventory API one thread at a time
There is no way for me to change the inventory API (legacy), therefore I must find a way to synchronize my nodejs server.
Note:
There is only one nodejs server, running one process, so I only need to synchronize the requests within that server
Low traffic server running on express.js

I'd use something like the async module's queue and set its concurrency parameter to 1. That way, you can put as many tasks in the queue as you need to run, but they'll only run one at a time.
The queue would look something like:
var inventoryQueue = async.queue(function(task, callback) {
// use the values in "task" to call your inventory API here
// pass your results to "callback" when you're done
}, 1);
Then, to make an inventory API request, you'd do something like:
var inventoryRequestData = { /* data you need to make your request; product id, etc. */ };
inventoryQueue.push(inventoryRequestData, function(err, results) {
// this will be called with your results
});

Fetching external resources in parallel in node - good practice?

I have a setup where a node server acts as a proxy server to serve images.
For example an image "test1.jpg", the exact same image can be fetched from 3 external sources - lets say -
a. www.abc.com/test1.jpg
b. www.def.com/test1.jpg
c. www.ghi.com/test1.jpg
When the nodejs server gets a request for "test1.jpg" it first gets a list of external URLs from a DB. Now amongst these external resources, at least one is always behind a CDN and is "expected" to respond faster and hence is a preferred source for the image.
My question is what is the correct method to achieve this out of the two below (or if there is any other method)
Fire http requests (using mikeal's request client module) for all the URLs at the same time. Get their promise objects and whichever source responds first, send that image back to the user (it can be any of the three sources, not necessarily the preferred source behind the cDN - but doesnt matter since the image is exactly the same). The disadvantage that I see is that for every image we hit 3 sources. Also the promises for http requests can still get fulfilled after the response from the first successful source has been sent out.
Fire http requests one at a time starting with the most preferred image, wait for it to fail (i.e. a 404 on the image) and then proceed to the next preferred image. We have lesser number of HTTP requests but more wait time for the user.
Some pseudo code
Method 1
while(imagePreferences.length > 0) {
var url = imagePreferences.splice(0,1);
getImage(url).then(function() {
sendImage();
}, function(err) {
console.log(err);
});
}
Method 2
if(imageUrls.length > 0) {
var url = imageUrls.splice(0,1);
getImage(url).then(function(imageResp) {
sendImageResp();
}, function(err) {
getNextImage(); //recurse over this
});
}
This is just pseudo code. I am new to nodejs. Any help/views would be appreciated.

I prefer the 1st option, CDNs are designed to receive massive requests. Your code is perfectly fine to send HTTP requests to multiple sources in parallel.
In case you want to stop the other requests after successfully receiving the first image, you can use async.detect: https://github.com/caolan/async#detect

How does Meteor receive updates to the results of a MongoDB query?

I asked a question a few months ago, to which Meteor seems to have the answer.
Which, if any, of the NoSQL databases can provide stream of *changes* to a query result set?
How does Meteor receive updates to the results of a MongoDB query?
Thanks,
Chris.

You want query.observe() for this. Say you have a Posts collection with a tags field, and you want to get notified when a post with the important tag is added.
http://docs.meteor.com/#observe
// collection of posts that includes array of tags
var Posts = new Meteor.Collection('posts');
// DB cursor to find all posts with 'important' in the tags array.
var cursor = Posts.find({tags: 'important'});
// watch the cursor for changes
var handle = cursor.observe({
added: function (post) { ... }, // run when post is added
changed: function (post) { ... } // run when post is changed
removed: function (post) { ... } // run when post is removed
});
You can run this code on the client, if you want to do something in each browser when a post changes. Or you can run this on the server, if you want to say send an email to the team when an important post is added.
Note that added and removed refer to the query, not the document. If you have an existing post document and run
Posts.update(my_post_id, {$addToSet: {tags: 'important'}});
this will trigger the 'added' callback, since the post is getting added to the query result.

Currently, Meteor really works well with one instance/process. In such case all queries are going through this instance and it can broadcast it back to other clients. Additional, it polls MongoDB every 10s for changes to the database which were done by outside queries. They are plans for 1.0 to improve the scalability and hopefully allow multiple instances to inform each one about changes.
DerbyJS on the other hand is using Redis PubSub.

From the docs:
On the server, a collection with that name is created on a backend Mongo server. When you call methods on that collection on the server,
they translate directly into normal Mongo operations.
On the client, a Minimongo instance is created. Minimongo is essentially an in-memory, non-persistent implementation of Mongo in
pure JavaScript. It serves as a local cache that stores just the
subset of the database that this client is working with. Queries on
the client (find) are served directly out of this cache, without
talking to the server.
When you write to the database on the client (insert, update, remove),
the command is executed immediately on the client, and,
simultaneously, it's shipped up to the server and executed there too.
The livedata package is responsible for this.
That explains client to server
Server to client from what I can gather is the livedata and mongo-livedata packages.
https://github.com/meteor/meteor/tree/master/packages/mongo-livedata
https://github.com/meteor/meteor/tree/master/packages/livedata
Hope that helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I use the foreach method in MongoDB to do scraping/API calls without getting blacklisted by sites? - node.js

Related

Node app that fetches, processes, and formats data for consumption by a frontend app on another server

Exposing Meteor's mongo DB to a stateless client

How to lock (Mutex) in NodeJS?

Fetching external resources in parallel in node - good practice?

How does Meteor receive updates to the results of a MongoDB query?

Categories

Resources