fetching data from remote couchdb database - node.js

I would like to fetch all the data provided by this api with is a couchdb, https://skimdb.npmjs.com/registry/_all_docs its total size is around 1.9 Million row, I need all this data to do some offline local processing , since the size of the data is huge I would like to fetch it by batch , how I could I do it ?
I tried fetching all the data from node app by it take to much time and it returned empty.
Thanks in advance

I found solution to my question :
There is many ways to do it , one efficient solution is to specify the start_key and the limit which represents the batch s
import fs from 'fs';
const registeryUrl = "https://skimdb.npmjs.com/registry/_all_docs?"
// In my case the start key id is '-'
const startkey="-"
function fetchdata(startkey){
console.log(startkey)
// batch size
const limit = 1000
fetch(registeryUrl +"startkey_docid="+startkey+"&limit=" + limit )
.then((response) => response.json())
.then((data) => {
// check if there is data still to fetch from api
var isStill = ((data["total_rows"]-data["offset"]) >limit) ? true : false;
// the last element will be the next startkey
var last_element=data["rows"].pop()["id"]
/** process you data as you want */
return [last_element,isStill]
})
.then((array) => {
const isStill=array[1]
const last_element=array[0]
// while there is more data be fetched
if (isStill){
fetchdata(last_element)
}
else{
console.log("finished fetching")
}
})
}
fetchdata(startkey)

Related

How to perform server-side pagination from mongodb using nodejs?

I am working on a project that deals with a large amount of data. Fetching of all data at once from mongodb is not an option since it results in bad user experience. I am working on creating an infinite loading setup and with each scroll, I want a fix number of data that is fetched from mongodb and will concatenate the newly fetched data with the previously fetched data to show results on my webpage.
How to do pagination in mongodb using nodejs?
The mongodb node.js driver allows you to use the pagination through the limit and the skip attributes.
// You can first start by counting the number of entities in your collection
collection.countDocuments().then((count) => {
var step = 1000;
var offset = 0;
var limit = step;
//then exploiting your offset and limit variables
//you can limit the number of results you get in each query (a page of results)
while (offset < count) {
process(offset, limit);
offset += step;
}
});
})
.catch((error) => console.error(error));
async function process(offset, limit) {
var entities = collection.find(
{},
{
limit: limit,
skip: offset,
}
);
for await (const entity of entities) {
// do what you want
}
}
You can find more details on the MongoDB documentation page.
https://www.mongodb.com/docs/drivers/node/current/fundamentals/crud/read-operations/limit/

Ability to provide insights from Redis Bull Queue data

I have an application that makes API calls to another system, and it queues these API calls in a queue using Bull and Redis.
However, occasionally it gets bogged down with lots of API calls, or something stops working properly, and I want an easy way for users to check if the system is just "busy". (Otherwise, if they perform some action, and 10 minutes later it hasn't completed, they'll keep trying it again, and then we get a backlog of more entries (and in some cases data issues where they've issued duplicate parts, etc.)
Here's what a single "key" looks like for a successful API call in the queue:
HSET "bull:webApi:4822" "timestamp" "1639085540683"
HSET "bull:webApi:4822" "returnvalue" "{"id":"e1df8bb4-fb6c-41ad-ba62-774fe64b7882","workOrderNumber":"WO309967","status":"success"}"
HSET "bull:webApi:4822" "processedOn" "1639085623027"
HSET "bull:webApi:4822" "data" "{"id":"e1df8bb4-fb6c-41ad-ba62-774fe64b7882","token":"eyJ0eXAiOiJKV1QiL....dQVyEpXt64Fznudfg","workOrder":{"members":{"lShopFloorLoad":true,"origStartDate":"2021-12-09T00:00:00","origRequiredQty":2,"requiredQty":2,"requiredDate":"2021-12-09T00:00:00","origRequiredDate":"2021-12-09T00:00:00","statusCode":"Released","imaItemName":"Solid Pin - Black","startDate":"2021-12-09T00:00:00","reference":"HS790022053","itemId":"13840402"}},"socketId":"3b9gejTZjAXsnEITAAvB","type":"Create WO"}"
HSET "bull:webApi:4822" "delay" "0"
HSET "bull:webApi:4822" "priority" "0"
HSET "bull:webApi:4822" "name" "__default__"
HSET "bull:webApi:4822" "opts" "{"lifo":true,"attempts":1,"delay":0,"timestamp":1639085540683}"
HSET "bull:webApi:4822" "finishedOn" "1639085623934"
You can see in this case it took 83 seconds to process. (1639085540 - 1639085623)
I'd like to be able to provide summary metrics like:
Most recent API call was added to queue X seconds ago
Most recent successful API call completed X seconds ago and took XX seconds to
complete.
I'd also like to be able to provide a list of the 50 most recent API calls, formatted in a nice way and tagged with "success", "pending", or "failed".
I'm fairly new to Redis and Bull, and I'm trying to figure out how to query this data (using Redis in Node.js) and return this data as JSON to the application.
I can pull a list of keys like this:
// #route GET /status
async function status(req, res) {
const client = createClient({
url: `redis://${REDIS_SERVER}:6379`
});
try {
client.on('error', (err) => console.log('Redis Client Error', err));
await client.connect();
const value = await client.keys('*');
res.json(value)
} catch (error) {
console.log('ERROR getting status: ', error.message, new Date())
res.status(500).json({ message: error.message })
} finally {
client.quit()
}
}
Which will return ["bull:webApi:3","bull:webApi:1","bull:webApi:2"...]
But how can I pull the values associated to the respective keys?
And how can I find the key with the highest number, and then pull the details for the "last 50". In SQL, it would be like doing a ORDER BY key_number DESC LIMIT 50 - but I'm not sure how to do it in Redis.
I'm a bit late here, but if you're not set on manually digging around in Redis, I think Bull's API, in particular Queue#getJobs(), has everything you need here, and should be much easier to work with. Generally, you shouldn't have to reach into Redis to do any common tasks like this, that's what Bull is for!
If I understand you goal correctly, you should be able to do something like:
const Queue = require('bull')
async function status (req, res) {
const { listNum = 10 } = req.params
const api_queue = new Queue('webApi', `redis://${REDIS_SERVER}:6379`)
const current_timestamp_sec = new Date().getTime() / 1000 // convert to seconds
const recent_jobs = await api_queue.getJobs(null, 0, listNum)
const results = recent_jobs.map(job => {
const processed_on_sec = job.processedOn / 1000
const finished_on_sec = job.finishedOn / 1000
return {
request_data: job.data,
return_data: job.returnvalue,
processedOn: processed_on_sec,
finishedOn: finished_on_sec,
duration: finished_on_sec - processed_on_sec,
elapsedSinceStart: current_timestamp_sec - processed_on_sec,
elapsedSinceFinished: current_timestamp_sec - finished_on_sec
}
})
res.json(results)
}
That will get you the most recent numList* jobs in your queue. I haven't tested this full code, and I'll leave the error handling and adding of your custom fields to the job data up to you, but the core of it is solid and I think that should meet your needs without ever having to think about how Bull stores things in Redis.
I also included a suggestion on how to deal with timestamps a bit more nicely, you don't need to do string processing to convert milliseconds to seconds. If you need them to be integers you can wrap them in Math.floor().
* at least that many, anyway - see the second note below
A couple notes:
The first argument of getJobs() is a list of statuses, so if you want to look at just completed jobs, you can pass ['completed'], or completed and active, do ['completed', 'active'], etc. If no list is provided (null) it defaults to all statuses.
As mentioned in the reference I linked, the limit here is per state - so you'll likely get more than listNum jobs back. It doesn't seem like that should be a problem for your use case, but if it is, you can sort the list returned (probably by job id) and just return the first listNum - you're guaranteed to get at least that many (assuming there are that many jobs in your queue), and won't get more than 6*listNum (since there are 6 states).
Folks new to Bull can get nervous about instantiating a Queue object to do stuff like this - but don't be! By itself a Queue instance doesn't do anything, it's just an interface to the given queue. It won't start processing jobs until you call process() to add a processor. This is, incidentally, also how you'd add jobs from a separate process than you run your queues in, but of course nothing will be added unless you call add().
So I've figured out how to pull the data I need. I'm not saying it's a good method, and I'm open to suggestions; but it seems to work to provide a filtered JSON return with the needed data, without changing how the queue functions.
Here's what it looks like:
// #route GET /status/:listNum
async function status(req, res) {
const { listNum = 10} = req.params
const client = createClient({
url: `redis://${REDIS_SERVER}:6379`
});
try {
client.on('error', (err) => console.log('Redis Client Error', err));
await client.connect();
// Find size of queue database
const total_keys = await client.sendCommand(['DBSIZE']);
const upper_value = total_keys;
const lower_value = total_keys - listNum;
// Generate array
const range = (start, stop) => Array.from({ length: (start - stop) + 1}, (_, i) => start - (i));
var queue_ids = range(upper_value, lower_value)
queue_ids = queue_ids.filter(function(x){ return x > 0 }); // Filer out anything less than zero
// Current timestamp in seconds
const current_timestamp = parseInt(String(new Date().getTime()).slice(0, -3)); // remove microseconds ("now")
var response = []; // Initialize array
for(id of queue_ids){ // Loop through queries
// Query value
var value = await client.HGETALL('bull:webApi:'+id);
if(Object.keys(value).length !== 0){ // if returned a value
// Grab most of the request (exclude the token & socketId to save space, not used)
var request_data = JSON.parse(value.data)
request_data.token = '';
request_data.socketId = '';
// Grab & calculate desired times
const processedOn = value.processedOn.slice(0, -3); // remove microseconds ("start")
const finishedOn = value.finishedOn.slice(0, -3); // remove microseconds ("done")
const duration = finishedOn - processedOn; // (seconds)
const elapsedSinceStart = current_timestamp - processedOn;
const elapsedSinceFinished = current_timestamp - finishedOn;
// Grab the returnValue
const return_data = value.returnValue;
// ignoring queue keys of: opts, priority, delay, name, timestamp
const object_data = {request_data: request_data, processedOn: processedOn, finishedOn: finishedOn, return_data: return_data, duration: duration, elapsedSinceStart: elapsedSinceStart, elapsedSinceFinished: elapsedSinceFinished }
response.push(object_data);
}
}
res.json(response);
} catch (error) {
console.log('ERROR getting status: ', error.message, new Date());
res.status(500).json({ message: error.message });
} finally {
client.quit();
}
}
It's looping the Redis query, so I wouldn't want to use this for hundreds of keys, but for 10 or even 50 I'm thinking it should work.
For now I've resorted to getting the total number of keys and working backwards:
await client.sendCommand(['DBSIZE']);
In my case it will return a total number slightly higher than the highest key id (~ a handful of status keys), but at least gets close, and then I just filter out any non-responses.
I've looked at ZRANGE a bit, but I can't figure out how to get it to give me the last id. When I have a Redis database (Bull Queue) like this:
If there's a simple Redis command I can run that will return "3", I'd probably use that instead. (since bull:webApi:3 has the highest number)
(In actual use case, this might be 9555 or some high number; I just want to get the highest numbered key that exists.)
For now I'll try using the method I've come up with above.

Firebase function Node.js transform stream

I'm creating a Firebase HTTP Function that makes a BigQuery query and returns a modified version of the query results. The query potentially returns millions of rows, so I cannot store the entire query result in memory before responding to the HTTP client. I am trying to use Node.js streams, and since I need to modify the results before sending them to the client, I am trying to use a transform stream. However, when I try to pipe the query stream through my transform stream, the Firebase Function crashes with the following error message: finished with status: 'response error'.
My minimal reproducible example is as follows. I am using a buffer, because I don't want to process a single row (chunk) at a time, since I need to make asynchronous network calls to transform the data.
return new Promise((resolve, reject) => {
const buffer = new Array(5000)
let bufferIndex = 0
const [job] = await bigQuery.createQueryJob(options)
const bqStream = job.getQueryResultsStream()
const transformer = new Transform({
writableObjectMode: true,
readableObjectMode: false,
transform(chunk, enc, callback) {
buffer[bufferIndex] = chunk
if (bufferIndex < buffer.length - 1) {
bufferIndex++
}
else {
this.push(JSON.stringify(buffer).slice(1, -1)) // Transformation should happen here.
bufferIndex = 0
}
callback()
},
flush(callback) {
if (bufferIndex > 0) {
this.push(JSON.stringify(buffer.slice(0, bufferIndex)).slice(1, -1))
}
this.push("]")
callback()
},
})
bqStream
.pipe(transform)
.pipe(response)
bqStream.on("end", () => {
resolve()
})
}
I cannot store the entire query result in memory before responding to the HTTP client
Unfortunately, when using Cloud Functions, this is precisely what must happen.
There is a documented limit of 10MB for the response payload, and that is effectively stored in memory as your code continues to write to the response. Streaming of requests and responses is not supported.
One alternative is to write your response to an object in Cloud Storage, then send a link or reference to that file to the client so it can read the response fully from that object.
If you need to send a large streamed response, Cloud Functions is not a good choice. Neither is Cloud Run, which is similarly limited. You will need to look into other solutions that allow direct socket access, such as Compute Engine.
I tried to implement the workaround as suggested by Doug Stevenson and got the following error:
#firebase/firestore: Firestore (9.8.2):
Connection GRPC stream error.
Code: 3
Message: 3
INVALID_ARGUMENT: Request payload size exceeds the limit: 11534336 bytes.
I created a workaround to store data in Firestore first. It works fine when the content size is below 10MB.
import * as firestore from "firebase/firestore";
import { initializeApp } from "firebase/app";
import { firebaseConfig } from '../conf/firebase'
// Initialize Firebase
const app = initializeApp(firebaseConfig);
const fs = firestore.getFirestore(app);
export async function storeStudents(data, context) {
const students = await api.getTermStudents()
const batch = firestore.writeBatch(fs);
students.forEach((student) => {
const ref = firestore.doc(fs, 'students', student.studentId)
batch.set(ref, student)
})
await batch.commit()
return 'stored'
}
exports.getTermStudents = functions.https.onCall(storeStudents);
UPDATE:
To bypass Firestore's limit when using the batch function, I just looped through the array and set (add/update) documents. Set() creates or overwrites a single document.
export async function storeStudents(data, context) {
const students = await api.getTermStudents({images: true})
students.forEach((student: Student) => {
const ref = firestore.doc(fs, 'students', student.student_id)
firestore.setDoc(ref, student)
})
return 'stored'
}

Multiple document reads in node js Firestore transaction

I want to perform a transaction that requires updating two documents using the previous values of those documents.
For the sake of the question, I'm trying to transfer 100 tokens from one app user to another. This operation must be atomic to keep the data integrity of my DB, so on the server side I though to use admin.firestore().runTransaction.
As I understand runTransaction needs to perform all reads before performing writes, so how do I read both user's balance before updating the data?
This is what I have so far:
db = admin.firestore();
const user1Ref = db.collection('users').doc(user1Id);
const user2Ref = db.collection('users').doc(user2Id);
transaction = db.runTransaction(t => {
return t.get(user1Ref).then(user1Snap => {
const user1Balance = user1Snap.data().balance;
// Somehow get the second user's balance (user2Balance)
t.update(user1Ref , {balance: user1Balance - 100});
t.update(user2Ref , {balance: user2Balance + 100});
return Promise.resolve('Transferred 100 tokens from ' + user1Id + ' to ' + user2Id);
});
}).then(result => {
console.log('Transaction success', result);
});
You can use getAll. See documentation at https://cloud.google.com/nodejs/docs/reference/firestore/0.15.x/Transaction?authuser=0#getAll
You can use Promise.all() to generate a single promise that resolves when all promises in the array passed to it have resolved. Use that promise to continue work after all your document reads are complete - it will contain all the results. The general form of your code should be like this:
const p1 = t.get(user1Ref)
const p2 = t.get(user2Ref)
const pAll = Promise.all([p1, p2])
pAll.then(results => {
snap1 = results[0]
snap2 = results[1]
// work with snap1 and snap2 here, make updates to refs...
})

Firebase Cloud Functions - move data in Firestore

Have a very basic understanding of the Typescript language, but would like to know, how can I copy multiple documents from one firestore database collection to another collection?
I know how to send the request from the app's code along with the relevant data (a string and firebase auth user ID), but unsure about the Typescript code to handle the request...
Thats a very broad question, but something like this can move data in moderate sizes from one collection to another:
import * as _ from 'lodash';
import {firestore} from 'firebase-admin';
export async function moveFromCollection(collectionPath1: string, collectionPath2: string): void {
try {
const collectionSnapshot1Ref = firestore.collection(collectionPath1);
const collectionSnapshot2Ref = firestore.collection(collectionPath2);
const collectionSnapshot1Snapshot = await collectionSnapshot1Ref.get();
// Here we get all the snapshots from collection 1. This is ok, if you only need
// to move moderate amounts of data (since all data will be stored in memory)
// Now lets use lodash chunk, to insert data in batches of 500
const chunkedArray = _.chunk(collectionSnapshot1Snapshot.docs, 500);
// chunkedArray is now an array of arrays, with max 500 in each
for (const chunk of chunkedArray) {
const batch = firestore.batch();
// Use the batch to insert many firestore docs
chunk.forEach(doc => {
// Now you might need some business logic to handle the new address,
// but maybe something like this is enough
const newDocRef = collectionSnapshot2Ref.doc(doc.id);
batch.set(newDocRef, doc.data(), {merge: false});
});
await batch.commit();
// Commit the batch
}
console.log('Done!');
} catch (error) {
console.log(`something went wrong: ${error.message}`);
}
}
But maybe you can tell more about the use case?

Resources