Node.js: How to process objects of a big JSON file one by one to avoid heap limit errors - node.js

im trying to process a few hundreds of json.gz files using worker threads.
at some point im getting js heap limit error due to 3 big files(about 3gb each unzipped).
i tried to find a way to process the objects of each file one by one, but all i've managed to get is all of the file's objects at once.
here is the worker code at the moment:
for (let gzFile of zippedFiles) {
const gunzip = zlib.createGunzip()
const parser = JSONStream.parse('offers.*')
const readStream = fs.createReadStream(gzFile)
readStream.pipe(gunzip).pipe(parser)
.pipe(es.map((offers, callback) => { //offers contains all of the current file objects array
offers.forEach(rawProduct => {
let processedProduct = getProcessedProduct(rawProduct)
parentPort.postMessage({ processedProduct })
})
callback()
})
.on('error', (e) => {
console.trace(`Error while reading file`, e)
})
.on('end', () => {
idxCount++
if (idxCount === lastIdx) {
parentPort.postMessage({ completed: true })
}
})
)
}
jsons structure:
{
"offers":
{
"offer":
[
{}, // => the objects i wanna get one by one
{},
{}
]
}
}
how can i avoid getting js heap limit error?
thanks!

Nidhim David suggestion is exactly what I was looking for.
here is the working code:
for (let gzFile of zippedFiles) {
const pipeline = chain([
fs.createReadStream(gzFile),
zlib.createGunzip(),
parser(),
pick({ filter: 'offers.offer' }), //getting the array of objects
streamArray(),
]);
pipeline.on('data', ({key, value}) => {
//getting objects one by one and processing them
const rawProduct = value;
const processedProduct = getProcessedProduct(rawProduct);
parentPort.postMessage({ processedProduct });
})
pipeline.on('end', () => {
idxCount++;
if (idxCount === lastIdx) {
debug(`last zipped file, sending complete message`);
parentPort.postMessage({ completed: true });
}
});
}

Related

Retrieving Data from Firestore with angular/fire/rxjs

I'm trying to get collection data from a firestore instance and don't want to use valueChanges{idField: id}. So far this is the only solution that somehow processes some of the data and gets the output close to what I need.
I'm new to angular & angular/fire as well as to rxjs and am really struggling to understand observables, pipe, map and rxjs in general.
What am I missing here?
async fetchJobs() {
let jc = await collection(this.firestore, 'jobs');
let cSN = await collectionSnapshots(jc);
let jobsArr = cSN.pipe(
map((data) =>
data.forEach((d) => {
let jobsData = d['_document']['data']['value']['mapValue'][
'fields'
] as Job;
const newData = {
id: d.id,
title: jobsData.title,
subtitle: jobsData.subtitle,
description: jobsData.description,
publish: jobsData.publish,
img: jobsData.img,
} as Job;
return newData;
})
)
);
}
This should work.
fetchJobs(): Observable<Job[]> {
const jc = collection(this.firestore, 'jobs')
return collectionSnapshots(jc)
.pipe(
map((snapshots) =>
snapshots.map((snapshot) => {
return { ...snapshot.data(), id: snapshot.id } as Job
})
)
)
}
which is equivalent to:
fetchJobs(): Observable<Job[]> {
const jc = collection(this.firestore, 'jobs')
return collectionData(jc, { idField: 'id' })
.pipe(
map((data) => data as Job[])
)
}
Since you only need to fetch the Job's data, collectionData() is way more appropriate.
collectionSnapshots() may be interesting when you need to perform additional operations on each Job, such as updating/deleting each one of them, which is possible with snapshot.ref
Example:
fetchJobs() {
const jc = collection(this.firestore, 'jobs')
return collectionSnapshots(jc)
}
deleteAllJobs() {
fetchJobs()
.pipe(take(1))
.subscribe(snapshots =>
snapshots.map((snapshot) => {
deleteDoc(snapshot.ref)
})
)
}
This is a mere example and the logic may not apply to your use case.

Best way to access data in react

PROBLEM:
I have a MERN application that is has a model with a couple of other models in it. The problem that I figured out later is that it saves the _id of the object and not the actual object in the model when you do this
const checkoutHistory = new Schema({
book: { type: mongoose.Schema.Types.ObjectId, ref: 'books',required: true },
checkoutCopiesNum: {type: Number, required: true},
profChosen: { type: mongoose.Schema.Types.ObjectId, ref: 'prof', required: true },
dueDate: {type: String, required: true}
})
The book: part of the object when retreived will be an id some string like "DKKLDFJhdkghhe839kdd" whatever. This is fine because then I guess I can make an API call in the react app later to search for this book. Is this the correct way to do it though?
The other way that I thought of was in the actual endpoint that retrieves the data was to call the findByID functions and set that data. It didn't work though here is the code for that:
const checkoutHistoryMiddle = async (req, res, next) => {
try {
//get the body of the request
const body = req.body
//check for data
if(!body){
return res.status(400).json({
success: false,
error: 'no body given'
})
}
const history = new CheckoutHist(body)
console.log(history)
// await Book.findById({_id: history.book}, (err, book) => {
// history.book = book
// })
// await Prof.findById({_id: history.profChosen}, (err, prof) => history.profChosen = prof)
console.log(history)
history.save().then(() => next()).catch(error => {
return res.status(400).json({
success: false,
message: error,
msg: "checkout save failed"
})
})
} catch (error) {
res.status(400).json({
success: false,
message: error,
msg: "checkoutHist failed"
})
}
}
I commented out the part I was talking about because well, it didn't work. It still saved the id instead of the object. Which like I said is fine. I gave my other idea a go and decided to do the calls inside the react app.
So I first got the array of objects from the schema provided above like this:
const [bookHist, setBookHist] = useState()
useEffect( () => {
const getHistory = async () => {
api.getCheckoutHist().then(hist => {
setBookHist(hist.data.data.filter((data) => data.book === props.book_id))
})
}
getHistory()
}, [])
This will create an array of objects in bookHist that looks like this
[{_id: "DKJFDKJDKLFJSL", book: "LDKhgajgahgelkji8440skg", checkoutCopiesNum: 3, profChosen: "gjellkdh39gh39kal930alkdfj", dueDate: "11/11/11"}, {...}]
so the next step would be to take each item in the array and get the id to search the database with so api.findProfByID(bookHist[0].profChosen)
then I would need to update the state of bookHist somehow only that item without effect the other items in the array.
The questions I have are what is the best way to update one item in the array state?
How do I make so many api calls? how do I make sure that they are waited on so that the state actually changes once the calls complete?
Here are things I have tried so far:
useEffect(() => {
bookHist.map(async bHist => {
await Axios.get("http://localhost:8174/user/professor/" + bHist.profChosen).then(async prof => {
// console.log(prof)
// console.log(prof)
bHist.profChosen = prof.data.data
// setBookHist(prevStat => ({}))
// setBookHist(...bookHist, [bookHist.])
})
setBookHist(bHist)
})
}, [])
this didn't work I assume because it would not update the state because it is not waiting on the map to finish before it sets the state of bookHist
So then I searched on the internet and found a promisAll method in react like this:
useEffect(() => {
const change = async () => {
if(bookHist){
console.log("prof")
//get the prof data
// const galleries = []
await Promise.all(bookHist.map( (bHist, index) => {
return await Axios.get("http://localhost:8174/user/professor/" + bHist.profChosen);
})).then(someData => {
console.log(someData)
});
}
change()
}, [])
This also does not work for unknown reasons. It only works if it hot reloads and does not refresh. The logging actually logs something when it hot refreshes.
here is the entirety of the funcitional component:
import React, {useState, useEffect} from 'react'
import api from '../../api/index'
import Axios from 'axios'
export default function CheckoutBookHistroy(props){
const [bookHist, setBookHist] = useState()
const [histData, setHistData] = useState([{
book: {},
prof: {}
}])
useEffect( () => {
const getHistory = async () => {
api.getCheckoutHist().then(hist => {
setBookHist(hist.data.data.filter((data) => data.book === props.book_id))
})
}
getHistory()
}, [])
//i also tried this way but this resulted in an infinite loop
const [profChosen, setProfChosen] = useState()
const handleProfFind = async (id) => {
await Axios.get("http://localhost:8174/user/professor/" + id).then(prof => {
setProfChosen(prof.data.data)
})
}
return (
<div>
{
bookHist ?
bookHist.map(data => {
//need to present the prof data here for each data obj
return (
<div>Checked out {data.checkoutCopiesNum}</div>
)}) : <div>no data</div>
}
</div>
)
}
I really hope I can gain some insight into the correct way to do all of this. I must be either really close or awfully wrong. Thank you in advance!
just by looking at your code, i don't see too much issue, although your code is a bit convoluted.
some functions has no caller, ex. handleProfFind. One suggestion, if you want to do something, just do it, no need that many functions, ex.
// assume you only want to do it once after mounting
useEffect( () => {
if (!data) {
api.getCheckoutHist().then(hist => {
// you can set your data state here
// or you can get the id inside each item, and then call more APIs
// whatever you want to do, please finish it here
}
}
}, [])

How to return a list of objects from Cypress Custom Commands in type script

I am using Cypress for my end to end Integration tests. I have a use case which involves returning a list of objects from Cypress Custom Commands and I have a difficulty in doing so. Here is my code pointer:
index.ts
declare global {
namespace Cypress {
interface Chainable<Subject> {
getTestDataFromElmoDynamoDB({locale, testType}): Cypress.Chainable<JQuery<expectedData[]>> // ??? not sure what return type should be given here.
}
}
}
Cypress.Commands.add('getTestDataFromDynamoDB', ({locale, testType}) => {
// expectedData is an interface declared. My use case is to return the list of this type.
let presetList: expectedData[]
cy.task('getTestDataFromDynamoDB', {
locale: locale,
testType: testType
}).then((presetData: any) => {
presetList = presetData;
// the whole idea here is to return presetList from cypress task
return cy.wrap(presetList) //??? not sure what should be written here
})
})
sampleSpec.ts
describe('The Sample Test', () => {
it.only('DemoTest', () => {
cy.getTestDataElmoDynamoDB({
locale: env_parameters.env.locale,
testType: "ChangePlan"
}).then((presetlist) => {
// not sure on how to access the list here. Tried wrap and alias but no luck.
presetList.forEach((preset: expectedData) => {
//blah blah blah
})
})
})
})
Did anyone work on similar use case before?
Thanks,
Saahith
Here My own command for doing exactly that.
Cypress.Commands.add("convertArrayOfAlliasedElementsToArrayOfInteractableElements", (arrayOfAlliases) => {
let arrayOfRecievedAlliasValues = []
for (let arrayElement of arrayOfAlliases) {
cy.get(arrayElement)
.then(aelement =>{
arrayOfRecievedAlliasValues.push(aelement)
})
}
return cy.wrap(arrayOfRecievedAlliasValues)
})
The way I do it is to pass it in an array and cy.wrap the array, Because it lets you chain the command with an interactable array.
The key point is - it has to be passed as array or object, because they are Reference types, and in cypress it is hard to work with let/var/const that are value types.
You can also allias the cy.wrapped object if you like.
The way to use it in code is:
cy.convertArrayOfAlliasedElementsToArrayOfInteractableElements(ArayOfElements)
What you asked for can be implemented as follows, but I do not know what type expectedData is, so let's assume that expectedData:string [], but you can replace string[] with your type.
plugins/index.ts
module.exports = (on: any, config: any) => {
on('task', {
getDataFromDB(arg: {locale: string, testType: string}){
// generate some data for an example
const list: string[] = [];
list.push('a', 'b');
return list;
},
});
};
commands.ts
declare global {
namespace Cypress {
interface Chainable<Subject> {
getTestDataElmoDynamoDB(arg: {locale: string, testType: string}): Cypress.Chainable<string[]>
}
}
}
Cypress.Commands.add('getTestDataElmoDynamoDB', (arg: {locale: string, testType: string}) => {
let presetList: string[] = [];
cy.task('getDataFromDB', arg)
.then((presetData?: string[]) => {
expect(presetData).not.be.undefined.and.not.be.empty;
// if the data is incorrect, the code will break earlier on expect, this line for typescript compiler
if (!presetData || !presetData.length) throw new Error('Present data are undefined or empty');
presetList = presetData;
return cy.wrap(presetList); // or you can return cy.wrap(presetData)
});
});
db.spec.ts
describe('Test database methods', () => {
it('When take some test data, expect that the data was received successfully ', () => {
cy.getTestDataElmoDynamoDB({ locale: 'someEnvVar', testType: 'ChangePlan' })
.then((list) => {
expect(list).not.empty.and.not.be.undefined;
cy.log(list); // [a,b]
// You can interact with list here as with a regular array, via forEach();
});
});
});
You can also access and receive data from cy.task directly in the spec file.
describe('Test database methods', () => {
it('When take some test data, expect that the data was received successfully ', () => {
cy.task('getDataFromDB', arg)
.then((list?: string[]) => {
expect(list).not.be.empty.and.not.be.undefined;
cy.log(list); // [a,b] — the same list as in the version above
});
});
});

Unable to save data in gatsby graphql layer while creating source plugin

I am trying to fetch all the videos of a youtube channel grouped by playlist. So first i am fetching all the playlists and then again fetching the corresponding videos.
const fetch = require("node-fetch")
const queryString = require("query-string")
module.exports.sourceNodes = async (
{ actions, createNodeId, createContentDigest },
configOptions
) => {
const { createNode } = actions
// Gatsby adds a configOption that's not needed for this plugin, delete it
delete configOptions.plugins
// plugin code goes here...
console.log("Testing my plugin", configOptions)
// Convert the options object into a query string
const apiOptions = queryString.stringify(configOptions)
const apiUrl = `https://www.googleapis.com/youtube/v3/playlists?${apiOptions}`
// Helper function that processes a content to match Gatsby's node structure
const processContent = content => {
const nodeId = createNodeId(`youtube--${content.id}`)
const nodeContent = JSON.stringify(content)
const nodeData = Object.assign({}, content, {
id: nodeId,
parent: null,
children: [],
internal: {
type: `tubeVideo`,
content: nodeContent,
contentDigest: createContentDigest(content)
}
})
return nodeData
}
return fetch(apiUrl)
.then(res => res.json())
.then(data => {
data.items.forEach(item => {
console.log("item", item.id)
//fetch videos of the playlist
let playlistApiOption = queryString.stringify({
part: "snippet,contentDetails",
key: "AIzaSyDPdlc3ctJ7yodRZE_GfbngNBEYbdcyys8",
playlistId: item.id,
fields: "items(id,snippet(title,description,thumbnails),contentDetails)"
})
let playlistApiUrl = `https://www.googleapis.com/youtube/v3/playlistItems?${playlistApiOption}`
fetch(playlistApiUrl)
.then(res => res.json())
.then(data => {
data.items.forEach(video => {
console.log("videos", video)
// Process the video data to match the structure of a Gatsby node
const nodeData = processContent(video)
// console.log(nodeData)
// Use Gatsby's createNode helper to create a node from the node data
createNode(nodeData)
})
})
})
})
}
Here Nodes are getting created for individual videos. But can't query this nodes from graphql store. ie. datas are not getting saved in graphql store
edit: Wait, I just realize it's inside a loop. Your sourceNodes is not waiting for the fetch inside your loop to resolve. In this case, you'd have to use something like Promise.all to resolve each item in the loop. Code's updated to reflect that.
return fetch(apiUrl)
.then(res => res.json())
.then(data => {
return Promise.all(
data.items.map(item => {
/* etc. */
return fetch(playlistApiUrl)
.then(res => res.json())
.then(data => {
data.items.forEach(video => {
/* etc. */
createNode(nodeData)
})
})
)
})
})
Check out async/await syntax, it might make finding these type of issue easier.

When use vm2 in worker_threads, is it possible to share a NodeVM instance between workers?

I am using worker_threads and vm2 to implement a serverless-like thing, but I cannot get a NodeVM instance in the main thread and then pass through workData(because of worker_threads's limitation), so I can only new NodeVM in a worker thread per request, inside which I cannot reuse a vm instance and the cost hurts.
The new NodeVM() takes 200 ~ 450 ms to finish, so I wish to pre-init a reusable instance.
const w = new Worker(`
(async () => {
const { workerData, parentPort } = require('worker_threads');
const { NodeVM } = require('vm2');
const t = Date.now();
const vm = new NodeVM({ // cost 200 ~ 450 ms
console: 'inherit',
require: {
external: [ 'request-promise', 'lodash' ],
builtin: [],
import: [ 'request-promise', 'lodash' ], // faster if added
},
});
console.log('time cost on new NodeVM:', Date.now() - t);
const fnn = vm.run(workerData.code, workerData.filename);
console.log('time cost by initializing vm:', Date.now() - t);
try {
const ret = await fnn(workerData.params);
parentPort.postMessage({
data: typeof ret === 'string' ? ret : JSON.stringify(ret),
});
} catch (e) {
parentPort.postMessage({
err: e.toString(),
});
}
console.log('----worker donex');
})();
`,
{
workerData: {
params,
code,
dirname: __dirname,
filename: `${__dirname}/faasVirtual/${fn}.js`,
},
eval: true,
});
Can anybody give me some advice?
Thanks a lot.
I've decided to prohibit external module import. Because require is internally readFileSync, which costs most of the time, and the http module within node itself can be used to replace request-promise.
After comment out external option, the average time cost for init is roughly 10+ms, which is acceptable for now.
But if worker_threads can clone function object through workerData, it would be more efficient.

Resources