I've searched a lot and this is deprecated question.
I'm trying to bulk insert in a table.
My approach was like this
knex('test_table').where({
user: 'user#example.com',
})
.then(result => {
knex.transaction(trx => {
Bluebird.map(result, data => {
return trx('main_table')
.insert(data.insert_row)
}, { concurrency: 3 })
.then(trx.commit);
})
.then(() => {
console.log("done bulk insert")
})
.catch(err => console.error('bulk insert error: ', err))
})
this could work if the columns where text or numeric columns, but i have jsonb columns
But I got this error:
invalid input syntax for type json
How can I solve this problem?
Sounds like some json columns doesn't have data stringified when sent to DB.
Also that is pretty much the slowest way to insert multiple rows, because you are doing 1 query for each inserted row and using single connection for inserting.
That concurrency 3 only causes pg driver to buffer those 2 queries before they are sent to the DB through the same transaction that all the others.
Something like this should be pretty efficient (didn't test running the code, so there might be errors):
const rows = await knex('test_table').where({ user: 'user#example.com' });
rows.forEach(row => {
// make sure that json columns are actually json strings
row.someColumnWithJson = JSON.stringify(row.someColumnWithJson);
});
await knex.transaction(async trx => {
let i, j, temparray, chunk = 200;
// insert rows in 200 row batches
for (i = 0, j = rows.length; i < j; i += chunk) {
rowsToInsert = rows.slice(i, i + chunk);
await trx('main_table').insert(rowsToInsert);
}
});
Also knex.batchInsert might work for you.
Related
I'm a bit puzzled by the situation I have now.
I've a simple SQL statement I execute from NodeJs on a SQLite database. The SQL statement returns values with a lot of decimals; although my data only contain two decimals.
When I run the exact same query in DB Browser for SQLite, I have a correct result.
My NodeJs code
app.get('/payerComparison/', (req, res) => {
// Returns labels and values within response
var response = {};
let db = new sqlite3.Database('./Spending.db', sqlite3.OPEN_READONLY, (err) => {
if(err){console.log(err.message); return}
});
response['labels'] = [];
response['data'] = [];
db.each("SELECT payer, sum(amount) AS sum FROM tickets GROUP BY payer", (err, row) => {
if(err){console.log(err.message); return}
response['labels'].push(row.payer);
response['data'].push(row.sum);
});
db.close((err) => {
if(err){console.log(err.message); return}
// Send data
console.log(response);
res.send(JSON.stringify(response));
});
})
What I have in the command line
{
labels: [ 'Aurélien', 'Commun', 'GFIS', 'Pauline' ],
data: [ 124128.26, 136426.43000000008, 5512.180000000001, 39666.93 ]
}
The result in DB Browser
I hope you can help me clarify this mystery!
Thank you
Round the values up to 2 decimals :).
SELECT payer, round(sum(amount),2) AS sum FROM tickets GROUP BY payer
I am currently testing out BigTable at the moment to see if is something we will use.
We currently use CloudSql with Postgres 9.6 with the current schema of;
id, sensor_id, time, value
Most of our queries we query for data between a range, something like this
SELECT
*
FROM
readings
WHERE
sensor_id IN(7297,7298,7299,7300)
AND time BETWEEN '2018-07-15 00:00:00' AND '2019-07-15 00:00:00'
ORDER BY
time, sensor_id
Each sensor can have readings every 10mins or so, so that's a fair bit of data.
At last check, we have 2 billion records, which is increasing a lot each day.
For BigTable I am importing with a row key of
readings#timestamp#sensorId so something like this readings#20180715000000#7297
So far seems so good.
To query a range (using node) I am doing this
const fromDate = '20180715000000'
const toDate = '20190715000000'
const ranges = sensorIds.map(sensorId => {
return {
start: `readings#${fromDate}#${sensorId}`,
end: `readings#${toDate}#${sensorId}`,
}
});
const results = [];
await table.createReadStream({
column: {
cellLimit: 1,
},
ranges
})
.on('error', err => {
console.log(err);
})
.on('data', row => {
results.push({
id: row.id,
data: row.data
})
})
.on('end', async () => {
console.log(` ${results.length} Rows`)
})
My understanding of this would be that the results would be similar to the sql query above, but it seems to be returning for all sensor ids across the date range, not by the ones specified within the query.
My questions;
Is this the correct row key that we should be using for this type of querying
If this is correct, can we filter per range? or is there a filter that we have to use to only return the values for the given date range and sensorId range?
thanks in advance for your advice.
The problem is that you are setting up your ranges variable in a wrong way and Big Table is getting lost because of that, try doing the following:
const fromDate = '20180715000000'
const toDate = '20190715000000'
const sensorId = sensorIds[0]
const filter = {
column: {
cellLimit: 1,
},
value: {
start: `readings#${fromDate}#${sensorId}`,
end: `readings#${toDate}#${sensorId}`,
}
};
const results = [];
await table.createReadStream({
filter
})
.on('error', err => {
console.log(err);
})
.on('data', row => {
results.push({
id: row.id,
data: row.data
})
})
.on('end', async () => {
console.log(` ${results.length} Rows`)
})
**NOTE: I am getting the first position of sensorIds which I assume is a list of all the Ids, but you can select any of them. Also this is all untested but should be a good starting point for you.
You can find snippets on the usage of the Node.js Client for BigTable on this Github Repo.
I'm using Node.js, mongoose.js and mongoDB in M0 Sandbox (General) atlas tier. I want to use streaming for better optimization of resources usages. here's my code:
let counter = 0;
let batchCounter = 0;
const batchSize = 500;
const tagFollowToBeSaved = [];
let startTime;
USER.find({}, { _id: 1 }).cursor()
.on('data', (id) => {
if (counter === 0) {
startTime = process.hrtime();
}
if (counter === batchSize) {
counter = 0;
const [s2] = process.hrtime();
console.log('finding ', batchSize, ' data took: ', s2 - startTime[0]);
TAG_FOLLOW.insertMany(tagFollowToBeSaved, { ordered: false })
.then((ok) => {
batchCounter += 1;
const [s3] = process.hrtime();
console.log(`batch ${batchCounter} took: ${s3 - s2}`);
})
.catch((error) => {
console.log('error in inserting following: ', error);
});
// deleting array
tagFollowToBeSaved.splice(0, tagFollowToBeSaved.length);
}
else {
counter += 1;
tagFollowToBeSaved.push({
tagId: ObjectId('5e81ba5a5c86d7215cdc9c88'),
followedBy: id,
});
}
})
.on('error', (error) => {
console.log('error while streaming: ', error);
})
.on('end', () => {
console.log('END');
});
What I'm doing:
first I read _ids from USER collection of 200k users using streams API with cursor
check whether the number of reads _ids has reached the batch size limit, if so, I'm going to insert these data to TAG_FOLLOW model.
All the console.logs are for analysis purposes and you can ignore them.
However insertMany does not work properly. because I'm monitoring database while inserting data, It seems it inserts about 100 data instead of 500 per batch size without throwing any error, therefore I'm losing 400 data per batch which is so unacceptable!
Although I changed the batchSize value, it does not work anyway and I'm going to lose some data per batch.
So what is the problem? what is the better approach for saving a bunch of large data in an optimized and ensured way?
Thanks in advance.
I am trying to use PostgreSQL's COPY FROM API to stream potentially-thousands of records into a database as they are dynamically generated in node.js code. To do so, I wrote this generic wrapper function:
function streamRows(client, { table, columns, data }) {
return new Promise((resolve, reject) => {
const sqlStream = client.query(
copyFrom(`COPY ${ table } (${ columns.join(', ') }) FROM STDIN`));
const rowStream = new Readable();
rowStream.pipe(sqlStream)
.on('finish', resolve)
.on('error', reject);
for (const row of data) {
rowStream.push(`${ row.join('\t') }\n`);
}
rowStream.push('\\.\n');
rowStream.push(null);
});
}
The database table I'm writing into looks like this:
CREATE TABLE devices (
id SERIAL PRIMARY KEY,
group_id INTEGER REFERENCES groups(id),
serial_number CHAR(12) NOT NULL,
status INTEGER NOT NULL
);
And I am calling it as follows:
function *genRows(id, devices) {
let count = 0;
for (const serial of devices) {
yield [ id, serial, UNSTARTED ];
count++;
if (count % 10 === 0) log.info(`Streamed ${ count } rows...`);
}
log.info(`Streamed ${ count } rows.`);
}
await streamRows(client, {
table: 'devices',
columns: [ 'group_id', 'serial_number', 'status' ],
data: genRows(id, devices),
});
The log statements in my generator function that's producing the per-row data all run as expected, and the output indicates that it is in fact always running the generator to completion, and streaming all the data rows I want. No errors are ever thrown. But if I wait for it to complete, the table sometimes ends up with 0 rows added to it--i.e., it looks like I sent all that data to Postgres, but none of it was actually inserted. What am I doing wrong?
I do not know exactly what parts of this made the difference and what is purely stylistic, but after playing around with a bunch of different examples from across the web, I managed to cobble together this function which works:
function streamRows(client, { table, columns, data }) {
return new Promise((resolve, reject) => {
const iterator = data[Symbol.iterator]();
const rs = new Readable();
const ws = client.query(copyFrom(`COPY ${ table } (${ columns.join(', ') }) FROM STDIN`));
rs._read = function() {
const { value, done } = iterator.next();
rs.push(done ? null : `${ value.join('\t') }\n`);
};
rs.on('error', reject);
ws.on('error', reject);
ws.on('end', resolve);
rs.pipe(ws);
});
}
I have tried several solutions to get this working but all failed. I am reading the Mongo DB docs using cursor.eachAsync() and converting some doc fields. I need to move these docs to another collection after conversion.My idea is that after 1000 docs are processed, they should be bulk-inserted into the destination collection. This works good until the last batch of records which are less than 1000. To phrase the same problem differently, if the number of records are <1000 then they are not inserted.
1. First version - bulk insert after async()
Just like any other code, I should have docs < 1000 in bulk object after async() and should be able to insert. But I find bulk.length is 0. (I have removed those statements in the code snippet below).
```js`async function run() {
await mongoose.connect(dbPath, dbOptions);
const cursor = events.streamEvents(query, 10);
let successCounter = 0;
let bulkBatchSize = 1000;
let bulkSizeCounter = 0;
let sourceDocsCount = 80;
var bulk = eventsConvertedModel.collection.initializeOrderedBulkOp();
await cursor.eachAsync((doc) => {
let pPan = new Promise((resolve, reject) => {
getTokenSwap(doc.panTokenIdentifier, doc._id)
.then((swap) => {
resolve(swap);
});
});
let pXml = new Promise((resolve, reject) => {
let xmlObject;
getXmlObject(doc)
.then(getXmlObjectToken)
.then((newXmlString) => {
resolve(newXmlString);
})
.catch(errFromPromise1 => {
});
})
.catch(error => {
reject(error);
});
});
Promise.all([pPan, pXml])
.then(([panSwap, xml]) => {
doc.panTokenIdentifier = panSwap;
doc.eventRecordTokenText = xml;
return doc;
})
.then((newDoc) => {
successCounter++;
bulkSizeCounter++;
bulk.insert(newDoc);
if (bulkSizeCounter % bulkBatchSize == 0) {
bulk.execute()
.then(result => {
bulkSizeCounter = 0;
let msg = "Conversion- bulk insert =" + result.nInserted;
console.log(msg);
bulk = eventsConvertedModel.collection.initializeOrderedBulkOp();
Promise.resolve();
})
.catch(bulkErr => {
logger.error(bulkErr);
});
}
else {
Promise.resolve();
}
})
.catch(err => {
console.log(err);
});
});
console.log("outside-async=" + bulk.length); // always 0
console.log("run()- Docs converted in this run =" + successCounter);
process.exit(0);
}`
2. Second version (track expected number of iterations and after all iterations, change batch size to say 10).
Result - The batch size value changes but it's not reflected in bulk.insert. The records are lost.
3. Same as 2nd but insert one record at a time after bulk inserts are done.
```js
let d = eventsConvertedModel(newDoc);
d.isNew = true;
d._id = mongoose.Types.ObjectId();
d.save().then(saved => {
console.log(saved._id)
Promise.resolve();
}).catch(saveFailed => {
console.log(saveFailed);
Promise.resolve();
});
```
Result - I was getting DocumentNotFound error, so I added d.isNew = true. But for some reason only few records get inserted and many of them get lost.
I have also tried other variations using the number of expected bulk insert iterations. Finally, I changed the code to write to file (one doc at a time) but I am still wondering if there is any way to make write to DB make work.
Dependencies:
Node v8.0.0
Mongoose 5.2.2