Azure documentdb bulk insert using stored procedure - azure

Hi I am using 16 collections to insert around 3-4 million json objects ranging from 5-10k per object.I am using stored procedure to insert these documents.I have 22 Capacity Unit.
function bulkImport(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
// Validate input.
if (!docs) throw new Error("The array is undefined or null.");
var docsLength = docs.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
}
// Call the CRUD API to create a document.
tryCreateOrUpdate(docs[count], callback);
// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreateOrUpdate(doc, callback) {
var isAccepted = true;
var isFound = collection.queryDocuments(collectionLink, 'SELECT * FROM root r WHERE r.id = "' + doc.id + '"', function (err, feed, options) {
if (err) throw err;
if (!feed || !feed.length) {
isAccepted = collection.createDocument(collectionLink, doc, callback);
}
else {
// The metadata document.
var existingDoc = feed[0];
isAccepted = collection.replaceDocument(existingDoc._self, doc, callback);
}
});
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isFound && !isAccepted) getContext().getResponse().setBody(count);
}
// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw err;
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
} else {
// Create next document.
tryCreateOrUpdate(docs[count], callback);
}
}
my C# codes looks like this
public async Task<int> Add(List<JobDTO> entities)
{
int currentCount = 0;
int documentCount = entities.Count;
while(currentCount < documentCount)
{
string argsJson = JsonConvert.SerializeObject(entities.Skip(currentCount).ToArray());
var args = new dynamic[] { JsonConvert.DeserializeObject<dynamic[]>(argsJson) };
// 6. execute the batch.
StoredProcedureResponse<int> scriptResult = await DocumentDBRepository.Client.ExecuteStoredProcedureAsync<int>(sproc.SelfLink, args);
// 7. Prepare for next batch.
int currentlyInserted = scriptResult.Response;
currentCount += currentlyInserted;
}
return currentCount;
}
The problem I am facing is out of 400k documents that I try to insert at times documents get missed with out giving any error.
The application is worker role deployed on cloud.
If I increase the number of threads or instances inserting in documentDB the number of documents missed are much higher.
how to figure out what is the problem.Thanks in Advance.

I found that when trying this code I would get an error at docs.length which stated that length was undefined.
function bulkImport(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
// Validate input.
if (!docs) throw new Error("The array is undefined or null.");
var docsLength = docs.length; // length is undefined
}
After many tests (could not find anything in Azure documentation) I realized that I could not pass an array as was suggested. The parameter had to be an object. I had to modify the batch code like this in order for it to run.
I also found I could not simply try and pass an array of documents in the DocumentDB script explorer (Input box) either. Even though the placeholder help text says you can.
This code worked for me:
// psuedo object for reference only
docObject = {
"items": [{doc}, {doc}, {doc}]
}
function bulkImport(docObject) {
var context = getContext();
var collection = context.getCollection();
var collectionLink = collection.getSelfLink();
var count = 0;
// Check input
if (!docObject.items || !docObject.items.length) throw new Error("invalid document input parameter or undefined.");
var docs = docObject.items;
var docsLength = docs.length;
if (docsLength == 0) {
context.getResponse().setBody(0);
}
// Call the funct to create a document.
tryCreateOrUpdate(docs[count], callback);
// Obviously I have truncated this function. The above code should help you understand what has to change.
}
Hopefully Azure documentation will catch up or become easier to find if I missed it.
I'll also be placing a bug report for the Script Explorer in hopes that the Azurites will update.

It’s important to note that stored procedures have bounded execution, in which all operations must complete within the server specified request timeout duration. If an operation does not complete with that time limit, the transaction is automatically rolled back. In order to simplify development to handle time limits, all CRUD (Create, Read, Update, and Delete) operations return a Boolean value that represents whether that operation will complete. This Boolean value can be used a signal to wrap up execution and for implementing a continuation based model to resume execution (this is illustrated in our code samples below).
The bulk-insert stored procedure provided above implements the continuation model by returning the number of documents successfully created. This is noted in the stored procedure's comments:
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isFound && !isAccepted) getContext().getResponse().setBody(count);
If the output document count is less than the input document count, you will need to re-run the stored procedure with the remaining set of documents.

Since May 2018 there is a new Batch SDK for Cosmos DB. There is a GitHub repo to get you started.
I have been able to import 100.000 records in 9 seconds. And using Azure Batch to fan out the inserts, I have done 19 mln records in 1m15s. This was on a 1.66mln RU/s collection, which you obviously can scale down after import.

Related

Using node.js and promise to fetch paginated data

Please keep in mind that I am new to node.js and I am used with android development.
My scenario is like this:
Run a query against the database that returns either null or a value
Call a web service with that database value, that offers info paginated, meaning that on a call I get a parameter to pass for the next call if there is more info to fetch.
After all the items are retrieved, store them in a database table
If everything is well, for each item received previously, I need to make another web call and store the retrieved info in another table
if fetching any of the data set fails, all data must be reverted from the database
So far, I've tried this:
getAllData: function(){
self.getMainWebData(null)
.then(function(result){
//get secondary data for each result row and insert it into database
}
}
getMainWebData: function(nextPage){
return new Promise(function(resolve, reject) {
module.getWebData(nextPage, function(errorReturned, response, values) {
if (errorReturned) {
reject(errorReturned);
}
nextPage = response.nextPageValue;
resolve(values);
})
}).then(function(result) {
//here I need to insert the returned values in database
//there's a new page, so fetch the next set of data
if (nextPage) {
//call again getMainWebData?
self.getMainWebData(nextPage)
}
})
There are a few things missing, from what I've tested, getAllData.then fires only one for the first set of items and not for others, so clearly handling the returned data in not right.
LATER EDIT: I've edited the scenario. Given some more research my feeling is that I could use a chain or .then() to perform the operations in a sequence.
Yes it is happening as you are resolving the promise on the first call itself. You should put resolve(value) inside an if statement which checks if more data is needed to be fetched. You will also need to restructure the logic as node is asynchronous. And the above code will not work unless you do change the logic.
Solution 1:
You can either append the paginated response to another variable outside the context of the calls you are making. And later use that value after you are done with the response.
getAllData: function(){
self.getMainWebData(null)
.then(function(result){
// make your database transaction if result is not an error
}
}
function getList(nextpage, result, callback){
module.getWebData(nextPage, function(errorReturned, response, values) {
if(errorReturned)
callback(errorReturned);
result.push(values);
nextPage = response.nextPageValue;
if(nextPage)
getList(nextPage, result, callback);
else
callback(null, result);
})
}
getMainWebData: function(nextPage){
return new Promise(function(resolve, reject) {
var result = [];
getList(nextpage, result, function(err, results){
if(err)
reject(err);
else{
// Here all the items are retrieved, you can store them in a database table
// for each item received make your web call and store it into another variable or result set
// suggestion is to make the database transaction only after you have retrieved all your data
// other wise it will include database rollback which will depend on the database which you are using
// after all this is done resolve the promise with the returning value
resolve(results);
}
});
})
}
I have not tested it but something like this should work. If problem persists let me know in comments.
Solution 2:
You can remove promises and try the same thing with callback as they are easier to follow and will make sense to the programmers who are familiar with structural languages.
Looking at your problem, I have created a code that would loop through promises.
and would only procede if there is more data to be fetched, the stored data would still be available in an array.
I hope this help. Dont forget to mark if it helps.
let fetchData = (offset = 0, limit= 10) => {
let addresses = [...Array(100).keys()];
return Promise.resolve(addresses.slice(offset, offset + limit))
}
// o => offset & l => limit
let o = 0, l = 10;
let results = [];
let process = p => {
if (!p) return p;
return p.then(data => {
// Process with data here;
console.log(data);
// increment the pagination
o += l;
results = results.concat(data);
// while there is data equal to limit set then fetch next page
// otherwise return the collected result
return (data.length == l)? process(fetchAddress(o, l)).then(data => data) : results;
})
}
process(fetchAddress(o, l))
.then(data => {
// All the fetched data will be here
}).catch(err => {
// Handle Error here.
// All the retrieved data from database will be available in "results" array
});
if You want to do it more often I have also created a gist for reference.
If You dont want to use any global variable, and want to do it in very functional way. You can check this example. However it requires little more complication.

Google Cloud Datastore, how to query for more results

Straight and simple, I have the following function, using Google Cloud Datastore Node.js API:
fetchAll(query, result=[], queryCursor=null) {
this.debug(`datastoreService.fetchAll, queryCursor=${queryCursor}`);
if (queryCursor !== null) {
query.start(queryCursor);
}
return this.datastore.runQuery(query)
.then( (results) => {
result=result.concat(results[0]);
if (results[1].moreResults === _datastore.NO_MORE_RESULTS) {
return result;
} else {
this.debug(`results[1] = `, results[1]);
this.debug(`fetch next with queryCursor=${results[1].endCursor}`);
return this.fetchAll(query, result, results[1].endCursor);
}
});
}
The Datastore API object is in the variable this.datastore;
The goal of this function is to fetch all results for a given query, notwithstanding any limits on the number of items returned per single runQuery call.
I have not yet found out about any definite hard limits imposed by the Datastore API on this, and the documentation seems somewhat opaque on this point, but I only noticed that I always get
results[1] = { moreResults: 'MORE_RESULTS_AFTER_LIMIT' },
indicating that there are still more results to be fetched, and the results[1].endCursor remains stuck on constant value that is passed on again on each iteration.
So, given some simple query that I plug into this function, I just go on running the query iteratively, setting the query start cursor (by doing query.start(queryCursor);) to the endCursor obtained in the result of the previous query. And my hope is, obviously, to obtain the next bunch of results on each successive query in this iteration. But I always get the same value for results[1].endCursor. My question is: Why?
Conceptually, I cannot see a difference to this example given in the Google Documentation:
// By default, google-cloud-node will automatically paginate through all of
// the results that match a query. However, this sample implements manual
// pagination using limits and cursor tokens.
function runPageQuery (pageCursor) {
let query = datastore.createQuery('Task')
.limit(pageSize);
if (pageCursor) {
query = query.start(pageCursor);
}
return datastore.runQuery(query)
.then((results) => {
const entities = results[0];
const info = results[1];
if (info.moreResults !== Datastore.NO_MORE_RESULTS) {
// If there are more results to retrieve, the end cursor is
// automatically set on `info`. To get this value directly, access
// the `endCursor` property.
return runPageQuery(info.endCursor)
.then((results) => {
// Concatenate entities
results[0] = entities.concat(results[0]);
return results;
});
}
return [entities, info];
});
}
(except for the fact, that I don't specify a limit on the size of the query result by myself, which I have also tried, by setting it to 1000, which does not change anything.)
Why does my code run into this infinite loop, stuck on each step at the same "endCursor"? And how do I correct this?
Also, what is the hard limit on the number of results obtained per call of datastore.runQuery()? I have not found this information in the Google Datastore documentation thus far.
Thanks.
Looking at the API documentation for the Node.js client library for Datastore there is a section on that page titled "Paginating Records" that may help you. Here's a direct copy of the code snippet from the section:
var express = require('express');
var app = express();
var NUM_RESULTS_PER_PAGE = 15;
app.get('/contacts', function(req, res) {
var query = datastore.createQuery('Contacts')
.limit(NUM_RESULTS_PER_PAGE);
if (req.query.nextPageCursor) {
query.start(req.query.nextPageCursor);
}
datastore.runQuery(query, function(err, entities, info) {
if (err) {
// Error handling omitted.
return;
}
// Respond to the front end with the contacts and the cursoring token
// from the query we just ran.
var frontEndResponse = {
contacts: entities
};
// Check if more results may exist.
if (info.moreResults !== datastore.NO_MORE_RESULTS) {
frontEndResponse.nextPageCursor = info.endCursor;
}
res.render('contacts', frontEndResponse);
});
});
Maybe you can try using one of the other syntax options (instead of Promises). The runQuery method can take a callback function as an argument, and that callback's parameters include explicit references to the entities array and the info object (which has the endCursor as a property).
And there are limits and quotas imposed on calls to the Datastore API as well. Here are links to official documentation that address them in detail:
Limits
Quotas

How to control serial and parallel control flow with mapped functions?

I've drawn a simple flow chart, which basically crawls some data from internet and loads them into the database. So far, I had thought I was peaceful with promises, however now I have an issue that I'm working for at least three days without a simple step.
Here is the flow chart:
Consider there is a static string array like so: const courseCodes = ["ATA, "AKM", "BLG",... ].
I have a fetch function, it basically does a HTTP request followed by parsing. Afterwards it returns some object array.
fetch works perfectly with invoking its callback with that expected object array, it even worked with Promises, which was way greater and tidy.
fetch function should be invoked with every element in the courseCodes array as its parameter. This task should be performed in parallel execution, since those seperate fetch functions do not affect each other.
As a result, there should be a results array in callback (or Promises resolve parameter), which includes array of array of objects. With those results, I should invoke my loadCourse with those objects in the results array as its parameter. Those tasks should be performed in serial execution, because it basically queries database if similar object exists, adds it if it's not.
How can perform this kind of tasks in node.js? I could not maintain the asynchronous flow in such a scenario like this. I've failed with caolan/async library and bluebird & q promise libraries.
Try something like this, if you are able to understand this:
const courseCodes = ["ATA, "AKM", "BLG",... ]
//stores the tasks to be performed.
var parallelTasks = [];
var serialTasks = [];
//keeps track of courses fetched & results.
var courseFetchCount = 0;
var results = {};
//your fetch function.
fetch(course_code){
//your code to fetch & parse.
//store result for each course in results object
results[course_code] = 'whatever result comes from your fetch & parse code...';
}
//your load function.
function loadCourse(results) {
for(var index in results) {
var result = results[index]; //result for single course;
var task = (
function(result) {
return function() {
saveToDB(result);
}
}
)(result);
serialTasks.push(task);
}
//execute serial tasks for saving results to database or whatever.
var firstSerialTask = serialTasks.shift();
nextInSerial(null, firstSerialTask);
}
//pseudo function to save a result to database.
function saveToDB(result) {
//your code to store in db here.
}
//checks if fetch() is complete for all course codes in your array
//and then starts the serial tasks for saving results to database.
function CheckIfAllCoursesFetched() {
courseFetchCount++;
if(courseFetchCount == courseCodes.length) {
//now process courses serially
loadCourse(results);
}
}
//helper function that executes tasks in serial fashion.
function nextInSerial(err, result) {
if(err) throw Error(err.message);
var nextSerialTask = serialTasks.shift();
nextSerialTask(result);
}
//start executing parallel tasks for fetching.
for(var index in courseCode) {
var course_code = courseCode[index];
var task = (
function(course_code) {
return function() {
fetch(course_code);
CheckIfAllCoursesFetched();
}
}
)(course_code);
parallelTasks.push(task);
for(var task_index in parallelTasks) {
parallelTasks[task_index]();
}
}
Or you may refer to nimble npm module.

Check if a document exists in mongoose (Node.js)

I have seen a number of ways of finding documents in mongoDB such that there is no performance hit, i.e. you don't really retrieve the document; instead you just retrieve a count of 1 or 0 if the document exists or not.
In mongoDB, one can probably do:
db.<collection>.find(...).limit(1).size()
In mongoose, you either have callbacks or not. But in both cases, you are retrieving the entries rather than checking the count. I simply want a way to check if a document exists in mongoose — I don't want the document per se.
EDIT: Now fiddling with the async API, I have the following code:
for (var i = 0; i < deamons.length; i++){
var deamon = deamons[i]; // get the deamon from the parsed XML source
deamon = createDeamonDocument(deamon); // create a PSDeamon type document
PSDeamon.count({deamonId: deamon.deamonId}, function(error, count){ // check if the document exists
if (!error){
if (count == 0){
console.log('saving ' + deamon.deamonName);
deamon.save() // save
}else{
console.log('found ' + deamon.leagueName);
}
}
})
}
You have to read about javascript scope. Anyway try the following code,
for (var i = 0; i < deamons.length; i++) {
(function(d) {
var deamon = d
// create a PSDeamon type document
PSDeamon.count({
deamonId : deamon.deamonId
}, function(error, count) {// check if the document exists
if (!error) {
if (count == 0) {
console.log('saving ' + deamon.deamonName);
// get the deamon from the parsed XML source
deamon = createDeamonDocument(deamon);
deamon.save() // save
} else {
console.log('found ' + deamon.leagueName);
}
}
})
})(deamons[i]);
}
Note: Since It includes some db operation, I am not tested.
I found it simpler this way.
let docExists = await Model.exists({key: value});
console.log(docExists);
Otherwise, if you use it inside a function, make sure the function is async.
let docHandler = async () => {
let docExists = await Model.exists({key: value});
console.log(docExists);
};
You can use count, it doesn't retrieve entries. It relies on mongoDB's count operation which:
Counts the number of documents in a collection.
Returns a document that contains this count and as well as the command status.

In node.js, how to use node.js and mongodb to store data in multiple levels

I met a wired problem, that when i use mongodb to store data, some data is missing, which I think it is because of its asynchronous feature
So for this list the timetable, I would use re
/* Here is the a application, in which by using a train_uid and today,
*/
var today = new Date();
var day = today.getDay();
scheduleModel.findByTrainAndTime(train_uid,today,function(err, doc){
var a = new Object();
if(err){}
else{
if(doc != null)
{
//mongodb database can give me some data about the train_id ,uid
a.train_uid = doc.train_uid;
a.train_id = train_id;
and most importantly a train schedule time table, the train schedule time table is a list ( doc.time_schedule )of json objects like arrival, departure and tiploc. However, I need to change the tiploc to sanox number, which referenceModel can help find sanox by providing tiploc number.
//doc.time_schedule
// here is to add a array
so I use async, for each item in the list, I use referenceModel to query sanox and construct a array - a.timeline to store each b, at last when async each operation is finished, trainModel is to store a object with an array of sanox object. However when it comes to the mongodb database, only the array of sanox objects are empty, I guess it is because of asynchronous operation, but since I used async , why it doesn't work
a.train_uid = doc.train_uid; //works
a.train_id = train_id; works
a.timeline = [] // doesn't work
a.timeline = new Array();
var b ;
async.forEachSeries(doc.time_schedule,
function(item,callback){
referenceModel.findStanoxByTicloc(item.tiploc_code,function(err,sanox){
try{
b = new Object();
b.sanox = sanox;
a.time.push(b);
}catch(err2){
}
});
callback();
},
function(err){
trainModel.createNewTrain(a,function(){});
}
}
});
You're calling callback after you fire off the asynchronous find, but before it actually comes back. You need to wait until after you've gotten the data to do that. The following should work better:
async.forEachSeries(doc.time_schedule,
function(item,callback){
referenceModel.findStanoxByTicloc(item.tiploc_code,function(err,sanox){
try{
b = new Object();
b.sanox = sanox;
a.time.push(b);
}catch(err2){
}
callback();
});
},
function(err){
trainModel.createNewTrain(a,function(){});
}

Resources