I'm using CosmosDB along with Azure Functions. Here I've one long-running activity, processing Store's transactions. There are around 3000 stores and each store is having 27 million transactional records.
I'm reading the data store-by-store, performing some arithmetic operations over data, and saving the calculated data (result) in another Cosmos container.
Here's my code to :
var distinctStores = _storeContainer
.Where(d => d.CreatedDate >= _yesterday.Date && d.CreatedDate < _today.Date)
.DistinctBy(x => x.LocationId)
.Select(s => s.LocationId);
try
{
foreach (var store in distinctStores) // looping 3000 Stores here
{
var data = store.Transactions // It has around 27 millions data collection.
// Here I'm doing all Calculations over data.
// Getting the results from the calculation
var result = getCalculatedData(data);
// Save the result to the container.
var Save(result)
}
}
private bool Save(List<MyModel> list)
{
_storeDataRepo.AddBulkAsync(list);
}
Here's my Repository Service to store the data in Cosmos DB Container.
public async Task AddBulkAsync(List<TEntity> documents)
{
List<Task> concurrentTasks = new();
foreach (var item in documents)
{
concurrentTasks.Add(_container.CreateItemAsync<TEntity>(item);
}
await Task.WhenAll(concurrentTasks);
}
I bit worried about calling _storeDataRepo.AddBulkAsync(list) inside the loop of stores. Which in turn executes await Task.WhenAll(concurrentTasks) for every loop.
Is it a better solution? please advise and help me with better approaches.
private bool Save(List<MyModel> list)
{
_storeDataRepo.AddBulkAsync(list);
}
AddBulkAsync is an async method, you should not do this. This is firing the Async task in the background (Fire and Forget pattern) and not waiting for it.
private Task SaveAsync(List<MyModel> list)
{
return _storeDataRepo.AddBulkAsync(list);
}
or:
private async Task SaveAsync(List<MyModel> list)
{
await _storeDataRepo.AddBulkAsync(list);
}
And called as:
foreach (var store in distinctStores) // looping 3000 Stores here
{
var data = store.Transactions // It has around 27 millions data collection.
// Here I'm doing all Calculations over data.
// Getting the results from the calculation
var result = getCalculatedData(data);
// Save the result to the container.
await SaveAsync(result)
}
Because the Bulk write does not really care if the operations are for the same store, you could just have 1 SaveAsync call with all the data from all stores:
List<TEntity> documents = new List<TEntity>();
foreach (var store in distinctStores) // looping 3000 Stores here
{
var data = store.Transactions // It has around 27 millions data collection.
// Here I'm doing all Calculations over data.
// Getting the results from the calculation
var result = getCalculatedData(data);
// Add to the documents mean to be saved
documents.AddRange(result);
}
await SaveAsync(documents);
Related
My app can have a large amount of writes, reads and updates (can even go above 10000) under certain circumstances.
While developing the application locally, these operations usually take a few seconds at most (great!) however, it can easily take minutes when running the application on google cloud, to the point that the Firebase function times out.
I developed a controlled test in a separate project, whose sole purpose is to write, get and delete thousands of items for bench-marking. These were the results (averaged out from several tests):
Local Emulator:
5000 items, 4.2s write, 2.2s delete
5000 items, batch mode ON, 0.75s write, 0.11s delete
Cloud Firestore:
100 items, 15.8s write, 14.5s delete
1000 items, batch mode ON, 4.8s write, 3.0s delete
5000 items, async mode ON, 10.2s write, 8.0s delete
5000 items, batch & async ON, 4.5s write, 3.9s delete
NOTE: My local emulator crashes whenever I try to perform db operations async (which is a problem for another day) but it is why I was unable to test the write/delete speeds asynchronously locally. Also, write and read values usually vary +-25% between runs.
However, as you can see, the fact that my local emulator is faster in its slowest mode compared to the fastest test in the cloud definitely raises some questions.
Could it be that I have some sort of configuration issue? or is it just that these numbers are standard for firestore? Here is the (summarised) typescript code if you wish to try it:
functions.runWith({ timeoutSeconds: 540, memory: "2GB" }).https.onRequest(async (req, res) => {
//getting the settings from the request
var data = req.body;
var numWrites: number = data.numWrites;
var syncMode: boolean = !data.asyncMode;
var batchMode: boolean = data.batchMode;
var batchLimit: number = data.batchLimit;
//pre-run setup
var dbObj = {
number: 123,
string: "abc",
boolean: true,
object: { var1: "var1", num1: 1 },
array: [1, 2, 3, 4]
};
var collection = db.collection("testCollection");
var startTime = moment();
//insert requested number of items, using requested settings
var allInserts: Promise<any>[] = [];
if (!batchMode) { //sequential writes
for (var i = 0; i < numWrites; i++) {
var set = collection.doc().set(dbObj);
allInserts.push(set);
if (syncMode) await set;
}
} else { //batch writes
var batch = db.batch();
for (var i = 1; i <= numWrites; i++) {
batch.set(collection.doc(), dbObj);
if (i % batchLimit === 0) {
var commit = batch.commit();
allInserts.push(commit);
batch = db.batch();
if (syncMode) await commit;
}
}
}
//some logging information. Getting items to delete
var numInserts = allInserts.length;
await Promise.all(allInserts);
var insertTime = moment();
var alldocs = (await collection.get()).docs;
var numDocs = alldocs.length;
var getTime = moment();
//deletes all of the items in the collection
var allDeletes: Promise<any>[] = [];
if (!batchMode) { //sequential deletes
for (var doc of alldocs) {
var del = doc.ref.delete();
allDeletes.push(del);
if (syncMode) await del;
}
} else { //batch deletes
var batch = db.batch();
for (var i = 1; i <= numDocs; i++) {
var doc = alldocs[i - 1];
batch.delete(doc.ref);
if (i % batchLimit === 0) {
var commit = batch.commit();
allDeletes.push(commit);
batch = db.batch();
if (syncMode) await commit;
}
}
}
var numDeletes = allDeletes.length;
await Promise.all(allDeletes);
var deleteTime = moment();
res.status(200).send(/* a whole bunch of metrics for analysis */);
});
EDIT: just to clarify, the UI does not perform these write operations, so latency between the end-user machine and cloud servers should (in theory) not cause any major latency issues. The communication to the database is handled fully by Firebase Functions
EDIT 2: I have run this test on two deployments, one in Europe and another in US. Both took around the same amount of time to run, even though my ping to these two servers are vastly different
It is normal to have faster response with the local emulator than Cloud Firestore as the remote environment adds the network traffic that takes time.
For large amounts of operations from a single source the recommendation is to use batch operations as these will reduce the transcactions, and with it Round trips.
And the reason for the Async mode to be faster is that the caller is not waiting for each transaction to be completed before sending the next one So it also makes sense that the calls are faster with it.
The Times you have on the table seem normal to me.
Just as an additional thing to optimize make sure that the region where your firestore database is located is the closest one to your location.
My application makes about 50 redis.get call to serve a single http request, it serves millions of request daily and application runs on about 30 pods.
When monitoring on newrelic i am getting 200MS average redis.get time, To Optimize this i wrote a simple pipeline system in nodejs which is simply a wrapper over redis.get and it pushes all the request in queue, and then execute the queue using redis.mget (getting all the keys in bulk).
Following is the code snippet:
class RedisBulk {
constructor() {
this.queue = [];
this.processingQueue = {};
this.intervalId = setInterval(() => {
this._processQueue();
}, 5);
}
clear() {
clearInterval(this.intervalId);
}
get(key, cb) {
this.queue.push({cb, key});
}
_processQueue() {
if (this.queue.length > 0) {
let queueLength = this.queue.length;
logger.debug('Processing Queue of length', queueLength);
let time = (new Date).getTime();
this.processingQueue[time] = this.queue;
this.queue = []; //empty the queue
let keys = [];
this.processingQueue[time].forEach((item)=> {
keys.push(item.key);
});
global.redisClient.mget(keys, (err, replies)=> {
if (err) {
captureException(err);
console.error(err);
} else {
this.processingQueue[time].forEach((item, index)=> {
item.cb(err, replies[index]);
});
}
delete this.processingQueue[time];
});
}
}
}
let redis_bulk = new RedisBulk();
redis_bulk.get('a');
redis_bulk.get('b');
redis_bulk.get('c');
redis_bulk.get('d');
My Question is: is this a good approach? will it help in optimizing redis get time? is there any other solution for above problem?
Thanks
I'm not a redis expert but judging by the documentation ;
MGET has the time complexity of
O(N) where N is the number of keys to retrieve.
And GET has the time complexity of
O(1)
Which brings both scenarios to the same end result in terms of time complexity in your scenario. Having a bulk request with MGET can bring you some improvements for the IO but apart from that looks like you have the same bottleneck.
I'd ideally split my data into chunks, responding via multiple http requests in async fashion if that's an option.
Alternatively, you can try calling GET with promise.all() to run GET requests in parallel, for all the GET calls you need.
Something like;
const asyncRedis = require("async-redis");
const client = asyncRedis.createClient();
function bulk() {
const keys = [];
return Promise.all(keys.map(client.get))
}
I've drawn a simple flow chart, which basically crawls some data from internet and loads them into the database. So far, I had thought I was peaceful with promises, however now I have an issue that I'm working for at least three days without a simple step.
Here is the flow chart:
Consider there is a static string array like so: const courseCodes = ["ATA, "AKM", "BLG",... ].
I have a fetch function, it basically does a HTTP request followed by parsing. Afterwards it returns some object array.
fetch works perfectly with invoking its callback with that expected object array, it even worked with Promises, which was way greater and tidy.
fetch function should be invoked with every element in the courseCodes array as its parameter. This task should be performed in parallel execution, since those seperate fetch functions do not affect each other.
As a result, there should be a results array in callback (or Promises resolve parameter), which includes array of array of objects. With those results, I should invoke my loadCourse with those objects in the results array as its parameter. Those tasks should be performed in serial execution, because it basically queries database if similar object exists, adds it if it's not.
How can perform this kind of tasks in node.js? I could not maintain the asynchronous flow in such a scenario like this. I've failed with caolan/async library and bluebird & q promise libraries.
Try something like this, if you are able to understand this:
const courseCodes = ["ATA, "AKM", "BLG",... ]
//stores the tasks to be performed.
var parallelTasks = [];
var serialTasks = [];
//keeps track of courses fetched & results.
var courseFetchCount = 0;
var results = {};
//your fetch function.
fetch(course_code){
//your code to fetch & parse.
//store result for each course in results object
results[course_code] = 'whatever result comes from your fetch & parse code...';
}
//your load function.
function loadCourse(results) {
for(var index in results) {
var result = results[index]; //result for single course;
var task = (
function(result) {
return function() {
saveToDB(result);
}
}
)(result);
serialTasks.push(task);
}
//execute serial tasks for saving results to database or whatever.
var firstSerialTask = serialTasks.shift();
nextInSerial(null, firstSerialTask);
}
//pseudo function to save a result to database.
function saveToDB(result) {
//your code to store in db here.
}
//checks if fetch() is complete for all course codes in your array
//and then starts the serial tasks for saving results to database.
function CheckIfAllCoursesFetched() {
courseFetchCount++;
if(courseFetchCount == courseCodes.length) {
//now process courses serially
loadCourse(results);
}
}
//helper function that executes tasks in serial fashion.
function nextInSerial(err, result) {
if(err) throw Error(err.message);
var nextSerialTask = serialTasks.shift();
nextSerialTask(result);
}
//start executing parallel tasks for fetching.
for(var index in courseCode) {
var course_code = courseCode[index];
var task = (
function(course_code) {
return function() {
fetch(course_code);
CheckIfAllCoursesFetched();
}
}
)(course_code);
parallelTasks.push(task);
for(var task_index in parallelTasks) {
parallelTasks[task_index]();
}
}
Or you may refer to nimble npm module.
Hi I am using 16 collections to insert around 3-4 million json objects ranging from 5-10k per object.I am using stored procedure to insert these documents.I have 22 Capacity Unit.
function bulkImport(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
// Validate input.
if (!docs) throw new Error("The array is undefined or null.");
var docsLength = docs.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
}
// Call the CRUD API to create a document.
tryCreateOrUpdate(docs[count], callback);
// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreateOrUpdate(doc, callback) {
var isAccepted = true;
var isFound = collection.queryDocuments(collectionLink, 'SELECT * FROM root r WHERE r.id = "' + doc.id + '"', function (err, feed, options) {
if (err) throw err;
if (!feed || !feed.length) {
isAccepted = collection.createDocument(collectionLink, doc, callback);
}
else {
// The metadata document.
var existingDoc = feed[0];
isAccepted = collection.replaceDocument(existingDoc._self, doc, callback);
}
});
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isFound && !isAccepted) getContext().getResponse().setBody(count);
}
// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw err;
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
} else {
// Create next document.
tryCreateOrUpdate(docs[count], callback);
}
}
my C# codes looks like this
public async Task<int> Add(List<JobDTO> entities)
{
int currentCount = 0;
int documentCount = entities.Count;
while(currentCount < documentCount)
{
string argsJson = JsonConvert.SerializeObject(entities.Skip(currentCount).ToArray());
var args = new dynamic[] { JsonConvert.DeserializeObject<dynamic[]>(argsJson) };
// 6. execute the batch.
StoredProcedureResponse<int> scriptResult = await DocumentDBRepository.Client.ExecuteStoredProcedureAsync<int>(sproc.SelfLink, args);
// 7. Prepare for next batch.
int currentlyInserted = scriptResult.Response;
currentCount += currentlyInserted;
}
return currentCount;
}
The problem I am facing is out of 400k documents that I try to insert at times documents get missed with out giving any error.
The application is worker role deployed on cloud.
If I increase the number of threads or instances inserting in documentDB the number of documents missed are much higher.
how to figure out what is the problem.Thanks in Advance.
I found that when trying this code I would get an error at docs.length which stated that length was undefined.
function bulkImport(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
// Validate input.
if (!docs) throw new Error("The array is undefined or null.");
var docsLength = docs.length; // length is undefined
}
After many tests (could not find anything in Azure documentation) I realized that I could not pass an array as was suggested. The parameter had to be an object. I had to modify the batch code like this in order for it to run.
I also found I could not simply try and pass an array of documents in the DocumentDB script explorer (Input box) either. Even though the placeholder help text says you can.
This code worked for me:
// psuedo object for reference only
docObject = {
"items": [{doc}, {doc}, {doc}]
}
function bulkImport(docObject) {
var context = getContext();
var collection = context.getCollection();
var collectionLink = collection.getSelfLink();
var count = 0;
// Check input
if (!docObject.items || !docObject.items.length) throw new Error("invalid document input parameter or undefined.");
var docs = docObject.items;
var docsLength = docs.length;
if (docsLength == 0) {
context.getResponse().setBody(0);
}
// Call the funct to create a document.
tryCreateOrUpdate(docs[count], callback);
// Obviously I have truncated this function. The above code should help you understand what has to change.
}
Hopefully Azure documentation will catch up or become easier to find if I missed it.
I'll also be placing a bug report for the Script Explorer in hopes that the Azurites will update.
It’s important to note that stored procedures have bounded execution, in which all operations must complete within the server specified request timeout duration. If an operation does not complete with that time limit, the transaction is automatically rolled back. In order to simplify development to handle time limits, all CRUD (Create, Read, Update, and Delete) operations return a Boolean value that represents whether that operation will complete. This Boolean value can be used a signal to wrap up execution and for implementing a continuation based model to resume execution (this is illustrated in our code samples below).
The bulk-insert stored procedure provided above implements the continuation model by returning the number of documents successfully created. This is noted in the stored procedure's comments:
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isFound && !isAccepted) getContext().getResponse().setBody(count);
If the output document count is less than the input document count, you will need to re-run the stored procedure with the remaining set of documents.
Since May 2018 there is a new Batch SDK for Cosmos DB. There is a GitHub repo to get you started.
I have been able to import 100.000 records in 9 seconds. And using Azure Batch to fan out the inserts, I have done 19 mln records in 1m15s. This was on a 1.66mln RU/s collection, which you obviously can scale down after import.
I met a wired problem, that when i use mongodb to store data, some data is missing, which I think it is because of its asynchronous feature
So for this list the timetable, I would use re
/* Here is the a application, in which by using a train_uid and today,
*/
var today = new Date();
var day = today.getDay();
scheduleModel.findByTrainAndTime(train_uid,today,function(err, doc){
var a = new Object();
if(err){}
else{
if(doc != null)
{
//mongodb database can give me some data about the train_id ,uid
a.train_uid = doc.train_uid;
a.train_id = train_id;
and most importantly a train schedule time table, the train schedule time table is a list ( doc.time_schedule )of json objects like arrival, departure and tiploc. However, I need to change the tiploc to sanox number, which referenceModel can help find sanox by providing tiploc number.
//doc.time_schedule
// here is to add a array
so I use async, for each item in the list, I use referenceModel to query sanox and construct a array - a.timeline to store each b, at last when async each operation is finished, trainModel is to store a object with an array of sanox object. However when it comes to the mongodb database, only the array of sanox objects are empty, I guess it is because of asynchronous operation, but since I used async , why it doesn't work
a.train_uid = doc.train_uid; //works
a.train_id = train_id; works
a.timeline = [] // doesn't work
a.timeline = new Array();
var b ;
async.forEachSeries(doc.time_schedule,
function(item,callback){
referenceModel.findStanoxByTicloc(item.tiploc_code,function(err,sanox){
try{
b = new Object();
b.sanox = sanox;
a.time.push(b);
}catch(err2){
}
});
callback();
},
function(err){
trainModel.createNewTrain(a,function(){});
}
}
});
You're calling callback after you fire off the asynchronous find, but before it actually comes back. You need to wait until after you've gotten the data to do that. The following should work better:
async.forEachSeries(doc.time_schedule,
function(item,callback){
referenceModel.findStanoxByTicloc(item.tiploc_code,function(err,sanox){
try{
b = new Object();
b.sanox = sanox;
a.time.push(b);
}catch(err2){
}
callback();
});
},
function(err){
trainModel.createNewTrain(a,function(){});
}