MarkLogic 8 - XQuery write large result set to a file efficiently - node.js

UPDATE: See MarkLogic 8 - Stream large result set to a file - JavaScript - Node.js Client API for someone's answer on how to do this in Javascript. This question is specifically asking about XQuery.
I have a web application that consumes rest services hosted in node.js.
Node simply proxies the request to XQuery which then queries MarkLogic.
These queries already have paging setup and work fine in the normal case to return a page of data to the UI.
I need to have an export feature such that when I put a URL parameter of export=all on a request, it doesn't lookup a page anymore.
At that point it should get the whole result set, even if it's a million records, and save it to a file.
The actual request needs to return immediately saying, "We will notify you when your download is ready."
One suggestion was to use xdmp:spawn to call the XQuery in the background which would save the results to a file. My actual HTTP request could then return immediately.
For the spawn piece, I think the idea is that I run my query with different options in order to get all results instead of one page. Then I would loop through the data and create a string variable to call xdmp:save with.
Some questions, is this a good idea? Is there a better way? If I loop through the result set and it does happen to be very large (gigabytes) it could cause memory issues.
Is there no way to directly stream the results to a file in XQuery?
Note: Another idea I had was to intercept the request at the proxy (node) layer and then do an xdmp:estimate to get the record count and then loop through querying each page and flushing it to disk. In this case I would need to find some way to return my request immediately yet process in the background in node which seems to have some ideas here: http://www.pubnub.com/blog/node-background-jobs-async-processing-for-async-language/

One possible strategy would be to use a self-spawning task that, on each iteration, gets the next page of the results for a query.
Instead of saving the results directly to a file, however, you might want to consider using xdmp:http-post() to send each page to a server:
http://docs.marklogic.com/xdmp:http-post?q=xdmp:http-post&v=8.0&api=true
In particular, the server could be a Node.js server that appends each page as it arrives to a file or any other datasink.
That way, Node.js could handle the long-running asynchronous IO with minimal load on the database server.
When a self-spawned task hits the end of the query, it can again use an HTTP request to notify Node.js to close the file and report that the export is finished.
Hping that helps,

Related

nodejs vs. ruby / understanding requests processing order

I have a simple utility that i use to size image on the fly via url params.
Having some troubles with the ruby image libraries (cmyk to rvb is, how to say… "unavailable"), i gave it a shot via nodejs, which solved the issue.
Basically, if the image does not exists, node or ruby transforms it. Otherwise when the image has already been requested/transformed, the ruby or node processes aren't touched, the image is returned statically
The ruby works perfectly, a bit slow if lot of transforms are requested at once, but very stable, it always go through whatever the amount (i see the images arriving one the page one after another)
With node, it works also perfectly, but when a large amount of images are requested, for a single page load, the first images is transformed, then all the others requests returns the very same image (the last transformed one). If I refresh the page, the first images (already transformed) is returned right away, the second one is returned correctly transformed, but then all the other images returned are the same as the one just newly transformed. and it goes on the same for every refresh. not optimal , basically the resquests are "merged" at some point and all return the same image. for reason i don't understand
(When using 'large amount', i mean more than 1)
The ruby version :
get "/:commands/*" do |commands,remote_path|
path = "./public/#{commands}/#{remote_path}"
root_domain = request.host.split(/\./).last(2).join(".")
url = "https://storage.googleapis.com/thebucket/store/#{remote_path}"
img = Dragonfly.app.fetch_url(url)
resized_img = img.thumb(commands).to_response(env)
return resized_img
end
The node js version :
app.get('/:transform/:id', function(req,res,next){
parser.parse(req.params,function(resized_img){
// the transform are done via lovell/sharp
// parser.parse parse the params, write the file,
// return the file path
// then :
fs.readFileSync(resized_img, function(error,data){
res.write(data)
res.end()
})
})
})
Feels like I'm missing here a crucial point in node. I expected the same behaviour with node and ruby, but obviously the same pattern transposed in the node area just does not work as expected. Node is not waiting for a request to process, rather processes those somehow in an order that is not clear to me
I also understand that i'm not putting the right words to describe the issue, hoping that it might speak to some experienced users, let them provide clarifiactions to get a better understanding of what happens behind the node scenes

How to implement server side rendering datatable, Using node and mongo db?

So i have one user collection(mongo DB) which consists millions of user.
I m using nodejs as backend, angular js as frontend and datatable for displaying those users.
But datatable Load all users in one api call which load more then 1 million user.
This makes my API response two slow.
I want only first 50 users then next 50 then so on....
Server stack = node js + angular js + mongo DB
Thanks
If you are using datatable with huge amount of data you should consider using server side processing functionnality.
Server side processing for datatable is described here : https://datatables.net/manual/server-side
But if you feel lazy to implement this on your server you could use third parties like :
https://github.com/vinicius0026/datatables-query
https://github.com/eherve/mongoose-datatable
Hope this helps.
The way to solve you client trying to fetch users from your server(and DB) and then rendering them to a datatable is done using pagination. There a few ways of solving pagination which i have seen, let's assume you are using REST.
One way of doing this is having your API ending with:
/api/users?skip=100&limit=50
Meaning, the client will ask your server for users(using default sorting) and skipping the first 100 results it finds and retrieving the next 50 users.
Another way is to have your API like this(I don't really like this approach):
/api/users?page=5&pageSize=50
Meaning, the client will pass which page and how many results per page it wants to fetch. This will result in a server side calculation becuase you would need to fetch users from 250-300.
You can read on pagination a lot more on the web.
Having said that, your next issue is to fetch the desired users from the database. MongoDB has two functions for using skip and limit, which is why I like the first API better. You can do the query as follows:
users.find().skip(50).limit(50)
You can read more about the limit function here and the skip function here
First Thing you need in to add skip and limit to you mongo query like this
Model.find().skip(offset).limit(limit)
then the next thing you have to do is enable server side processing in datatables
If you are using javascript data-table then this fiddle will work for you
http://jsfiddle.net/bababalcksheep/ntcwust8/
For angular-datatables
http://l-lin.github.io/angular-datatables/archives/#/serverSideProcessing
One other way if you want to send own parameters
$scope.dtOptions = DTOptionsBuilder.newOptions()
.withOption('serverSide', true)
.withOption('processing', true)
.withOption('ajax', function (data, callback, settings) {
// make an ajax request using data.start and data.length
$http.post(url, {
draw: draw,
limit: data.length,
offset: data.start,
contains: data.search.value
}).success(function (res) {
// map your server's response to the DataTables format and pass it to
// DataTables' callback
draw = res.draw;
callback({
recordsTotal: res.meta,
recordsFiltered: res.meta,
draw: res.draw,
data: res.data
});
});
})
you will get the length per page and offset as start variable in data object in the .withOption('ajax' , fun...) section and from there you can pass this in get request as params e.g. /route?offset=data.start&limit?data.length or using the post request in above example
On hitting next button in table this function will automatically trigger with limit and start and many other datatable related value
#mahesh
when loading page create 2 variables lets say skipVar=0 and limit when user clicks on next send *skipVar value key skip
var skipVar =0
on page load skip=skipVar&limit=limit
on next button
skipVar=skipVar*limit
and send Query String as
skip=skipVar&limit=limit

Nodejs: Do additional stuff after res.send

I'm using Node as webserver and I want to log every request to it into a database. I also want the user to receive the response as quickly as possible, so I came up with this code:
// ... putting together the response_data
res.send(response_data);
// ... now log the request into the DB and maybe do additional stuff
It works and I like the idea of putting some of the (time) expensive stuff behind the send. But as I'm new to Node I'm asking if this is a common pattern?
On Stackoverflow I just find people having problems bc they try to send additional data after res.send - but I never heard anybody saying "yeah this is a great feature for your responsiveness" so I'm not sure if there's a major flaw with this solution I just don't see yet...
As long as you don't need to send anything back to the user as a result of the "additional" stuff then your approach is fine.
The problem most people come across is trying to send data down the response after the response has already been sent e.g.
res.send(response_data);
// do additional stuff
res.send(additional_data); // KABOOM!

Websockets with Streaming Archives

So this is the setup I'm working with:
I am on an express server which must stream an archived binary payload to a browser (does not matter if it is zip, tar or tar.gz - although zip would be nice).
On this server, I have a websocket open that connects to another server which is sending me binary payloads of individual files in a directory. I get these payloads streamed, piece-by-piece, as buffers, and I'm doing this serially (that is - file-by-file - there aren't multiple websockets open at one time, and there is one websocket per file). This is the websocket library I'm using: https://github.com/einaros/ws
I would like to go through each file, open a websocket, and then append the buffers to an archiver as they come through the websockets. When data is appended to the archiver, it would be nice if I could stream the ouput of the archiver to the browser (via the response object with response.write). So, basically, as I'm getting the payload from the websocket, I would like that payload streamed through an archiver and then to the response. :-)
Some things I have looked into:
node-zipstream - This is nice because it gives me an output stream I can pipe directly to response.write. However, it doesn't appear to support nested files/folders, and, more importantly, it only accepts an input stream. I have looked at the source code (which is quite terse and readable), and it seems as though, if I were able to have access to the update function within ZipStream.prototype.addFile, I could just call that each time on the message event when I get a binary buffer from the websocket. This is quite messy/hacky though, and, given that this library already doesn't seem to support nested files/folders, I'm not sure I will be going with it.
node-archiver - This suffers from the same issue as node-zipstream (probably because it was inspired by it) where it allows me to pipe the output, but I cannot append multiple buffers for the same file within the archive (and then manually signal when the last buffer has been appended for a given file). However, it does allow me to have nested folders, which is a clear win over node-zipstream.
Is there something I'm not aware of, or is this just a really crazy thing that I want to do?
The only alternative I see at this point is to wait for the entire payload to be streamed through a websocket and then append with node-archiver, but I really would like to reap the benefit of true streaming/archiving on-the-fly.
I've also thought about the possibility of creating a read stream of sorts just to serve as a proxy object that I can pass into node-archiver and then just append the buffers I get from the websocket to this read stream. Looking at various read streams, I'm not sure how to do this though. The only way I could think of was creating a writestream, piping buffers to it, and having a readstream read from that writestream. Am I on the correct thought process here?
As always, thanks for any help/direction you can offer SO community.
EDIT:
Since I just opened this question, and I'm new to node, there may be a better answer than the one I provided. I will keep this question open and accept a better answer if one presents itself within a few days. As always, I will upvote any other answers, even if they're ridiculous, as long as they're correct and allow me to stream on-the-fly as mine does.
I figured out a way to get this working with node-archiver. :-)
It was based off my hunch of creating a temporary "proxy stream" of sorts, inspired by this SO question: How to create streams from string in Node.Js?
The basic gist is (coffeescript syntax):
archive = archiver 'zip'
archive.pipe response // where response is the http response
// and then for each file...
fileName = ... // known file name
fileSize = ... // known file size
ws = .... // create websocket
proxyStream = new Stream()
numBytesStreamed = 0
archive.append proxyStream, name: fileName
ws.on 'message', (dataBuffer) ->
numBytesStreamed += dataBuffer.length
proxyStream.emit 'data', dataBuffer
if numBytesStreamed is fileSize
proxyStream.emit 'end'
// function/indicator to do this for the next file in the folder
// and then when you're completely done...
archive.finalize (err, bytesOfArchive) ->
if err?
// do whatever
else
// unless you somehow knew this ahead of time
res.addTrailers
'Content-Length': bytesOfArchive
res.end()
Note that this is not the complete solution I implemented. There is still a lot of logic dealing with getting the files, their paths, etc. Not to mention error-handling.
EDIT:
Since I just opened this question, and I'm new to node, there may be a better answer. I will keep this question open and accept a better answer if one presents itself within a few days. As always, I will upvote any other answers, even if they're ridiculous, as long as they're correct and allow me to stream on-the-fly as mine does.

Fire Off an asynchronous thread and save data in cache

I have an ASP.NET MVC 3 (.NET 4) web application.
This app fetches data from an Oracle database and mixes some information with another Sql Database.
Many tables are joined together and lot of database reading is involved.
I have already optimized the best I could the fetching side and I don't have problems with that.
I've use caching to save information I don't need to fetch over and over.
Now I would like to build a responsive interface and my goal is to present the users the order headers filtered, and load the order lines in background.
I want to do that cause I need to manage all the lines (order lines) as a whole cause of some calculations.
What I have done so far is using jQuery to make an Ajax call to my action where I fetch the order headers and save them in a cache (System.Web.Caching.Cache).
When the Ajax call has succeeded I fire off another Ajax call to fetch the lines (and, once again, save the result in a cache).
It works quite well.
Now I was trying to figure out if I can move some of this logic from the client to the server.
When my action is called I want to fetch the order header and start a new thread - responsible of the order lines fetching - and return the result to the client.
In a test app I tried both ThreadPool.QueueUserWorkItem and Task.Factory but I want the generated thread to access my cache.
I've put together a test app and done something like this:
TEST 1
[HttpPost]
public JsonResult RunTasks01()
{
var myCache = System.Web.HttpContext.Current.Cache;
myCache.Remove("KEY1");
ThreadPool.QueueUserWorkItem(o => MyFunc(1, 5000000, myCache));
return (Json(true, JsonRequestBehavior.DenyGet));
}
TEST 2
[HttpPost]
public JsonResult RunTasks02()
{
var myCache = System.Web.HttpContext.Current.Cache;
myCache.Remove("KEY1");
Task.Factory.StartNew(() =>
{
MyFunc(1, 5000000, myCache);
});
return (Json(true, JsonRequestBehavior.DenyGet));
}
MyFunc crates a list of items and save the result in a cache; pretty silly but it's just a test.
I would like to know if someone has a better solution or knows of some implications I might have access the cache in a separate thread?!
Is there anything I need to be aware of, I should avoid or I could improve ?
Thanks for your help.
One possible issue I can see with your approach is that System.Web.HttpContext.Current might not be available in a separate thread. As this thread could run later, once the request has finished. I would recommend you using the classes in the System.Runtime.Caching namespace that was introduced in .NET 4.0 instead of the old HttpContext.Cache.

Resources