I am looking for the ways how to optimize bulk download of CRM2011 data. Here are the two main scenarios:
a) Full synchronization: Download of all data - first all accounts, then all contacts etc etc.
b) Incremental synchronization: Download of all entities modified since given date
We use multithread downloader with 3 threads. Each thread performs FetchXml for one entity type that is downloaded page by page. Parsed objects are stored in the downloader cache and the downloader goes on for the next page. There is another thread that pulls the downloaded data from the cache and processes them. This organization increases the download speed more than 2x.
The problems I see:
a) FetchXml protocol is very inefficient. For example it contains lots of unneeded data. Example: FormattedValues take 10-15% bandwidth (my data show ~15% in the source Xml stream or ~10% in the zipped stream), although all we do with it is a) Xml parsing, b) throwing away. (Note that the parsing is not negligible either - iOs/Android Mono parsers are surprisingly slow.)
b) In case of incremental synchronization most of the FetchXml requests return zero items. In this case it would be highly desirable to combine several FetchXml requests into one. (AFAIK it is impossible.) Or maybe use another trick such as to ask for the counts of modified objects I did not investigate what is possible yet.
Does anybody have any advice how to optimize FetchXml traffic?
Your fastest method would be to use SQL server directly for something like this (unless you are using online).
To make the incremental faster, your best bet is to use the aggregate functionality FetchXML provides which is both extremely quick and less verbose.
Why parse on the iOS/Android Mono? If you are instead sharing this to a large number of devices, you'd be better off having a central caching server that could send back this data in a json (zipped) format to the devices (or possibly bson). Then the caching server would request an update of the changes, process those and then send back incremental changes in whatever format to the clients. Would be considerably faster on the clients and far less bandwidth.
I'm not sure of a way to further optimize FetchXML. I would question why you're not using the OData Endpoints and REST, especially if you're primarily concerned with the about of data being sent over the wire.
I have talked to some brilliant CRM MVPs, and I know they have used REST to migrate data to CRM. I'm not sure if they did it because it was faster, but I assumed that was why.
I do know that you are going to minimize the amount of data that is being sent to the client, since XML is extremely bloated.
Have a look at the ExecuteMultipleRequest to allow you to perform multiple requests/queries at once... http://msdn.microsoft.com/en-us/library/microsoft.xrm.sdk.messages.executemultiplerequest.aspx
Related
I have different external APIs doing basically the same things but in a different way : add product informations (ext_api).
I would like to make an adapter API that would call, behind the scene, the different external APIs (adapter_api).
My problem is the following : the external APIs are optimised when calling them with a batch of products attributes. However, my API would be optimised on a product by product basis.
I would like to somehow make a buffer of product attributes that would grow when I call my adapter_api. When the number of product attributes reach a certain limit, the ext_api would be called and the buffer would be reset and ready to receive more product attributes.
I'm wondering how to achieve that. I was thinking of making a REST api in python that would store the buffer of product attributes. I would like this REST api to be able to scale on a Kubernetes cluster : it would need low latency, and several instance of this API would write in the buffer of products until one of them reach the limit and make the call to the external API.
Here is what I have in mind :
Are there any best practices concerning the buffer on this use case ? To add some extra informations : my main purpose here is to hide from internal business APIs (not drawn) the complexity of calling many different external APIs each of which have their own rules and credentials.
Thank you very much for your help.
You didn't tell us your performance evaluation criteria.
You did tell us this:
don't know how to store the buffer : I would like to avoid databases or files.
which makes little sense,
since there's a simple answer to this question:
Is there any best practices on this use case ?
Yes. The best practice is to append requests to buffer.txt
and send the batch when that file exceeds some threshold.
A convenient way to implement the threshold would be
to send when getsize() reports a large enough value.
If requests are of quite different size and the batch
size really matters to you, then append a single byte
to a 2nd file, and use size of that to indicate how
many entries are enqueued.
requirements
The heart of your question seems to revolve around
what was left unsaid:
What is the cost function for sending too many "small" batches to ext_api?
What is the cost function for the consumer of the adapter_api, what does it care about? Low latency return, perhaps?
If ext_api permanently fails (say, a day of downtime), do we have some responsibility for quickly notifying the consumer that its updates are going into a black hole?
And why would using the filesystem be inappropriate?
It seems a perfect match for your needs.
Consider using a global in-memory object,
such as list or queue for the batch you're accumulating.
You might want to protect accesses with a lock.
Maybe your client doesn't really want a
one-product-at-a-time API.
Maybe you'd prefer to have your client
accumulate items,
sending only when its batch size is big enough.
I read this question but it didn't really help.
First and most important thing: time performances are the focus in the application that I'm developing
We have a client/server model (even distributed or cloud if we wish) and a data structure D hosted on the server. Each client request consists in:
Read something from D
Eventually write something on D
Eventually delete something on D
We can say that in this application the relation between the number of received operations can be described as delete<<write<<read. In addition:
Read ops cannot absolutely wait: they must be processed immediately
Write and delete can wait some time, but sooner is better.
From the description above, any lock-mechanism is not acceptable: this would imply that read operations could wait, which is not acceptable (sorry if I stress it so much, but it's really a crucial point).
Consistency is not necessary: if a write/delete operation has been performed and then a read operation doesn't see the write/delete effect it's not a big deal. It would be better, but it's not required.
The solution should be data-structure-independent, so it shouldn't matter if we write on a vector, list, map or Donald Trump's face.
The data structure could occupy a big amount of memory.
My solution so far:
We use two servers: the first server (called f) has Df, the second server (called s) has Ds updated.
f answers clients requests using Df and sends write/delete operations to s. Then s applies write/delete operations Ds sequentially.
At a certain point, all future client requests are redirected to s. At the same time, f copies s updated Ds into its Df.
Now, f and s roles are swapped: s will answer clients request using Ds and f will keep an updated version of Ds. The swapping process is periodically repeated.
Notice that I omitted on purpose A LOT of details for simplicity (for example, once the swap has been done, f has to finish all the pending client requests before applying the write/delete operations received from s in the meantime).
Why do we need two servers? Because the data structure is potentially too big to fit into one memory.
Now, my question is: there is some similar approach in literature? I came up with this protocol in 10 minutes, I find strange that no (better) solution similar to this one has been already proposed!
PS: I could have forgot some application specs, don't hesitate to ask for any clarification!
The scheme that you have works. I don't see any particular problem with it. This is basically like many databases run their HA solution. They apply a log of writes to replicas. This model affords a great deal of flexibility in how the replicas are formed, accessed and maintained. Failovers are easy, too.
An alternative technique is to use persistent datastructures. Each write returns you a new and independent version of the data. All versions can be read in a stable and lock-free way. Versions can be kept or discarded at will. Versions share as much of the underlying state as possible.
Usually, trees underlie such persistent datastructures because it is easy to update a small part of the tree and reuse most of the old tree.
A reason you might not have found a more sophisticated approach is that your problem is extremely general: You want this to work with any data structure at all and the data can be big.
SQL Server Hekaton uses a quite sophisticated data structure to achieve lock-free, readable, point in time snapshots of any database contents. Maybe it's worth a look how they are doing it (they released a paper describing every detail of the system). They also allow for ACID transactions, serializability and concurrent writes. All lock-free.
At the same time, f copies s updated Ds into its Df.
This copy will take a long time because the data is big. It will block readers. A better approach is to apply the log of writes to the writable copy before accepting new writes there. That way reads can be accepted continuously.
The switchover also is a short period where reads might have a slightly higher latency than normal.
I'm writing my first 'serious' Node/Express application, and I'm becoming concerned about the number of O(n) and O(n^2) operations I'm performing on every request. The application is a blog engine, which indexes and serves up articles stored in markdown format in the file system. The contents of the articles folder do not change frequently, as the app is scaled for a personal blog, but I would still like to be able to add a file to that folder whenever I want, and have the app include it without further intervention.
Operations I'm concerned about
When /index is requested, my route is iterating over all files in the directory and storing them as objects
When a "tag page" is requested (/tag/foo) I'm iterating over all the articles, and then iterating over their arrays of tags to determine which articles to present in an index format
Now, I know that this is probably premature optimisation as the performance is still satisfactory over <200 files, but definitely not lightning fast. And I also know that in production, measures like this wouldn't be considered necessary/worthwhile unless backed by significant benchmarking results. But as this is purely a learning exercise/demonstration of ability, and as I'm (perhaps excessively) concerned about learning optimal habits and patterns, I worry I'm committing some kind of sin here.
Measures I have considered
I get the impression that a database might be a more typical solution, rather than filesystem I/O. But this would mean monitoring the directory for changes and processing/adding new articles to the database, a whole separate operation/functionality. If I did this, would it make sense to be watching that folder for changes even when a request isn't coming in? Or would it be better to check the freshness of the database, then retrieve results from the database? I also don't know how much this helps ultimately, as database calls are still async/slower than internal state, aren't they? Or would a database query, e.g. articles where tags contain x be O(1) rather than O(n)? If so, that would clearly be ideal.
Also, I am beginning to learn about techniques/patterns for caching results, e.g. a property on the function containing the previous result, which could be checked for and served up without performing the operation. But I'd need to check if the folder had new files added to know if it was OK to serve up the cached version, right? But more fundamentally (and this is the essential newbie query at hand) is it considered OK to do this? Everyone talks about how node apps should be stateless, and this would amount to maintaining state, right? Once again, I'm still a fairly raw beginner, and so reading the source of mature apps isn't always as enlightening to me as I wish it was.
Also have I fundamentally misunderstood how routes work in node/express? If I store a variable in index.js, are all the variables/objects created by it destroyed when the route is done and the page is served? If so I apologise profusely for my ignorance, as that would negate basically everything discussed, and make maintaining an external database (or just continuing to redo the file I/O) the only solution.
First off, the request and response objects that are part of each request last only for the duration of a given request and are not shared by other requests. They will be garbage collected as soon as they are no longer in use.
But, module-scoped variables in any of your Express modules last for the duration of the server. So, you can load some information in one request, store it in a module-level variable and that information will still be there when the next request comes along.
Since multiple requests can be "in-flight" at the same time if you are using any async operations in your request handlers, then if you are sharing/updating information between requests you have to make sure you have atomic updates so that the data is shared safely. In node.js, this is much simpler than in a multi-threaded response handler web server, but there still can be issues if you're doing part of an update to a shared object, then doing some async operation, then doing the rest of an update to a shared object. When you do an async operation, another request could run and see the shared object.
When not doing an async operation, your Javascript code is single threaded so other requests won't interleave until you go async.
It sounds like you want to cache your parsed state into a simple in-memory Javascript structure and then intelligently update this cache of information when new articles are added.
Since you already have the code to parse your set of files and tags into in-memory Javascript variables, you can just keep that code. You will want to package that into a separate function that you can call at any time and it will return a newly updated state.
Then, you want to call it when your server starts and that will establish the initial state.
All your routes can be changed to operate on the cached state and this should speed them up tremendously.
Then, all you need is a scheme to decide when to update the cached state (e.g. when something in the file system changed). There are lots of options and which to use depends a little bit on how often things will change and how often the changes need to get reflected to the outside world. Here are some options:
You could register a file system watcher for a particular directory of your file system and when it triggers, you figure out what has changed and update your cache. You can make the update function as dumb (just start over and parse everything from scratch) or as smart (figure out what one item changed and update only that part of the cache) as it is worth doing. I'd suggest you start simple and only invest more in it when you're sure that effort is needed.
You could just manually rebuild the cache once every hour. Updates would take an average of 30 minutes to show, but this would take 10 seconds to implement.
You could create an admin function in your server to instruct the server to update its cache now. This might be combined with option 2, so that if you added new content, it would automatically show within an hour, but if you wanted it to show immediately, you could hit the admin page to tell it to update its cache.
I have a console/desktop application that crawls a lot (think million calls) of data from various webservices. At any given time I have about 10 threads performing these call and aggregating the data into a MySql database. All seeds are also stored in a database.
What would be the best way to report it's progress? By progress I mean:
How many calls already executed
How many failed
What's the average call duration
How much is left
I thought about logging all of them somehow and tailing the log to get the data. Another idea was to offer some kind of output to a always open TCP endpoint where some form of UI could read the data and display some aggregation. Both ways look too rough and too complicated.
Any other ideas?
The "best way" depends on your requirements. If you use a logging framework like NLog, you can plug in a variety of logging targets like files, databases, the console or TCP endpoints.
You can also use a viewer like Harvester as a logging target.
When logging multi-threaded applications I sometimes have an additional thread that writes a summary of progress to the logger once every so often (e.g. every 15 seconds).
since it is a Console Application, just use Writeline, just have the application spit the important stuff out to the Console.
I did something Similar in an application that I created to export PDF's from a SQL Server Database back into PDF Format
you can do it many different ways. if you are counting records and their size you can run a tally of sorts and have it show the total every so many records..
I also wrote out to a Text File, so that I could keep track of all the PDFs and what case numbers they went to and things like that. that information is in the answer that I gave to the above linked question.
you could also write things out to a Text File every so often with the statistics.
the logger that Eric J. mentions is probably going to be a little bit easier to implement, and would be a nice tool for your toolbox.
these options are just as valid depending on your specific needs.
We're designing a backbone application, in which each server-side collection has the potential to contain tens of thousands of records. As an analogy - think of going into the 'Sent Items' view of an email application.
In the majority of Backbone examples I've seen, the collections involved are at most 100-200 records, and therefore fetching the whole collection and working with it in the client is relatively easy. I don't believe this would be the case with a much larger set.
Has anyone done any work with Backbone on large server-side collections?
Have you encountered performance issues (especially on mobile devices) at a particular collection size?
What decision(s) did you take around how much to fetch from the server?
Do you download everything or just a subset?
Where do you put the logic around any custom mechanism (Collection prototype for example?)
Yes, at about 10,000 items, older browsers could not handle the display well. We thought it was a bandwidth issue, but even locally, with as much bandwidth as a high-performance machine could throw at it, Javascript just kinda passed out. This was true on Firefox 2 and IE7; I haven't tested it on larger systems since.
We were trying to fetch everything. This didn't work for large datasets. It was especially pernicious with Android's browser.
Our data was in a tree structure, with other data depending upon the presence of data in the tree structure. The data could change due to actions from other users, or other parts of the program. Eventually, we made the tree structure fetch only the currently visible nodes, and the other parts of the system verified the validity of the datasets on which they dependent independently. This is a race condition, but in actual deployment we never saw any problems. I would have liked to use socket.io here, but management didn't understand or trust it.
Since I use Coffeescript, I just inherited from Backbone.Collection and created my own superclass, which also instantiated a custom sync() call. The syntax for invoking a superclass's method is really useful here:
class Dataset extends BaseAccessClass
initialize: (attributes, options) ->
Dataset.__super__.initialize.apply(#, arguments)
# Customizations go here.
Like Elf said you should really paginate loading data from the server. You'd save a lot of load on the server from downloading items you may not need. Just creating a collection with 10k models locally in Chrome take half a second. It's a huge load.
You can put the work on another physical CPU thread by using a worker and then use transient objects to sent it to the main thread in order to render it on the DOM.
Once you have a collection that big rendering in the DOM lazy rendering will only get you so far. The memory will slowly increase until it crashes the browser (that will be quick on tablets). You should use object pooling on the elements. It will allow you to set a small max size for the memory and keep it there.
I'm building a PerfView for Backbone that can render 1,000,000 models and scroll at 120FPS on Chrome. The code is all up on Github https://github.com/puppybits/BackboneJS-PerfView. It;s commented so theres a lot of other optimizations you'd need to display large data sets.