YUI datatable performance - yui

In YUI3 (3.11) adding 100 items to my table takes 800ms. What are some ways to optimize this (and why is this so slow in the first place)?
data = data.slice(1,100)
data.forEach(function(item){
data_table.data.add({ 'name': item });
})

DataTable.data uses a YUI ModelList internally to store data, which turns all of your rows into Models and fires an add event when you add them to your table. It's currently not meant to have many rows added to it at once, though, because of that.
The best way of solving your problem is probably to reset the table with only the rows that you want to show. You can do that by doing:
data = data.slice(1, 100);
data_table.data.reset(data);
That will let you add data to your DataTable with a lot less overhead, and prevent the firing of those additional events.

Related

How to filter data in extractor?

I've got a long-running pipeline that has some failing items (items that at the end of the process are not loaded because they fail database validation or something similar).
I want to rerun the pipeline, but only process the items that failed the import on the last run.
I have the system in place where I check each item ID (that I received from external source). I do this check in my loader. If I already have that item ID in the database, I skip loading/inserting that item in the database.
This works great. However, it's slow, since I do extract-transform-load for each of these items, and only then, on load, I query the database (one query per item) and compare item IDs.
I'd like to filter-out these records sooner. If I do it in transformer, I can only do it per item again. It looks like extractor could be the place, or I could pass records to transformer in batches and then filter+explode the items in (first) transformer.
What would be better approach here?
I'm also thinking about reusability of my extractor, but I guess I could live with the fact that one extractor does both extract and filter. I think the best solution would be to be able to chain multiple extractors. Then I'd have one that extracts the data and another one that filters the data.
EDIT: Maybe I could do something like this:
already_imported_item_ids = Items.pluck(:item_id)
Kiba.run(
Kiba.parse do
source(...)
transform do |item|
next if already_imported_item_ids.include?(item)
item
end
transform(...)
destination(...)
end
)
I guess that could work?
A few hints:
The higher (sooner) in the pipeline, the better. If you can find a way to filter out right from the source, the cost will be lower, because you do not have to manipulate the data at all.
If you have a scale small enough, you could load only the full list of ids at the start in a pre_process block (mostly what you have in mind in your code sample), then compare right after the source. Obviously it doesn't scale infinitely, but it can work a long time depending on your dataset size.
If you need to have a higher scale, I would advise to either work with a buffering transform (grouping N rows) that would achieve a single SQL query to verify the existence of all the N rows ids in the target database, or work with groups of rows then explode indeed.

How to use SplitOn with singleInstance?

I have a logic app with a sql trigger that gets multiple rows.
I need to split on the rows so that I have a better overview about the actions I do per row.
Now I would like that the logic app is only working on one row at a time.
What would be the best solution for that since
"operationOptions": "singleInstance", and
"runtimeConfiguration": {
"concurrency": {
"runs": 1
}
},
are not working with splitOn.
I was also thinking about calling another logic app and have the logic app use a runtimeConfiguration but that sounds just like an ugly workaround.
Edit:
The row is atomic, and no sorting is needed. Each row can be worked on separately and independent of other data.
As fare as I can tell I wouldn't use a foreach for that since than one failure within a row will lead to a failed logic app.
If one dataset (row) other should also be tried and the error should be easily visible.
Yes, you are seeing the expected behavior. Keep in mind, the split happens in the trigger, not the workflow. BizTalk works the same way except it's a bit more obvious there.
You don't want concurrent processing, you want ordered processing. Right now, the most direct way to handle this is by Foreach'ing over the collection. Though waiting ~3 weeks might be a better option.
One decision point will be weather the atomicity is the collection or the item. Also, you'll need to know if overlapping batches are ok or not.
For instance, if you need to process all items in order, with batch level validation, Foreach with concurrency = 1 is what you need.
Today (as of 2018-03-06) concurrency control is not supported for split-on triggers.
Having said that, concurrency control should be enabled for all trigger types (including split-on triggers) within the next 2-3 weeks.
In the interim you could remove the splitOn property on your trigger and set its concurrency limit to 1. This will start a single run for the entire collection of items, but you can use a foreach loop in your definition to limit concurrency as well. The drawback here is that the trigger will wait until the run as a whole is completed (all items are processed), so the throughput will not be optimal.

How to stop power query querying full original dataset

I have an excel file connected to an access database. I have created a query through Power Query that simply brings the target table into the file and does a couple of minor things to it. I don’t load this to a worksheet but maintain a connection only.
I then have a number of other queries linking to the table created in the first query.
In one of these linked queries, I apply a variety of filters to exclude certain products, customers and so on. This reduces the 400,000 records in the original table in the first query down to around 227,000 records. I then load this table to a worksheet to do some analysis.
Finally I have a couple of queries looking at the 227,000 record table. However, I notice that when I refresh these queries and watch the progress in the right hand pane, they still go through 400,000 records as if they are looking through to the original table.
Is there any way to stop this happening in the expectation that doing so would help to speed up queries that refer to datasets that have themselves already been filtered?
Alternatively is there a better way to do what I’m doing?
Thanks
First: How are you refreshing your queries? If you execute them one at a time then yes, they're all independent. However, when using Excel 2016 on a workbook where "Fast Data Load" is disabled on all queries, I've found that a Refresh All does cache and share query results with downstream queries!
Failing that, you could try the following:
Move the query that makes the 227,000-row table into its own group called "Refresh First"
Place your cursor in your 227,000-row table and click Data - Get &
Transform - From Table,
Change all of your queries to pull from this new query rather than the source.
Create another group called "Refresh Second" that contains every query that
is downstream of the query you created in step 2, and
loads data to the workbook
Move any remaining queries that load to the workbook into "Refresh First", "Refresh Second", or some other group. (By the way: I usually also have a "Connections" group that holds every query that doesn't load data to the workbook, too.)
Unfortunately, once you do this, "Refresh All" would have to be done twice to ensure all source changes are fully propagated because those 227,000 rows will be used before they've been updated from the 400,000. If you're willing to put up with this and refresh manually then you're all set! You can right-click and refresh query groups. Just right-cick and refresh the first group, wait, then right-click and refresh the second one.
For a more idiot-proof way of refreshing... you could try automating it with VBA, but queries normally refresh in the background; it will take some extra work to ensure that the second group of queries aren't started before all of the queries in your "Refresh First" group are completed.
Or... I've learned to strike a balance between fidelity in the real world but speed when developing by doing the following:
Create a query called "ProductionMode" that returns true if you want full data, or false if you're just testing each query. This can just be a parameter if you like.
Create a query called "fModeSensitiveQuery" defined as
let
// Get this once per time this function is retrived and cached, OUTSIDE of what happens each time the returned function is executed
queryNameSuffix = if ProductionMode then
""
else
" Cached",
// We can now use the pre-rendered queryNameSuffix value as a private variable that's not computed each time it's called
returnedFunction = (queryName as text) as table => Expression.Evaluate(
Expression.Identifier(
queryName & queryNameSuffix
),
#shared
)
in
returnedFunction
For each slow query ("YourQueryName") that loads to the table,
Create "YourQueryName Cached" as a query that pulls straight from results the table.
Create "modeYourQueryName" as a query defined as fModeSensitiveQuery("YourQueryName")
Change all queries that use YourQueryName to use modeYourQueryName instead.
Now you can flip ProductionMode to true and changes propagate completely, or flip ProductionMode to false and you can test small changes quickly; if you're refreshing just one query it isn't recomputing the entire upstream to test it! Plus, I don't know why but when doing a Refresh All I'm pretty sure it also speeds up the whole thing even when ProductionMode is true!!!
This method has three caveats that I'm aware of:
Be sure to update your "YourQueryName Cached" query any time the "YourQueryName" query's resulting columns are added, removed, renamed, or typed differently. Or better yet, delete and recreate them. You can do this because,
Power-Query won't recognize your "YourQueryName" and "YourQueryName Cached" queries as dependencies of "modeYourQueryName". The Query Dependences diagram won't be quite right, you'll be able to delete "YourQueryName" or "YourQueryName Cached" without Power Query stopping you, and renaming YourQueryName will break things instead of Power Query automatically changing all of your other queries accordingly.
While faster, the user-experience is a rougher ride, too! The UI gets a little jerky because (and I'm totally guessing, here) this technique seems to cause many more queries to finish simultaneously, flooding Excel with too many repaint requests at the same time. (This isn't a problem, really, but it sure looks like one when you aren't expecting it!)

Mongodb document insertion order

I have a mongodb collection for tracking user audit data. So essentially this will be many millions of documents.
Audits are tracked by loginID (user) and their activities on items. example: userA modified 'item#13' on date/time.
Case: I need to query with filters based on user and item. That's Simple. This returns many thousands of documents per item. I need to list them by latest date/time (descending order).
Problem: How can I insert new documents to the top of the stack? (like a capped collection) or Is it possible to find records from the bottom of the stack? (reverse order). I do NOT like the idea of find and sorting because when dealing with thousand and millions of documents sorting is a bottleneck.
Any solutions?
Stack: mongodb, node.js, mongoose.
Thanks!
the top of the stack?
you're implying there is a stack, but there isn't - there's a tree, or more precisely, a B-Tree.
I do NOT like the idea of find and sorting
So you want to sort without sorting? That doesn't seem to make much sense. Stacks are essentially in-memory data structures, they don't work well on disks because they require huge contiguous blocks (in fact, huge stacks don't even work well in memory, and growing stacks requires copying the entire data set, that would hardly work
sorting is a bottleneck
It shouldn't be, at least not for data that is stored closely together (data locality). Sorting is an O(m log n) operation, and since the _id field already encodes a timestamp, you already have a field that you can sort on. m is relatively small, so I don't see the problem here. Have you even tried that? With MongoDB 3.0, index intersectioning has become more powerful, you might not even need _id in the compound index.
On my machine, getting the top items from a large collection, filtered by an index takes 1ms ("executionTimeMillis" : 1) if the data is in RAM. The sheer network overhead will be in the same league, even on localhost. I created the data with a simple network creation tool I built and queried it from the mongo console.
I have encountered the same problem. My solution is to create another additional collection which maintain top 10 records. The good point is that you can query quickly. The bad point is you need update additional collection.
I found this which inspired me. I implemented my solution with ruby + mongoid.
My solution:
collection definition
class TrainingTopRecord
include Mongoid::Document
field :training_records, :type=>Array
belongs_to :training
index({training_id: 1}, {unique: true, drop_dups: true})
end
maintain process.
if t.training_top_records == nil
training_top_records = TrainingTopRecord.create! training_id: t.id
else
training_top_records = t.training_top_records
end
training_top_records.training_records = [] if training_top_records.training_records == nil
top_10_records = training_top_records.training_records
top_10_records.push({
'id' => r.id,
'return' => r.return
})
top_10_records.sort_by! {|record| -record['return']}
#limit training_records' size to 10
top_10_records.slice! 10, top_10_records.length - 10
training_top_records.save
MongoDb's ObjectId is structured in a way that has natural ordering.
This means the last inserted item is fetched last.
You can override that by using: db.collectionName.find().sort({ $natural: -1 }) during a fetch.
Filters can then follow.
You will not need to create any additional indices since this works on _id, which is indexed by default.
This is possibly the only efficient way you can achieve what you want.

Which is the best method to do pagination so that load on server is minimum

I have done a bit of research on pagination and from what i have read there are 2 contradictory solutions of doing it
Load a small set of data from the database each time a user clicks next
Problem - Suppose there are a million rows that meet any WHERE conditions. That means a million rows are retrieved, stored, filesorted, then most of them are discarded and only 20 retrieved. If the user clicks the "next" button the same process happens again, only a different 20 are retrieved.(ref - http://www.mysqlperformanceblog.com/2008/09/24/four-ways-to-optimize-paginated-displays/)
Load all the data form the database and cache it...This has few problems too mentioned here - http://www.javalobby.org/java/forums/t63849.html
So i know i will have to use a hybrid of both..however the question boils down to - Which operation is more expensive -
making repeated queries in database for small chunks of data
or
transferring a large result set over the network
My company has exactly this situation, and we've chosen a bit of a hybrid. Our data is tabular, so we send it via AJAX to datatables This allows for good UI formatting, sorting, filtering, and show/hide of columns. Datatables has a great solution that will "queue ahead" called "pipelining" that will grab a quantity of data ahead of the user's action (in our case, up to 5 times the records they request) then page through without requests until it runs out of data. It's EXTREMELY easy to implement with Datatables, but I suspect a similar solution would not be difficult if you had to write it by hand using jQuery's AJAX functionality.
I tried doing a full load and cache on a 1.5 million record database and it was a trainwreck. The client almost dumped me because they got mad it was so slow. After a solid overnight of AJAX goodness, the client was happy once again. But best never to get to that point.
Good Luck.

Resources