SubSonic & LinqTemplate: Why is read so slow here? - subsonic

Not sure where to get this info out to the subsonic crew as they took down their forums, and said use stackoverflow, so here we go.
I've been struggling to achieve maximum performance with an ORM. I love subsonic, and its bulk insert, update, delete obliterates Entity Framework in this context, however for straight simple reads, EF blows subsonic out of the water with 5x faster performance.
I've stripped both down, removed change tracking, foreign properties, navigational properties, everything down to poco.
EF executes a simple select of 1M rows in about 4.8, SubSonic Linq is taking 5x that...
looking at SQL profiler the call is slightly different:
Entity Framework: SELECT TOP (1000000)
[c].[Id] AS [Id],
[c].[ProjectIDL] AS [ProjectIDL],
[c].[DescriptorIDL] AS [DescriptorIDL],
[c].[FieldIDL] AS [FieldIDL],
[c].[Data] AS [Data],
[c].[OptionId] AS [OptionId]
FROM [dbo].[DescriptorFieldValues] AS [c]
SubSonic:SELECT TOP (1000000) [t0].[Data], [t0].[DescriptorIDL], [t0].[FieldIDL], [t0].[Id], [t0].[OptionId], [t0].[ProjectIDL]
FROM [dbo].[DescriptorFieldValues] AS t0
SQL profiler is running a trace and showing a massive duration difference here.
I checked the audit login before both queries and they are identical...
If I run the same query in sql management studio it takes 11 seconds for both queries
Looking into the subsonic 3.04 source, I cannot figure out where I would make the adjustment to make it match EF, but is the query syntax here really making that big of a difference, or is there some magic going on I'm not aware of?
Thanks for your help!

Aside from the fact that your SubSonic query doesn't seem to be selecting the ID field, the two SQL statements are for all intents and purposes identical. Finding that the two queries take the same amount of time to execute in SQL Management Studio seems to support this.
That would seem to indicate that SubSonic does in fact take longer to query than EF. SubSonic is known to have some issues with performance that could most certainly explain this difference.
Really, though, we'd need far more detail on your exact usage to really pin down an answer to why this query is slower for you.
Also, if you are looking for the current activity and help with SubSonic, you should try their Google Group.
The creator of SubSonic, Rob Conery, stopped working on it quite a while ago, and the "SubSonic crew" haven't released any substantial updates (if any at all, I don't think they have but I'm not completely certain). It's a project that you may as well consider "finished" in that work on the project has ceased (for all practical purposes) but in a way more work isn't really needed exactly (it works).
By the way, that page that says the SubSonic forums are shut down and come to StackOverflow - well, those pages haven't been updated at all in over a year, maybe 2.

After getting berated by Rob Corney for wanting to learn why SubSonic is slow in comparison to EF, and told we are stupid for wanting to learn why SubSonic is broke, we believe our team has identified a couple places for these performance issues:
In Extensions/Database.cs Load uses per row, per property reflection to serialize the datarow to a concrete object.
In Extensions/Database.cs ToEnumerable all DataReader conversion are done in 1 sequential while loop.
in ExecutionBuilder - this querycompiler is a straight copy paste of an alpha for educational purposes only querycompiler and is woefully out of date using DynamicInvoke.
Our team plans to make the following modifications, in the following order:
In Extensions/Database.cs ToEnumerable pull the property infos once and pass them into load, this is believed to have mininmal performance impact, but likely a decent memory utilization impact.
Modify ToEnumerable to multithread the materialization for large datasets.
Add a method into the object factories to allow for construction of an object from a datarow without using reflection, and instead using the t4 template code generation ahead of time.
Inside load, check the object activator for this interface method, if so use this, if not, default back to reflective based serialization.
Update the ExceutionBuilder to avoid the use of DynamicInvoke.
Hopefully this should bring our performance needs to fruition.
Thank you to Jeff V, Ken I, and QES for their help. It's sad to see the creator of SubSonic Rob Corney so defensive about SubSonic's performance, when it appears to be fairly easy to solve.
~ JT

Subsonic 3 is so slow because its need to compile again and again and again the same thinks before its go to sql server and ask for results.
And this compile is happens on the linq part.
For example a simple static command like "Select * FROM Products WHERE ProductID > 100" that typed in subsonic 3 as Products.Find(x => (x.ProductID > 100) needs first a lot of time to converted to string command, and I mean a very long time.
This is the main reason why subsonic is so slow, and thats why subsonic 3 is worthless.
In linq the trick is the CompiledQuery.Compile function that compile it one time and remember it in memory. How ever for me and my standards is slow too.
Now back to subsonic 2, this is a really nice idea, but still need a lot of optimization inside. Apparently they not think too much the speed. Anyway with some optimization subsonic 2 can be super fast, near the simple ado commands.
My tests a year ago:
Benchmark Linq2SQL, Subsonic2, Subsonic3 - Any other ideas to make them faster?
My idea is to go back to subsonic 2 and make it better and speedy, and drop the subsonic 3. I thank them for both subsonic 2 and 3, but in the version 3 they fail, its ok, I have made too many programs all that years, and not all of them used. Is not big deal.

Related

Xpages design elements, managed bean limits

Sorry not a coding question, not sure if I should be posting it here.
I struggle with the concept of what is 'large' in Notes nsf application design elements as opposed to the amount of data or records stored. For example it is stated that we shouldn't have too many views, but 'too many' does not give any scale whatsoever, is it 10,50,100,500 before it 'slows down'. I realise it also based on the view design but some idea of 'too many' would be beneficial. In this instance data and design elements are in the same nsf.
Is there a recommendation regarding number of elements such as XPages, Custom Controls, Managed Beans, Java Classes etc. What would be deemed excessive? In this instance I have data and logic in separate nsfs.
Any guidance would be greatly appreciated.
Thanks
There is a limit on the number of design elements. But unless you're importing a whole JavaScript framework into an NSF, uou're not likely to hit it.
As has been mentioned, view performance is dependent on many factors. 500 decently designed views are fine. 50 badly performing views can be bad. Lots of resorts on columns impacts the number of indexes that need to be created and managed. Using #Today or #Now in a view selection formula or column formula will be a big problem. Having lots of documents that rarely change, smaller numbers of documents that are updated every 30 seconds, lots of users regularly updating - these will all be impacts on performance.
Performance in code will also impact and XPages Toolbox or agent profiling will give an idea. DocumentCollection.count() is slow, but sometimes is needed. NoteCollections may be quicker. There are various blog posts covering this.
A managed bean that has a Map that grows and grows will impact Java memory.
But there are always performance enhancements being made on the server side. gRPC in Domino 10 will be extremely performant. So always try to be on a recent version and keep up to date with sessions at conferences etc so you know what TCO improvements are being made.
The bottom line is without an intimate understanding of your architecture and code, no one will be able to give you a definitive answer.

View Index is always being rebuilt

As of late, I have encountered a problem with my view index being rebuilt all the time and users are having massive issues with this particular view.
I figured it was due to #Date in my selection formula aswell as one of my column formulas. This way the selection formula would be different every second that passes.
So I figured, since I dont need hours/minutes/seconds in my formulas, I would use #Today. This worked out well for 2-3 days and after that the same problem occured again.
So since the problem is back again, I'm not quite sure if that even causes the problem. When this particular view is open, I have issues in every tab that's open in notes, not only this specific database.
Is this a common/known issue? What can I do to avoid this problem?
Yes, it's a common issue that has been well known since the very early days of Notes more than 20 years ago.
#Date is not a problem on its own. #Now and #Today are both problems.
Using #TextToTime("Today") was a popular workaround that was discovered early on. This hid the problem from the indexer so the server failed to realize that the view was out of date. It doesn't solve the underlying problem, though, which is that the view is trying to do something that views simply aren't designed to do. Views are intended to be static, requiring update only when documents change. Introducing time into a selection or column formula makes them dynamic, which kills that presumption and is a major source of performance problems. Using this workaround requires that the view be fully rebuilt every night. You can do that by setting the view index options to "Manual", and setting up a program document to run an updall command with the -T option for the specific database and view once per night. Note that if your users are spread out across timezones, you'll have to pick one specific time as your standard, and if you have servers spread out across timezones you're going to have a lot of fun figuring out how to make them all show the same documents in the view at all times - but that's common to pretty much all approaches to the problem.
See this IBM Technote for a description of several other options that people have used over the years, with their pros and cons. Also see this article by Andre Guirard, which covers date/time issues in great detail.
I would add that the agent-and-folder solution that they describe in the Technote was generally my preferred approach, but it does have an additional disadvantage that they don't mention: it can eventually lead to an obscure situation where the server throws an error "Folder is larger than supported". This error actually has nothing to do with the size of the folder in documents; it refers to fragmentation of internal structures that occurs as large numbers of documents are moved in and out of the folder over time. It could only be fixed by deleting and re-creating the folder, which you can do in your agent code. I believe this problem may be fixed in more recent versions of Domino, but it caused me a lot of grief back in the Notes 6 and 7 timeframes.

most suited search algorithm?

I'm now facing a problem and I'm not sure what the right solution is. I'll try to explain it, and I hope someone has some good solutions for me:
I have two big data arrays. One that I'm browsing, with something between 50^3 and 150^3 data samples (usually between 50 and 100, rare worst case scenario 150).
For every sample, I want to make a query on another structure that is around the same size (so huge number of total combinations, I can't explore them all).
The query can't be predicted exactly but usually, it is something like :
structure has fields A B C D E F G (EDIT : in total, it's something like 10 to 20 int fields).
query is something like :
10 < A < 20 and B > 100 and D > 200.
Yes, it's really close to SQL.
I thought to put this in a database, but actually it would be a standalone database, and I can work in RAM to make it even faster (speed is an essential criteria).
I thought to try something using GPGPU but it seems it's a terrible idea and despite search can be parallel, it does't seem to be a good idea, searching an unpredictable number of results isn't a good application (if someone can tell me if my understanding has been right it would help me confirm that I should forgive this solution).
EDIT : the nubmer of results is unpredictable because of the query nature, but the it is quite low, since the purpose is to find a low number of well suited combinations
Then since I could use a DB, why not make a RAM B-Tree? it seems close to the solution, but is it? If it is, how should I build my indexes? Can I really do multidimensional indexes, since multidimensional search will always exist? probably UB-Tree or R-tree could do the job (but in my second data sample, I could have some duplicates, so doesn't it make the R-TREE non applicable?).
The thing is, I'm not sure I understand properly all those right now, so if one of you knows trees (and gpgpu, and even solutions I didn't think to), perhaps you could let me know which solution I should explore, learn, and implement?
GPGPU is not a suitable choice due to the fact that you are limited by their capacity and since you are not telling us the data size of these samples I am assuming that a titan x tier card will not suffice. If you could go really wild, TESLA or FirePro, then it is actually worth it since you mentioned that speed really matters. But I am going to speculate that these things are out of your budget, and considering that you have to learn CUDA or OpenCL to make something that will generally be a pain to port here and there, my take is "No".
You mentioned that you have an unpredictable number of results and this is a bad thing. You should develop a formula that will "somewhat" calculate the amount of space which will be needed otherwise it will be disappointing to have your program work on something for quite some time only to get a capacity error/crash. On the other hand, if the RAM capacity is not sufficient, you could work "database style" fetching data from storage when needed(and this is quite bothersome to implement due to scheduling implementations etc).
If you do have the time to go bespoke, here is a helpfull link. Remember, you are going to stumble a lot, but when you make it you will have learnt a tone of stuff:
https://www.quora.com/What-are-some-fast-similarity-search-algorithms-and-data-structures-for-high-dimensional-vectors
In my opinion, an in memory database is the easiest and at the same time most reliable thing to do without compromising on speed. Which one to implement is on you. I think MemSQL is a good one.

Oracle Sort Order - What may cause it to change

Disclaimer: I know that it is bad to not use an 'ORDER BY' in SQL when sorted data is required.
I am currently supporting a Pro*C program which is having a wierd-problem.
One of the possible causes of the wierd-problem may be that the original developers (from a long time ago) have not used ORDER BY in their SQL even though the program logic depends on it!
The program has been working fine all these years and started showing problems only recently.
We are trying to pin the wierd-problem to the ORDER BY mistake (there are other cause candidates like a recent port from Solaris to Linux which took place).
What shadowy things on the database end should we look at that may have changed the old sort order? Things like data files etc?
Anybody have any experience with Pro*C on Solaris magically sorting the result-set?
Thanks!
Since you know that the program cares about the order in which results are returned and you know that the query that is submitted is missing an ORDER BY clause, is there a reason that you don't just fix the problem rather than looking to try to figure out whether the actual order of results may have changed? If you fix the known ORDER BY problem and the "weird problem" you have disappears, that would provide some pretty good evidence that the "weird problem" is, in fact, caused by the missing ORDER BY.
Unfortunately, there are lots of things that might have caused the order of results to change many of which may be impossible to track down. The most obvious cause would be a change in the execution plan. That, in turn, may have been caused either because statistics changed or because statistics didn't change enough or because of a patch or because of an initialization parameter change or because of a client configuration change among other things. If you are licensed to use the AWR (Automatic Workload Repository), you might be able to find evidence that the plan has changed by looking to see if there are multiple PLAN_HASH_VALUE values for the SQL_ID in DBA_HIST_SQLSTAT over different days. If there are, you'd still have to try to figure out whether the different plans actually caused the results to be returned in a different order. Beyond a query plan change, though, there are dozens of other possible causes. The physical order of data on disk may have changed because someone reorganized the table or because someone moved data files around on the disk or because the SAN automatically rebalanced something by moving data around. Some data may have been cached (or may not have been cached) in general in the past that is now cached. An Oracle patch may have been applied.
I suggest that change your physical table with view and make your required order in that view.
example
TABLE_NOT_SORTED --> rename to --> PHYS_TABLE_NOT_SORTED
CREATE VIEW TABLE_NOT_SORTED
AS
SELECT * FROM PHYS_TABLE_NOT_SORTED
ORDER BY DESIRED_COLUMNS
For response to comment:
According to this question and Ask Tom's Answer, it seems that since Oracle does not guarantee a default sorting if you do not use "ORDER BY", they are free to change it. They are absolutely right of course. If you need sorting, use Order By.
Other than that we can not say anything about your code or default ordering.

What is the best way to search multiple sources simultaneously?

I'm writing a phonebook search, that will query multiple remote sources but I'm wondering how it's best to approach this task.
The easiest way to do this is to take the query, start a thread per remote source query (limiting max results to say 10), waiting for the results from all threads and aggregating the list into a total of 10 entries and returning them.
BUT...which of the remote source is more important if all sources return at least 10 results, so then I would have to do a search on the search results. While this would yield accurate information it seems inefficient and unlikely to scale up well.
Is there a solution commercial or open source that I could use and extend, or is there a clever algorithm I can use that I've missed?
Thanks
John, I believe what you want is federated search. I suggest you check out Solr as a framework for this. I agree with Nick that you will have to evaluate the relative quality of the different sources yourself, and build a merge function. Solr has some infrastructure for this, as this email thread shows.
To be honest I haven't seen a ready solution, but this is why we programmers exist: to create a solution if one is not readily availble :-)
The way I would do it is similar to what you describe: using threads - if this is a web application then ajax is your friend for speed and usability, for a desktop app gui representation is not even an issue.
It sounds like you can't determine or guess upfront which source is the best in terms of reliability, speed & number of results. So you need to setup you program so that it determines best results on the fly. Let's say you have 10 data sources, and therfore 10 threads. When you fire up your threads - wait for the first one to return with results > 0. This is going to be you "master" result. As other threads return you can compare them to your "master" result and add new results. There is really no way to avoid this if you want to provide unique results. You can start displaying results as soon as you have your first thread. You don't have to update your screen right away with all the new results as they come in but if takes some time user may become agitated. You can just have some sort of indicator that shows that more results are available, if you have more than 10 for instance.
If you only have a few sources, like 10, and you limit the number of results per source you are waiting for, to like 10, it really shouldn't take that much time to sort through them in any programming language. Also make sure you can recover if your remote sources are not available. If let's say, you are waiting for all 10 sources to come back to display data - you may be in for a long wait, if one of the sources is down.
The other approach is to f00l user. Sort of like airfare search sites do - where they make you want a few seconds while they collect and sort results. I really like Kayak.com's implementation - as it make me feel like it's doing something unlike some other sites.
Hope that helps.

Resources