I am working on a Qt application in Python 3.6.5 using PySide2 v5.6.0. This application is a tool to be used to label many tens of thousands of images for training neural networks and has a qtablewidget for viewing information pertaining to each image.
I have implemented filtering by label status/name/etc, and when a filter is applied, I am clearing the table with mainWin.tableWidget.clearContents() and re-populating it with the new entries. I have tested this on a few hundred images, and it clears the table in less than a second; however, when the details for more than a few thousand image files have been loaded into the table, the program hangs on the clearContents() method. I don't know if it will eventually finish, but I have waited over 30 minutes in some cases and it has never cleared the table. One of my testers has reported that if you do wait long enough, it will eventually filter on large data sets, but obviously this is not a viable solution.
Trying clear() and setRowCount(0) give the same result.
The only thing special about the table is that I have a few signals hooked up to it to display images when a user clicks on a new row, but that is about it, and when I disabled those, it still would hang.
Here is my question: is there some way to quickly clear the rows or maybe filter them with some built in qtablewidget function?
Also, I feel I should point out that I am not storing image data in the table. All that I am storing in each cell is about 10-30 characters of text. Images are loaded from the disk when they are selected and are not stored in RAM.
Thank you.
Related
I have reached the number of 20 metrics in GDS. I am using report to do some aggregation and calculation on data and then download it to Excel file. To add more metrics I am using optional metrics but every time I refresh report I have to manually click all squares before I would be able to download this file. How can I deal with it?
AFAIK, Google Data Studio doesn't have a limit of 20 metrics.
Some visual component may have this limit. If that's the case, there isn't anything that can be done.
You can try to add multiple visuals and position them close to each other, so users will think they're the same component. Like in the picture bellow:
Notice there are two tables (the first one selected), but they were positioned in a way that users will think there is only one.
I understand that CouchDB hashes the source of each design documents against the name of the index file. Whenever I change the source code, the index needs to be rebuild. CouchDB does this when the document is requested for the first time.
What I'd expect to happen and want to happen
Each time I change a design doc, the first call to a view will take significantly longer than usual and may time out. The index will continue to build. Once this is completed, the view will only process changes and will be very fast.
What actually happens
When running an amended view for the first time, I see the process in the status window, slowly reach 100%. This takes about 2 hours. During this time all CPU's are fully utilized.
Once process reaches 99% it remains there for about an hour and then disappears. CPU utilization drops to just one cpu.
When the process has disappeared, the data file for the view keeps growing for about half an hour to an hour. CPU utilization is near 0%
The index file suddenly stops to increase in size.
If I request the view again when I've reached state 4), the characteristics of 3) start again. I have to repeat this process between 5 to 50 times until I can finally retrieve the view values.
If the view get's requested a second time whilst till in stage 1 or 2, it will most definitely run out of memory and I have to restart the CouchDB service. This is despite my DB rarely using more than 2 GByte when runninng just one job and more than 4 GByte free in usual operation.
I have tried to tweak configuration settings, add more memory, but nothing seems to have an impact.
My Question
Do I misunderstand the concept of running views or is something wrong with my setup?
If this is expected, is there anything I can tweak to reduce the number of reruns?
Context
My documents are pretty large (1 to 20 MByte). The data they contain is well structured, they are usually web-analytics reports and would in a relational database be stored as several 10k rows of data.
My map function extracts these rows. It returns the dimensions as key array. The key array sometimes exceeds 20 columns. Most views will only have less than 10 columns.
The reduce function will aggregate (sum) all values in rows with identical keys. The metrics are stored in a dictionary and may contain different keys. The reduce function identifies missing keys in one document and adds these to the aggregate as 0.
I am using CouchDB 1.5.0 on Windows Server 2008 R2 with 2CPUs and 8 GByte memory.
The views are written in javascript using the couchjs query server.
My designs documents usually consist of several views, with a '_lib' view that does not emit any data, but contains an exhaustive library of functions accessed by the actual views.
It is a known issue, but just in case: if you have gigabytes of docs, you can forget about reduce functions. Only build-in ones will work fast enough.
It is possible to set os_process_limit to an extra-low value (1 sec, for sample). This way you can detect which doc takes long to be indexed and optimize your map function for performance.
I have an attendance recording system that has 2 databases, one for current, another for archiving. The server processes attendance records, and puts records marked completed into the archive. There is no processing done in the archive database.
Here's the issue. One of the requirement was to build a blank record for each staff every day, for which attendance records are put into. The agent that does this calls a few procedures and does some checking within the database. As of current, there are roughly 1,800 blank records created daily. On the development PC, processing each records takes roughly 2 to 3 seconds, which translates to an average of an hour and a half. However, when we deployed it on the server, processing each records takes roughly 7 seconds, roughly translates into 3 and a half hours to complete. We have had instances when the agent takes 4.5 to 5 hours to complete.
Note that in both instances, agents are scheduled. There are no other lotus apps in the server, and the server is free and idle most of the time (no other application except Windows Server and Lotus Notes). Is there anything that could cause the additional processing time compared on the development PC and the server?
Your process is generating 1800 new documents every day, and you have said that you are also archiving documents regularly, so I presume that means that you are deleting them after you archive them. Performance problems can build up over time in applications like this. You probably have a large number of deletion stubs in the database, and the NSF file is probably highly fragmented (internally and/or externally).
You should use the free NotesPeek utility to examine the database and see how many deletion stubs it contains. Then you should check the purge interval setting and consider lowering it to the smallest value that you are comfortable with. (I.e., big enough so you know that all servers and users will replicate within that time, but small enough to avoid allowing a large buildup of deletion stubs.) If you change the purge interval, you can wait 24 hours for the stubs to be purged, or you can manually run updall against the database on the server console to force it.
Then you should run compact -c on the NSF file, and also run a defrag on the server disk volume where the NSF lives.
If these steps do improve your performance, then you may want to take steps in your code to prevent recurrence of the problem by using coding techniques that minimize deletion stubs, database growth and fragmentation.
I.e., go into your code for archiving, and change it so it doesn't delete them after archiving. Instead, have your code mark them with a field such as FreeDocList := "1". Then add a hidden view called (FreeDocList) with a selction formula of FreeDocList = "1". Also go into ever other view in the database and add & (!(FreeDocList = "1")) to the selection formulas. Then change the code adds the new blank documents, so that instead of creating new docs it just goes to the FreeDocList view, finds the first document, sets FreeDocList = "0", and clears all the previous field values. Of course, if there aren't enough documents the FreeDocList view, your code would revert to the old behavior and create a new document.
With the above changes, you will be re-using your existing documents whenever possible instead of deleting and creating new ones. I've run benchmarks on code like this and found that it can help; but I can't guarantee it in all cases. Much would depend on what else is going on in the application.
In AcaniUsers, I'm downloading the closest 20 users to me and displaying their profile pictures as thumbnails in a table view. User & Photo are both Resources because they each have an id (MongoDB BSON ObjectId) on the server. Each user has a unique_id. Each Photo has four different sizes (images) on the server: square: 75x75, square#2x: 150x150, large: 320x480, large#2x: 640x960. But, each device will only have two of these sizes, depending on whether it's an iPhone 3 or 4 (retina display). Each of these sizes has their own MongoDB collection. And, all four images for each Photo have the same BSON ObjectId's across these four collections.
In the future, I may give User a relationship called photos to allow a user to have more than one photo. Also, although I don't foresee this, I may add more Image sizes (types).
The fresh attribute on Image tells me whether I've downloaded the latest Image. I set this to NO whenever the Photo's ID has changed, and then back to yes after I've finished downloading the Image.
Should I store the four different images in Core Data or on the file system and just store their URLs in Core Data? I read somewhere that over 1 or 2MB, you should store in file system, not Core Data. So, I was thinking of storing the square images in Core Data and the large images in the file system, but I'd rather store them all the same way to make things easier. So, maybe I'll just store them all in the file system? What do you think?
Do you think I should discard the 75x75 & 320x480 sizes since pretty soon iPhone 3's will be gone?
How can I improve my design of the entities, and their attributes and relationships. For example, is the Resource entity even beneficial at all?
I'm displaying the Users with an NSFetchedResultsController. However, it doesn't know when the User's image gets updated, so the images don't show up until I scroll aggressively the first time. How do I let the NSFetchedResultsController know that a user's thumbnail has finished downloading? Do I have to use KVO?
To answer your questions:
1 I'd store them all in the file system and record the URL in the database. I've never been a big fan of storing image data in the DB. Plus it'll simplify things a little to have all of the image storage uniform. That way in your image loading code you don't have to worry about if it's a type that's stored in the DB or on the file system.
2 No, I wouldn't do that yet. The iPhone 3 is going to be around for a bit longer. ATT is still selling them as the cheap entry level iPhone. I just saw a commercial the other night advertising them for $49.
3 Remove the Resources entry and add the id attribute to each of the classes. How you did it is actually bad. Abstract entities should only be used when you have a couple of entities that are almost identical and only have a few differences between them. Under the hood, Core Data will make only one table for an abstract entity and all of its children. So right now you're going to end up with only one table that will contain both your user and photo entries which can be bad when you're trying to query just type of entity.
You should also delete the Image entity and move its attributes into the Photo entity. The Photo will always have those values associated with it and the same values won't be shared between photos. Having them as a separate entity will cause a slow down. You'll either need to load them with the photos which will require a join (slow) or they'll be loaded one at a time when you access either the data or fresh attributes which is also slow. When each of the faults is fired in the latter scenario a separate query and round trip to the disk will happen for each object. So when you loop through your pictures for display in the table, you'll be firing n queries instead of one which can be a big difference in performance.
4 You can use KVO to do it. Have your table cell observer the User or Picture (depends on if you have the Picture already added to the user and are changing the data or if you're adding a new picture to the user on load completion). When the observer gets triggered, update the image being displayed.
I have done a bit of research on pagination and from what i have read there are 2 contradictory solutions of doing it
Load a small set of data from the database each time a user clicks next
Problem - Suppose there are a million rows that meet any WHERE conditions. That means a million rows are retrieved, stored, filesorted, then most of them are discarded and only 20 retrieved. If the user clicks the "next" button the same process happens again, only a different 20 are retrieved.(ref - http://www.mysqlperformanceblog.com/2008/09/24/four-ways-to-optimize-paginated-displays/)
Load all the data form the database and cache it...This has few problems too mentioned here - http://www.javalobby.org/java/forums/t63849.html
So i know i will have to use a hybrid of both..however the question boils down to - Which operation is more expensive -
making repeated queries in database for small chunks of data
or
transferring a large result set over the network
My company has exactly this situation, and we've chosen a bit of a hybrid. Our data is tabular, so we send it via AJAX to datatables This allows for good UI formatting, sorting, filtering, and show/hide of columns. Datatables has a great solution that will "queue ahead" called "pipelining" that will grab a quantity of data ahead of the user's action (in our case, up to 5 times the records they request) then page through without requests until it runs out of data. It's EXTREMELY easy to implement with Datatables, but I suspect a similar solution would not be difficult if you had to write it by hand using jQuery's AJAX functionality.
I tried doing a full load and cache on a 1.5 million record database and it was a trainwreck. The client almost dumped me because they got mad it was so slow. After a solid overnight of AJAX goodness, the client was happy once again. But best never to get to that point.
Good Luck.