Folium + Bokeh: Poor performance and massive memory usage - python-3.x

I'm using Folium and Bokeh together in a Jupyter notebook. I'm looping through a dataframe, and for each row inserting a marker on the Folium map, pulling some data from a separate dataframe, creating a Bokeh chart out of that data, and then embedding the Bokeh chart in the Folium map popup in an IFrame. Code is as follows:
map = folium.Map(location=[36.710021, 35.086146],zoom_start=6)
for i in range (0,len(duty_station_totals)):
popup_table = station_dept_totals.loc[station_dept_totals['Duty Station'] == duty_station_totals.iloc[i,0]]
chart = Bar(popup_table,label=CatAttr(columns=['Department / Program'],sort=False),values='dept_totals',
title=duty_station_totals.iloc[i,0] + ' Staff',xlabel='Department / Program',ylabel='Staff',plot_width=350,plot_height=350)
hover = HoverTool(point_policy='follow_mouse')
hover.tooltips=[('Staff','#height'),('Department / Program','#{Department / Program}'),('Duty Station',duty_station_totals.iloc[i,0])]
chart.add_tools(hover)
html = file_html(chart, INLINE, "my plot")
iframe = folium.element.IFrame(html=html, width=400, height=400)
popup = folium.Popup(iframe, max_width=400)
marker = folium.CircleMarker(duty_station_totals.iloc[i,2],
radius=duty_station_totals.iloc[i,1] * 150,
color=duty_station_totals.iloc[i,3],
fill_color=duty_station_totals.iloc[i,3])
marker.add_to(map)
folium.Marker(duty_station_totals.iloc[i,2],icon=folium.Icon(color='black',icon_color=duty_station_totals.iloc[i,3]),popup=popup).add_to(map)
map
This loop runs extremely slowly, and adds approx. 200mb to the memory usage of the associated python 3.5 process, per run of the loop! In fact, after running the loop a couple times my entire macbook is slowing down to a crawl - even the mouse is lagging. The associated map also lags heavily when scrolling and zooming, and the popups are slow to open. In case it isn't obvious, I'm pretty new to the python analytics and web visualization world so maybe there is something clearly very inefficient here.
I'm wondering why this is and if there is a better way of having Bokeh charts appear in the map popups. From some basic experiments I've done, it doesn't seem that the issue is with the calls to Bar - the memory usage seems to really skyrocket when I include calls to file_html and just get worse as calls to folium.element.IFrame are added. Seems like there is some sort of memory leak going on due to the increasing memory usage on re-running of the same code.
If anyone has ideas as to how to achieve the same effect (Bokeh charts opening when clicking a Folium marker) in a more efficient manner I would really appreciate it!
Update following some experimentation
I've run through the loop step by step and observed changes in memory usage as more steps are added in to try and isolate what piece of code is driving this issue. On the Bokeh side, the biggest culprit seems to be the calls to file_html() - when running the loop through this step it adds about 5mb of memory usage to the associated python 3.5 process per run (the loop is creating 18 charts), even when including bokeh.io.curdoc().clear().
The bigger issue, however, seems to be driven by Folium. running the whole loop including the creation of the Folium IFrames with the Bokeh-generated HTML and the map markers linked to the IFrames adds 25-30mb to the memory usage of the Python process per run.
So, I guess this is turning in to more of a Folium question. Why is this structure so memory intensive and is there a better way? By the way, saving the resulting Folium map as an HTML file with map.save('map.html') creates a huge, 22mb HTML file.

There are lots of different use-cases, and some of them come with unavoidable trade-offs. In order to make some other use-cases very simple and convenient, Bokeh has an implicit "current document" and keeps accumulating things there. For the particular use-case of generating a bunch of plots sequentially in a loop, you will want to call bokeh.io.reset_output() in between each, to prevent this accumulation.

Related

Why GBuffers need to be created for each frame in D3D12?

I have experience with D3D11 and want to learn D3D12. I am reading the official D3D12 multithread example and don't understand why the shadow map (generated in the first pass as a DSV, consumed in the second pass as SRV) is created for each frame (actually only 2 copies, as the FrameResource is reused every 2 frames).
The code that creates the shadow map resource is here, in the FrameResource class, instances of which is created here.
There is actually another resource that is created for each frame, the constant buffer. I kind of understand the constant buffer. Because it is written by CPU (D3D11 dynamic usage) and need to remain unchanged until the GPU finish using it, so there need to be 2 copies. However, I don't understand why the shadow map needs to do the same, because it is only modified by GPU (D3D11 default usage), and there are fence commands to separate reading and writing to that texture anyway. As long as the GPU follows the fence, a single texture should be enough for the GPU to work correctly. Where am I wrong?
Thanks in advance.
EDIT
According to the comment below, the "fence" I mentioned above should more accurately be called "resource barrier".
The key issue is that you don't want to stall the GPU for best performance. Double-buffering is a minimal requirement, but typically triple-buffering is better for smoothing out frame-to-frame rendering spikes, etc.
FWIW, the default behavior of DXGI Present is to stall only after you have submitted THREE frames of work, not two.
Of course, there's a trade-off between triple-buffering and input responsiveness, but if you are maintaining 60 Hz or better than it's likely not noticeable.
With all that said, typically you don't need to double-buffered depth/stencil buffers for rendering, although if you wanted to make the initial write of the depth-buffer overlap with the read of the previous depth-buffer passes then you would want distinct buffers per frame for performance and correctness.
The 'writes' are all complete before the 'reads' in DX12 because of the injection of the 'Resource Barrier' into the command-list:
void FrameResource::SwapBarriers()
{
// Transition the shadow map from writeable to readable.
m_commandLists[CommandListMid]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_DEPTH_WRITE, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE));
}
void FrameResource::Finish()
{
m_commandLists[CommandListPost]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE, D3D12_RESOURCE_STATE_DEPTH_WRITE));
}
Note that this sample is a port/rewrite of the older legacy DirectX SDK sample MultithreadedRendering11, so it may be just an artifact of convenience to have two shadow buffers instead of just one.

Objective C For loops with #autorelease and ARC

As part of an app that allows auditors to create findings and associate photos to them (Saved as Base64 strings due to a limitation on the web service) I have to loop through all findings and their photos within an audit and set their sync value to true.
Whilst I perform this loop I see a memory spike jumping from around 40MB up to 500MB (for roughly 350 photos and 255 findings) and this number never goes down. On average our users are creating around 1000 findings and 500-700 photos before attempting to use this feature. I have attempted to use #autorelease pools to keep the memory down but it never seems to get released.
for (Finding * __autoreleasing f in self.audit.findings){
#autoreleasepool {
[f setToSync:#YES];
NSLog(#"%#", f.idFinding);
for (FindingPhoto * __autoreleasing p in f.photos){
#autoreleasepool {
[p setToSync:#YES];
p = nil;
}
}
f = nil;
}
}
The relationships and retain cycles look like this
Audit has a strong reference to Finding
Finding has a weak reference to Audit and a strong reference to FindingPhoto
FindingPhoto has a weak reference to Finding
What am I missing in terms of being able to effectively loop through these objects and set their properties without causing such a huge spike in memory. I'm assuming it's got something to do with so many Base64 strings being loaded into memory when looping through but never being released.
So, first, make sure you have a batch size set on the fetch request. Choose a relatively small number, but not too small because this isn't for UI processing. You want to batch a reasonable number of objects into memory to reduce loading overhead while keeping memory usage down. Try 50 or 100 and see how it goes, then consider upping the batch size a little.
If all of the objects you're loading are managed objects then the correct way to evict them during processing is to turn them into faults. That's done by calling refreshObject:mergeChanges: on the context. BUT - that discards any changes, and your loop is specifically there to make changes.
So, what you should really be doing is batch saving the objects you've modified and then turning those objects back into faults to remove the data from memory.
So, in your loop, keep a counter of how many you've modified and save the context each time you hit that count and refresh all the objects that were processed so far. The batch on the fetch and the batch size to save should be the same number.
There's probably a big difference in size between your "Finding" objects and the associated images. So your primary aim should be to redesign your database in a way so that unfaulting (loading) a Finding object does not automatically load the base64 encoded image.
That's actually one of the major strengths of Code Data: Loading part of an object hierarchy. Just try to move the base64 encoded data to an own (managed) object so that Core Data does not load it. It will still be loaded as needed when the reference is touched.

FabricJS ApplyFilter is slow on large images

Taking a very basic stock example such as the redify filter, with a large image (1200x1024) I was trying to determine why it takes (what I think) is too long. After some investigating, I find that the delay occurs in fabricjs::ApplyFilter, where replacement.src = canvasEl.toDataURL('image/png'); (line 17933 in 1.6.2). That take a long time, even compared to the complete pixel run through by the filter.
Is there some way around this? Can I do something differently to speed up the process? TIA

Xna Xbox framedrops when GC kicks in

I'm developing an app (XNA Game) for the XBOX, which is a pretty simple app. The startpage contains tiles with moving gif images. Those gif images are actually all png images, which gets loaded once by every tile, and put in an array. Then, using a defined delay, these images are played (using a counter which increases every time a delay passes).
This all works well, however, I noticed some small lag every x seconds in the movement of the GIF images. I then started to add some benchmarking stuff:
http://gyazo.com/f5fe0da3ff81bd45c0c52d963feb91d8
As you can see, the FPS is pretty low for such a simple program (This is in debug, when running the app from the Xbox itself, I get an avg of 62fps).
2 important settings:
Graphics.SynchronizeWithVerticalRetrace = false;
IsFixedTimeStep = false;
Changing isFixedTimeStep to true increases the lag. The settings tile has wheels which rotate, and you can see the wheels go back a little every x seconds. The same counts for SynchronizeWVR, also increases lag.
I noticed a connection between the lag and the moment the garbage collector kicks in, every time it kicks in, there is a lag...
Don't mind the MAX HMU(Heap memory usage), as this is takes the amount of the start, the avg is more realistic.
Here is another screen from the performance monitor, however I don't understand much from this tool, first time I'm using it... Hope it helps:
http://gyazo.com/f70a3d400657ac61e6e9f2caaaf17587
After a little research I found the culprit.
I have custom components that all derive from GameComponent, and who get added to the Component list of the main Game class.
This was one (of a total of 2) major problem, causing to update everything that wasn't needing an update. (The draw method was the only one who kept the page state in mind, and only drew if needed).
I fixed this by using different "screens" (or pages as I called them), wich are the only components who derive from GameComponent.
Then I only update the page wich is active, and the custom components on that page also get updated. Problem fixed.
The second big problem, is the following;
I made a class which helps me on positioning stuff on the screen, relative that is, with percentages and stuff like that. Parent containers, aligns & v-aligns etc etc.
That class had properties, for size & vectors, but instead of saving the calculated value in a backing field, I recalculated them everytime I accessed a property. But calculating complex stuff like that uses references (to parent & child containers for example) wich made it very hard for the CLR, because it had alot of work to do.
I now rebuilt the whole positioning class to a fully functional optimized class, with different flags for recalculating when necessairy, and instead of drops of 20fps, I now get an average of 170+fps!

How to search for Possibilities to parallelize?

I have some serial code that I have started to parallelize using Intel's TBB. My first aim was to parallelize almost all the for loops in the code (I have even parallelized for within for loop)and right now having done that I get some speedup.I am looking for more places/ideas/options to parallelize...I know this might sound a bit vague without having much reference to the problem but I am looking for generic ideas here which I can explore in my code.
Overview of algo( the following algo is run over all levels of the image starting with shortest and increasing width and height by 2 each time till you reach actual height and width).
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen
Just for some perspective, it may not always be worthwhile to parallelize something.
Just because you have a for loop where each iteration can be done independently of each other, doesn't always mean you should.
TBB has some overhead for starting those parallel_for loops, so unless you're looping a large number of times, you probably shouldn't parallelize it.
But, if each loop is extremely expensive (Like in CirrusFlyer's example) then feel free to parallelize it.
More specifically, look for times where the overhead of the parallel computation is small relative to the cost of having it parallelized.
Also, be careful about doing nested parallel_for loops, as this can get expensive. You may want to just stick with paralellizing the outer for loop.
The silly answer is anything that is time consuming or iterative. I use Microsoft's .NET v4.0 Task Parallel Library and one of the interesting things about their setup is its "expressed parallelism." An interesting term to describe "attempted parallelism." Though, your coding statements may say "use the TPL here" if the host platform doesn't have the necessary cores it will simply invoke the old fashion serial code in its place.
I have begun to use the TPL on all my projects. Any place there are loops especially (this requires that I design my classes and methods such that there are no dependencies between the loop iterations). But any place that might have been just good old fashion multithreaded code I look to see if it's something I can place on different cores now.
My favorite so far has been an application I have that downloads ~7,800 different URL's to analyze the contents of the pages, and if it finds information that it's looking for does some additional processing .... this used to take between 26 - 29 minutes to complete. My Dell T7500 workstation with dual quad core Xeon 3GHz processors, with 24GB of RAM, and Windows 7 Ultimate 64-bit edition now crunches the entire thing in about 5 minutes. A huge difference for me.
I also have a publish / subscribe communication engine that I have been refactoring to take advantage of TPL (especially on "push" data from the Server to Clients ... you may have 10,000 client computers who have stated their interest in specific things, that once that event occurs, I need to push data to all of them). I don't have this done yet but I'm REALLY LOOKING FORWARD to seeing the results on this one.
Food for thought ...

Resources