I'm working on some code to select and export geodata based on a bounding box. The data I want to select comes from 2 seperate layers in a huge File GDB (16GB) covering the entire Netherlands. I use a bounding box as to avoid reading the entire dataset before making a selection.
This method works great when applied on a gpkg database, but with a file geodatabase the time to process is way longer (0,2s vs 300s for a 200x200 meter selection). The File GDB I'm using has a spatial index set for the layers I'm reading. I'm using geopandas to read and select. Below you'll find an example for the layer 'Adres':
import geopandas as gpd
def ImportGeodata(FilePath, BoundingBox):
importBag=gpd.read_file(FilePath, layer='Adres', bbox=BoundingBox)
importBag['mergeid']=importBag['identificatie']
return importBag
Am I overseeing something? Or is this a limitation when importing from a huge File GDB? I can't find an obvious mistake here. For now the workaround is another script that imports and dumps the layers I need in a gpkg. Problem is this runs for 3 to 4 hours (gpkg result is almost 6 GB). I don't want to keep doing that, it would be necessary to do once every month or so in order to process a new version of this dataset.
Curious what you guys come up with.
Related
I'm currently using pickle to save some big data which contains many numpy matrices of size 10k*10k. Even though I use several similar (separate) python files, whenever I save the data, the size of the saved dat file is always 4 GB. So, is that just a coincidence or it can't save more than this amount?
Also, when I load the data it uses more than 90% of the memory, which is not useful to me. I have heard of cpickle and joblib, here a comparison of them: What are the different use cases of joblib versus pickle?
I would like to reduce memory usage. Should I switch to the joblib? What would be the fastest way?
Thanks for any suggestions.
p.s. I use Python 3.8 on Ubuntu 20.04 with Spyder IDE.
My data is in a large multi-indexed pandas DataFrame. I re-index to flatten the DataFrame and then feed it through ColumnDataSource, but I need to group my data row wise in order to plot it correctly (think bunch of torque curves corresponding to a bunch of gears for a car). If I just plot the dictionary output of ColumnDataSource, it's a mess.
I've tried converting the ColumnDataSource output back to DataFrame, but then I lose the update functionality, the callback won't touch the DataFrame, and the plots won't change. Anyone have any ideas?
The short answer to the question in the title is "Yes". The ColumnDataSource is the special, central data structure of Bokeh. It provides the data for all the glyphs in a plot, or content in data tables, and automatically keeps that data synchronized on the Python and JavaScript sides, so that you don't have to, e.g write a bunch of low-level websocket code yourself. To update things like glyphs in a plot, you update the CDS that drives them.
It's possible there are improvements that could be made in your approach to updating the CDS, but it is impossibe to speculate without seeing actual code for what you have tried.
I'm looking for a python 3 module that can generate a visual representation of a graph. Ideally I would give it a list of nodes and a list of connection, as well as some small amount of data associated with those things, and it would open a window (but an image saved on disk is fine) showing said nodes connected as specified. I don't want to specify the positions of the nodes, instead I'd like the software to arrange them in a way that minimizes edge crossings at least approximately.
Is there any such module? All I've been able to find are plotters and such...
If there is none, an easy-to-learn graphics module would do: I have never done any graphics things.
You can take a look at networkx. It offers the possibility to draw graphs with matplotlib
I am loading in big, raw data files with python. It is a collection of images (video stream) that I want to display on an interface. As of now I am embedding a matplotlib graph by using the imshow() command. However it is very slow.
The fast part is reading the data itself, but splitting it in a numpy array matrix already takes 8 seconds for a 14MB file. We have 50GB files. That would take 8 hours. It's probably not the biggest problem though.
What the real problem is, is displaying the images. Let's say all images of the 14MB file are in RAM memory (I'm assuming python keeps it there. Which is also my problem with python, you don't know what the hell is happening). So right now I am replotting the image every time and then redrawing the canvas, and it seems to be a bottleneck. Is there anyway to reduce this bottleneck?
Images are usually 680*480 (but also variable) of a variable datatype, usually uint8. The interface is a GUI, and there is a slider bar that you can drag to get to a certain frame. An additional feature will be a play button that will go through each frames near real-time. Windows application.
I am working on an assignment for a course. The code creates variables for use in a data.dist function from rms. We then create a simple linear regression model using ols(). Printing/creating the first plot before the data.dist() and ols() functions is simple. We use:
plot(x,y,pch='o')
lines(x,yTrue,lty=2,lwd=.5,col='red')
Then we create the data.dist() and the ols(), here named fit0.
mydat=data.frame(x=x,y=y)
dd=datadist(mydat)
options(datadist='dd')
fit0=ols(y~x,data=mydat)
fit0
anova(fit0)
This all works smoothly, printing out the results of the linear regression and the anova table. Then, we want to predict based on the model, and plot these predictions. The plot prints out nicely, however the lines and points won't show up here. The code:
ff=Predict(fit0)
plot(ff)
lines(x,yTrue,lwd=2,lty=1,col='red')
points(x,y,pch='.')
Note - this works fine in R. I much prefer to use RStudio, though can switch to R if there's no clear solution this issue. I've tried dev.off() several times (i.e. repeat until get, I've tried closing RStudio and re-opening, I've uninstalled and reinstalled R and RStudio, rms package (which includes ggplot2), updated the packages, made my RStudio graphics window larger. Any solution I've seen, doesn't work. Help!