accessing geospatial raster data with limited memory - python-3.x

I am following the Rasterio documentation to access the geospatial raster data downloaded from here -- a large tiff image. Unfortunately, I do not have enough memory so numpy throws an ArrayMemoryError.
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 77.8 GiB for an array with shape (1, 226112, 369478) and data type uint8
My code is as follow:
import rasterio
import rasterio.features
import rasterio.warp
file_path = r'Path\to\file\ESACCI-LC-L4-LC10-Map-10m-MEX.tif'
with rasterio.open(file_path) as dataset:
# Read the dataset's valid data mask as a ndarray.
mask = dataset.dataset_mask()
# Extract feature shapes and values from the array.
for geom, val in rasterio.features.shapes(
mask, transform=dataset.transform):
# Transform shapes from the dataset's own coordinate
# reference system to CRS84 (EPSG:4326).
geom = rasterio.warp.transform_geom(
dataset.crs, 'EPSG:4326', geom, precision=6)
# Print GeoJSON shapes to stdout.
print(geom)
I need a way to store the numpy array to disk, so I tried looking into numpy nemmap, but I do not understand how to implement it for this. Additionally, I do not need to the full geospacial data, I am only interested in the lat, long, and the type of land cover as I planed to merge this with another dataset.
Using python 3.9.
Edit:
I updated my code to try using a window.
with rasterio.open(file_path) as dataset:
mask = dataset.read(1, window=Window(0, 0, 226112, 369478))
...
I can obviously adjust the window and upload the file in sections now. However, I do not understand how this has almost halved the memory required from 77.8 to 47.6.

Related

combine overlapping labelled objects and modify label values

I have a Z-stack of 2D confocal microscopy images (2D slices) and I want to segment cells. The Z-stack of 2D images is actually a 3D data. In different slices along the Z-axis, I see same cells do appear in multiple slices. I am interested in cell shape in the XY so I want to preserve the largest cell area from different Z-axis slices. I thought to combine the consecutive 2D slices after converting them to labelled binary images but I am having few issues and I need some help to proceed further.
I have two images img_a and img_b. I first converted them to binary images using OTSU, then applied some morphological operations and then used cv2.connectedComponentsWithStats() to obtain labelled objects. After labeling images, I combined them using cv2.bitwise_or() but it messes up with the labels. You can see this in the attached processed image (cell higlighted by red circles). I see multiple labels for overlapping cell. However, I want to assign one unique label for every combined overlapping object.
What I want at the end is that when I combine two labelled images, I want to assign one single label (a unique value) to the combined overlapping objects and keep the largest cell area by combining both images. Does anyone know how to do it?
Here is the code:
from matplotlib import pyplot as plt
from skimage import io, color, measure
from skimage.util import img_as_ubyte
from skimage.segmentation import clear_border
import cv2
import numpy as np
cells_a=img_a[:,:,1] # get the green channel
#Threshold image to binary using OTSU.
ret_a, thresh_a = cv2.threshold(cells_a, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)
# Morphological operations to remove small noise - opening
kernel = np.ones((3,3),np.uint8)
opening_a = cv2.morphologyEx(thresh_a,cv2.MORPH_OPEN,kernel, iterations = 2)
opening_a = clear_border(opening_a) #Remove edge touchingpixels
numlabels_a, labels_a, stats_a, centroids_a = cv2.connectedComponentsWithStats(opening_a)
img_a1 = color.label2rgb(labels_a, bg_label=0)
## now do the same with image_b
cells_b=img_b[:,:,1] # get the green channel
#Threshold image to binary using OTSU.
ret_b, thresh_b = cv2.threshold(cells_b, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)
# Morphological operations to remove small noise - opening
opening_b = cv2.morphologyEx(thresh_b,cv2.MORPH_OPEN,kernel, iterations = 2)
opening_b = clear_border(opening_b) #Remove edge touchingpixels
numlabels_b, labels_b, stats_b, centroids_b = cv2.connectedComponentsWithStats(opening_b)
img_b1 = color.label2rgb(labels_b, bg_label=0)
## Now combined two images
combined = cv2.bitwise_or(labels_a, labels_b) ## combined both labelled images to get maximum area per cell
combined_img = color.label2rgb(combined, bg_label=0)
plt.imshow(combined_img)
Images can be found here:
Based on the comments from Christoph Rackwitz and beaker, I started to look around for 3D connected components labeling. I found one python library that can handle such things and I installed it and give it a try. It seems to be doing pretty good. It does assign labels in each slice and keeps the labels same for the same cells in different slices. This is exactly what I wanted.
Here is the link to the library that I used to label objects in 3D.
https://pypi.org/project/connected-components-3d/

memory issues for sparse one hot encoded features

I want to create sparse matrix for one hot encoded features from data frame df. But I am getting memory issue for code given below. Shape of sparse_onehot is (450138, 1508)
sp_features = ['id', 'video_id', 'genre']
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features)
import scipy
X = scipy.sparse.csr_matrix(sparse_onehot.values)
I get memory error as shown below.
MemoryError: Unable to allocate 647. MiB for an array with shape (1508, 450138) and data type uint8
I have tried scipy.sparse.lil_matrix and get same error as above.
Is there any efficient way of handling this?
Thanks in advance
Try setting to True the sparse parameter:
sparsebool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features, sparse = True)
This will use a much more memory efficient (but somewhat slower) representation than the default one.

I can't generate a word cloud with some images

I just start the module worcloud in Python 3.7, and I'm using the next cxode to generate wordclouds from a dictionary and I'm trying to use differents masks, but this works for some images: in two cases works with images of 831x816 and 1000x808. This has to be with the size of the image? Or is because the images is kind a blurry? Or what is it?
I paste my code:
from PIL import Image
our_mask = np.array(Image.open('twitter.png'))
twitter_cloud = WordCloud(background_color = 'white', mask = our_mask)
twitter_cloud.generate_from_frequencies(frequencies)
twitter_cloud.to_file("twitter_cloud.jpg")
plt.imshow(twitter_cloud)
plt.axis('off')
plt.show()
How can i fix this?
I had a similar problem with a black-and-white image I used. What fixed it for me was when I cropped the image more closely to the black drawing so there was no unnecessary bulk white area on the edges.
Some images should be adjusted for the process. Note only white point values for image is mask_out (other values are mask_in). The problem is that some of images are not suitable for masking. The reason is that the color's np.array somewhat mismatches. To solve this, following can be done:
1.Creating mask object: (Please try with your own image as I couldn't upload:)
import numpy as np;
import pandas as pd;
from PIL import Image;
from wordcloud import WordCloud
mask = np.array(Image.open("filepath/picture.png"))
print(mask)
If the output values for white np.array is 255, then it is okay. But if it is 0 or probably other value, we have to change this to 255.
2.In the case of other values, the code for changing the values:
2-1. Create function for transforming (here our value = 0)
def transform_zeros(val):
if val == 0:
return 255
else:
return val
2-2. Creating the same shaped np.array:
maskable_image = np.ndarray((mask.shape[0],mask.shape[1]), np.int32)
2-3. Transformation:
for i in range(len(mask)):
maskable_image[i] = list(map(transform_zeros, mask[i]))
3.Checking:
print(maskable_image)
Then you can use this array for your mask.
mask = maskable_image
All this is copied and interpreted from this link, so check it if you find my attempted explanation unclear, as I just provided solution but don't understand that much about color arrays of image and its transformation.

Awkward Array: How to get numpy array after storing as Parquet (not BitMasked)?

I want to store 2D arrays of different length as an AwkwardArray, store them as Parquet, and later access them again.
The problem is that, after loading from Parquet, the format is BitMaskedArray and the access performance is a bit slow. Demonstrated by the following code:
import numpy as np
import awkward as awk
# big to feel performance (imitating big audio file); 2D
np_arr0 = np.arange(20000000, dtype=np.float32).reshape(2, -1)
print(np_arr0.shape)
# (2, 10000000)
# different size
np_arr1 = np.arange(20000000, 36000000, dtype=np.float32).reshape(2, -1)
print(np_arr1.shape)
# (2, 8000000)
# slow; turn into AwkwardArray
awk_arr = awk.fromiter([np_arr0, np_arr1])
# fast; returns np.ndarray
awk_arr[0][0]
# store and load from parquet
awk.toparquet("sample.parquet", awk_arr)
pq_array = awk.fromparquet("sample.parquet")
# kinda slow; return BitMaskedArray
pq_array[0][0]
If we inspect the return, we see:
pq_array[0][0].layout
# layout
# [ ()] BitMaskedArray(mask=layout[0], content=layout[1], maskedwhen=False, lsborder=True)
# [ 0] ndarray(shape=1250000, dtype=dtype('uint8'))
# [ 1] ndarray(shape=10000000, dtype=dtype('float32'))
# trying to access only float32 array [1]
pq_array[0][0][1]
# expected
# array([0.000000e+00, 1.000000e+00, 2.000000e+00, ..., 9.999997e+06, 9.999998e+06, 9.999999e+06], dtype=float32)
# reality
# 1.0
Question
How can I load AwkwardArray from Parquet and quickly access the numpy values?
Info from README (GitHub)
awkward.fromparquet is lazy-loading the Parquet file.
Good that's what will help when doing e.g. pq_array[0][0][:1000]
The next layer of new structure is that the jagged array is bit-masked. Even though none of the values are nullable, this is an artifact of the way Parquet formats columnar data.
I guess there is no way around this. However, is this the reason why loading is kinda slow? Can I still access the data as numpy.ndarray by directly accessing it (no bitmasked)?
Additional attempt
Loading it with Arrow, then Awkward:
import pyarrow as pa
import pyarrow.parquet as pq
# Parquet as Arrow
pa_array = pq.read_table("sample.parquet")
# returns table instead of JaggedArray
awk.fromarrow(pa_array)
# <Table [<Row 0> <Row 1>] at 0x7fd92c83aa90>
In both Arrow and Parquet, all data is nullable, so Arrow/Parquet writers are free to throw in bitmasks wherever they want to. When reading the data back, Awkward has to treat those bitmasks as meaningful (mapping them to awkward.BitMaskedArray), but they might be all valid, particularly if you know that you didn't set any values to null.
If you're willing to ignore the bitmask, you can reach behind it by calling
pq_array[0][0].content
As for the slowness, I can say that
import awkward as ak
# slow; turn into AwkwardArray
awk_arr = ak.fromiter([np_arr0, np_arr1])
is going to be slow because ak.fromiter is one of the few functions that is implemented with a Python for loop—iterating over 10 million values in a NumPy array with a Python for loop is going to be painful. You can build the same thing manually with
>>> ak_arr0 = ak.JaggedArray.fromcounts([np_arr0.shape[1], np_arr0.shape[1]],
... np_arr0.reshape(-1))
>>> ak_arr1 = ak.JaggedArray.fromcounts([np_arr1.shape[1], np_arr1.shape[1]],
... np_arr1.reshape(-1))
>>> ak_arr = ak.JaggedArray.fromcounts([len(ak_arr0), len(ak_arr1)],
... ak.concatenate([ak_arr0, ak_arr1]))
As for Parquet being slow, I can't say why: it could be related to page size or row group size. Since Parquet is a "medium weight" file format (between "heavyweights" like HDF5 and "lightweights" like npy/npz), it has a few tunable parameters (not a lot).
You might also want to consider
ak.save("file.awkd", ak_arr)
ak_arr2 = ak.load("file.awkd")
which is really just the npy/npz format with JSON metadata to map Awkward arrays to and from flat NumPy arrays. For this sample, the file.awkd is 138 MB.

mangle images of vtk from itk

I am reading an image from SimpleITK but I get these results in vtk any help?
I am not sure where things are going wrong here.
Please see image here.
####
CODE
def sitk2vtk(img):
size = list(img.GetSize())
origin = list(img.GetOrigin())
spacing = list(img.GetSpacing())
sitktype = img.GetPixelID()
vtktype = pixelmap[sitktype]
ncomp = img.GetNumberOfComponentsPerPixel()
# there doesn't seem to be a way to specify the image orientation in VTK
# convert the SimpleITK image to a numpy array
i2 = sitk.GetArrayFromImage(img)
#import pylab
#i2 = reshape(i2, size)
i2_string = i2.tostring()
# send the numpy array to VTK with a vtkImageImport object
dataImporter = vtk.vtkImageImport()
dataImporter.CopyImportVoidPointer( i2_string, len(i2_string) )
dataImporter.SetDataScalarType(vtktype)
dataImporter.SetNumberOfScalarComponents(ncomp)
# VTK expects 3-dimensional parameters
if len(size) == 2:
size.append(1)
if len(origin) == 2:
origin.append(0.0)
if len(spacing) == 2:
spacing.append(spacing[0])
# Set the new VTK image's parameters
#
dataImporter.SetDataExtent (0, size[0]-1, 0, size[1]-1, 0, size[2]-1)
dataImporter.SetWholeExtent(0, size[0]-1, 0, size[1]-1, 0, size[2]-1)
dataImporter.SetDataOrigin(origin)
dataImporter.SetDataSpacing(spacing)
dataImporter.Update()
vtk_image = dataImporter.GetOutput()
return vtk_image
###
END CODE
You are ignoring two things:
There is an order change when you perform GetArrayFromImage:
The order of index and dimensions need careful attention during conversion. Quote from SimpleITK Notebooks at http://insightsoftwareconsortium.github.io/SimpleITK-Notebooks/01_Image_Basics.html:
ITK's Image class does not have a bracket operator. It has a GetPixel which takes an ITK Index object as an argument, which is an array ordered as (x,y,z). This is the convention that SimpleITK's Image class uses for the GetPixel method as well.
While in numpy, an array is indexed in the opposite order (z,y,x).
There is a change of coordinates between ITK and VTK image representations. Historically, in computer graphics there is a tendency to align the camera in such a way that the positive Y axis is pointing down. This results in a change of coordinates between ITK and VTK images.

Resources