memory issues for sparse one hot encoded features - python-3.x

I want to create sparse matrix for one hot encoded features from data frame df. But I am getting memory issue for code given below. Shape of sparse_onehot is (450138, 1508)
sp_features = ['id', 'video_id', 'genre']
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features)
import scipy
X = scipy.sparse.csr_matrix(sparse_onehot.values)
I get memory error as shown below.
MemoryError: Unable to allocate 647. MiB for an array with shape (1508, 450138) and data type uint8
I have tried scipy.sparse.lil_matrix and get same error as above.
Is there any efficient way of handling this?
Thanks in advance

Try setting to True the sparse parameter:
sparsebool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features, sparse = True)
This will use a much more memory efficient (but somewhat slower) representation than the default one.

Related

accessing geospatial raster data with limited memory

I am following the Rasterio documentation to access the geospatial raster data downloaded from here -- a large tiff image. Unfortunately, I do not have enough memory so numpy throws an ArrayMemoryError.
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 77.8 GiB for an array with shape (1, 226112, 369478) and data type uint8
My code is as follow:
import rasterio
import rasterio.features
import rasterio.warp
file_path = r'Path\to\file\ESACCI-LC-L4-LC10-Map-10m-MEX.tif'
with rasterio.open(file_path) as dataset:
# Read the dataset's valid data mask as a ndarray.
mask = dataset.dataset_mask()
# Extract feature shapes and values from the array.
for geom, val in rasterio.features.shapes(
mask, transform=dataset.transform):
# Transform shapes from the dataset's own coordinate
# reference system to CRS84 (EPSG:4326).
geom = rasterio.warp.transform_geom(
dataset.crs, 'EPSG:4326', geom, precision=6)
# Print GeoJSON shapes to stdout.
print(geom)
I need a way to store the numpy array to disk, so I tried looking into numpy nemmap, but I do not understand how to implement it for this. Additionally, I do not need to the full geospacial data, I am only interested in the lat, long, and the type of land cover as I planed to merge this with another dataset.
Using python 3.9.
Edit:
I updated my code to try using a window.
with rasterio.open(file_path) as dataset:
mask = dataset.read(1, window=Window(0, 0, 226112, 369478))
...
I can obviously adjust the window and upload the file in sections now. However, I do not understand how this has almost halved the memory required from 77.8 to 47.6.

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Awkward Array: How to get numpy array after storing as Parquet (not BitMasked)?

I want to store 2D arrays of different length as an AwkwardArray, store them as Parquet, and later access them again.
The problem is that, after loading from Parquet, the format is BitMaskedArray and the access performance is a bit slow. Demonstrated by the following code:
import numpy as np
import awkward as awk
# big to feel performance (imitating big audio file); 2D
np_arr0 = np.arange(20000000, dtype=np.float32).reshape(2, -1)
print(np_arr0.shape)
# (2, 10000000)
# different size
np_arr1 = np.arange(20000000, 36000000, dtype=np.float32).reshape(2, -1)
print(np_arr1.shape)
# (2, 8000000)
# slow; turn into AwkwardArray
awk_arr = awk.fromiter([np_arr0, np_arr1])
# fast; returns np.ndarray
awk_arr[0][0]
# store and load from parquet
awk.toparquet("sample.parquet", awk_arr)
pq_array = awk.fromparquet("sample.parquet")
# kinda slow; return BitMaskedArray
pq_array[0][0]
If we inspect the return, we see:
pq_array[0][0].layout
# layout
# [ ()] BitMaskedArray(mask=layout[0], content=layout[1], maskedwhen=False, lsborder=True)
# [ 0] ndarray(shape=1250000, dtype=dtype('uint8'))
# [ 1] ndarray(shape=10000000, dtype=dtype('float32'))
# trying to access only float32 array [1]
pq_array[0][0][1]
# expected
# array([0.000000e+00, 1.000000e+00, 2.000000e+00, ..., 9.999997e+06, 9.999998e+06, 9.999999e+06], dtype=float32)
# reality
# 1.0
Question
How can I load AwkwardArray from Parquet and quickly access the numpy values?
Info from README (GitHub)
awkward.fromparquet is lazy-loading the Parquet file.
Good that's what will help when doing e.g. pq_array[0][0][:1000]
The next layer of new structure is that the jagged array is bit-masked. Even though none of the values are nullable, this is an artifact of the way Parquet formats columnar data.
I guess there is no way around this. However, is this the reason why loading is kinda slow? Can I still access the data as numpy.ndarray by directly accessing it (no bitmasked)?
Additional attempt
Loading it with Arrow, then Awkward:
import pyarrow as pa
import pyarrow.parquet as pq
# Parquet as Arrow
pa_array = pq.read_table("sample.parquet")
# returns table instead of JaggedArray
awk.fromarrow(pa_array)
# <Table [<Row 0> <Row 1>] at 0x7fd92c83aa90>
In both Arrow and Parquet, all data is nullable, so Arrow/Parquet writers are free to throw in bitmasks wherever they want to. When reading the data back, Awkward has to treat those bitmasks as meaningful (mapping them to awkward.BitMaskedArray), but they might be all valid, particularly if you know that you didn't set any values to null.
If you're willing to ignore the bitmask, you can reach behind it by calling
pq_array[0][0].content
As for the slowness, I can say that
import awkward as ak
# slow; turn into AwkwardArray
awk_arr = ak.fromiter([np_arr0, np_arr1])
is going to be slow because ak.fromiter is one of the few functions that is implemented with a Python for loop—iterating over 10 million values in a NumPy array with a Python for loop is going to be painful. You can build the same thing manually with
>>> ak_arr0 = ak.JaggedArray.fromcounts([np_arr0.shape[1], np_arr0.shape[1]],
... np_arr0.reshape(-1))
>>> ak_arr1 = ak.JaggedArray.fromcounts([np_arr1.shape[1], np_arr1.shape[1]],
... np_arr1.reshape(-1))
>>> ak_arr = ak.JaggedArray.fromcounts([len(ak_arr0), len(ak_arr1)],
... ak.concatenate([ak_arr0, ak_arr1]))
As for Parquet being slow, I can't say why: it could be related to page size or row group size. Since Parquet is a "medium weight" file format (between "heavyweights" like HDF5 and "lightweights" like npy/npz), it has a few tunable parameters (not a lot).
You might also want to consider
ak.save("file.awkd", ak_arr)
ak_arr2 = ak.load("file.awkd")
which is really just the npy/npz format with JSON metadata to map Awkward arrays to and from flat NumPy arrays. For this sample, the file.awkd is 138 MB.

ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1)!= 16 (dim 0)

I am working on housing dataset and when trying to fit the linear regression model getting error as mentioned. Complete code as below.
I am not sure where is code going wrong. I tried pasting the code as it is from the reference book.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
ERROR: ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1) != 16 (dim 0)
What am I doing wrong here?
Explanation
Hi, I guess you are reading and following the Hands on Machine Learning with Scikit Learn and Tensorflow book. The problem also occurred to me.
In the following part of the code you select from the data set the first 5 instances. One of the attributes in the data set which is called ocean_proximity is an object and for the linear regression model to be able to operate with it, it must be translated to an integer, which in the book is done with a one hot encoding.
One hot encoding works by analyzing all the categories that can be assigned to the attribute, in this case 5 ('<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'), and then creating a matrix of that length for each instance and zeroing every element of the matrix except the category of that instance which is assigned a 1 (or another value). For example:
If ocean_proximity equals '<1H OCEAN' the conversion would be [1, 0, 0, 0, 0]
In this piece of code you select the five first instances of the data set, but this does not assure you that all the categories in "ocean_proximity" will appear. It could happen that only 3 of them appear or just 1. Therefor if you apply a one hot encoding to those five selected rows and only 3 categories appear (for example just 'INLAND', 'ISLAND' and 'NEAR BAY'), the matrices created by the one hot encoding will be of length 3.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
The error is just telling you that, since the one hot conversion of some_data created matrices of a length inferior to 5, the total columns in some_data_prepared is 14, which is less than the columns in housing_prepared (16), thus making the model unable to predict the prices.
If you transform both some_data_prepared and housing_prepared into dataframes and then call .head() you will see the problem.
some_data_prepared.head()
housing_prepared.head()
Solution
To solve the problem you must create the columns missing in some_data_prepared by creating a zeroed numpy array of shape [5,x] (being 5 the number of rows and x the number of columns missing) and concatenating it to some_data_prepared to match the shape of the housing_prepared data set.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.fit_transform(some_data)
dummy_array = np.zeros((5,1))
some_data_prepared = np.c_[some_data_prepared, dummy_array]
predictions = linear_regression.predict(some_data_prepared)
print("Predictions: ", predictions)
print("Labels: ", some_labels.values)
Missing category values (ocean proximity in this case) in some_data compared to housing_prepared is the issue.
housing_prepared.shape gives (16512, 16), but some_data_prepared.shape gives (5,14), so add zeros for the missing columns:
dummy_array = np.zeros((5,2))
some_data_prepared = np.c_[some_data_prepared,dummy_array]
the 2 in np.zeros determines the difference of columns
I've at first encountered the same issue on the considered piece of code. After exploring the issues of the handson-ml repository, I think I have understood the subtlety which is causing the error here.
My guess is that (as in my case), closing the notebook might have caused what was in memory (and the trained model in particular) to be lost. In my case, I could get the result and avoid the error rerunning the notebook from the beginning.
Instead, from a theoretical viewpoint, you should never call fit() or fit_transform() on data which is not training data (eg on some_data). Here, running fit_transform(some_data) and then stacking the dummy array to some_data_prepared works, but it forces the model to be trained again on some_data rather than on housing_prepared, which is not what you want.

Resized copy of Pytorch Tensor/Dataset

I have a homemade dataset with a few million rows. I am trying to make truncated copies. So I clip the tensors that I'm using to make the original dataset and create a new dataset. However, when I save the new dataset, which is only 20K rows, it's the same size on disk as the original dataset. Otherwise everything seems kosher, including, when I check, the size of the new tensors. What am I doing wrong?
#original dataset - 2+million rows
dataset = D.TensorDataset(training_data, labels)
torch.save(dataset, filename)
#20k dataset for experiments
d = torch.Tensor(training_data[0:20000])
l = torch.Tensor(labels[0:20000])
ds_small = D.TensorDataset(d,l)
#this is the same size as the one above on disk... approx 1.45GB
torch.save(ds_small, filename_small)
Thanks
In your code d and training_data share the same memory, even if you use slicing during the creation of d. I don't know why this is the case, but answer anyway to give you a solution:
d = x[0:10000].clone()
l = y[0:10000].clone()
clonewill give you Tensors with a memory independent from the old Tensor's and the file size will be much smaller.
Note that using torch.Tensor() is not necessary when creating d and l since training_data and labels are already tensors.

Resources