Awkward Array: How to get numpy array after storing as Parquet (not BitMasked)? - python-3.x

I want to store 2D arrays of different length as an AwkwardArray, store them as Parquet, and later access them again.
The problem is that, after loading from Parquet, the format is BitMaskedArray and the access performance is a bit slow. Demonstrated by the following code:
import numpy as np
import awkward as awk
# big to feel performance (imitating big audio file); 2D
np_arr0 = np.arange(20000000, dtype=np.float32).reshape(2, -1)
print(np_arr0.shape)
# (2, 10000000)
# different size
np_arr1 = np.arange(20000000, 36000000, dtype=np.float32).reshape(2, -1)
print(np_arr1.shape)
# (2, 8000000)
# slow; turn into AwkwardArray
awk_arr = awk.fromiter([np_arr0, np_arr1])
# fast; returns np.ndarray
awk_arr[0][0]
# store and load from parquet
awk.toparquet("sample.parquet", awk_arr)
pq_array = awk.fromparquet("sample.parquet")
# kinda slow; return BitMaskedArray
pq_array[0][0]
If we inspect the return, we see:
pq_array[0][0].layout
# layout
# [ ()] BitMaskedArray(mask=layout[0], content=layout[1], maskedwhen=False, lsborder=True)
# [ 0] ndarray(shape=1250000, dtype=dtype('uint8'))
# [ 1] ndarray(shape=10000000, dtype=dtype('float32'))
# trying to access only float32 array [1]
pq_array[0][0][1]
# expected
# array([0.000000e+00, 1.000000e+00, 2.000000e+00, ..., 9.999997e+06, 9.999998e+06, 9.999999e+06], dtype=float32)
# reality
# 1.0
Question
How can I load AwkwardArray from Parquet and quickly access the numpy values?
Info from README (GitHub)
awkward.fromparquet is lazy-loading the Parquet file.
Good that's what will help when doing e.g. pq_array[0][0][:1000]
The next layer of new structure is that the jagged array is bit-masked. Even though none of the values are nullable, this is an artifact of the way Parquet formats columnar data.
I guess there is no way around this. However, is this the reason why loading is kinda slow? Can I still access the data as numpy.ndarray by directly accessing it (no bitmasked)?
Additional attempt
Loading it with Arrow, then Awkward:
import pyarrow as pa
import pyarrow.parquet as pq
# Parquet as Arrow
pa_array = pq.read_table("sample.parquet")
# returns table instead of JaggedArray
awk.fromarrow(pa_array)
# <Table [<Row 0> <Row 1>] at 0x7fd92c83aa90>

In both Arrow and Parquet, all data is nullable, so Arrow/Parquet writers are free to throw in bitmasks wherever they want to. When reading the data back, Awkward has to treat those bitmasks as meaningful (mapping them to awkward.BitMaskedArray), but they might be all valid, particularly if you know that you didn't set any values to null.
If you're willing to ignore the bitmask, you can reach behind it by calling
pq_array[0][0].content
As for the slowness, I can say that
import awkward as ak
# slow; turn into AwkwardArray
awk_arr = ak.fromiter([np_arr0, np_arr1])
is going to be slow because ak.fromiter is one of the few functions that is implemented with a Python for loop—iterating over 10 million values in a NumPy array with a Python for loop is going to be painful. You can build the same thing manually with
>>> ak_arr0 = ak.JaggedArray.fromcounts([np_arr0.shape[1], np_arr0.shape[1]],
... np_arr0.reshape(-1))
>>> ak_arr1 = ak.JaggedArray.fromcounts([np_arr1.shape[1], np_arr1.shape[1]],
... np_arr1.reshape(-1))
>>> ak_arr = ak.JaggedArray.fromcounts([len(ak_arr0), len(ak_arr1)],
... ak.concatenate([ak_arr0, ak_arr1]))
As for Parquet being slow, I can't say why: it could be related to page size or row group size. Since Parquet is a "medium weight" file format (between "heavyweights" like HDF5 and "lightweights" like npy/npz), it has a few tunable parameters (not a lot).
You might also want to consider
ak.save("file.awkd", ak_arr)
ak_arr2 = ak.load("file.awkd")
which is really just the npy/npz format with JSON metadata to map Awkward arrays to and from flat NumPy arrays. For this sample, the file.awkd is 138 MB.

Related

Python PyVisa convert queried binary data to ascii data

I'm currently using a keysight VNA product and I control it using PyVisa. Since I have a rapid changing system, I wish to query binary data instead of ascii data from the machine since it is about 10 times faster. The issue I am having is to convert the data to ascii again.
Minimal exampel code:
import pyvisa as visa
import numpy as np
device_adress = ''TCPIP0::localhost::hislip1,4880::INSTR''
rm = visa.ResourceManager('C:\\Windows\\System32\\visa32.dll')
device = rm.open_resource(device_adres)
# presetting device for SNP data measurment
# ...
device.query_ascii_values('CALC:DATA:SNP? 2', container = np.ndarray) # works super but is slow
device.write('FORM:DATA REAL,64')
device.query_binary_values('CALC:DATA:SNP? 2', container = np.ndarray) # 10 times faster but how to read data
Official docs to query binary doesn't give me anything. I found the functions for the code on git here and some helper functions for converting data here, but I am still unable to convert the data such that the converted data is the same as the one I got from the ascii query command. If possible I would like the 'container=np.ndarray' to kept.
Functions from the last link that I have tested:
bin_data = device.query_binary_values('CALC:DATA:SNP? 2', container = np.ndarray)
num = from_binary_block(bin_data) # "Convert a binary block into an iterable of numbers."
ascii_data = to_ascii_block(num) # "Turn an iterable of numbers in an ascii block of data."
but the data from query_ascii_values and the values of ascii_data don't match. Any help is higly appreciated.
Edit:
With the following code
device.write(f"SENS:SWE:POIN 5;")
data_bin = device.query_binary_values('CALC:DATA? SDATA', container=np.ndarray)
I got
data_bin = array([-5.0535379e-34, 1.3452465e-43, -1.7349754e+09, 1.3452465e-43,
-8.6640313e+22, 8.9683102e-44, 5.0314407e-06, 3.1389086e-43,
4.8143607e-36, 3.1389086e-43, -4.1738553e-12, 1.3452465e-43,
-1.5767541e+11, 8.9683102e-44, -2.8241991e+32, 1.7936620e-43,
4.3024710e+16, 1.3452465e-43, 2.1990014e+07, 8.9683102e-44],
dtype=float32)

Reading a set of HDF5 files and then slicing the resulting datasets without storing them in the end

I think some of my question is answered here:1
But the difference that I have is that I'm wondering if it is possible to do the slicing step without having to re-write the datasets to another file first.
Here is the code that reads in a single HDF5 file that is given as an argument to the script:
with h5py.File(args.H5file, 'r') as df:
print('Here are the keys of the input file\n', df.keys())
#interesting point here: you need the [:] behind each of these and we didn't need it when
#creating datasets not using the 'with' formalism above. Adding that even handled the cases
#in the 'hits' and 'truth_hadrons' where there are additional dimensions...go figure.
jetdset = df['jets'][:]
haddset = df['truth_hadrons'][:]
hitdset = df['hits'][:]
Then later I do some slicing operations on these datasets.
Ideally I'd be able to pass a wild-card into args.H5file and then the whole set of files, all with the same data formats, would end up in the three datasets above.
I do not want to store or make persistent these three datasets at the end of the script as the output are plots that use the information in the slices.
Any help would be appreciated!
There are at least 2 ways to access multiple files:
If all files follow a naming pattern, you can use the glob
module. It uses wildcards to find files. (Note: I prefer
glob.iglob; it is an iterator that yields values without creating a list. glob.glob creates a list which you frequently don't need.)
Alternatively, you could input a list of filenames and loop on
the list.
Example of iglob:
import glob
for fname in glob.iglob('img_data_0?.h5'):
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Example with a list of names:
filenames = [ 'img_data_01.h5', 'img_data_02.h5', 'img_data_03.h5' ]
for fname in filenames:
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Next, your code mentions using [:] when you access a dataset. Whether or not you need to add indices depends on the object you want returned.
If you include [()], it returns the entire dataset as a numpy array. Note [()] is now preferred over [:]. You can use any valid slice notation, e.g., [0,0,:] for a slice of a 3-axis array.
If you don't include [:], it returns a h5py dataset object, which
behaves like a numpy array. (For example, you can get dtype and shape, and slice the data). The advantage? It has a smaller memory footprint. I use h5py dataset objects unless I specifically need an array (for example, passing image data to another package).
Examples of each method:
jets_dset = h5f['jets'] # w/out [()] returns a h5py dataset object
jets_arr = h5f['jets'][()] # with [()] returns a numpy array object
Finally, if you want to create a single array that merges values from 3 datasets, you have to create an array big enough to hold the data, then load with slice notation. Alternatively, you can use np.concatenate() (However, be careful, as concatenating a lot of data can be slow.)
A simple example is shown below. It assumes you know the shape of the dataset, and they are the same for all 3 files. (a0, a1 are the axes lengths for 1 dataset) If you don't know them, you can get them from the .shape attribute
Example for method 1 (pre-allocating array jets3x_arr):
a0, a1 = 100, 100
jets3x_arr = np.empty(shape=(a0, a1, 3)) # add dtype= if not float
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
jets3x_arr[:,:,cnt] = h5f['jets']
Example for method 2 (using np.concatenate()):
a0, a1 = 100, 100
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
if cnt == 0:
jets3x_arr= h5f['jets'][()].reshape(a0,a1,1)
else:
jets3x_arr= np.concatenate(\
(jets3x_arr, h5f['jets'][()].reshape(a0,a1,1)), axis=2)

accessing geospatial raster data with limited memory

I am following the Rasterio documentation to access the geospatial raster data downloaded from here -- a large tiff image. Unfortunately, I do not have enough memory so numpy throws an ArrayMemoryError.
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 77.8 GiB for an array with shape (1, 226112, 369478) and data type uint8
My code is as follow:
import rasterio
import rasterio.features
import rasterio.warp
file_path = r'Path\to\file\ESACCI-LC-L4-LC10-Map-10m-MEX.tif'
with rasterio.open(file_path) as dataset:
# Read the dataset's valid data mask as a ndarray.
mask = dataset.dataset_mask()
# Extract feature shapes and values from the array.
for geom, val in rasterio.features.shapes(
mask, transform=dataset.transform):
# Transform shapes from the dataset's own coordinate
# reference system to CRS84 (EPSG:4326).
geom = rasterio.warp.transform_geom(
dataset.crs, 'EPSG:4326', geom, precision=6)
# Print GeoJSON shapes to stdout.
print(geom)
I need a way to store the numpy array to disk, so I tried looking into numpy nemmap, but I do not understand how to implement it for this. Additionally, I do not need to the full geospacial data, I am only interested in the lat, long, and the type of land cover as I planed to merge this with another dataset.
Using python 3.9.
Edit:
I updated my code to try using a window.
with rasterio.open(file_path) as dataset:
mask = dataset.read(1, window=Window(0, 0, 226112, 369478))
...
I can obviously adjust the window and upload the file in sections now. However, I do not understand how this has almost halved the memory required from 77.8 to 47.6.

Why is ColumnTransformer producing a different output using the same code but different .csv files?

I am trying to finish this course tooth and nail with the hopes of being able to do this kind of stuff entry level by Spring time. This is my first post here on this incredible resource, and will do my best to conform to posting format. As a potential way to enforce my learning and commit to long term memory, I'm trying the same things on my own dataset of > 500 entries containing data more relevant to me as opposed to dummy data.
I'm learning about the data preprocessing phase where you fill in missing values and separate the columns into their respective X and Y to be fed into the models later on, if I understand correctly.
So in the course example, it's the top left dataset of countries. Then the bottom left is my own database of data I've been keeping for about a year on a multiplayer game I play. It has 100 or so characters you can choose from who are played between 5 different categorical roles.
Course data set (top left) personal dataset (bottom left
personal dataset column transformed results
What's up with the different outputs that are produced, with the only difference being the dataset (.csv file)? The course's dataset looks right; that first column of countries (textual categories) gets turned into binary vectors in the output no? Why is the output on my data set omitting columns, and producing these bizarre looking tuples followed by what looks like a random number? I've tried removing the np.array function, I've tried printing each output at each level, unable to see what's causing the difference. I expected on my dataset it would transform the characters' names into binary vectors (combinations of 1s/0s?) so the computer can understand the difference and map them to the appropriate results. Instead I'm getting that weird looking output I've never seen before.
EDIT: It turns out these bizarre number combinations are what's called a "sparse matrix." Had to do some research starting with the type() which yielded csr_array. If I understood what I Read correctly all the stuff inside takes up one column, so I just tried all rows/columns using [:] and I didn't get an error.
Really appreciate your time and assistance.
EDIT: Thanks to this thread I was able to make my way to the end of this data preprocessing/import/cleaning/ phase exercise, to feature scaling using my own dataset of ~ 550 rows.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# IMPORT RAW DATA // ASSIGN X AND Y RAW
df = pd.read_csv('datasets/winpredictor.csv')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
# TRANSFORM CATEGORICAL DATA
ct = ColumnTransformer(transformers=\
[('encoder', OneHotEncoder(), [0, 1])], remainder='passthrough')
le = LabelEncoder()
X = ct.fit_transform(X)
y = le.fit_transform(y)
# SPLIT THE DATA INTO TRAINING AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(\
X, y, train_size=.8, test_size=.2, random_state=1)
# FEATURE SCALING
sc = StandardScaler(with_mean=False)
X_train[:, :] = sc.fit_transform(X_train[:, :])
X_test[:, :] = sc.transform(X_test[:, :])
First of all I encourage you to keep working with this course and for sure you will be a perfect Data Science in a few weeks.
Let's talk about your problem. It' seems that you only have a problem of visualization due to the big size of different types of "Hero" (I think you have 37 unique values).
I will explain you the results you have plotted. They programm only indicate you the values of the samples that are different of 0:
(0,10)=1 --> 0 refers to the first sample, and 10 refers to the 10th
value of the sample that is equal to 1.
(0,37)=5 --> 0 refers to the first sample, and 37 refers to the 37th, which is equal to 5.
etc..
So your first sample will be something like:
[0,0,0,0,0,0,0,0,0,0,1,.........., 5, 980,-30, 1000, 6023]
Which is the way to express the first sample of "Jakiro".
["Jakiro",5, 980,-30, 1000, 6023]
To sump up, the first 37 values refers to your OneHotEncoder, and last 5 refers to your initial numerical values.
So it seems to be correct, just a different way to plot the result due to the big size of classes of the categorical variable.
You can try to reduce the number of X rows (to 4 for example), and try the same process. Then you will have a similar output as the course.

memory issues for sparse one hot encoded features

I want to create sparse matrix for one hot encoded features from data frame df. But I am getting memory issue for code given below. Shape of sparse_onehot is (450138, 1508)
sp_features = ['id', 'video_id', 'genre']
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features)
import scipy
X = scipy.sparse.csr_matrix(sparse_onehot.values)
I get memory error as shown below.
MemoryError: Unable to allocate 647. MiB for an array with shape (1508, 450138) and data type uint8
I have tried scipy.sparse.lil_matrix and get same error as above.
Is there any efficient way of handling this?
Thanks in advance
Try setting to True the sparse parameter:
sparsebool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features, sparse = True)
This will use a much more memory efficient (but somewhat slower) representation than the default one.

Resources