I have a tuple of numpy.ndarrays (of different dtypes), representing columns from a pandas dataframe.
args = *(seconds[column].values for column in seconds if column!='pair')
In order to know the column names I have created a numba typedDict col_names looking like this:
DictType[unicode_type,int64]<iv=None>({ts: 0, volume: 1, price: 2, ema_14: 3, slope_14: 4})
The numba function which takes a dynamic number of numpy.ndarrays into the *args tuple, and the col_names typed dictionary.
However, in order to access a specific numpy array I need from the tuple, it won't allow me to do it by calling args[col_names['ts'], because col_names['ts'] is not a compile time constant.
#njit
def numba_function(col_names, *args):
print(args[col_names['ts']]) # this doesn't work as col_names['ts'] is not compile time constant
print(args[0]) # this works as 0 is a compile time constant
Can you suggest a better approach for this problem?
Thank you in advance!
Related
Below is a screenshot of the branches of data in my .hdf5 file. I am trying to extract the existing column names (ie. experiment_id, session_id....) from this particular BlinkStartEvent segment.
I have the following codes that was able to access to this section of the data and extract the numerical data as well. But for some reason, I cannot extract the corresponding column names, which I wish to append onto a separate list so I can create a dictionary out of this entire dataset. I thought .keys() was supposed to do it, but it didn't.
import h5py
def traverse_datasets(hdf_file):
def h5py_dataset_iterator(g, prefix=''):
for key in g.keys():
#print(key)
item = g[key]
path = f'{prefix}/{key}'
if isinstance(item, h5py.Dataset): # test for dataset
yield (path, item)
elif isinstance(item, h5py.Group): # test for group (go down)
yield from h5py_dataset_iterator(item, path)
for path, _ in h5py_dataset_iterator(hdf_file):
yield path
with h5py.File(filenameHDF[0], 'r') as f:
for dset in traverse_datasets(f):
if str(dset[-15:]) == 'BlinkStartEvent':
print('-----Path:', dset) # path that leads to the data
print('-----Shape:', f[dset].shape) #the length dimension of the data
print('-----Data type:', f[dset].dtype) #prints out the unicode for all columns
data2 = f[dset][()] # The entire dataset
# print('Check column names', f[dset].keys()) # I tried this but I got a AttributeError: 'Dataset' object has no attribute 'keys' error
I got the following as the output:
-----Path: /data_collection/events/eyetracker/BlinkStartEvent
-----Shape: (220,)
-----Data type: [('experiment_id', '<u4'), ('session_id', '<u4'), ('device_id', '<u2'), ('event_id', '<u4'), ('type', 'u1'), ('device_time', '<f4'), ('logged_time', '<f4'), ('time', '<f4'), ('confidence_interval', '<f4'), ('delay', '<f4'), ('filter_id', '<i2'), ('eye', 'u1'), ('status', 'u1')]
Traceback (most recent call last):
File "C:\Users\angjw\Dropbox\NUS PVT\Analysis\PVT analysis_hdf5access.py", line 64, in <module>
print('Check column names', f[dset].keys())
AttributeError: 'Dataset' object has no attribute 'keys'
What am I getting wrong here?
Also, is there a more efficient way to access the data such that I can do something (hypothetical) like:
data2[0]['experiment_id'] = 1
data2[1]['time'] = 78.35161
data2[2]['logged_time'] = 80.59253
rather than having to go through the process of setting up a dictionary for every single row of data?
You're close. The dataset's .dtype gives you the dataset as a NumPy dtype. Adding .descr returns it as a list of (field name, field type) tuples. See code below to print the field names inside your loop:
for (f_name,f_type) in f[dset].dtype.descr:
print(f_name)
There are better ways to work with HDF5 data than creating a dictionary for every single row of data (unless you absolutely want a dictionary for some reason). h5py is designed to work with dataset objects similar to NumPy arrays. (However, not all NumPy operations work on h5py dataset objects). The following code accesses the data and returns 2 similar (but slightly different) data objects.
# this returns a h5py dataset object that behaves like a NumPy array:
dset_obj = f[dset]
# this returns a NumPy array:
dset_arr = f[dset][()]
You can slice data from either object using standard NumPy slicing notation (using field names and row values). Continuing from above...
# returns row 0 from field 'experiment_id'
val0 = dset_obj[0]['experiment_id']
# returns row 1 from field 'time'
val1 = dset_obj[1]['time']
# returns row 2 from field 'logged_time'
val2 = dset_obj[2]['logged_time']
(You will get the same values if you replace dset_obj with dset_arr above.)
You can also slice entire fields/columns like this:
# returns field 'experiment_id' as a NumPy array
expr_arr = dset_obj['experiment_id']
# returns field 'time' as a NumPy array
time_arr = dset_obj['time']
# returns field 'logged_time' as a NumPy array
logtime_arr = dset_obj['logged_time']
That should answer your initial questions. If not, please add comments (or modify the post), and I will update my answer.
My previous answer used the h5py package (same package as your code). There is another Python package that I like to use with HDF5 data: PyTables (aka tables). Both are very similar, and each has unique strengths.
h5py attempts to map the HDF5 feature set to NumPy as closely as possible. Also, it uses Python dictionary syntax to iterate over object names and values. So, it is easy to learn if you are familiar with NumPy. Otherwise, you have to learn some NumPy basics (like interrogating dtypes). Homogeneous data is returned as a np.array and heterogeneous data (like yours) is returned as a np.recarray.
PyTables builds an additional abstraction layer on top of HDF5 and NumPy. Two unique capabilities I like are: 1) recursive iteration over nodes (groups or datasets), so a custom dataset generator isn't required, and 2) heterogeneous data is accessed with a "Table" object that has more methods than basic NumPy recarray methods. (Plus it can do complex queries on tables, has advanced indexing capabilities, and is fast!)
To compare them, I rewrote your h5py code with PyTables so you can "see" the difference. I incorporated all the operations in your question, and included the equivalent calls from my h5py answer. Differences to note:
The f.walk_nodes() method is a built-in method that replaces your
your generator. However, it returns an object (a Table object in this
case), not the Table (dataset) name. So, the code is slightly
different to work with the object instead of the name.
Use Table.read() to load the data into a NumPy (record) array. Different examples show how to load the entire Table into an array, or load a single column referencing the field name.
Code below:
import tables as tb
with tb.File(filenameHDF[0], 'r') as f:
for tb_obj in f.walk_nodes('/','Table'):
if str(tb_obj.name[-15:]) == 'BlinkStartEvent':
print('-----Name:', tb_obj.name) # Table name without the path
print('-----Path:', tb_obj._v_pathname) # path that leads to the data
print('-----Shape:', tb_obj.shape) # the length dimension of the data
print('-----Data type:', tb_obj.dtype) # prints out the np.dtype for all column names/variable types
print('-----Field/Column names:', tb_obj.colnames) #prints out the names of all columns as a list
data2 = tb_obj.read() # The entire Table (dataset) into array data2
# returns field 'experiment_id' as a NumPy (record) array
expr_arr = tb_obj.read(field='experiment_id')
# returns field 'time' as a NumPy (record) array
time_arr = tb_obj.read(field='time')
# returns field 'logged_time' as a NumPy (record) array
logtime_arr = tb_obj.read(field='logged_time')
does someone know why pandas behave differently when column which we use as BY in GROUPBY contains only 1 unique value? Specifically, if there is just 1 value and we return pandas.Series, returned output is basically transposed in comparison to multiple unique values:
dt = pd.date_range('2021-01-01', '2021-01-02 23:00', closed=None, freq='1H')
df = pd.DataFrame({'date':dt.date, 'vals': range(dt.shape[0])}, index=dt)
dt1 = pd.date_range('2021-01-01', '2021-01-01 23:00', closed=None, freq='1H')
df2 = pd.DataFrame({'date':dt1.date, 'vals': range(dt1.shape[0])}, index=dt1)
def f(row, ):
return row['vals']
print(df.groupby('date').apply(f).shape)
print(df2.groupby('date').apply(f).shape)
[out 1] (48,)
[out 2] (1, 24)
Is there some simple parameter I can use to make sure the behavior is consistent? Would it make sense to maybe sumbit it as bug-report due to inconsistency, or is it "expected" (I undestood from previous question that sometimes poor design or small part is not a bug)? (I still love pandas, just these small things can make their usage very painful)
squeeze()
DataFrame.squeeze() and Series.squeeze() can make the shapes consistent:
>>> df.groupby('date').apply(f).squeeze().shape
(48,)
>>> df2.groupby('date').apply(f).squeeze().shape
(24,)
squeeze=True (deprecated)
groupby() has a squeeze param:
squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
>>> df.groupby('date', squeeze=True).apply(f).shape
(48,)
>>> df2.groupby('date', squeeze=True).apply(f).shape
(24,)
This has been deprecated since pandas 1.1.0 and will be removed in the future.
I have the following dict and pandas DataFrame.
sample_dict = {'isDuplicate': {'1051681551': False, '1037545402': True, '1035390559': False},
'dateTime': {'1051681551': Timestamp('2019-01-29 09:09:00+0000', tz='UTC'),
'1037545402': Timestamp('2019-01-11 02:06:00+0000', tz='UTC'),
'1035390559': Timestamp('2019-01-08 14:35:00+0000', tz='UTC')},
'dateTimePub': {'1051681551': None, '1037545402': None, '1035390559': None}}
df = pd.DataFrame.from_dict(sample_dict)
I want to apply a np.where() function to dateTimeand dateTimePub columns like:
def _replace_datetime_with_datetime_pub(news_dataframe):
news_dataframe.dateTime = np.where(news_dataframe.dateTimePub, news_dataframe.dateTimePub, news_dataframe.dateTime)
return pd.to_datetime(news_dataframe.dateTime)
df.apply(_replace_datetime_with_datetime_pub)
But I get the following error,
AttributeError: 'Series' object has no attribute 'dateTimePub'
It's possible to do df = _replace_datetime_with_datetime_pub(df). But my question is,
how to apply this function via either pd.DataFrame.apply or pd.DataFrame.transform method, and
why do I get this error?
I have already checked many other similar questions, but none of them had AttributeError.
With apply, you're breaking down your DataFrame into series to pass to your function. Since you don't specify the axis keyword argument, pandas assumes you want to pass each column as a series. This is the source of the AttributeError you're getting. For pandas to pass each row as a series, you want to specify axis=1 in your apply call.
Even then, you'll need to adapt the function some to fit it into the apply paradigm. In particular, you want to think of how the function should process each row it encounters. The function you pass to apply (if you specify axis=1) will work on each row in isolation from every other row. The return value from each row will then be stitched together to return a series.
I am quite new to this, but I need solution to mathematical problem, which involves finding roots of a function, that involves cumulative density function (several).
For simplicity I tried to code similar procedure, but with as simple function as possible but even that doesn't work.
Would anyone tell me please what am I doing wrong?
from scipy.optimize import fsolve
import sympy as sy
import numpy as np
from scipy.stats import norm
y=sy.Symbol('y')
def cdf(x):
cdf_norm=norm.cdf(x,100,20)
return cdf_norm
result=fsolve(y**2-14*y+7-cdf(y))
print(result)
The problem seems to be that fsolve requires that the first argument is a function. However, you passed it an expression which gets evaluated to some value, however, the expression has a variable name y which is undefined, so the interpreter throws a NameError. Also, it will require one more argument, an ndarray containing the estimates to the roots. So, one easy solution is to define another function:
def f(y):
return y**2 - 14*y + 7 - cdf(y)
result = fsolve(f, np.array([1,0])
print(result)
I get the following result:
array([ 0.51925928, 0.51925928])
Whenever I read a CSV file that has a column of strings, I've found that by default pandas gives it's dtype as object. I've tried to use mydf['mycol'].astype(str) to change the dtype of a column mycol from object to str, but it didn't work - it didn't give me an error, but at the same time, the dtype remained the same.
I read that pandas has been built on top of numpy, and numpy allows for both str_ and unicode_ see here: numpy scalar types. I'm NOT very familiar with the internal workings of pandas and NOT familiar with numpy in general.
Is there anything I can do when using pandas.io.parsers.read_csv to make sure that a column of strings in the CSV file is read as a dtype of str rather than object?
More specifically, what parameters (from those given below) do I need to use to achieve this?
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None,
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0,
skipinitialspace=False, lineterminator=None, header='infer', index_col=None,
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0,
na_values=None, na_fvalues=None, true_values=None, false_values=None,
delimiter=None, converters=None, dtype=None, usecols=None, engine=None,
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False,
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True,
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None,
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False,
date_parser=None, memory_map=False, float_precision=None, nrows=None,
iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False,
mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False,
skip_blank_lines=True)
Somewhat related: is there any variable / flag in the parameters of pandas.io.parsers.read_csv that can automatically read a missing string from a column of string as '' (empty string) rather than read a missing string as nan?
Also, many of the parameters that can be passed to pandas.io.parsers.read_csv are NOT described in the documentation : pandas.io.parsers.read_csv.html for example, na_fvalues, use_unsigned, compact_ints, etc. Aside from reading the code (which is a bit long), would there be any ohter place where a more detailed documentation for all the parameters is available?
This was a technical decision taken by Wes not to use numpy's string datatype: Numpy allocates all strings as the same size.
In most real world use cases strings are not fixed size and often a few are very long. It's wasteful to allocate a very large contiguous block of memory (and IIRC,
counterintuitively, can be slower!) to store them as if they are fixed size:
In [11]: np.array(["ab", "a"]) # The 2 is the length
Out[11]:
array(['ab', 'a'],
dtype='|S2')
In [12]: np.array(['this is a very long string', 'a', 'b', 'c'])
Out[12]:
array(['this is a very long string', 'a', 'b', 'c'],
dtype='|S26')
To give a silly example, we can see an example where object dtype takes up less memory:
In [21]: a = np.array(['a'] * 99 + ['this is a very long string, really really really really really long, oh yes'])
In [22]: a.nbytes
Out[22]: 7500
In [23]: b = a.astype(object)
In [24]: b.nbytes + sum(sys.getsizeof(item) for item in b)
Out[24]: 4674
There's also some "surprising" behaviour of numpy strings (also due to their layout):
In [31]: a = np.array(['ab', 'c'])
In [32]: a[1] = 'def'
In [33]: a # what the f?
Out[33]:
array(['ab', 'de'],
dtype='|S2')
If you wanted to fix this behaviour - and keep the numpy string dtype - you would have to make a copy for every assignment. (With object arrays you get this for free: you simply change the pointer.)
Hence in pandas strings are stored using the object dtype.
Note: I thought there was a section of the docs which discussed this decision but I can't seem to locate it...