f = h5py.File(data_dir+rec_filename+'.hdf5', 'r')
trial = f.values() #_._next_() #.next()
equilibrate = trial.attrs['equilibrate']
this synatax gives me error: 'ValuesViewHDF5' object has no attribute 'attrs'. Does anyone know why I get this error?
I tried to search if the syntax for the new h5py has changed but couldn't find anything relevant.
Maybe it is a problem related to incpomatibilities between python2.x and python3.x
If you want to get the equilibrate attribute on the file object, you need to use this line:
equilibrate = f.attrs['equilibrate']
However as #hpaulj mentioned, you need to confirm the equilibrate attribute exists. You can access and print all attributes at the file level with this code:
with h5py.File(data_dir+rec_filename+'.hdf5', 'r') as f:
for k in f.attrs.keys():
print(f"{k} => {f.attrs[k]}")
Complete details about creating and reading attributes with h5py are available in this answer to a similar question:
How to read HDF5 attributes (metadata) with Python and h5py
BTW, why are you using f.values()? It does NOT return the file object. If you inspect the returned object, you will find it has the objects from the referenced group (e.g., Group and Dataset objects for f, the file object). Repeating what #hpaulj said, h5py uses dictionary syntax to access group and dataset names and/or objects (but they are not dictionaries!). The keys are the object names and the values are the objects. Here is a simple example showing how the dictionary syntax behaves:
with h5py.File(data_dir+rec_filename+'.hdf5', 'r') as f:
for k in f: # get the keys, eg the object names
print(f"Object name: {k}")
for k in f.keys(): # same as example above
print(f"Object name: {k}")
for v in f.values(): # get the values, eg the objects
print(f"Object: {v}; name: {v.name}")
for k,v in f.items(): # get the keys and values
print(f"Object name: {k}; Object: {v}")
Related
Below is a screenshot of the branches of data in my .hdf5 file. I am trying to extract the existing column names (ie. experiment_id, session_id....) from this particular BlinkStartEvent segment.
I have the following codes that was able to access to this section of the data and extract the numerical data as well. But for some reason, I cannot extract the corresponding column names, which I wish to append onto a separate list so I can create a dictionary out of this entire dataset. I thought .keys() was supposed to do it, but it didn't.
import h5py
def traverse_datasets(hdf_file):
def h5py_dataset_iterator(g, prefix=''):
for key in g.keys():
#print(key)
item = g[key]
path = f'{prefix}/{key}'
if isinstance(item, h5py.Dataset): # test for dataset
yield (path, item)
elif isinstance(item, h5py.Group): # test for group (go down)
yield from h5py_dataset_iterator(item, path)
for path, _ in h5py_dataset_iterator(hdf_file):
yield path
with h5py.File(filenameHDF[0], 'r') as f:
for dset in traverse_datasets(f):
if str(dset[-15:]) == 'BlinkStartEvent':
print('-----Path:', dset) # path that leads to the data
print('-----Shape:', f[dset].shape) #the length dimension of the data
print('-----Data type:', f[dset].dtype) #prints out the unicode for all columns
data2 = f[dset][()] # The entire dataset
# print('Check column names', f[dset].keys()) # I tried this but I got a AttributeError: 'Dataset' object has no attribute 'keys' error
I got the following as the output:
-----Path: /data_collection/events/eyetracker/BlinkStartEvent
-----Shape: (220,)
-----Data type: [('experiment_id', '<u4'), ('session_id', '<u4'), ('device_id', '<u2'), ('event_id', '<u4'), ('type', 'u1'), ('device_time', '<f4'), ('logged_time', '<f4'), ('time', '<f4'), ('confidence_interval', '<f4'), ('delay', '<f4'), ('filter_id', '<i2'), ('eye', 'u1'), ('status', 'u1')]
Traceback (most recent call last):
File "C:\Users\angjw\Dropbox\NUS PVT\Analysis\PVT analysis_hdf5access.py", line 64, in <module>
print('Check column names', f[dset].keys())
AttributeError: 'Dataset' object has no attribute 'keys'
What am I getting wrong here?
Also, is there a more efficient way to access the data such that I can do something (hypothetical) like:
data2[0]['experiment_id'] = 1
data2[1]['time'] = 78.35161
data2[2]['logged_time'] = 80.59253
rather than having to go through the process of setting up a dictionary for every single row of data?
You're close. The dataset's .dtype gives you the dataset as a NumPy dtype. Adding .descr returns it as a list of (field name, field type) tuples. See code below to print the field names inside your loop:
for (f_name,f_type) in f[dset].dtype.descr:
print(f_name)
There are better ways to work with HDF5 data than creating a dictionary for every single row of data (unless you absolutely want a dictionary for some reason). h5py is designed to work with dataset objects similar to NumPy arrays. (However, not all NumPy operations work on h5py dataset objects). The following code accesses the data and returns 2 similar (but slightly different) data objects.
# this returns a h5py dataset object that behaves like a NumPy array:
dset_obj = f[dset]
# this returns a NumPy array:
dset_arr = f[dset][()]
You can slice data from either object using standard NumPy slicing notation (using field names and row values). Continuing from above...
# returns row 0 from field 'experiment_id'
val0 = dset_obj[0]['experiment_id']
# returns row 1 from field 'time'
val1 = dset_obj[1]['time']
# returns row 2 from field 'logged_time'
val2 = dset_obj[2]['logged_time']
(You will get the same values if you replace dset_obj with dset_arr above.)
You can also slice entire fields/columns like this:
# returns field 'experiment_id' as a NumPy array
expr_arr = dset_obj['experiment_id']
# returns field 'time' as a NumPy array
time_arr = dset_obj['time']
# returns field 'logged_time' as a NumPy array
logtime_arr = dset_obj['logged_time']
That should answer your initial questions. If not, please add comments (or modify the post), and I will update my answer.
My previous answer used the h5py package (same package as your code). There is another Python package that I like to use with HDF5 data: PyTables (aka tables). Both are very similar, and each has unique strengths.
h5py attempts to map the HDF5 feature set to NumPy as closely as possible. Also, it uses Python dictionary syntax to iterate over object names and values. So, it is easy to learn if you are familiar with NumPy. Otherwise, you have to learn some NumPy basics (like interrogating dtypes). Homogeneous data is returned as a np.array and heterogeneous data (like yours) is returned as a np.recarray.
PyTables builds an additional abstraction layer on top of HDF5 and NumPy. Two unique capabilities I like are: 1) recursive iteration over nodes (groups or datasets), so a custom dataset generator isn't required, and 2) heterogeneous data is accessed with a "Table" object that has more methods than basic NumPy recarray methods. (Plus it can do complex queries on tables, has advanced indexing capabilities, and is fast!)
To compare them, I rewrote your h5py code with PyTables so you can "see" the difference. I incorporated all the operations in your question, and included the equivalent calls from my h5py answer. Differences to note:
The f.walk_nodes() method is a built-in method that replaces your
your generator. However, it returns an object (a Table object in this
case), not the Table (dataset) name. So, the code is slightly
different to work with the object instead of the name.
Use Table.read() to load the data into a NumPy (record) array. Different examples show how to load the entire Table into an array, or load a single column referencing the field name.
Code below:
import tables as tb
with tb.File(filenameHDF[0], 'r') as f:
for tb_obj in f.walk_nodes('/','Table'):
if str(tb_obj.name[-15:]) == 'BlinkStartEvent':
print('-----Name:', tb_obj.name) # Table name without the path
print('-----Path:', tb_obj._v_pathname) # path that leads to the data
print('-----Shape:', tb_obj.shape) # the length dimension of the data
print('-----Data type:', tb_obj.dtype) # prints out the np.dtype for all column names/variable types
print('-----Field/Column names:', tb_obj.colnames) #prints out the names of all columns as a list
data2 = tb_obj.read() # The entire Table (dataset) into array data2
# returns field 'experiment_id' as a NumPy (record) array
expr_arr = tb_obj.read(field='experiment_id')
# returns field 'time' as a NumPy (record) array
time_arr = tb_obj.read(field='time')
# returns field 'logged_time' as a NumPy (record) array
logtime_arr = tb_obj.read(field='logged_time')
how to make a Multidimensional Dictionary with multiple keys and value and how to print its keys and values?
from this format:
main_dictionary= { Mainkey: {keyA: value
keyB: value
keyC: value
}}
I tried to do it but it gives me an error in the manufacturer. here is my code
car_dict[manufacturer] [type]= [( sedan, hatchback, sports)]
Here is my error:
File "E:/Programming Study/testupdate.py", line 19, in campany
car_dict[manufacturer] [type]= [( sedan, hatchback, sports)]
KeyError: 'Nissan'
And my printing code is:
for manufacuted_by, type,sedan,hatchback, sports in cabuyao_dict[bgy]:
print("Manufacturer Name:", manufacuted_by)
print('-' * 120)
print("Car type:", type)
print("Sedan:", sedan)
print("Hatchback:", hatchback)
print("Sports:", sports)
Thank you! I'm new in Python.
I think you have a slight misunderstanding of how a dict works, and how to "call back" the values inside of it.
Let's make two examples for how to create your data-structure:
car_dict = {}
car_dict["Nissan"] = {"types": ["sedan", "hatchback", "sports"]}
print(car_dict) # Output: {'Nissan': {'types': ['sedan', 'hatchback', 'sports']}}
from collections import defaultdict
car_dict2 = defaultdict(dict)
car_dict2["Nissan"]["types"] = ["sedan", "hatchback", "sports"]
print(car_dict2) # Output: defaultdict(<class 'dict'>, {'Nissan': {'types': ['sedan', 'hatchback', 'sports']}})
In both examples above, I first create a dictionary, and then on the row after I add the values I want it to contain. In the first example, I give car_dict the key "Nissan" and set it's values to a new dictionary containing some values.
In the second example I use defaultdict(dict) which basically has the logic of "if i am not given a value for key then use the factory (dict) to create a value for it.
Can you see the difference of how to initiate the values inside of both of the different methods?
When you called car_dict[manufacturer][type] in your code, you hadn't yet initiated car_dict["Nissan"] = value, so when you tried to retrieve it, car_dict returned a KeyError.
As for printing out the values, you can do something like this:
for key in car_dict:
manufacturer = key
car_types = car_dict[key]["types"]
print(f"The manufacturer '{manufacturer}' has the following types:")
for t in car_types:
print(t)
Output:
The manufacturer 'Nissan' has the following types:
sedan
hatchback
sports
When you loop through a dict, you are looping through only the keys that are contained in it by default. That means that we have to retrieve the values of key inside of the loop itself to be able to print them correctly.
Also as a side note: You should try to avoid using Built-in's names such as type as variable names, because you then overwrite that functions namespace, and you can have some problems in the future when you have to do comparisons of types of variables.
import pandas as pd
import nltk
import os
directory = os.listdir(r"C:\...")
x = []
num = 0
for i in directory:
x.append(pd.read_fwf("C:\\..." + i))
x[num] = x[num].to_string()
So, once I have a dictionary x = [ ] populated by the read_fwf for each file in my directory:
I want to know how to make it so every single character is lowercase. I am having trouble understanding the syntax and how it is applied to a dictionary.
I want to define a filter that I can use to count for a list of words in this newly defined dictionary, e.g.,
list = [bus, car, train, aeroplane, tram, ...]
Edit: Quick unrelated question:
Is pd_read_fwf the best way to read .txt files? If not, what else could I use?
Any help is very much appreciated. Thanks
Edit 2: Sample data and output that I want:
Sample:
The Horncastle boar's head is an early seventh-century Anglo-Saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. It was discovered in 2002 by a metal detectorist searching
in the town of Horncastle, Lincolnshire. It was reported as found
treasure and acquired for £15,000 by the City and County Museum, where
it is on permanent display.
Required output - changes everything in uppercase to lowercase:
the horncastle boar's head is an early seventh-century anglo-saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. it was discovered in 2002 by a metal detectorist searching
in the town of horncastle, lincolnshire. it was reported as found
treasure and acquired for £15,000 by the city and county museum, where
it is on permanent display.
You shouldn't need to use pandas or dictionaries at all. Just use Python's built-in open() function:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Use the string's lower() method to make everything lowercase
text = text.lower()
print(text)
# Split text by whitespace into list of words
word_list = text.split()
# Get the number of elements in the list (the word count)
word_count = len(word_list)
print(word_count)
If you want, you can do it in the reverse order:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Split text by whitespace into list of words
word_list = text.split()
# Use list comprehension to create a new list with the lower() method applied to each word.
lowercase_word_list = [word.lower() for word in word_list]
print(word_list)
Using a context manager for this is good since it automatically closes the file for you as soon as it goes out of scope (de-tabbed from with statement block). Otherwise you would have to use file.open() and file.read().
I think there are some other benefits to using context managers, but someone please correct me if I'm wrong.
I think what you are looking for is dictionary comprehension:
# Python 3
new_dict = {key: val.lower() for key, val in old_dict.items()}
# Python 2
new_dict = {key: val.lower() for key, val in old_dict.iteritems()}
items()/iteritems() gives you a list of tuples of the (keys, values) represented in the dictionary (e.g. [('somekey', 'SomeValue'), ('somekey2', 'SomeValue2')])
The comprehension iterates over each of these pairs, creating a new dictionary in the process. In the key: val.lower() section, you can do whatever manipulation you want to create the new dictionary.
I'm new in PyTables and I want to retrieve a dataset from a HDF5 using natural naming but I'm getting this error using this input:
f = tables.open_file("filename.h5", "r")
f.root.group-1.dataset-1.read()
group / does not have a child named group
and if I try:
f.root.group\-1.dataset\-1.read()
group / does not have a child named group
unexpected character after line continuation character
I can't change names in the groups because is big data from an experiment.
You can't use the minus (hyphen) sign with Natural Naming because it's not a valid character as a Python variable name (group-1 and dataset-1 look like a subtraction operation!) See this discussion:
why-python-does-not-allow-hyphens
If you have groups and datasets that use this naming convention, you will have to use the file.get_node() method to access them. Here's a simple code snippet to demonstrate. The first part creates 2 groups and tables (datasets). #1 uses _ and #2 uses - in the group and table names. The second part accesses dataset #1 with Natural Naming, and dataset #2 with file.get_node()
import tables as tb
import numpy as np
# Create h5 file with 2 groups and datasets:
# '/group_1', 'ds_1' : Natural Naming Supported
# '/group-2', 'ds-2' : Natural Naming NOT Supported
h5f = tb.open_file('SO_55211646.h5', 'w')
h5f.create_group('/', 'group_1')
h5f.create_group('/', 'group-2')
mydtype = np.dtype([('a',float),('b',float),('c',float)])
h5f.create_table('/group_1', 'ds_1', description=mydtype )
h5f.create_table('/group-2', 'ds-2', description=mydtype )
# Close, then Reopen file READ ONLY
h5f.close()
h5f = tb.open_file('SO_55211646.h5', 'r')
testds_1 = h5f.root.group_1.ds_1.read()
print (testds_1.dtype)
# these aren't valid Python statements:
#testds-2 = h5f.root.group-2.ds-2.read()
#print (testds-2.dtype)
testds_2 = h5f.get_node('/group-2','ds-2').read()
print (testds_2.dtype)
h5f.close()
Trying to make a method to add new rows following the interface bellow:
def row_add(self, **rowtoadd)
I don't see how, if I define my columns like:
stuff1, stuff2, stuff3
I can get a **namedtuple to sort itself in the correct order of columns names.
So far, I've tried (here table = the filepath we're editing, containing the csv we need):
def row_add(self, **rowtoadd):
if os.path.isfile(self.table):
with open(self.table, 'a') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(rowtoadd)
But the namedtuple is not converted into a row, only the name of variable are.
ex:
row_add(stuff1="hello1", stuff2="hello2", stuff3="hello3")
cat ./my_file.csv -> stuff1, stuff2, stuff3
Try the following:
csvwriter.writerow(rowtoadd[x] for x in sorted(rowtoadd.keys()))
The issue is two-fold:
'rowtoadd' is a dict object. The order of the keys of a dict is not upheld in python.
When you writerow(rowtoadd), the default iterator in a dict is over the keys, which is why your csv file is getting the keys rather than the values.
In my line of code above, sorted(rowtoadd.keys()) sorts the keys of the dict, so that they are in a predictable order (alphabetical). rowtoadd[x] for x in ... makes it a comprehension which provides an ordered list of the values you'd like to print into the file.
A key thing to understand here is that the csvwriter is not aware of the files preexisting structure. It doesn't know what order the keys should be in. You need to specify that order somehow. In this case, I specified the order alphabetically, but you may need to do it differently.
If you don't know the names of the fields beforehand, you could use positional arguments to keep the order of the fields. Positional arguments become a tuple, which is an ordered type in python:
def row_add(self, *row):
if os.path.isfile(self.table):
with open(self.table, 'a') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(row)
This solution relies on the fact that the caller provides the arguments in the correct order.