Accessing HDF5 file structure while omitting certain groups and datasets - python-3.x

I would like to access an HDF5 file structure with h5py, where the groups and data sets are stored as following :
/Group 1/Sub Group 1/*/Data set 1/
where the asterisk signifies a sub-sub group which has a unique address. However, its address is irrelevant, since I am simply interested in the data sets it contains. How can I access any random sub-sub group without having to specify its unique address?
Here is a script for a specific case:
import h5py as h5
deleteme = h5.File("deleteme.hdf5", "w")
nobody_in_particular = deleteme.create_group("/grp_1/subgr_1/nobody_in_particular/")
dt = h5.special_dtype(vlen=str)
dataset_1 = nobody_in_particular.create_dataset("dataset_1",(1,),dtype=dt)
dataset_1.attrs[str(1)] = "Some useful data 1"
dataset_1.attrs[str(2)] = "Some useful data 2"
deleteme.close()
# access data from nobody_in_particular subgroup and do something
deleteme = h5.File("deleteme.hdf5", "r")
deleteme["/grp_1/subgr_1/nobody_in_particular/dataset_1"]
This gives output:
<HDF5 dataset "dataset_1": shape (1,), type "|O">
Now I wish accomplish the same result, however without knowing who (or which group) in particular. Any random subgroup in place of nobody_in_particular will do for me. How can I access this random subgroup?
In other words:
deleteme["/grp_1/subgr_1/<any random sub-group>/dataset_1"]

Assuming you only want to read and not create groups/datasets, then using visit (http://docs.h5py.org/en/latest/high/group.html#Group.visit) with a suitable function will allow you to select the desired groups/datasets.

Related

target_transform in torchvision.datasets.ImageFolder seems not to work

I am using PuyTorch 1.13 with Python 3.10.
I have a problem where I import pictures from a folder structure using
data = ImageFolder(root='./faces/', loader=img_loader, transform=transform,
is_valid_file=is_valid_file)
In this command labels are assigned automatically according to which subdirectory belongs an image.
I wanted to assign different labels and use target_transform for this purpose (e.g. I wanted to use a word from the file name to assign an appropriate label).
I have used
def target_transform(id):
print(2)
return id * 2
data = ImageFolder(root='./faces/', loader=img_loader, transform=transform, target_transform=target_transform, is_valid_file=is_valid_file)
Next,
data = ImageFolder(root='./faces/', loader=img_loader, transform=transform, target_transform=lambda id:2*id, is_valid_file=is_valid_file)
or
data = ImageFolder(root='./faces/', loader=img_loader, transform=transform, target_transform=
torchvision.transforms.Lambda(lambda id:2*id), is_valid_file=is_valid_file)
But none of these affect the labels. In addition, in the first example I included the print statemet to see whether the function is called but it is not. I have serached the use of this funciton but the exmaples I have found do not work and the documentation is scarce in this respect. Any idea what is wrogn with the code?

Stuck using pandas to build RPG item generator

I am trying to build a simple random item generator for a game I am working on.
So far I am stuck trying to figure out how to store and access all of the data. I went with pandas using .csv files to store the data sets.
I want to add weighted probabilities to what items are generated so I tried to read the csv files and compile each list into a new set.
I got the program to pick a random set but got stuck when trying to pull a random row from that set.
I am getting an error when I use .sample() to pull the item row which makes me think I don't understand how pandas works. I think I need to be creating new lists so I can later index and access the various statistics of the items once one is selected.
Once I pull the item I was intending on adding effects that would change the damage and armor and such displayed. So I was thinking of having the new item be its own list then use damage = item[2] + 3 or whatever I need
error is: AttributeError: 'list' object has no attribute 'sample'
Can anyone help with this problem? Maybe there is a better way to set up the data?
here is my code so far:
import pandas as pd
import random
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
def get_item():
item_class = [random.choices(df, weights=(45,40,15), k=1)] #this part seemed to work. When I printed item_class it printed one of the entire lists at the correct odds
item = item_class.sample()
print (item) #to see if the program is working
get_item()
I think you are getting slightly confused with lists vs list elements. This should work. I stubbed your dfs with simple ones
import pandas as pd
import random
# Actual data. Comment it out if you do not have the csv files
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
# My stubs -- uncomment and use this instead of the line above if you want to run this specific example
# df = [pd.DataFrame({'weapons' : ['w1','w2']}), pd.DataFrame({'armor' : ['a1','a2', 'a3']}), pd.DataFrame({'aether' : ['e1','e2', 'e3', 'e4']})]
def get_item():
# I removed [] from the line below -- choices() already returns a list of length 1
item_class = random.choices(df, weights=(45,40,15), k=1)
# I added [0] to choose the first element of item_class which is a list of length 1 from the line above
item = item_class[0].sample()
print (item) #to see if the program is working
get_item()
prints random rows from random dataframes that I setup such as
weapons
1 w2

loop to read multiple files

I am using Obspy _read_segy function to read a segy file using following line of code:
line_1=_read_segy('st1.segy')
However I have a large number of files in a folder as follow:
st1.segy
st2.segy
st3.segy
.
.
st700.segy
I want to use a for loop to read the data but I am new so can any one help me in this regard.
Currently i am using repeated lines to read data as follow:
line_2=_read_segy('st1.segy')
line_2=_read_segy('st2.segy')
The next step is to display the segy data using matplotlib and again i am using following line of code on individual lines which makes it way to much repeated work. Can someone help me with creating a loop to display the data and save the figures .
data=np.stack(t.data for t in line_1.traces)
vm=np.percentile(data,99)
plt.figure(figsize=(60,30))
plt.imshow(data.T, cmap='seismic',vmin=-vm, vmax=vm, aspect='auto')
plt.title('Line_1')
plt.savefig('Line_1.png')
plt.show()
Your kind suggestions will help me a lot as I am a beginner in python programming.
Thank you
If you want to reduce code duplication, you use something called functions. And If you want to repeatedly do something, you can use loops. So you can call a function in a loop, if you want to do this for all files.
Now, for reading the files in folder, you can use glob package of python. Something like below:
import glob, os
def save_fig(in_file_name, out_file_name):
line_1 = _read_segy(in_file_name)
data = np.stack(t.data for t in line_1.traces)
vm = np.percentile(data, 99)
plt.figure(figsize=(60, 30))
plt.imshow(data.T, cmap='seismic', vmin=-vm, vmax=vm, aspect='auto')
plt.title(out_file_name)
plt.savefig(out_file_name)
segy_files = list(glob.glob(segy_files_path+"/*.segy"))
for index, file in enumerate(segy_files):
save_fig(file, "Line_{}.png".format(index + 1))
I have not added other imports here, which you know to add!. segy_files_path is the folder where your files reside.
You just need to dynamically open the files in a loop. Fortunately they all follow the same naming pattern.
N = 700
for n in range(N):
line_n =_read_segy(f"st{n}.segy") # Dynamic name.
data = np.stack(t.data for t in line_n.traces)
vm = np.percentile(data, 99)
plt.figure(figsize=(60, 30))
plt.imshow(data.T, cmap="seismic", vmin=-vm, vmax=vm, aspect="auto")
plt.title(f"Line_{n}")
plt.show()
plt.savefig(f"Line_{n}.png")
plt.close() # Needed if you don't want to keep 700 figures open.
I'll focus on addressing the file looping, as you said you're new and I'm assuming simple loops are something you'd like to learn about (the first example is sufficient for this).
If you'd like an answer to your second question, it might be worth providing some example data, the output result (graph) of your current attempt, and a description of your desired output. If you provide that reproducible example and clear description of the problem you're having it'd be easier to answer.
Create a list (or other iterable) to hold the file names to read, and another container (maybe a dict) to hold the result of your read_segy.
files = ['st1.segy', 'st2.segy']
lines = {} # creates an empty dictionary; dictionaries consist of key: value pairs
for f in files: # f will first be 'st1.segy', then 'st2.segy'
lines[f] = read_segy(f)
As stated in the comment by #Guimoute, if you want to dynamically generate the file names, you can create the files list by pasting integers to the base file name.
lines = {} # creates an empty dictionary; dictionaries have key: value pairs
missing_files = []
for i in range(1, 701):
f = f"st{str(i)}.segy" # would give "st1.segy" for i = 1
try: # in case one of the files is missing or can’t be read
lines[f] = read_segy(f)
except:
missing_files.append(f) # store names of missing or unreadable files

Cannot retrieve Datasets in PyTables using natural naming

I'm new in PyTables and I want to retrieve a dataset from a HDF5 using natural naming but I'm getting this error using this input:
f = tables.open_file("filename.h5", "r")
f.root.group-1.dataset-1.read()
group / does not have a child named group
and if I try:
f.root.group\-1.dataset\-1.read()
group / does not have a child named group
unexpected character after line continuation character
I can't change names in the groups because is big data from an experiment.
You can't use the minus (hyphen) sign with Natural Naming because it's not a valid character as a Python variable name (group-1 and dataset-1 look like a subtraction operation!) See this discussion:
why-python-does-not-allow-hyphens
If you have groups and datasets that use this naming convention, you will have to use the file.get_node() method to access them. Here's a simple code snippet to demonstrate. The first part creates 2 groups and tables (datasets). #1 uses _ and #2 uses - in the group and table names. The second part accesses dataset #1 with Natural Naming, and dataset #2 with file.get_node()
import tables as tb
import numpy as np
# Create h5 file with 2 groups and datasets:
# '/group_1', 'ds_1' : Natural Naming Supported
# '/group-2', 'ds-2' : Natural Naming NOT Supported
h5f = tb.open_file('SO_55211646.h5', 'w')
h5f.create_group('/', 'group_1')
h5f.create_group('/', 'group-2')
mydtype = np.dtype([('a',float),('b',float),('c',float)])
h5f.create_table('/group_1', 'ds_1', description=mydtype )
h5f.create_table('/group-2', 'ds-2', description=mydtype )
# Close, then Reopen file READ ONLY
h5f.close()
h5f = tb.open_file('SO_55211646.h5', 'r')
testds_1 = h5f.root.group_1.ds_1.read()
print (testds_1.dtype)
# these aren't valid Python statements:
#testds-2 = h5f.root.group-2.ds-2.read()
#print (testds-2.dtype)
testds_2 = h5f.get_node('/group-2','ds-2').read()
print (testds_2.dtype)
h5f.close()

How to reshape/recast data in Excel from wide- to long-format

I want to reshape my data in Excel, which is currently in "wide" format into "long" format. You can see each variable (Column Name) corresponds to a tenure, race and cost burden. I want to more easily put these data into a pivot table, but I'm not sure how to do this. Any ideas out there?
FYI, the data are HUD CHAS (Department of Housing and Urban Development, Comprehensive Housing Affordability Strategy), which has over 20 tables that would need to be reshaped.
There is a simple R script that will help with this. The function accepts the path to your csv file and the number of header variables you have. In the example image/data I provided, there are 7 header variables. That is, the actual data (T9_est1) starts on the 8th column.
# Use the command below if you do not have the tidyverse package installed.
# install.packages("tidyverse")
library(tidyverse)
read_data_long <- function(path_to_csv, header_vars) {
data_table <- read_csv(path_to_csv)
fields_to_melt <- names(data_table[,as.numeric(header_vars+1):ncol(data_table)])
melted <- gather(data_table, fields_to_melt, key = 'variable', value = 'values')
return(melted)
}
# Change the file path to where your data is and where you want it written to.
# Also change "7" to the number of header variables your data has.
melted_data <- read_data_long("path_to_input_file.csv", 7)
write_csv(melted_data, "new_path_to_melted_file.csv")
(Updated 7/25/18 with a more elegant solution; Again 9/28/18 with small change.)

Resources