Convert panda dataframe to h5 file - python-3.x

I want to store my dataframe in h5 file. My dataframe is:
dfbhbh=pd.DataFrame([m1bhbh,m2bhbh,adcobhbh,edcobhbh,alisabhbh,elisabhbh,tevolbhbh,distbhbh,metalbhbh,compbhbh,weightbhbh]).T
dfbhbh.columns=['m_1','m_2','a_DCO','e_DCO','a_LISA','e_LISA','t_evol','dist','Z','component','weight']
I am trying to convert it using:
hf=h5py.File('anew', 'w')
for i in range(len(dfbhbh)):
hf.create_dataset('simulations',list(dfbhbh.iloc[i]))
And I'm getting the error
TypeError: Can't convert element 9 (low_alpha_disc) to hsize_t
I removed the entire array of the component (even though it is extremely significant) but the code did not run.
I also tried to insert directly the data in the h5 file like this
hf.create_dataset('simulations', m1bhbh)
I got this error
Dimensionality is too large (dimensionality is too large)
The variable 'm1bhbh' is a float type with length 1499.

Try:
hf.create_dataset('simulations', data = m1bhbh)
instead of
hf.create_dataset('simulations', m1bhbh)
(Don't forget to clear outputs before running it.)

Related

Saving panda dataframe to csv changes values

I want to save a bunch of values in a dataframe to csv but I keep running in the problem that something changes to values while saving. Let's have a look at the MWE:
import pandas as pd
import csv
df = {
"value1": [110.589, 222.534, 390.123],
"value2": [50.111, 40.086, 45.334]
}
df.round(1)
#checkpoint
df.to_csv(some_path)
If I debug it and look at the values of df at the step which I marked "checkpoint", thus after rounding, they are like
[110.6000, 222.5000, 390.1000],
[50.1000, 40.1000, 45.3000]
In reality, my data frame is much larger and when I open the csv after saving, some values (usually in a random block of a couple of dozen rows) have changed! They then look like
[110.600000000001, 222.499999999999, 390.099999999999],
[50.099999999999, 40.100000000001, 45.300000000001]
So it's always a 0.000000000001 offset from the "real"/rounded values. Does anybody know what's going on here/how I can avoid this?
This is a typical floating point problem. pandas gives you the option to define a float_format:
df.to_csv(some_path, float_format='%.4f')
This will force 4 decimals (or actually, does a cut-off at 4 decimals). Note that values will be treated as strings now, so if you set quoting on strings, then these columns are also quoted.

Reading a set of HDF5 files and then slicing the resulting datasets without storing them in the end

I think some of my question is answered here:1
But the difference that I have is that I'm wondering if it is possible to do the slicing step without having to re-write the datasets to another file first.
Here is the code that reads in a single HDF5 file that is given as an argument to the script:
with h5py.File(args.H5file, 'r') as df:
print('Here are the keys of the input file\n', df.keys())
#interesting point here: you need the [:] behind each of these and we didn't need it when
#creating datasets not using the 'with' formalism above. Adding that even handled the cases
#in the 'hits' and 'truth_hadrons' where there are additional dimensions...go figure.
jetdset = df['jets'][:]
haddset = df['truth_hadrons'][:]
hitdset = df['hits'][:]
Then later I do some slicing operations on these datasets.
Ideally I'd be able to pass a wild-card into args.H5file and then the whole set of files, all with the same data formats, would end up in the three datasets above.
I do not want to store or make persistent these three datasets at the end of the script as the output are plots that use the information in the slices.
Any help would be appreciated!
There are at least 2 ways to access multiple files:
If all files follow a naming pattern, you can use the glob
module. It uses wildcards to find files. (Note: I prefer
glob.iglob; it is an iterator that yields values without creating a list. glob.glob creates a list which you frequently don't need.)
Alternatively, you could input a list of filenames and loop on
the list.
Example of iglob:
import glob
for fname in glob.iglob('img_data_0?.h5'):
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Example with a list of names:
filenames = [ 'img_data_01.h5', 'img_data_02.h5', 'img_data_03.h5' ]
for fname in filenames:
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Next, your code mentions using [:] when you access a dataset. Whether or not you need to add indices depends on the object you want returned.
If you include [()], it returns the entire dataset as a numpy array. Note [()] is now preferred over [:]. You can use any valid slice notation, e.g., [0,0,:] for a slice of a 3-axis array.
If you don't include [:], it returns a h5py dataset object, which
behaves like a numpy array. (For example, you can get dtype and shape, and slice the data). The advantage? It has a smaller memory footprint. I use h5py dataset objects unless I specifically need an array (for example, passing image data to another package).
Examples of each method:
jets_dset = h5f['jets'] # w/out [()] returns a h5py dataset object
jets_arr = h5f['jets'][()] # with [()] returns a numpy array object
Finally, if you want to create a single array that merges values from 3 datasets, you have to create an array big enough to hold the data, then load with slice notation. Alternatively, you can use np.concatenate() (However, be careful, as concatenating a lot of data can be slow.)
A simple example is shown below. It assumes you know the shape of the dataset, and they are the same for all 3 files. (a0, a1 are the axes lengths for 1 dataset) If you don't know them, you can get them from the .shape attribute
Example for method 1 (pre-allocating array jets3x_arr):
a0, a1 = 100, 100
jets3x_arr = np.empty(shape=(a0, a1, 3)) # add dtype= if not float
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
jets3x_arr[:,:,cnt] = h5f['jets']
Example for method 2 (using np.concatenate()):
a0, a1 = 100, 100
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
if cnt == 0:
jets3x_arr= h5f['jets'][()].reshape(a0,a1,1)
else:
jets3x_arr= np.concatenate(\
(jets3x_arr, h5f['jets'][()].reshape(a0,a1,1)), axis=2)

Python: how can I get the mode from a month column that i extracted from a datetime column?

I'm new at this! Doing my first Python project. :)
My tasks are:
convert df['Start Time'] from string to datetime
create a month column from df['Start Time']
get the mode of that month.
I used a few different ways to do all 3 of the steps, but trying to get the mode always returns TypeError: tuple indices must be integers or slices, not str. This happens even if I try converting the "tuple" into a list or NumPy array.
Ways I tried to extract month from Start Time:
df['extracted_month'] = pd.DatetimeIndex(df['Start Time']).month
df['extracted_month'] = np.asarray(df['extracted_month'])
df['extracted_month'] = df['Start Time'].dt.month
Ways I've tried to get the mode:
print(df['extracted_month'].mode())
print(df['extracted_month'].mode()[0])
print(stat.mode(df['extracted_month']))
Trying to get the index with df.columns.get_loc("extracted_month") then replacing it in the mode code gives me the SAME error (TypeError: tuple indices must be integers or slices, not str).
I think I should convert df['extracted_month'] into a different... something. What is it?
Note: My extracted_month column is a STRING, but you should still be able to get the mode from a string variable! I'm not changing it, that would be giving up.
Edit: using the following code still results in the same error
extracted_month = pd.Index(df['extracted_month'])
print(extracted_month.value_counts())
The error is likely caused by the way you are creating your dataframe.
If the dataframe is created in another function, and that function returns other things along with the dataframe, but you assign it to the variable df, then df will be a tuple that contains the actual dataframe, and not the dataframe itself.

Python - In a ML code. Getting error : IndexError: list index out of range

I was going thru some ML python code just try to understand what is does and how it works. I noticed a youtube video that took me to this code random-forests-tutorials. The code actually uses hard-coded Array/List. But if I use file as input, then it throws
IndexError: list index out of range in the print_tree function
could someone please help me with resolving this? I have not yet changed anything else in the program besides just pointing it to file as input instead of hard-coded Array.
I created this function to read the CSV data from HEADER and TRAINING files. But to read the TESTING data file i have similar function but am not reading row[5] as it does not exist. the number of columns of Testing data file is 1 short.
def getBackData(filename)
with open(filename, newline='') as csvfile:
rawReader = csv.reader(csvfile, delimiter=',', quotechar='"')
if "_training" in filename:
parsed = ((row[0],
int(row[1]),
int(row[2]),
int(row[3]),
row[4],
row[5])
for row in rawReader)
else:
parsed = rawReader
theData = list(parsed)
return theData
So in the code am using the variables as
training_data = fs.getBackData(fileToUse + "_training.dat")
header = fs.getBackData(fileToUse + "_header.dat")
testing_data =fs.getBackData(fileToUse + "_testing.dat")
Sample Data for Header is
header = ["CYCLE", "PASSES", "FAILURES", "IGNORED", "STATUS", "ACCEPTANCE"]
Sample for Training Data is
"cycle1",10,5,1,"fail","discard"
"cycle2",7,9,0,"fail","discard"
"cycle3",14,2,0,"pass","accept"
Sample for Testing Data is
"cycle71",11,4,1,"failed"
"cycle72",16,0,0,"passed"
I cant believe myself. I was wondering why was it so difficult to use a CSV file when every thing else is so easy in Python. My bad, I am new to it. So I finally figured out what was causing the list out of bound.
the function getBackData to be used to Training DATA & Testing Data only.
Separate function required for Header and Testing Data. Because Header will have equal number of columns, still the data type will be String.
Actually, I was using the function getBackData for Header also. and it was returning the CSV (containing headers) in a 2D list. Typically thats what it does. this was causing the issue.
Headers were supposed to be read as header[index], instead the code was recognizing it as header[row][col]. thats what I missed. I assumed Python to be intelligent enough to understand if only 1 row is there in CSV it should return a 1-D array.
Deserves a smiley :-)

Convert All Items in a Dataframe to Float

I am trying to convert all items in my dataframe to a float. The types are varies at the moment. The following error persist -> ValueError: could not convert string to float: '116,584.54'
The file can be found at https://www.imf.org/external/pubs/ft/weo/2019/01/weodata/WEOApr2019all.xls
I checked the value in excel, it is a Number. I tried .replace, .astype, pd.to_numeric.
for i in weo['1980']:
if i == float:
print(i)
i.replace(",",'')
i.replace("--",np.nan)
else:
continue
Also, I have tried:
weo['1980'] = weo['1980'].apply(pd.to_numeric)
You can try using DataFrame.astype in order to conduct the conversion which is usually the recommended approach. As you already attempted in your question, you may have to remove all the comas form the string in column 1980 first as it may cause the same error as quoted in your question:
weo['1980'] = weo['1980'].replace(',', '')
weo['1980'] = weo['1980'].asytpe(float)
If you're reading your DataFrame from Excel using pandas.read_excel, you can also specify the thousands argument to do this conversion for you which will likely result in a higher performance:
pandas.read_excel(file, thousands=',')
I had types error all the time while playing with dataframes. I now always use this to convert all the values that can be converted into floats.
# Convert all columns that can be converted into float into float.
# Error were raised because their type was Object
df = df.apply(pd.to_numeric, errors='ignore')

Resources