How to write a Pandas Dataframe into a HDF5 dataset - python-3.x

I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve
import h5py
import numpy as np
import pandas as pd
file = h5py.File('database.h5','w')
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
groups = ['A','B','C']
for m in groups:
group = file.create_group(m)
dataset = ['1','2','3']
for n in dataset:
data = df
ds = group.create_dataset(m + n, data.shape)
print ("Dataset dataspace is", ds.shape)
print ("Dataset Numpy datatype is", ds.dtype)
print ("Dataset name is", ds.name)
print ("Dataset is a member of the group", ds.parent)
print ("Dataset was created in the file", ds.file)
print ("Writing data...")
ds[...] = data
print ("Reading data back...")
data_read = ds[...]
print ("Printing data...")
print (data_read)
file.close()
This way the nested structure is created but it loses the index and columns. I've tried the
df.to_hdf('database.h5', ds, table=True, mode='a')
but didn't work, I get this error
AttributeError: 'Dataset' object has no attribute 'split'
Can anyone shed some light please. Many thanks

df.to_hdf() expects a string as a key parameter (second parameter):
key : string
identifier for the group in the store
so try this:
df.to_hdf('database.h5', ds.name, table=True, mode='a')
where ds.name should return you a string (key name):
In [26]: ds.name
Out[26]: '/A1'

I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following
import numpy as np
import pandas as pd
db = pd.HDFStore('Database.h5')
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])
groups = ['A','B','C']
i = 1
for m in groups:
subgroups = ['d','e','f']
for n in subgroups:
db.put(m + '/' + n, df, format = 'table', data_columns = True)
It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like
db['A/d'].Col1[4:]

Related

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

Changing column data type in Pandas DF using for command

I want to change the data type for a few columns in a database from Int/Float to Object.
I am trying to do this using for loop. But the data type is not changing.
for i in convert_to_object:
i = i.astype('object',inplace=True)
#Convert_to_object is a list of column names in data frame df
convert_to_object = [df_app1.REGION_RATING_CLIENT,
df_app1.REGION_RATING_CLIENT_W_CITY,
df_app1.REG_REGION_NOT_LIVE_REGION,
df_app1.REG_REGION_NOT_WORK_REGION,
df_app1.LIVE_REGION_NOT_WORK_REGION,
df_app1.REG_CITY_NOT_LIVE_CITY,
df_app1.REG_CITY_NOT_WORK_CITY,
df_app1.LIVE_CITY_NOT_WORK_CITY]
for i in convert_to_object:
i = i.astype('object',inplace=True)
df.info()
The data type for given columns should change to Object.
As far as I know DataFrame.astype does not handle the inplace param. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html How about:
cols = [
"REGION_RATING_CLIENT",
"REGION_RATING_CLIENT_W_CITY",
"REG_REGION_NOT_LIVE_REGION",
"REG_REGION_NOT_WORK_REGION",
"LIVE_REGION_NOT_WORK_REGION",
"REG_CITY_NOT_LIVE_CITY",
"REG_CITY_NOT_WORK_CITY",
"LIVE_CITY_NOT_WORK_CITY"
]
df[cols] = df[cols].astype(object)
Consider this example:
import pandas as pd
df = pd.DataFrame({
'A': [1,2],
'B': [1.2,3.4]
})
print(df.dtypes)
df[['A','B']] = df[['A','B']].astype(object)
print(df.dtypes)
Returns:
A int64
B float64
dtype: object
A object
B object
dtype: object

How to appropriately test Pandas dtypes within dataframes?

Objective: to create a function that can match given dtypes to a predfined data type scenario.
Description: I want to be able to classify given datasets based on their attribution into predefined scenario types.
Below are two example datasets (df_a and df_b). df_a has only dtypes that are equal to 'object' while df_b has both 'object' and 'int64':
# scenario_a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]
df_a = pd.DataFrame(data, columns = ['Name','Color'])
df_a['Color'] = df_a['Color'].astype('object')
# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]
df_b = pd.DataFrame(data, columns = ['Name', 'Age'])
I want to be able to determine automatically which scenario it is based on a function:
import pandas as pd
import numpy as np
def scenario(data):
if data.dtypes.str.contains('object'):
return scenario_a
if data.dtypes.str.contatin('object', 'int64'):
return scenario_b
Above is what I have so far, but isn't getting the results I was hoping for.
When using the function scenario(df_a) I am looking for the result to be scenario_a and when I pass df_b I am looking for the function to be able to determine, correctly, what scenario it should be.
Any help would be appreciated.
Here is one approach. Create a dict scenarios, with the keys a sorted tuple of predefined dtypes, and the value being what you would want returned by the function.
Using your example, something like:
# scenario a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]
df_a = pd.DataFrame(data_a, columns = ['Name','Color'])
df_a['Color'] = df_a['Color'].astype('object')
# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]
df_b = pd.DataFrame(data_b, columns = ['Name', 'Age'])
scenario_a = tuple(sorted(df_a.dtypes.unique()))
scenario_b = tuple(sorted(df_b.dtypes.unique()))
scenarios = {
scenario_a: 'scenario_a',
scenario_b: 'scenario_b'
}
print(scenarios)
# scenarios:
# {(dtype('O'),): 'scenario_a', (dtype('int64'), dtype('O')): 'scenario_b'}
def scenario(data):
dtypes = tuple(sorted(data.dtypes.unique()))
return scenarios.get(dtypes, None)
scenario(df_a)
# 'scenario_a'
scenario(df_b)
# scenario_b

Subtract a single value from columns in pandas

I have two data frames, df and df_test. I am trying to create a new dataframe for each df_test row that will include the difference between x coordinates and the y coordinates. I wold also like to create a new column that gives the magnitude of this distance between objects. Below is my code.
import pandas as pd
import numpy as np
# Create Dataframe
index_numbers = np.linspace(0, 10, 11, dtype=np.int)
index_ = ['OP_%s' % number for number in index_numbers]
header = ['X', 'Y', 'D']
# print(index_)
data = np.round_(np.random.uniform(low=0, high=10, size=(len(index_), 3)), decimals=0)
# print(data)
df = pd.DataFrame(data=data, index=index_, columns=header)
df_test = df.sample(3)
# print(df)
# print(df_test)
for index, row in df_test.iterrows():
print(index)
print(row)
df_(index) = df
df_(index)['X'] = df['X'] - df_test['X'][row]
df_(index)['Y'] = df['Y'] - df_test['Y'][row]
df_(index)['Dist'] = np.sqrt(df_(index)['X']**2 + df_(index)['Y']**2)
print(df_(index))
Better For Loop
for index, row in df_test.iterrows():
# print(index)
# print(row)
# print("df_{0}".format(index))
df_temp = df.copy()
df_temp['X'] = df_temp['X'] - df_test['X'][index]
df_temp['Y'] = df_temp['Y'] - df_test['Y'][index]
df_temp['Dist'] = np.sqrt(df_temp['X']**2 + df_temp['Y']**2)
print(df_temp)
I have written a for loop to run through each row of the df_test dataframe and "try" to create the columns. The (index) in each loop is the name of the new data frame based on test row used. Once the dataframe is created with the modified and new columns I would need to save the data frames to a dictionary. The new loop produces the each of the new dataframes I need but what is the best way to save each new dataframe? Any help in creating these columns would be greatly appreciated.
Please comment with any questions so that I can make it easier to understand, if need be.

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

Resources