How to use HDFStore.select screen data - python-3.x

Question:
1、how to select the rows(Pseudo code) : columns['Name']='Name_A' (Name_A just a example) & columns['time'] isin (2021-11-21 00:00:00,2021-11-22 00:00:00) .
I have store about 4 billion rows data to a hdf5 file.
Now, I want to select some data.
My code like this:
import pandas as pd
ss = pd.HDFStore("xh_data_L9.hdf5") #<class 'pandas.io.pytables.HDFStore'>
print(type(ss))
print(ss.keys())
s_1 = ss.select('alldata',start=0,stop=500) # data example
print(s_1)
ss.close()
I found HDFStore.select usage like this:
HDFStore.select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)
# can not run success.
s_3 = ss.select('alldata',where="Time>2021-11-21 00:00:00 & Time<2021-11-22 00:00:00)")
s_3 = ss.select('alldata',['Name'] == 'Name_A')
I have google some method,but don't how to use "where"
code and result

I found that the reason was whether the data_columns was established when the file was created.
#this method created hdf5 don't have data_columns
ss.append('store',df_temp,index=True)
#this method created hdf5 have data_columns
store.append("store", df_temp, format="table", data_columns=True)
#query whether include data_columns
import pandas as pd
ss = pd.HDFStore("store.hdf5")
print(ss.info())
if the result include " dc->[Time,Name,Value]".
ss.select("store",where="Name='Name_A'")
#Single quotation marks are required before and after the varies.
The following is the official website explanation for data_columns:
data_columns :
list of columns, or True, default None
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes of the object are indexed.
See here <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>.

Related

How do I extract the column names from a .hdf5 file table and extract specific row data based on a specified column name?

Below is a screenshot of the branches of data in my .hdf5 file. I am trying to extract the existing column names (ie. experiment_id, session_id....) from this particular BlinkStartEvent segment.
I have the following codes that was able to access to this section of the data and extract the numerical data as well. But for some reason, I cannot extract the corresponding column names, which I wish to append onto a separate list so I can create a dictionary out of this entire dataset. I thought .keys() was supposed to do it, but it didn't.
import h5py
def traverse_datasets(hdf_file):
def h5py_dataset_iterator(g, prefix=''):
for key in g.keys():
#print(key)
item = g[key]
path = f'{prefix}/{key}'
if isinstance(item, h5py.Dataset): # test for dataset
yield (path, item)
elif isinstance(item, h5py.Group): # test for group (go down)
yield from h5py_dataset_iterator(item, path)
for path, _ in h5py_dataset_iterator(hdf_file):
yield path
with h5py.File(filenameHDF[0], 'r') as f:
for dset in traverse_datasets(f):
if str(dset[-15:]) == 'BlinkStartEvent':
print('-----Path:', dset) # path that leads to the data
print('-----Shape:', f[dset].shape) #the length dimension of the data
print('-----Data type:', f[dset].dtype) #prints out the unicode for all columns
data2 = f[dset][()] # The entire dataset
# print('Check column names', f[dset].keys()) # I tried this but I got a AttributeError: 'Dataset' object has no attribute 'keys' error
I got the following as the output:
-----Path: /data_collection/events/eyetracker/BlinkStartEvent
-----Shape: (220,)
-----Data type: [('experiment_id', '<u4'), ('session_id', '<u4'), ('device_id', '<u2'), ('event_id', '<u4'), ('type', 'u1'), ('device_time', '<f4'), ('logged_time', '<f4'), ('time', '<f4'), ('confidence_interval', '<f4'), ('delay', '<f4'), ('filter_id', '<i2'), ('eye', 'u1'), ('status', 'u1')]
Traceback (most recent call last):
File "C:\Users\angjw\Dropbox\NUS PVT\Analysis\PVT analysis_hdf5access.py", line 64, in <module>
print('Check column names', f[dset].keys())
AttributeError: 'Dataset' object has no attribute 'keys'
What am I getting wrong here?
Also, is there a more efficient way to access the data such that I can do something (hypothetical) like:
data2[0]['experiment_id'] = 1
data2[1]['time'] = 78.35161
data2[2]['logged_time'] = 80.59253
rather than having to go through the process of setting up a dictionary for every single row of data?
You're close. The dataset's .dtype gives you the dataset as a NumPy dtype. Adding .descr returns it as a list of (field name, field type) tuples. See code below to print the field names inside your loop:
for (f_name,f_type) in f[dset].dtype.descr:
print(f_name)
There are better ways to work with HDF5 data than creating a dictionary for every single row of data (unless you absolutely want a dictionary for some reason). h5py is designed to work with dataset objects similar to NumPy arrays. (However, not all NumPy operations work on h5py dataset objects). The following code accesses the data and returns 2 similar (but slightly different) data objects.
# this returns a h5py dataset object that behaves like a NumPy array:
dset_obj = f[dset]
# this returns a NumPy array:
dset_arr = f[dset][()]
You can slice data from either object using standard NumPy slicing notation (using field names and row values). Continuing from above...
# returns row 0 from field 'experiment_id'
val0 = dset_obj[0]['experiment_id']
# returns row 1 from field 'time'
val1 = dset_obj[1]['time']
# returns row 2 from field 'logged_time'
val2 = dset_obj[2]['logged_time']
(You will get the same values if you replace dset_obj with dset_arr above.)
You can also slice entire fields/columns like this:
# returns field 'experiment_id' as a NumPy array
expr_arr = dset_obj['experiment_id']
# returns field 'time' as a NumPy array
time_arr = dset_obj['time']
# returns field 'logged_time' as a NumPy array
logtime_arr = dset_obj['logged_time']
That should answer your initial questions. If not, please add comments (or modify the post), and I will update my answer.
My previous answer used the h5py package (same package as your code). There is another Python package that I like to use with HDF5 data: PyTables (aka tables). Both are very similar, and each has unique strengths.
h5py attempts to map the HDF5 feature set to NumPy as closely as possible. Also, it uses Python dictionary syntax to iterate over object names and values. So, it is easy to learn if you are familiar with NumPy. Otherwise, you have to learn some NumPy basics (like interrogating dtypes). Homogeneous data is returned as a np.array and heterogeneous data (like yours) is returned as a np.recarray.
PyTables builds an additional abstraction layer on top of HDF5 and NumPy. Two unique capabilities I like are: 1) recursive iteration over nodes (groups or datasets), so a custom dataset generator isn't required, and 2) heterogeneous data is accessed with a "Table" object that has more methods than basic NumPy recarray methods. (Plus it can do complex queries on tables, has advanced indexing capabilities, and is fast!)
To compare them, I rewrote your h5py code with PyTables so you can "see" the difference. I incorporated all the operations in your question, and included the equivalent calls from my h5py answer. Differences to note:
The f.walk_nodes() method is a built-in method that replaces your
your generator. However, it returns an object (a Table object in this
case), not the Table (dataset) name. So, the code is slightly
different to work with the object instead of the name.
Use Table.read() to load the data into a NumPy (record) array. Different examples show how to load the entire Table into an array, or load a single column referencing the field name.
Code below:
import tables as tb
with tb.File(filenameHDF[0], 'r') as f:
for tb_obj in f.walk_nodes('/','Table'):
if str(tb_obj.name[-15:]) == 'BlinkStartEvent':
print('-----Name:', tb_obj.name) # Table name without the path
print('-----Path:', tb_obj._v_pathname) # path that leads to the data
print('-----Shape:', tb_obj.shape) # the length dimension of the data
print('-----Data type:', tb_obj.dtype) # prints out the np.dtype for all column names/variable types
print('-----Field/Column names:', tb_obj.colnames) #prints out the names of all columns as a list
data2 = tb_obj.read() # The entire Table (dataset) into array data2
# returns field 'experiment_id' as a NumPy (record) array
expr_arr = tb_obj.read(field='experiment_id')
# returns field 'time' as a NumPy (record) array
time_arr = tb_obj.read(field='time')
# returns field 'logged_time' as a NumPy (record) array
logtime_arr = tb_obj.read(field='logged_time')

How to output data to Azure ML Batch Endpoint correctly using python?

When invoking Azure ML Batch Endpoints (creating jobs for inferencing), the run() method should return a pandas DataFrame or an array as explained here
However this example shown, doesn't represent an output with headers for a csv, as it is often needed.
The first thing I've tried was to return the data as a pandas DataFrame and the result is just a simple csv with a single column and without the headers.
When trying to pass the values with several columns and it's corresponding headers, to be later saved as csv, as a result, I'm getting awkward square brackets (representing the lists in python) and the apostrophes (representing strings)
I haven't been able to find documentation elsewhere, to fix this:
This is the way I found to create a clean output in csv format using python, from a batch endpoint invoke in AzureML:
def run(mini_batch):
batch = []
for file_path in mini_batch:
df = pd.read_csv(file_path)
# Do any data quality verification here:
if 'id' not in df.columns:
logger.error("ERROR: CSV file uploaded without id column")
return None
else:
df['id'] = df['id'].astype(str)
# Now we need to create the predictions, with previously loaded model in init():
df['prediction'] = model.predict(df)
# or alternative, df[MULTILABEL_LIST] = model.predict(df)
batch.append(df)
batch_df = pd.concat(batch)
# After joining all data, we create the columns headers as a string,
# here we remove the square brackets and apostrophes:
azureml_columns = str(batch_df.columns.tolist())[1:-1].replace('\'','')
result = []
result.append(azureml_columns)
# Now we have to parse all values as strings, row by row,
# adding a comma between each value
for row in batch_df.iterrows():
azureml_row = str(row[1].values).replace(' ', ',')[1:-1].replace('\'','').replace('\n','')
result.append(azureml_row)
logger.info("Finished Run")
return result

Pandas read_xml and SEPA (CAMT 053) XML

Recently I wanted to try the newly implemented xml_read function within pandas. I thought about testing the feature with SEPA camt-format xml. I'm stuck with the functions parameters, as I'm unfamiliar with the lxml logic. I tried pointing to the transactions values as rows ("Ntry" tag), as I thought this will then loop through those rows and creates the dataframe. Setting xpath to default returns an empty dataframe with the columns "GrpHdr" and "Rpt", but the relevant data is one level below "Rpt". Setting xpath='//*' creates a huge dataframe with every tag as column and values randomly sorted.
If anyone is familiar with using the pandas xml_read and nested xmls, I'd appreciate any hints.
The xml file looks like this (fake values):
<Document>
<BkToCstmrAcctRpt>
<GrpHdr>
<MsgId>Account</MsgId>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<MsgRcpt>
<Nm> Name</Nm>
</MsgRcpt>
</GrpHdr>
<Rpt>
<Id>Account ID</Id>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<Acct>
<Id>
<IBAN>DEXXXXX</IBAN>
</Id>
</Acct>
<Bal>
<Tp>
<CdOrPrtry>
</CdOrPrtry>
</Tp>
<Amt Ccy="EUR">161651651651</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Dt>
<DtTm>2021-08-05T14:20:23.077+02:00</DtTm>
</Dt>
</Bal>
<Ntry>
<Amt Ccy="EUR">11465165</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Sts>BOOK</Sts>
<BookgDt>
<Dt>2021-08-02</Dt>
</BookgDt>
<ValDt>
<Dt>2021-08-02</Dt>
</ValDt>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+65454</Cd>
<Issr>DFE</Issr>
</Prtry>
</BkTxCd>
<NtryDtls>
<TxDtls>
<Amt Ccy="EUR">4945141.0</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+55155</Cd>
<Issr>DFEsds</Issr>
</Prtry>
</BkTxCd>
<RltdPties>
<Dbtr>
<Nm>Name</Nm>
</Dbtr>
<Cdtr>
<Nm>Name</Nm>
</Cdtr>
</RltdPties>
<RmtInf>
<Ustrd>Referenz NOTPROVIDED</Ustrd>
<Ustrd> Buchug</Ustrd>
</RmtInf>
</TxDtls>
</NtryDtls>
</Ntry>
</Rpt>
</BkToCstmrAcctRpt>
</Document>
The bank statement is not a shallow xml, thus not very suitable for pandas.read_xml (as indicated in the documentation).
Instead I suggest to use sepa library.
from sepa import parser
import re
import pandas as pd
# Utility function to remove additional namespaces from the XML
def strip_namespace(xml):
return re.sub(' xmlns="[^"]+"', '', xml, count=1)
# Read file
with open('example.xml', 'r') as f:
input_data = f.read()
# Parse the bank statement XML to dictionary
camt_dict = parser.parse_string(parser.bank_to_customer_statement, bytes(strip_namespace(input_data), 'utf8'))
statements = pd.DataFrame.from_dict(camt_dict['statements'])
all_entries = []
for i,_ in statements.iterrows():
if 'entries' in camt_dict['statements'][i]:
df = pd.DataFrame()
dd = pd.DataFrame.from_records(camt_dict['statements'][i]['entries'])
df['Date'] = dd['value_date'].str['date']
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
iban = camt_dict['statements'][i]['account']['id']['iban']
df['IBAN'] = iban
df['Currency'] = dd['amount'].str['currency']
all_entries.append(df)
df_entries = pd.concat(all_entries)

Dask: Add list to a column value like pandas does

I am bit new to dask. I have large csv file and large list. Length of row of csv are equal to length of the list. I am trying to create a new column in the Dask dataframe from a list. In pandas, it pretty straight forward, however in Dask I am having hard time creating new column for it. I am avoiding to use pandas because my data is 15GB+.
Please see my tries below.
csv Data
name,text,address
john,some text here,MD
tim,some text here too,WA
Code tried
import dask.dataframe as dd
import numpy as np
ls = ['one','two']
ddf = dd.read_csv('../data/test.csv')
ddf.head()
Try #1:
ddf['new'] = ls # TypeError: Column assignment doesn't support type list
Try #2: What should be passed here for condlist?
ddf['new'] = np.select(choicelist=ls) # TypeError: _select_dispatcher() missing 1 required positional argument: 'condlist'
Looking for this output:
name text address new
0 john some text here MD one
1 tim some text here too WA two
Try creating a dask dataframe and then appending it like this -
#ls = dd.from_array(np.array(['one','two']))
#ddf['new'] = ls
# As tested by OP
import dask.array as da
ls = da.array(['one','two','three'])
ddf['new'] = ls

How do you take one dataframe that covers multiple years, and break it into a separate DF for each year

I already looked on SE and couldn't find an answer to my question. I am still new to this.
I am trying to take a purchasing csv file and break it into separate dataframes for each year.
For example, if I have a listing with full dates in MM/DD/YYYY format, I am trying to separate them into dataframes for each year. Like Ord2015, Ord2014, etc...
I tried to covert the full date into just the year, and also attempted to use slicing to only look at the last four of the date to no avail.
Here is my current (incomplete) attempt:
import pandas as pd
import csv
import numpy as np
import datetime as dt
import re
purch1 = pd.read_csv('purchases.csv')
#Remove unneeded fluff
del_colmn = ['pid', 'notes', 'warehouse_id', 'env_notes', 'budget_notes']
purch1 = purch1.drop(del_colmn, 1)
#break down by year only
purch1.sort_values(by=['order_date'])
Ord2015 = ()
Ord2014 = ()
for purch in purch1:
Order2015.add(purch1['order_date'] == 2015)
Per req by #anon01... here are the results of the code you had me run. I only used a sample of four as that was all I was initially playing with... The record has almost 20k lines, so I only pulled aside a few to play with.
'{"pid":{"0":75,"2":95,"3":117,"1":82},"env_id":{"0":12454,"2":12532,"3":12623,"1":12511},"ord_date":{"0":"10\/2\/2014","2":"11\/22\/2014","3":"2\/17\/2015","1":"11\/8\/2014"},"cost_center":{"0":"Ops","2":"Cons","3":"Net","1":"Net"},"dept":{"0":"Ops","2":"Cons","3":"Ops","1":"Ops"},"signing_mgr":{"0":"M. Dodd","2":"L. Price","3":"M. Dodd","1":"M. Dodd"},"check_num":{"0":null,"2":null,"3":null,"1":82301.0},"rec_date":{"0":"10\/11\/2014","2":"12\/2\/2014","3":"3\/1\/2015","1":"11\/20\/2014"},"model":{"0":null,"2":null,"3":null,"1":null},"notes":{"0":"Shipped to east WH","2":"Rec'd by L.Price","3":"Shipped to Client (1190)","1":"Rec'd by K. Wilson"},"env_notes":{"0":"appr by K.Polt","2":"appr by S. Crane","3":"appr by K.Polt","1":"appr by K.Polt"},"budget_notes":{"0":null,"2":"OOB expense","3":"Bill to client","1":null},"cost_year":{"0":2014.0,"2":2015.0,"3":null,"1":2014.0}}'
You can add parse_dates to read_csv for convert column to datetimes and then create dictionary of DataFrames dfs, for selecting is used keys:
purch1 = pd.read_csv('purchases.csv', parse_dates=['ord_date'])
dfs = dict(tuple(purch1.groupby(df['ord_date'].dt.year)))
Ord2015 = dfs[2015]
Ord2016 = dfs[2016]
It is not recommended, but possible create DataFrames by years groups:
for i, g in df.groupby(purch1['ord_date'].dt.year):
globals()['Ord' + str(i)] = g

Resources