h5py accuracy for matrix storage - python-3.x

I want to use Python3-h5py to store matrix to the .HDF5 format
My problem is that when I compare the initial data to the data extracted from the HDF5 file, I get surprising differences.
import numpy
import h5py
# Create a vector of float64 values between 0 and 1
A = numpy.array(range(16384+1))/(16384+1)
# Save the corresponding float16 array to a HDF5 file
Fid = h5py.File("Output/Test.hdf5","w")
Group01 = Fid.create_group("Group")
Group01.create_dataset("Data", data=A, dtype='f2')
# Group01.create_dataset("Data", data=A.astype(numpy.float16), dtype='f2')# Use that line to avoid the bug
Fid.flush()
Fid.close()
# Read the HDF5 file
Fid = h5py.File("Output/Test.hdf5",'r')
B = Fid["Group/Data"][:]
Fid.close()
# Compare float64 and float16 Values
print(A[8192])
print(B[8192])
print("")
print(A[8192+1])
print(B[8192+1])
print("")
print(A[16384])
print(B[16384])
Gives :
0.499969484284
0.25
0.500030515716
0.5
0.999938968569
0.5
Sometimes I get a difference of about "0.00003" and sometimes "0.4999".
Normally, I am supposed to always get "0.00003" which is related to the float16 rounding for a value between 0 and 1.
But the "0.4999" value is really unexpected, I have noticed that it happens to values which are close to power of 2 (for example "~1/2" will be stored as "~1/4").
Is it a bug into the h5py package ?
Thanks in advance,
Stéphane,
[Xubuntu 17.09 64bits + python3-h5py v2.7.1-2 + python3 v3.6.3-0ubuntu2]

I am not fully sure that this can be considered as an answer, but I finally get rid of my problem with a small circumvent.
To sum it up, it looks like there is a bug with "h5py v2.7.1-2"
When using h5py to store arrays, don't use such command :
`Group01.create_dataset("Data", data=A, dtype='f2')# Buggy command`
But instead :
`Group01.create_dataset("Data", data=A.astype(numpy.float16), dtype='f2')`
Edit 18 Nov 2022 : with h5py==3.7.0 the bug is now fixed

Related

Python PyVisa convert queried binary data to ascii data

I'm currently using a keysight VNA product and I control it using PyVisa. Since I have a rapid changing system, I wish to query binary data instead of ascii data from the machine since it is about 10 times faster. The issue I am having is to convert the data to ascii again.
Minimal exampel code:
import pyvisa as visa
import numpy as np
device_adress = ''TCPIP0::localhost::hislip1,4880::INSTR''
rm = visa.ResourceManager('C:\\Windows\\System32\\visa32.dll')
device = rm.open_resource(device_adres)
# presetting device for SNP data measurment
# ...
device.query_ascii_values('CALC:DATA:SNP? 2', container = np.ndarray) # works super but is slow
device.write('FORM:DATA REAL,64')
device.query_binary_values('CALC:DATA:SNP? 2', container = np.ndarray) # 10 times faster but how to read data
Official docs to query binary doesn't give me anything. I found the functions for the code on git here and some helper functions for converting data here, but I am still unable to convert the data such that the converted data is the same as the one I got from the ascii query command. If possible I would like the 'container=np.ndarray' to kept.
Functions from the last link that I have tested:
bin_data = device.query_binary_values('CALC:DATA:SNP? 2', container = np.ndarray)
num = from_binary_block(bin_data) # "Convert a binary block into an iterable of numbers."
ascii_data = to_ascii_block(num) # "Turn an iterable of numbers in an ascii block of data."
but the data from query_ascii_values and the values of ascii_data don't match. Any help is higly appreciated.
Edit:
With the following code
device.write(f"SENS:SWE:POIN 5;")
data_bin = device.query_binary_values('CALC:DATA? SDATA', container=np.ndarray)
I got
data_bin = array([-5.0535379e-34, 1.3452465e-43, -1.7349754e+09, 1.3452465e-43,
-8.6640313e+22, 8.9683102e-44, 5.0314407e-06, 3.1389086e-43,
4.8143607e-36, 3.1389086e-43, -4.1738553e-12, 1.3452465e-43,
-1.5767541e+11, 8.9683102e-44, -2.8241991e+32, 1.7936620e-43,
4.3024710e+16, 1.3452465e-43, 2.1990014e+07, 8.9683102e-44],
dtype=float32)

Analysis on data txt.files

I have a series of .txt files that I want to analyse all at the same time. The files are typically having a lenght of about 1000 values. I want to analyse the first 200 values of them on outliers. An outlier is whenever the value is below 12. Therefore I use the code, however, I get the error:
'numpy.bool_' object does not support item assignment. How to overcome? Should I not use loadtxt in order to perform these kind of checks?
for files in document:
Rf_file = open(files, "r")
Rf_value = np.loadtxt(Rf_file)
#Indicate outliers
for i in range(0,200):
outliers = Rf_value[i] < 12
Rf_value = Rf_value[outliers]
enter image description here
Without an example, it is hard to give a perfect answer, but the answer is most probably something like this:
import numpy as np
for document in documents:
values = np.loadtxt(document)
values_200 = values[:200]
outliers = values_200[values_200 < 12]

Pandas .rolling.corr using date/time offset

I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.

Python 3 OutOfBoundsDatetime: Out of bounds nanosecond timestamp: (Workaround)

Encountered an error today involving importing a CSV file with dates. The file has known quality issues and in this case one entry was "3/30/3013" due to a data entry error.
Reading other entries about the OutOfBoundsDatetime error, datetime's upper limit maxes out at 4/11/2262. The suggested solution was to fix the formatting of the dates. In my case the date format is correct but the data is wrong.
Applying numpy logic:
df['Contract_Signed_Date'] = np.where(df['Contract_Signed_Date']>'12/16/2017',
df['Alt_Date'],df['Contract_Signed_Date'])
Essentially if the file's 'Contract Signed Date' is greater than today (being 12/16/2017), I want to use the Alt_Date column instead. It seems to work except when it hits the year 3013 entry it errors out. Whats a good pythonic way around the out of bounds error?
Perhaps hideously unpythonic but it appears to do what you want.
Input, file arthur.csv:
input_date,var1,var2
3/30/3013,2,34
02/2/2017,17,35
Code:
import pandas as pd
from io import StringIO
target_date='2017-12-17'
for_pandas = StringIO()
print ('input_date,var1,var2,alt_date', file=for_pandas) #new header
with open('arthur.csv') as arthur:
next(arthur) #skip header in csv
for line in arthur:
line_items = line.rstrip().split(',')
date = '{:4s}-{:0>2s}-{:0>2s}'.format(*list(reversed(line_items[0].split('/'))))
if date>target_date:
output = '{},{},{},{}'.format(*['NaT',line_items[1],line_items[2],date])
else:
output = '{},{},{},{}'.format(*[date,line_items[1],line_items[2],'NaT'])
print(output, file=for_pandas)
for_pandas.seek(0)
df = pd.read_csv(for_pandas, parse_dates=['input_date', 'alt_date'])
print (df)
Output:
0 NaT 2 34 3013-30-03
1 2017-02-02 17 35 NaT

RuntimeWarning: divide by zero encountered in log when using pvlib

I'm using PVLib to model a PV system. I'm pretty new to coding and Python, and this is my first time using PVLib, so not surprisingly I've hit some difficulties.
Specifically, I've got created the following code using the extensive readthedocs examples at http://pvlib-python.readthedocs.io/en/latest/index.html
import pandas as pd
import numpy as np
from numpy import isnan
import datetime
import pytz
# pvlib imports
import pvlib
from pvlib.forecast import GFS, NAM, NDFD, HRRR, RAP
from pvlib.pvsystem import PVSystem, retrieve_sam
from pvlib.modelchain import ModelChain
# set location (Royal Greenwich Observatory, London, UK)
latitude, longitude, tz = 51.4769, 0.0005, 'Europe/London'
# specify time range.
start = pd.Timestamp(datetime.date.today(), tz=tz)
end = start + pd.Timedelta(days=5)
periods = 8 # number of periods that the GFS model and/or the model chain allows us to forecast power output.
# specify what irradiance variables we want
irrad_vars = ['ghi', 'dni', 'dhi']
# Use Global Forecast System model. The GFS is the US model that provides forecasts for the entire globe.
fx_model = GFS() # note: gives output in 3-hourly intervals
# retrieve data in processed format (convert temps from Kelvin to Celsius, combine elements of wind speed, complete irradiance data)
# Returns pandas.DataFrame object
fx_data = fx_model.get_processed_data(latitude, longitude, start, end)
# load module and inverter specifications
sandia_modules = pvlib.pvsystem.retrieve_sam('SandiaMod')
cec_inverters = pvlib.pvsystem.retrieve_sam('cecinverter')
module = sandia_modules['SolarWorld_Sunmodule_250_Poly__2013_']
inverter = cec_inverters['ABB__PVI_3_0_OUTD_S_US_Z_M_A__240_V__240V__CEC_2014_']
# model a fixed system in the UK. 10 strings of 250W panels, with 40 panels per string. Gives a nominal 100kW array
system = PVSystem(module_parameters=module, inverter_parameters=inverter, modules_per_string=40, strings_per_inverter=10)
# use a ModelChain object to calculate modelling intermediates
mc = ModelChain(system, fx_model.location, orientation_strategy='south_at_latitude_tilt')
# extract relevant data for model chain
mc.run_model(fx_data.index, weather=fx_data)
# OTHER CODE AFTER THIS TO DO SOMETHING WITH THE DATA
Having used a lot of print() statements in the console to debug, I can see that at the final line
mc.run_model(fx_data.index....
I get the following error:
/opt/pyenv/versions/3.6.0/lib/python3.6/site-packages/pvlib/pvsystem.py:1317:
RuntimeWarning: divide by zero encountered in log
module['Voco'] + module['Cells_in_Series']*delta*np.log(Ee) +
/opt/pyenv/versions/3.6.0/lib/python3.6/site-packages/pvlib/pvsystem.py:1323:
RuntimeWarning: divide by zero encountered in log
module['C3']*module['Cells_in_Series']*((delta*np.log(Ee)) ** 2) +
As a result, when I then go on to look at the ac_power outputs, I get what looks like erroneous data (every hour with a forecast that is not NaN = 3000 W).
I'd really appreciate any help you can give as I don't know what's causing it. Maybe I'm specifying the system incorrectly?
Thanks, Matt
I think the warnings you're seeing are ok to ignore. A handful of pvlib algorithms spit out warnings due to things like 0 values at night.
I think your problem with the non-NaN values is unrelated to the warnings. Study the other modeling results (stored as mc attributes -- see documentation and source code) to see if you can track down the source of your problem.

Resources