How to extract an array from a text file - python-3.x

I have a .txt file with thousands of tensors written inside. My problem is that they are all written in the following format (it is a string):
' tensor([ 9.8228e-01, -2.6578e-01, 9.6711e-01,........, -0.3274, -0.3205])'
How can I convert this into an array of floats? I have problems with handling the 'e-01' parts as well.
Thank you very much!

You could just map() to float the strings obtained by splitting for , the substring between [ and ]:
s = 'tensor([ 9.8228e-01, -2.6578e-01, 9.6711e-01, -0.3274, -0.3205])'
list(map(float, s[s.find('[') + 1:s.find(']')].split(',')))
# [0.98228, -0.26578, 0.96711, -0.3274, -0.3205]
or to get to a NumPy array:
import numpy as np
np.fromiter(map(float, s[s.find('[') + 1:s.find(']')].split(',')), dtype=float)
# array([ 0.98228, -0.26578, 0.96711, -0.3274 , -0.3205 ])
EDIT
NumPy offers a faster alternative using np.fromstring():
np.fromstring(s[s.find('['):], dtype=float, sep=', ')
which is substantially an optimized version of the above. Note that you would still need to remove the tensor part of the string.

Related

sklearn dcg_score not working as expected

This is my code:
from sklearn.metrics import dcg_score
true_relevance = np.asarray([[10]])
scores = np.asarray([[.1]])
dcg_score(true_relevance, scores)
The below code should produce 10 as the dcg_score. The formula from wikipedia gives 10/log2 = 10 But, instead I get ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got binary instead
Did anyone encounter this?
Since computing dcg on a single element is not meaningful, the sklearn library requires at least two y_true and y_score elements in the corresponding arrays.
You can check this by exploring the sklearn code (or through debugging): https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/utils/multiclass.py#L158
Like:
true_relevance = np.asarray([[10, 5]])
scores = np.asarray([[.1, .2]])
dcg_score(true_relevance, scores)

Python 3.x - Why can I not convert this numpy array to a Pillow image?

The first group of code (under the lines for importing) obviously involves opening a .png and showing it.
The second group of code creates a transparent 1 x 1 image (RGBA format) as a numpy array just like before, but although the types at play seem to be exactly the same, the last line fails to execute.
The error I'm getting follows as: "TypeError: Cannot handle this data type: (1, 1, 4)", and I have no idea why. Where I print the arrays directly, they seem to be in identical format. Thank you in advance to anyone willing to help.
from PIL import Image
import numpy as np
i = np.array(Image.open(r'folder\test.png'))
print(i)
Image.fromarray(i)
o = np.zeros((1, 1, 4))
print(o)
Image.fromarray(o)
It is about the dtype. You can use the following:
o = np.zeros((1, 1, 4),dtype=np.uint8)

How to save list of extracted features in hdf5 file python

I am extracting some features from an audio file and save them in a list and then saving a list in hdf5 file but it cause an error. Previously I am directly saving features in a hdf5 file but it just overwrite all the values and save only the last one.
ampList = []
mffcslist = []
centroidlist = []
i = 0
ampList.append(Xdb) # saving extracted feature in a list
mffcslist.append(mfccs)
centroidlist.append(spectral_centroids)
with h5py.File('C:/Users/Aweem Ashar/Desktop/feature.h5', 'a') as f:
f.close()
for i in range(len(audio_path)):
#print(ampList[i])
f.create_dataset("amplitude", data=ampList[i])
f.create_dataset("MffC", data=mffcslist[i])
f.create_dataset("spectral", data=centroidlist[i])
# plt.show() # To view Wave graph
I didn't look at your code that closely when I wrote my comment. I just realized you are loading your list data one element at a time. There are much better/faster ways to do it with Numpy arrays. I don't know what kind of data you're working with, so created a very simple example with a few floats in ampList. I use np.asarray() to convert the list to a Numpy array and load into the dataset in 1 shot. Much easier and compact. This method (with np.asarray()) will work for any list with elements of a common type (all floats or all ints).
My simple example:
import h5py
import numpy as np
ampList = [ 20., 11., 33., 40., 100. ]
with h5py.File('SO_58092765.h5','w') as h5f:
h5f.create_dataset("amplitude", data=np.asarray(ampList) )
A Better Approach:
The example above addresses your basic question (how to copy the list data into a HDF dataset). However, I think there is a better approach for your scenario. I assume you have amplitude, MffC, and spectral data for each and every audio file, AND it would be convenient to have that data associated with the audio file name. If so, that's where HDF5 and mixed format datatypes are so powerful.
I created a second example (below) to show how you can save mixed data in a single dataset. I assumed the following datatypes (to make the example interesting):
Audio file name: String
amplitude: Float
MffC: Integer
Spectral (centroid): Float array of shape (3,)
This example creates 2 HDF5 files:
SO_58092765_3ds.h5: saves each List as a separate dataset.
SO_58092765_1ds.h5: saves all List data in a single dataset, with each List written to a separate Field/Column.
The second method uses a Numpy datatype (dtype) to define the name and datatype of each column of data in the HDF5 dataset. The dtype is then used to create an empty dataset. Each List is written to the dataset by referencing the field name.
Second example:
import h5py
import numpy as np
fileList = [ 'audio1.mp3', 'audio2.mp3', 'audio11.mp3', 'audio21.mp3','audio22.mp3' ]
ampList = [ 20., 11., 33., 40., 100. ]
mffcslist = [ 12, 8, 9, 14, 33 ]
centroidlist = [ (0.,0.,0.), (1.,0.,0.),
(0.,1.,0.), (0.,1.,0.),
(1.,1.,1.),]
# create SO_58092765_3ds.h5:
with h5py.File('SO_58092765_3ds.h5','w') as h5f:
h5f.create_dataset("amplitude", data=np.asarray(ampList) )
h5f.create_dataset("MffC", data=np.asarray(mffcslist) )
h5f.create_dataset("spectral", data=np.asarray(centroidlist) )
# create SO_58092765_1ds.h5 with ds_dtype:
ds_dtype = np.dtype( [("audiofile",'S20'), ("amplitude",float),
("MffC",int), ("spectral",float, (3,)) ] )
with h5py.File('SO_58092765_1ds.h5','w') as h5f:
ds = h5f.create_dataset("test_data", shape=(len(ampList),), dtype=ds_dtype )
ds['audiofile'] = np.asarray(fileList)
ds['amplitude'] = np.asarray(ampList)
ds['MffC'] = np.asarray(mffcslist)
ds['spectral'] = np.asarray(centroidlist)
import matplotlib.pyplot as plt
from glob import glob
import librosa as lb
import sklearn
import librosa.display
import librosa
import h5py
import numpy as np
dir = 'C:\\Users\\Aweem Ashar\\Desktop\\recordingd'
audio_path = glob(dir + '/*.wav')
ampList = []
mffcslist = []
centroidlist = []
for file in range(0, len(audio_path)):
x, sr = lb.load(audio_path[file])
print(type(x), type(sr))
librosa.display.waveplot(x, sr=sr)
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
def normalize(x, axis=0):
return sklearn.preprocessing.minmax_scale(x, axis=axis)
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='r')
mfccs = librosa.feature.mfcc(x, sr=sr)
print(mfccs.shape)
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
ampList.append(Xdb)
mffcslist.append(mfccs)
centroidlist.append(spectral_centroids)
with h5py.File('C:/Users/Aweem Ashar/Desktop/feature.h5', 'w') as f:
f.create_dataset("amplitude", data=np.asarray(ampList))
f.create_dataset("MFCC", data=np.asarray(mffcslist))
f.create_dataset("SpectralCentroid", data=np.asarray(centroidlist))
Aweem, I'm not familiar with librosa or sklearn so can't debug all your code. When working with something new, use a minimally complete verifiable example (MCVE) to confirm behavior with simple data sets. They are much easier to diagnose.
To do that, I simplified the for loop in your second post; reorganizing and removing what I thought was unnecessary. Also, you don't need to loop over all the images. Change the glob() call to get 1 (or a few) images. The shape of Xdb (saved to ampList) should show why Numpy asarray() tries to broadcast the shape (and why you get an error). If not, post the output for review.
Finally, you should add a comment to create_dataset("amplitude") to verify that the other 2 create_dataset() calls work. Good luck.
dir = 'C:\\Users\\Aweem Ashar\\Desktop\\recordingd'
# change this to get 1 wav file:
audio_path = glob(dir + '/*.wav')
ampList = []
mffcslist = []
centroidlist = []
for file in range(0, len(audio_path)):
x, sr = lb.load(audio_path[file])
print(x, sr)
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
print (Xdb.shape)
ampList.append(Xdb)
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
print (spectral_centroids.shape)
centroidlist.append(spectral_centroids)
mfccs = librosa.feature.mfcc(x, sr=sr)
print (mfccs.shape)
mffcslist.append(mfccs)
with h5py.File('C:/Users/Aweem Ashar/Desktop/feature.h5', 'w') as f:
f.create_dataset("amplitude", data=np.asarray(ampList))
f.create_dataset("MFCC", data=np.asarray(mffcslist))
f.create_dataset("SpectralCentroid", data=np.asarray(centroidlist))

How to reverse one hot encoded value to Label?

I am working on simple dataset to detect rock or mine with class names 'R' and 'M'. I have one hot encoded R to 1 and M to 0. Now I want to revese it.
I have tried many ways but couldn't find approach to convert back 1 to R and 0 to M
import numpy as np
import pandas as pd
import keras
from sklearn.preprocessing import LabelEncoder
df=pd.read_csv('D:\\Datasets\\node-fussy-examples-master\\node-fussy-
examples-master\\sonar\\training.csv')
ds=df.values
x_train=df[df.columns[0:60]].values
y_train=df[df.columns[60]]
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_Y = encoder.transform(y_train)
I expect 1 to be R and 0 to be M
You can use inverse_transform method:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
print(le.transform([1, 1, 2, 6]))
print(le.inverse_transform([0, 0, 1, 2]))
If you need to do the same thing in Tensorflow, look at this thread.
I just came across a use case today where I needed to convert an onehot-encoded tensor back to a normal label tensor. I know you can use np.argmax(probs, axis=1) or something to reverse an onehot-encoded probability tensor but that didn't work in my case as my data was not a soft probability tensor but rather a label tensor filled with either 0 or 1. I know this is not entirely relevant to OP's question but I thought someone might need to do something similar so I will just write my solution down here.
def reverse_onehot(onehot_data):
# onehot_data assumed to be channel last
data_copy = np.zeros(onehot_data.shape[:-1])
for c in range(onehot_data.shape[-1]):
img_c = onehot_data[..., c]
data_copy[img_c == 1] = c
return data_copy
Let's say y is your one-hot-encoded array. Then the following should give you the labels back:
unique_classes[np.argmax(y, axis=1)]
assuming you used unique_classes for encoding too (order is important).

Delimit array with different strings

I have a text file that contains 3 columns of useful data that I would like to be able to extract in python using numpy. The file type is a *.nc and is NOT a netCDF4 filetype. It is a standard file output type for CNC machines. In my case it is sort of a CMM (coordinate measurement machine). The format goes something like this:
X0.8523542Y0.0000000Z0.5312869
The X,Y, and Z are the coordinate axes on the machine. My question is, can I delimit an array with multiple delimiters? In this case: "X","Y", and "Z".
You can use Pandas
import pandas as pd
from io import StringIO
#Create a mock file
ncfile = StringIO("""X0.8523542Y0.0000000Z0.5312869
X0.7523542Y1.0000000Z0.5312869
X0.6523542Y2.0000000Z0.5312869
X0.5523542Y3.0000000Z0.5312869""")
df = pd.read_csv(ncfile,header=None)
#Use regex with split to define delimiters as X, Y, Z.
df_out = df[0].str.split(r'X|Y|Z', expand=True)
df_out.set_axis(['index','X','Y','Z'], axis=1, inplace=False)
Output:
index X Y Z
0 0.8523542 0.0000000 0.5312869
1 0.7523542 1.0000000 0.5312869
2 0.6523542 2.0000000 0.5312869
3 0.5523542 3.0000000 0.5312869
I ended up using the Pandas solution provided by Scott. For some reason I am not 100% clear on, I cannot simply convert the array from string to float with float(array). I created an array of equal size and iterated over the size of the array, converting each individual element to a float and saving it to the other array.
Thanks all
Using the filter function that I suggested in a comment:
String sample (standin for file):
In [1]: txt = '''X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869'''
Basic genfromtxt use - getting strings:
In [3]: np.genfromtxt(txt.splitlines(), dtype=None,encoding=None)
Out[3]:
array(['X0.8523542Y0.0000000Z0.5312869', 'X0.8523542Y0.0000000Z0.5312869',
'X0.8523542Y0.0000000Z0.5312869', 'X0.8523542Y0.0000000Z0.5312869'],
dtype='<U30')
This array of strings could be split in the same spirit as the pandas answer.
Define a function to replace the delimiter characters in a line:
In [6]: def foo(aline):
...: return aline.replace('X','').replace('Y',',').replace('Z',',')
re could be used for a prettier split.
Test it:
In [7]: foo('X0.8523542Y0.0000000Z0.5312869')
Out[7]: '0.8523542,0.0000000,0.5312869'
Use it in genfromtxt:
In [9]: np.genfromtxt((foo(aline) for aline in txt.splitlines()), dtype=float,delimiter=',')
Out[9]:
array([[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869]])
With a file instead, the generator would something like:
(foo(aline) for aline in open(afile))

Resources