Delimit array with different strings - python-3.x

I have a text file that contains 3 columns of useful data that I would like to be able to extract in python using numpy. The file type is a *.nc and is NOT a netCDF4 filetype. It is a standard file output type for CNC machines. In my case it is sort of a CMM (coordinate measurement machine). The format goes something like this:
X0.8523542Y0.0000000Z0.5312869
The X,Y, and Z are the coordinate axes on the machine. My question is, can I delimit an array with multiple delimiters? In this case: "X","Y", and "Z".

You can use Pandas
import pandas as pd
from io import StringIO
#Create a mock file
ncfile = StringIO("""X0.8523542Y0.0000000Z0.5312869
X0.7523542Y1.0000000Z0.5312869
X0.6523542Y2.0000000Z0.5312869
X0.5523542Y3.0000000Z0.5312869""")
df = pd.read_csv(ncfile,header=None)
#Use regex with split to define delimiters as X, Y, Z.
df_out = df[0].str.split(r'X|Y|Z', expand=True)
df_out.set_axis(['index','X','Y','Z'], axis=1, inplace=False)
Output:
index X Y Z
0 0.8523542 0.0000000 0.5312869
1 0.7523542 1.0000000 0.5312869
2 0.6523542 2.0000000 0.5312869
3 0.5523542 3.0000000 0.5312869

I ended up using the Pandas solution provided by Scott. For some reason I am not 100% clear on, I cannot simply convert the array from string to float with float(array). I created an array of equal size and iterated over the size of the array, converting each individual element to a float and saving it to the other array.
Thanks all

Using the filter function that I suggested in a comment:
String sample (standin for file):
In [1]: txt = '''X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869'''
Basic genfromtxt use - getting strings:
In [3]: np.genfromtxt(txt.splitlines(), dtype=None,encoding=None)
Out[3]:
array(['X0.8523542Y0.0000000Z0.5312869', 'X0.8523542Y0.0000000Z0.5312869',
'X0.8523542Y0.0000000Z0.5312869', 'X0.8523542Y0.0000000Z0.5312869'],
dtype='<U30')
This array of strings could be split in the same spirit as the pandas answer.
Define a function to replace the delimiter characters in a line:
In [6]: def foo(aline):
...: return aline.replace('X','').replace('Y',',').replace('Z',',')
re could be used for a prettier split.
Test it:
In [7]: foo('X0.8523542Y0.0000000Z0.5312869')
Out[7]: '0.8523542,0.0000000,0.5312869'
Use it in genfromtxt:
In [9]: np.genfromtxt((foo(aline) for aline in txt.splitlines()), dtype=float,delimiter=',')
Out[9]:
array([[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869]])
With a file instead, the generator would something like:
(foo(aline) for aline in open(afile))

Related

How can I interpolate a numpy array so that it becomes a certain length?

I have three numpy arrays each with different lengths:
A.shape = (3401,)
B.shape = (2200,)
C.shape = (4103,)
I would like to average the three arrays to produce a new array with size of the largest array (in this case C):
D.shape = (4103,)
Problem is, I don't think I can do this without adding "fake" data to A and B, by interpolation.
How can I perform interpolation on the first two numpy arrays so that they are of the same length as array C?
Do I even need to interpolate here?
First thing that comes to mind is zoom from scipy:
The array is zoomed using spline interpolation of the requested order.
Code:
import numpy as np
from scipy.ndimage import zoom
A = np.random.rand(3401)
B = np.random.rand(2200)
C = np.ones(4103)
for arr in [A, B]:
zoom_rate = C.shape[0] / arr.shape[0]
arr = zoom(arr, zoom_rate)
print(arr.shape)
Output:
(4103,)
(4103,)
I think the simplest option is to do the following:
D = np.concatenate([np.average([A[:2200], B, C[:2200]], axis=0),
np.average([A[2200:3401], C[2200:3401]], axis=0),
C[3401:]])

Why does removing a comma in this source file result in a scalar data in the HDF5 file?

f = h5.File("image_data.h5", 'w')
f["horizontal_min"] = horizontal_min,
f["horizontal_max"] = horizontal_max,
f["vertical_min"] = vertical_min,
f["vertical_max"] = vertical_max,
results in
horizontal_max Dataset {1}
horizontal_min Dataset {1}
vertical_max Dataset {1}
vertical_min Dataset {1}
But if the commas in the end of each of the lines is removed as in (output from h5ls):
f = h5.File("image_data.h5", 'w')
f["horizontal_min"] = horizontal_min
f["horizontal_max"] = horizontal_max
f["vertical_min"] = vertical_min
f["vertical_max"] = vertical_max
I get the following (output from h5ls):
horizontal_max Dataset {SCALAR}
horizontal_min Dataset {SCALAR}
vertical_max Dataset {SCALAR}
vertical_min Dataset {SCALAR}
Note that the data changed from Dataset {1} to {SCALAR}. Note the comma does not change the type as shown below:
In [3]: type(5.0,)
Out[3]: float
In [4]: type(5.0)
Out[4]: float
Why is this change happening?
In an interactive ipython session:
In [66]: 1,
Out[66]: (1,)
In [67]: 1
Out[67]: 1
With the comma, the value is a tuple, which h5py will save as a 1d array. Without the comma it's a scalar.
In Python a comma, is part of the tuple syntax; more important than the (). It isn't a superfluous line ending. (; is an optional line ender).
The correct way to do your last test:
In [71]: x=1,
In [72]: type(x)
Out[72]: tuple
In [73]: x=1
In [74]: type(x)
Out[74]: int
In type(1,) the comma is part of the arguments tuple. type takes 1 or 3 arguments.
type(object_or_name, bases, dict)

How to extract an array from a text file

I have a .txt file with thousands of tensors written inside. My problem is that they are all written in the following format (it is a string):
' tensor([ 9.8228e-01, -2.6578e-01, 9.6711e-01,........, -0.3274, -0.3205])'
How can I convert this into an array of floats? I have problems with handling the 'e-01' parts as well.
Thank you very much!
You could just map() to float the strings obtained by splitting for , the substring between [ and ]:
s = 'tensor([ 9.8228e-01, -2.6578e-01, 9.6711e-01, -0.3274, -0.3205])'
list(map(float, s[s.find('[') + 1:s.find(']')].split(',')))
# [0.98228, -0.26578, 0.96711, -0.3274, -0.3205]
or to get to a NumPy array:
import numpy as np
np.fromiter(map(float, s[s.find('[') + 1:s.find(']')].split(',')), dtype=float)
# array([ 0.98228, -0.26578, 0.96711, -0.3274 , -0.3205 ])
EDIT
NumPy offers a faster alternative using np.fromstring():
np.fromstring(s[s.find('['):], dtype=float, sep=', ')
which is substantially an optimized version of the above. Note that you would still need to remove the tensor part of the string.

Plotting a chart a plot in which the Y text data and X numeric data from dictionary. Matplotlib & Python 3 [duplicate]

I can create a simple columnar diagram in a matplotlib according to the 'simple' dictionary:
import matplotlib.pyplot as plt
D = {u'Label1':26, u'Label2': 17, u'Label3':30}
plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), D.keys())
plt.show()
But, how do I create curved line on the text and numeric data of this dictionarie, I do not know?
ΠΆ_OLD = {'10': 'need1', '11': 'need2', '12': 'need1', '13': 'need2', '14': 'need1'}
Like the picture below
You may use numpy to convert the dictionary to an array with two columns, which can be plotted.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
x = list(zip(*T_OLD.items()))
# sort array, since dictionary is unsorted
x = np.array(x)[:,np.argsort(x[0])].T
# let second column be "True" if "need2", else be "False
x[:,1] = (x[:,1] == "need2").astype(int)
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks([0,1])
plt.gca().set_yticklabels(['need1', 'need2'])
plt.show()
The following would be a version, which is independent on the actual content of the dictionary; only assumption is that the keys can be converted to floats.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10': 'run', '11': 'tea', '12': 'mathematics', '13': 'run', '14' :'chemistry'}
x = np.array(list(zip(*T_OLD.items())))
u, ind = np.unique(x[1,:], return_inverse=True)
x[1,:] = ind
x = x.astype(float)[:,np.argsort(x[0])].T
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks(range(len(u)))
plt.gca().set_yticklabels(u)
plt.show()
Use numeric values for your y-axis ticks, and then map them to desired strings with plt.yticks():
import matplotlib.pyplot as plt
import pandas as pd
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice([0,1], size=len(times))
data_labels = ['need1','need2']
fig, ax = plt.subplots()
ax.plot(times, data, marker='o', linestyle="None")
plt.yticks(data, data_labels)
plt.xlabel("time")
Note: It's generally not a good idea to use a line graph to represent categorical changes in time (e.g. from need1 to need2). Doing that gives the visual impression of a continuum between time points, which may not be accurate. Here, I changed the plotting style to points instead of lines. If for some reason you need the lines, just remove linestyle="None" from the call to plt.plot().
UPDATE
(per comments)
To make this work with a y-axis category set of arbitrary length, use ax.set_yticks() and ax.set_yticklabels() to map to y-axis values.
For example, given a set of potential y-axis values labels, let N be the size of a subset of labels (here we'll set it to 4, but it could be any size).
Then draw a random sample data of y values and plot against time, labeling the y-axis ticks based on the full set labels. Note that we still use set_yticks() first with numerical markers, and then replace with our category labels with set_yticklabels().
labels = np.array(['A','B','C','D','E','F','G'])
N = 4
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice(np.arange(len(labels)), size=len(times))
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(times, data, marker='o', linestyle="None")
ax.set_yticks(np.arange(len(labels)))
ax.set_yticklabels(labels)
plt.xlabel("time")
This gives the exact desired plot:
import matplotlib.pyplot as plt
from collections import OrderedDict
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
plt.plot(map(int, T_SRT.keys()), map(lambda x: int(x[-1]), T_SRT.values()),'r')
plt.ylim([0.9,2.1])
ax = plt.gca()
ax.set_yticks([1,2])
ax.set_yticklabels(['need1', 'need2'])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
For Python 3.X the plotting lines needs to explicitly convert the map() output to lists:
plt.plot(list(map(int, T_SRT.keys())), list(map(lambda x: int(x[-1]), T_SRT.values())),'r')
as in Python 3.X map() returns an iterator as opposed to a list in Python 2.7.
The plot uses the dictionary keys converted to ints and last elements of need1 or need2, also converted to ints. This relies on the particular structure of your data, if the values where need1 and need3 it would need a couple more operations.
After plotting and changing the axes limits, the program simply modifies the tick labels at y positions 1 and 2. It then also adds the title and the x and y axis labels.
Important part is that the dictionary/input data has to be sorted. One way to do it is to use OrderedDict. Here T_SRT is an OrderedDict object sorted by keys in T_OLD.
The output is:
This is a more general case for more values/labels in T_OLD. It assumes that the label is always 'needX' where X is any number. This can readily be done for a general case of any string preceding the number though it would require more processing,
import matplotlib.pyplot as plt
from collections import OrderedDict
import re
T_OLD = {'10' : 'need1', '11':'need8', '12':'need11', '13':'need1','14':'need3'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
x_val = list(map(int, T_SRT.keys()))
y_val = list(map(lambda x: int(re.findall(r'\d+', x)[-1]), T_SRT.values()))
plt.plot(x_val, y_val,'r')
plt.ylim([0.9*min(y_val),1.1*max(y_val)])
ax = plt.gca()
y_axis = list(set(y_val))
ax.set_yticks(y_axis)
ax.set_yticklabels(['need' + str(i) for i in y_axis])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
This solution finds the number at the end of the label using re.findall to accommodate for the possibility of multi-digit numbers. Previous solution just took the last component of the string because numbers were single digit. It still assumes that the number for plotting position is the last number in the string, hence the [-1]. Again for Python 3.X map output is explicitly converted to list, step not necessary in Python 2.7.
The labels are now generated by first selecting unique y-values using set and then renaming their labels through concatenation of the strings 'need' with its corresponding integer.
The limits of y-axis are set as 0.9 of the minimum value and 1.1 of the maximum value. Rest of the formatting is as before.
The result for this test case is:

Insert field into structured array at a specific column index

I'm currently using np.loadtxt to load some mixed data into a structured numpy array. I do some calculations on a few of the columns to output later. For compatibility reasons I need to maintain a specific output format so I'd like to insert those columns at specific points and use np.savetxt to export the array in one shot.
A simple setup:
import numpy as np
x = np.zeros((2,),dtype=('i4,f4,a10'))
x[:] = [(1,2.,'Hello'),(2,3.,'World')]
newcol = ['abc','def']
For this example I'd like to make newcol the 2nd column. I'm very new to Python (coming from MATLAB). From my searching all I've been able to find so far are ways to append newcol to the end of x to make it the last column, or x to newcol to make it the first column. I also turned up np.insert but it doesn't seem to work on a structured array because it's technically a 1D array (from my understanding).
What's the most efficient way to accomplish this?
EDIT1:
I investigated np.savetxt a little further and it seems like it can't be used with a structured array, so I'm assuming I would need to loop through and write each row with f.write. I could specify each column explicitly (by field name) with that approach and not have to worry about the order in my structured array, but that doesn't seem like a very generic solution.
For the above example my desired output would be:
1, abc, 2.0, Hello
2, def, 3.0, World
This is a way to add a field to the array, at the position you require:
from numpy import zeros, empty
def insert_dtype(x, position, new_dtype, new_column):
if x.dtype.fields is None:
raise ValueError, "`x' must be a structured numpy array"
new_desc = x.dtype.descr
new_desc.insert(position, new_dtype)
y = empty(x.shape, dtype=new_desc)
for name in x.dtype.names:
y[name] = x[name]
y[new_dtype[0]] = new_column
return y
x = zeros((2,), dtype='i4,f4,a10')
x[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
new_dt = ('my_alphabet', '|S3')
new_col = ['abc', 'def']
x = insert_dtype(x, 1, new_dt, new_col)
Now x looks like
array([(1, 'abc', 2.0, 'Hello'), (2, 'def', 3.0, 'World')],
dtype=[('f0', '<i4'), ('my_alphabet', 'S3'), ('f1', '<f4'), ('f2', 'S10')])
The solution is adapted from here.
To print the recarray to file, you could use something like:
from matplotlib.mlab import rec2csv
rec2csv(x,'foo.txt')

Resources