I have been doing some reading from fer 2013 csv file containing three columns emotion, pixel value and usage. the pixel value is said to have given in string format and I want to rescale my pixel values from 0-255 to between 0-1 , so I need to convert it to int/float and then only I would be able to do any mathematical operations on them.
I first tried to read the csv file using pandas read_csv function and then using iloc I read the value of the pixel value in a variable called x_tr. Then upon printing its value it shows its d type as object.confused on that too.x_tr is numpy ndarray then how should I convert it into integral value.
I tried the x_tr.astype(np.float) but it gave up the error as mentioned in the code.
x_tr = train.iloc[:,1].values
x_tr
what I tried to convert to float
x_tr = train.iloc[:,1].values
x_tr = x_tr.astype(np.float)
and what I've got as error
Please Help.
Don't convert your pixel into an array, instead treat it as a simple string. Then use numpy.fromstring() method. Here's an example for reference.
>>> s = '1 2 3 4'
>>> f = np.fromstring(s, dtype=float, sep=' ')
>>> f
array([1., 2., 3., 4.])
Related
I have a banded sparse square matrix , A, of type <class 'scipy.sparse.csr.csr_matrix'> and size = 400 x 400. I'd like to split this into block square matrices of size 200 x 200 each. For instance, the first block
block1 = A[0:200, 0:200]
block2 = A[100:300, 100:300]
block3 = A[200:400, 200:400]
The same information about the slices is stored in a list of tuples.
[(0,200), (100, 300), (200, 400)]
Suggestions on how to split the spare square matrix will be really helpful.
You can convert to a regular array and then split it:
from scipy.sparse import csr_matrix
import numpy as np
row = np.arange(400)[::2]
col = np.arange(400)[1::2]
data = np.random.randint(1, 10, (200))
compressed_matrix = csr_matrix((data, (row, col)), shape=(400, 400))
# Convert to a regular array
m = compressed_matrix.toarray()
# Split the matrix
sl = [(0,200), (100, 300), (200, 400)]
blocks = [m[i, i] for i in map(lambda x: slice(*x), sl)]
And if you want you can convert back each block to a compressed matrix:
blocks_csr = list(map(csr_matrix, blocks))
CODE EXPLANATION
The creation of the blocks is based on a list comprehension and basic slicing.
Each input tuple is converted to a slice object, only to create a series of row and column indexes, corresponding to that of the elements to be selected; in this answer, this is sufficient to select the requested block squared matrix. Slice objects are generated when extended indexing syntax is used: To be clear, a[start:stop:step] will create a slice object equivalent to slice(start, stop, step). In our case, they are used to dynamically change the indexes to be selected, according to the matrix we want to extract. So, if you consider the first block, m[i, i] is equivalent to m[0:200, 0:200].
Slice objects are a form of basic indexing, so a view of the original array is created, rather than a copy (this means that if you modify the view, also the original array will be modified: you can easily create a copy of the original data using the copy method of the numpy array).
The map object is used to generate slice objects from the input tuples; map applies the function provided as its first argument to all the elements of its second argument.
lambda is used to create an anonymous function, i.e., a function defined without a name. Anonymous functions are useful to accomplish specific tasks that you do not want to code in a standard function, because you are not going to reuse them or you need only for a short period of time, like in the example of this code. They make code more compact rather than defining the correspondent functions.
*x is called unpacking, i.e you extract, unpack elements from the tuple. Suppose you have a function f and a tuple a = (1, 2, 3), then f(*a) is equivalent to f(1, 2, 3) (as you can see, you can think of unpacking as removing a level of parentheses).
So, looking back at the code:
blocks = [ # this is a list comprehension
m[i, i] # basic slicing of the input array
for i in map( # map apply a function to all the item of a list
lambda x: slice(*x), sl # creating a slice object out of the provided index ranges
)
]
I am working on a new routine inside some codes based on OOP, and encountered a problem while modifying the array of the data (short example of the code is below).
Basically, this routine is about taking the array R, transposing it and then sorting it, and then filter out the data below the pre-determined value of thres. Then, I re-transpose back this array into its original dimension, and then plot each of its rows with the first element of T.
import numpy as np
import matplotlib.pyplot as plt
R = np.random.rand(3,8)
R = R.transpose() # transpose the random matrix
R = R[R[:,0].argsort()] # sort this matrix
print(R)
T = ([i for i in np.arange(1,9,1.0)],"temps (min)")
thres = float(input("Define the threshold of coherence: "))
if thres >= 0.0 and thres <= 1.0 :
R = R[R[:, 0] >= thres] # how to filter unwanted values? changing to NaN / zeros ?
else :
print("The coherence value is absurd or you're not giving a number!")
print("The final results are ")
print(R)
print(R.transpose())
R.transpose() # re-transpose this matrix
ax = plt.subplot2grid( (4,1),(0,0) )
ax.plot(T[0],R[0])
ax.set_ylabel('Coherence')
ax = plt.subplot2grid( (4,1),(1,0) )
ax.plot(T[0],R[1],'.')
ax.set_ylabel('Back-azimuth')
ax = plt.subplot2grid( (4,1),(2,0) )
ax.plot(T[0],R[2],'.')
ax.set_ylabel('Velocity\nkm/s')
ax.set_xlabel('Time (min)')
However, I encounter an error
ValueError: x and y must have same first dimension, but have shapes (8,) and (3,)
I comment the part of where I think the problem might reside (how to filter unwanted values?), but then the question remains.
How can I plot this two arrays (R and T) while still being able to filter out unwanted values below thres? Can I transform these unwanted values to zero or NaN and then successfully plot them? If yes, how can I do that?
Your help would be much appreciated.
With the help of a techie friend, the problem is simply resolved by keeping this part
R = R[R[:, 0] >= thres]
because removing unwanted elements is more preferable than changing them to NaN or zero. And then the problem with plotting is fixed by adding a slight modification in this part
ax.plot(T[0][:len(R[0])],R[0])
and also for the subsequent plotting part. This slices T into the same dimension as R.
import numpy as np
import torch
a = torch.zeros(5)
b = torch.tensor(tuple((0,1,0,1,0)),dtype=torch.uint8)
c= torch.tensor([7.,9.])
print(a[b].size())
a[b]=c
print(a)
torch.Size([2])tensor([0., 7., 0., 9., 0.])
I am struggling to understand how this works. I initially thought the above code was using Fancy indexing but I realised that values from c tensors are getting copied corresponding to the indices marked 1. Also, if I don't specify dtype of b as uint8 then the above code does not work. Can someone please explain me the mechanism of the above code.
Indexing with arrays works the same as in numpy and most other vectorized math packages I am aware of. There are two cases:
When b is of type uint8 (think boolean, pytorch doesn't distinguish bool from uint8), a[b] is a 1-d array containing the subset of values of a (a[i]) for which the corresponding in b (b[i]) was nonzero. These values are aliased to the original a so if you modify them, their corresponding locations will change as well.
The alternative type you can use for indexing is an array of int64, in which case a[b] creates an array of shape (*b.shape, *a.shape[1:]). Its structure is as if each element of b (b[i]) was replaced by a[i]. In other words, you create a new array by specifying from which indexes of a should the data be fetched. Again, the values are aliased to the original a, so if you modify a[b] the values of a[b[i]], for each i, will change. An example usecase is shown in this question.
These two modes are explained for numpy in integer array indexing and boolean array indexing, where for the latter you have to keep in mind that pytorch uses uint8 in place of bool.
Also, if your goal is to copy data from one tensor to another you have to keep in mind that an operation like a[ixs] = b[ixs] is an in-place operation (a is modified in place), which my not play well with autograd. If you want to do out of place masking, use torch.where. An example usecase is shown in this answer.
I'm reading a bin list from a config file and it is being read as str.
I want to convert str to list type so that i can use in the bin function
Here is an example
import numpy as np
import pandas as pd
raw_data = {'student':['A','B','C'],'marks_maths':[75,90,99]}
df = pd.DataFrame(raw_data, columns = ['student','marks_maths'])
bins = str([0,50,75,np.inf])
groups = ['L','M','H']
df['maths_level'] = pd.cut(df['marks_maths'], bins, labels=groups)
I get an error indicating
ValueError('bins must increase monotonically.')
IndexError: list assignment index out of range
From help(pd.cut), it looks like it's expecting bins to be a list of integers, not a string:
cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
Return indices of half-open bins to which each value of `x` belongs.
Parameters
----------
x : array-like
Input array to be binned. It has to be 1-dimensional.
bins : int or sequence of scalars
If `bins` is an int, it defines the number of equal-width bins in the
range of `x`. However, in this case, the range of `x` is extended
by .1% on each side to include the min or max values of `x`. If
`bins` is a sequence it defines the bin edges allowing for
non-uniform bin width. No extension of the range of `x` is done in
this case.
try this:
bins = [0,50,75,np.inf]
not
bins = str([0,50,75,np.inf])
question: is my method of converting a numpy array of numbers to a numpy array of strings with specific number of decimal places AND trailing zeros removed the 'best' way?
import numpy as np
x = np.array([1.12345, 1.2, 0.1, 0, 1.230000])
print np.core.defchararray.rstrip(np.char.mod('%.4f', x), '0')
outputs:
['1.1235' '1.2' '0.1' '0.' '1.23']
which is the desired result. (I am OK with the rounding issue)
Both of the functions 'rstrip' and 'mod' are numpy functions which means this is fast but is there a way to accomplish this with ONE built in numpy function? (ie. does 'mod' have an option that I couldn't find?) It would save the overhead of returning copies twice which for very large arrays is slow-ish.
thanks!
Thanks to Warren Weckesser for providing valuable comments. Credit to him.
I converted my code to use:
formatter = '%d'
if num_type == 'float':
formatter = '%%.%df' % decimals
np.savetxt(out, arr, fmt=formatter)
where out is a file handle to which I had already written my headers. Alternatively, I could also use the headers= argument in np.savetxt. I have no clue how I didn't see those options in the documentation.
For a numpy array 1300 by 1300, creating the line by line output as I did before (using np.core.defchararray.rstrip(np.char.mod('%.4f', x), '0')) took ~1.7 seconds and using np.savetxt takes 0.48 seconds.
So np.savetxt is a cleaner, more readable, and faster solution.
Note:
I did try:
np.savetxt(out, arr, fmt='%.4g')
in an effort to not have a switch based on number type but it did not work as I had hoped.