cython implementation of groupby failing with NameError - python-3.x

I am attempting to speed up dozens of calls I make to pandas groupby using cython optimised functions. These incldue straight groupby, groupby with ranking and others. I have one that does a groupby that runs in my notebook, but not when called I get a NameError.
Here is the test code from my notebook (in 3 cells there)
%%cython
def _set_indices(keys_as_int, n_keys):
import numpy
cdef int i, j, k
cdef object[:, :] indices = [[i for i in range(0)] for _ in range(n_keys)]
for j, k in enumerate(keys_as_int):
indices[k].append(j)
return [([numpy.array(elt) for elt in indices])]
def group_by(keys):
_, first_occurrences, keys_as_int = np.unique(keys, return_index=True, return_inverse=True)
n_keys = max(keys_as_int) + 1
indices = [[] for _ in range(max(keys_as_int) + 1)]
print(str(keys_as_int) + str(n_keys) + str(indices))
indices = _set_indices(keys_as_int, n_keys)
return indices
%%timeit
result = group_by(['1', '2', '3', '1', '3'])
print(str(result))
The error I get is:
<ipython-input-20-3f8635aec47f> in group_by(keys)
4 indices = [[] for _ in range(max(keys_as_int) + 1)]
5 print(str(keys_as_int) + str(n_keys) + str(indices))
----> 6 indices = _set_indices(keys_as_int, n_keys)
7 return indices
NameError: name '_set_indices' is not defined
Can someone explain if this is due to notebook or if I have done something wrong with the way cython is used, I am new to it.
Also any hints to get a strongly type, with minimum cache hits solution are most welcome.

You need to put your _set_indices function in the same cell, or you need to explicitly import it. From the Compiling with a Jupyter Notebook documentation:
Note that each cell will be compiled into a separate extension module.
After compilation, you do have a global name _set_indices, but that doesn't make it available as a global in the separate extension module for the group_by() function.
You'll need to put the two function definitions into the same cell, or create a separate module for the utility functions.
Note that there is also another issue with the code; you can't just create a typed memory view from a list of integers:
Traceback (most recent call last):
File "so58378716.pyx", line 22, in init so58378716
result = group_by(['1', '2', '3', '1', '3'])
File "so58378716.pyx", line 19, in so58378716.group_by
indices = _set_indices(keys_as_int, n_keys)
File "so58378716.pyx", line 6, in so58378716._set_indices
cdef object[:, :] indices = [[i for i in range(0)] for _ in range(n_keys)]
File "stringsource", line 654, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'list'
You'd have to create an actual numpy array, or use a cython.view.array object, or an array.array.

Related

Sympy Call symbols created using a range

I'm trying to be able to create symbols for a DH table, but rather that write it out, create a function. However, I don't know how to call the variable when making the table. Here is a synopsis of the problem:
from sympy import *
def naming_symbols(N):
theta = symbols(f"theta:{N}")
L = symbols(f"L:{N}")
alpha=symbols(f"alpha:{N}")
d=symbols(f"d:{N}")
pprint(theta[:])
pprint(L[:])
pprint(alpha[:])
pprint(d[:])
return theta, L, alpha, d
naming_symbols(3)
print(theta2)
returns:
"*FileName*", line 18, in <module>
print(theta2)
NameError: name 'theta2' is not defined
(θ₀, θ₁, θ₂)
(L₀, L₁, L₂)
(α₀, α₁, α₂)
(d₀, d₁, d₂)
Process finished with exit code 1
This is the same for "theta_2" and "theta"
How do I call the created symbols? As in, I want to put "theta2" in the table, but it doesn't recognize it as a created symbol. I think I need to add the symbols into a dictionary or something, but don't know how to do that either. I thought the creation would add it to the dictionary... but... well, please help.
There is a difference between the Symbol (a python object that SymPy creates) and the variable that you assign it to. You already know that you can call a value like 1 anything you want:
>>> x = 1
>>> y = 1
The same is true for a Symbol that you create.
>>> my_x = Symbol('x'); my_x
x
The convention is to often use a variable name that matches the Symbol name, but this is not necessary. Notice that printing my_x (the variable) shows x (the Symbol).
The symbols command creates a tuple of Symbols. You can call that tuple anything you want, just like you can with numerical values
>>> v = (1, 2); v[0]
1
>>> my_v = symbols('v:3'); my_v[0]
v0
Your function is creating tuples of Symbols. You are assigning (and returning) those tuples from the function. In order to use those locally outside of the function you have to create python variable names for the elements of the tuples or you can access them by index just like for the tuple v defined above:
>>> one, two = v # assigning names for the tuple elements
>>> one
1
>>> v[1] # using index to the tuple name `v`
2
>>> naming_symbols(2)
(θ₀, θ₁)
(L₀, L₁)
(α₀, α₁)
(d₀, d₁)
((theta0, theta1), (L0, L1), (alpha0, alpha1), (d0, d1))
>>> t, l, a, d = _ # assigning names to the tuples
>>> t[0]
theta0
>>> t0,t1 = t # assigning names to the elements of a tuple
>>> t0
theta0
___ probably stop here, but note...
It is possible to create names for the variables locally using the var command instead of the symbols command. It is probably better for you to learn how to do the above, but this is an example of using var:
>>> def do(n):
... var(f'y:{n}')
...
Until this function is run, y0 is not defined
>>> y0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'y0' is not defined
But after running it with n=3, y0, y1 and y2 will exist in the local namespace of python:
>>> do(3)
>>> y0
y0
>>> y1
y1
Note that the naming convention -- calling the SymPy symbol by the matching variable name -- is used. This only works if you create Symbol names that are valid python variables names. So although var('x(1)') will create a Symbol with the name x(1) you cannot type this to use the variable:
>>> var('x(1)')
x(1)
>>> x(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Symbol' object is not callable
>>> y=_ # assign the name y to this Symbol
>>> y
x(1)

Cannot create a numpy array using numpy's `full()` method and a python list

I can create a numpy array from a python list as follows:
>>> a = [1,2,3]
>>> b = np.array(a).reshape(3,1)
>>> print(b)
[[1]
[2]
[3]]
However, I don't know what causes error in the following code:
Code :
>>> a = [1,2,3]
>>> b = np.full((3,1), a)
Error :
ValueError Traceback (most recent call last)
<ipython-input-275-1ab6c109dda4> in <module>()
1 a = [1,2,3]
----> 2 b = np.full((3,1), a)
3 print(b)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order)
324 dtype = array(fill_value).dtype
325 a = empty(shape, dtype, order)
--> 326 multiarray.copyto(a, fill_value, casting='unsafe')
327 return a
328
<__array_function__ internals> in copyto(*args, **kwargs)
ValueError: could not broadcast input array from shape (3) into shape (3,1)
Even though the list a has 3 elements inside it and I expect a 3x1 numpy array, the full() method fails to deliver it.
I referred the broadcasting article of numpy too. However, they are much more focused towards the arithmetic operation perspective, hence I couldn't obtain anything useful from there.
So it would be great if you can help me to understand the difference in b/w. the above mentioned array creation methods and the cause of the error too.
Numpy is unable to broadcast the two shapes together because your list is interpreted as a 'row vector' (np.array(a).shape = (3,)) while you are asking for a 'column vector' (shape = (3, 1)). If you are set on using np.full, then you can shape your list as a column vector initially:
>>> import numpy as np
>>>
>>> a = [[1],[2],[3]]
>>> b = np.full((3,1), a)
Another option is to convert a into a numpy array ahead of time and add a new axis to match the desired output shape.
>>> a = [1,2,3]
>>> a = np.array(a)[:, np.newaxis]
>>> b = np.full((3,1), a)

how to solve this error with lambda and sorted method when i try to make sentiment analysis (POS or NEG text)?

Input code:
best = sorted(word_scores.items(), key=lambda w, s: s, reverse=True)[:10000]
Result:
Traceback (most recent call last):
File "C:\Users\Sarah\Desktop\python\test.py", line 78, in <module>
best = sorted(word_scores.items(), key=lambda w, s: s, reverse=True)[:10000]
TypeError: <lambda>() missing 1 required positional argument: 's'
How do I solve it?
If I've understood the format of your word_scores dictionary correctly (that the keys are words and the values are integers representing scores), and you're simply looking to get an ordered list of words with the highest scores, it's as simple as this:
best = sorted(word_scores, key=word_scores.get, reverse=True)[:10000]
If you want to use a lambda to get an ordered list of tuples, where each tuple is a word and a score, and they are ordered by score, you can do the following:
best = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:10000]
The difference between this and your original attempt is that I have passed one argument (x) to the lambda, and x is a tuple of length 2 - x[0] is the word and x[1] is the score. Since we want to sort by score, we use x[1].

slicing error in numpy array

I am trying to run the following code
fs = 1000
data = np.loadtxt("trainingdataset.txt", delimiter=",")
data1 = data[:,2]
data2 = data1.astype(int)
X,Y = data2['521']
but it gets me the following error
Traceback (most recent call last):
File "C:\Users\hadeer.elziaat\Desktop\testspec.py", line 58, in <module>
X,Y = data2['521']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
my dataset
1,4,6,10
2,100,125,10
3,100,7216,254
4,100,527,263
5,100,954,13
6,100,954,23
You're using the string '521' rather than the number 521 for indexing. Try X,Y = data2[521] instead.
If you are only given the string, you could cast it to an int first: X,Y = data2[int('521')], but this might result in some errors and/or unexpected behaviour.
Next problem, you are requiring two variable, one for X and one for Y, yet the data2[521] selection only provides you with a single variable (the number in the 3rd column, 522nd row).
You say you want all the data in the 3rd column.
I assume you also want some kind of x-axis, since you are attempting to do X, Y = .... How about using the first column for that? Then your code would be:
import numpy as np
data = np.loadtxt("trainingdataset.txt", delimiter=',', dtype='int')
x = data[:, 0]
y = data[:, 2]
What remains unclear from your question is why you tried to index your data with 521 - which failed because you cannot use strings as indices on plain arrays.

How to make a tuple including a numpy array hashable?

One way to make a numpy array hashable is setting it to read-only. This has worked for me in the past. But when I use such a numpy array in a tuple, the whole tuple is no longer hashable, which I do not understand. Here is the sample code I put together to illustrate the problem:
import numpy as np
npArray = np.ones((1,1))
npArray.flags.writeable = False
print(npArray.flags.writeable)
keySet = (0, npArray)
print(keySet[1].flags.writeable)
myDict = {keySet : 1}
First I create a simple numpy array and set it to read-only. Then I add it to a tuple and check if it is still read-only (which it is).
When I want to use the tuple as key in a dictionary, I get the error TypeError: unhashable type: 'numpy.ndarray'.
Here is the output of my sample code:
False
False
Traceback (most recent call last):
File "test.py", line 10, in <module>
myDict = {keySet : 1}
TypeError: unhashable type: 'numpy.ndarray'
What can I do to make my tuple hashable and why does Python show this behavior in the first place?
You claim that
One way to make a numpy array hashable is setting it to read-only
but that's not actually true. Setting an array to read-only just makes it read-only. It doesn't make the array hashable, for multiple reasons.
The first reason is that an array with the writeable flag set to False is still mutable. First, you can always set writeable=True again and resume writing to it, or do more exotic things like reassign its shape even while writeable is False. Second, even without touching the array itself, you could mutate its data through another view that has writeable=True.
>>> x = numpy.arange(5)
>>> y = x[:]
>>> x.flags.writeable = False
>>> x
array([0, 1, 2, 3, 4])
>>> y[0] = 5
>>> x
array([5, 1, 2, 3, 4])
Second, for hashability to be meaningful, objects must first be equatable - == must return a boolean, and must be an equivalence relation. NumPy arrays don't do that. The purpose of hash values is to quickly locate equal objects, but when your objects don't even have a built-in notion of equality, there's not much point to providing hashes.
You're not going to get hashable tuples with arrays inside. You're not even going to get hashable arrays. The closest you can get is to put some other representation of the array's data in the tuple.
The fastest way to hash a numpy array is likely tostring.
In [11]: %timeit hash(y.tostring())
What you could do is rather than use a tuple define a class:
class KeySet(object):
def __init__(self, i, arr):
self.i = i
self.arr = arr
def __hash__(self):
return hash((self.i, hash(self.arr.tostring())))
Now you can use it in a dict:
In [21]: ks = KeySet(0, npArray)
In [22]: myDict = {ks: 1}
In [23]: myDict[ks]
Out[23]: 1

Resources