Risk scoring in python - python-3.x

I have a metric to detect fraud, say calls, transfer rate, aux time, and so on.
I have grouped them into bins based on quartiles and now I have to give ratings from 1 to 5 based on bins. For example: calls > 150 assign ranking as 1, calls <=150 and >=300 as 2 and so on. Likewise for all the metrics.
I tried the following:
np.where(x.Calls<=125.8,1,
np.where(x.Calls>=153.2 & x.Calls<=190.0,2,np.where(x.Calls>=190.0 & x.Calls<=235.0,3,np.where(x.Calls>=235.0 & x.Calls<=304.4,4,np.where(x.Calls>=304.4,5,0))))
Error:
File "<ipython-input-32-41fe2292e308>", line 2
np.where(x.Calls>=153.2 & x.Calls<=190.0,2,np.where(x.Calls>=190.0 &
x.Calls<=235.0,3,np.where(x.Calls>=235.0 &
x.Calls<=304.4,4,np.where(x.Calls>=304.4,5,0))))
^ SyntaxError: unexpected EOF while parsing
I want the code to take the range of values from the quartiles got and give ratings to it on its own.

Your specific error indicates that you have left some parentheses open.
But you're getting this error because the nested np.where approach is really hard to implement (and therefore debug and maintain). So it's worth thinking about other ways.
The rules you want to implement aren't totally clear to me, but I think np.digitize might help you make progress. It 'quantizes' your data: you give it an array-like of bins, and it returns the bin each value of an array appears in. It works like this:
>>> import numpy as np
>>> a = np.array([55, 99, 65, 121, 189, 205, 211, 304, 999])
>>> bins = [100, 200, 300]
>>> np.digitize(a, bins=bins)
array([0, 0, 0, 1, 1, 2, 2, 3, 3])

Related

Find start/stop location of sharp cumulative events

Here is an example set of data:
EDIT: Included some more data.
x = [0, 5, 6,15, 20, 40, 73,
100,101,102,103,104,105,106,108,111,115,
116,117,118,119,120,123,124,125,126,127,
128,129,130,131, 150,161,170, 183, 194,
210, 234, 257, 271,272,273,274, 275,276,
277,278,279,280,281,282,283,284,285,287,
288,291,292,293,294,295,296,297,298,300,301,
302,303,304,305,306,307,308,309,310,311,
340, 351, 358, 360, 380, 390, 400,401,
402,403, 404, 405, 408, 409, 413, 420,
425,426,427,428,429,430,431,432,433,434,435,
436, 440, 450, 455]
y = np.arange(1, len(x)+1)
Here is what the data visually looks like and has the potentially for each sharp increase to be longer. The last sharp increase also has a pause, but I would like it to still be considered one set of data. Black dots are the gradient.
I am attempting to find the the start/end x-values for each sharp increase in cumulative counts. So the output should be an array of indexes, like what Riley has done.
A vectorized method would be ideal to help with any time constraints to quickly go through data. Here is rough outline of what has been done so far within a pandas dataframe.
Shift the "x-data" and take a difference
See if sequential differences are below a threshold to create logic array
Do rolling sum on logic array with so Trues will continue add to count
Find when rolling sum exceeds another threshold
Compare with previous value to ensure it is increase/decreasing for start/stop times
Add times to index list
It seems a little finicky on some of the rolling averages and isn't as quick as I would like. Multiplying some of these large arrays with logic arrays seems to take a good amount of time.
EDIT: Here is the code Riley has provided and offers an excellent start. It is also only a couple lines a code, versus my method above was almost 50 or so.
rate_threshold = 0.25
min_consecutive = 8
above_rate = np.gradient(y,x) >= rate_threshold
sequence_diff = np.diff(np.lib.stride_tricks.sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
The new issue comes from the final sharp increase of data. Two sets of start/end points are returned, where the desired would just be one.
My initial thought is to include some kind of average routine with the sliding window to account for these drops in the gradient so the end is so hard set.
Not sure what your desired output would look like, so let's start with this, verify it does it what you want it to, then go from there
rate_threshold = 1
min_consecutive = 5
above_rate = np.diff(y)/np.diff(x) >= rate_threshold
sequence_diff = np.diff(sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
intervals is a 2d numpy array of indices whose 2 columns are first index, and last index in a sequence (of length min_consecutive) of rates above the threshold
array([[ 7, 12],
[16, 20],
[22, 29],
[39, 52],
[56, 62],
[64, 74]], dtype=int64)

How can I use broadcasting to code my program in one line?

I have a code that works fine, however, the exercise is to code it in one line using broadcasting and I've found it very complicated to do, this is the code:
import numpy as np
v1 = np.array([10, 20, 30, 40, 50])
v2 = np.array([0, 1, 2, 3 ])
matrix = []
for i in v1:
matrix.append(i**v2)
matrixx = np.array(matrix).reshape([5,4])
print(matrixx)
Please some help!
You don't need broadcasting (well it will occur automatically) in this case since both arrays have a dimension of size 1.
You can get the same output without loop/comprehension:
print(v1.reshape(5,1)**v2)

Error when using an awkward array with an index array

I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:
values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
I want something like values[arr], but that gives the following error:
>>> values[arr]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100
If I run it with a loop, I get back what I want:
>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]
Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.
Thanks!
If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):
>>> values = ak.Array(np.random.rand(100))
but the second line iterates over data in Python (has a slow loop):
>>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.
On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:
>>> values
<Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
>>> arr
<Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>
Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].
Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.
>>> # The np.newaxis is to give values a length-1 dimension before concatenating.
>>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
>>> duplicated
<Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>
Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.
>>> duplicated[arr]
<Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
>>> ak.num(duplicated[arr])
<Array [33, 125] type='2 * int64'>
If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?
When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

TypeError: 'Tensor' object is not iterable

In tensor flow, I have got a tensor with 512 rows and 2 columns. What I want to do is that: filter column 2 of the tensor on the basis of unique values of column 1 and then for each unique value (of column 1) process corresponding values of column 2 in the inner loop.
So, as an example, I have got a 2-dimensional tensor, value (after evaluating in a session) of which looks like following:
[[ 509, 270],
[ 533, 568],
[ 472, 232],
...,
[ 6, 276],
[ 331, 165],
[ 401, 1144]]
509, 533, 472 ... are elements of column1 and 270, 568, 232,... are elements of column 2.
Is there a way that I can define following 2 steps within a graph (not while executing the session):
get unique values of column1
for each `unique_value` in column1:
values_in_column2 = values in column2 corresponding to `unique_value` (filter column2 according to unique_value`)
some_function(values_in_column2)
I can do above steps while running the session but I would like to define the above 2 steps in a graph - which I can run in session after defining many subsequent steps.
Is there any way to do this? Appreciate any kind of help in this regard.
Here is pseudo code for what I want to do.
tensor1 = tf.stack([column1, column2], axis = 1)
column1 = tensor1[0, :]
unique_column1, unique_column1_indexes = tf.unique(column1)
for unique_column1_value in unique_column1:
column1_2_indexes = tf.where(column1 == unique_column1_value)
corresponding_column2_values = tensor1[column1_2_indexes][:, 1]
But as of now it gives an error:
TypeError: 'Tensor' object is not iterable.
at the following line:
for unique_column1_value in unique_column1.
I have followed this question: "TypeError: 'Tensor' object is not iterable" error with tensorflow Estimator which does not apply to me.
I understand that I need to use while_loop but I don't know how.
Regards,
Sumit
Updated: There is a solution for when column1 is sorted here. Note that this is also a feature request for the more general version, but is closed for inactivity. The sorted version solution is like:
column1 = tf.constant([1,2,2,2,3,3,4])
column2 = tf.constant([5,6,7,8,9,10,11])
tensor1 = tf.stack([column1, column2], axis = 1)
unique_column1, unique_column1_indices, counts = tf.unique_with_counts(column1)
unique_ix = tf.cumsum(tf.pad(counts,[[1,0]]))[:-1]
output = tf.gather(tensor1, unique_ix)
which outputs: [[ 1, 5][ 2, 6][ 3, 9][ 4, 11]]

Resources