How to NaN elements in a numpy array based on upper and lower boundery - python-3.x

I have an numpy array with 0-10 elements.
a = np.arange(0,11)
np.random.shuffle(a)
a
array([ 1, 7, 8, 0, 2, 3, 4, 10, 9, 5, 6])
I wanted to convert elements to NaN if they are between 4 and 8.
As a first step I tried getting the array with np.where like below:
np.where([a > 4] & [a < 8])
but got error. any help please?

You need:
import numpy as np
a = np.arange(0,11)
np.random.shuffle(a)
print(a)
# [ 7 4 2 3 6 10 1 9 5 0 8]
a = np.where((a > 4) & (a < 8), np.nan, a)
print(a)
# [nan 4. 2. 3. nan 10. 1. 9. nan 0. 8.]
If you want to know how np.where()works, refer numpy.where() detailed, step-by-step explanation / examples

Related

Pandas qcut ValueError: Input array must be 1 dimensional

I was trying to categorize my values into 10 bins and I met with this error. How can I avoid this error and bin them smoothly?
Attached are samples of the data and code.
Data
JPM
2008-01-02 NaN
2008-01-03 NaN
2008-01-04 NaN
2008-01-07 NaN
2008-01-08 NaN
... ...
2009-12-24 -0.054014
2009-12-28 0.002679
2009-12-29 -0.030015
2009-12-30 -0.019058
2009-12-31 -0.010090
505 rows × 1 columns
Code
group_names = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
discretized_roc = pd.qcut(df, 10, labels=group_names)
Pass column JPM and for only integer indicators of the bins use labels=False:
discretized_roc = pd.qcut(df['JPM'], 10, labels=False)
If need first column instead label use DataFrame.iloc:
discretized_roc = pd.qcut(df.iloc[:, 0], 10, labels=False)

Pandas create a new column based on existing column with chain rule

I have below code
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, 3, 1],"B":[7, 2, 54, 3, None],"C":[20, 16, 11, 3, 8],"D":[14, 3, None, 2, 6]})
df['A1'] = np.where(df['A'] > 10, 10, np.where(df['A'] < 3, 3, df['A']))
While this is okay, I want create the final dataframe (i.e. 2nd line of code) using chain rule from the first line. I want to achieve this to increase readability.
Could you please help how can I achieve this?
You can use clip here:
df.assign(A1=df['A'].clip(upper=10,lower=3))
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3
If you really need to do this in one line (note that I dont find this readable)
pd.DataFrame({"A":[12, 4, 5, 3, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]}).assign(A1=lambda x:x['A'].clip(upper=10,lower=3))
You could use np.select() like the following. It makes the conditions and choices very readable.
conditions = [df['A'] > 10,
df['A'] < 3]
choices = [10,3]
df['A2'] = np.select(conditions, choices, default = df['A'])
print(df)
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3

How to create a separate df after applying groupby?

I have a df as follows:
Product Step
1 1
1 3
1 6
1 6
1 8
1 1
1 4
2 2
2 4
2 8
2 8
2 3
2 1
3 1
3 3
3 6
3 6
3 8
3 1
3 4
What I would like to do is to:
For each Product, every Step must be grabbed and the order must not be changed, that is, if we look at Product 1, after Step 8, there is a 1 coming and that 1 must be after 8 only. So, the expected output for product 1 and product 3 should be of the order: 1, 3, 6, 8, 1, 4; for the product 2 it must be: 2, 4, 8, 3, 1.
Update:
Here, I only want one value of 6 for product 1 and 3, since in the main df both the 6 next to each other, but both the values of 1 must be present since they are not next to each other.
Once the first step is done, the products with the same Steps must be grouped together into a new df (in the below example: Product 1 and 3 have same Steps, so they must be grouped together)
What I have done:
import pandas as pd
sid = pd.DataFrame(data.groupby('Product').apply(lambda x: x['Step'].unique())).reset_index()
But it is yielding a result like:
Product 0
0 1 [1 3 6 8 4]
1 2 [2 4 8 3 1]
2 3 [1 3 6 8 4]
which is not the result I want. I would like the value for the first and third product to be [1 3 6 8 1 4].
IIUC Create the Newkey by using cumsum and diff
df['Newkey']=df.groupby('Product').Step.apply(lambda x : x.diff().ne(0).cumsum())
df.drop_duplicates(['Product','Newkey'],inplace=True)
s=df.groupby('Product').Step.apply(tuple)
s.reset_index().groupby('Step').Product.apply(list)
Step
(1, 3, 6, 8, 1, 4) [1, 3]
(2, 4, 8, 3, 1) [2]
Name: Product, dtype: object
groupby preservers the order of rows within a group, so there isn't much need to worry about the rows shifting.
A straightforward, but not greatly performant, solution would be to apply(tuple), since they are hashable allowing you to group on them to see which Products are identical. form_seq will make it so that consecutive values only appear once in the list of steps before forming the tuple.
def form_seq(x):
x = x[x != x.shift()]
return tuple(x)
s = df.groupby('Product').Step.apply(form_seq)
s.groupby(s).groups
#{(1, 3, 6, 8, 1, 4): Int64Index([1, 3], dtype='int64', name='Product'),
# (2, 4, 8, 3, 1): Int64Index([2], dtype='int64', name='Product')}
Or if you'd like a DataFrame:
s.reset_index().groupby('Step').Product.apply(list)
#Step
#(1, 3, 6, 8, 1, 4) [1, 3]
#(2, 4, 8, 3, 1) [2]
#Name: Product, dtype: object
The values of that dictionary are the groupings of products that share the step sequence (given by the dictionary keys). Products 1 and 3 are grouped together by the step sequence 1, 3, 6, 8, 1, 4.
Another very similar way:
df_no_dups=df[df.shift()!=df].dropna(how='all').ffill()
df_no_dups_grouped=df_no_dups.groupby('Product')['Step'].apply(list)

Three-dimensional array processing

I want to turn
arr = np.array([[[1,2,3],[4,5,6],[7,8,9],[10,11,12]], [[2,2,2],[4,5,6],[7,8,9],[10,11,12]], [[3,3,3],[4,5,6],[7,8,9],[10,11,12]]])
into
arr = np.array([[[1,2,3],[7,8,9],[10,11,12]], [[2,2,2],[7,8,9],[10,11,12]], [[3,3,3],[7,8,9],[10,11,12]]])
Below is the code:
a = 0
b = 0
NewArr = []
while a < 3:
c = arr[a, :, :]
d = arr[a]
print(d)
if c[1, 2] == 6:
c = np.delete(c, [1], axis=0)
a += 1
b += 1
c = np.concatenate((d, c), axis=1)
print(c)
But after deleting the line containing the number 6, I cannot stitch the array together,Can someone help me?
thank you very much for your help.
If you want a more automatic way of processing your input data, here is an answer using numpy functions :
arr[np.newaxis,~np.any(arr==6,axis=2)].reshape((3,-1,3))
np.any(arr==6,axis=2) outputs an array which has True at rows which contain the value 6. We take the inverse of those booleans since we want to remove those rows. The solution is then used as an index selection in arr, with a np.newaxis because the output of np.any had one axis less than the original array.
Finally, the output is reshaped into a 3,x,3 array, where x will depend on the number of rows which were removed (hence the -1 in reshape)
Based on the input / output you provide, a simpler solution would be to just use index selection and slices:
import numpy as np
arr = np.array([[[1,2,3],[4,5,6],[7,8,9],[10,11,12]], [[2,2,2],[4,5,6],[7,8,9],[10,11,12]], [[3,3,3],[4,5,6],[7,8,9],[10,11,12]]])
print("arr=")
print(arr)
expected_result = np.array([[[1,2,3],[7,8,9],[10,11,12]], [[2,2,2],[7,8,9],[10,11,12]], [[3,3,3],[7,8,9],[10,11,12]]])
# select indices 0, 2 and 3 from dimension 2
a = np.copy(arr[:,[0,2,3],:])
print("a=")
print(a)
print(np.array_equal(a, expected_result))
Output:
arr=
[[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
[[ 2 2 2]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
[[ 3 3 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]]
a=
[[[ 1 2 3]
[ 7 8 9]
[10 11 12]]
[[ 2 2 2]
[ 7 8 9]
[10 11 12]]
[[ 3 3 3]
[ 7 8 9]
[10 11 12]]]
True

Why is order of data items reversed while creating a pandas series?

I am new to python and pandas so please bear with me. I tried searching the answer everywhere but couldn't find it. Here's my question:
This is my input code:
list = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], list)
The output is:
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
Now, my question is why the "list" is coming before the first list specified while creating the series? I tried running the same code multiple times to check if the series creation is orderless. Any help would be highly appreciated.
Python Version:
Python 3.6.0
Pandas Version:
'0.19.2'
I think you omit index which specify first column called index - so Series construction now is:
#dont use list as variable, because reversed word in python
L = [1, 2, 3, 1, 2, 3]
s = pd.Series(data=[1, 2, 3, 10, 20, 30], index=L)
print (s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
You can also check Series documentation.

Resources