How exactly 'abs()' and 'argsort()' works together - python-3.x

#Creating DataFrame
df=pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
output:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
aValue = 43.0
df.loc[(df.CCC-aValue).abs().argsort()]
output:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
The output confusing, can you please explain in detail how the below line
works
df.loc[(df.CCC-aValue).abs().argsort()]

With abs flipping negative values, and the subtraction shift values around, it's hard to visualize what's going on. Instead I need to calculate it step by step:
In [97]: x = np.array([100,50,-30,-50])
In [98]: x-43
Out[98]: array([ 57, 7, -73, -93])
In [99]: abs(x-43)
Out[99]: array([57, 7, 73, 93])
In [100]: np.argsort(abs(x-43))
Out[100]: array([1, 0, 2, 3])
In [101]: x[np.argsort(abs(x-43))]
Out[101]: array([ 50, 100, -30, -50])
argsort is the indexing that puts the elements in sorted order. We can see that with:
In [104]: Out[99][Out[100]]
Out[104]: array([ 7, 57, 73, 93])
or
In [105]: np.array([57, 7, 73, 93])[[1, 0, 2, 3]]
Out[105]: array([ 7, 57, 73, 93])
How they work together is determined by the Python syntax; that's straight forward.

(df.CCC-aValue).abs() will take the absolute value of the df.CCC-aValue, and argsort will sort the values and takes the sorted indexes, and df.loc will show the rows with sorted indexes

Related

Get sum of group subset using pandas groupby

I have a dataframe as shown. Using python, I want to get the sum of 'Value' for each 'Id' group upto the first occurrence of 'Stage' 12.
df = pd.DataFrame({'Id':[1,1,1,2,2,2,2],
'Date': ['2020-04-23', '2020-04-25', '2020-04-28', '2020-04-20', '2020-05-01', '2020-05-05', '2020-05-12'],
'Stage': [11, 12, 15, 11, 14, 12, 12],
'Value': [5, 4, 6, 12, 2, 8, 3]})
Id Date Stage Value
1 2020-04-23 11 5
1 2020-04-25 12 4
1 2020-04-28 15 6
2 2020-04-20 11 12
2 2020-05-01 14 2
2 2020-08-05 12 8
2 2020-05-12 12 3
My desired output:
Id Value
1 9
2 22
Would be very thankful if someone could help.
Let us try use the groupby transform idxmax filter the dataframe , then do another round of groupby
idx = df['Stage'].eq(12).groupby(df['id']).transform('idxmax')
output = df[df.index <= idx].groupby('id')['Value'].sum().reset_index()
Detail
the transform with idxmax will return the first index match with 12 for all the groupby row, then we need to filter the df with index less than that to get the data until the first 12 show up.

how to sort nested dictionaries based on multiple keys

I need to sort the dictionary dicti and display as follows:
compile the following statistics for each player:
Number of best-of-5 set matches won
Number of best-of-3 set matches won
Number of sets won
Number of games won
Number of sets lost
Number of games lost
You should print out to the screen (standard output) a summary in decreasing order of ranking, where the ranking is according to the criteria 1-6 in that order (compare item 1, if equal compare item 2, if equal compare item 3 etc, noting that for items 5 and 6 the comparison is reversed).
I have stored the results in dictionary but I am not familiar with sorting of dictionaries. I've no clue how to do it.
dicti={'Federer': {'gameswon': 142, 'gameslost': 143, 'setswon': 13, 'setslost': 16, 'fivesetmatch': 3, 'threesetmatch': 1},
'Nadal': {'gameswon': 143, 'gameslost': 142, 'setswon': 16, 'setslost': 13, 'fivesetmatch': 2, 'threesetmatch': 2},
'Halep': {'gameswon': 15, 'gameslost': 12, 'setswon': 2, 'setslost': 1, 'fivesetmatch': 0, 'threesetmatch': 1},
'Wozniacki': {'gameswon': 12, 'gameslost': 15, 'setswon': 1, 'setslost': 2, 'fivesetmatch': 0, 'threesetmatch': 0}}
Use pandas for data analysis and getting insights
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(dicti)
>>> df
Federer Nadal Halep Wozniacki
gameswon 142 143 15 12
gameslost 143 142 12 15
setswon 13 16 2 1
setslost 16 13 1 2
fivesetmatch 3 2 0 0
threesetmatch 1 2 1 0
>>> df.describe()
Federer Nadal Halep Wozniacki
count 6.000000 6.000000 6.000000 6.00000
mean 53.000000 53.000000 5.166667 5.00000
std 69.561484 69.558608 6.554896 6.69328
min 1.000000 2.000000 0.000000 0.00000
25% 5.500000 4.750000 1.000000 0.25000
50% 14.500000 14.500000 1.500000 1.50000
75% 110.500000 110.500000 9.500000 9.50000
max 143.000000 143.000000 15.000000 15.00000
For example,
For number of games won you could do
>>> df.loc['gameswon'].sum()
312

Pandas: Random integer between values in two columns

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

How to shuffle data in python keeping some n number of rows intact

I want to shuffle my data in such manner that each 4 rows remain intact. For example I have 16 rows then first 4 rows can go to last and then second four rows may go to third and so on in any particular order. I am trying to do thins in python
Reshape splitting the first axis into two with the later of length same as the group length = 4, giving us a 3D array and then use np.random.shuffle, which shuffles along the first axis. The reshaped version being a view into the original array, assigns back the results directly into it. Being in-situ, this should be pretty efficient (both memory-wise and on performance).
Hence, the implementation would be as simple as this -
def array_shuffle(a, n=4):
a3D = a.reshape(a.shape[0]//n,n,-1) # a is input array
np.random.shuffle(a3D)
Another variant of it would be to generate random permutations covering the length of the 3D array, then indexing into it with those and finally reshaping back to 2D.This makes a copy, but seems more performant than in-situ edits as shown in the previous method.
The implementation would be -
def array_permuted_indexing(a, n=4):
m = a.shape[0]//n
a3D = a.reshape(m, n, -1)
return a3D[np.random.permutation(m)].reshape(-1,a3D.shape[-1])
Step-by-step run on shuffling method -
1] Setup random input array and split into a 3D version :
In [2]: np.random.seed(0)
In [3]: a = np.random.randint(11,99,(16,3))
In [4]: a3D = a.reshape(a.shape[0]//4,4,-1)
In [5]: a
Out[5]:
array([[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23],
[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31],
[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50],
[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]])
2] Check the 3D array :
In [6]: a3D
Out[6]:
array([[[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23]],
[[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31]],
[[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]],
[[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]]])
3] Shuffle it along the first axis (in-situ) :
In [7]: np.random.shuffle(a3D)
In [8]: a3D
Out[8]:
array([[[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31]],
[[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]],
[[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23]],
[[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]]])
4] Verify the changes back in the original array :
In [9]: a
Out[9]:
array([[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31],
[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39],
[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23],
[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]])
Runtime test
In [102]: a = np.random.randint(11,99,(16000,3))
In [103]: df = pd.DataFrame(a)
# #piRSquared's soln1
In [106]: %timeit df.iloc[np.random.permutation(np.arange(df.shape[0]).reshape(-1, 4)).ravel()]
100 loops, best of 3: 2.88 ms per loop
# #piRSquared's soln2
In [107]: %%timeit
...: d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
...: pd.concat([d.xs(i) for i in np.random.permutation(range(4))])
100 loops, best of 3: 3.48 ms per loop
# Array based soln-1
In [108]: %timeit array_shuffle(a, n=4)
100 loops, best of 3: 3.38 ms per loop
# Array based soln-2
In [109]: %timeit array_permuted_indexing(a, n=4)
10000 loops, best of 3: 125 µs per loop
Setup
Consider the dataframe df
df = pd.DataFrame(np.random.randint(10, size=(16, 4)), columns=list('WXYZ'))
df
W X Y Z
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
Option 1
Inspired by #B.M. and #Divakar
I'm using np.random.permutation because it returns a copy that is a permuted version of what was passed. This means I can then pass that directly to iloc and return what I need.
df.iloc[np.random.permutation(np.arange(16).reshape(-1, 4)).ravel()]
W X Y Z
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
Option 2
I'd add a level to the index that we can call on when shuffling
d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
d
W X Y Z
0 0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
1 4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
2 8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
3 12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
Then we can shuffle
pd.concat([d.xs(i) for i in np.random.permutation(range(4))])
W X Y Z
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
Below code in python does the magic
from random import shuffle
import numpy as np
from math import ceil
#creating sample dataset
d=[[i*4 +j for i in range(5)] for j in range(25)]
a = np.array(d, int)
print '--------------Input--------------'
print a
gl=4 #group length i.e number of rows needs to be intact
parts=ceil(1.0*len(a)/gl) #no of partitions based on grouplength for the given dataset
#creating partition list and shuffling it to use later
x = [i for i in range(int(parts))]
shuffle(x)
#Creates new dataset based on shuffled partition list
fg=x.pop(0)
f = a[gl*fg:gl*(fg+1)]
for i in x:
t=a[gl*i:(i+1)*gl]
f=np.concatenate((f, t), axis=0)
print '--------------Output--------------'
print f

Converting float into range in Python

I am doing some data analysis with pandas and am struggling to find a nice, clean way of summing up a range of numbers. I have a data frame with a column of floats, however I am not interested in the exact number, but a rough range. Ultimately I want to run a pivot and count how many values are in a certain range. Therefore ideally I would want to create a new column in my data frame, that converts my column of floats into a range. Say df[number] = 3.5, then df[range] = 0-10
The ranges should be 0-10, 10-20, ... >100
This may sound very arbitrary, but I've been struggling to find an answer on this. Many thanks
Pandas has a cut function for this
In [18]: s = pd.Series(np.random.uniform(0, 110, 100))
In [19]: s
Out[19]:
0 57.614427
1 30.576853
2 95.578943
3 53.010340
4 63.947381
...
95 42.252644
96 14.814418
97 81.271527
98 5.732966
99 90.932890
In [12]: s = pd.Series(np.random.uniform(0, 110, 100))
In [13]: s
Out[13]:
0 2.652461
1 46.536276
2 6.455352
3 6.075963
4 40.013378
...
95 39.775493
96 99.688307
97 41.064469
98 91.401904
99 60.580600
dtype: float64
In [14]: cuts = np.arange(0, 101, 10)
In [15]: pd.cut(s, cuts)
Out[15]:
0 (0, 10]
1 (40, 50]
2 (0, 10]
3 (0, 10]
4 (40, 50]
...
95 (30, 40]
96 (90, 100]
97 (40, 50]
98 (90, 100]
99 (60, 70]
dtype: category
Categories (10, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40] ... (60, 70] < (70, 80] < (80, 90] <
(90, 100]]
See the docs for controlling what happens with endpoints.
Note that in 0.18 (coming out soonish) the result will be an IntervalIndex instead of a Categorical, which will make things even nicer.
To get your counts per interval, use the value_counts method
In [17]: pd.cut(s, cuts).value_counts()
Out[17]:
(30, 40] 15
(40, 50] 13
(50, 60] 12
(60, 70] 10
(0, 10] 10
(90, 100] 8
(70, 80] 8
(80, 90] 7
(10, 20] 6
(20, 30] 3
dtype: int64
def get_range_for(x, start=0, stop=100, step=10):
if x < start:
return (float('-inf'), start)
if x >= stop:
return (stop, float('inf'))
left = step * ((x - start) // step)
right = left + step
return (left, right)
Examples:
>>> get_range_for(3.5)
(0.0, 10.0)
>>> get_range_for(27.3)
(20.0, 30.0)
>>> get_range_for(75.6)
(70.0, 80.0)
Corner cases:
>>> get_range_for(-100)
(-inf, 0)
>>> get_range_for(1234)
(100, inf)
>>> get_range_for(0)
(0, 10)
>>> get_range_for(10)
(10, 20)
Using the properties of integer division should help. Because you want ranges in units of 10, dividing a number by 10 (13.5 / 10 == 1.35), converting it to an integer (int(1.35) == 1), and then multiplying by 10 (1 * 10 == 10) will convert the number to the low-end of the range it falls into. This might need some refinement (especially for negative numbers), but you could try something like:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'vals': [3.5, 4.2, 10.5, 19.5, 20.3, 24.2]})
>>> df
vals
0 3.5
1 4.2
2 10.5
3 19.5
4 20.3
5 24.2
>>> df['range_start'] = np.floor(df['vals'] / 10) * 10
>>> df
vals range_start
0 3.5 0
1 4.2 0
2 10.5 10
3 19.5 10
4 20.3 20
5 24.2 20

Resources