The concept of Rank in J - j

If I have one dimension data such as
v1 =: 5 $ i.5
v1
0 1 2 3 4
then v1 -"1 0 v1 gives me the Euclidean vectors like
v1 -"1 0 v1
0 1 2 3 4
_1 0 1 2 3
_2 _1 0 1 2
_3 _2 _1 0 1
_4 _3 _2 _1 0
We can also find Euclidean distance matrix easily.
This is how I find the Euclidean vectors and distance matrix of a 2D vector
v2=: <"1 ? 5 2 $ 10
v2
┌───┬───┬───┬───┬───┐
│4 0│4 5│5 7│8 3│6 0│
└───┴───┴───┴───┴───┘
direction_vector=: <"1 #: (-"0 #:(-/"2 #: (>"0 #: (diff))))
distance =: +/"1 #: *: #: (>"2 #:(direction_vector))
m =: 3 : '(i.(#y)) distance"0 _ y'
m v2
0 25 50 25 4
25 0 5 20 29
50 5 0 25 50
25 20 25 0 13
4 29 50 13 0
However, my problem is that I am not sure how to find the Euclidean vectors and distance of 2D data in a smart and clean way
As you can see from the table below, my algorithm took more than 1/3 of time to calculate the direction vectors of data. 14.5 second is not bad, but the problem arises when I have a bigger dataset.
Time (seconds)
+----------------+------+--------+---------+-----+----+------+
|name |locale|all |here |here%|cum%|rep |
+----------------+------+--------+---------+-----+----+------+
|direction_vector|base |6.239582| 5.105430| 35.2| 35 |773040|
|move |base |9.741510| 1.753868| 12.1| 47 | 3390|
|script |base |1.969949| 1.443148| 9.9| 57 | 18|
|distance |base |5.650358| 1.318022| 9.1| 66 |579780|
|enclosestrings |pcsv |1.491832| 1.255603| 8.6| 75 | 1|
|diff |base |1.134585| 1.134585| 7.8| 83 |773186|
|makedsv |pcsv |1.728721| 0.236883| 1.6| 84 | 4|
|norm |base |0.221794| 0.221794| 1.5| 86 | 3390|
|xpt |base |0.194896| 0.194896| 1.3| 87 | 3390|
|ypt |base |0.193579| 0.193579| 1.3| 89 | 3390|
|writedsv |pcsv |2.067408| 0.186687| 1.3| 90 | 4|
|cosd |base |0.172359| 0.172359| 1.2| 91 | 113|
|[rest] | | | 1.300733| 9.0|100 | |
|[total] | | |14.517587|100.0|100 | |
+----------------+------+--------+---------+-----+----+------+
I think I can definitely simplify direction_vector one by using rank, but I got stuck. I tried "2 1 "1 1 "1 2 "_ 1 "1 0 ..., but none of them gave me a clear result.
Can anyone help me on this issue? Thank you!

I would begin with noticing that v2 u/ v2 makes a table out of the items of v2 by applying u between those items. Also, you can simplify this a bit by using u/~ v2 which is the same as v2 u/ v2. The next question is what is u, but before we go there, boxing really slows things down and you don't actually have to box the vector for this to work since the items can already be written like this:
[ v2=: 5 2 $ 4 0 4 5 5 7 8 3 6 0
4 0
4 5
5 7
8 3
6 0
and this makes the items the vectors, which is what you want to be able to use u/~ v2
Now, back to the question of what we want u to be. We are going to be working on the items of v2 Since u is being fed the items of v2 to make the table, you would like to subtract the items from each other as vectors (rank 1) and then square them and then add them together. Translating this into J you get +/#:*:#:-"1 as u
+/#:*:#:-"1/~ v2
0 25 50 25 4
25 0 5 20 29
50 5 0 25 50
25 20 25 0 13
4 29 50 13 0
If you time this I hope that you will find it much faster than your solution because it does not require boxing. The key area that you were concerned about with rank was that it be applied after the Table adverb /
Hope this helps even though it is a slightly different approach and let me know what your timings look like.

New answer so that I could format the result properly.
Looking at your result I think that there is further room for improvement at least by my measurements. Getting away from the tacit approach and moving to a sequential application with the change of +/ to +/"1 gives roughly twice the speed.
[ v2=: 5 2 $ 4 0 4 5 5 7 8 3 6 0
4 0
4 5
5 7
8 3
6 0
(10000) 6!:2 '+/#:*:#:-"1/~ v2'
8.4647e_6
(10000) 6!:2 '+/"1 *: -"1/~ v2'
3.1289e_6
(+/"1 *: -"1/~ v2) -: (+/#:*:#:-"1/~ v2)
1
+/"1 *: -"1/~ v2
0 25 50 25 4
25 0 5 20 29
50 5 0 25 50
25 20 25 0 13
4 29 50 13 0

Related

How can I get the count of sequential events pairs from a Pandas dataframe?

I have a dataframe that looks like this:
ID EVENT DATE
1 1 142
1 5 167
1 3 245
2 1 54
2 5 87
3 3 165
3 2 178
And I would like to generate something like this:
EVENT_1 EVENT_2 COUNT
1 5 2
5 3 1
3 2 1
The idea is how many items (ID) go from one event to the next one. Don't care about previous states, I just want to consider the next state from the current state (e.g.: for ID 1, I don't want to count a transition from 1 to 3 because first, it goes to event 5 and then to 3).
The date format is the number of days from a specific date (sort of like SAS format).
Is there a clean way to achieve this?
Let's try this:
(df.groupby([df['EVENT'].rename('EVENT_1'),
df.groupby('ID')['EVENT'].shift(-1).rename('EVENT_2')])['ID']
.count()).rename('COUNT').reset_index().astype(int)
Output:
| | EVENT_1 | EVENT_2 | COUNT |
|---:|----------:|----------:|--------:|
| 0 | 1 | 5 | 2 |
| 1 | 3 | 2 | 1 |
| 2 | 5 | 3 | 1 |
Details: Groupby on 'EVENT' and shifted 'EVENT' within each ID, then count.
You could use groupby and shift. We'll also use rename_axis and reset_index to tidy up the final output:
(pd.concat([f.groupby([f['EVENT'], f['EVENT'].shift(-1).astype('Int64')]).size()
for _, f in df.groupby('ID')])
.groupby(level=[0, 1]).sum()
.rename_axis(['EVENT_1', 'EVENT_2']).reset_index(name='COUNT'))
[out]
EVENT_1 EVENT_2 COUNT
0 1 5 2
1 3 2 1
2 5 3 1

Calibrate with cph function (with external validation)

I have two questions for calibrate with cph function.
My data have 5 independent variables(from BMI to RT), and 2 dependent variables (time, event).
> head(data)
BMI Taxanes Surgery LND RT Event Time
1 19 0 0 2 5 0 98
2 20 0 0 3 3 0 97
3 21 0 0 8 2 0 17
4 18 0 0 1 3 0 35
5 20 1 0 3 1 0 27
6 20 1 0 2 3 1 2
> str(data)
$ BMI : num 19 20 21 18 20 20 20 ...
$ Taxanes: int 0 0 0 0 1 1 1 0 0 0 ...
$ Surgery: num 0 0 0 0 0 0 1 0 0 0 ...
$ LND : int 2 3 8 1 3 2 2 2 5 2 ...
$ RT : Factor w/ 7 levels "0","1","2","3",..: 5 3 2 3 1 3 ...
$ Event : int 0 0 0 0 0 1 0 0 0 0 ...
$ Time : num 98 97 17 35 27 2 22 ...
(1) With this data, I did survival analysis with cph model. And I want to make a calibration plot using this data. But I got an error which "Error in x(x) : argument "y" is missing, with no default". I was finding lots of material. But I don't know the reason for this error. Even if I found the calibrate function in web, But I can't find for the element 'y'. please help me for this question.
> ddist <- datadist(data)
> options(datadist='ddist')
>
> fit = cph(Surv(Time,Event) ~ BMI + Surgery + Taxanes + RT + LND, data=data, x=TRUE, y=TRUE, surv=TRUE, dxy=TRUE, time.inc=36)
> plot(calibrate(fit))
Using Cox survival estimates at 36 Days
**Error in x(x) : argument "y" is missing, with no default**
(2) Eventually I want to do external validation for this cph model(fit).
If new data name is kind of dat2 (which has the same variable with data), then what is the observed and predicted survival? I know that the predicted value calculate like this code
val<-val.surv(fit, newdata=dat2, S=Surv(dat2$Time,dat2$Event))
But how I get a actual(observed) survival in new data(dat2)? Please help for this problem. Thank you so much in advance!

How to do cumulative mean and count in a easy way

I have following dataframe in pandas
data = {'call_put':['C', 'C', 'P','C', 'P'],'price':[10,20,30,40,50], 'qty':[11,12,11,14,9]}
df['amt']=df.price*df.qty
df=pd.DataFrame(data)
call_put price qty amt
0 C 10 11 110
1 C 20 12 240
2 P 30 11 330
3 C 40 14 560
4 P 50 9 450
I want output something like following based on call_put value is 'C' or 'P' count, median and calculation as follows
call_put price qty amt cummcount cummmedian cummsum
C 10 11 110 1 110 110
C 20 12 240 2 175 ((110+240)/2 ) 350
P 30 11 330 1 330 680
C 40 14 560 3 303.33 (110+240+560)/3 1240
P 50 9 450 2 390 ((330+450)/2) 1690
Can it be done in some easy way without creating additional dataframes and functions?
create a grouped element named g and use df.assign to assign values:
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
call_put price qty amt cum_count cummedian cum_sum
0 C 10 11 110 1 110.000000 110
1 C 20 12 240 2 175.000000 350
2 P 30 11 330 1 303.333333 680
3 C 40 14 560 3 330.000000 1240
4 P 50 9 450 2 390.000000 1690
Note: for P , the cummedian should be 390 since (330+450)/2 = 390
For cum_count look at df.groupby.cumcount()
for cummedian check how expanding() works ,
for cumsum check df.cumsum()
IIUC, this should work
df['cumcount']=df.groupby('call_put').cumcount()
df['cummidean']=df.groupby('call_put')['amt'].cumsum()
df['cumsum']=df.groupby('call_put').cumsum()
Thanks following solution is fine
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
if I run following without drop=True
g['amt'].expanding().mean().reset_index()
why output is showing level_1
call_put level_1 amt
0 C 0 110.000000
1 C 1 175.000000
2 C 3 303.333333
3 P 2 330.000000
4 P 4 390.000000
g['amt'].expanding().mean().reset_index(drop=True)
0 110.000000
1 175.000000
2 303.333333
3 330.000000
4 390.000000
Name: amt, dtype: float64
Can you pl explain in more detail ?
How do you add one more condition in groupby clause
g=df.groupby('call_put', 'price' < 50)
TypeError: '<' not supported between instances of 'str' and 'int'

FoldList like primitive in J

Mathematica has a built-in function called FoldList FoldList function description. Is there a similar primitive verb in J?
(I know that J has a ^: verb, which is like Nest and FixedPoint.)
To clarify my question, J has dyadic verb, so usually u / x1 x2 x3 becomes x1 u (x2 u x3), which works just like FoldList, with reverse order.
Except if the function u takes y, in a different shape from x. In FoldList there is an initial x. In J, if x3 is a different shape, one has to rely on < to pack it together. For example, one has to pack and unpack
[list =. (;/ 3 3 4 3 3 34),(< 1 2)
+-+-+-+-+-+--+---+
|3|3|4|3|3|34|1 2|
+-+-+-+-+-+--+---+
tf =: 4 : '<((> x) , >y)'
tf/ list
+----------------+
|1 2 3 3 4 3 3 34|
+----------------+
tf/\ |. list
+---+------+--------+----------+------------+--------------+----------------+
|1 2|1 2 34|1 2 34 3|1 2 34 3 3|1 2 34 3 3 4|1 2 34 3 3 4 3|1 2 34 3 3 4 3 3|
+---+------+--------+----------+------------+--------------+----------------+
which is kind of inconvenient. Any better solutions?
u/\ comes very close (if you don't mind the right folding):
+/\ 1 2 3 4
1 3 6 10
*/\1+i.10
1 2 6 24 120 720 5040 ...
(+%)/\7#1. NB. continued fraction of phi
1 2 1.5 1.66667 1.6 1.625 1.61538
edit on your edit:
The first two elements of FoldList are x and f(x,a). In J those two have to be of the same "kind" (shape+type) if you want them on the same list. The inconvenience comes from J's data structures not from the lack of a FoldList verb. If you exclude x from the list, things are easier:
FoldListWithout_x =: 1 : 'u/ each }.<\y'
; FoldListWithout_x 1 2 3 4
┌─────┬───────┬─────────┐
│┌─┬─┐│┌─┬─┬─┐│┌─┬─┬─┬─┐│
││1│2│││1│2│3│││1│2│3│4││
│└─┴─┘│└─┴─┴─┘│└─┴─┴─┴─┘│
└─────┴───────┴─────────┘
>+ FoldListWithout_x 1 2 3 4
3 6 10
(+%) FoldListWithout_x 7#1
┌─┬───┬───────┬───┬─────┬───────┐
│2│1.5│1.66667│1.6│1.625│1.61538│
└─┴───┴───────┴───┴─────┴───────┘
The next logical step is to include a boxed x after making the folds, but that will either require more complex code or a case-by-case construction. Eg:
FoldList =: 1 :'({.y) ; u FoldListWithout_x y'
+ FoldList 1 2 3 4
┌─┬─┬─┬──┐
│1│3│6│10│
└─┴─┴─┴──┘
; FoldList 1 2 3 4
┌─┬─────┬───────┬─────────┐
│1│┌─┬─┐│┌─┬─┬─┐│┌─┬─┬─┬─┐│
│ ││1│2│││1│2│3│││1│2│3│4││
│ │└─┴─┘│└─┴─┴─┘│└─┴─┴─┴─┘│
└─┴─────┴───────┴─────────┘
vs
FoldList =: 1 :'(<{.y) ; u FoldListWithout_x y'
+ FoldList 1 2 3 4
┌───┬─┬─┬──┐
│┌─┐│3│6│10│
││1││ │ │ │
│└─┘│ │ │ │
└───┴─┴─┴──┘
; FoldList 1 2 3 4
┌───┬─────┬───────┬─────────┐
│┌─┐│┌─┬─┐│┌─┬─┬─┐│┌─┬─┬─┬─┐│
││1│││1│2│││1│2│3│││1│2│3│4││
│└─┘│└─┴─┘│└─┴─┴─┘│└─┴─┴─┴─┘│
└───┴─────┴───────┴─────────┘
I guess #Dan Bron's comment deserves an answer. It is discussed with some solutions in http://www.jsoftware.com/pipermail/programming/2006-May/002245.html
if we define an adverb (modified from the link above)
upd =: 1 : 0
:
u&.> /\ ( <"_ x),<"0 y
)
then
1 2 , upd |. 3 3 4 3 3 34
┌───┬──────┬────────┬──────────┬────────────┬──────────────┬────────────────┐
│1 2│1 2 34│1 2 34 3│1 2 34 3 3│1 2 34 3 3 4│1 2 34 3 3 4 3│1 2 34 3 3 4 3 3│
└───┴──────┴────────┴──────────┴────────────┴──────────────┴────────────────┘

J : Iverson's `table` function from _A Personal View of APL_

At the very end of A Personal View of APL (right before the references), Ken Iverson gave the following series of J code snippets:
[a=. b=. i. 5
0 1 2 3 4
a +/ b
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
over=.({.,.#;}.)&":#,
by=. (,~"_1 ' '&;&,.)~
a by b over a !/ b
+-+---------+
| |0 1 2 3 4|
+-+---------+
|1|1 1 1 1 1|
|2|0 1 2 3 4|
|3|0 0 1 3 6|
|4|0 0 0 1 4|
|5|0 0 0 0 1|
+-+---------+
table=. /([`by`]`over`)\
2 3 5 *table 1 2 3 4 5
+-+-------------+
| |1 2 3 4 5|
+-+-------------+
|2|2 4 6 8 10|
|3|3 6 9 12 15|
|4|5 10 15 20 25|
+-+-------------+
All of these work for me in J701, except the last, which gives me:
table=. /([`by`]`over`)\
2 3 5 *table 1 2 3 4 5
|rank error
| 2 3 5 *table 1 2 3 4 5
I notice in the original PDF from IBM that the quotes look more like:
table=. /([`by']`over')\
But this is a syntax error.
Was there a transcription error converting the PDF to HTML on the J site, or has the syntax of J changed?
I don't think that this is valid J syntax (I mean, for what it is supposed to do); maybe it was then, but not any more. The adverb table can be simply defined as:
table =: 1 :'[ by ] over u/'
The closest I can get to Iverson's version is:
table =: /([`by`]`over`)
but then you have to evoke (`:) the result of the adverb:
2 3 5 (*table`:6) 1 2 3 4 5
┌─┬─────────────┐
│ │1 2 3 4 5│
├─┼─────────────┤
│2│2 4 6 8 10│
│3│3 6 9 12 15│
│5│5 10 15 20 25│
└─┴─────────────┘
J has changed. Earlier versions allowed adverbs and conjunctions to be defined in ways that are no longer possible.
A version of table that is compatible with recent versions appears in the J Dictionary under "Bordering a Table"

Resources