At the very end of A Personal View of APL (right before the references), Ken Iverson gave the following series of J code snippets:
[a=. b=. i. 5
0 1 2 3 4
a +/ b
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
over=.({.,.#;}.)&":#,
by=. (,~"_1 ' '&;&,.)~
a by b over a !/ b
+-+---------+
| |0 1 2 3 4|
+-+---------+
|1|1 1 1 1 1|
|2|0 1 2 3 4|
|3|0 0 1 3 6|
|4|0 0 0 1 4|
|5|0 0 0 0 1|
+-+---------+
table=. /([`by`]`over`)\
2 3 5 *table 1 2 3 4 5
+-+-------------+
| |1 2 3 4 5|
+-+-------------+
|2|2 4 6 8 10|
|3|3 6 9 12 15|
|4|5 10 15 20 25|
+-+-------------+
All of these work for me in J701, except the last, which gives me:
table=. /([`by`]`over`)\
2 3 5 *table 1 2 3 4 5
|rank error
| 2 3 5 *table 1 2 3 4 5
I notice in the original PDF from IBM that the quotes look more like:
table=. /([`by']`over')\
But this is a syntax error.
Was there a transcription error converting the PDF to HTML on the J site, or has the syntax of J changed?
I don't think that this is valid J syntax (I mean, for what it is supposed to do); maybe it was then, but not any more. The adverb table can be simply defined as:
table =: 1 :'[ by ] over u/'
The closest I can get to Iverson's version is:
table =: /([`by`]`over`)
but then you have to evoke (`:) the result of the adverb:
2 3 5 (*table`:6) 1 2 3 4 5
┌─┬─────────────┐
│ │1 2 3 4 5│
├─┼─────────────┤
│2│2 4 6 8 10│
│3│3 6 9 12 15│
│5│5 10 15 20 25│
└─┴─────────────┘
J has changed. Earlier versions allowed adverbs and conjunctions to be defined in ways that are no longer possible.
A version of table that is compatible with recent versions appears in the J Dictionary under "Bordering a Table"
Related
I am having a dataframe, with a number of features. There is one particular feature, that is totally dynamic and I aim to encode it. I cannot use one-hot encoding as the unique count of values can change. LabelEncoder can be of use, but can the number of classes/target labels be changed ?
Consider an example (the Name feature):
index | A | B | Name
------+---+---+-----
1 5 6 abc
2 4 7 abc
2 3 0 def
2 3 0 ghi
3 3 0 abc
3 3 0 def
I wish to encode it as
index | A | B | Name
------+---+---+-----
1 5 6 1
2 4 7 1
2 3 0 2
2 3 0 3
3 3 0 1
3 3 0 2
And also keeping in mind that if later on another value different than all these comes up, they automatically get stored in the encoder by the next successive value like even if the next row input is
index | A | B | Name
------+---+---+-----
1 5 6 xyz
It gets encoded to and is used as
index | A | B | Name
------+---+---+-----
1 5 6 4
And how do I get the original value back ?
You can try factorize
df.Name=df.Name.factorize()[0]+1
You can use astype category and then use the category accessor .cat to get the assigned codes:
df['Name'] = df['Name'].astype('category').cat.codes + 1
Output:
index A B Name
0 1 5 6 1
1 2 4 7 1
2 2 3 0 2
3 2 3 0 3
4 3 3 0 1
5 3 3 0 2
I have a dataframe that looks like this:
ID EVENT DATE
1 1 142
1 5 167
1 3 245
2 1 54
2 5 87
3 3 165
3 2 178
And I would like to generate something like this:
EVENT_1 EVENT_2 COUNT
1 5 2
5 3 1
3 2 1
The idea is how many items (ID) go from one event to the next one. Don't care about previous states, I just want to consider the next state from the current state (e.g.: for ID 1, I don't want to count a transition from 1 to 3 because first, it goes to event 5 and then to 3).
The date format is the number of days from a specific date (sort of like SAS format).
Is there a clean way to achieve this?
Let's try this:
(df.groupby([df['EVENT'].rename('EVENT_1'),
df.groupby('ID')['EVENT'].shift(-1).rename('EVENT_2')])['ID']
.count()).rename('COUNT').reset_index().astype(int)
Output:
| | EVENT_1 | EVENT_2 | COUNT |
|---:|----------:|----------:|--------:|
| 0 | 1 | 5 | 2 |
| 1 | 3 | 2 | 1 |
| 2 | 5 | 3 | 1 |
Details: Groupby on 'EVENT' and shifted 'EVENT' within each ID, then count.
You could use groupby and shift. We'll also use rename_axis and reset_index to tidy up the final output:
(pd.concat([f.groupby([f['EVENT'], f['EVENT'].shift(-1).astype('Int64')]).size()
for _, f in df.groupby('ID')])
.groupby(level=[0, 1]).sum()
.rename_axis(['EVENT_1', 'EVENT_2']).reset_index(name='COUNT'))
[out]
EVENT_1 EVENT_2 COUNT
0 1 5 2
1 3 2 1
2 5 3 1
I have a dataframe looks like,
A B
1 2
1 3
1 4
2 5
2 6
3 7
3 8
If I df.groupby('A'), how do I turn each group into sub-dataframes, so it will look like, for A=1
A B
1 2
1 3
1 4
for A=2,
A B
2 5
2 6
for A=3,
A B
3 7
3 8
By using get_group
g=df.groupby('A')
g.get_group(1)
Out[367]:
A B
0 1 2
1 1 3
2 1 4
You are close, need convert groupby object to dictionary of DataFrames:
dfs = dict(tuple(df.groupby('A')))
print (dfs[1])
A B
0 1 2
1 1 3
2 1 4
print (dfs[2])
A B
3 2 5
4 2 6
Hi all I have the following dataframe:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
And I am trying to only repeat the last two rows of the data so that it looks like this:
A | B | C
1 2 3
2 3 4
3 4 5
3 4 5
4 5 6
4 5 6
I have tried using append, concat and repeat to no avail.
repeated = lambda x:x.repeat(2)
df.append(df[-2:].apply(repeated),ignore_index=True)
This returns the following dataframe, which is incorrect:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
3 4 5
3 4 5
4 5 6
4 5 6
You can use numpy.repeat for repeating index and then create df1 by loc, last append to original, but before filter out last 2 rows by iloc:
df1 = df.loc[np.repeat(df.index[-2:].values, 2)]
print (df1)
A B C
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
print (df.iloc[:-2])
A B C
0 1 2 3
1 2 3 4
df = df.iloc[:-2].append(df1,ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
If want use your code add iloc for filtering only last 2 rows:
repeated = lambda x:x.repeat(2)
df = df.iloc[:-2].append(df.iloc[-2:].apply(repeated),ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
Use pd.concat and index slicing with .iloc:
pd.concat([df,df.iloc[-2:]]).sort_values(by='A')
Output:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
I'm partial to manipulating the index into the pattern we are aiming for then asking the dataframe to take the new form.
Option 1
Use pd.DataFrame.reindex
df.reindex(df.index[:-2].append(df.index[-2:].repeat(2)))
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Same thing in multiple lines
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.reindex(idx)
Could also use loc
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.loc[idx]
Option 2
Reconstruct from values. Only do this is all dtypes are the same.
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
pd.DataFrame(df.values[idx], df.index[idx])
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Option 3
Can also use np.array in iloc
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
df.iloc[idx]
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
I want to calculate the probability of all the data in a column dataframe according to its own distribution.For example,my data like this:
data
0 1
1 1
2 2
3 3
4 2
5 2
6 7
7 8
8 3
9 4
10 1
And the output I expect like this:
data pro
0 1 0.155015
1 1 0.155015
2 2 0.181213
3 3 0.157379
4 2 0.181213
5 2 0.181213
6 7 0.048717
7 8 0.044892
8 3 0.157379
9 4 0.106164
10 1 0.155015
I also refer to another question(How to compute the probability ...) and get an example of the above.My code is as follows:
import scipy.stats
samples = [1,1,2,3,2,2,7,8,3,4,1]
samples = pd.DataFrame(samples,columns=['data'])
print(samples)
kde = scipy.stats.gaussian_kde(samples['data'].tolist())
samples['pro'] = kde.pdf(samples['data'].tolist())
print(samples)
But what I can't stand is that if my column is too long, it makes the operation slow.Is there a better way to do it in pandas?Thanks in advance.
Its own distribution does not mean kde. You can use value_counts with normalize=True
df.assign(pro=df.data.map(df.data.value_counts(normalize=True)))
data pro
0 1 0.272727
1 1 0.272727
2 2 0.272727
3 3 0.181818
4 2 0.272727
5 2 0.272727
6 7 0.090909
7 8 0.090909
8 3 0.181818
9 4 0.090909
10 1 0.272727