How to reshape data array from wide to long in J? - j

I like to replicate the reshape function in J.
For example, Stata can reshape a dataset "from wide to long". Below is their Example 1 from their documentation:
. use http://www.stata-press.com/data/r11/reshape1.dta
. list
+-------------------------------------------------------+
| id sex inc80 inc81 inc82 ue80 ue81 ue82 |
|-------------------------------------------------------|
| 1 0 5000 5500 6000 0 1 0 |
| 2 1 2000 2200 3300 1 0 0 |
| 3 0 3000 2000 1000 0 0 1 |
+-------------------------------------------------------+
. reshape long inc ue, i(id) j(year)
. list
+-----------------------------+
| id year sex inc ue |
|-----------------------------|
| 1 80 0 5000 0 |
| 1 81 0 5500 1 |
| 1 82 0 6000 0 |
| 2 80 1 2000 1 |
| 2 81 1 2200 0 |
| 2 82 1 3300 0 |
| 3 80 0 3000 0 |
| 3 81 0 2000 0 |
| 3 82 0 1000 1 |
+-----------------------------+
NB. Python Pandas has a similar function ("stack").
I understand that J can import the data files (csv format) as follows.
load 'web/gethttp'
] dataset =: gethttp 'https://bbbyc.github.io/reshape1.csv'
load 'tables/csv'
] dataInJArray =: fixcsv dataset
I am lost after getting this dataInJArray. How can I reshape it? Appreciate any hints / advice!

To actually work on your specific problem using J you could do this:
NB. t is the data to be stacked:
[ t=: 3 8 $ 1 0 5000 5500 6000 0 1 0 2 1 2000 2200 3300 1 0 0 3 0 3000 2000 1000 0 0 1
1 0 5000 5500 6000 0 1 0
2 1 2000 2200 3300 1 0 0
3 0 3000 2000 1000 0 0 1
you can select and combine the different columns appropriately
({. ,. 1&{ ,. (2 3 4 & {),. (5 6 7 & {))"1 t
1 0 5000 0
1 0 5500 1
1 0 6000 0
2 1 2000 1
2 1 2200 0
2 1 3300 0
3 0 3000 0
3 0 2000 0
3 0 1000 1
Since this leaves gaps between the groups, you apply ,/ to the whole result
,/#:(({. ,. 1&{ ,. (2 3 4 & {),. (5 6 7 & {))"1) t
1 0 5000 0
1 0 5500 1
1 0 6000 0
2 1 2000 1
2 1 2200 0
2 1 3300 0
3 0 3000 0
3 0 2000 0
3 0 1000 1
I am not sure how well this generalizes, but variations could be used on tables of any number of records if they are already organized appropriately.
To finish off the formatting and the introduction of 'years'
[s1=. ,. each <"1 |: s0 NB. years inserted in the next step
+-+-+----+-+
|1|0|5000|0|
|1|0|5500|1|
|1|0|6000|0|
|2|1|2000|1|
|2|1|2200|0|
|2|1|3300|0|
|3|0|3000|0|
|3|0|2000|0|
|3|0|1000|1|
+-+-+----+-+
[s2=. ({. , ,.#:(9 $ 80 81 82"_); }.)s1 NB. 80 81 82"_ creates a verb that returns 80 81 82 given any argument
+-+--+-+----+-+
|1|80|0|5000|0|
|1|81|0|5500|1|
|1|82|0|6000|0|
|2|80|1|2000|1|
|2|81|1|2200|0|
|2|82|1|3300|0|
|3|80|0|3000|0|
|3|81|0|2000|0|
|3|82|0|1000|1|
+-+--+-+----+-+
('id';'year';'sex';'inc';'ue'),:s2
+--+----+---+----+--+
|id|year|sex|inc |ue|
+--+----+---+----+--+
|1 |80 |0 |5000|0 |
|1 |81 |0 |5500|1 |
|1 |82 |0 |6000|0 |
|2 |80 |1 |2000|1 |
|2 |81 |1 |2200|0 |
|2 |82 |1 |3300|0 |
|3 |80 |0 |3000|0 |
|3 |81 |0 |2000|0 |
|3 |82 |0 |1000|1 |
+--+----+---+----+--+

Related

Group rows based on the current occurance of a variable

I am trying to group a dataframe based on the occurrence a variable. For example take this dataframe
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
6 | 0 | NaN
7 | -1 | NaN
8 | 0 | NaN
9 | 0 | -1
10| 0 | -1
11| 0 | -1
I want to group variable based on the current occurrence of a variable in column_2 to a dataframe and get the next sequence into another dataframe and likewise till the end of dataframe while also ignoring NaN.
So the final output would be like:
ones_1 =
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
mones_1 =
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
mones_2 =
9 | 0 | -1
10| 0 | -1
11| 0 | -1
I suggest create dictionary of DataFrames:
#only non missing rows
mask = df['col_2'].notna()
#create unique groups
g = df['col_2'].ne(df['col_2'].shift()).cumsum()
#create counter of filtered g
g = g[mask].groupby(df['col_2']).transform(lambda x:pd.factorize(x)[0]) + 1
#map positive and negative values to strings and add counter values
g = df.loc[mask, 'col_2'].map({-1:'mones_',1:'ones_'}) + g.astype(str)
#generally groups
#g = 'val' + df.loc[mask, 'col_2'].astype(str) + ' no' + g.astype(str)
print (g)
0 ones_1
1 ones_1
2 ones_1
3 mones_1
4 mones_1
5 mones_1
9 mones_2
10 mones_2
11 mones_2
Name: col_2, dtype: object
#create dictionary of DataFrames
dfs = dict(tuple(df.groupby(g)))
print (dfs)
{'mones_1': col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0, 'mones_2': col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0, 'ones_1': col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0}
#select by keys
print (dfs['ones_1'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
It is not recommended, but possible create DataFrames by groups with variable names:
for i, g in df.groupby(g):
globals()[i] = g
print (ones_1)
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
here is another logic (keeping them in dictionary is the idea again):
m=df[df.col_2.notna()] #filter out the NaN rows
#check if the index are in sequence along with that check if values changes per row
s=m.col_2.ne(m.col_2.shift())|m.index.to_series().diff().fillna(1).gt(1)
dfs={f'df_{int(i)}':g for i , g in df.groupby(s.cumsum())} #groupby and store in dict
Access the dataframes by accessing the keys:
print(dfs['df_1'])
print('---------------------------------')
print(dfs['df_2'])
print('---------------------------------')
print(dfs['df_3'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
---------------------------------
col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0
---------------------------------
col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0

Creating A new column based on other columns' values with specific requirement in Python Dataframe

I want to create a new column in Python dataframe with specific requirements from other columns. For example, my python dataframe df:
A | B
-----------
5 | 0
5 | 1
15 | 1
10 | 1
10 | 1
20 | 2
15 | 2
10 | 2
5 | 3
15 | 3
10 | 4
20 | 0
I want to create new column C, with below requirements:
When the value of B = 0, then C = 0
The same value in B will have the same value in C. The same values in B will be classified as start, middle, and end. So for values 1, it has 1 start, 2 middle, and 1 end, for values 3, it has 1 start, 0 middle, and 1 end. And the calculation for each section:
I specify a threshold = 10.
Let's look at values B = 1 :
Start :
C.loc[2] = min(threshold, A.loc[1]) + A.loc[2]
Middle :
C.loc[3] = A.loc[3]
C.loc[4] = A.loc[4]
End:
C.loc[5] = min(Threshold, A.loc[6])
However, the output value of C will be the sum of the above calculations.
When the value of B is unique and not 0. For example when B = 4
C[10] = min(threshold, A.loc[9]) + min(threshold, A.loc[11])
I can solve point 0 and 3. But I'm struggling to solve point 2.
So, the final output will be:
A | B | c
--------------------
5 | 0 | 0
5 | 1 | 45
15 | 1 | 45
10 | 1 | 45
10 | 1 | 45
20 | 2 | 50
15 | 2 | 50
10 | 2 | 50
5 | 3 | 25
10 | 3 | 25
10 | 4 | 20
20 | 0 | 0

Pivot table based on groupby in Pandas

I have a dataframe like this:
customer_id | date | category
1 | 2017-2-1 | toys
2 | 2017-2-1 | food
1 | 2017-2-1 | drinks
3 | 2017-2-2 | computer
2 | 2017-2-1 | toys
1 | 2017-3-1 | food
>>> import pandas as pd
>>> dt = dict(customer_id=[1,2,1,3,2,1],
date='2017-2-1 2017-2-1 2017-2-1 2017-2-2 2017-2-1 2017-3-1'.split(),
category=["toys", "food", "drinks", "computer", "toys", "food"]))
>>> df = pd.DataFrame(dt)
ues my new columns and one hot encoding those columns, I know I can use df.pivot_table(index = ['customer_id'], columns = ['category']).
>>> df['Indicator'] = 1
>>> df.pivot_table(index=['customer_id'], columns=['category'],
values='Indicator').fillna(0).astype(int)
category computer drinks food toys
customer_id
1 0 1 1 1
2 0 0 1 1
3 1 0 0 0
>>>
I also want to group by date so each row only contains information from the same date, like in the desired output below, id 1 has two rows because two unique dates in the date column.
customer_id | toys | food | drinks | computer
1 | 1 | 0 | 1 | 0
1 | 0 | 1 | 0 | 0
2 | 1 | 1 | 0 | 0
3 | 0 | 0 | 0 | 1
You may looking for crosstab
>>> pd.crosstab([df.customer_id,df.date], df.category)
category computer drinks food toys
customer_id date
1 2017-2-1 0 1 0 1
2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id,df.date],
df.category).reset_index(level=1)
category date computer drinks food toys
customer_id
1 2017-2-1 0 1 0 1
1 2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id, df.date],
df.category).reset_index(level=1, drop=True)
category computer drinks food toys
customer_id
1 0 1 0 1
1 0 0 1 0
2 0 0 1 1
3 1 0 0 0
>>>
Assuming your frame is called df, you could add an indicator column and then directly use .pivot_table:
df['Indicator'] = 1
pvt = df.pivot_table(index=['date', 'customer_id'],
columns='category',
values='Indicator')\
.fillna(0)
This gives a dataframe that looks like:
category computer drinks food toys
date customer_id
2017-2-1 1 0.0 1.0 0.0 1.0
2 0.0 0.0 1.0 1.0
2017-2-2 3 1.0 0.0 0.0 0.0
2017-3-1 1 0.0 0.0 1.0 0.0

Write 1s faster to col-rows based on positions in a list

I'm new to pandas. I'm using a dataframe to tally how many times two positions match.
Here is the code in question...right at the start. The "what am I trying to accomplish" below...
def crossovers(df, index):
# Duplicate the dataframe passed in
_dfcopy = df.copy(deep=True)
# Set all values to 0
_dfcopy[:] = 0.0
# change the value of any col/row where there's a shared SNP
for i in index:
for j in index:
if i == j: continue # Don't include self as a shared SNP
_dfcopy[i][j] = 1
# Return the DataFrame.
# Should only contain 0s (no shared SNP) or 1s ( a shared SNP)
return _dfcopy
QUESTION:*
The data is flipping all the 0s in a dataframe to 1s, for all the intersections of rows/columns in a list (see details below).
I.e. if the list is
_indices = [0,2,3]
...all the locations at (0,2); (0,3); (2,0); (2,3); (3,0); and (3,2) get flipped to 1s.
Currently I do this by iterating through the list recursively onto itself. But this is painfully slow...and I'm passing in 16 million lines of data (16 mil indices).
How can I speed up this overall process?
LONGER DESCRIPTION
I start with a dataframe called sharedby_BOTH similar to below, except much larger (70 cols x 70 rows)- I'm using it to tally occurrences of shared data intersections.
Rows (index) are labeled 0,1,2,3 & 4...70 - as are the columns. Each location contains a 0.
sharedby_BOTH
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 0 | 0 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 0 | 0 | 0 | 0 | 0
3 | 0 | 0 | 0 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
Then I have a list, which contains intersecting data.
_indices = [0,2,3 (more)] # for example
This means that 0, 2, & 3 all contain shared data. So, I pass it to crossovers which returns a dataframe with a "1" at the intersection places, obtaining this...
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 1 | 1 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 1 | 0 | 0 | 1 | 0
3 | 1 | 0 | 1 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
...where the shared data locations are (0,2),(0,3),(2,0),(2,3),(3,0),(3,2).
*Notice that self is not recognized [(0,0), (2,2), and (3,3) DO NOT have 1s] *
Then I add this to the original dataframe with this code (inside a loop)...
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices)
I repeat this in a loop...
for pos, pos_val in chrom_val.items(): # pos_val is a dict
_indices = [i for i, x in enumerate(pos_val["sharedby"]) if (x == "HET")]
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices))
The end result is that sharedby_BOTH will look like the following, if I added the three example _indices
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,4] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 3 | 2 | 1
1 | 0 | 0 | 0 | 0 | 0
2 | 3 | 0 | 0 | 2 | 1
3 | 2 | 0 | 2 | 0 | 0
4 | 1 | 0 | 1 | 0 | 0
(more)
...where, amongst the three indices passed in...
0shared data with 2 a total of three times so (0,2) and (2,0) totaled three.
0shared data with 3 twice so (0,3) and (3,0) total two.
0shared data with 4 only once, so (0,4) and (4,0) total one.
I hope this makes sense :)
EDIT
I did try the following...
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
BUT...then any locations within sharedby_BOTH that DID NOT HAVE SHARED DATA ended up as NAN
I.e...
sharedby_BOTH = pd.DataFrame(0, index=[x for x in range(4)], columns=[x for x in range(4)])
_indices = [0,2,3 (more)] # for example
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
0 1 2 3 4 (more)
------------------
0 | NAN | NAN | 1 | 1 | NAN
1 | NAN | NAN | NAN | NAN | NAN
2 | 1 | NAN | NAN | 1 | NAN
3 | 1 | NAN | 1 | NAN | NAN
4 | NAN | NAN | NAN | NAN | NAN
(more)
I'd organize it with numpy slice assignment and the handy np.triu_indices function. It returns the row and column indices of the upper triangle. I make sure to pass k=1 to ensure I skip the diagonal. When I slice assign, I make sure to use both i, j and j, i to get upper and lower
triangles.
def xover(n, idx):
idx = np.asarray(idx)
a = np.zeros((n, n))
i_, j_ = np.triu_indices(len(idx), 1)
i = idx[i_]
j = idx[j_]
a[i, j] = 1
a[j, i] = 1
return a
pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
0 1 2 3
0 0.0 0.0 1.0 1.0
1 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 1.0
3 1.0 0.0 1.0 0.0
Timings
%timeit pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
10000 loops, best of 3: 192 µs per loop
%%timeit
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
100 loops, best of 3: 6.8 ms per loop
You can use itertools product and loc for assignment i.e
from itertools import product
li = [ 0,2,3]
ndf = df.copy()
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
0 1 2 3 4
0 0 0 1 1 0
1 0 0 0 0 0
2 1 0 0 1 0
3 1 0 1 0 0
4 0 0 0 0 0

Complete Truth Tables Based On Binary

I am trying to figure out names for every combination with truth tables.
In the first table, I have each truth table for a two input and one output system. The inputs are read by row. The outputs are in a binary counted format. Each Output is read by column and is labeled with a hex number 0 to F. The input by row is related to the outputs within the specified output column.
In the second table, I have listed by row how each output column on the first chart works. In each row I have listed the binary logic gate name, if statement in javascript, and a description for how each would work. I have a hyphen for spaces that are not complete.
Are there names for the blank spaces in the gate names in the second table?
Complete Truth Tables
Inputs | Outputs
1 2 | 0 1 2 3 4 5 6 7 8 9 A B C D E F
-----------------------------------------
0 0 | 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 | 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 | 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 | 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Num | Gate | Javascript | Return True If
--- | ----- | ---------- | --------------
0 | - | 0 | FALSE
1 | AND | I1&&I2 | I1 AND I2
2 | - | I1&&!I2 | I1 AND NOT I2
3 | - | I1 | I1
4 | - | !I1&&I2 | I2 AND NOT I1
5 | - | I2 | I2
6 | XOR | I1!==I2 | I1 NOT EQUALS I2
7 | OR | I1||I2 | I1 OR I2
8 | NOR | !I1||!I2 | NOT I1 OR NOT I2
9 | XNOR | I1==I2 | I1 EQUALS I2
A | - | !I2 | NOT I2
B | - | !(!I1&&I2) | NOT ( I2 AND NOT I1 )
C | - | !I1 | NOT I1
D | - | !(I1&&!I2) | NOT ( I1 AND NOT I2 )
E | NAND | !I1&&!I2 | NOT I1 AND NOT I2
F | - | 1 | TRUE
Some of the other combinations have gate names, but not all do.
The A and C cases are each an example of a NOT gate, and the 3 and 5 cases are each an example of a BUFFER.
The D case is known as an IMPLY gate, but this is not as commonly known as the others.
For the rest, there are no commonly used gate names because to implement their boolean function would require either no gates (as in TRUE and FALSE), or they would require a combination of two or more of the conventional gates that you have already identified. There may be specific implementations of tools or systems that have created names for these "quasi-gates", but they are not in common use.
See Also
Logic Gate (Wikipedia)
Imply Gate (Wikipedia)

Resources