Pivot table based on groupby in Pandas - python-3.x

I have a dataframe like this:
customer_id | date | category
1 | 2017-2-1 | toys
2 | 2017-2-1 | food
1 | 2017-2-1 | drinks
3 | 2017-2-2 | computer
2 | 2017-2-1 | toys
1 | 2017-3-1 | food
>>> import pandas as pd
>>> dt = dict(customer_id=[1,2,1,3,2,1],
date='2017-2-1 2017-2-1 2017-2-1 2017-2-2 2017-2-1 2017-3-1'.split(),
category=["toys", "food", "drinks", "computer", "toys", "food"]))
>>> df = pd.DataFrame(dt)
ues my new columns and one hot encoding those columns, I know I can use df.pivot_table(index = ['customer_id'], columns = ['category']).
>>> df['Indicator'] = 1
>>> df.pivot_table(index=['customer_id'], columns=['category'],
values='Indicator').fillna(0).astype(int)
category computer drinks food toys
customer_id
1 0 1 1 1
2 0 0 1 1
3 1 0 0 0
>>>
I also want to group by date so each row only contains information from the same date, like in the desired output below, id 1 has two rows because two unique dates in the date column.
customer_id | toys | food | drinks | computer
1 | 1 | 0 | 1 | 0
1 | 0 | 1 | 0 | 0
2 | 1 | 1 | 0 | 0
3 | 0 | 0 | 0 | 1

You may looking for crosstab
>>> pd.crosstab([df.customer_id,df.date], df.category)
category computer drinks food toys
customer_id date
1 2017-2-1 0 1 0 1
2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id,df.date],
df.category).reset_index(level=1)
category date computer drinks food toys
customer_id
1 2017-2-1 0 1 0 1
1 2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id, df.date],
df.category).reset_index(level=1, drop=True)
category computer drinks food toys
customer_id
1 0 1 0 1
1 0 0 1 0
2 0 0 1 1
3 1 0 0 0
>>>

Assuming your frame is called df, you could add an indicator column and then directly use .pivot_table:
df['Indicator'] = 1
pvt = df.pivot_table(index=['date', 'customer_id'],
columns='category',
values='Indicator')\
.fillna(0)
This gives a dataframe that looks like:
category computer drinks food toys
date customer_id
2017-2-1 1 0.0 1.0 0.0 1.0
2 0.0 0.0 1.0 1.0
2017-2-2 3 1.0 0.0 0.0 0.0
2017-3-1 1 0.0 0.0 1.0 0.0

Related

How to reshape data array from wide to long in J?

I like to replicate the reshape function in J.
For example, Stata can reshape a dataset "from wide to long". Below is their Example 1 from their documentation:
. use http://www.stata-press.com/data/r11/reshape1.dta
. list
+-------------------------------------------------------+
| id sex inc80 inc81 inc82 ue80 ue81 ue82 |
|-------------------------------------------------------|
| 1 0 5000 5500 6000 0 1 0 |
| 2 1 2000 2200 3300 1 0 0 |
| 3 0 3000 2000 1000 0 0 1 |
+-------------------------------------------------------+
. reshape long inc ue, i(id) j(year)
. list
+-----------------------------+
| id year sex inc ue |
|-----------------------------|
| 1 80 0 5000 0 |
| 1 81 0 5500 1 |
| 1 82 0 6000 0 |
| 2 80 1 2000 1 |
| 2 81 1 2200 0 |
| 2 82 1 3300 0 |
| 3 80 0 3000 0 |
| 3 81 0 2000 0 |
| 3 82 0 1000 1 |
+-----------------------------+
NB. Python Pandas has a similar function ("stack").
I understand that J can import the data files (csv format) as follows.
load 'web/gethttp'
] dataset =: gethttp 'https://bbbyc.github.io/reshape1.csv'
load 'tables/csv'
] dataInJArray =: fixcsv dataset
I am lost after getting this dataInJArray. How can I reshape it? Appreciate any hints / advice!
To actually work on your specific problem using J you could do this:
NB. t is the data to be stacked:
[ t=: 3 8 $ 1 0 5000 5500 6000 0 1 0 2 1 2000 2200 3300 1 0 0 3 0 3000 2000 1000 0 0 1
1 0 5000 5500 6000 0 1 0
2 1 2000 2200 3300 1 0 0
3 0 3000 2000 1000 0 0 1
you can select and combine the different columns appropriately
({. ,. 1&{ ,. (2 3 4 & {),. (5 6 7 & {))"1 t
1 0 5000 0
1 0 5500 1
1 0 6000 0
2 1 2000 1
2 1 2200 0
2 1 3300 0
3 0 3000 0
3 0 2000 0
3 0 1000 1
Since this leaves gaps between the groups, you apply ,/ to the whole result
,/#:(({. ,. 1&{ ,. (2 3 4 & {),. (5 6 7 & {))"1) t
1 0 5000 0
1 0 5500 1
1 0 6000 0
2 1 2000 1
2 1 2200 0
2 1 3300 0
3 0 3000 0
3 0 2000 0
3 0 1000 1
I am not sure how well this generalizes, but variations could be used on tables of any number of records if they are already organized appropriately.
To finish off the formatting and the introduction of 'years'
[s1=. ,. each <"1 |: s0 NB. years inserted in the next step
+-+-+----+-+
|1|0|5000|0|
|1|0|5500|1|
|1|0|6000|0|
|2|1|2000|1|
|2|1|2200|0|
|2|1|3300|0|
|3|0|3000|0|
|3|0|2000|0|
|3|0|1000|1|
+-+-+----+-+
[s2=. ({. , ,.#:(9 $ 80 81 82"_); }.)s1 NB. 80 81 82"_ creates a verb that returns 80 81 82 given any argument
+-+--+-+----+-+
|1|80|0|5000|0|
|1|81|0|5500|1|
|1|82|0|6000|0|
|2|80|1|2000|1|
|2|81|1|2200|0|
|2|82|1|3300|0|
|3|80|0|3000|0|
|3|81|0|2000|0|
|3|82|0|1000|1|
+-+--+-+----+-+
('id';'year';'sex';'inc';'ue'),:s2
+--+----+---+----+--+
|id|year|sex|inc |ue|
+--+----+---+----+--+
|1 |80 |0 |5000|0 |
|1 |81 |0 |5500|1 |
|1 |82 |0 |6000|0 |
|2 |80 |1 |2000|1 |
|2 |81 |1 |2200|0 |
|2 |82 |1 |3300|0 |
|3 |80 |0 |3000|0 |
|3 |81 |0 |2000|0 |
|3 |82 |0 |1000|1 |
+--+----+---+----+--+

List column name having value greater than zero

I have following dataframe
A | B | C | D
1 0 2 1
0 1 1 0
0 0 0 1
I want to add the new column have any value of row in the column greater than zero along with column name
A | B | C | D | New
1 0 2 1 A-1, C-2, D-1
0 1 1 0 B-1, C-1
0 0 0 1 D-1
We can use mask and stack
s=df.mask(df==0).stack().\
astype(int).astype(str).\
reset_index(level=1).apply('-'.join,1).add(',').sum(level=0).str[:-1]
df['New']=s
df
Out[170]:
A B C D New
0 1 0 2 1 A-1,C-2,D-1
1 0 1 1 0 B-1,C-1
2 0 0 0 1 D-1
Combine the column names with the df values that are not zero and then filter out the None values.
df = pd.read_clipboard()
arrays = np.where(df!=0, df.columns.values + '-' + df.values.astype('str'), None)
new = []
for array in arrays:
new.append(list(filter(None, array)))
df['New'] = new
df
Out[1]:
A B C D New
0 1 0 2 1 [A-1, C-2, D-1]
1 0 1 1 0 [B-1, C-1]
2 0 0 0 1 [D-1]

Group rows based on the current occurance of a variable

I am trying to group a dataframe based on the occurrence a variable. For example take this dataframe
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
6 | 0 | NaN
7 | -1 | NaN
8 | 0 | NaN
9 | 0 | -1
10| 0 | -1
11| 0 | -1
I want to group variable based on the current occurrence of a variable in column_2 to a dataframe and get the next sequence into another dataframe and likewise till the end of dataframe while also ignoring NaN.
So the final output would be like:
ones_1 =
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
mones_1 =
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
mones_2 =
9 | 0 | -1
10| 0 | -1
11| 0 | -1
I suggest create dictionary of DataFrames:
#only non missing rows
mask = df['col_2'].notna()
#create unique groups
g = df['col_2'].ne(df['col_2'].shift()).cumsum()
#create counter of filtered g
g = g[mask].groupby(df['col_2']).transform(lambda x:pd.factorize(x)[0]) + 1
#map positive and negative values to strings and add counter values
g = df.loc[mask, 'col_2'].map({-1:'mones_',1:'ones_'}) + g.astype(str)
#generally groups
#g = 'val' + df.loc[mask, 'col_2'].astype(str) + ' no' + g.astype(str)
print (g)
0 ones_1
1 ones_1
2 ones_1
3 mones_1
4 mones_1
5 mones_1
9 mones_2
10 mones_2
11 mones_2
Name: col_2, dtype: object
#create dictionary of DataFrames
dfs = dict(tuple(df.groupby(g)))
print (dfs)
{'mones_1': col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0, 'mones_2': col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0, 'ones_1': col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0}
#select by keys
print (dfs['ones_1'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
It is not recommended, but possible create DataFrames by groups with variable names:
for i, g in df.groupby(g):
globals()[i] = g
print (ones_1)
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
here is another logic (keeping them in dictionary is the idea again):
m=df[df.col_2.notna()] #filter out the NaN rows
#check if the index are in sequence along with that check if values changes per row
s=m.col_2.ne(m.col_2.shift())|m.index.to_series().diff().fillna(1).gt(1)
dfs={f'df_{int(i)}':g for i , g in df.groupby(s.cumsum())} #groupby and store in dict
Access the dataframes by accessing the keys:
print(dfs['df_1'])
print('---------------------------------')
print(dfs['df_2'])
print('---------------------------------')
print(dfs['df_3'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
---------------------------------
col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0
---------------------------------
col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0

How to efficiently disaggregate data from?

I have Google Analytics data which I am trying to disaggregate.
Below is a simplified version of the dataframe I am dealing with:
date | users | goal_completions
20150101| 2 | 1
20150102| 3 | 2
I would like to disaggregate the data such that each "user" has its own row. In addition, the third column, "goal_completions" will also be disaggregated with the assumption that each user can only have 1 "goal_completion".
The output I am seeking will be something like this:
date | users | goal_completions
20150101| 1 | 1
20150101| 1 | 0
20150102| 1 | 1
20150102| 1 | 1
20150102| 1 | 0
I was able to duplicate each row based on the number of users on a given date, however I can't seem to find a way to disaggregate the "goal_completion" column. Here is what I currently have after duplicating the "users" column:
date | users | goal_completions
20150101| 1 | 1
20150101| 1 | 1
20150102| 1 | 2
20150102| 1 | 2
20150102| 1 | 2
Any help will be appreciated - thanks!
IIUC using repeat create you dfs , then we adjust the two column by cumcount with np.where
df=df.reindex(df.index.repeat(df.users))
df=df.assign(users=1)
df.goal_completions=np.where(df.groupby(level=0).cumcount()<df.goal_completions,1,0)
df
Out[609]:
date users goal_completions
0 20150101 1 1
0 20150101 1 0
1 20150102 1 1
1 20150102 1 1
1 20150102 1 0

Write 1s faster to col-rows based on positions in a list

I'm new to pandas. I'm using a dataframe to tally how many times two positions match.
Here is the code in question...right at the start. The "what am I trying to accomplish" below...
def crossovers(df, index):
# Duplicate the dataframe passed in
_dfcopy = df.copy(deep=True)
# Set all values to 0
_dfcopy[:] = 0.0
# change the value of any col/row where there's a shared SNP
for i in index:
for j in index:
if i == j: continue # Don't include self as a shared SNP
_dfcopy[i][j] = 1
# Return the DataFrame.
# Should only contain 0s (no shared SNP) or 1s ( a shared SNP)
return _dfcopy
QUESTION:*
The data is flipping all the 0s in a dataframe to 1s, for all the intersections of rows/columns in a list (see details below).
I.e. if the list is
_indices = [0,2,3]
...all the locations at (0,2); (0,3); (2,0); (2,3); (3,0); and (3,2) get flipped to 1s.
Currently I do this by iterating through the list recursively onto itself. But this is painfully slow...and I'm passing in 16 million lines of data (16 mil indices).
How can I speed up this overall process?
LONGER DESCRIPTION
I start with a dataframe called sharedby_BOTH similar to below, except much larger (70 cols x 70 rows)- I'm using it to tally occurrences of shared data intersections.
Rows (index) are labeled 0,1,2,3 & 4...70 - as are the columns. Each location contains a 0.
sharedby_BOTH
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 0 | 0 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 0 | 0 | 0 | 0 | 0
3 | 0 | 0 | 0 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
Then I have a list, which contains intersecting data.
_indices = [0,2,3 (more)] # for example
This means that 0, 2, & 3 all contain shared data. So, I pass it to crossovers which returns a dataframe with a "1" at the intersection places, obtaining this...
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 1 | 1 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 1 | 0 | 0 | 1 | 0
3 | 1 | 0 | 1 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
...where the shared data locations are (0,2),(0,3),(2,0),(2,3),(3,0),(3,2).
*Notice that self is not recognized [(0,0), (2,2), and (3,3) DO NOT have 1s] *
Then I add this to the original dataframe with this code (inside a loop)...
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices)
I repeat this in a loop...
for pos, pos_val in chrom_val.items(): # pos_val is a dict
_indices = [i for i, x in enumerate(pos_val["sharedby"]) if (x == "HET")]
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices))
The end result is that sharedby_BOTH will look like the following, if I added the three example _indices
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,4] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 3 | 2 | 1
1 | 0 | 0 | 0 | 0 | 0
2 | 3 | 0 | 0 | 2 | 1
3 | 2 | 0 | 2 | 0 | 0
4 | 1 | 0 | 1 | 0 | 0
(more)
...where, amongst the three indices passed in...
0shared data with 2 a total of three times so (0,2) and (2,0) totaled three.
0shared data with 3 twice so (0,3) and (3,0) total two.
0shared data with 4 only once, so (0,4) and (4,0) total one.
I hope this makes sense :)
EDIT
I did try the following...
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
BUT...then any locations within sharedby_BOTH that DID NOT HAVE SHARED DATA ended up as NAN
I.e...
sharedby_BOTH = pd.DataFrame(0, index=[x for x in range(4)], columns=[x for x in range(4)])
_indices = [0,2,3 (more)] # for example
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
0 1 2 3 4 (more)
------------------
0 | NAN | NAN | 1 | 1 | NAN
1 | NAN | NAN | NAN | NAN | NAN
2 | 1 | NAN | NAN | 1 | NAN
3 | 1 | NAN | 1 | NAN | NAN
4 | NAN | NAN | NAN | NAN | NAN
(more)
I'd organize it with numpy slice assignment and the handy np.triu_indices function. It returns the row and column indices of the upper triangle. I make sure to pass k=1 to ensure I skip the diagonal. When I slice assign, I make sure to use both i, j and j, i to get upper and lower
triangles.
def xover(n, idx):
idx = np.asarray(idx)
a = np.zeros((n, n))
i_, j_ = np.triu_indices(len(idx), 1)
i = idx[i_]
j = idx[j_]
a[i, j] = 1
a[j, i] = 1
return a
pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
0 1 2 3
0 0.0 0.0 1.0 1.0
1 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 1.0
3 1.0 0.0 1.0 0.0
Timings
%timeit pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
10000 loops, best of 3: 192 µs per loop
%%timeit
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
100 loops, best of 3: 6.8 ms per loop
You can use itertools product and loc for assignment i.e
from itertools import product
li = [ 0,2,3]
ndf = df.copy()
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
0 1 2 3 4
0 0 0 1 1 0
1 0 0 0 0 0
2 1 0 0 1 0
3 1 0 1 0 0
4 0 0 0 0 0

Resources