Group rows based on the current occurance of a variable - python-3.x

I am trying to group a dataframe based on the occurrence a variable. For example take this dataframe
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
6 | 0 | NaN
7 | -1 | NaN
8 | 0 | NaN
9 | 0 | -1
10| 0 | -1
11| 0 | -1
I want to group variable based on the current occurrence of a variable in column_2 to a dataframe and get the next sequence into another dataframe and likewise till the end of dataframe while also ignoring NaN.
So the final output would be like:
ones_1 =
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
mones_1 =
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
mones_2 =
9 | 0 | -1
10| 0 | -1
11| 0 | -1

I suggest create dictionary of DataFrames:
#only non missing rows
mask = df['col_2'].notna()
#create unique groups
g = df['col_2'].ne(df['col_2'].shift()).cumsum()
#create counter of filtered g
g = g[mask].groupby(df['col_2']).transform(lambda x:pd.factorize(x)[0]) + 1
#map positive and negative values to strings and add counter values
g = df.loc[mask, 'col_2'].map({-1:'mones_',1:'ones_'}) + g.astype(str)
#generally groups
#g = 'val' + df.loc[mask, 'col_2'].astype(str) + ' no' + g.astype(str)
print (g)
0 ones_1
1 ones_1
2 ones_1
3 mones_1
4 mones_1
5 mones_1
9 mones_2
10 mones_2
11 mones_2
Name: col_2, dtype: object
#create dictionary of DataFrames
dfs = dict(tuple(df.groupby(g)))
print (dfs)
{'mones_1': col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0, 'mones_2': col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0, 'ones_1': col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0}
#select by keys
print (dfs['ones_1'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
It is not recommended, but possible create DataFrames by groups with variable names:
for i, g in df.groupby(g):
globals()[i] = g
print (ones_1)
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0

here is another logic (keeping them in dictionary is the idea again):
m=df[df.col_2.notna()] #filter out the NaN rows
#check if the index are in sequence along with that check if values changes per row
s=m.col_2.ne(m.col_2.shift())|m.index.to_series().diff().fillna(1).gt(1)
dfs={f'df_{int(i)}':g for i , g in df.groupby(s.cumsum())} #groupby and store in dict
Access the dataframes by accessing the keys:
print(dfs['df_1'])
print('---------------------------------')
print(dfs['df_2'])
print('---------------------------------')
print(dfs['df_3'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
---------------------------------
col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0
---------------------------------
col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0

Related

compare columns with NaN or <NA> values pandas

I have the dataframe with NaN and values, now I want to compare two columns in the same dataframe whether each row values in null or not null. For examples,
if the column a_1 have null values, column a_2 have not null values, then for that particular
row, the result should be 1 in the new column a_12.
If the values in both a_1(value is 123) & a_2(value is 345) is not null, and the values are
not equal, then the result should be 3 in column a_12.
below is the code snippet I have used for comparison, for the scenario 1, I am getting the result as 3 instead of 1. Please guide me to get the correct output.
try:
if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 0
elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 0
elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 1
elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 3
else:
pass
except Exception as exc:
if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 0
elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 0
elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 1
elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 3
else:
pass
I have used pd.isna() and pd.notna(), also np.isnan() and ~np.isnan(), because for some columns the second method (np.isnan()) is working, for some columns its just throwing an error.
Please guide me to achieve the result as excepted.
Expected Output:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 1 |
| <NA> | qweweqw | 2 |
| adsadgsgd | wwuwquq | 3 |
Output Got with the above code:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 3 |
| <NA> | qweweqw | 3 |
| adsadgsgd | wwuwquq | 3 |
Going by the logic in your code, you'd want to define a function and apply it across your DataFrame.
df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})
The categories you want map neatly to binary numbers, which you can use to write a short function like -
def nan_check(row):
x, y = row
if x != y:
return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
return 0
df['flag'] = df.apply(nan_check, axis=1)
Output
a_1 a_2 flag
0 1.0 2.0 3
1 2.0 NaN 1
2 NaN 1.0 2
3 NaN NaN 0
4 1.0 1.0 0
You can try np.select, but I think you need to rethink the condition and the expected output
Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.
Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.
df['a_12'] = np.select(
[df['a_1'].isna() & df['a_2'].notna(),
df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
[1, 3],
default=0
)
print(df)
a_1 a_2 result a_12
0 gssfwe gssfwe 0 0
1 NaN NaN 0 0
2 fsfsfw NaN 1 0 # Shouldn't be Condition 1 since a_1 is not NaN
3 NaN qweweqw 2 1 # Condition 1
4 adsadgsgd wwuwquq 3 3

List column name having value greater than zero

I have following dataframe
A | B | C | D
1 0 2 1
0 1 1 0
0 0 0 1
I want to add the new column have any value of row in the column greater than zero along with column name
A | B | C | D | New
1 0 2 1 A-1, C-2, D-1
0 1 1 0 B-1, C-1
0 0 0 1 D-1
We can use mask and stack
s=df.mask(df==0).stack().\
astype(int).astype(str).\
reset_index(level=1).apply('-'.join,1).add(',').sum(level=0).str[:-1]
df['New']=s
df
Out[170]:
A B C D New
0 1 0 2 1 A-1,C-2,D-1
1 0 1 1 0 B-1,C-1
2 0 0 0 1 D-1
Combine the column names with the df values that are not zero and then filter out the None values.
df = pd.read_clipboard()
arrays = np.where(df!=0, df.columns.values + '-' + df.values.astype('str'), None)
new = []
for array in arrays:
new.append(list(filter(None, array)))
df['New'] = new
df
Out[1]:
A B C D New
0 1 0 2 1 [A-1, C-2, D-1]
1 0 1 1 0 [B-1, C-1]
2 0 0 0 1 [D-1]

How to efficiently disaggregate data from?

I have Google Analytics data which I am trying to disaggregate.
Below is a simplified version of the dataframe I am dealing with:
date | users | goal_completions
20150101| 2 | 1
20150102| 3 | 2
I would like to disaggregate the data such that each "user" has its own row. In addition, the third column, "goal_completions" will also be disaggregated with the assumption that each user can only have 1 "goal_completion".
The output I am seeking will be something like this:
date | users | goal_completions
20150101| 1 | 1
20150101| 1 | 0
20150102| 1 | 1
20150102| 1 | 1
20150102| 1 | 0
I was able to duplicate each row based on the number of users on a given date, however I can't seem to find a way to disaggregate the "goal_completion" column. Here is what I currently have after duplicating the "users" column:
date | users | goal_completions
20150101| 1 | 1
20150101| 1 | 1
20150102| 1 | 2
20150102| 1 | 2
20150102| 1 | 2
Any help will be appreciated - thanks!
IIUC using repeat create you dfs , then we adjust the two column by cumcount with np.where
df=df.reindex(df.index.repeat(df.users))
df=df.assign(users=1)
df.goal_completions=np.where(df.groupby(level=0).cumcount()<df.goal_completions,1,0)
df
Out[609]:
date users goal_completions
0 20150101 1 1
0 20150101 1 0
1 20150102 1 1
1 20150102 1 1
1 20150102 1 0

Pivot table based on groupby in Pandas

I have a dataframe like this:
customer_id | date | category
1 | 2017-2-1 | toys
2 | 2017-2-1 | food
1 | 2017-2-1 | drinks
3 | 2017-2-2 | computer
2 | 2017-2-1 | toys
1 | 2017-3-1 | food
>>> import pandas as pd
>>> dt = dict(customer_id=[1,2,1,3,2,1],
date='2017-2-1 2017-2-1 2017-2-1 2017-2-2 2017-2-1 2017-3-1'.split(),
category=["toys", "food", "drinks", "computer", "toys", "food"]))
>>> df = pd.DataFrame(dt)
ues my new columns and one hot encoding those columns, I know I can use df.pivot_table(index = ['customer_id'], columns = ['category']).
>>> df['Indicator'] = 1
>>> df.pivot_table(index=['customer_id'], columns=['category'],
values='Indicator').fillna(0).astype(int)
category computer drinks food toys
customer_id
1 0 1 1 1
2 0 0 1 1
3 1 0 0 0
>>>
I also want to group by date so each row only contains information from the same date, like in the desired output below, id 1 has two rows because two unique dates in the date column.
customer_id | toys | food | drinks | computer
1 | 1 | 0 | 1 | 0
1 | 0 | 1 | 0 | 0
2 | 1 | 1 | 0 | 0
3 | 0 | 0 | 0 | 1
You may looking for crosstab
>>> pd.crosstab([df.customer_id,df.date], df.category)
category computer drinks food toys
customer_id date
1 2017-2-1 0 1 0 1
2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id,df.date],
df.category).reset_index(level=1)
category date computer drinks food toys
customer_id
1 2017-2-1 0 1 0 1
1 2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id, df.date],
df.category).reset_index(level=1, drop=True)
category computer drinks food toys
customer_id
1 0 1 0 1
1 0 0 1 0
2 0 0 1 1
3 1 0 0 0
>>>
Assuming your frame is called df, you could add an indicator column and then directly use .pivot_table:
df['Indicator'] = 1
pvt = df.pivot_table(index=['date', 'customer_id'],
columns='category',
values='Indicator')\
.fillna(0)
This gives a dataframe that looks like:
category computer drinks food toys
date customer_id
2017-2-1 1 0.0 1.0 0.0 1.0
2 0.0 0.0 1.0 1.0
2017-2-2 3 1.0 0.0 0.0 0.0
2017-3-1 1 0.0 0.0 1.0 0.0

Write 1s faster to col-rows based on positions in a list

I'm new to pandas. I'm using a dataframe to tally how many times two positions match.
Here is the code in question...right at the start. The "what am I trying to accomplish" below...
def crossovers(df, index):
# Duplicate the dataframe passed in
_dfcopy = df.copy(deep=True)
# Set all values to 0
_dfcopy[:] = 0.0
# change the value of any col/row where there's a shared SNP
for i in index:
for j in index:
if i == j: continue # Don't include self as a shared SNP
_dfcopy[i][j] = 1
# Return the DataFrame.
# Should only contain 0s (no shared SNP) or 1s ( a shared SNP)
return _dfcopy
QUESTION:*
The data is flipping all the 0s in a dataframe to 1s, for all the intersections of rows/columns in a list (see details below).
I.e. if the list is
_indices = [0,2,3]
...all the locations at (0,2); (0,3); (2,0); (2,3); (3,0); and (3,2) get flipped to 1s.
Currently I do this by iterating through the list recursively onto itself. But this is painfully slow...and I'm passing in 16 million lines of data (16 mil indices).
How can I speed up this overall process?
LONGER DESCRIPTION
I start with a dataframe called sharedby_BOTH similar to below, except much larger (70 cols x 70 rows)- I'm using it to tally occurrences of shared data intersections.
Rows (index) are labeled 0,1,2,3 & 4...70 - as are the columns. Each location contains a 0.
sharedby_BOTH
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 0 | 0 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 0 | 0 | 0 | 0 | 0
3 | 0 | 0 | 0 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
Then I have a list, which contains intersecting data.
_indices = [0,2,3 (more)] # for example
This means that 0, 2, & 3 all contain shared data. So, I pass it to crossovers which returns a dataframe with a "1" at the intersection places, obtaining this...
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 1 | 1 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 1 | 0 | 0 | 1 | 0
3 | 1 | 0 | 1 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
...where the shared data locations are (0,2),(0,3),(2,0),(2,3),(3,0),(3,2).
*Notice that self is not recognized [(0,0), (2,2), and (3,3) DO NOT have 1s] *
Then I add this to the original dataframe with this code (inside a loop)...
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices)
I repeat this in a loop...
for pos, pos_val in chrom_val.items(): # pos_val is a dict
_indices = [i for i, x in enumerate(pos_val["sharedby"]) if (x == "HET")]
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices))
The end result is that sharedby_BOTH will look like the following, if I added the three example _indices
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,4] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 3 | 2 | 1
1 | 0 | 0 | 0 | 0 | 0
2 | 3 | 0 | 0 | 2 | 1
3 | 2 | 0 | 2 | 0 | 0
4 | 1 | 0 | 1 | 0 | 0
(more)
...where, amongst the three indices passed in...
0shared data with 2 a total of three times so (0,2) and (2,0) totaled three.
0shared data with 3 twice so (0,3) and (3,0) total two.
0shared data with 4 only once, so (0,4) and (4,0) total one.
I hope this makes sense :)
EDIT
I did try the following...
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
BUT...then any locations within sharedby_BOTH that DID NOT HAVE SHARED DATA ended up as NAN
I.e...
sharedby_BOTH = pd.DataFrame(0, index=[x for x in range(4)], columns=[x for x in range(4)])
_indices = [0,2,3 (more)] # for example
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
0 1 2 3 4 (more)
------------------
0 | NAN | NAN | 1 | 1 | NAN
1 | NAN | NAN | NAN | NAN | NAN
2 | 1 | NAN | NAN | 1 | NAN
3 | 1 | NAN | 1 | NAN | NAN
4 | NAN | NAN | NAN | NAN | NAN
(more)
I'd organize it with numpy slice assignment and the handy np.triu_indices function. It returns the row and column indices of the upper triangle. I make sure to pass k=1 to ensure I skip the diagonal. When I slice assign, I make sure to use both i, j and j, i to get upper and lower
triangles.
def xover(n, idx):
idx = np.asarray(idx)
a = np.zeros((n, n))
i_, j_ = np.triu_indices(len(idx), 1)
i = idx[i_]
j = idx[j_]
a[i, j] = 1
a[j, i] = 1
return a
pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
0 1 2 3
0 0.0 0.0 1.0 1.0
1 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 1.0
3 1.0 0.0 1.0 0.0
Timings
%timeit pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
10000 loops, best of 3: 192 µs per loop
%%timeit
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
100 loops, best of 3: 6.8 ms per loop
You can use itertools product and loc for assignment i.e
from itertools import product
li = [ 0,2,3]
ndf = df.copy()
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
0 1 2 3 4
0 0 0 1 1 0
1 0 0 0 0 0
2 1 0 0 1 0
3 1 0 1 0 0
4 0 0 0 0 0

Resources