Python Pandas select rows in numpy array on first columns - python-3.x

I have a dataframe like this:
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],'C':[7,8,9],'D':[10,11,12]})
and a list, here arr, that may vary in length like this:
arr = np.array([[1,4],[2,6]])
arr = np.array([[2,5,8], [1,5,8]])
And I would like get all rows in df that matches first elements in arr like following:
for x in arr:
df[df.iloc[:, :len(x)].eq(x).all(1)]
Thanks guys!

IIUC, you can convert the array to df and use merge
arr = np.array([[1,4],[2,6],[2,5]])
df.merge(pd.DataFrame(arr, columns = df.iloc[:,:arr.shape[1]].columns))
A B C D
0 1 4 7 10
1 2 5 8 11
This solution will handle arrays of different shapes (as long as shape[1] of arr <= shape[1] of df)
arr = np.array([[2,5,8], [1,5,8], [3,6,9]])
df.merge(pd.DataFrame(arr, columns = df.iloc[:,:arr.shape[1]].columns))
A B C D
0 2 5 8 11
1 3 6 9 12

Related

pandas listing same indexes

if a table has the same index 3 times in a row, I want it to fetch me this dataframe.
example
index var1
1 a
2 b
2 c
2 d
3 e
2 f
5 g
2 f
After the code
expected output
index var1
2 b
2 c
2 d
One option is to split data frame on the diff index, check size of each chunk and filter out chunks with sizes smaller then threshold and then recombine them:
import pandas as pd
import numpy as np
diff_indices = np.flatnonzero(df['index'].diff().ne(0))
diff_indices
# array([0, 1, 4, 5, 6, 7], dtype=int32)
pd.concat([chunk for chunk in np.split(df, diff_indices) if len(chunk) >= 3])
index var1
1 2 b
2 2 c
3 2 d
Let us identify the blocks of consecutive indices using cumsum, then group and transform with count to find the size of each block then select the rows where the block size > 2
b = df['index'].diff().ne(0).cumsum()
df[b.groupby(b).transform('count') > 2]
index var1
1 2 b
2 2 c
3 2 d
You can assign consecutive rows to same value by comparing with next and cumsum. Then groupby consecutive rows and keep the group where number of rows are 3 times
m = df['index'].ne(df['index'].shift()).cumsum()
out = df.groupby(m).filter(lambda col: len(col) == 3)
print(out)
index var1
1 2 b
2 2 c
3 2 d
Here's one more solution on top of the ones above (this one is more generalizable, since it selects ALL slices that meet the given criterium):
import pandas as pd
df['diff_index'] = df['index'].diff(-1) # calcs the index diff
df = df.fillna(999) # get rid of NaNs
df['diff_index'] = df['diff_index'].astype(int) # convert the diff to int
df_selected = [] # create a list of all dfs we're going to slice
l = list(df['diff_index'])
for i in range(len(l)-1):
if l[i] == 0 and l[i+1] == 0: # if 2 consecutive 0s are found, get the slice
df_temp = df[df.index.isin([i,i+1,i+2])]
del df_temp['diff_index']
df_selected.append(df_temp) # append the slice to our list
print(df_selected) # list all identified data frames (in your example, there will be only one
[ index var1
1 2 b
2 2 c
3 2 d]

pandas expand dataframe column with tuples, into multiple columns and rows

I have a data frame where one column contains elements that are a list containing several tuples. I want to turn each tuple in to a column for each element and create a new row for each tuple. So this code shows what I mean and the solution I came up with:
import numpy as np
import pandas as pd
a = pd.DataFrame(data=[['a','b',[(1,2,3),(6,7,8)]],
['c','d',[(10,20,30)]]], columns=['one','two','three'])
df2 = pd.DataFrame(columns=['one', 'two', 'A', 'B','C'])
print(a)
for index,item in a.iterrows():
for xtup in item.three:
temp = pd.Series(item)
temp['A'] = xtup[0]
temp['B'] = xtup[1]
temp['C'] = xtup[2]
temp = temp.drop('three')
df2 = df2.append(temp)
print(df2)
The output is:
one two three
0 a b [(1, 2, 3), (6, 7, 8)]
1 c d [(10, 20, 30)]
one two A B C
0 a b 1 2 3
0 a b 6 7 8
1 c d 10 20 30
Unfortunately, my solution takes 2 hours to run on 55,000 rows! Is there a more efficient way to do this?
We do explode column then explode row
a=a.explode('three')
a=pd.concat([a,pd.DataFrame(a.pop('three').tolist(),index=a.index)],axis=1)
one two 0 1 2
0 a b 1 2 3
0 a b 6 7 8
1 c d 10 20 30

How to join several data frames containing different pieces of one data into one?

I have several - let's say three - data frames that contain different rows (sometimes they can overlap) of another data frame. The columns are the same for all three dfs. I want now to create final data frame that will contain all the rows from three mentioned data frames. Moreover I need to generate a column for the final df that will contain information in which one of the first three dfs this particular row is included.
Example below
Original data frame:
original_df = pd.DataFrame(np.array([[1,1],[2,2],[3,3],[4,4],[5,5],[6,6]]), columns = ['label1','label2'])
Three dfs containing different pieces of the original df:
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
I want to get the following data frame:
final_df = pd.DataFrame(np.array([[1,1,'a'],[2,2,'a'],[3,3,'b'],[4,4,'c'],\
[5,5,'c'],[6,6,'c']]), columns = ['label1','label2', 'from which df this row'])
or simply use integers to mark from which df the row is:
final_df = pd.DataFrame(np.array([[1,1,1],[2,2,1],[3,3,2],[4,4,3],\
[5,5,3],[6,6,3]]), columns = ['label1','label2', 'from which df this row'])
Thank you in advance!
See this related post
IIUC, you can use pd.concat with the keys and names arguments
pd.concat(
[a, b, c], keys=['a', 'b', 'c'],
names=['from which df this row']
).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
However, I'd recommend that you store those dataframe pieces in a dictionary.
parts = {
'a': original_df.loc[0:1],
'b': original_df.loc[2:2],
'c': original_df.loc[3:]
}
pd.concat(parts, names=['from which df this row']).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
And as long as it is stored as a dictionary, you can also use assign like this
pd.concat(d.assign(**{'from which df this row': k}) for k, d in parts.items())
label1 label2 from which df this row
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c
Keep in mind that I used the double-splat ** because you have a column name with spaces. If you had a column name without spaces, we could do
pd.concat(d.assign(WhichDF=k) for k, d in parts.items())
label1 label2 WhichDF
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c
Just create a list and in the end concatenate:
list_df = []
list_df.append(df1)
list_df.append(df2)
list_df.append(df3)
df = pd.concat(liste_df)
Perhaps this can work / add value for you :)
import pandas as pd
# from your post
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
# create new column to label the datasets
a['label'] = 'a'
b['label'] = 'b'
c['label'] = 'c'
# add each df to a list
combined_l = []
combined_l.append(a)
combined_l.append(b)
combined_l.append(c)
# concat all dfs into 1
df = pd.concat(liste_df)

How do I copy to a range, rather than a list, of columns?

I am looking to append several columns to a dataframe.
Let's say I start with this:
import pandas as pd
dfX = pd.DataFrame({'A': [1,2,3,4],'B': [5,6,7,8],'C': [9,10,11,12]})
dfY = pd.DataFrame({'D': [13,14,15,16],'E': [17,18,19,20],'F': [21,22,23,24]})
I am able to append the dfY columns to dfX by defining the new columns in list form:
dfX[[3,4]] = dfY.iloc[:,1:3].copy()
...but I would rather do so this way:
dfX.iloc[:,3:4] = dfY.iloc[:,1:3].copy()
The former works! The latter executes, returns no errors, but does not alter dfX.
Are you looking for
dfX = pd.concat([dfX, dfY], axis = 1)
It returns
A B C D E F
0 1 5 9 13 17 21
1 2 6 10 14 18 22
2 3 7 11 15 19 23
3 4 8 12 16 20 24
And you can append several dataframes in this like pd.concat([dfX, dfY, dfZ], axis = 1)
If you need to append say only column D and E from dfY to dfX, go for
pd.concat([dfX, dfY[['D', 'E']]], axis = 1)

iterate through rows and columns in excel using pandas-Python 3

I have an excel spreadsheet that I read with this code:
df=pd.ExcelFile('/Users/xxx/Documents/Python/table.xlsx')
ccg=df.parse("CCG")
With the sheet that I want inside the spreadsheet being CCG
The sheet looks like this:
col1 col2 col3
x a 1 2
x b 3 4
x c 5 6
x d 7 8
x a 9 10
x b 11 12
x c 13 14
y a 15 16
y b 17 18
y c 19 20
y d 21 22
y a 23 24
How would I write code that gets values of col 2 and col3 for rows that contain both a and x. So the proposed output for this table would be: col1=[1,9], col2=[2,10]
Try this:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx', 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Demo:
Excel file:
In [243]: fn = r'C:\Temp\.data\41718085.xlsx'
In [244]: pd.read_excel(fn, 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Out[244]:
col1 col2
x a 1
x a 9
You can do:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx'),sheetname='CCG', index_col=0)
filter = df[(df.index == 'x') & (df.col1 == 'a')]
Then from here, you can return all the values as a numpy array with:
filter['col2']
filter['col3']
Managed to create a count that iterates until it finds a adds +1 to the count and only appends to the list index if it is between the ranges that x is in, once i have the indices i search through col 2 and 3 and pull the values out for the indices

Resources