Iterate through Pandas dataframe rows in a triangular fashion - python-3.x

I have a Pandas dataframe df like this:
col1. col2
0. value11 List1
1. value12 List2
2. value13. List3
.. ... ...
i. value1i. List_i
j. value1j. List_j
.. ... ...
Col1 is the key (it does not repeat). Col2 is a list. In the end, I want a set intersection of each of the rows of Col2.
I would like to iterate through this dataframe in a triangular fashion.
Something along the lines of:
for i = 0 ; i < len(df); i++
for j = i+1 ; j < len(df) ; j++
Set(List_i).intersect(Set(List_j)
So, 1st iterator goes through the full dataframe, while the second iterator, starts from one greater index than the 1st iterator and goes until the end of the dataframe.
How to do this efficiently and in a fast manner?
Edit:
Naive way of doing this is:
col1_list = list(set(df.col1))
num_col1_entries = len(col1_list)
for idx, value1 in enumerate(col1_list):
for j in range(idx + 1, num_col1_entries):
value2 = col1_list[j]
list1 = df.loc[df.col1 == value1]['col2']
list2 = df.loc[df.col2 == value2]['col2']
print(set(list1).intersection(set(list2)))
Expected output: n(n-1)/2 prints of set intersections of each pair of rows of col2.

You can use itertools. Let's say this is your dataframe:
col1. col2
0 value11 List1
1 value12 List2
2 value13 List3
3 value14 List4
4 value15 List5
5 value16 List6
Then get al the combinations (15) and print the intersection between the two lists:
from itertools import combinations
for pair in list(combinations(df.index, 2)):
print(pair)
list1 = df.iloc[pair[0],1]
list2 = df.iloc[pair[1],1]
print(set(list1).intersection(set(list2)))
Output (only printing the pair):
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(0, 5)
(1, 2)
(1, 3)
(1, 4)
(1, 5)
(2, 3)
(2, 4)
(2, 5)
(3, 4)
(3, 5)
(4, 5)

Related

How can I iterate through a DataFrame with a conditional to reorganize my data?

I have a DataFrame in the following format, and I would like to rearrange it based on a conditional using one of the columns of data.
My current DataFrame has the following format:
df.head()
Room Temp1 Temp2 Temp3 Temp4
R1 1 2 1 3
R1 2 3 2 4
R1 3 4 3 5
R2 1 1 2 2
R2 2 2 3 3
...
R15 1 1 1 1
I would like to 'pivot' this DataFrame to look like this:
Room
R1 = [1, 2, 3, 2, 3, 4, 1, 2, 3, 3, 4, 5]
R2 = [1, 2, 1, 2, 2, 3, 2, 3]
...
R15 = [1, 1, 1, 1,]
Where:
R1 = Temp1 + Temp2 + Temp3
So that:
R1 = [1, 2, 3, 2, 3, 4, 1, 2, 3, 3, 4, 5]
First: I have tried creating a list of each column using the 'where' conditional in which Room = 'R1'
room1 = np.where(df["Room"] == 'R1', df["Temp1"], 0).tolist()
It works, but I would need to do this individually for every column, of which there are many more than 4 in my other datasets.
Second: I tried to iterate through them:
i = ['Temp1', 'Temp2', 'Temp3', 'Temp4']
room1= []
for i in df[i]:
for row in df["Room"]:
while row == "R1":
...and this is where I get very lost. Where do I go next? How can I iterate through the rest of the columns and end up with the DataFrame I have above?
This should work (although it's not very efficient and will be slow on a big DataFrame):
results = {} # dict to store results
cols = ['Temp1', 'Temp2', 'Temp3', 'Temp4']
for r in df['Room'].unique():
room_list = []
sub_frame = df[df['Room'] == r]
for col in cols:
sub_col = sub_frame[col]
for val in sub_col:
room_list.append(val)
results[r] = room_list
results will be stored in the result dict, so you can access, say, R1 with:
results['R1']
Usually iterating over DataFrames is a bad idea though, I'm sure there's a better solution!
I found the answer!
The trick is to use the .pivot() function to rearrange the columns accordingly. I had an additonal column called 'Time' which I did not include in the original post thinking it was not relevant to the solution.
What I ended up doing is pivoting the table based on Columns and Values using index as the rooms:
df = df.pivot(index = "Room", columns = "Time", values = ["Temp1", "Temp2", "Temp3", "Temp4"]
Thank you to those who helped me on the way!

Finite loop over a list with repeated parameter

I am working on an algorithm, and I would like to iterate a function, finitely, over 3 rows of an array. Like, I want to perform the iteration of the function on row1, row2, row3, back to row1, row2, etc.
What I did only stops at after the 3rd row
import numpy as np
m, n = 3, 3
A = [[1,2,3], [4,5,6], [7,8,9]]
b = [1, 7, 9]
def my_func(x, i):
pro = x + A_i^T[i,:]
return pro
rows = 1
rows = [1, 2, 3]
x = np.zeros(n)
for n in range(1000):
y = my_func(x, rows)
print(y)
x = y
rows += 1
Iterate the operation over another iteration
for i in range(int(1000/len(rows))): #range(no. of times to iterate row 1, 2 & 3)
for row in A:
#put your operations here

How to find column position for first matching

I have a dataframe which has 500K rows and 200 columns. I need to find first zero's column index of each row. If I couldn't find any zero in a row, i should see like 999.
Thank you for your kindest help.
This is my example:
a = {'A':[1,2,5,7,0,9],
'B':[6,5,0,0,7,2],
'C':[0,8,np.nan,10,0,6],
'D':[np.nan, 9,5,2,6,7],
'E':[1,4,6,3,3,6]}
aidx = ['id_1','id_2','id_3',
'id_4','id_5','id_6']
df = pd.DataFrame(a, index=aidx)
def get_col(df,num):
df_num = df==num
df_num=df_num[df_num.any(axis=1)].idxmax(axis=1)
return(df_num)
df_new = pd.DataFrame(get_col(df,0))
df_need = pd.DataFrame([2,999,1,1,0,999], index=aidx)
Just like this
s=(df.values==0)
np.where(np.any(s,1),s.argmax(1),999)
Out[77]: array([ 2, 999, 1, 1, 0, 999], dtype=int64)
Create a mapping dictionary from data frame column names, and construct the dataframe using numpy where
d = dict(zip(df.columns, np.arange(len(df.columns))))
df = pd.DataFrame(np.where(df.eq(0).any(1),df.eq(0).idxmax(1), 999), index=df.index)
df[0] = df[0].map(d).fillna(999).astype(int)
0
id_1 2
id_2 999
id_3 1
id_4 1
id_5 0
id_6 999
Or using Numpy,
from numpy import copy
a = copy(np.where(df.eq(0).any(1),df.eq(0).idxmax(1), 999))
for k, v in d.items():
a[a==k] = v
pd.DataFrame(a, index = df.index)

Get row and column in Pandas for a cell with a certain value

I am trying to read an Excel spreadsheet that is unformatted using Pandas. There are multiple tables within a single sheet and I want to convert these tables into dataframes. Since it is not already "indexed" in the traditional way, there are no meaningful column or row indices. Is there a way to search for a specific value and get the row, column where that is? For example, say I want to get a row, column number for all cells that contain the string "Title".
I have already tried things like DataFrame.filter but that only works if there are row and column indices.
Create a df with NaN where your_value is not found.
Drop all rows that don't contain the value.
Drop all columns that don't contain the value
a = df.where(df=='your_value').dropna(how='all').dropna(axis=1)
To get the row(s)
a.index
To get the column(s)
a.columns
You can do some long and hard to read list comprehension:
# assume this df and that we are looking for 'abc'
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
out:
[(0, 0), (3, 0), (1, 1)]
I should note that this is (index value, column location)
you can also change .eq() to str.contains() if you are looking for any strings that contains a certain value:
[(df[col][df[col].str.contains('ab')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].str.contains('ab')].index))]
You can simply create a mask of the same shape than your df by calling df == 'title'.
You can then combines this with the df.where() method, which will set all fields to NA that are different to your keyword, and finally you can use dropna() to reduce it to all valid fields. Then you can use the df.columnnsand df.indexlike you're use to.
df = pd.DataFrame({"a": [0,1,2], "b": [0, 9, 7]})
print(df.where(df == 0).dropna().index)
print(df.where(df == 0).dropna().columns)
#Int64Index([0], dtype='int64')
#Index(['a', 'b'], dtype='object')
Here's an example to fetch all the row and column index of the cells containing word 'title' -
df = pd.DataFrame({'A':['here goes the title', 'tt', 'we have title here'],
'B': ['ty', 'title', 'complex']})
df
+---+---------------------+---------+
| | A | B |
+---+---------------------+---------+
| 0 | here goes the title | ty |
| 1 | tt | title |
| 2 | we have title here | complex |
+---+---------------------+---------+
idx = df.apply(lambda x: x.str.contains('title'))
col_idx = []
for i in range(df.shape[1]):
col_idx.append(df.iloc[:,i][idx.iloc[:,i]].index.tolist())
out = []
cnt = 0
for i in col_idx:
for j in range(len(i)):
out.append((i[j], cnt))
cnt += 1
out
# [(0, 0), (2, 0), (1, 1)] # Expected output
The answer by #firefly works if the second dropna gets the how='all', too like so:
a = df.where(targetMap == 'your_value').dropna(how='all').dropna(how='all',axis=1)
Another approach that's in the vein of #It_is_Chris's solution, but may be a little easier to read:
# assuming this df and that we are looking for 'abc'
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
[x[1:] for x in ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False)) for j, v in enumerate(row_tup)) if x[0] == "abc"]
Output
[(0, 0), (1, 1), (3, 0)]
Similar to what Chris said, I found this to work for me, although it's not the prettiest or shortest way. This returns all the row,column pairs matching a regular expression in a dataframe:
for row in df.itertuples():
col_count = 0
for col in row:
if regex.match(str(col)):
tuples.append((row_count, col_count))
col_count+=1
row_count+=1
return tuples

What does the output of this line in pandas dataframe signify?

I am learning Pandas DataFrame and came across this code:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
Now when I use print(list(df.columns.values)) as suggested on this page, the output is:
[0, 1, 2]
I am unable to understand the output. What are the values 0,1,2 signifying. Since the height of DataFrame is 2, I suppose the last value 2 is signifying the height. What about 0 and 1?
I apologize if this question is a duplicate. I couldn't find any relevant explanation. If there is any similar question, please mention the link.
Many thanks.
If question is what are columns check samples:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
print (df)
0 1 2
0 1 2 3
1 4 5 6
#default columns names
print(list(df.columns.values))
[0, 1, 2]
print(list(df.index.values))
[0, 1]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), columns=list('abc'))
print (df)
a b c
0 1 2 3
1 4 5 6
#custom columns names
print(list(df.columns.values))
['a', 'b', 'c']
print(list(df.index.values))
[0, 1]
You can also check docs:
The axis labeling information in pandas objects serves many purposes:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
Enables automatic and explicit data alignment
Allows intuitive getting and setting of subsets of the data set
What is a data frame?
df is a data frame. Take a step back and take in what that means. I mean outside what it means from a Pandas perspective. Though there are many nuances to what different people mean by a data frame, generally, it is a table of data with rows and columns.
How do we reference those rows and/or columns?
Consider the example data frame df. I create a 4x4 table with tuples in each cell representing the (row, column) position of that cell. You'll also notice the labels on the rows are ['A', 'B', 'C', 'D'] and the labels on the columns are ['W', 'X', 'Y', 'Z']
df = pd.DataFrame(
[[(i, j) for j in range(4)] for i in range(4)],
list('ABCD'), list('WXYZ')
)
df
W X Y Z
A (0, 0) (0, 1) (0, 2) (0, 3)
B (1, 0) (1, 1) (1, 2) (1, 3)
C (2, 0) (2, 1) (2, 2) (2, 3)
D (3, 0) (3, 1) (3, 2) (3, 3)
If we wanted to reference by position, the zeroth row and third column is highlighted here.
df.style.applymap(lambda x: 'background: #aaf' if x == (0, 3) else '')
We could get at that position with iloc (which handles ordinal/positional indexing)
df.iloc[0, 3]
(0, 3)
What makes Pandas special is that it gives us an alternative way to reference both the rows and/or the columns. We could reference by the labels using loc (which handles label indexing)
df.loc['A', 'Z']
(0, 3)
I intentionally labeled the rows and columns with letters so as to not confuse label indexing with positional indexing. In your data frame, you let Pandas give you a default index for both rows and columns and those labels end up just being equivalent to positions when you begin.
What is the difference between label and positional indexing?
Consider this modified version of our data frame. Let's call it df_
df_ = df.sort_index(axis=1, ascending=False)
df_
Z Y X W
A (0, 3) (0, 2) (0, 1) (0, 0)
B (1, 3) (1, 2) (1, 1) (1, 0)
C (2, 3) (2, 2) (2, 1) (2, 0)
D (3, 3) (3, 2) (3, 1) (3, 0)
Notice that the columns are in reverse order. And when I call the same positional reference as above but on df_
df_.iloc[0, 3]
(0, 0)
I get a different answer because my columns have shifted around and are out of their original position.
However, if I call the same label reference
df_.loc['A', 'Z']
(0, 3)
I get the same thing. So label indexing allows me to reference regardless of the order of rows or columns.
OK! But what about OP's question?
Pandas stores the data in an attribute values
df.values
array([[(0, 0), (0, 1), (0, 2), (0, 3)],
[(1, 0), (1, 1), (1, 2), (1, 3)],
[(2, 0), (2, 1), (2, 2), (2, 3)],
[(3, 0), (3, 1), (3, 2), (3, 3)]], dtype=object)
The columns labels in an attribute columns
df.columns
Index(['W', 'X', 'Y', 'Z'], dtype='object')
And the row labels in an attribute index
df.index
Index(['A', 'B', 'C', 'D'], dtype='object')
It so happens that in OP's sample data frame, the columns were [0, 1, 2]

Resources