Using dataframe value as an index for a list [duplicate] - python-3.x

This question already has an answer here:
Python: an efficient way to slice a list with a index list
(1 answer)
Closed 3 years ago.
I have a dataframe and a list set up as follows:
index list_index
sample1 1
sample2 2
sample3 4
values = [-0.5, -23, 0, 15, 100]
I am trying to create a new column in the dataframe that takes the list_index and values list. Like the following:
index list_index val
sample1 1 -23
sample2 2 0
sample3 5 100
My code is:
df['val'] = values[df['list_index']]
I am getting TypeError: list indices must be integers or slices not Series.

I would use pandas.series.apply
Example code:
import pandas as pd
df = pd.DataFrame({'A' : [1,3,5]})
v = [0, 1, 0, 1, 0, 1]
df['B'] = df['A'].apply(lambda x: v[x])
Results in what you are looking for I think:
Out[7]:
A B
0 1 1
1 3 1
2 5 1
For your code would do the below
df['val'] = df['list_index'].apply(lambda x: values[x])
Essentially you are getting the above error as you are not passing in the indices of the list element by element but rather as a series which does not work.

Related

Compare value in a dataframe to multiple columns of another dataframe to get a list of lists where entries match in an efficient way

I have two pandas dataframes and i want to find all entries of the second dataframe where a specific value occurs.
As an example:
df1:
NID
0 1
1 2
2 3
3 4
4 5
df2:
EID N1 N2 N3 N4
0 1 1 2 13 12
1 2 2 3 14 13
2 3 3 4 15 14
3 4 4 5 16 15
4 5 5 6 17 16
5 6 6 7 18 17
6 7 7 8 19 18
7 8 8 9 20 19
8 9 9 10 21 20
9 10 10 11 22 21
Now, what i basically want, is a list of lists with the values EID (from df2) where the values NID (from df1) occur in any of the columns N1,N2,N3,N4:
Solution would be:
sol = [[1], [1, 2], [2, 3], [3, 4], [4, 5]]
The desired solution explained:
The solution has 5 entries (len(sol = 5)) since I have 5 entries in df1.
The first entry in sol is 1 because the value NID = 1 only appears in the columns N1,N2,N3,N4 for EID=1 in df2.
The second entry in sol refers to the value NID=2 (of df1) and has the length 2 because NID=2 can be found in column N1 (for EID=2) and in column N2 (for EID=1). Therefore, the second entry in the solution is [1,2] and so on.
What I tried so far is looping for each element in df1 and then looping for each element in df2 to see if NID is in any of the columns N1,N2,N3,N4. This solution works but for huge dataframes (each df can have up to some thousand entries) this solution becomes extremely time-consuming.
Therefore I was looking for a much more efficient solution.
My code as implemented:
Input data:
import pandas as pd
df1 = pd.DataFrame({'NID':[1,2,3,4,5]})
df2 = pd.DataFrame({'EID':[1,2,3,4,5,6,7,8,9,10],
'N1':[1,2,3,4,5,6,7,8,9,10],
'N2':[2,3,4,5,6,7,8,9,10,11],
'N3':[13,14,15,16,17,18,19,20,21,22],
'N4':[12,13,14,15,16,17,18,19,20,21]})
solution acquired using looping:
sol= []
for idx,node in df1.iterrows():
x = []
for idx2,elem in df2.iterrows():
if node['NID'] == elem['N1']:
x.append(elem['EID'])
if node['NID'] == elem['N2']:
x.append(elem['EID'])
if node['NID'] == elem['N3']:
x.append(elem['EID'])
if node['NID'] == elem['N4']:
x.append(elem['EID'])
sol.append(x)
print(sol)
If anyone has a solution where I do not have to loop, I would be very happy. Maybe using a numpy function or something like cKDTrees but unfortunately I have no idea on how to get this problem solved in a faster way.
Thank you in advance!
You can reshape with melt, filter with loc, and groupby.agg as list. Then reindex and convert tolist:
out = (df2
.melt('EID') # reshape to long form
# filter the values that are in df1['NID']
.loc[lambda d: d['value'].isin(df1['NID'])]
# aggregate as list
.groupby('value')['EID'].agg(list)
# ensure all original NID are present in order
# and convert to list
.reindex(df1['NID']).tolist()
)
Alternative with stack:
df3 = df2.set_index('EID')
out = (df3
.where(df3.isin(df1['NID'].tolist())).stack()
.reset_index(name='group')
.groupby('group')['EID'].agg(list)
.reindex(df1['NID']).tolist()
)
Output:
[[1], [2, 1], [3, 2], [4, 3], [5, 4]]

Pandas DataFrame producing unexpected result with list comprehension

When I am using list comprehension on dataframe to find common elements in each columns.
df
A B C
0 1 2 0
1 3 4 6
2 5 6 7
3 7 3 3
4 9 1 9
l=[i for i in df.A if i in df.B ]
l
[1, 3]
list2=[i for i in l if i in df.C]
list2
[1, 3]
first list comprehension produces the result as expected i.e common element in A and B are [1,3].
But [i for i in l if i in df.C] this line produces unexpected result.
Convert the DataFrame column to a list.
list2 = [i for i in l if i in list(df.C)]
OR
list2=[i for i in l if i in df.C.tolist()]
output of list2:
>>>print(list2)
[3]
This is because df.C returns a Series with index '1' included.
You can also use df.C.values instead.

Sorting by absolute value for value different of zero for one column keeping equal value of another column together

I have the following dataframe :
A B C
============
11 x 2
11 y 0
13 x -10
13 y 0
10 x 7
10 y 0
and i would like to sort C by absolute value for value different of 0. But as i need to keep A values together it would look like below (sorted by absolute value but with 0 in between):
A B C
============
13 x -10
13 y 0
10 x 7
10 y 0
11 x 2
11 y 0
I can't manage to obtain this with sort_values(). If i sort by C, i don't have A values together.
Step 1: get absolute values
# creating a column with the absolute values
df["abs_c"] = df["c"].abs()
Step 2: sort values on absolute values of "c"
# sorting by absolute value of "c" & reseting the index & assigning it back to df
df = df.sort_values("abs_c",ascending=False).reset_index(drop=True)
Step 3: get the order of column "a" based on the sorted values, this is achieved by using drop duplicates of pandas which keeps the first instance of the value in the column a which is sorted based on "c". This will be used in the next step
# getting the order of "a" based on sorted value of "c"
order_a = df["a"].drop_duplicates()
Step 4: based on the order of "a" and the sorted values of "c" creating a data frame
# based on the order_a creating a data frame as per the order_a which is based on the sorted values of abs "c"
sorted_df = pd.DataFrame()
for i in range(len(order_a)):
sorted_df = sorted_df.append(df[df["a"]==order_a[i]])
Step 5:Assigning the sorted df back to df
# reset index of sorted values and assigning it back to df
df = sorted_df.reset_index(drop=True)
Output
a b c abs_c
0 13 x -10 10
1 13 y 0 0
2 10 x 7 7
3 10 y 0 0
4 11 x 2 2
5 11 y 0 0
Doc reference
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Sorry, it doesn't turn out very nice, but I almost never use panda. I hope everything works out the way you want it.
import pandas as pd
df = pd.DataFrame({'a': [11, 11, 13, 13, 10, 10],
'b': ['x', 'y', 'x', 'y', 'x', 'y'],
'c': [2, 0, -10, 0, 7, 0]})
mask = df[df['c'] != 0]
mask['abs'] = mask['c'].abs()
mask = mask.sort_values('abs', ascending=False).reset_index()
tempNr = 0
for index, row in df.iterrows():
if row['c'] != 0:
df.loc[index] = mask.loc[tempNr].drop('abs')
tempNr = tempNr + 1
print(df)

How to access the items in list from Dataframe against each index item? [duplicate]

This question already has answers here:
Pandas expand rows from list data available in column
(3 answers)
Closed 3 years ago.
Consider the table (Dataframe) below.
Need each item in the list against its index such as given below. What are the possible ways of doing this in python?
Anybody can tweak the question if it matches the context.
You can do this using the pandas library with the explode method. Here is how your code would look -
import pandas as pd
df = [["A", [1,2,3,4]],["B",[9,6,4]]]
df = pd.DataFrame(df, columns = ['Index', 'Lists'])
print(df)
df = df.explode('Lists').reset_index(drop=True)
print(df)
Your output would be -
Index Lists
0 A [1, 2, 3, 4]
1 B [9, 6, 4]
Index Lists
0 A 1
1 A 2
2 A 3
3 A 4
4 B 9
5 B 6
6 B 4

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Resources