How to access the items in list from Dataframe against each index item? [duplicate] - python-3.x

This question already has answers here:
Pandas expand rows from list data available in column
(3 answers)
Closed 3 years ago.
Consider the table (Dataframe) below.
Need each item in the list against its index such as given below. What are the possible ways of doing this in python?
Anybody can tweak the question if it matches the context.

You can do this using the pandas library with the explode method. Here is how your code would look -
import pandas as pd
df = [["A", [1,2,3,4]],["B",[9,6,4]]]
df = pd.DataFrame(df, columns = ['Index', 'Lists'])
print(df)
df = df.explode('Lists').reset_index(drop=True)
print(df)
Your output would be -
Index Lists
0 A [1, 2, 3, 4]
1 B [9, 6, 4]
Index Lists
0 A 1
1 A 2
2 A 3
3 A 4
4 B 9
5 B 6
6 B 4

Related

Compare value in a dataframe to multiple columns of another dataframe to get a list of lists where entries match in an efficient way

I have two pandas dataframes and i want to find all entries of the second dataframe where a specific value occurs.
As an example:
df1:
NID
0 1
1 2
2 3
3 4
4 5
df2:
EID N1 N2 N3 N4
0 1 1 2 13 12
1 2 2 3 14 13
2 3 3 4 15 14
3 4 4 5 16 15
4 5 5 6 17 16
5 6 6 7 18 17
6 7 7 8 19 18
7 8 8 9 20 19
8 9 9 10 21 20
9 10 10 11 22 21
Now, what i basically want, is a list of lists with the values EID (from df2) where the values NID (from df1) occur in any of the columns N1,N2,N3,N4:
Solution would be:
sol = [[1], [1, 2], [2, 3], [3, 4], [4, 5]]
The desired solution explained:
The solution has 5 entries (len(sol = 5)) since I have 5 entries in df1.
The first entry in sol is 1 because the value NID = 1 only appears in the columns N1,N2,N3,N4 for EID=1 in df2.
The second entry in sol refers to the value NID=2 (of df1) and has the length 2 because NID=2 can be found in column N1 (for EID=2) and in column N2 (for EID=1). Therefore, the second entry in the solution is [1,2] and so on.
What I tried so far is looping for each element in df1 and then looping for each element in df2 to see if NID is in any of the columns N1,N2,N3,N4. This solution works but for huge dataframes (each df can have up to some thousand entries) this solution becomes extremely time-consuming.
Therefore I was looking for a much more efficient solution.
My code as implemented:
Input data:
import pandas as pd
df1 = pd.DataFrame({'NID':[1,2,3,4,5]})
df2 = pd.DataFrame({'EID':[1,2,3,4,5,6,7,8,9,10],
'N1':[1,2,3,4,5,6,7,8,9,10],
'N2':[2,3,4,5,6,7,8,9,10,11],
'N3':[13,14,15,16,17,18,19,20,21,22],
'N4':[12,13,14,15,16,17,18,19,20,21]})
solution acquired using looping:
sol= []
for idx,node in df1.iterrows():
x = []
for idx2,elem in df2.iterrows():
if node['NID'] == elem['N1']:
x.append(elem['EID'])
if node['NID'] == elem['N2']:
x.append(elem['EID'])
if node['NID'] == elem['N3']:
x.append(elem['EID'])
if node['NID'] == elem['N4']:
x.append(elem['EID'])
sol.append(x)
print(sol)
If anyone has a solution where I do not have to loop, I would be very happy. Maybe using a numpy function or something like cKDTrees but unfortunately I have no idea on how to get this problem solved in a faster way.
Thank you in advance!
You can reshape with melt, filter with loc, and groupby.agg as list. Then reindex and convert tolist:
out = (df2
.melt('EID') # reshape to long form
# filter the values that are in df1['NID']
.loc[lambda d: d['value'].isin(df1['NID'])]
# aggregate as list
.groupby('value')['EID'].agg(list)
# ensure all original NID are present in order
# and convert to list
.reindex(df1['NID']).tolist()
)
Alternative with stack:
df3 = df2.set_index('EID')
out = (df3
.where(df3.isin(df1['NID'].tolist())).stack()
.reset_index(name='group')
.groupby('group')['EID'].agg(list)
.reindex(df1['NID']).tolist()
)
Output:
[[1], [2, 1], [3, 2], [4, 3], [5, 4]]

How to access list of list values in columns in dataset

In my DataFrame.I am having a list of list values in a column. For example, I am having columns as A, B, C, and my output column. In column A I'm having a value of 12 and in column B I am having values of 30 and in column C I am having a list of values like [0.01,1.234,2.31].When I try to find mean for all the list of list values.It shows list object as no attribute mean.How to convert all list of list values to mean in the dataframe?
You can transform the column which contains the lists to another DataFrame and calculate the mean.
import pandas as pd
df = ... # Original df
pd.DataFrame(df['column_with_lists'].values.tolist()).mean(1)
This would result in a pandas DataFrame which looks like the following:
0 mean_of_list_row_0
1 mean_of_list_row_1
. .
. .
. .
n mean_of_list_row_n
You can use apply(np.mean) on the column with the lists in it to get the mean. For example:
Build a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2,4],[4,6]])
df[3] = [[5,7],[8,9,10]]
print(df)
0 1 3
0 2 4 [5, 7]
1 4 6 [8, 9, 10]
Use apply(np.mean)
print(df[3].apply(np.mean))
0 6.0
1 9.0
If you want to convert that column into the mean of the lists:
df[3] = df[3].apply(np.mean)
print(df)
Name: 3, dtype: float64
0 1 3
0 2 4 6.0
1 4 6 9.0

Combine dataframe within the list to form a single dataframe using pandas in python [duplicate]

This question already has answers here:
How to merge two dataframes side-by-side?
(6 answers)
Closed 2 years ago.
Let say I have a list df_list with 3 single column pandas dataframe as below:
>>> df_list
[ A
0 1
1 2
2 3, B
0 4
1 5
2 6, C
0 7
1 8
2 9]
I would like to merge them to become a single dataframe dat as below:
>>> dat
A B C
0 1 4 7
1 2 5 8
2 3 6 9
One way I can get it done is to create a blank dataframe and concatenate each of them using for loop.
dat = pd.DataFrame([])
for i in range(0, len(df_list)):
dat = pd.concat([dat, df_list[i]], axis = 1)
Is there a more efficient way to achieve this without using iteration? Thanks in advance.
Use concat with list of DataFrames:
dat = pd.concat(df_list, axis = 1)

How to convert a column containing list into separate column in pandas data-frame? [duplicate]

This question already has answers here:
Split a Pandas column of lists into multiple columns
(11 answers)
How to convert string representation of list to a list
(19 answers)
Closed 3 years ago.
I've a data frame and one of its columns contains a list.
A B
0 5 [3, 4]
1 4 [1, 1]
2 1 [7, 7]
3 3 [0, 2]
4 5 [3, 3]
5 4 [2, 2]
The output should look like this:
A x y
0 5 3 4
1 4 1 1
2 1 7 7
3 3 0 2
4 5 3 3
5 4 2 2
I have tried these options that I found here but its not working.
df = pd.DataFrame(data={"A":[0,1],
"B":[[3,4],[1,1]]})
df['x'] = df['B'].apply(lambda x:x[0])
df['y'] = df['B'].apply(lambda x:x[1])
df.drop(['B'],axis=1,inplace=True)
A x y
0 0 3 4
1 1 1 1
Incase the list is stored as string
from ast import literal_eval
df = pd.DataFrame(data={"A":[0,1],
"B":['[3,4]','[1,1]']})
df['x'] = df['B'].apply(lambda x:literal_eval(x)[0])
df['y'] = df['B'].apply(lambda x:literal_eval(x)[1])
df.drop(['B'],axis=1,inplace=True)
3rd way credit goes to #anky_91
df = pd.DataFrame(data={"A":[0,1],
"B":['[3,4]','[1,1]']})
df["B"] = df["B"].apply(lambda x :literal_eval(x))
df[['A']].join(pd.DataFrame(df["B"].values.tolist(),columns=['x','y'],index=df.index))
df.drop(["B"],axis=1,inplace=True)

Using dataframe value as an index for a list [duplicate]

This question already has an answer here:
Python: an efficient way to slice a list with a index list
(1 answer)
Closed 3 years ago.
I have a dataframe and a list set up as follows:
index list_index
sample1 1
sample2 2
sample3 4
values = [-0.5, -23, 0, 15, 100]
I am trying to create a new column in the dataframe that takes the list_index and values list. Like the following:
index list_index val
sample1 1 -23
sample2 2 0
sample3 5 100
My code is:
df['val'] = values[df['list_index']]
I am getting TypeError: list indices must be integers or slices not Series.
I would use pandas.series.apply
Example code:
import pandas as pd
df = pd.DataFrame({'A' : [1,3,5]})
v = [0, 1, 0, 1, 0, 1]
df['B'] = df['A'].apply(lambda x: v[x])
Results in what you are looking for I think:
Out[7]:
A B
0 1 1
1 3 1
2 5 1
For your code would do the below
df['val'] = df['list_index'].apply(lambda x: values[x])
Essentially you are getting the above error as you are not passing in the indices of the list element by element but rather as a series which does not work.

Resources