pandas generates a new column based on values from another column considering duplicates - python-3.x

I am working on a dataframe which has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, assigns a unique integer to the corresponding row as id. If elements in two lists are the same but with a different order, the two lists should be assigned the same id. A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[3,2,1] 1
[4,5,6,7] 2
[8] 0
[9,10] 3
[10,9] 3
column cluster_id only considers the 1st, 2nd, 3rd, 5th and 6th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column, also [1,2,3], [3,2,1] and [9,10], [10,9] should be assigned the same cluster_id.
I was asking a similar question without considering duplicates list values, at
pandas how to derived values for a new column base on another column
I am wondering how to do that in pandas.

First, you need to assign a column with the list lengths, and another column with the lists as set objects sorted:
df['list_len'] = df.document_no_list.apply(len)
df['list_sorted'] = df.document_no_list.apply(sorted)
Then you need to assign the cluster_id for each set sorted list:
ids = df.loc[df.list_len > 1, ['list_sorted']].drop_duplicates()
ids['cluster_id'] = range(1,len(ids)+1)
Left join this onto the original dataframe, and fill whatever that hasn't been joined (the singletons) with zeros:
df.merge(ids, how = 'left').fillna({'cluster_id':0})

Related

Find three sets of Top 5 non-zero column names in descending order

I am relatively new to python and struggling with a problem
I am trying to find the top keywords based on description to optimize the keyword search algorithm.
I have created a TF-IDF matrix and I need to do the following
a) Find the top n column names row-wise (which will correspond to keywords because the column names are tokens of my corpus)
b) Divide the top n columns into three sets in descending order (Set1 - Top 10 TFIDF, Set2- 11-20 , Set3 - 21-30) (if there are 15 items, the top 10 go in the first column, the next 5 go in second column, 3rd column stays empty)
I have a code snippet which creates a column per keyword (top 3) . I want to extend it to save buckets of 10 items each. The cur
Following is the code
dfScore = pd.DataFrame(score.toarray(), columns=tfidf.get_feature_names())
pd.concat([df,pd.DataFrame(dfScore.apply(lambda x:list(dfScore.columns[np.array(x).argsort()[::-1][:3]]), axis=1).values.tolist(), columns=['One', 'Two', 'Three'])], axis=1)

Excel - Lookup date in matrix and return column heading

I have a matrix between Products and Enablers, where the intersection between the two represents a point in time.
Product list
Enabler 1
Enabler 2
Enabler 3
Product 1
10-Oct
11-Oct
20-Oct
Product 2
20-Nov
25-Nov
01-Dec
Product 3
10-Oct
21-Oct
25-Oct
I need to turn this into a 'timeline' view so visually there are two ways to see the data, where the dates are across the top and based on the timing in the first table, it returns the corresponding 'Enabler' at the correct date...something like
Product list
10-Oct
11-Oct
12-Oct
Product 1
Enabler 1
Enabler 2
Product 2
Product 3
Enabler 1
Does anyone have any ideas how I'd do this? I think it requires an INDEX MATCH array formula as it needs to look across the matrix to find the date in that row, then return what is in the header column...but this isn't my area of expertise and I just can't seem to figure out how to make it work.
One approach might be to return this as an array. You could do:
=IF( ( Table1[[Enabler 1]:[Enabler 3]] = B7:D7 ) * ( Table1[Product list] = A8:A10),
Table1[[#Headers],[Enabler 1]:[Enabler 3]],
"" )
where Table1 is an Excel Table that holds your Product List and Enablers as columns (as shown in your first table); A8:A10 is the list of products in your second table; and B7:D7 is the list of dates in your second table shown as column headers. The formula would be placed in the upper left cell of your second table - in my example, B8 as shown here:
The result will spill into the second table.
If you wanted your second table to be an Excel Table, the approach
would be different as arrays cannot spill into Excel Tables.

Sum dictionary values stored in Data frame Columns

I have a data frame having dictionary like structure. I want to only sum the values and store into new column.
Column 1 Desired Output
[{'Apple':3}, 9
{'Mango:2},
{'Peach:4}]
[{'Blue':2}, 3
{'Black':1}]
df['Desired Output'] = [sum(x) for x in df['Column 1']]
df
Assuming your Column 1 column does indeed have dictionaries (and not strings that look like dictionaries), this should do the trick:
df['Desired Output'] = df["Column 1"].apply(lambda lst: sum(sum(d.values()) for d in lst))

compare one column value with all the values of other column using pandas

I have the one excel file which contains the below values
I need to compare a_id value with all the value of b_id and if it matches i have to update the value of a_flag to 1 otherwise 0.
For example take the first value in a_tag ie; 123 then compare all the values of b_id(113,211,222,123) . When it reaches to 123 in b_id we can see it matches. So we will update the value of a_flag as 1.
Just like that take all the values of a_id and compare with all the values of b_id. So after everything done we will have value either 1 or 0 in a_flag column.
Once its done we will take the first value of b_id then compare with all the value in a_id column and update b_flag column accordingly.
Finally i will have the below data.
I need to this using pandas because i am dealing with large collection of data. Below is my findings but it compare only with the first value of b_id. For example it compares 123(a_id first value) with 113 only (b_id first value).
import pandas as pd
df1 = pd.read_excel('system_data.xlsx')
df1['a_flag'] = (df3['a_id'] == df3['b_id']).astype(int)
Use Series.isin for test membership:
df1['a_flag'] = df3['a_id'].isin(df3['b_id']).astype(int)
df1['b_flag'] = df3['b_id'].isin(df3['a_id']).astype(int)

How to select bunch of rows

I have dataframe with multiple columns , i want to select bunch of rows if column B have consecutive 1 and check in these rows if column A have any value equal to 0.04 then need this bunch of rows and extract start value and end value of column A for this bunch of rows
Here is my dataframe
Here is my desired output:
filtter Consecutive groups .diff().abs().cumsum().bfill() not following the specific considitons (x['B'].eq(1).any() and x['A'].eq(0.04).any()
agg first and last
followed by grouping consecutivity column to extract first and last rows with use of agg fun
df['temp'] = df.B.diff().abs().cumsum().bfill()
df.groupby('temp').filter(lambda x: (x['B'].eq(1).any() and x['A'].eq(0.04).any()))\
.groupby('temp').agg({'A':['first','last']})
Out:
A
first last
temp
3.0 344.0 39.9

Resources