Find three sets of Top 5 non-zero column names in descending order - python-3.x

I am relatively new to python and struggling with a problem
I am trying to find the top keywords based on description to optimize the keyword search algorithm.
I have created a TF-IDF matrix and I need to do the following
a) Find the top n column names row-wise (which will correspond to keywords because the column names are tokens of my corpus)
b) Divide the top n columns into three sets in descending order (Set1 - Top 10 TFIDF, Set2- 11-20 , Set3 - 21-30) (if there are 15 items, the top 10 go in the first column, the next 5 go in second column, 3rd column stays empty)
I have a code snippet which creates a column per keyword (top 3) . I want to extend it to save buckets of 10 items each. The cur
Following is the code
dfScore = pd.DataFrame(score.toarray(), columns=tfidf.get_feature_names())
pd.concat([df,pd.DataFrame(dfScore.apply(lambda x:list(dfScore.columns[np.array(x).argsort()[::-1][:3]]), axis=1).values.tolist(), columns=['One', 'Two', 'Three'])], axis=1)

Related

Excel - Lookup date in matrix and return column heading

I have a matrix between Products and Enablers, where the intersection between the two represents a point in time.
Product list
Enabler 1
Enabler 2
Enabler 3
Product 1
10-Oct
11-Oct
20-Oct
Product 2
20-Nov
25-Nov
01-Dec
Product 3
10-Oct
21-Oct
25-Oct
I need to turn this into a 'timeline' view so visually there are two ways to see the data, where the dates are across the top and based on the timing in the first table, it returns the corresponding 'Enabler' at the correct date...something like
Product list
10-Oct
11-Oct
12-Oct
Product 1
Enabler 1
Enabler 2
Product 2
Product 3
Enabler 1
Does anyone have any ideas how I'd do this? I think it requires an INDEX MATCH array formula as it needs to look across the matrix to find the date in that row, then return what is in the header column...but this isn't my area of expertise and I just can't seem to figure out how to make it work.
One approach might be to return this as an array. You could do:
=IF( ( Table1[[Enabler 1]:[Enabler 3]] = B7:D7 ) * ( Table1[Product list] = A8:A10),
Table1[[#Headers],[Enabler 1]:[Enabler 3]],
"" )
where Table1 is an Excel Table that holds your Product List and Enablers as columns (as shown in your first table); A8:A10 is the list of products in your second table; and B7:D7 is the list of dates in your second table shown as column headers. The formula would be placed in the upper left cell of your second table - in my example, B8 as shown here:
The result will spill into the second table.
If you wanted your second table to be an Excel Table, the approach
would be different as arrays cannot spill into Excel Tables.

How to return count of number in sequence after validating three values from 3 different columns even though there will be matched data found?

I have a range of cells in excel. How to increment numbers when meeting data validation from three different columns?
I tried using formula COUNTIF($A$2:A2,A2) which creates a number sequence. But I have other data to validate from another column for it to return the correct number sequence.
First validation: count the emp no in column range A1:A5 which return a result under Hierarchy column.
Second validation: check the % value under column L as per below level of hierarchy in which the problem comes from.
1 - 0.25
2 - 0.25
3 - 0.5
4 - 0.5
5 - 1
Third validation: check the type of Relation (see Relation column) that needs to check when returning number of sequence too. Below is the Relation Level table.
I don't know on how to join these three conditions for the result to be as below.
My really problem here is on how will i get a sequence number if a person does have 3 children and should be tagged as 2,3,4 (next to spouse which is 1) then the next relation which is parent will be tagged then as next number sequence from the last count of child wherein will be 5 given that as per relation table, Parent level is 3 but it will be adjusted as per count of relations a person has. And for this specific instance, if Parent count will be 5, it still should have 0.5 EE % (see relation table level vs % hierarchy level) even though the count of number is 5. I hope this will make sense. But let me know if you have any questions.
Hope someone could help me on this coz I am not that expert when it comes to excel formula. Thank you!

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

pandas generates a new column based on values from another column considering duplicates

I am working on a dataframe which has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, assigns a unique integer to the corresponding row as id. If elements in two lists are the same but with a different order, the two lists should be assigned the same id. A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[3,2,1] 1
[4,5,6,7] 2
[8] 0
[9,10] 3
[10,9] 3
column cluster_id only considers the 1st, 2nd, 3rd, 5th and 6th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column, also [1,2,3], [3,2,1] and [9,10], [10,9] should be assigned the same cluster_id.
I was asking a similar question without considering duplicates list values, at
pandas how to derived values for a new column base on another column
I am wondering how to do that in pandas.
First, you need to assign a column with the list lengths, and another column with the lists as set objects sorted:
df['list_len'] = df.document_no_list.apply(len)
df['list_sorted'] = df.document_no_list.apply(sorted)
Then you need to assign the cluster_id for each set sorted list:
ids = df.loc[df.list_len > 1, ['list_sorted']].drop_duplicates()
ids['cluster_id'] = range(1,len(ids)+1)
Left join this onto the original dataframe, and fill whatever that hasn't been joined (the singletons) with zeros:
df.merge(ids, how = 'left').fillna({'cluster_id':0})

Find the top n values in a range while keeping the sum of values in another range under x value

I'd like to accomplish the following task. There are three columns of data. Column A represents price, where the sum needs to be kept under $100,000. Column B represents a value. Column C represents a name tied to columns A & B.
Out of >100 rows of data, I need to find the highest 8 values in column B while keeping the sum of the prices in column A under $100,000. And then return the 8 names from column C.
Can this be accomplished?
EDIT:
I attempted the Solver solution w/ no luck. 200 rows looks to be the max w/ Solver, and that is what I'm using now. Here are the steps I've taken:
Create a column called rank RANK(B2,$B$2:$B$200) (used column D -- what is the purpose of this?)
Create a column called flag just put in zeroes (used column E)
Create 3 total cells total_price (=SUM(A2:A200)), total_value (=SUM(B2:B200)) and total_flag (=(E2:E200))
Use solver to minimize total_value (shouldn't this be maximize??)
Add constraints -Total_price<=100000 -Total_flag=8 -Flag cells are binary
Using Simplex LP, it simply changes the flags for the first 8 values. However, the total price for the first 8 values is >$100,000 ($140k). I've tried changing some options in the Solver Parameters as well as using different solving methods to no avail. I'd like to post an image of the parameter settings, but don't have enough "reputation".
EDIT #2:
The first 5 rows looks like this, price goes down to ~$6k at the bottom of the table.
Price Value Name Rank Flag
$22,538 42.81905675 Blow, Joe 1 0
$22,427 37.36240932 Doe, Jane 2 0
$17,158 34.12127693 Hall, Cliff 3 0
$16,625 33.97654031 Povich, John 4 0
$15,631 33.58212402 Cow, Holy 5 0
I'll give you the solver solution as a starting point. It involves the creation of some extra columns and total cells. Note solver is limited in the amount of cells it can handle but will work with 100 anyway.
Create a column called rank RANK(B2,$B$2:$B$100)
Create a column called flag just put in zeroes
Create 3 total cells total_price, total_value and total_flag
Use solver to minimize total_value
Add constraints
-Total_price<=100000
-Total_flag=8
-Flag cells are binary
This will flag the rows you want and you can grab the names however you want.

Resources