How to add NER tags to features - nlp

I have a set of training sentences for which I computed some float features. In each sentence, two entities are identified. They are either of type 'PERSON', 'ORGANIZATION', 'LOCATION', or 'OTHER'. I would like to add these types to my feature matrix (which stores float variables).
My question is: is there a recommended way to add these entity types ?
I could think of two ways for now:
either adding TWO columns, one for each entity, that will be filled with entity types ids (e.g 0 to 3 or 1 to 4)
adding EIGHT columns, one for each entity type and each entity, and filling them with 0's and 1's
Best!

I would recommend that you use something that can easily be normalized and which is in the same range as the rest of your data.
So if all your float values are between -1 and 1, i would keep the values from your "Named Entity Recognition" in the same range.
So depending on what you prefer or what gives you the best result you could either assign 4 values in the same range as the rest of your floats or use a binary result with more columns.

Finally, the second suggestion (adding EIGHT columns, one for each entity type and each entity, and filling them with 0's and 1's) worked fine!

Related

Deal with Ties when Using Index/Match

I'm currently pulling the top (5) number of numerical values from one sheet and inputting them into a different sheet. Each number is within its own column and there is a name matching that column, EX:
And so, having a tie is common with the data that I'm working with, so it nearly deprecates my formulas.
For getting the name:
=INDEX('Total Cases by Categories'!$B$18:$B$50, MATCH(LARGE('Total Cases by Categories'!$H$18:$H$50, A39),'Total Cases by Categories'!$H$18:$H$50, 0))
For getting the numerical value associated with the name:
=LARGE('Total Cases by Categories'!$H$18:$H, A39)
And so, when there are 2 people with the same numerical value associated within a category, then that person appears twice, I assume because of their position within the sheet.
So something like this happens:
So in the event of a tie, I would want to list both names that have the same amount of points instead of the first name that shows up with the duplicated value.
Any help would be appreciated!
Actually, LARGE will give you both of tied names. It's MATCH that can't look beyond the first. To the best of my knowledge there is no way around that (the difficult one being not to use MATCH). Therefore the solution is to have no ties.
This is achieved with helper columns that contain no identical numbers. This can be achieved by adding an insignificant decimal. Since you are dealing with integers, adding 0.1 would be insignificant for your purposes but 13.1 is different from 13.2. If you need to extract the "real" number from this use INT(13.2).
Using the row number to generate an insignificant decimal is popular for this purpose. In row 1 ROW()/10 will return 0.1. But in row 10 ROW()/10 will return 1.0 which isn't an insignificant number anymore. Therefore you have to work with ROW()/100 or an even larger divisor, depending upon how many rows you have. Try ROW()/10^6 - any decimal will do the tie-breaking job.
You may not like that using ROW() will list tied participants in the order in which they appear in the worksheet. The differentiating decimals can be created by any other means that doesn't create ties in itself.
Normally, the helper columns with the decimals added will be hidden. They contain a formula like =D23 + (ROW()/10000) which manages itself. You can then use that column for the MATCH function to list all participants in the order of LARGE using the helper column or the original. Just make sure that MATCH refers to the helper column.

Is there a way to call the pcfcross function on groups of marks?

I'm using the pcfcross function to estimate the pair correlation functions (PCFs) between pairs of cell types, indicated by marks. I would now like to expand my analysis to include measuring the PCFs between cell types and groups of cell types. Is there a way to use the pcfcross function on a group of marks?
Alternatively, is there a way to change the marks of a group of marks to a singular mark?
You can collapse several levels of a factor to a single level, using the spatstat function mergeLevels. This will group several types of points into a single type.
However, this may not give you any useful new information. The pair correlation function is a second-order summary, so the pair correlation for the grouped data can be calculated from the pair correlations for the un-grouped data. (See Chapter 7 of the spatstat book).

How to apply sklean pipeline to a list of features depending on availability

I have a pandas dataframe with 10 features (e.g., all floats). Given the different characteristics of the features (e.g., mean), the dataframe can be broken into 4 subsets: mean <0, mean within range (0,1), mean within range (1,100), mean >=100
For each subset, a different pipeline will be applied, however, they may not always be available, for example, the dataset might only contain mean <0; or it may contain only mean <0 and mean (1,100); or it may contain all 4 subsets
The question is how to apply the pipelines depending on the availability of the subsets.
The problem is that there will be total 7 different combinations:
all subset exists, only 3 exists, only 2 subset exists, only 1 subset exist.
How can I assign different pipelines depending on the availability of the subsets without using a nested if else (10 if/else)
if subset1 exists:
make_column_transformer(pipeline1, subset1)
elif subset2 exists:
make_column_transformer(pipeline2, subset2)
elif subset3 exists:
make_column_transformer(pipeline3, subset3)
elif subset1 and subset 2 exists
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2)]
elif subset3 and subset 2 exists
make_column_transformer([(pipeline3, subset3), (pipeline2, subset2)]
elif subset1 and subset 3 exists
make_column_transformer([(pipeline1, subset1), (pipeline3, subset3)]
elif subset1 and subset2 and subset3 exists:
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2), (pipeline3, subset3)]
Is there a better way to avoid this nested if else (considering that if we have 10 different subsets _)
The way to apply different transformations to different sets of features is by ColumnTransformer [1]. You could then have a lists with the column names, which can be filled based on the conditions you want. Then, each transformer will take the columns listed in each list, for example cols_mean_lt0 = [...], etc.
Having said that, your approach doesn't look good to me. You probably want to scale the features so they all have the same mean and std. Depending on the algorithm you'll use, this may be mandatory or not.
[1] https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
EDIT:
ColumnTransformer takes transformers, which are a tuple of name, tuple and columns. What you want is to have multiple transformers, each of which will process different columns. The columns in the tuple can be indicated by 'string or int, array-like of string or int, slice, boolean mask array or callable'. Here is where I suggest you pass a list of columns.
This way, you can have three transformers, one for each of your cases. Now, to indicate which columns you want each transformer to process, you just have to create three lists, one for each transformer. Each column will corresond to one of the lists. This is simple to to. In a loop you can check for each column what the mean is, and then append the column name to the list which corresponds to the corresponding transformer.
Hope this helps!

How to set a desired name in MATLAB from words and numbers extracted from data?

I want to extract three number from different column of my dataset and set these numbers along with some words to be the name of my variable in workspace, and then assign a matrix to this variable. For instance:
data=dataset{:,:,5};
FID=data(1,14);
VID=data(1,1);
PID=data(1,15)
Here I extracted three number from different column of a matrix in dataset:
FID=4 , VID=8 , PID=12
Now, I want to create a variable in the workspace using these three numbers besides three words with underline between them, such as: A4_B8_C12
and then assign a matrix to this variable:
A4_B8_C12=dataset{:,:,5};
Since, my dataset is a cell array and contains 2169 matrices, I'm writing a code to extract the three numbers from desired matrix and use them along with desired words to create several matrices.
How can I do that?
When you have cell arrays, structs and arrays, this is not a good practice. This is against the philosophy of using arrays. But any way if you want continue this way of programming you can use the following code:
for i=1:5
data=dataset{:,:,i};
FID=data(1,14);
VID=data(1,1);
PID=data(1,15);
eval(sprintf('A%d_B%d_C%d=data;',FID,VID,CID));
end
Using evalf is a kind of programming which can be used for self modifying codes.

Return variable from its name in MATLAB

Say I have a variable var=1 and a string str='var'.
How can I obtain the value of var from str?. I tried using str2num(str), but it didn't work.
Also, if I had 2 strings str1='some letters' and str2='str1', can I obtaing the phase 'some letters' from str2?
I want to do this because I have many matrices (quite big) and I want to separate them in some groups, so I thought about making cells with the names of each of the matrices that belong to a group (a matrix can belong to more than one group, so making cells with the matrices is not very good).
You can use eval:
x = eval( str ) ;
But it's not recommended.
Though it can easily be achieved with an eval as #Shai mentioned, you probably don't really want to do this. Using eval hinders your debugging and depending on the name of variables seriously limits the flexibility of your code. If you want to name something, you may be better off using a struct with a data field and a name field instead.
Judging from your description, I wonder about the following:
1. Why do you have many matrices?
For each variable that you have, you depend on a name. Depending on a lot of names is typically undesirable. Hence my suggestion:
Use a (cell) array containing these matrices
2. What way do you exactly want them to be in a group
It is not clear to me how you want the grouping to work, but think of this:
If you want to use names, create a struct or array of structs with a nameField, but
otherwise just use a cell array and have each matrix get a number.
You can now handle the matrices more easily and things like 'selecting 10 random matrices' or 'selecting all matrices whose nameField contains 'abc'' can be done easily and efficiently.
You can now also have a field with your data specifying in which groups it is, or you can define groups as simple lists of numbers.

Resources