Defining Data Structures/ Types In Haskell - haskell

How would it possible to define a data structure in Haskell, such that there are certain constraints/rules that apply to the elements of the structure, AND be able to reflect this in the type.
For example, if I have a type made up of a list of another type, say
r = [x | x <- input, rule1, rule2, rule3].
In this case, the type of r is a list of elements of (type of x). But by saying this, we loose the rules. So how would it be possible to retain this extra information in the type definition.
To give more concreteness to my question, take the sudoko case. The grid of sudoko is a list of rows, which in turn is a list of cells. But as we all know, there are constraints on the values, frequency. But when one expresses the types, these constraints don't show up in the definition of the type of the grid and row.
Or is this not possible?
thanks.

In the example of a sodoku, create a data type that has multiple constructors, each representing a 'rule' or semantic property.
I.E.
data SodokuType = NotValidatedRow | InvalidRow | ValidRow
Now in some validation function you would return an InvalidRow where you detect a validation of the sodoku rules, and a ValidRow where you detect a successful row (or column or square etc). This allows you to pattern match as well.
The problem you're having is that you're not using types, you're using values. You're defining a list of values, while the list does not say anything about the values it contains.
Note that the example I used is probably not very useful as it does not contain any information about the rows position or anything like that, but you can define it yourself as you'd like.

Related

How to modify dynamic complex data type fields in azure data factory data flow

This bounty has ended. Answers to this question are eligible for a +400 reputation bounty. Bounty grace period ends in 22 hours.
kevzettler is looking for a more detailed answer to this question.
I have a complex data type (fraudData) that undesirably has hyphen characters in the field names I need to remove or change the hypens to some other character.
The input schema of the complex object looks like:
I have tried using the "Select" and "Derive Column" data flow functions and adding a custom mapping. It seems both functions have the same mapping interface. My current attempt with Select is:
This gets me close to the desired results. I can use the replace expression to convert hypens to underscores.
The problem here is that this mapping creates new root level columns outside of the fraudData structure. I would like to preserve the hierarchy of the fraudData structure and modify the column names in place.
If I am unable to modify the fraudData in place. Is there any way I can take the new columns and merge them into another complex data type?
Update:. I do not know the fields of the complex data type in advance. This is a schema drift problem. This is why I have tried using the pattern matching solution. I will not be able to hardcode out kown sub-column names.
You can rename the sub-columns of complex data type using derived column transformation and convert them as a complex data type again. I tried this with sample data and below is the approach.
Sample complex data type column with two sub fields are taken as in below image.
img:1 source data preview
In Derived column transformation, For the column fraudData, expression is given as
#(fraudData_1_chn=fraudData.{fraudData-1-chn},
fraudData_2_chn=fraudData.{fraudData-2-chn})
img:2 Derived column settings
This expression renames the subfields and nests them under the parent column fraudData.
img:3 Transformed data- Fields are renamed.
Update: To rename sub columns dynamically
You can use below expression to rename all the fields under the root column fraudData.
#(each(fraudData, match(true()), replace($$,'-','_') = $$))
This will replace fields which has - with _.
You can also use pattern match in the expression.
#(each(fraudData, patternMatch(`fraudData-.+` ), replace($$,'-','_') = $$))
This expression will take fields with pattern fraudData-.+ and replace - with _ in those fields only.
Reference:
Microsoft document on script for hierarchical definition in data flow.
Microsoft document on building schemas using derived column transformation .

Is there a way to call the pcfcross function on groups of marks?

I'm using the pcfcross function to estimate the pair correlation functions (PCFs) between pairs of cell types, indicated by marks. I would now like to expand my analysis to include measuring the PCFs between cell types and groups of cell types. Is there a way to use the pcfcross function on a group of marks?
Alternatively, is there a way to change the marks of a group of marks to a singular mark?
You can collapse several levels of a factor to a single level, using the spatstat function mergeLevels. This will group several types of points into a single type.
However, this may not give you any useful new information. The pair correlation function is a second-order summary, so the pair correlation for the grouped data can be calculated from the pair correlations for the un-grouped data. (See Chapter 7 of the spatstat book).

How to apply sklean pipeline to a list of features depending on availability

I have a pandas dataframe with 10 features (e.g., all floats). Given the different characteristics of the features (e.g., mean), the dataframe can be broken into 4 subsets: mean <0, mean within range (0,1), mean within range (1,100), mean >=100
For each subset, a different pipeline will be applied, however, they may not always be available, for example, the dataset might only contain mean <0; or it may contain only mean <0 and mean (1,100); or it may contain all 4 subsets
The question is how to apply the pipelines depending on the availability of the subsets.
The problem is that there will be total 7 different combinations:
all subset exists, only 3 exists, only 2 subset exists, only 1 subset exist.
How can I assign different pipelines depending on the availability of the subsets without using a nested if else (10 if/else)
if subset1 exists:
make_column_transformer(pipeline1, subset1)
elif subset2 exists:
make_column_transformer(pipeline2, subset2)
elif subset3 exists:
make_column_transformer(pipeline3, subset3)
elif subset1 and subset 2 exists
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2)]
elif subset3 and subset 2 exists
make_column_transformer([(pipeline3, subset3), (pipeline2, subset2)]
elif subset1 and subset 3 exists
make_column_transformer([(pipeline1, subset1), (pipeline3, subset3)]
elif subset1 and subset2 and subset3 exists:
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2), (pipeline3, subset3)]
Is there a better way to avoid this nested if else (considering that if we have 10 different subsets _)
The way to apply different transformations to different sets of features is by ColumnTransformer [1]. You could then have a lists with the column names, which can be filled based on the conditions you want. Then, each transformer will take the columns listed in each list, for example cols_mean_lt0 = [...], etc.
Having said that, your approach doesn't look good to me. You probably want to scale the features so they all have the same mean and std. Depending on the algorithm you'll use, this may be mandatory or not.
[1] https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
EDIT:
ColumnTransformer takes transformers, which are a tuple of name, tuple and columns. What you want is to have multiple transformers, each of which will process different columns. The columns in the tuple can be indicated by 'string or int, array-like of string or int, slice, boolean mask array or callable'. Here is where I suggest you pass a list of columns.
This way, you can have three transformers, one for each of your cases. Now, to indicate which columns you want each transformer to process, you just have to create three lists, one for each transformer. Each column will corresond to one of the lists. This is simple to to. In a loop you can check for each column what the mean is, and then append the column name to the list which corresponds to the corresponding transformer.
Hope this helps!

How to add NER tags to features

I have a set of training sentences for which I computed some float features. In each sentence, two entities are identified. They are either of type 'PERSON', 'ORGANIZATION', 'LOCATION', or 'OTHER'. I would like to add these types to my feature matrix (which stores float variables).
My question is: is there a recommended way to add these entity types ?
I could think of two ways for now:
either adding TWO columns, one for each entity, that will be filled with entity types ids (e.g 0 to 3 or 1 to 4)
adding EIGHT columns, one for each entity type and each entity, and filling them with 0's and 1's
Best!
I would recommend that you use something that can easily be normalized and which is in the same range as the rest of your data.
So if all your float values are between -1 and 1, i would keep the values from your "Named Entity Recognition" in the same range.
So depending on what you prefer or what gives you the best result you could either assign 4 values in the same range as the rest of your floats or use a binary result with more columns.
Finally, the second suggestion (adding EIGHT columns, one for each entity type and each entity, and filling them with 0's and 1's) worked fine!

Return variable from its name in MATLAB

Say I have a variable var=1 and a string str='var'.
How can I obtain the value of var from str?. I tried using str2num(str), but it didn't work.
Also, if I had 2 strings str1='some letters' and str2='str1', can I obtaing the phase 'some letters' from str2?
I want to do this because I have many matrices (quite big) and I want to separate them in some groups, so I thought about making cells with the names of each of the matrices that belong to a group (a matrix can belong to more than one group, so making cells with the matrices is not very good).
You can use eval:
x = eval( str ) ;
But it's not recommended.
Though it can easily be achieved with an eval as #Shai mentioned, you probably don't really want to do this. Using eval hinders your debugging and depending on the name of variables seriously limits the flexibility of your code. If you want to name something, you may be better off using a struct with a data field and a name field instead.
Judging from your description, I wonder about the following:
1. Why do you have many matrices?
For each variable that you have, you depend on a name. Depending on a lot of names is typically undesirable. Hence my suggestion:
Use a (cell) array containing these matrices
2. What way do you exactly want them to be in a group
It is not clear to me how you want the grouping to work, but think of this:
If you want to use names, create a struct or array of structs with a nameField, but
otherwise just use a cell array and have each matrix get a number.
You can now handle the matrices more easily and things like 'selecting 10 random matrices' or 'selecting all matrices whose nameField contains 'abc'' can be done easily and efficiently.
You can now also have a field with your data specifying in which groups it is, or you can define groups as simple lists of numbers.

Resources