In featuretools, How to control the application of where_primitives? - featuretools

In featuretools we have various primitive application control mechanisms to custom apply the primitives to select entities and columns.
They are very neatly documented here
The ignore_entities and ignore_variables parameters of DFS control entities and variables (columns) that should be ignored for all primitives. This is useful for ignoring columns or entities that don’t relate to the problem or otherwise shouldn’t be included in the DFS run.
Options for individual primitives or groups of primitives are set by the primitive_options parameter of DFS. This parameter maps any desired options to specific primitives.
Using primitive_options I can control the application of primitives to the individual entity or, more granularly, to columns within each entity. I can also control the columns by which I groupby to apply groupby_trans_primitive.
I cannot find (i have searched enough to think it does not exist) how to control the application of where primitives.
For example: say, I have a column for spend. I create a seed_feature to create buckets on the spend column. I might want to create the feature min(spend) on the whole. But, within the bucket [10000,15000], I might not want to create the min(spend where spend_bucket == 10000_15000). How do I go about having this kind of control where I control primitives application only when where clause is in effect

Currently the ability to control where primitives via primitive_options does not exist.
Potentially using drop_contains with a substring could give the desired control.
This issue will track adding support for this to Featuretools:
https://github.com/alteryx/featuretools/issues/1514

Related

How to use Azure Search Service with heterogenous data sources

I have worked on Azure Search service previously where I created an indexer directly on a SQL DB in the Azure Portal.
Now I have a use-case where I would want to ingest from multiple data sources each having different data schema. Assume these data sources to be 3 search APIs of X,Y,Z teams. All of them take search term and gives back results in their own schema. I want my Azure Search Service to be proxy for these so that I have one search API that a user can use to get results from multiple sources, ordered correctly.
How should I go about doing it? I assume that I might have to create a common schema and whenever user searches something, I would call these 3 APIs and get results, map them to a common schema and then index this data in common schema into Azure Search index. Finally, call this Azure Search API to give back the results to the caller.
I would appreciate any help! If I can get hold of a better documentation for doing this work, that will be great as well.
Your assumption is correct. You can work with 3 different indexes and fire queries against them, or you can try to combine all of them in the same index. The benefit of the second approach is a better way to implement ordering / paging as all the information will be stored in the same index.
It really depends on what you mean by ordered correctly. Should team X be able to see results from teams Y and Z? The only way you can get ranked results like this is to maintain a single index with a common schema containing data from all teams.
One potential pitfall with this approach is conflicts in the schema. For example if one team requires a field to be of a specific datatype or use a specific analyzer, while another team has different requirements. We do this in our indexes, but with some carefully selected common fields and then dedicated fields prefixed according to our own naming convention to avoid conflicts.
One thing to consider is the need to reset the index. If you need to add, change or remove fields you will have to delete the index and create it again with a new schema. If you have a common index and team X needs to add a new property, you would need to reset (delete and create) the common index which affects all teams.
So, creating separate indexes per team has its benefits. Each team can have their own schema without risk of conflicts and they can reset their index without affecting the other teams.

Blazegraph Tinkerpop 3 Indexing

I am trying to learn about Blazegraph. At the moment I am puzzled how I can optimise simple lookups.
Suppose all my vertices have a property id, which is unique. This property is set by the user. Is there any way to speed up finding a vertex of a particular id while still sticking to the Tinkerpop APIs?
Is the search API defined here the only way?
My previous experience is in TitanDB and in Titan's case it's possible to define an index which the Tinkerpop APIs integrate with flawlessly. Is there any way to achieve the same results in Blazegraph without using the Search API?
Whether a mid-traversal V() uses an index or not, depends on a)
whether suitable index exists and b) if the particular graph system
provider implemented this functionality.
Gremlin (Tinkerpop) does not specify how to set indexes although the documentation presents things like the following
graph.createIndex("username",Vertex.class)
But may be reserved for the ThinkerGraph implementation, as a matter of fact it says
Each graph system will have different mechanism by which indices and
schemas are defined. TinkerPop3 does not require any conformance in
this area. In TinkerGraph, the only definitions are around indices.
With other graph systems, property value types, indices, edge labels,
etc. may be required to be defined a priori to adding data to the
graph.
There is an example for Neo4J
TinkerPop3 does not provide method interfaces for defining
schemas/indices for the underlying graph system. Thus, in order to
create indices, it is important to call the Neo4j API directly.
But the code is very specific for that plugin
graph.cypher("CREATE INDEX ON :person(name)")
Note that for BlazeGraph the search uses a built in full-text index

Complex Finds in Domain Driven Design

I'm looking into converting part of an large existing VB6 system, into .net. I'm trying to use domain driven design, but I'm having a hard time getting my head around some things.
One thing that I'm completely stumped on is how I should handle complex find statements. For example, we currently have a screen that displays a list of saved documents, that the user can select and print off, email, edit or delete. I have a SavedDocument object that does the trick for all the actions, but it only has the properties relevant to it, and I need to display the client name that the document is for and their email address if they have one. I also need to show the policy reference that this document may have come from. The Client and Policy are linked to the SavedDocument but are their own aggregate roots, so are not loaded at the same time the SavedDocuments are.
The user is also allowed to specify several filters to reduce the list down. These to can be from properties that are stored on the SavedDocument or the Client and Policy.
I'm not sure how to handle this from a Domain driven design point of view.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocuments, that I then have to turn into a different object or DTO, and fill with the additional client and policy information? That seem a little slow as I have to load all the details using multiple calls.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocumentsForList objects that contain just the information I want? This seems the quickest but doesn't feel like I'm using DDD.
Do I load everything from their objects and do all the filtering and column selection in a service? This seems the slowest, but also appears to be very domain orientated.
I'm just really confused how to handle these situations, and I've not really seeing any other people asking questions about it, which masks me feel that I'm missing something.
Queries can be handled in a few ways in DDD. Sometimes you can use the domain entities themselves to serve queries. This approach can become cumbersome in scenarios such as yours when queries require projections of multiple aggregates. In this case, it is easier to use objects explicitly designed for the respective queries - effectively DTOs. These DTOs will be read-only and won't have any behavior. This can be referred to as the read-model pattern.

Core data predicate based on multiple entitites

Normally if you have a 1 to many relationship in Core Data I understand that you should set that up as a relationship in the data model.
In this case, it is difficult to do because of the origin and management of the data.
I'm trying to essentially accomplish a join.
I'd like to fetch an entity A which meets some criteria on A but also meets a criteria on B.code and another attribute.
select statement would be
select attributeFromA from A, B where A.code = B.code and B.attrib="foo"
Is there a reasonable way to accomplish this without creating a relationship in core data?
I've only found two solutions, neither very good.
From what I've read, Core Data does not support a query against multiple entities unless they have a relationship between them.
Add a relationship anyway. This can be particularly bad since the data is coming from a server. No way to easily maintain relationships when individually updating each table from the server. Need to recreate relationships when data changes.
Manually perform the join outside of Core Data. In the above case, the intent is to get the set of object identifiers ('code') that match. One way to do that is to perform separate queries then get the intersection. Setup each query to only retrieve 'code', not managed objects.

Optional or boolean elements to specify characteristics in XML schema?

I'm trying to create an XML schema to describe some aspects of hospitals. A hospital may have 24 hour coverage on: emergency services, operating room, pharmacist, etc. The entire list is relatively short - around 10. The coverage may be on more than one of these services.
My question is how best to represent this. I'm thinking along the lines of:
<coverage>
<emergencyServices/>
<operatingRoom/>
</coverage>
Basically, the services are optional and, if they exist, the coverage is offered by the hospital.
Alternatively, I could have:
<coverage>
<emergencyServices>true</emergencyServices>
<operatingRoom>true</operatingRoom>
<pharmacist>false</pharmacist>
</coverage>
In this case, I require all the elements, but a value of false means that the coverage isn't offered.
There are probably other approaches.
What's the best practice for something like this? And, if I use the first option, what type should the elements be in the schema?
Best practice here depends really on the consumer.
The short and simple rule is that markup is for structure, and content is for data. So having them contain xs:boolean values is generally the best course.
Now, on to the options:
Having separate untyped elements is simple and clear; sometimes processing systems may have difficulty reading them, because some XML-relational mappers may not see any data in the elements to put in relational tables. But if they had values, like <emergencyServices>true</emergencyServices>, then the relational table would have a value to hold.
Again, if you have fixed element names, it means if your consumer is using a system that maps the XML to a database, every time you add a service, a schema change will have to be made.
There are several other ways; each has trade-offs:
Using a <xs:string> with an enumeration, and allow multiple copies. Then you could have <coverage>emergencyServices</coverage><coverage>operatingRoom</coverage>. It makes adding to the list simpler, but allows duplicates. This scheme does not require schema changes in the database for the consumer.
You could use attributes on the <coverage> element. They would have a xs:boolean type, but still require a schema change. Of course, this evokes the attribute vs. element argument.
One good resource is Chapter 11 of Effective XML. At least this should be read before making a final decision.

Resources