How do I flatten a featuretools entity set to get wide input format? - featuretools

I have an entity set with relations defined. Is there a method to get a left joined version of all the data frames in entities as we already have relations?
I can merge the dataframes outside using pandas but would like to leverage well defined entityset.

This functionality does not currently exist in Featuretools. You can do it outside of Featuretools (with pandas).
Feel free to create an issue for this feature:
https://github.com/alteryx/featuretools/issues/new/choose

Related

In featuretools, How to control the application of where_primitives?

In featuretools we have various primitive application control mechanisms to custom apply the primitives to select entities and columns.
They are very neatly documented here
The ignore_entities and ignore_variables parameters of DFS control entities and variables (columns) that should be ignored for all primitives. This is useful for ignoring columns or entities that don’t relate to the problem or otherwise shouldn’t be included in the DFS run.
Options for individual primitives or groups of primitives are set by the primitive_options parameter of DFS. This parameter maps any desired options to specific primitives.
Using primitive_options I can control the application of primitives to the individual entity or, more granularly, to columns within each entity. I can also control the columns by which I groupby to apply groupby_trans_primitive.
I cannot find (i have searched enough to think it does not exist) how to control the application of where primitives.
For example: say, I have a column for spend. I create a seed_feature to create buckets on the spend column. I might want to create the feature min(spend) on the whole. But, within the bucket [10000,15000], I might not want to create the min(spend where spend_bucket == 10000_15000). How do I go about having this kind of control where I control primitives application only when where clause is in effect
Currently the ability to control where primitives via primitive_options does not exist.
Potentially using drop_contains with a substring could give the desired control.
This issue will track adding support for this to Featuretools:
https://github.com/alteryx/featuretools/issues/1514

Does sklearn have any model type metadata either in the project or outside?

For example, it could be useful to have information in the library that allows one to select all tree-based ensemble models that work on regression/classifier tasks with more than one output.
I think users could gradually create this meta-data in the library if it doesn't already exist.
So something like:
[model_entry for model_entry in sklearn.meta_info if model_entry.2d_y and model_entry.ensemble]
but with better names.
You can always make use of the estimator tags to get such information: https://scikit-learn.org/dev/developers/develop.html#estimator-tags

Bulk import of graph data with ArangoDB java driver

I have a question regarding bulk import when working with a graph layer of ArangoDB and its java driver. I'm using Arango 3.4.5 with java driver 5.0.0.
In a document layer, it's possible to use ArangoCollection.importDocuments to insert several documents at once. However, for the collections of the graph layer, the ArangoEdgeCollection and the ArangoVertexCollection, the importDocuments function (or a corresponding importVertices/importEdges function) does not exist. So, if I want to pursue a bulk import of my graph data, I have to ignore the graph layer and use the importDocuments function on vertex collections, *_ELEMENT-PROPERTIES, *_ELEMENT-HAS-PROPERTIES, and edge collections separately by myself.
Furthermore, when the edge collections already exist in the database, it's even not possible to perform a bulk import, because the existing collection is already defined as an edge collection.
Or maybe it's not true what I'm writing and I overlooked something essential?
If not, is there a reason why the bulk import is not implemented for the graph layer? Or is a graph bulk import just among items of a nice-to-have list which hasn't been implemented yet?
Based on my findings described above, the bulk import of graph data with java driver is IMO not possible if the graph collections already exist (because of the edge collections) (?). It would be possible to carry out the bulk import only if we created edge collections from scratch as ordinary collections, which, however, already smells of necessity to sequentially write my own basic graph layer (which I don't want to do, of course).
I guess another way is then the import of JSON data which I haven't analyzed much so far because it seems to me inconvenient when I need to manipulate (or create) the data with java before storing them. Therefore, I would really like to work with the java driver.
Thank you very much for any reply, opinion or corrections.

How to apply a dprep package to incoming data in score.py Azure Workbench

I had been wondering if it were possible to apply "data preparation" (.dprep) files to incoming data in the score.py, similar to how Pipeline objects may be applied. This would be very useful for model deployment. To find out, I asked this question on the MSDN forums and received a response confirming it were possible, but little explanation about how to actually do it. The response was:
in your score.py file, you can invoke the dprep package from Python
SDK to apply the same transformation to the incoming scoring data.
make sure you bundle your .dprep file in the image you are building.
So my questions are:
What function do I apply to invoke this dprep package?
Is it: run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) ?
How do I bundle it into the image when creating a web-service from the CLI?
Is there a switch to -f for score files?
I have scanned through the entire documentation and Workbench Repo but cannot seem to find any examples.
Any suggestions would be much appreciated!
Thanks!
EDIT:
Scenario:
I import my data from a live database and let's say this data set has 10 columns.
I then feature engineer this (.dsource) data set using the Workbench resulting in a .dprep file which may have 13 columns.
This .dprep data set is then imported as a pandas DataFrame and used to train and test my model.
Now I have a model ready for deployment.
This model is deployed via Model Management to a Container Service and will be fed data from a live database which once again will be of the original format (10 columns).
Obviously this model has been trained on the transformed data (13 columns) and will not be able to make a prediction on the 10 column data set.
What function may I use in the 'score.py' file to apply the same transformation I created in workbench?
I believe I may have found what you need.
From this documentation you would import from the azureml.dataprep package.
There aren't any examples there, but searching on GitHub, I found this file which has the following to run data preparation.
from azureml.dataprep import package
df = package.run('Data analysis.dprep', dataflow_idx=0)
Hope that helps!
To me, it looks like this can be achieved by using the run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) method from the azureml.dataprep.package module.
From the documentation :
run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) runs the specified data flow based on an in-memory data source and returns the results as a dataframe. The user_config argument is a dictionary that maps the absolute path of a data source (.dsource file) to an in-memory data source represented as a list of lists.

Nodes order in xml created from dataset

I'am filling tables in .net DataSet with data.
There is a nested relation between the tables, so the exported XML (by using GetXml() method) is nested (the child rows are becoming child nodes).
I 'am sending this XML to a conversion module that converts the XML from the DataSet schema (I' am using the dataset XSD file) to other schema by XSLT map.
The problem is that in the XML that I' am receiving from the DataSet (by using GetXml method) the child nodes are not in the correct order (different from the order they are in the schema). From this reason the schema validation in the conversion module is failing!
I've found this W# documentation:
All or Sequence
I've tried to act according to this, but it seems like the value "all" can't "live" with the relations between the tables in the DataSet and I'am getting many weird error messages.
Is there a better way to control the child nodes order or to make the schema to succeed in the validation process even if the order is different?
I would use explicit select statements in your SQL
SELECT Column1, Column2 From ...
If this isn't possible, you will need to make your XSD match your physical table specifications.

Resources