What are the steps within sklearn.pipeline.make_pipeline()?
The documentation doesn't explicitly state what are the steps that we "save" when using make_pipeline() instead of doing it the normal way. Was wondering what those steps are just for my understanding?
I can guess that one of the steps is applying a scaling to all columns in the dataset, like StandardScaler().
Related
I have defined an Azure Machine Learning Pipeline with three steps:
e2e_steps=[etl_model_step, train_model_step, evaluate_model_step]
e2e_pipeline = Pipeline(workspace=ws, steps = e2e_steps)
The idea is to run the Pipeline in the given sequence:
etl_model_step
train_model_step
evaluate_model_step
However, my experiment is failing because it is trying to execute evaluate_model_step before train_model_step:
How do I enforce the sequence of execution?
azureml.pipeline.core.StepSequence lets you do exactly that.
A StepSequence can be used to easily run steps in a specific order, without needing to specify data dependencies through the use of PipelineData.
See the docs to read more.
However, the preferable way to have steps run in order is stitching them together via PipelineData or OutputFileDatasetConfig. In your example, does the train_step depend on outputs from the etl step? If so, consider having that be the way that steps are run in sequence. For more info see this tutorial for more info
Extending the question Cleanup steps for Cucumber scenarios. I am aware that I can use tagged #After hooks to repeat the last few steps for all scenarios matching the tag. However, this implementation will be in my java classes and my business users will have no idea. Also my acceptance tests are huge, around 200. Lets say each feature file contains 10 scenarios and last 3-4 steps are common for all of them in that feature file. So i will have 20 feature file and 20 unique tags. I can create 20 #After hooks function and silently perform those steps. But how will my business owners know this if they cannot see these technical implementation?
The purpose of 'Background' tag is to repeat the same steps at beginning of the scenarios. We could have easily achieved this by using tagged #Before hooks, then why Background tag? If we have new feature of having 'Postground' tag, which is opposite of 'Background' tag, above problem can be solved. What do you think?
Note: I have logged an issue for this, but it got closed by #aslakhellosoy. I think I did not articulate the problem statement well.
Instead of repeating the same steps one by one, you can extract helper methods to perform the individual actions those steps perform, and call those helper methods either one by one in individual steps, or in sequence from the overarching step.
That way you can still make visible to the business users what happens, without having to spell out all the individual steps.
For more information, check the Cucumber documentation on Helper Methods.
If you still have more questions (I realise the documentation on Helper Methods isn't very extensive), please join the Cucumber Slack.
I want to know what is the difference between feature numeric and numeric columns in Azure Machine Learning Studio.
The documentation site states:
Because all columns are initially treated as features, for modules
that perform mathematical operations, you might need to use this
option to prevent numeric columns from being treated as variables.
But nothing more. Not what a feature is, in which modules you need features. Nothing.
I specifically would like to understand if the clear feature dropdown option in the fields in the edit metadata-module has any effect. Can somebody give me a szenario where this clear feature-operation changes the ML outcome? Thank you
According to the documentation in ought to have an effect:
Use the Fields option if you want to change the way that Azure Machine
Learning uses the data in a model.
But what can this effect be? Any example might help
As you suspect, setting a column as feature does have an effect, and it's actually quite important - when training a model, the algorithms will only take into account columns with the feature flag, effectively ignoring the others.
For example, if you have a dataset with columns Feature1, Feature2, and Label and you want to try out just Feature1, you would apply clear feature to the Feature2 column (while making sure that Feature1 has the feature label set, of course).
I am reading a lot about Gherkin, and I had already read that it was not good to repeat steps, and for this it is necessary to use the keyword "Background", but in the example of this page they are repeating the same "Given" again and again, Could it be that I am doing wrong? I need to know your opinion about it:
Like with several things, this a topic that will generate different opinions. On this particular example I would have moved the "Given that I select the post" to the Background section as this seems to be a pre-requisite to all scenarios on this feature. Of course this would leave the scenarios in the feature without an actual Given section but those would be incorporated from the Background section on execution.
I have also seen cases where sometimes the decision of moving steps to the Background is a trade-off between having more or less feature files and how these are structured. For example, if there are 10 scenarios for a particular feature with a lot of similar steps between them - but there are 1 or 2 scenarios which do not require a particular step, then those 1 or 2 scenarios would have to moved into a new feature file in order to have the exact same steps on the Background section of the original feature.
Of course it is correct to keep the scenarios like this. From a tester's perspective, the Scenarios/Test cases should run independently, therefore, you can keep these tests separately for each functionality.
But in case you are doing an integration testing, then some of these test cases can be merged, thus you can cover multiple test cases in one scenario.
And the "given" statement is repeating, therefore you can put that in the background, so you don't have to call it in each scenarios.
Note: These separate scenarios will be handy when you run the scripts separately with annotation tags, when you just have to check for a specific functionality, or a bug fix.
How do I decide between creating several WebJobs with 1 function each and bundling several functions into one or only a few WebJobs?
Thanks
There is no straight answer to your question. Sorry.
Usually you group functions by workflow or role. For example if you have a workflow that contains a function that resizes an image, then a function that applies a watermark and another one that replicates the images then it makes sense to put all the functions together because they are related. You are more likely to change all of them when you modify the flow.
On the other hand, you might argue that functions should be separated. Unless you change the input/output, there is no reason to modify more than one function. However, if you need to change more than one function, you will end up editing more projects.
As you see, both arguments have pros/cons and there is really no right answer.
Try to experiment and see which approach works better for your solution.
PS: The only guideline that I can give is: if the functions are really small (a few lines of code), probably it is easier to put them in the same webjob because there is quite some overhead in maintaining multiple assemblies.