I want to perform attribute selection in Weka, but my dataset is rather big, and the program runs quite a while. That's why I want to see the current best set of attributes found. How do I do it?
For example, genetic search has the "Report Frequency" parameter, but all the results are shown after the whole search is finished, that's not what I need.
There is no progress bar, so I don't even know for how long will I have to wait...
Feature or Attribute selection is a standard problem in data-mining and Machine learning domains.
If you want to select a good set of attributes, you must preprocess your data by ranking attributes based on their quality. Ranking Methods such as p-metric or t-statistic are popular, which are based on statistical measures. One cannot simply go about by randomly selecting attributes from a large set without any sort of intuition on the nature of attributes.
If you do not need to run the attribute selection on your whole dataset you could use a smaller sample of your dataset (simply edit your ARFF file) to run the attribute selection.
Related
I would like to use paraview for postprocessing of FE models. However, I am missing an essential feature in VKT format, which probably exists, but I don't know its name or how it is implemented in VTK.
In FE models it is common to group some nodes/elements. Depending on the program these are named differently: Groups, Sets, Selections, ... . Basically they are just an array with the reference numbers for quick selection. For example: A tube could have the selections "inlet", "outlet" and "wall". Is there any possibility to store such a selection in VTK format? The goal would be to be able to apply filters only to this node selection, for example to get results only from certain nodes.
By the way, I do the export of my calculated data to VTK on my own, because my FE program does not have native support for the VTK format. So I am more interested in the required data structure than in a workflow for program XY.
In VTK, you cannot apply filters only on subset of a data object. What you need is to be able to split your data into several ones for processing.
I see two ways for that:
create one object per selection and then use a MultiBlockDataSet with one part per block. Then you can use vtkExtractBlock to apply filters on a specific part.
Add a PartId array to your data. Then you can use thresholding to extract the region of interest.
I advise to use 1. as it has more semantic.
I'm a mechanical engineer, and I have developed a pretty cool spreadsheet that I use to size some steel members for lifting beams. The set back is that I need to do some trial and error in the selection of the member until I get one that gets as close to the allowable limits as possible.
What I'm hoping to improve on is to develop a function that based upon a length and weight variable that I enter, the program runs a loop and automatically selects the best member size(s) based upon a list of the members and their physical properties. Is this possible?
Yeah, depending on the complexity, either a simple search through parameters (less than, more than etc) might bring you the answer. You can do it quite easily via Pandas library. Just load up the excel as pandas DataFrame (pandas.read_excel()), which then will allow you to perform the searches on that DataFrame object.
If you want to run some optimization algo, you should look into SciPy's optimize to get what you're looking for based on the input data (it handles unconstrained and constrained functions).
Of course, the question you've stated is quite general, so I only pointed the direction. More info would be better.
I want to know what is the difference between feature numeric and numeric columns in Azure Machine Learning Studio.
The documentation site states:
Because all columns are initially treated as features, for modules
that perform mathematical operations, you might need to use this
option to prevent numeric columns from being treated as variables.
But nothing more. Not what a feature is, in which modules you need features. Nothing.
I specifically would like to understand if the clear feature dropdown option in the fields in the edit metadata-module has any effect. Can somebody give me a szenario where this clear feature-operation changes the ML outcome? Thank you
According to the documentation in ought to have an effect:
Use the Fields option if you want to change the way that Azure Machine
Learning uses the data in a model.
But what can this effect be? Any example might help
As you suspect, setting a column as feature does have an effect, and it's actually quite important - when training a model, the algorithms will only take into account columns with the feature flag, effectively ignoring the others.
For example, if you have a dataset with columns Feature1, Feature2, and Label and you want to try out just Feature1, you would apply clear feature to the Feature2 column (while making sure that Feature1 has the feature label set, of course).
While passing the dataframes as entities in an entityset and use DFS on that, are we supposed to exclude target variable from the DFS? I have a model that had 0.76 roc_auc score after traditional feature selection methods tried manually and used feature tools to see if it improves the score. So used DFS on entityset that included target variable as well. Surprisingly, the roc_auc score went up to 0.996 and accuracy to 0.9997 and so i am doubtful of the scores as i passed target variable as well into Deep Feature Synthesis and there the infor related to the target might have been leaked to the training? Am i assuming correct?
Deep Feature Synthesis and Featuretools do allow you to keep your target in your entity set (in order to create new features using historical values of it), but you need to set up the “time index” and use “cutoff times” to do this without label leakage.
You use the time index to specify the column that holds the value for when data in each row became known. This column is specified using the time_index keyword argument when creating the entity using entity_from_dataframe.
Then, you use cutoff times when running ft.dfs() or ft.calculate_feature_matrix() to specify the last point in time you should use data when calculating each row of your feature matrix. Feature calculation will only use data up to and including the cutoff time. So, if this cutoff time is before the time index value of your target, you won’t have label leakage.
You can read about those concepts in detail in the documentation on Handling Time.
If you don’t want to deal with the target at all you can
You can use pandas to drop it out of your dataframe entirely before making it an entity. If it’s not in the entityset, it can’t be used to create features.
You can set the drop_contains keyword argument in ft.dfs to ['target']. This stops any feature from being created which includes the string 'target'.
No matter which of the above options you do, it is still possible to pass a target column directly through DFS. If you add the target to your cutoff times dataframe it is passed through to the resulting feature matrix. That can be useful because it ensures the target column remains aligned with the other features. You can an example of passing the label through here in the documentation.
Advanced Solution using Secondary Time Index
Sometimes a single time index isn’t enough to represent datasets where information in a row became known at two different times. This commonly occurs when the target is a column. To handle this situation, we need to use a “secondary time index”.
Here is an example from a Kaggle kernel on predicting when a patient will miss an appointment with a doctor where a secondary time index is used. The dataset has a scheduled_time, when the appointment is scheduled, and an appointment_day, which is when the appointment actually happens. We want to tell Featuretools that some information like the patient’s age is known when they schedule the appointment, but other information like whether or not a patient actually showed up isn't known until the day of the appointment.
To do this, we create an appointments entity with a secondary time index as follows:
es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
dataframe=data,
index='appointment_id',
time_index='scheduled_time',
secondary_time_index={'appointment_day': ['no_show', 'sms_received']})
This says that most columns can be used at the time index scheduled_time, but that the variables no_show and sms_received can’t be used until the value in secondary time index.
We then make predictions at the scheduled_time by setting our cutoff times to be
cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']]
By passing that dataframe into DFS, the no_show column will be passed through untouched, but while historical values of no_show can still be used to create features. An example would be something like ages.PERCENT_TRUE(appointments.no_show) or “the percentage of people of each age that have not shown up in the past”.
If you are using the target variable in your DFS, than you are leaking information about it in your training data. So you have to remove your target variable while you are doing every kind of feature engineering (manuall or via DFS).
I'm working on a pretty basic web app (not much more than CRUD stuff). However, the requirements call for a bunch of data to be displayed with each item in the search results - IDs, dates, email addresses, long descriptions... too much to fit neatly into a simple grid, and too dissimilar to make them flow together (like the natural language example from this article.)
Is there a design pattern for attractively displaying many descriptive fields with each search result?
(Please don't tell me to just remove some fields from the results; that's not an option for this project.)
Obviously there are many ways you can handle this, and to a degree it's a factor of your information design abilities and preferences.
Natural Data Groupings
What I would do is try to organize your data into a small number of "buckets." You state that the data are too dissimilar to be arranged into a sentence, but it's likely you can create a few logical groups. Since we can't see all your data, I'll guess that you have information about a person (email, name, ID?), about some sort of event (dates? type?), or maybe about some kind of object related to the person (orders? classes?). Whatever they are, some of the data will be more closely related to each other than others.
Designing in Chunks
Take each loose "bucket" and design a kind of "plate" -- a grouping just for the information in that bucket. The design problem within this constrained chunk is easier to tackle: maybe it's a little table-like layout, maybe it's something non-tabular, like the stackoverflow user "nameplate". Maybe long textual data have their own plates, or maybe they're grouped into a single plate, but with a preview/detail click-for-more arrangement.
Using a Grid
Now that you have a small number of "plates," go back to a grid-like approach for your overall search result row design. Arrange the plates as units within the row, and be sure to keep them aligned. Following an overall grid (HTML table or otherwise) for the plates will avoid an "information soup" problem. You'll have clean columns that scan well, and a readable, natural information hierarchy. The natural language example you cite would indeed be difficult to parse if it were one of many rows displayed in a search results grid.
Consistency
Be sure to use a common "design vocabulary" when you're working on the chunks -- consistent styling of labels, consistent spacing... so when everything's displayed, despite the bulk of information, it all feels like it's part of the same family.
It's an interesting design exercise. Many comps, lots of iteration, and some brainstorming should get you where you need to be.
It probably depends on the content you're displaying. Look at the StackOverflow layout for this question. It has Votes, Title, Description, Tags, Author, etc. The content wouldn't work well in a grid for sure, nor does it flow nicely on it's own.
I think it's time to get creative ;)
No one ever thinks about what this is going to look like on their screen, do they?
One thing you can do is truncate the displayed text, and then display the expanded version in a tooltip on hover, or after the user clicks on it.
For example, display only the two-letter state abbreviation but show the full state name on hover.
Or, to save even more space, only display the state abbreviation, and put the entire address in the tooltip.
For long descriptions, you can display only the first few characters, followed by an ellipsis or the word "More". Then, show the full text either on hover or on click.
One disadvantage of the hover approach is that you can't sort the column on that text. There's nothing for the user to click to request the sort.