Interpreting analysis with PCA [closed] - scikit-learn

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
The focus of this question is: What components should I keep?
There is a dataset that has this structure:
Each row is associated with an image in a directory.
The variable confidence is a dummy value that is always 1.
The name of the coordinates where an object is identified are: XMin, XMax, YMin, YMax.
The name of the image characteristics are: IsOccluded, IsTruncated, IsGroupOf, IsDepiction, IsInside.
So I made a correlation table, which shows below that the 4 components that correspond to the points in the image are necessary.
Then a table was made with the main components and their Explained variance ratio as shown below.
After that use PCA from sklearn and it shows the number of components and their Cumulative explained variance.
I interpret from all this that the 4 coordinates are totally necessary.
¿How Can I demonstrate that the characteristics of the image are not relevant?

the last YMin coordinate has a low percentage of Cumulative explained variance
This is wrong because the PCA gives you cumulative explained variance per principal component, not per variable of your original base.
What it tells you is that you can make a projection of the data in only 3 dimensions instead of N while still keeping 70% of the variability, while if you keep 4 dimensions you keep 80% of variability. But only after a specific change of basis (that of the PCs), not dropping some of the initial variables.
To see how important the initial variables are, you can look at the vector representation of the principal components: each of their coordinates is the amount of the corresponding initial variable that is used to make this component.
the 4 coordinates are totally necessary
It depends on your interpretation of "necessary"

Related

How is the window size affect word2vec and how do we choose window size according to different tasks? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
For example, if I choose two window size, 5 and 50, and train the word2vec model, will the 50 one takes more time to train? Will the embeddings of the 50 one concentrates more on semantics of the text and the 5 one concentrates more on single word?
BTW, above two questions are just my thinking/exmaples of what I am seeking. My real question is just the title "How is the window size affect word2vec and how do we choose window size according to different tasks?"
A larger window will take longer to train.
A larger window will have a stronger effect on runtime in 'skip-gram' mode, where a larger window means more individual center-word predictions & error-backpropagations. It'll have a milder effect on runtime in 'CBOW' mode, where it just means more averaging of input-vectors and fan-out of the final effects for each prediction/backpropagation.
For how it affects the character of the resulting word-vectors, there's some discussion & a related research paper in a prior answer: Word2Vec: Effect of window size used
Generally, you'd optimize the window value the same as any other tunable parameter, by devising some repeatable way to score the final word-vectors on your real task (or a close/correlated simulation), then trying a range of values to see which scores best on your evaluation.

How can i trace eating with mouth closed using Google ML Kit

Currently i tried to calculate distance between UPPER_LIP_BOTTOM and LOWER_LIP_TOP, and i set the threshold value 23 (Calculated by minimum distance between both UPPER_LIP_BOTTOM and LOWER_LIP_TOP), if current distance go above the THRESHOLD it will show "Eating" but this method is not working when i am eating with my mouth closed.
You can experiment with a couple of things:
Take all the points in the mouth as the input and build a second ML classifier model (a single layer fully connected model might work).
In addition to above, take input from multiple frames. There maybe additional complication if the frames are not taken in regular intervals.
I am interested in the use-case, can you tell us more?

Overlapping/crowded labels on y-axis python [duplicate]

This question already has answers here:
How to change spacing between ticks
(4 answers)
Closed 5 months ago.
I'am kind of in a rush to finish this for tomorrows presentation towards the project owner. We are a small group of economic students in germany trying to figure out machine learning with python. We set up a Random Forest Classifier and are desperate to show the estimators important features in a neat plot. By applying google search we came up with the following solution that kind of does the trick, but leaves us unsatisfied due to the overlapping of the labels on the y-axis. The code we used looks like this:
feature_importances = clf.best_estimator_.feature_importances_
feature_importances = 100 * (feature_importances / feature_importances.max())
sorted_idx = np.argsort(feature_importances)
pos = np.arange(sorted_idx.shape[0])
plt.barh(pos, feature_importances[sorted_idx], align='center', height=0.8)
plt.yticks(pos, df_year_four.columns[sorted_idx])
plt.show()
Due to privacy let me say this: The feature names on the y-axis are overlapping (there are about 30 of them). I was looking into the documentation of matplotlib in order to get an understanding of how to do this by myself, unfortunately I couldn't find anything helpful. Seems like training and testing models is easier than understanding matplotlib and creating plots :D
Thank you so much for helping out and taking the time, I appreciate it.
I see your solution, and I want to just add this link here to explain why: How to change spacing between ticks in matplotlib?
The spacing between ticklabels is exclusively determined by the space between ticks on the axes. Therefore the only way to obtain more space between given ticklabels is to make the axes larger.
The question I linked shows that by making the graph large enough, your axis labels would naturally be spaced better.
You are using np.argsort that will return a numpy array with many indices. And you are using that array as labels for your Y-Axis thus there is overlapping of labels.
My suggestion will be to use an index for sorted_idx like,
plt.yticks(pos, df_year_four.columns[sorted_idx[0]])
This will plot only for 1 label.
Got it guys!
'Geistesblitz' as we say in germany! (spiritual lightening)
See the variable feature_importances in the third top row? Add feature_importnaces[:-15]
to view only the top half of the features and loosen up the y-axis. Yes!!! This does well because there are way less important features.

how to explain this decision tree interpretability question?

enter image description here
enter image description here
The 2 pictures above shown the 2 decision tree....
Question is: It is often claimed that a strength of decision trees is their interpretability.
Is this always justified? Refer to Figures 5 and 6 to help with your answer.
I think the point of the question is saying that a decision tree is interpretable if its depth is relatively small. The second tree is very deep i.e for one single prediction, you will get a high number of different splitting decisions to process. Therefore, you lose interpretability because the explanation for any prediction is an intersection of too many conditions for a human-user to process.

What exactly is computer resolution? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Maybe It's a bit trivial question, but I am confused over the computer resolution. What is it?
If the resolution is say 150 X 100 then does it mean that 150 pixels will cover up the entire length of my system, (horizontally). And does it mean that the size of a pixel is not fixed? Since 300 X 200 will mean 300 pixels covering the same length.
Also, say I take up a pixel and draw a circle around it. Now is it possible that the circle passes through the centre of all the pixels it covers, or there will/can be some pixels for which the boundary does not pass through the centre. [Passing from the centre as in, if I take the extreme point on the diameter,for that the boundary passes through the centre. So like that].
That is, Can I say that taking up a pixel, if it is inside the circle or outside? [Again, the extreme point on the diameter is inside].
EDIT Also, In a normal X-Y axis; the points on the boundary are in decimals too, but if I consider the indices of pixels, in a window, then it will increase as a unit value only. So how do we decide, what all pixels get coloured when drawing a circle?
When you read 1024x768 1024 is the horizontal resolution and 768 is the vertical resolution in pixels.
A Pixel can be a square or an rectangle, its not specified.
To determine if a pixel is inside a Circle, you (simplified) calculate the distance of the center of that pixel to the center of the circle and check if it is below or equal the radius, if it is, it is inside.

Resources