I need to find a group of datasets to apply some Non Parametric tests. Anyone knows where can I find example datasets to download them and apply those methods?
Thank you all :)
Not very sure what you need, try the wine dataset from R package ordinal:
library(ordinal)
data(wine)
And information about the dataset:
The ‘wine’ data set is adopted from Randall(1989) and from a
factorial experiment on factors determining the bitterness of
wine. Two treatment factors (temperature and contact) each have
two levels. Temperature and contact between juice and skins can be
controlled when cruching grapes during wine production. Nine
judges each assessed wine from two bottles from each of the four
treatment conditions, hence there are 72 observations in all.
Related
Statistics is not my major and English is not my native language. I tried to apply for data analysis or data science work in industry. However, I do not know how to describe my research process below in a concise and professional way. I highly appreciated if you could provide me such help.
Background: I simulating properties of materials using different research packages, such as LAMMPS. The simulated data are only coordinates of atoms. Below are my data analysis.
step 1: clean the data to make sure the data complete and atom ID is unique and not exchangeable at different time moments (timesteps).
step 2: Calculated the neighbor atoms' distance of each center atom to find the target species (a configuration formed by several target atoms, such as Al-O-H, Si-O-H, Al-O-H2, H3-O)
step 3: count the amount of species as functions of space and/or time and draw the species distribution as functions of space and/or time, lifetime distribution of species.
NOTE: such distribution is different from statistical distribution, such as Normal Distribution, Binomial Distribution.
step 4: Based on above distribution, the correlation between species would be explored and interpreted.
After above steps, I study the mechanism behind based on materials selves and local environment.
Could anyone point out how to understand above steps in statistical terms or data analytic terms or others?
I sincerely appreciate your time and help.
I have a dataset on all the software installed by a large group of users. I would have to classify the users into one of 4 categories based on which software they installed (each user can install up to 30 pieces of software).
The categories are highly imbalanced - one category contains nearly 45% of the users in the training dataset, another - 35%, a third only 15% and the forth - only 5%. Let's say that these 4 categories roughly correspond to 4 different IT job types (e.g. "Software Engineer", "DevOp", "Analyst", etc.).
"Software" is a feature which has high cardinality (above 1000) so using naive one-hot encoding does not seem appropriate.
I would like to identify a subset of informative values/levels for this variable. For instance, a software like an anti-virus program probably does not differentiate well between these categories as all or most users would have it installed. A specialised tool (e.g. IDE) might differentiate better, i.e. its frequency of occurrence might be different within each IT job category.
How to identify such "informative" features using python? Do we use sklearn.feature_selection.chi2? Or do we use sklearn.feature_selection.mutual_info_classif? Or is there some other method?
Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.
This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.
I'm using PCA from sckit-learn and I'm getting some results which I'm trying to interpret, so I ran into question - should I subtract the mean (or perform standardization) before using PCA, or is this somehow embedded into sklearn implementation?
Moreover, which of the two should I perform, if so, and why is this step needed?
I will try to explain it with an example. Suppose you have a dataset that includes a lot features about housing and your goal is to classify if a purchase is good or bad (a binary classification). The dataset includes some categorical variables (e.g. location of the house, condition, access to public transportation, etc.) and some float or integer numbers (e.g. market price, number of bedrooms etc). The first thing that you may do is to encode the categorical variables. For instance, if you have 100 locations in your dataset, the common way is to encode them from 0 to 99. You may even end up encoding these variables in one-hot encoding fashion (i.e. a column of 1 and 0 for each location) depending on the classifier that you are planning to use. Now if you use the price in million dollars, the price feature would have a much higher variance and thus higher standard deviation. Remember that we use square value of the difference from mean to calculate the variance. A bigger scale would create bigger values and square of a big value grow faster. But it does not mean that the price carry significantly more information compared to for instance location. In this example, however, PCA would give a very high weight to the price feature and perhaps the weights of categorical features would almost drop to 0. If you normalize your features, it provides a fair comparison between the explained variance in the dataset. So, it is good practice to normalize the mean and scale the features before using PCA.
Before PCA, you should,
Mean normalize (ALWAYS)
Scale the features (if required)
Note: Please remember that step 1 and 2 are not the same technically.
This is a really non-technical answer but my method is to try both and then see which one accounts for more variation on PC1 and PC2. However, if the attributes are on different scales (e.g. cm vs. feet vs. inch) then you should definitely scale to unit variance. In every case, you should center the data.
Here's the iris dataset w/ center and w/ center + scaling. In this case, centering lead to higher explained variance so I would go with that one. Got this from sklearn.datasets import load_iris data. Then again, PC1 has most of the weight on center so patterns I find in PC2 I wouldn't think are significant. On the other hand, on center | scaled the weight is split up between PC1 and PC2 so both axis should be considered.
I am writing a study protocol for my masters thesis. The study seeks to compare the rates of Non Communicable Diseases and risk factors and determine the effects of rural to urban migration. Sibling pairs will be identified from a rural area. One of the siblings should have participated in the rural NCD survey which is currently on going in the area. The other sibling should have left the area and reported moving to a city.Data will collected by completing a questionnaire on demographics, family history,medical history, diet,alcohol consumption, smoking ,physical activity.This will be done for both the rural and urban sibling, with data on the amount of time spent in urban areas fur
The outcomes which are binary (whether one has a condition or not) are : 1.diabetic, 2.hypertensive, 3.obese
What statistical method can I use to compare the outcomes (stated above) between the two groups, considering that the siblings were matched (one urban sibling for every rural sibling)?
What statistical methods can also be used to explore associations between amount spent in urban residence and the outcomes?
Given that your main aim is to compare quantities of two nominal distributions, a chi-square test seems to be the method of choice with regard to your first question. However, it should be mentioned that a chi-square test is somehow "the smallest" test for answering differences in samples. If you are studying medicine (or related) a chi-square test is fine because it is also frequently applied by researchers of this field. If you are studying psychology or sociology (or related) I'd advise to highlight limitations of the test in the discussions section since it mostly tests your distributions against randomly expected distributions.
Regarding your second question, a logistic regression would be applicable since it allows binomial distributed variables both for independent variables (predictors) and dependent variables. However, if you have other interval scaled variables (e.g. age, weight etc.) you could also use t-tests or ANOVAs to investigate differences between these variables with respect to the existence of specific diseases (i.e. is diabetic or not).
Overall, this matter strongly depends on what you mean by "association". Classically, "association" refers to correlations or linear regression (for which you need interval scaled variables on "both sides") but given your data structure, the aforementioned methods possess a better fit.
How you actually calculate these tests depends on the statistics software used.