I have different employees’ performance rating data( low, mixed, strong, significant, exceptional) in 1 column and certain commonly occurring words used in their performance reviews encoded as 0 and 1 (0 meaning the word is not present,1 meaning the word is present in their Perf reviews). For example have multiple columns like “leadership”, “excellent”,”lacking”… etc etc which are encoded as 0/1 for each employee
example:
empID
perf rating
team
leadership
lacking
excellent
good
A123
low
1
0
1
1
0
C453
mixed
1
1
0
0
0
B335
strong
0
0
1
0
1
F976
significant
1
0
1
1
0
G257
exceptional
1
1
1
1
0
I need to find out which words are associated with a positive performance and which words are associated with negative performance? The output dataframe should be in the form of word, correlation-coefficient.
I understand that since these are both qualitative variables, we cannot use Pearson's correlation coefficient method and we can use something like Cramer's v to find out the correlation coefficients. But I need the correlation coefficients to be between -1 and 1 rather than 0 to 1, to tell which words are positively associated with Performance rating and which words are negatively associated with Performance rating.
If I encode the performance rating to 1,2,3,4,5 1 being low and 5 being exceptional, and since the presence of a word is also in the form of 0/1, can I still use Pearson's correlation coefficient to get positive association and negative association of words with performance rating? or is it a blunder?
For example, my output should be something like
word
corr-coeff
team
-0.02
leadership
0.712
lacking
-0.8122
excellent
0.6172
good
0.5672
There are a few issues to bear in mind here.
you have repeated meaures on empID. That is, observations on one exmployee are likely to be more similar to each other than to observations on other employees. This means the observations are not independent and this needs to be accounted for.
the research question seems to warrant a regression model
I would consider using a multinomial logistic model with random intercepts for employee
this will provide estimates for the association between each commonly occuring word and the performance ratings, while accounting for the non-independence of observations within employees.
Related
I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number of words in each category, and calculating a proportion score for each category by converting the raw word counts into a percentage based on total words used in the text.
n-power n-achieve n-affiliation
Document1 0.010 0.025 0.100
Document2 0.045 0.010 0.050
: : : :
: : : :
Document100000 0.100 0.020 0.010
For each document, I would like to get a measure of distinctiveness that indicates the degree to which the content of a document on the three psychological categories differs from the average content of all documents (i.e., the prototypical document in my sample). Is there a way to do this?
Essentially what you have is a clustering problem. Currently you made a representation of each of your documents with 3 numbers, lets call them a vector (essentially you cooked up some embeddings). To do what you want you can
1) Calculate an average vector for the whole set. Basically add up all numbers in each column and divide by the number of documents.
2) Pick a metric you like which will reflect an alignment of your document vectors with an average. You can just use (Euclidian)
sklearn.metrics.pairwise.euclidean_distances
or cosine
sklearn.metrics.pairwise.cosine_distances
X will be you list of document vectors and Y will be a single average vector in the list. This is a good place to start.
If I would do it I would ignore average vector approach as you are in fact dealing with clustering problem. So I would use KMeans
see more here guide
Hope this helps!
I have a dataset which contains gender as Male and Female. I have converted male to 1 and female to 0 using pandas functionality which has now data type int8. now I wanted to normalize columns such as weight and height. So what should be done with the gender column: should it be normalized or not. I am planning to use it in for a linear regression.
So I think you are mixing up normalization with standardization.
Normalization:
rescales your data into a range of [0;1]
Standardization:
rescales your data to have a mean of 0 and a standard deviation of 1.
Back to your question:
For your gender column your points are already ranging between 0 and 1. Therefore your data is already "normalized". So your question should be if you can standarize your data and the answer is: yes you could, but it doesn't really make sense. This question was already discussed here: Should you ever standardise binary variables?
I have two groups, "in" and "out," and item categories that can be split up among the groups. For example, I can have item category A that is 99% "in" and 1% "out," and item B that is 98% "in" and 2% "out."
For each of these items, I actually have the counts that are in/out. For example, A could have 99 items in and 1 item out, and B could have 196 items that are in and 4 that are out.
I would like to rank these items based on the percentage that are "in," but I would also like to give some priority to items that have larger overall populations. This is because I would like to focus on items that are very relevant to the "in" group, but still have a large number of items in the "out" group that I could pursue.
Is there some kind of score that could do this?
I'd be tempted to use a probabilistic rank — the probability that an item category is from the group given the actual numbers for that category. This requires making some assumptions about the data set, including why a category may have any out-of-group items. You might take a look at the binomial test or the Mann-Whitney U test for a start. You might also look at some other kinds of nonparametric statistics.
I ultimately ended up using bayesian averaging, which was recommended in this post. The technique is briefly described in this wikipedia article and more thoroughly described in this post by Evan miller and this post by Paul Masurel.
In bayesian averaging, "prior values" are used to influence the numerator and denominator towards the expected values. Essentially, the expected numerator and expected denominator are added to the actual numerator and denominator. In the case where the numerator and denominator are small, the prior values have a larger impact because they represent a larger proportion of the new numerator/denominator. As the numerators and denominators grow in magnitude, the bayesian average starts to approach the actual average due to increased confidence.
In my case, the prior value for the average was fairly low, which biased averages with small denominators downward.
I have been playing around a Kaplan Meier Survival analysis. I have 3 conditions mutually exclusive. Let says:
condition 1 is 'I am not able to work (and not working)'
condition 2 is 'I am able to work and I am working'
condition 3 is 'I am able to work but I am not working'
I am trying to have an overall likelihood to be either way in C1, C2 or C3.
I have done 3 separates Survival analysis (one for each condition) and add the cumulative proportion for the same time, but to the total is slightly superior to 1 (between 1.02 to 1.06 to be exact). I was wondering how to explain this over estimation. Is it something in my logic or the way the censored data are estimated? (or else).
Thanks,
I am trying to create a spreadsheet that can find the most likely probability that a student scored a specific grade on a test.
Only one student can score a grade and only one grade can have a student.
I have limited information about each student.
There are 5 students (1,2,3,4,5)
and the grades possible are only (100,90,80,70,60)
In the spreadsheet a 1 denotes that the student DIDN'T score that grade.
Does anyone know how to make a simulation that I can find the most likely probability of what student scored what grade?
Link:
https://docs.google.com/spreadsheets/d/1a8uUIRzUKsY3DolTM1A0ISqMd-42WCUCiDsxmUT5TKI/edit?usp=sharing
Based on your response in comments, each student has an equal likelihood of getting each grade. No simulation is necessary.
If you want to simulate it anyway, don't use Excel*. Create a vector of students, and pair it with a shuffled vector of the grades. Lather, rinse, repeat as many times as you want to see that the student-to-grade matching is uniformly distributed.
* - To get an idea of how bad Excel can be for random variate generation, enable the Analysis Toolpak, go to "Data -> Data Analysis" on the ribbon, and select "Random Number Generation". Fill in the tabs that you want 10 variables, number of random numbers 2000, a "Normal" distribution, leave the mean and std dev at 0 and 1, and enter a "Random Seed" value of 123. You will find that the resulting table contains 3 instances of the value "-9.35764". Values that extreme should occur about once per twenty thousand years if you generate a billion a second. Getting three of them is so extreme that it should happen once per 1030 times the current estimated age of the universe. Conclude that a) it's your lucky day, or b) Excel sucks at random numbers, and despite being informed about this as far back as 1998 Microsoft hasn't bothered to fix it.