Statistics - Complex survey analysis - statistics

Considering a random selection of students from a population of 10 students. Which is correct?
The probability of selecting the 2nd student is 1/10 if sampling without replacement
The probability of selecting the 2nd student is 1/10 if sampling with replacement
The probability of selecting the 2nd student is 1/9 if sampling without replacement
The probability of selecting the 2nd student is 1/9 if sampling with replacement

Related

LSTM Future Prediction

Why the LSTM prediction value gives the randomize number ? From that how to take decision.
Scenario is: I want to predict in future power cut in house holds. I have a past 3 month hour based energy meter reading. I trained and LSTM model gives randomized number in next 24 hr..
How do I conclude which hour power On/Off?
Sample output shown here:

How to measure how distinct a document is based on predefined linguistic categories?

I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number of words in each category, and calculating a proportion score for each category by converting the raw word counts into a percentage based on total words used in the text.
n-power n-achieve n-affiliation
Document1 0.010 0.025 0.100
Document2 0.045 0.010 0.050
: : : :
: : : :
Document100000 0.100 0.020 0.010
For each document, I would like to get a measure of distinctiveness that indicates the degree to which the content of a document on the three psychological categories differs from the average content of all documents (i.e., the prototypical document in my sample). Is there a way to do this?
Essentially what you have is a clustering problem. Currently you made a representation of each of your documents with 3 numbers, lets call them a vector (essentially you cooked up some embeddings). To do what you want you can
1) Calculate an average vector for the whole set. Basically add up all numbers in each column and divide by the number of documents.
2) Pick a metric you like which will reflect an alignment of your document vectors with an average. You can just use (Euclidian)
sklearn.metrics.pairwise.euclidean_distances
or cosine
sklearn.metrics.pairwise.cosine_distances
X will be you list of document vectors and Y will be a single average vector in the list. This is a good place to start.
If I would do it I would ignore average vector approach as you are in fact dealing with clustering problem. So I would use KMeans
see more here guide
Hope this helps!

Train LSTM model on keras with several multivariate time series

I have a time series dataset of customer behavior. For each month, I have one row per customer which includes a set of features(for example, the amount of spending, number of visits, etc.) and a target value(a binary value; does the customer buy product "A" or not?).
My problem is: I want to train an LSTM model to predict the target value for the next month(does the customer buy product "A" next month?). Since I have multiple time series(one per customer), I have more than one sample per timestamp(for example, for January 2010, I have more than 1000 samples, and so on). How do I train the model? Do I go epoch by epoch and for each epoch, fit the model one by one on all customers? Is there another side to this I'm missing?
Dataset features:
Number of customers: 1500;
Length of time series: 120;
Number of features per customer: 80(before adding time-shifted features);

Fleiss-kappa score for interannotator agreement

In my dataset I have a set of categories, where for every category I have a set of 150 examples. Each example has been annotated as true/false by 5 human raters. I am computing the inter-annotator agreement using the Fleiss-kappa score:
1) for the entire dataset
2) for each category in particular
However, the results I obtained show that the Fleiss-kappa score for the entire dataset does not equal the average of the Fleiss-kappa score for each category. In my computation I am using a standard built-in package to compute the scores. Could this be due to a bug in my matrix computation, or are the scores not supposed to be equal? Thanks!

Simulation in Excel using probability

I am trying to create a spreadsheet that can find the most likely probability that a student scored a specific grade on a test.
Only one student can score a grade and only one grade can have a student.
I have limited information about each student.
There are 5 students (1,2,3,4,5)
and the grades possible are only (100,90,80,70,60)
In the spreadsheet a 1 denotes that the student DIDN'T score that grade.
Does anyone know how to make a simulation that I can find the most likely probability of what student scored what grade?
Link:
https://docs.google.com/spreadsheets/d/1a8uUIRzUKsY3DolTM1A0ISqMd-42WCUCiDsxmUT5TKI/edit?usp=sharing
Based on your response in comments, each student has an equal likelihood of getting each grade. No simulation is necessary.
If you want to simulate it anyway, don't use Excel*. Create a vector of students, and pair it with a shuffled vector of the grades. Lather, rinse, repeat as many times as you want to see that the student-to-grade matching is uniformly distributed.
* - To get an idea of how bad Excel can be for random variate generation, enable the Analysis Toolpak, go to "Data -> Data Analysis" on the ribbon, and select "Random Number Generation". Fill in the tabs that you want 10 variables, number of random numbers 2000, a "Normal" distribution, leave the mean and std dev at 0 and 1, and enter a "Random Seed" value of 123. You will find that the resulting table contains 3 instances of the value "-9.35764". Values that extreme should occur about once per twenty thousand years if you generate a billion a second. Getting three of them is so extreme that it should happen once per 1030 times the current estimated age of the universe. Conclude that a) it's your lucky day, or b) Excel sucks at random numbers, and despite being informed about this as far back as 1998 Microsoft hasn't bothered to fix it.

Resources