I did a diary study and people had to answer to 2 times a day for 5 days.
They had to answer in the morning and in the afternoon, and the variables used in both assessments were different (for instance: in the morning I asked about sleeping problems from the night before, in the afternoon I did not ask about sleeping problems).
I joined both databases for morning and afternoon.
I was trying to calculate the reliability for each variable (calculating RkF), but I always get the same error.
I was trying to calculate Sleeping Problems:
I put:
base_w_final = dataset
code= the code they had to create before fulfilling each assessment
register_ the day (day 1, day 2, day 3, day 4 and day 5).
items= columns 20 to 24 correspond to the columns for sleeping problems.
library(psych)
mlr(base_w_final, grp = "code", Time = "register", items = c(20:24))
At least one item had no variance when finding alpha for subject = 1. Proceed with caution
At least one item had no variance when finding alpha for subject = 2. Proceed with caution
At least one item had no variance when finding alpha for subject = 3. Proceed with caution
At least one item had no variance when finding alpha for subject = 5. Proceed with caution
At least one item had no variance when finding alpha for subject = 6. Proceed with caution
At least one item had no variance when finding alpha for subject = 7. Proceed with caution
At least one item had no variance when finding alpha for subject = 8. Proceed with caution
At least one item had no variance when finding alpha for subject = 9. Proceed with caution
At least one item had no variance when finding alpha for subject = 10. Proceed with caution
At least one item had no variance when finding alpha for subject = 11. Proceed with caution
At least one item had no variance when finding alpha for subject = 13. Proceed with caution
At least one item had no variance when finding alpha for subject = 15. Proceed with caution
At least one item had no variance when finding alpha for subject = 16. Proceed with caution
At least one item had no variance when finding alpha for subject = 17. Proceed with caution
At least one item had no variance when finding alpha for subject = 18. Proceed with caution
At least one item had no variance when finding alpha for subject = 19. Proceed with caution
At least one item had no variance when finding alpha for subject = 20. Proceed with caution
At least one item had no variance when finding alpha for subject = 21. Proceed with caution
At least one item had no variance when finding alpha for subject = 23. Proceed with caution
At least one item had no variance when finding alpha for subject = 24. Proceed with caution
At least one item had no variance when finding alpha for subject = 25. Proceed with caution
At least one item had no variance when finding alpha for subject = 27. Proceed with caution
At least one item had no variance when finding alpha for subject = 29. Proceed with caution
At least one item had no variance when finding alpha for subject = 30. Proceed with caution
At least one item had no variance when finding alpha for subject = 33. Proceed with caution
At least one item had no variance when finding alpha for subject = 34. Proceed with caution
At least one item had no variance when finding alpha for subject = 37. Proceed with caution
At least one item had no variance when finding alpha for subject = 38. Proceed with caution
At least one item had no variance when finding alpha for subject = 40. Proceed with caution
At least one item had no variance when finding alpha for subject = 41. Proceed with caution
At least one item had no variance when finding alpha for subject = 42. Proceed with caution
At least one item had no variance when finding alpha for subject = 43. Proceed with caution
At least one item had no variance when finding alpha for subject = 44. Proceed with caution
At least one item had no variance when finding alpha for subject = 45. Proceed with caution
At least one item had no variance when finding alpha for subject = 47. Proceed with caution
At least one item had no variance when finding alpha for subject = 49. Proceed with caution
At least one item had no variance when finding alpha for subject = 51. Proceed with caution
At least one item had no variance when finding alpha for subject = 52. Proceed with caution
At least one item had no variance when finding alpha for subject = 53. Proceed with caution
At least one item had no variance when finding alpha for subject = 55. Proceed with caution
At least one item had no variance when finding alpha for subject = 56. Proceed with caution
At least one item had no variance when finding alpha for subject = 57. Proceed with caution
At least one item had no variance when finding alpha for subject = 58. Proceed with caution
At least one item had no variance when finding alpha for subject = 59. Proceed with caution
At least one item had no variance when finding alpha for subject = 60. Proceed with caution
At least one item had no variance when finding alpha for subject = 63. Proceed with caution
At least one item had no variance when finding alpha for subject = 64. Proceed with caution
At least one item had no variance when finding alpha for subject = 65. Proceed with caution
At least one item had no variance when finding alpha for subject = 66. Proceed with caution
At least one item had no variance when finding alpha for subject = 67. Proceed with caution
At least one item had no variance when finding alpha for subject = 69. Proceed with caution
At least one item had no variance when finding alpha for subject = 70. Proceed with caution
At least one item had no variance when finding alpha for subject = 71. Proceed with caution
At least one item had no variance when finding alpha for subject = 72. Proceed with caution
At least one item had no variance when finding alpha for subject = 73. Proceed with caution
At least one item had no variance when finding alpha for subject = 77. Proceed with caution
At least one item had no variance when finding alpha for subject = 78. Proceed with caution
At least one item had no variance when finding alpha for subject = 80. Proceed with caution
At least one item had no variance when finding alpha for subject = 81. Proceed with caution
At least one item had no variance when finding alpha for subject = 82. Proceed with caution
At least one item had no variance when finding alpha for subject = 84. Proceed with caution
At least one item had no variance when finding alpha for subject = 85. Proceed with caution
At least one item had no variance when finding alpha for subject = 86. Proceed with caution
At least one item had no variance when finding alpha for subject = 87. Proceed with caution
At least one item had no variance when finding alpha for subject = 89. Proceed with caution
At least one item had no variance when finding alpha for subject = 91. Proceed with caution
At least one item had no variance when finding alpha for subject = 93. Proceed with caution
At least one item had no variance when finding alpha for subject = 96. Proceed with caution
At least one item had no variance when finding alpha for subject = 98. Proceed with caution
At least one item had no variance when finding alpha for subject = 99. Proceed with caution
At least one item had no variance when finding alpha for subject = 103. Proceed with caution
At least one item had no variance when finding alpha for subject = 105. Proceed with caution
At least one item had no variance when finding alpha for subject = 106. Proceed with caution
At least one item had no variance when finding alpha for subject = 107. Proceed with caution
At least one item had no variance when finding alpha for subject = 110. Proceed with caution
At least one item had no variance when finding alpha for subject = 112. Proceed with caution
At least one item had no variance when finding alpha for subject = 113. Proceed with caution
At least one item had no variance when finding alpha for subject = 114. Proceed with caution
At least one item had no variance when finding alpha for subject = 115. Proceed with caution
At least one item had no variance when finding alpha for subject = 116. Proceed with caution
At least one item had no variance when finding alpha for subject = 118. Proceed with caution
At least one item had no variance when finding alpha for subject = 120. Proceed with caution
At least one item had no variance when finding alpha for subject = 121. Proceed with caution
At least one item had no variance when finding alpha for subject = 123. Proceed with caution
At least one item had no variance when finding alpha for subject = 125. Proceed with caution
At least one item had no variance when finding alpha for subject = 127. Proceed with caution
At least one item had no variance when finding alpha for subject = 128. Proceed with caution
At least one item had no variance when finding alpha for subject = 130. Proceed with caution
At least one item had no variance when finding alpha for subject = 131. Proceed with caution
At least one item had no variance when finding alpha for subject = 132. Proceed with caution
At least one item had no variance when finding alpha for subject = 134. Proceed with caution
At least one item had no variance when finding alpha for subject = 135. Proceed with caution
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I think this is due to the fact that I have NA values when it´s the afternoon assessments. But I tried to sperate the datasets (have 1 for the morning assessments and have 1 for the afternoon assessments) and calculate then, but I still get errors.
If I keep the same database, is there any possibility to tell RStudio to dismiss missing values or do you think the error has nothing to do with that?
Thank you so much for all the help!
Related
Introduction
I have written code to give me a set of numbers in '36 by q' format ( 1<= q <= 36), subject to following conditions:
Each row must use numbers from 1 to 36.
No number must repeat itself in a column.
Method
The first row is generated randomly. Each number in the coming row is checked for the above conditions. If a number fails to satisfy one of the given conditions, it doesn't get picked again fot that specific place in that specific row. If it runs out of acceptable values, it starts over again.
Problem
Unlike for low q values (say 15 which takes less than a second to compute), the main objective is q=36. It has been more than 24hrs since it started to run for q=36 on my PC.
Questions
Can I predict the time required by it using the data I have from lower q values? How?
Is there any better algorithm to perform this in less time?
How can I calculate the average number of cycles it requires? (using combinatorics or otherwise).
Can I predict the time required by it using the data I have from lower q values? How?
Usually, you should be able to determine the running time of your algorithm in terms of input. Refer to big O notation.
If I understood your question correctly, you shouldn't spend hours computing a 36x36 matrix satisfying your conditions. Most probably you are stuck in the infinite loop or something. It would be more clear of you could share code snippet.
Is there any better algorithm to perform this in less time?
Well, I tried to do what you described and it works in O(q) (assuming that number of rows is constant).
import random
def rotate(arr):
return arr[-1:] + arr[:-1]
y = set([i for i in range(1, 37)])
n = 36
q = 36
res = []
i = 0
while i < n:
x = []
for j in range(q):
if y:
el = random.choice(list(y))
y.remove(el)
x.append(el)
res.append(x)
for j in range(q-1):
x = rotate(x)
res.append(x)
i += 1
i += 1
Basically, I choose random numbers from the set of {1..36} for the i+q th row, then rotate the row q times and assigned these rotated rows to the next q rows.
This guarantees both conditions you have mentioned.
How can I calculate the average number of cycles it requires?( Using combinatorics or otherwise).
I you cannot calculate the computation time in terms of input (code is too complex), then fitting to curve seems to be right.
Or you could create an ML model with iterations as data and time for each iteration as label and perform linear regression. But that seems to be overkill in your example.
Graph q vs time
Fit a curve,
Extrapolate to q = 36.
You might want to also graph q vs log(time) as that may give an easier fitted curve.
I am looking at the vehicle routing problem which minimizes the cost of "the slowest truck" in a fleet.
So now the objective function should involve two quantities:
the sum of all transitions of all vehicles (total distance), and
the cost of the most expensive route
How are these values combined? I am assuming that the global span coefficient
distance_dimension.SetGlobalSpanCostCoefficient(100)
is involved? Is that the coefficient of a weighted sum
cost = w*A + (100-w)*B
where A is the cost of the slowest truck and B is the total distance of all trucks?
No it's simply: cost = B + A
with B = sum of all edge cost in the routes (usually set by using routing.SetArcCostEvaluatorOfAllVehicles(arc_cost_callback))
and A = w * (max{end} - min{start})
note: B is needed to help solver to find a first good solution (otherwise strategy like CHEAPEST_PATH behave strangely since there is no cost on edge to choose the cheapest...), While A helps to "distribute" jobs by minimizing the Max cumul Var. but it's still not a real dispersion incentive
e.g. supposing a dimension with cumul_start = 0 and 4 routes with cost 0,0,6,6 it is as good as 2,2,2,6 (or 6,6,6,6 but B is higher here).
i.e. max(cumul_end) == 6 in both cases.
I added a section on GlobalSpan here in the doc.
ps: take a look at https://github.com/google/or-tools/issues/685#issuecomment-388503464
pps: in the doc example try to change maximum_distance = 3000 by 1800 or 3500 if I remember well ;)
ppps: Note than you can have several GlobalSpan on several dimensions and objective is just the sum of all this costs multiply by their respective coefficient...
I have a question about my results of the Wilcoxon signed rank test:
My data consists of a trial with 2 groups (paired) in which a treatment was used. The results were scored in %. Groups consist of 131 people.
When I run the test in R, I got the following result:
wilcox.test(no.treatment, with.treatment, paired=T)
# Wilcoxon signed rank test with continuity correction
# data: no.treatment and with.treatment V = 3832, p-value = 0.7958
# alternative hypothesis: true location shift is not equal to 0
I am wondering what the V value means. I read somewhere that it has something to do with the number of positive scores (?), but I am wondering if it could tell me anything about the data and interpretation?
I'll give a little bit of background before answering your question.
The Wilcoxon signed rank sum test compares two values between the same N people (here 131), like for example blood values were measured for 131 people at two time points. The purpose of the test is to see whether the blood values have changed.
The V-statistic you are getting does not have a direct interpretation. This value is based on the pairwise difference between the individuals in your two groups. It is a value for a variable, that is supposed to follow a certain probability distribution. Intuitively speaking, you can say that the larger the value for V, the larger the difference between the two groups you sampled.
As always in hypothesis testing, you (well, the wilcox.test function) will calculate the probability that the value (V) of that variable is equal to 3832 or larger
prob('observing a value of 3832 or larger, when the groups are actually the same')
If there is really no difference between the two groups, the value for V will be 'close to zero'. Whether the value V you see is 'close to zero' depends on the probability distribution. The probability distribution is not straightforward for this variable, but luckily that doesn't matter since wilcoxon knows the distribution and calculates the probability for you (0.7958).
In short
Your groups do not significantly differ and V doesn't have a clear interpretation.
The V statistic produced by the function wilcox.test() can be calculated in R as follows:
# Create random data between -0.5 and 0.5
da_ta <- runif(1e3, min=-0.5, max=0.5)
# Perform Wilcoxon test using function wilcox.test()
wilcox.test(da_ta)
# Calculate the V statistic produced by wilcox.test()
sum(rank(abs(da_ta))[da_ta > 0])
The user MrFlick provided the above answer in reply to this question:
How to get same results of Wilcoxon sign rank test in R and SAS.
The Wilcoxon W statistic is not the same as the V statistic, and can be calculated in R as follows:
# Calculate the Wilcoxon W statistic
sum(sign(da_ta) * rank(abs(da_ta)))
The above statistic can be compared with the Wilcoxon probability distribution to obtain the p-value. There is no simple formula for the Wilcoxon distribution, but it can be simulated using Monte Carlo simulation.
The value of V does not mean the number of positive scores, but the sum of these positive scores.
As well there is a measurement for the sum for the negative scores, that this test does not provide. A brief script for calculating the sum for positive and for negative scores is provided in the following example:
a <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
b <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)
diff <- c(a - b) #calculating the vector containing the differences
diff <- diff[ diff!=0 ] #delete all differences equal to zero
diff.rank <- rank(abs(diff)) #check the ranks of the differences, taken in absolute
diff.rank.sign <- diff.rank * sign(diff) #check the sign to the ranks, recalling the signs of the values of the differences
ranks.pos <- sum(diff.rank.sign[diff.rank.sign > 0]) #calculating the sum of ranks assigned to the differences as a positive, ie greater than zero
ranks.neg <- -sum(diff.rank.sign[diff.rank.sign < 0]) #calculating the sum of ranks assigned to the differences as a negative, ie less than zero
ranks.pos #it is the value V of the wilcoxon signed rank test
[1] 80
ranks.neg
[1] 40
CREDITS: https://www.r-bloggers.com/wilcoxon-signed-rank-test/
(They also provide a nice context for it.)
You can compare also both of these numbers to their average (in this case, 60), that would be the expected value for each side, i.e. positive ranks summing 60 and negative ranks summing 60 means complete equivalence of the sides. Do positive ranks summing 80 and negative ranks summing 40 can also be considered equivalent? (i.e. could we just attribute this difference of "20" to stochastic reasons or is this distant enough for us to reject the hypothesis of no-difference?)
So, as they explain, the critical interval for this case is [25,95]. Checking on a table for critical values for the Wilcoxon rank signed test, the critical value for this example is 25 (15 pairs at 5% on a two-tailed test; and 120-25 = 95...). Meaning that the interval [40,80] is not "big enough" to discard the possibility that the differences are purely due to random sampling. (Consistently, the p-value is above the alpha).
To compare the sum of positive scores to the sum of negative scores helps to determine the significance of the difference, it enriches the analysis. Also, the positive ranks themselves are input for the calculation of the p-value of the test, therefore the interest in them.
But to extract meaning from a simply reported sum of positive ranks (V), only, I think that is not straightforward. In terms of providing information, I believe that the least to do is to also check the sum of the negative ranks, too, to have a more consistent idea of what is happening. (of course, along with general info, like sample size, p-value, etc).
I, too, was confused about this seemingly mysterious "V" statistic. I realize there are already some helpful answers here, but I did not really understand them when I first read over them. So here I am explain it again in a way that I finally understood it. Hopefully it helps others if they are also still confused.
The V-statistic is the sum of ranks assigned to the differences with positive signs. Meaning, when you run a Wilcoxon Signed Rank test, it calculates a sum of negative ranks (W-) and a sum of positive ranks (W+). The test statistic (W) is usually the minimum value either (W-) or (W+), however the V-statistic is just going to be (W+).
To understand the importance of this, if the null hypothesis is true, (W+) and (W-) would be similar. This is because given the number of samples (n), your (W+) and (W-) will have a maximum possible combined value or, (W+)+(W-)=n(n+1)/2. If this maximum value is divided somewhat evenly, than there is not much of a difference between the paired sample sets and we accept the null. If there is a large difference between (W+) and (W-) than there is a large difference between the paired sample sets, and we have evidence for the alternative hypothesis. The degree of the difference and its significance relates to the critical value chart for W.
Here are particularly helpful sites to check out if the concept is still not 100%:
1.) https://mathcracker.com/wilcoxon-signed-ranks
2.) https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric6.html)
3.) https://www.youtube.com/watch?v=TqCg2tb4wJ0
TLDR; the V-statistic reported by R is the same as the W-statistic in cases where (W+) is the smaller of (W+) or (W-).
what is the actual formula to compute sentiments using sentiment rated lexicon. the lexicon that I am using contains rating between the range -5 to 5. I want to compute sentiment for individual sentences. Either i have to compute average of all sentiment ranked words in sentence or only sum up them.
There are several methods for computing an index from scored sentiment components of sentences. Each is based on comparing positive and negative words, and each has advantages and disadvantages.
For your scale, a measure of the central tendency of the words would be a fair measure, where the denominator is the number of scored words. This is a form of the "relative proportional difference" measure employed below. You would probably not want to divide the total sentiment words' scores by all words, since this makes each sentence's measure strongly affected by non-sentiment terms.
If you do not believe that the 11 point rating you describe is accurate, you could just classify it as positive or negative depending on its sign. Then you could apply the following methods where you have transformed
where each P and N refer to the counts of the Positive and Negative coded sentiment words, and O is the count of all other words (so that the total number of words = P + N + O).
Absolute Proportional Difference. Bounds: [0,1]
Sentiment = (P − N) / (P + N + O)
Disadvantage: A sentence's score is affected by non-sentiment-related content.
Relative Proportional Difference. Bounds: [-1, 1]
Sentiment = (P − N) / (P + N)
Disadvantage: A sentence's score may tend to cluster very strongly near the scale endpoints (because they may contain content primarily or exclusively of either positive or negative).
Logit scale. Bounds: [-infinity, +infinity]
Sentiment = log(P + 0.5) - log(N + 0.5)
This tends to have the smoothest properties and is symmetric around zero. The 0.5 is a smoother to prevent log(0).
For details, please see William Lowe, Kenneth Benoit, Slava Mikhaylov, and Michael Laver. (2011) "Scaling Policy Preferences From Coded Political Texts." Legislative Studies Quarterly 26(1, Feb): 123-155. where we compare their properties for measuring right-left ideology, but everything we discuss also applies to positive-negative sentiment.
you can use R tool for sentiment computation. here is the link you can refer to:
https://sites.google.com/site/miningtwitter/questions/sentiment/analysis
I participated in code jam, I successfully solved small input of The Repeater Challenge but can't seem to figure out approach for multiple strings.
Can any one give the algorithm used for multiple strings. For 2 strings ( small input ) I am comparing strings character by character and doing operations to make them equal. However this approach would time out for large input.
Can some one explain their algorithm they used. I can see solutions of other users but can't figure out what have they done.
I can tell you my solution which worked fine for both small and large inputs.
First, we have to see if there is a solution, you do that by bringing all strings to their "simplest" form. If any of them does not match, there there is no solution.
e.g.
aaabbbc => abc
abbbbbcc => abc
abbcca => abca
If only the first two were given, then a solution would be possible. As soon as the third is thrown into the mix, then it's impossible. The algorithm to do the "simplification" is to parse the string and eliminate any double character you see. As soon as a string does not equal the simplified form of the batch, bail out.
As for actual solution to the problem, i simply converted the strings to a [letter, repeat] format. So for example
qwerty => 1q,1w,1e,1r,1t,1y
qqqwweeertttyy => 3q,2w,3e,1r,3t,2y
(mind you the outputs are internal structures, not actual strings)
Imagine now you have 100 strings, you have already passed the test that there is a solution and you have all strings into the [letter, repeat] representation. Now go through every letter and find the least 'difference' of repetitions you have to do, to reach the same number. So for example
1a, 1a, 1a => 0 diff
1a, 2a, 2a => 1 diff
1a, 3a, 10a => 9 diff (to bring everything to 3)
the way to do this (i'm pretty sure there is a more efficient way) is to go from the min number to the max number and calculate the sum of all diffs. You are not guaranteed that the number will be one of the numbers in the set. For the last example, you would calculate the diff to bring everything to 1 (0,2,9 =11) then for 2 (1,1,8 =10), the for 3 (2,0,7 =9) and so on up to 10 and choose the min again. Strings are limited to 1000 characters so this is an easy calculation. On my moderate laptop, the results were instant.
Repeat the same for every letter of the strings and sum everything up and that is your solution.
This answer gives an example to explain why finding the median number of repeats produces the lowest cost.
Suppose we have values:
1 20 30 40 100
And we are trying to find the value which has shortest total distance to all these values.
We might guess the best answer is 50, with cost |50-1|+|50-20|+|50-30|+|50-40|+|50-100| = 159.
Split this into two sums, left and right, where left is the cost of all numbers to the left of our target, and right is the cost of all numbers to the right.
left = |50-1|+|50-20|+|50-30|+|50-40| = 50-1+50-20+50-30+50-40 = 109
right = |50-100| = 100-50 = 50
cost = left + right = 159
Now consider changing the value by x. Providing x is small enough such that the same numbers are on the left, then the values will change to:
left(x) = |50+x-1|+|50+x-20|+|50+x-30|+|50+x-40| = 109 + 4x
right(x) = |50+x-100| = 50 - x
cost(x) = left(x)+right(x) = 159+3x
So if we set x=-1 we will decrease our cost by 3, therefore the best answer is not 50.
The amount our cost will change if we move is given by difference between the number to our left (4) and the number to our right (1).
Therefore, as long as these are different we can always decrease our cost by moving towards the median.
Therefore the median gives the lowest cost.
If there are an even number of points, such as 1,100 then all numbers between the two middle points will give identical costs, so any of these values can be chosen.
Since Thanasis already explained the solution, I'm providing here my source code in Ruby. It's really short (only 400B) and following his algorithm exactly.
def solve(strs)
form = strs.first.squeeze
strs.map { |str|
return 'Fegla Won' if form != str.squeeze
str.chars.chunk { |c| c }.map { |arr|
arr.last.size
}
}.transpose.map { |row|
Range.new(*row.minmax).map { |n|
row.map { |r|
(r - n).abs
}.reduce :+
}.min
}.reduce :+
end
gets.to_i.times { |i|
result = solve gets.to_i.times.map { gets.chomp }
puts "Case ##{i+1}: #{result}"
}
It uses a method squeeze on strings, which removes all the duplicate characters. This way, you just compare every squeezed line to the reference (variable form). If there's an inconsistency, you just return that Fegla Won.
Next you use a chunk method on char array, which collects all consecutive characters. This way you can count them easily.