Related
So I am trying to solve mean, median and mode challenge on Hackerrank. I defined 3 functions to calculate mean, median and mode for a given array with length between 10 and 2500, inclusive.
I get an error with an array of 2500 integers, not sure why. I looked into python documentation and found no mentions of max length for lists. I know I can use statistics module but trying the hard way and being stubborn I guess. Any help and criticism is appreciated regarding my code. Please be honest and brutal if need be. Thanks
N = int(input())
var_list = [int(x) for x in input().split()]
def mean(sample_list):
mean = sum(sample_list)/N
print(mean)
return
def median(sample_list):
sorted_list = sorted(sample_list)
if N%2 != 0:
median = sorted_list[(N//2)]
else:
median = (sorted_list[N//2] + sorted_list[(N//2)-1])/2
print(median)
return
def mode(sample_list):
sorted_list = sorted(sample_list)
mode = min(sorted_list)
max_count = sorted_list.count(mode)
for i in sorted_list:
if (i <= mode) and (sorted_list.count(i) >= max_count):
mode = i
print(mode)
return
mean(var_list)
median(var_list)
mode(var_list)
Compiler Message
Wrong Answer
Input (stdin)
2500
19325 74348 68955 98497 26622 32516 97390 64601 64410 10205 5173 25044 23966 60492 71098 13852 27371 40577 74997 42548 95799 26783 51505 25284 49987 99134 33865 25198 24497 19837 53534 44961 93979 76075 57999 93564 71865 90141 5736 54600 58914 72031 78758 30015 21729 57992 35083 33079 6932 96145 73623 55226 18447 15526 41033 46267 52486 64081 3705 51675 97470 64777 31060 90341 55108 77695 16588 64492 21642 56200 48312 5279 15252 20428 57224 38086 19494 57178 49084 37239 32317 68884 98127 79085 77820 2664 37698 84039 63449 63987 20771 3946 862 1311 77463 19216 57974 73012 78016 9412 90919 40744 24322 68755 59072 57407 4026 15452 82125 91125 99024 49150 90465 62477 30556 39943 44421 68568 31056 66870 63203 43521 78523 58464 38319 30682 77207 86684 44876 81896 58623 24624 14808 73395 92533 4398 8767 72743 1999 6507 49353 81676 71188 78019 88429 68320 59395 95307 95770 32034 57015 26439 2878 40394 33748 41552 64939 49762 71841 40393 38293 48853 81628 52111 49934 74061 98537 83075 83920 42792 96943 3357 83393{-truncated-}
Download to view the full testcase
Expected Output
49921.5
49253.5
2184
Your issue seems to be that you are actually using standard list operations rather than calculating things on the fly, while looping through the data once (for the average). sum(sample_list) will almost surely give you something which exceeds the double-limit, i.a.w. it becomes really big.
Further reading
Calculating the mean, variance, skewness, and kurtosis on the fly
How do I determine the standard deviation (stddev) of a set of values?
Rolling variance algorithm
What is a good solution for calculating an average where the sum of all values exceeds a double's limits?
How do I determine the standard deviation (stddev) of a set of values?
How to efficiently compute average on the fly (moving average)?
I figured out that you forgot to change the max_count variable inside the if block. Probably that causes the wrong result. I tested the debugged version on my computer and they seem to work well when I compare their result with the scipy's built-in functions. The correct mode function should be
def mode(sample_list):
N = len(sample_list)
sorted_list = sorted(sample_list)
mode = min(sorted_list)
max_count = sorted_list.count(mode)
for i in sorted_list:
if (sorted_list.count(i) >= max_count):
mode = i
max_count = sorted_list.count(i)
print(mode)
I was busy with some stuff and now came back to completing this. I am happy to say that I have matured enough as a coder and solved this issue.
Here is the solution:
# Enter your code here. Read input from STDIN. Print output to STDOUT
# Input an array of numbers, convert it to integer array
n = int(input())
my_array = list(map(int, input().split()))
my_array.sort()
# Find mean
array_mean = sum(my_array) / n
print(array_mean)
# Find median
if (n%2) != 0:
array_median = my_array[n//2]
else:
array_median = (my_array[n//2 - 1] + my_array[n//2]) / 2
print(array_median)
# Find mode(I could do this using multimode method of statistics module for python 3.8)
def sort_second(array):
return array[1]
modes = [[i, my_array.count(i)] for i in my_array]
modes.sort(key = sort_second, reverse=True)
array_mode = modes[0][0]
print(array_mode)
I have a dataframe which is of length 177 and I want to calculate and plot the partial auto-correlation function (PACF).
I have the data imported etc and I do:
from statsmodels.tsa.stattools import pacf
ys = pacf(data[key][array].diff(1).dropna(), alpha=0.05, nlags=176, method="ywunbiased")
xs = range(lags+1)
plt.figure()
plt.scatter(xs,ys[0])
plt.grid()
plt.vlines(xs, 0, ys[0])
plt.plot(ys[1])
The method used results in numbers greater than 1 for very long lags (90ish) which is incorrect and I get a RuntimeWarning: invalid value encountered in sqrtreturn rho, np.sqrt(sigmasq) but since I can't see their source code I don't know what this means.
To be honest, when I search for PACF, all the examples only carry out PACF up to 40 lags or 60 or so and they never have any significant PACF after lag=2 and so I couldn't compare to other examples either.
But when I use:
method="ols"
# or
method="ywmle"
the numbers are corrected. So it must be the algo they use to solve it.
I tried importing inspect and getsource method but its useless it just shows that it uses another package and I can't find that.
If you also know where the problem arises from, I would really appreciate the help.
For your reference, the values for data[key][array] are:
[1131.130005, 1144.939941, 1126.209961, 1107.300049, 1120.680054, 1140.839966, 1101.719971, 1104.23999, 1114.579956, 1130.199951, 1173.819946, 1211.920044, 1181.27002, 1203.599976, 1180.589966, 1156.849976, 1191.5, 1191.329956, 1234.180054, 1220.329956, 1228.810059, 1207.01001, 1249.47998, 1248.290039, 1280.079956, 1280.660034, 1294.869995, 1310.609985, 1270.089966, 1270.199951, 1276.660034, 1303.819946, 1335.849976, 1377.939941, 1400.630005, 1418.300049, 1438.23999, 1406.819946, 1420.859985, 1482.369995, 1530.619995, 1503.349976, 1455.27002, 1473.98999, 1526.75, 1549.380005, 1481.140015, 1468.359985, 1378.550049, 1330.630005, 1322.699951, 1385.589966, 1400.380005, 1280.0, 1267.380005, 1282.829956, 1166.359985, 968.75, 896.23999, 903.25, 825.880005, 735.090027, 797.869995, 872.8099980000001, 919.1400150000001, 919.320007, 987.4799800000001, 1020.6199949999999, 1057.079956, 1036.189941, 1095.630005, 1115.099976, 1073.869995, 1104.48999, 1169.430054, 1186.689941, 1089.410034, 1030.709961, 1101.599976, 1049.329956, 1141.199951, 1183.26001, 1180.550049, 1257.640015, 1286.119995, 1327.219971, 1325.829956, 1363.609985, 1345.199951, 1320.640015, 1292.280029, 1218.890015, 1131.420044, 1253.300049, 1246.959961, 1257.599976, 1312.410034, 1365.680054, 1408.469971, 1397.910034, 1310.329956, 1362.160034, 1379.319946, 1406.579956, 1440.670044, 1412.160034, 1416.180054, 1426.189941, 1498.109985, 1514.680054, 1569.189941, 1597.569946, 1630.73999, 1606.280029, 1685.72998, 1632.969971, 1681.550049, 1756.540039, 1805.810059, 1848.359985, 1782.589966, 1859.449951, 1872.339966, 1883.949951, 1923.569946, 1960.22998, 1930.6700440000002, 2003.369995, 1972.290039, 2018.050049, 2067.560059, 2058.899902, 1994.9899899999998, 2104.5, 2067.889893, 2085.51001, 2107.389893, 2063.110107, 2103.840088, 1972.180054, 1920.030029, 2079.360107, 2080.409912, 2043.939941, 1940.2399899999998, 1932.22998, 2059.73999, 2065.300049, 2096.949951, 2098.860107, 2173.600098, 2170.949951, 2168.27002, 2126.149902, 2198.810059, 2238.830078, 2278.8701170000004, 2363.639893, 2362.719971, 2384.199951, 2411.800049, 2423.409912, 2470.300049, 2471.649902, 2519.360107, 2575.26001, 2584.840088, 2673.610107, 2823.810059, 2713.830078, 2640.8701170000004, 2648.050049, 2705.27002, 2718.3701170000004, 2816.290039, 2901.52002, 2913.97998]
Your time series is pretty clearly not stationary, so that Yule-Walker assumptions are violated.
More generally, PACF is usually appropriate with stationary time series. You might difference your data first, before considering the partial autocorrelations.
I am trying to learn linearK estimates on a small linnet object from the CRC spatstat book (chapter 17) and when I use the linearK function, spatstat throws an error. I have documented the process in the comments in the r code below. The error is as below.
Error in seq.default(from = 0, to = right, length.out = npos + 1L) : 'to' cannot be NA, NaN or infinite
I do not understand how to resolve this. I am following this process:
# I have data of points for each data of the week
# d1 is district 1 of the city.
# I did the step below otherwise it was giving me tbl class
d1_data=lapply(split(d1, d1$openDatefactor),as.data.frame)
# I previously create a linnet and divided it into districts of the city
d1_linnet = districts_linnet[["d1"]]
# I create point pattern for each day
d1_ppp = lapply(d1_data, function(x) as.ppp(x, W=Window(d1_linnet)))
plot(d1_ppp[[1]], which.marks="type")
# I am then converting the point pattern to a point pattern on linear network
d1_lpp <- as.lpp(d1_ppp[[1]], L=d1_linnet, W=Window(d1_linnet))
d1_lpp
Point pattern on linear network
3 points
15 columns of marks: ‘status’, ‘number_of_’, ‘zip’, ‘ward’,
‘police_dis’, ‘community_’, ‘type’, ‘days’, ‘NAME’,
‘DISTRICT’, ‘openDatefactor’, ‘OpenDate’, ‘coseDatefactor’,
‘closeDate’ and ‘instance’
Linear network with 4286 vertices and 6183 lines
Enclosing window: polygonal boundary
enclosing rectangle: [441140.9, 448217.7] x [4640080, 4652557] units
# the errors start from plotting this lpp object
plot(d1_lpp)
"show.all" is not a graphical parameter
Show Traceback
Error in plot.window(...) : need finite 'xlim' values
coords(d1_lpp)
x y seg tp
441649.2 4649853 5426 0.5774863
445716.9 4648692 5250 0.5435492
444724.6 4646320 677 0.9189631
3 rows
And then consequently, I also get error on linearK(d1_lpp)
Error in seq.default(from = 0, to = right, length.out = npos + 1L) : 'to' cannot be NA, NaN or infinite
I feel lpp object has the problem, but I find it hard to interpret the errors and how to resolve them. Could someone please guide me?
Thanks
I can confirm there is a bug in plot.lpp when trying to plot the marked point pattern on the linear network. That will hopefully be fixed soon. You can plot the unmarked point pattern using
plot(unmark(d1_lpp))
I cannot reproduce the problem with linearK. Which version of spatstat are you running? In the development version on my laptop spatstat_1.51-0.073 everything works. There has been changes to this code recently, so it is likely that this will be solved by updating to development version (see https://github.com/spatstat/spatstat).
I have 2 columns A and B which contain the Spearman's correlation values as follows:
0.127272727 -0.260606061
-0.090909091 -0.224242424
0.345454545 0.745454545
0.478787879 0.660606061
-0.345454545 -0.333333333
0.151515152 -0.127272727
0.478787879 0.660606061
-0.321212121 -0.284848485
0.284848485 0.515151515
0.36969697 -0.139393939
-0.284848485 0.272727273
How can I calculate the average of those correlation values in these 2 columns in Excel or Matlab ? I found a close answer in this link : https://stats.stackexchange.com/questions/8019/averaging-correlation-values
The main point is we can not use mean or average in this case, as explained in the link. They proposed a nice way to do that, but I dont know how to implement it in Excel or Matlab.
Following the second answer of the link you provided, which is the most general case, you can calculate the average Spearman's rho in Matlab as follows:
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
z = atanh(M);
meanRho = tanh(mean(z));
As you can see it gives mean values of
meanRho =
0.1165 0.1796
whereas the simple mean is quite close:
mean(M)
ans =
0.1085 0.1350
Edit: more information on Fisher's transformation here.
In MATLAB, define a matrix with these values and use mean function as follows:
%define a matrix M
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
%calculates the mean of each column
meanVals = mean(M);
Result
meanVals =
0.1085 0.1350
It is also possible to calculate the total meanm and the mean of each row as follows:
meanVals = mean(M); %total mean
meanVals = mean(M,2); %mean of each row
I need to plot a 3D bar graph in matlab or excel. I am going to use some dates in x-axis, time in y-axis and some amount on the z-axis. Each record in csv file looks like ...
18-Apr, 21, 139.45
I am not sure how to do this right. can anyone help me please. I tried using pivort chart of excel. however, i could not manipulate the axis and use appropriate space between each tick.
thanks
kaisar
Since the question is lacking details, let me illustrate with an example.
Consider the following code:
%# read file contents: date,time,value
fid = fopen('data.csv','rt');
C = textscan(fid, '%s %s %f', 'Delimiter',',');
fclose(fid);
%# correctly reshape the data, and extract x/y labels
num = 5;
d = reshape(C{1},num,[]); d = d(1,:);
t = reshape(C{2},num,[]); t = t(:,1);
Z = reshape(C{3},num,[]);
%# plot 3D bars
bar3(Z)
xlabel('date'), ylabel('time'), zlabel('value')
set(gca, 'XTickLabel',d, 'YTickLabel',t)
I ran on the following data file:
data.csv
18-Apr,00:00,0.85535
18-Apr,03:00,0.38287
18-Apr,06:00,0.084649
18-Apr,09:00,0.73387
18-Apr,12:00,0.33199
19-Apr,00:00,0.83975
19-Apr,03:00,0.37172
19-Apr,06:00,0.82822
19-Apr,09:00,0.17652
19-Apr,12:00,0.12952
20-Apr,00:00,0.87988
20-Apr,03:00,0.044079
20-Apr,06:00,0.68672
20-Apr,09:00,0.73377
20-Apr,12:00,0.43717
21-Apr,00:00,0.37984
21-Apr,03:00,0.97966
21-Apr,06:00,0.39899
21-Apr,09:00,0.44019
21-Apr,12:00,0.15681
22-Apr,00:00,0.32603
22-Apr,03:00,0.31406
22-Apr,06:00,0.8945
22-Apr,09:00,0.24702
22-Apr,12:00,0.31068
23-Apr,00:00,0.40887
23-Apr,03:00,0.70801
23-Apr,06:00,0.14364
23-Apr,09:00,0.87132
23-Apr,12:00,0.083156
24-Apr,00:00,0.46174
24-Apr,03:00,0.030389
24-Apr,06:00,0.7532
24-Apr,09:00,0.70004
24-Apr,12:00,0.21451
25-Apr,00:00,0.6799
25-Apr,03:00,0.55729
25-Apr,06:00,0.85068
25-Apr,09:00,0.55857
25-Apr,12:00,0.90177
26-Apr,00:00,0.41952
26-Apr,03:00,0.35813
26-Apr,06:00,0.48899
26-Apr,09:00,0.25596
26-Apr,12:00,0.92917
27-Apr,00:00,0.46676
27-Apr,03:00,0.25401
27-Apr,06:00,0.43122
27-Apr,09:00,0.70253
27-Apr,12:00,0.40233
Use MATLAB's CSV reading functions (or write your own) and then use bar3 to display the data.