Failure rate of a system - reliability

In a Microsoft interview I was asked the following question:
A system is guaranteed to fail 10% of a time within any given hour, what's the failure rate after two hours ? after a million hour ?
I'm not very experienced in Reliability theory and Failure rates, but any input on this question will be very much appreciated.

1-(.9^h), where h is the number of hours

Related

Optaplanner: Randomly very low "average calcultate count per second" when solving in parallel

I am using Optaplanner to solve a comparatively small optimization problem. For my use case, many such optimizations are required though which is why I started to run them in parallel. The parallelity is based upon Java 8' parallel stream. It doesn't allow to control the actual number of threads to be used, but I believe it to be based on the available CPU count.
For most of the solver runs this seems to work fine, but I noticed that I sometimes got invalid solutions from a single run which were not reproducible when only that problem was run alone.
After inspecting the logs, I noticed that the "average calculate count per second" was very low for invalid solution while being fine for other runs. In fact, the invalid solution was actually the (naively built) initial solution:
[rkJoinPool.commonPool-worker-6] (DefaultSolver.java:203) Solving started: time spent (0), best score (-5hard/-2medium/168soft), environment mode (REPRODUCIBLE), random (JDK with seed 0).
[rkJoinPool.commonPool-worker-6] (DefaultConstructionHeuristicPhase.java:158) Construction Heuristic phase (0) ended: step total (0), time spent (1), best score (-5hard/-2medium/233soft).
[rkJoinPool.commonPool-worker-4] (DefaultSolver.java:203) Solving started: time spent (1), best score (-5hard/-1medium/579soft), environment mode (REPRODUCIBLE), random (JDK with seed 0).
[rkJoinPool.commonPool-worker-4] (DefaultConstructionHeuristicPhase.java:158) Construction Heuristic phase (0) ended: step total (0), time spent (1), best score (-5hard/-1medium/617soft).
[rkJoinPool.commonPool-worker-5] (DefaultSolver.java:203) Solving started: time spent (1), best score (-6hard/-3medium/137soft), environment mode (REPRODUCIBLE), random (JDK with seed 0).
[rkJoinPool.commonPool-worker-7] (DefaultLocalSearchPhase.java:152) Local Search phase (1) ended: step total (42), time spent (704), best score (0hard/0medium/808soft).
[rkJoinPool.commonPool-worker-4] (DefaultLocalSearchPhase.java:152) Local Search phase (1) ended: step total (22), time spent (218), best score (0hard/0medium/1033soft).
[rkJoinPool.commonPool-worker-5] (DefaultSolver.java:238) Solving ended: time spent (210), best score (-6hard/-3medium/137soft), average calculate count per second (4), environment mode (REPRODUCIBLE).
[rkJoinPool.commonPool-worker-7] (DefaultSolver.java:238) Solving ended: time spent (746), best score (0hard/0medium/808soft), average calculate count per second (25256), environment mode (REPRODUCIBLE).
[rkJoinPool.commonPool-worker-4] (DefaultSolver.java:238) Solving ended: time spent (219), best score (0hard/0medium/1033soft), average calculate count per second (30461), environment mode (REPRODUCIBLE).
Notice how Threads 4 and 7 produce good results with 25-30k accs, while Thread 5 produced an invalid result and only used 4 accs (given the 200ms termination timeout I assume that really only one step was taken).
The following configuration was used which was determined using the benchmarker (albeit in a single-thread setup):
<termination>
<millisecondsSpentLimit>2000</millisecondsSpentLimit>
<unimprovedMillisecondsSpentLimit>200</unimprovedMillisecondsSpentLimit>
</termination>
<constructionHeuristic>
<constructionHeuristicType>FIRST_FIT</constructionHeuristicType>
</constructionHeuristic>
<localSearch>
<localSearchType>HILL_CLIMBING</localSearchType>
</localSearch>
I assume that this problem has to do with the fact that several solvers are running in parallel while a time based termination criteria is used. Is the termination time based on "wall time" or on actual CPU time?
Is using a time based termination criteria not such a good idea when running in parallel? This seems to be the best way though to use all available computing power.
What could cause as single solver to seemingly at random only perform so few steps?
millisecondsSpentLimit and unimprovedMillisecondsSpentLimit are based on wall time, not on actual CPU time.
AFAIK, parallel streams does not limit the number of threads to the number of CPU's, as those jobs might block under IO (which is not the case for Solver.solve() calls). I prefer to use an ExecutorService with a thread pool size of Math.max(1, Runtime.getRuntime().availableProcessors() - 2).

Is process.hrtime() suitable for monitoring uptime?

I am trying to monitor the uptime and send data at high frequency on node.js server. The server will broadcast network data every few milliseconds.
But using Date.now() is not accurate enough. So I am thinking of using the high resolution timer process.hrtime(). I don't know what is the max value of process.hrtime. I need to run the server for at least 6 months. Will it overflow very soon?
Primary use of process.hrtime() is is for measuring performance between intervals (docs) but it can be also used for measuring uptime with nanosecond precision.
It returns time in array [seconds, nanoseconds], where nanoseconds are the remaining part of time not representable by whole second.
Seconds will reach max safe integer (9007199254740991) in thousands of years. Nanosecond will never reach it as the maximum nanoseconds not representable by whole second is 999999999.

How do I know the last sched time of a process

I current run into an issue that a process seems stuck somehow, it just doesn't gets scheduled, the status is always 'S'. I have monitored sched_switch_task trace by debugfs for a while, didn't see the process get scheduled. So I would like to know when is that last time scheduled of this process by kernel?
Thanks a lot.
It might be possible using the info in /proc/pid#/sched file.
In there you can find these parameters (depending on the OS version, mine is opensuse 3.16.7-21-desktop):
se.exec_start : 593336938.868448
...
se.statistics.wait_start : 0.000000
se.statistics.sleep_start : 593336938.868448
se.statistics.block_start : 0.000000
The values represent timestamps relative to the system boot time, but in a unit which may depend on your system (in my example the unit is 0.5 msec, for a total value of ~6 days 20 hours and change).
In the last 3 parameters listed above at most one appears to be non-zero at any time and it I suspect that the respective non-zero value represents the time when it last entered the corresponding state (with the process actively running when all are zero).
So if your process is indeed stuck the non-zero value would have recorded when it got stuck.
Note: this is mostly based on observations and assumptions - I didn't find these parameters documented anywhere, so take them with a grain of salt.
Plenty of other scheduling info in that file, but mostly stats and without documentation difficult to use.

Importance of do_fast_gettimeoffset( ) in linux

Was reading "Understanding Linux Kernel" book and in it says that, "number of microseconds is calculated by do_fast_gettimeoffset( )". Also it says that "to count the number of microseconds that have elapsed within the current second."
Couldnt understand what the author means by last sentence. Could anyone explain more on that?
If you want to understand the linux kernel, you should be aware that that book has been outdated for a long time and that do_fast_gettimeoffset no longer exists.
do_get_fast_time returns the number of seconds, and is always fast.
do_gettimeoffset returns the number of microseconds since the start of the second, and might be slow.

Processing of sensor data

I am working on a system with laser trip detectors(if something breaks the laser path I get a one on the output of the laser receiver).
I have many of these trip detectors and I want to detect if one is malfunctioning, but I do not know how to go about doing this. The lasers should not trip all that often..maybe a few times a day.
A typical case would be that the laser gets tripped for a .5-2 seconds, or brief intermittent tripping for a short time period, and possibly again after that(within 2-10 seconds)...
Are there any good ways to check the sensor is malfunctioning using a good statistical methodology?
You could just create a "profile" which includes the avg/mean/min/max of how often each sensor is tripped/how long it is tripped/how long is the time between a trip and the next trip etc. for example by using the data of some period of time like the last week/month or similar...
THEN you can compare the current state of a sensor to its profile... when the deviation is "big enough" you can assume an exceptional situation/perhaps a malfunction... the hardest part is to adjust the threshold for the deviation from the profile which in turn if hit triggers for example "malfunction handling"...

Resources