What do you call a bug that occurs once and hardly occurs again or you are not able to reproduce it?

What do you call a bug that occurs once and hardly occurs again or you are not able to reproduce it? - nomenclature

Some "mysterious" bugs occur once, and then they either never occur or occur after months or years. You are not able to reproduce them because you don't know why they happen. Is there a technical name for such bugs?

This kind of bug is usually referred to as a Heisenbug.

Related

MS Access writing to Excel over multiple minutes will sometimes throw a false error

I'm someone who solves problems by looking, not asking. So this is new to me. This has been an issue for years, and it crops up with different computers, networks, versions and completely different code. There is a lot here, so, thank you in advance if you are willing to read the whole thing.
Generally speaking, I write MS Access programs that will open Excel and then create multiple worksheets inside of a workbook using data from Access tables and/or Excel sheets. The process can take a couple of minutes to run and occasionally, it will get an error. I could tell you the error message, but it doesn't matter because it will be different depending where the error occurs. When it occurs I simply click debug and click continue and it... continues. If it errors out again (many loops later), it will happen in the exact same spot.
So, what I start with is to make minor changes to the code. In the current program I'm working on, the error happens when I write to a cell and the value is a value directly from a table. I created a variable, copied the value to the variable and then wrote to the cell. The error moved to a completely different part of the program and it became a "paste" error. Generally what fixes it is to put a wait function at the spot where the error occurs. One second is usually good enough. Sometimes it takes a couple of these, but that usually solves it. It only took one delay per loop this time, so it is working. I just hate causing delays in my program. So... Has anyone seen anything like this before, or is it just me. It feels like a timing issue between Access and Excel since the delays are usually helpful. Thanks in advance.

I dug up my last major Access project that interacted with Word (ca. 2016) where I struggled with similar issues. I see many, many Debug.Print statements (some commented, some still active), but unlike what I recalled earlier in my comments, I don't see any "wait" statements anymore! From what I now recall and after re-inspecting the code, most problems were resolved by
implementing robust error handling and best practices for always closing automation objects (and/or releasing the objects if I wanted the instances to persist)
subscribing to and utilizing appropriate automation object events to detect and handle interaction rather than trying to force everything into serialized work-then-wait code. To do this, I placed all automation code in well-structured classes that declared automation objects WithEvents (in VBA of course) and then defined relevant event handlers for actions I was effecting. I now recall finally being able to avoid weird errors and application hangs, etc.
You also may never get a good answer to a question like this, so despite that I am not an absolute expert on Office development, I have had my own experience with frustrating bugs like this and so I'll share my 2 cents. This may not be satisfying, but after experiencing similar behavior using office automation objects, my general understanding is that interaction between OS processes are not deterministic. Especially since VBA generally has no threading or parallelism concerns, it can be strange to deal with objects that behave in unpredictable ways. The time slices given to each process separately is at the mercy of the OS and it will vary greatly with multiple processors/cores, running processes, memory management, etc. Despite the purpose of the automation objects--to control instances of office apps--the API's are not designed well for inter-application processes.
Although it would be great if old automation code would produces more useful errors, perhaps nested exceptions (like in .Net and other modern environments), something that indicates delays and timeouts within callbacks between automation objects, instead you get hodgepodge of various context errors.
My hardware is old, but still ticking. I often get delays, even if only for a second, when switching between apps, etc. Instead of thinking of it as an error, I just perceive it as a slow machine, just wait and continue. It may be useful to consider these type of random errors as similar delays. If a wait call here or there resolves the issue, however annoying, that may just be the best solution... wait and continue.
Every now and then after debugging these types of issues I would actually discover the underlying problem and be able to fix it. At the least I would be able to avoid actual problems with the data, despite errors being raised, just like you describe. But even when I felt that I understood the problem, the answer was still often to do exactly as you have done and just add a short wait.

I do believe now this is a timing issue. After thinking things through, I realized that I could easily (well 3 hours later) separate the database info from the spreadsheet info and then move the updated code that is causing problems into an Excel Macro. I then called that macro from Access. Not only do the errors go away, but it runs about 4 times faster. It's not surprising, I just hadn't thought of that direction before.

Z3 might be inconsistent when solving this string problem?

I just encountered a SMTLIB problem in string theory that Z3 might have answered inconsistently. When envoking Z3 to solve the problem: with argument smt.string_solver=z3str3 it returns unsat; without any arguments it returns sat.
I also used CVC4 to solve the problem. It returned a solution, which seems to be a valid model as I checked it by manually replacing the variable assignments into the problem.
Since I'm trying to do a research using Z3, I would like to know if this is a known behavior of Z3. Thanks to anyone that could help! :)
Edit: I was using Z3 4.7.1 on WSL Ubuntu 16.04.

I'd say that getting unsat or sat, depending on Z3's configuration, sounds like a bug to me. Check the Z3 issue tracker for issues that describe a similar behaviour, and if nothing pops up, file an issue there. Ideally with a minimal example to reproduce the problem with, your current one is rather long.

Publishing customization doesn't publish customizations

Sometimes when I hit Publish All Customizations, I notice that the process doesn't result in the expected outcome. At first, I blamed my forgetfulness but finally, I took statistics. There is something wrong with the publishing facility. I even have a short screencast to prove it!
What am I doing wrong?!

You are most likely not doing anything wrong. I've see that behavior quite a few times. It started with RU11, I believe and it's a bug. It appears seldom and has (most often) to do with importing a solution from a lower version of a roll-up, as far I've cared to regard
it.
Just to make sure we're talking about the same phenomenon, see my blog if you recognize the signs from the screenshots.
Short remedy - hit the button again. It always works the second time.
Long term solution - wait until Microsoft fixes it.

Any good strategies for dealing with 'not reproducible' bugs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Very often you will get or submit bug reports for defects that are 'not reproducible'. They may be reproducible on your computer or software project, but not on a vendor's system. Or the user supplies steps to reproduce, but you can't see the defect locally. Many variations on this scenario of course, so to simplify I guess what I'm trying to learn is:
What is your company's policy towards 'not reproducible' bugs? Shelve them, close them, ignore? I occasionally see intermittent, non reproducible bugs in 3-rd party frameworks, and these are pretty much always closed instantly by the vendor... but they are real bugs.
Have you found any techniques that help in fixing these types of bugs? Usually what I do is get a system info report from the user, and steps to reproduce, then search on keywords, and try to see any sort of pattern.

Verify the steps used to produce the error
Oftentimes the people reporting the error, or the people reproducing the error, will do something wrong and not end up in the same state, even if they think they are. Try to walk it through with the reporting party. I've had a user INSIST that the admin privileges were not appearing correctly. I tried reproducing the error and was unable to. When we walked it through together, it turned out he was logging in as a regular user in that case.
Verify the system/environment used to produce the error
I've found many 'irreproducible' bugs and only later discovered that they ARE reproducible on Mac OS (10.4) Running X version of Safari. And this doesn't apply only to browsers and rendering, it can apply to anything; the other applications that are currently being run, whether or not the user is RDP or local, admin or user, etc... Make certain you get your environment as close to theirs as possible before calling it irreproducible.
Gather Screenshots and Logs
Once you have verified that the user is doing everything correctly and still getting a bug, and that you're doing exactly what they do, and you are NOT getting the bug, then it's time to see what you can actually do about it. Screenshots and logs are critical. You want to know exactly what it looks like, and exactly what was going on at the time.
It is possible that the logs could contain some information that you can reproduce on your system, and once you can reproduce the exact scenario, you might be able to coax the error out of hiding.
Screenshots also help with this, because you might discover that "X piece has loaded correctly, but it shouldn't have because it is dependent on Y" and that might give you a hint. Even if the user can describe what were doing, a screen shot could help even more.
Gather step-by-step description from the user
It's very common to blame the users, and not trust anything that they say (because they call a 'usercontrol' a 'thingy') but even though they might not know the names of what they're seeing, they will still be able to describe some of the behaviour they are seeing. This includes some minor errors that may have occured a few minutes BEFORE the real error occurred, or possibly slowness in certain things that are usually fast. All these things can be clues to help you narrow down which aspect is causing the error on their machine and not yours.
Try Alternate Approachs to produce the error
If all else fails, try looking at the section of code that is causing problems, and possibly refactor or use a workaround. If it is possible for you to create a scenario where you start with half the information already there (hopefully in UAT) ask the user to try that approach, and see if the error still occurs. Do you best to create alternate but similar approaches that get the error into a different light so that you can examine it better.

Short answer: Conduct a detailed code review on the suspected faulty code, with the aim of fixing any theoretical bugs, and adding code to monitor and log any future faults.
Long answer:
To give a real-world example from the embedded systems world: we make industrial equipment, containing custom electronics, and embedded software running on it.
A customer reported that a number of devices on a single site were experiencing the same fault at random intervals. Their symptoms were the same in each case, but they couldn't identify an obvious cause.
Obviously our first step was to try and reproduce the fault in the same device in our lab, but we were unable to do this.
So, instead, we circulated the suspected faulty code within the department, to try and get as many ideas and suggestions as possible. We then held a number of code review meetings to discuss these ideas, and determine a theory which: (a) explained the most likely cause of the faults observed in the field; (b) explained why we were unable to reproduce it; and (c) led to improvements we could make to the code to prevent the fault happening in the future.
In addition to the (theoretical) bug fixes, we also added monitoring and logging code, so if the fault were to occur again, we could extract useful data from the device in question.
To the best of my knowledge, this improved software was subsequently deployed on site, and appears to have been successful.

resolved "sterile" and "spooky"
We have two closed bug categories for this situation.
sterile - cannot reproduce.
spooky - it's acknowledged there is a problem, but it just appears intermittently, isn't quite understandable, and gives everyone a faint case of the creeps.

Error-reporting, log files, and stern demands to "Contact me immediately if this happens again."

If it happens in one context, and not in another, we try to enumerate the difference between both, and eliminate them.
Sometimes this works (e.g. other hardware, dual core vs. hyperthreading, laptop-disk vs. workstation disk, ...).
Sometimes it doesn't. If it's possible, we may start remote-debugging. If that doesn't help, we may try get our hands on the customer's system.
But of course, we don't write too many bugs in the first place :)

Well, you try your best to reproduce it, and if you can't, you take a long think and consider how such a problem might arise. If you still have no idea, then there's not much you can do about it.

Some of the new features in Visual Studio 2010 will help. See:
Historical Debugger and Test Impact Analysis in Visual Studio Team System 2010
Better Software Quality with Visual Studio Team System 2010
Manual Testing with Visual Studio Team System 2010

Sometimes the bug is not reproducible even in a pre-production environment that is the exact duplicate of the production environment. Concurrency issues are notorious for this.
Random Failures Are Often Concurrency Issues
Link: https://pragprog.com/tips/
The reason can be simply because of the Heisenberg effect, i.e. observation changes behaviour. Another reason can be because the chances are very small of hitting the combination of events that triggers the bug.
Sometimes you are lucky and you have audit logs that you can playback, greatly increasing the chances of recreating the issue. You can also stress the environment with high volumes of transactions. This effectively compresses time so that if the bug occurs say once a week, you may be able to reliably reproduce it in 1 day if you stress the system to 7 X the production load.
The last resort is whitebox testing where you go through the code line by line writing unit tests as you go.

I add logging to the exception handling code throughout the program. You need a method to collect the logs (users can email it, etc.)
Preemptive checks for code versions and sane environments are a good thing too. With the ease of software updates these days the code and environment the user is running has almost certainly not been tested. It didn't exist when you released your code.

With a web project I'm developing at the moment I'm doing something very similar to your technique. I'm building a page that I can direct users to in order to collect information such as their browser version and operating system. I'll also be collecting the apps registry info so i can have a look at what they've been doing.
This is a very real problem. I can only speak for web development, but I find users are rarely able to give me the basic information I would need to look into the issue. I suspect it's entirely possible to do something similar with other kinds of development. My plan is to keep working on this system to make it more and more useful.
But my policy is never to close a bug simply because I can't reproduce it, no matter how annoying it may be. And then there's the cases when it's not a bug, but the user has simply gotten confused. Which is a different type of bug I guess, but just as important.

You talk about problems that are reproducible but only on some systems. These are easy to handle:
First step: By using some sort of remote software, you let the customer tell you what to do to reproduce the problem on the system that has it. If this fails, then close it.
Second step: Try to reproduce the problem on another system. If this fails, make an exact copy of the customers system.
Third step: If it still fails, you have no option than to try to debug it on the customer system.
Once you can reproduce it, you can fix it. Doesn't matter on what system.
The tricky issue are truly non-reproducible issues, that is things that happen only intermittently. For that I'll have to chime in with the reports, logs and stern demands attitude. :)

It is important to categorize such bugs (rarely reproducible) and act on them differently than bugs that are frequently reproducible based on specific user actions.
Clear issue description along with steps to reproduce and observed behavior: Unambiguous reporting helps in understanding of the issue by entire team eliminating incorrect conclusions. For example, user reporting blank screen is different than HMI freeze on user action. Sequence of steps and approx timing of user action is also important.Did the user immediately select the option after screen transition or waited for a few minutes? An interesting bug concerning timing is a car allergic to vanilla ice-cream that baffled automotive engineers.
System config and startup parameters: Sometimes even hardware configuration and application software version (including drivers and firmware version) may do a trick or two. Mismatch of version or configuration can result in issues that are difficult to reproduce in other setups. Hence these are essential details to be captured. Most bug reporting tools have these details as mandatory parameters to report while logging an issue.
Extensive Logging: This is dependent on the logging facilities followed in concerned projects. While working with embedded Linux systems, we not only provide general diagnostic logs, but also system level logs like dmesg or top command logs. You may never know that wrong part is not the code flow but the abnormal memory usage/CPU usage. Identify the type of the issue and report the relevant logs for investigation.
Code Reviews and Walk-through: Dev teams cannot wait forever to reproduce these issues at their end and then take action. Bug report and available logs should be investigated and various possibilities be identified on this basis from design and code. If required, they should prepare hotfix on possible root causes and circulate the hotfix among teams including the tester who identified it to see if bug is reproducible with it.
Don't close these issues based on observation by a single tester/team after a fix is identified and checked in: Perhaps the most important part is approach followed to close these issues. Once fix of these issues has been checked in, all testing/validation teams at different locations should be informed on it for running intensive tests and identifying regression errors if any. Only all (practically most of them) of them reports as non-reproducible, a closure assessment has to be done by senior management.

If it is not reproduce able get logs, screen shots of exact steps to reproduce.

There's a nice new feature in Windows 7 that allows the user to record what they're doing and then send a report - it comes through as a doc with screen-shots of every stage. Hopefully it'll help in the cases where it's the user interacting with the application in an order that the developer wouldn't think of. I've seen plenty of bugs where it's just a case that the developer's logical way of using the app doesn't fit with how end users actually do it... resulting in lots of subtle errors.

Logging is your friend!
Generally what happens when we discover a bug that we can't reproduce is we either ask the customer to turn on more logging (if its available), or we release a version with extra logging added around the area we are interested in. Generally speaking the logging we have is excellent and has the ability to be very verbose, and so releasing versions with extra logging doesn't happen often.
You should also consider the use of memory dumps (which IMO also falls under the umbrella of logging). Producing a minidump is so quick that it can usually be done on production servers, even under load (as long as the number of dumps being produced is low).
The way I see it: Being able to reproduce a problem is nice because it gives you an environment where you can debug, experiement and play around in more freely, but - reproducing a bug is by no means essential to debug it! If the bug is only happening on someone else system then you still need to diagnose and debug the problem in the same way, its just that this time you need to be cleverer about how you do it.

The accepted answer is the best general approach. At a high level, it's worth weighing the importance of fixing the bug against what you could add as a feature or enhance that would benefit the user. Could a 'non-reproducible' bug take two days to fix? Could a feature be added in that time that gives users more benefit than that bug fix? Maybe the users would prefer the feature. I've been fixated at times as a developer on imperfections I can see, and then users are asked for feedback and none of them actually mention the bug(s) that I can see, but the software is missing a feature that they really want!
Sometimes, simple persistence in attempting to reproduce the bug whilst debugging can be the most effective approach. For this strategy to work, the bug needs to be 'intermittent' rather than completely 'non-reproducible'. If you can repeat a bug even one time in 10, and you have ideas about the most likely place it's occurring, you can place breakpoints at those points then doggedly attempt to repeat the bug and see exactly what's going on. I've experienced this to be more effective than logging in one or two cases (although logging would be my first go-to in general).

Which is better: shipping a buggy feature or not shipping the feature at all?

this is a bit of a philosophical question. I am adding a small feature to my software which I assume will be used by most users but only maybe 10% of the times they use the software. In other words, the software has been fine without it for 3 months, but 4 or 5 users have asked for it, and I agree that it should be there.
The problem is that, due to limitations of the platform I'm working with (and possibly limitations of my brain), "the best I can do" still has some non-critical but noticeable bugs - let's say the feature as coded is usable but "a bit wonky" in some cases.
What to do? Is a feature that's 90% there really "better than nothing"? I know I'll get some bug reports which I won't be able to fix: what do I tell customers about those? Should I live with unanswered feature requests or unanswered bug reports?

Make sure people know, that you know, that there are problems. That there are bugs. And give them an easy way to proide feedback.
What about having a "closed beta" with the "4 or 5 users" who suggested the feature in the first place?

There will always be unanswered feature requests and bug reports. Ship it, but include a readme with "known issues" and workarounds when possible.

You need to think of this from your user's perspective - which will cause less frustration? Buggy code is usually more frustrating than missing features.

Perfectionists may answer "don't do it".
Business people may answer "do it".
I guess where the balance is is up to you. I would be swaying towards putting the feature in there if the bugs are non-critical. Most users don't see your software the same way you do. You're a craftsman/artist, which means your more critical than regular people.
Is there any way that you can get a beta version to the 4-5 people who requested the feature? Then, once you get their feedback, it may be clear which decision to make.

Precisely document the wonkiness and ship it.
Make sure a user is likely to see and understand your documentation of the wonkiness.
You could even discuss the decision with users who have requested the feature: do some market research.
Just because you can't fix it now, doesn't mean you won't be able to in the future. Things change.

Label what you have now as a 'beta version' and send it out to those people who have asked for it. Get their feedback on how well it works, fix whatever they complain about, and you should then be ready to roll it out to larger groups of users.

Ship early, ship often, constant refactoring.
What I mean is, don't let it stop you from shipping, but don't give up on fixing the problems either.
An inability to resolve wonkiness is a sign of problems in your code base. Spend more time refactoring than adding features.

I guess it depends on your standards. For me, buggy code is not production ready and so shouldn't be shipped. Could you have a beta version with a known issues list so users know what to expect under certain conditions? They get the benefit of using the new features but also know that it's not perfect (use that their own risk). This may keep those 4 or 5 customers that requested the feature happy for a while which gives you more time to fix the bugs (if possible) and release to production later for the masses.
Just some thoughts depending on your situation.

Depends. On the bugs, their severity and how much effort you think it will take to fix them. On the deadline and how much you think you can stretch it. On the rest of the code and how much the client can do with it.

I would not expect coders to deliver known problems into test let alone to release to a customer.
Mind you, I believe in zero tolerance of bugs. Interestingly I find that it is usually developers/ testers who are keenest to remove all bugs - it is often the project manager and/ or customer who are willing to accept bugs.
If you must release the code, then document every feature/ bug that you are aware of, and commit to fixing each one.
Why don't you post more information about the limitations of the platform you are working on, and perhaps some of the clever folk here can help get your bug list down.

If the demand is for a feature NOW, rather than a feature that works. You may have to ship.
In this situation though:
Make sure you document the bug(s)
and consequences (both to the user
and other developers).
Be sure to add the bug(s) to your
bug tracking database.
If you write unit tests (I hope so),
make sure that tests are written
which highlight the bugs, before you
ship. This will mean that when you
come to fix the bugs in the future,
you know where and what they are,
without having to remember.
Schedule the work to fix the bugs
ASAP. You do fix bugs before
writing new code, don't you?

If bugs can cause death or can lose users' files then don't ship it.
If bugs can cause the application to crash itself then ship it with a warning (a readme or whatever). If crashes might cause the application to corrupt the users' files that they were in the middle of editing with this exact application, then display a warning each time they start up the application, and remind them to backup their files first.
If bugs can cause BSODs then be very careful about who you ship it to.

If it doesn't break anything else, why not ship it? It sounds like you have a good relationship with your customers, so those who want the feature will be happy to get it even if it's not all the way there, and those who don't want it won't care. Plus you'll get lots of feedback to improve it in the next release!

The important question you need to answer is if your feature will solve a real business need given the design you've come up with. Then it's only a matter of making the implementation match the design - making the "bugs" being non-bugs by defining them as not part of the intended behaviour of the feature (which should be covered by the design).
This boils down to a very real choice of paths: is a bug something that doesn't work properly, that wasn't part of the intended behaviour and design? Or is it a bug only if if doesn't work in accordance to the intended behaviour?
I am a firm believer in the latter; bugs are the things that do not work the way they were intended to work. The implementation should capture the design, that should capture the business need. If the implementation is used to address a different business need that wasn't covered by the design, it is the design that is at fault, not the implementation; thus it is not a bug.
The former attitude is by far the most common amongst programmers in my experience. It is also the way the user views software issues. From a software development perspective, however, it is not a good idea to adopt this view, because it leads you to fix bugs that are not bugs, but design flaws, instead of redesigning the solution to the business need.

Coming from someone who has to install buggy software for their users - don't ship it with that feature enabled.
It doesn't matter if you document it, the end users will forget about that bug the first time they hit it, and that bug will become critical to them not being able to do their job.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string