Is PetraVM Jinx Beta 1 good? - multithreading

PetraVM recently came out with a Beta release of their Jinx product. Has anyone checked it out yet? Any feedback?
By good, I mean:
1) easy to use
2) intuitive
3) useful
4) doesn't take a lot of code to integrate
... those kinds of things.
Thanks guys!

After literally stumbling across Jinx while poking around on Google, I have been on the beta and pre-beta tests with a fair amount of usage already under my belt. As with any beta related comments please understand that things may change or already have changed, so do keep this in mind and take the following with a grain of salt.
So, going through the list of questions one by one:
1) Install and go. Jinx adds a control panel to Visual Studio which you can mostly ignore as the defaults are typically good for most cases. Otherwise you just work normally and forget about it. Jinx does not instrument your code, require any additional libraries linked in or the numerous other things some tools require.
2) The question of "intuitive" is really up to the user. If you understand threaded code and the sorts of bugs possible, Jinx just makes those bugs happen much more frequently, which by itself is a huge benefit to people doing threaded code. While Jinx attempts to stop the code in a state that makes the problem as obvious as possible, "obvious" and "intuitive" are really up to the skill of the programmer.
3) Useful? Anyone who has done threaded code before knows that a race condition can happen regularly or once every month based on cosmic ray counts, that randomness makes debugging threaded code very difficult. With Jinx, even the most minor race condition can be reproduced usually on the first run consistently. This works even for lockless code that other static analysis or instrumenting tools would generally miss.
This sort of quick reproduction of problems is amazingly useful. Jinx has helped me track down a "one instruction in the wrong place" sort of bug that would actually hit once a week at most. Jinx forced the crash to happen almost immediately and allowed me to focus on the actual cause of the bug instead being completely in the dark as to the real source.
4) Integration with Jinx is a breeze. If you don't mind your machine becoming a bit slow, you can tell Jinx to watch the entire machine. It slows the machine down as it is actually watching everything on the machine, including the OS. While interresting and useful if your software consists of multiple processes on the same machine, this is not suggested as it can become painful to work with the machine.
Instead of using the global system, adding an include and two lines of code does the primary work needed of registering the process with Jinx such that Jinx can watch just the registered items. You can help Jinx by using the Jinx specific asserts and registering regions of code that are more important. In the case of the crash mentioned above though, I didn't have to do any of that and Jinx found the problem without the additional integration work. In any case, the integration is extremely simple.
After using Jinx for the last couple months, I have to say that overall it has been a great pleasure. I won't write new threaded code without Jinx running in the background anymore simply because it does its intended job of forcing obscure threading issues to be immediate assert/crashes. As mentioned, things that you could go weeks without seeing become problems almost immediately, this is a wonderful thing to have during initial test and implementation.
KRB

BTW, PetraVM has changed its name to Corensic and you can find Jinx Beta 2 over at www.corensic.com.
--Prashant, the marketing guy at Corensic

Related

Any good strategies for dealing with 'not reproducible' bugs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Very often you will get or submit bug reports for defects that are 'not reproducible'. They may be reproducible on your computer or software project, but not on a vendor's system. Or the user supplies steps to reproduce, but you can't see the defect locally. Many variations on this scenario of course, so to simplify I guess what I'm trying to learn is:
What is your company's policy towards 'not reproducible' bugs? Shelve them, close them, ignore? I occasionally see intermittent, non reproducible bugs in 3-rd party frameworks, and these are pretty much always closed instantly by the vendor... but they are real bugs.
Have you found any techniques that help in fixing these types of bugs? Usually what I do is get a system info report from the user, and steps to reproduce, then search on keywords, and try to see any sort of pattern.
Verify the steps used to produce the error
Oftentimes the people reporting the error, or the people reproducing the error, will do something wrong and not end up in the same state, even if they think they are. Try to walk it through with the reporting party. I've had a user INSIST that the admin privileges were not appearing correctly. I tried reproducing the error and was unable to. When we walked it through together, it turned out he was logging in as a regular user in that case.
Verify the system/environment used to produce the error
I've found many 'irreproducible' bugs and only later discovered that they ARE reproducible on Mac OS (10.4) Running X version of Safari. And this doesn't apply only to browsers and rendering, it can apply to anything; the other applications that are currently being run, whether or not the user is RDP or local, admin or user, etc... Make certain you get your environment as close to theirs as possible before calling it irreproducible.
Gather Screenshots and Logs
Once you have verified that the user is doing everything correctly and still getting a bug, and that you're doing exactly what they do, and you are NOT getting the bug, then it's time to see what you can actually do about it. Screenshots and logs are critical. You want to know exactly what it looks like, and exactly what was going on at the time.
It is possible that the logs could contain some information that you can reproduce on your system, and once you can reproduce the exact scenario, you might be able to coax the error out of hiding.
Screenshots also help with this, because you might discover that "X piece has loaded correctly, but it shouldn't have because it is dependent on Y" and that might give you a hint. Even if the user can describe what were doing, a screen shot could help even more.
Gather step-by-step description from the user
It's very common to blame the users, and not trust anything that they say (because they call a 'usercontrol' a 'thingy') but even though they might not know the names of what they're seeing, they will still be able to describe some of the behaviour they are seeing. This includes some minor errors that may have occured a few minutes BEFORE the real error occurred, or possibly slowness in certain things that are usually fast. All these things can be clues to help you narrow down which aspect is causing the error on their machine and not yours.
Try Alternate Approachs to produce the error
If all else fails, try looking at the section of code that is causing problems, and possibly refactor or use a workaround. If it is possible for you to create a scenario where you start with half the information already there (hopefully in UAT) ask the user to try that approach, and see if the error still occurs. Do you best to create alternate but similar approaches that get the error into a different light so that you can examine it better.
Short answer: Conduct a detailed code review on the suspected faulty code, with the aim of fixing any theoretical bugs, and adding code to monitor and log any future faults.
Long answer:
To give a real-world example from the embedded systems world: we make industrial equipment, containing custom electronics, and embedded software running on it.
A customer reported that a number of devices on a single site were experiencing the same fault at random intervals. Their symptoms were the same in each case, but they couldn't identify an obvious cause.
Obviously our first step was to try and reproduce the fault in the same device in our lab, but we were unable to do this.
So, instead, we circulated the suspected faulty code within the department, to try and get as many ideas and suggestions as possible. We then held a number of code review meetings to discuss these ideas, and determine a theory which: (a) explained the most likely cause of the faults observed in the field; (b) explained why we were unable to reproduce it; and (c) led to improvements we could make to the code to prevent the fault happening in the future.
In addition to the (theoretical) bug fixes, we also added monitoring and logging code, so if the fault were to occur again, we could extract useful data from the device in question.
To the best of my knowledge, this improved software was subsequently deployed on site, and appears to have been successful.
resolved "sterile" and "spooky"
We have two closed bug categories for this situation.
sterile - cannot reproduce.
spooky - it's acknowledged there is a problem, but it just appears intermittently, isn't quite understandable, and gives everyone a faint case of the creeps.
Error-reporting, log files, and stern demands to "Contact me immediately if this happens again."
If it happens in one context, and not in another, we try to enumerate the difference between both, and eliminate them.
Sometimes this works (e.g. other hardware, dual core vs. hyperthreading, laptop-disk vs. workstation disk, ...).
Sometimes it doesn't. If it's possible, we may start remote-debugging. If that doesn't help, we may try get our hands on the customer's system.
But of course, we don't write too many bugs in the first place :)
Well, you try your best to reproduce it, and if you can't, you take a long think and consider how such a problem might arise. If you still have no idea, then there's not much you can do about it.
Some of the new features in Visual Studio 2010 will help. See:
Historical Debugger and Test Impact Analysis in Visual Studio Team System 2010
Better Software Quality with Visual Studio Team System 2010
Manual Testing with Visual Studio Team System 2010
Sometimes the bug is not reproducible even in a pre-production environment that is the exact duplicate of the production environment. Concurrency issues are notorious for this.
Random Failures Are Often Concurrency Issues
Link: https://pragprog.com/tips/
The reason can be simply because of the Heisenberg effect, i.e. observation changes behaviour. Another reason can be because the chances are very small of hitting the combination of events that triggers the bug.
Sometimes you are lucky and you have audit logs that you can playback, greatly increasing the chances of recreating the issue. You can also stress the environment with high volumes of transactions. This effectively compresses time so that if the bug occurs say once a week, you may be able to reliably reproduce it in 1 day if you stress the system to 7 X the production load.
The last resort is whitebox testing where you go through the code line by line writing unit tests as you go.
I add logging to the exception handling code throughout the program. You need a method to collect the logs (users can email it, etc.)
Preemptive checks for code versions and sane environments are a good thing too. With the ease of software updates these days the code and environment the user is running has almost certainly not been tested. It didn't exist when you released your code.
With a web project I'm developing at the moment I'm doing something very similar to your technique. I'm building a page that I can direct users to in order to collect information such as their browser version and operating system. I'll also be collecting the apps registry info so i can have a look at what they've been doing.
This is a very real problem. I can only speak for web development, but I find users are rarely able to give me the basic information I would need to look into the issue. I suspect it's entirely possible to do something similar with other kinds of development. My plan is to keep working on this system to make it more and more useful.
But my policy is never to close a bug simply because I can't reproduce it, no matter how annoying it may be. And then there's the cases when it's not a bug, but the user has simply gotten confused. Which is a different type of bug I guess, but just as important.
You talk about problems that are reproducible but only on some systems. These are easy to handle:
First step: By using some sort of remote software, you let the customer tell you what to do to reproduce the problem on the system that has it. If this fails, then close it.
Second step: Try to reproduce the problem on another system. If this fails, make an exact copy of the customers system.
Third step: If it still fails, you have no option than to try to debug it on the customer system.
Once you can reproduce it, you can fix it. Doesn't matter on what system.
The tricky issue are truly non-reproducible issues, that is things that happen only intermittently. For that I'll have to chime in with the reports, logs and stern demands attitude. :)
It is important to categorize such bugs (rarely reproducible) and act on them differently than bugs that are frequently reproducible based on specific user actions.
Clear issue description along with steps to reproduce and observed behavior: Unambiguous reporting helps in understanding of the issue by entire team eliminating incorrect conclusions. For example, user reporting blank screen is different than HMI freeze on user action. Sequence of steps and approx timing of user action is also important.Did the user immediately select the option after screen transition or waited for a few minutes? An interesting bug concerning timing is a car allergic to vanilla ice-cream that baffled automotive engineers.
System config and startup parameters: Sometimes even hardware configuration and application software version (including drivers and firmware version) may do a trick or two. Mismatch of version or configuration can result in issues that are difficult to reproduce in other setups. Hence these are essential details to be captured. Most bug reporting tools have these details as mandatory parameters to report while logging an issue.
Extensive Logging: This is dependent on the logging facilities followed in concerned projects. While working with embedded Linux systems, we not only provide general diagnostic logs, but also system level logs like dmesg or top command logs. You may never know that wrong part is not the code flow but the abnormal memory usage/CPU usage. Identify the type of the issue and report the relevant logs for investigation.
Code Reviews and Walk-through: Dev teams cannot wait forever to reproduce these issues at their end and then take action. Bug report and available logs should be investigated and various possibilities be identified on this basis from design and code. If required, they should prepare hotfix on possible root causes and circulate the hotfix among teams including the tester who identified it to see if bug is reproducible with it.
Don't close these issues based on observation by a single tester/team after a fix is identified and checked in: Perhaps the most important part is approach followed to close these issues. Once fix of these issues has been checked in, all testing/validation teams at different locations should be informed on it for running intensive tests and identifying regression errors if any. Only all (practically most of them) of them reports as non-reproducible, a closure assessment has to be done by senior management.
If it is not reproduce able get logs, screen shots of exact steps to reproduce.
There's a nice new feature in Windows 7 that allows the user to record what they're doing and then send a report - it comes through as a doc with screen-shots of every stage. Hopefully it'll help in the cases where it's the user interacting with the application in an order that the developer wouldn't think of. I've seen plenty of bugs where it's just a case that the developer's logical way of using the app doesn't fit with how end users actually do it... resulting in lots of subtle errors.
Logging is your friend!
Generally what happens when we discover a bug that we can't reproduce is we either ask the customer to turn on more logging (if its available), or we release a version with extra logging added around the area we are interested in. Generally speaking the logging we have is excellent and has the ability to be very verbose, and so releasing versions with extra logging doesn't happen often.
You should also consider the use of memory dumps (which IMO also falls under the umbrella of logging). Producing a minidump is so quick that it can usually be done on production servers, even under load (as long as the number of dumps being produced is low).
The way I see it: Being able to reproduce a problem is nice because it gives you an environment where you can debug, experiement and play around in more freely, but - reproducing a bug is by no means essential to debug it! If the bug is only happening on someone else system then you still need to diagnose and debug the problem in the same way, its just that this time you need to be cleverer about how you do it.
The accepted answer is the best general approach. At a high level, it's worth weighing the importance of fixing the bug against what you could add as a feature or enhance that would benefit the user. Could a 'non-reproducible' bug take two days to fix? Could a feature be added in that time that gives users more benefit than that bug fix? Maybe the users would prefer the feature. I've been fixated at times as a developer on imperfections I can see, and then users are asked for feedback and none of them actually mention the bug(s) that I can see, but the software is missing a feature that they really want!
Sometimes, simple persistence in attempting to reproduce the bug whilst debugging can be the most effective approach. For this strategy to work, the bug needs to be 'intermittent' rather than completely 'non-reproducible'. If you can repeat a bug even one time in 10, and you have ideas about the most likely place it's occurring, you can place breakpoints at those points then doggedly attempt to repeat the bug and see exactly what's going on. I've experienced this to be more effective than logging in one or two cases (although logging would be my first go-to in general).

Useful and unuseful real life development techniques

I work on a medium/large company that follows what I think are some good practices for development, maybe not the best ones but good enough.
We've some development resource that get implemented on the basis of "do, test, if it useful the use, else throw away". I've found that most of the so called best practices are, sometimes, ideally great but unfeasible or even harmful in real life.
For example, we use to have a dotproject website for our team. The idea was to track tasks, update progress and so on. We used the "do, test, if" with it and the result was... we threw it away and just keep the forum that was extremly useful to communicate between us and keep track of conclusions of meetings and TO DO lists... Tracking each task on the other hand proved to be both time consuming and unrealistic.
Firs of all nobody was doing it, it didn't take a lot of time but developers hated it and make them unhappy because they had to remember updating every task, the estimates for the times of the task proved to be unrealistic most of the time.
So my question is, What development techniques have you tried and found useful/unuseful?
I mean that as in real life, not some theoretical best practice, but as a hands on experience. I'm looking to explore new techniques (or tools or whatever) and I'd like opinions on what to do next. Our current status:
Internal issue tracking system (Useful)
Semiautomated builds (every developer has to maintain the equivalent of a makefile in order for the system to be able to make them).
NO automated testing. Test are performed by a test team. We have integration tests and wide system tests.
Two test labs, one for the Test team, the other for the developers (in case they need to perform test that involves more than one machine or to test out of the development machine)
No unit testing in general. Some libraries have them but usually the developer test its units as he wants.
Full Specification using DOORs.
Test protocols. Formal, written in Word.
Source control (Clear Case). Usually everything is done in the main branch, whenever we ship a version it gets labeled, if needed a branch is made for fixes to that version.
Note: When you can (if you don't mind :P), could you try to justify your proposal? How and why was it useful? How did improve your work?
One of the most useful things we introduced was a project Wiki, an extremely useful dumping ground for all the little titbits of information floating around in peoples head but too trivial to record in a full document.
Having been involved in all sorts of development teams with different methodologies, my experience is that most of the agile principles tend to work great. It generally makes development more fun, since everyone is more engaged. In the larger development environments the basic principle of co-locating all team members brings great benefits, especially when you have separate information analysts and testers next to developers. Have information analysts, testers and developers work together per feature. This creates the best flow of information, with as little overhead as possible. You can take this even further to a Lean development process.
In general, the things that gave us the greatest benefits were the things that lowered the barriers of communication. In a practical sense, a company wiki helped out a great deal as well, lowering the barrier for sharing information. A good bug/features/RFC tracking tool also helped a great deal to have a joint understanding amongst stakeholders of what direction the project is heading. And such a tracking tool does not just have to be internal: lower the barriers to your customers as well. This also helps a great deal in managing expectations.
I feel I'm just getting started here. Others will no doubt come with more suggestions.
Pascal.
Follow this link...
And personally, as a developer I prefer to focus on things that improve my performance. I don't mind checking some bug reporting site to check if new bugs are reported, but I need to be able to view it quickly without having to go through a few dozens of pages or dozens of clicks.
I don't mind writing technical designs before writing code either, as long as I have the tools to write it properly. These tools must be created to increase performance with a minimum of fluffy features that no one uses. For example, in the past I've used Enterprise Architect to create UML models before writing code. It worked fine but the application has some flaws and lots of features that I don't need. When I discovered Altoma UModel, I quickly changed to a much leaner tool for UML generation that offered me exactly what I need. Nothing more, nothing less.
Basically, you have to keep people focused on the final goal. And the final goal is creating some product to be used by your users. Many development teams get lost because they focus on other things instead. None of your users will care about how you created the thing they use. They just need your project to reach their goals.
Thus, the best practice is that which makes your team the most comfortable, including any new team member that would join halfway of any project.
Personally, I'm a very big fan of Automatic Testing (both unit tests and integration tests). In my view it's like a safety net - developers feel safer when changing code when they know there's a test harness that could catch them if the break something. This allows you faster introduction of new features, but also removes the 'fear' of refactoring, for example.
As management for our team: SCRUM

When must you use poor design to finish a project?

There are many different bad practices, such as memory leaks, that are easy to slip into a program on accident. Sometimes, they might even be able to jury-rig your program together.
I'm working on a project right now and it works if I deliberately put a memory leak in my code. If I take the leak out, the code crashes. Obviously this is bad, and needs to be (and will be) fixed soon.
My question is, when do you decide to deliver code in this state, if it's not possible to release code without these poor practices, in time?
If the problem's impact on actual usage of the system can be reasonably expected to be none or negligible, and the delivery date cannot be pushed back, and it can be fixed within a scope of time before the problem's impact becomes more than negligible, ship it.
Obviously this is not ideal or even recommended, but you're clearly pushed into a corner at this point. Sometimes there are no good choices, but pragmatism must win over formal correctness. If an application has a memory leak, but we can reasonably expect that the app will be recycled or machine restarted or whatever before the leak becomes a real problem, that can sometimes be better than delivering late. It depends on the conditions of the agreement and the customer.
It's always better to at least try to push back the delivery date, but I am assuming you've already tried that and it's not an option here.
It is typical once an application has been shipped to ignore technical debt and move on. It's the responsibility of the developers to clearly communicate to the stakeholders the importance of paying off some of that debt, especially in a case like this.
However, given that it seems the customer cares more about a delivery date than correctness, it's less likely anyone will be convinced to pay off any debt once you go live. This is a bad situation to be in. Only the person with all the facts can make the right choice.
"My question is, when do you decide to deliver code in this state, if it's not possible to release code without these poor practices, in time?"
Never.
What you do instead: prioritize and focus.
If what you're working on is really high-priority, and you've mis-designed it, something low- priority has to be sacrificed. Often, some feature(s) must be delayed to give you time to focus on the high priority feature that doesn't work.
If what you're working on is really low-priority, you have to ask why you're not working on something higher-priority. And you still have to focus and prioritize. Sometimes things which are very low priority must be sacrificed.
When you can't do "everything" you have to pick things you can do that will be reasonably bug-free.
You might be interested in the concept of technical debt.
You only have three knobs you can turn when shipping software, assuming a fixed number of developers: features, quality and ship date, and turning one up means the others get turned down.
One of the most difficult things to do in software development is to build your product with the knobs set just right. For example, the Duke Nukem Forever guys have turned the features and quality knob up to eleven and thrown the ship date knob out the window. Microsoft often seems to glue the ship date knob in place and turn down the feature knob as needed, then unglue the ship date knob, turn it up a bit, glue it back down and continue twiddling the other two. And there are seem to be an endless amount of products out there that ship all the time but never put in the features they need to be successful.
In the end, you don't get paid if you don't deliver. Having poor quality hurts you terribly in the long run; reputations are hard to rebuild. It has almost always been that the right thing to do is to cut features if you have too many bugs. Always.
However, bug triage is just as important as feature development. What kind of leak are we talking about here? Are you leaking a byte? A small object? One thousand objects? Entire DLLs? There are scenarios where its probably better to leak a little than to fail to deliver the product.
And what do you mean by leak? Does your application have a well defined life cycle? Where you allocate something once at startup and then never free it until the process dies? Well how long does your process run? Do you expect to run multiple copies of your process?
Obviously you never want to leak, and you should work to develop best practices that minimize leaking, but in the end you have to make a judgment call. Maybe you can just explain the bug to your customers, explain the impact, and they'll buy it anyway. Maybe you can patch it a week later. Maybe you really do need to fix it. But we'd need to know more about it to give good advice.
I will say I have shipped known leaks in the past. I won't say what product or company, but I had a bug where DLL dependencies and insane lifetime management made it next to impossible to correctly free our references to a certain DLL once it was loaded, so in the end we just leaked it. And I still think it was the right thing to do. Other times I've seen things deliberately leaked to keep third party code that was written incorrectly from crashing (though that is a completely separate debate).
But in the end I believe such instances are rare and once you have identified the source of a memory leak, it shouldn't take much more than a day to fix it. It is rare indeed that I would ship with a memory leak that was known and a fix was known. It would have to be something that required a major re-architect involving changing a threading model, or refactoring huge swaths of code, and it would literally have to be a day or two before the product was to ship. At that point I might just leak the memory and promise a patch in a weak after proper testing could be done on the re-architect.
I would be very uncomfortable releasing with such a known bug. It is likely to occur in another way.
You have not specified your environment or language, but I suggest you look at using a memory checking tool such as:
Purify (trial available)
BoundsChecker
Valgrind
or even a free one, Visual Leak Detector
Perhaps, when you are not going to be around to maintain the code later, you don't care about your client/employer and none of the ramifications of your code could possibly affect you.
In other words, in your professional coding life, it's never a really good idea.
If you are working for someone that is less concerned about code quality than you are and simply wants you to finish the code at all costs, then I can see how you'd be in a difficult situation. Finishing faster but poorer will earn you some immediate reward. You should remember though that even if failing to meet your employer/client's expectation for a milestone bites you only once, your poor code may continue to bite you into the future, not only through the difficulties in maintaining it but also through the negative impression others may form of you down the track.
If its a single (or limited) memory leak, and it doesn't grow, and say it only causes a crash when shutting down (the most common case of stuff like this), then it depends. If its a client/desktop software and the users are going to crash every time on their way out, I'd make it an ultra high priority. If its server, and the only one running the server is you, and everything else works fine, I'd say its alright to enter beta. But if the leaks grow, or can cause crashes at "random" times they need to be fixed asap.
To get past an internal milestone, it's arguable, although still nothing to be taken likely.
To release, never. It always comes back and bites you. If your software is in such a bad space that a piece of poor design will get it over the line, you've got much bigger problems looming round the corner
Never, unless you don't care about the poor developer who is going to be maintaining your work afterwards.
Ultimately, a decision like this should be made by the customer or the project manager. Individual developers should not be making these kinds of decisions alone, or keeping this information to themselves.
Tell them what the problem is, and what the consequences will be for not fixing it. If they want you to ship it broken on time, that's their call.
If you don't want to work for people who accept shoddy products, that's your call, but it's a mistake to think that developers have some sort of professional responsibility to ignore their clients' and bosses' quality/cost/time priorities.
If somebody may actually die if you ship bad software, then don't do it, but if the worst-case scenario is that somebody is going to have to reboot a couple times per day, then do what you're told or find another job.

How to detect and debug multi-threading problems?

This is a follow up to this question, where I didn't get any input on this point. Here is the brief question:
Is it possible to detect and debug problems coming from multi-threaded code?
Often we have to tell our customers: "We can't reproduce the problem here, so we can't fix it. Please tell us the steps to reproduce the problem, then we'll fix it." It's a somehow nasty answer if I know that it is a multi-threading problem, but mostly I don't. How do I get to know that a problem is a multi-threading issue and how to debug it?
I'd like to know if there are any special logging frameworks, or debugging techniques, or code inspectors, or anything else to help solving such issues. General approaches are welcome. If any answer should be language related then keep it to .NET and Java.
Threading/concurrency problems are notoriously difficult to replicate - which is one of the reasons why you should design to avoid or at least minimize the probabilities. This is the reason immutable objects are so valuable. Try to isolate mutable objects to a single thread, and then carefully control the exchange of mutable objects between threads. Attempt to program with a design of object hand-over, rather than "shared" objects. For the latter, use fully synchronized control objects (which are easier to reason about), and avoid having a synchronized object utilize other objects which must also be synchronized - that is, try to keep them self contained. Your best defense is a good design.
Deadlocks are the easiest to debug, if you can get a stack trace when deadlocked. Given the trace, most of which do deadlock detection, it's easy to pinpoint the reason and then reason about the code as to why and how to fix it. With deadlocks, it always going to be a problem acquiring the same locks in different orders.
Live locks are harder - being able to observe the system while in the error state is your best bet there.
Race conditions tend to be extremely difficult to replicate, and are even harder to identify from manual code review. With these, the path I usually take, besides extensive testing to replicate, is to reason about the possibilities, and try to log information to prove or disprove theories. If you have direct evidence of state corruption you may be able to reason about the possible causes based on the corruption.
The more complex the system, the harder it is to find concurrency errors, and to reason about it's behavior. Make use of tools like JVisualVM and remote connect profilers - they can be a life saver if you can connect to a system in an error state and inspect the threads and objects.
Also, beware the differences in possible behavior which are dependent on the number of CPU cores, pipelines, bus bandwidth, etc. Changes in hardware can affect your ability to replicate the problem. Some problems will only show on single-core CPU's others only on multi-cores.
One last thing, try to use concurrency objects distributed with the system libraries - e.g in Java java.util.concurrent is your friend. Writing your own concurrency control objects is hard and fraught with danger; leave it to the experts, if you have a choice.
I thought that the answer you got to your other question was pretty good. But I'll emphasis these points.
Only modify shared state in a critical section (Mutual Exclusion)
Acquire locks in a set order and release them in the opposite order.
Use pre-built abstractions whenever possible (Like the stuff in java.util.concurrent)
Also, some analysis tools can detect some potential issues. For example, FindBugs can find some threading issues in Java programs. Such tools can't find all problems (they aren't silver bullets) but they can help.
As vanslly points out in a comment to this answer, studying well placed logging output can also very helpful, but beware of Heisenbugs.
For Java there is a verification tool called javapathfinder which I find it useful to debug and verify multi-threading application against potential race condition and death-lock bugs from the code.
It works finely with both Eclipse and Netbean IDE.
[2019] the github repository
https://github.com/javapathfinder
Assuming I have reports of troubles that are hard to reproduce I always find these by reading code, preferably pair-code-reading, so you can discuss threading semantics/locking needs. When we do this based on a reported problem, I find we always nail one or more problems fairly quickly. I think it's also a fairly cheap technique to solve hard problems.
Sorry for not being able to tell you to press ctrl+shift+f13, but I don't think there's anything like that available. But just thinking about what the reported issue actually is usually gives a fairly strong sense of direction in the code, so you don't have to start at main().
In addition to the other good answers you already got: Always test on a machine with at least as many processors / processor cores as the customer uses, or as there are active threads in your program. Otherwise some multithreading bugs may be hard to impossible to reproduce.
Apart from crash dumps, a technique is extensive run-time logging: where each thread logs what it's doing.
The first question when an error is reported, then, might be, "Where's the log file?"
Sometimes you can see the problem in the log file: "This thread is detecting an illegal/unexpected state here ... and look, this other thread was doing that, just before and/or just afterwards this."
If the log file doesn't say what's happening, then apologise to the customer, add sufficiently-many extra logging statements to the code, give the new code to the customer, and say that you'll fix it after it happens one more time.
Sometimes, multithreaded solutions cannot be avoided. If there is a bug,it needs to be investigated in real time, which is nearly impossible with most tools like Visual Studio. The only practical solution is to write traces, although the tracing itself should:
not add any delay
not use any locking
be multithreading safe
trace what happened in the correct sequence.
This sounds like an impossible task, but it can be easily achieved by writing the trace into memory. In C#, it would look something like this:
public const int MaxMessages = 0x100;
string[] messages = new string[MaxMessages];
int messagesIndex = -1;
public void Trace(string message) {
int thisIndex = Interlocked.Increment(ref messagesIndex);
messages[thisIndex] = message;
}
The method Trace() is multithreading safe, non blocking and can be called from any thread. On my PC, it takes about 2 microseconds to execute, which should be fast enough.
Add Trace() instructions wherever you think something might go wrong, let the program run, wait until the error happens, stop the trace and then investigate the trace for any errors.
A more detailed description for this approach which also collects thread and timing information, recycles the buffer and outputs the trace nicely you can find at:
CodeProject: Debugging multithreaded code in real time 1
A little chart with some debugging techniques to take in mind in debugging multithreaded code.
The chart is growing, please leave comments and tips to be added.
(update file at this link)
Visual Studio allows you to inspect the call stack of each thread, and you can switch between them. It is by no means enough to track all kinds of threading issues, but it is a start. A lot of improvements for multi-threaded debugging is planned for the upcoming VS2010.
I have used WinDbg + SoS for threading issues in .NET code. You can inspect locks (sync blokcs), thread call stacks etc.
Tess Ferrandez's blog has good examples of using WinDbg to debug deadlocks in .NET.
assert() is your friend for detecting race-conditions. Whenever you enter a critical section, assert that the invariant associated with it is true (that's what CS's are for). Though, unfortunately, the check might be expensive and thus not suitable for use in production environment.
I implemented the tool vmlens to detect race conditions in java programs during runtime. It implements an algorithm called eraser.
Develop code the way that Princess recommended for your other question (Immutable objects, and Erlang-style message passing). It will be easier to detect multi-threading problems, because the interactions between threads will be well defined.
I faced a thread issue which was giving SAME wrong result and was not behaving un-predictably since each time other conditions(memory, scheduler, processing load) were more or less same.
From my experience, I can say that HARDEST PART is to recognize that it is a thread issue, and BEST SOLUTION is to review the multi-threaded code carefully. Just by looking carefully at the thread code you should try to figure out what can go wrong. Other ways (thread dump, profiler etc) will come second to it.
Narrow down on the functions that are being called, and rule out what could and could not be to blame. When you find sections of code that you suspect may be causing the issue, add lots of detailed logging / tracing to it. Once the issue occurs again, inspect the logs to see how the code executed differently than it does in "baseline" situations.
If you are using Visual Studio, you can also set breakpoints and use the Parallel Stacks window. Parallel Stacks is a huge help when debugging concurrent code, and will give you the ability to switch between threads to debug them independently. More info-
https://learn.microsoft.com/en-us/visualstudio/debugger/using-the-parallel-stacks-window?view=vs-2019
https://learn.microsoft.com/en-us/visualstudio/debugger/walkthrough-debugging-a-parallel-application?view=vs-2019
I'm using GNU and use simple script
$ more gdb_tracer
b func.cpp:2871
r
#c
while (1)
next
#step
end
The best thing I can think of is to stay away from multi-threaded code whenever possible. It seems there are very few programmers who can write bug free multi threaded applications and I would argue that there are no coders beeing able to write bug free large multi threaded applications.

Which is better: shipping a buggy feature or not shipping the feature at all?

this is a bit of a philosophical question. I am adding a small feature to my software which I assume will be used by most users but only maybe 10% of the times they use the software. In other words, the software has been fine without it for 3 months, but 4 or 5 users have asked for it, and I agree that it should be there.
The problem is that, due to limitations of the platform I'm working with (and possibly limitations of my brain), "the best I can do" still has some non-critical but noticeable bugs - let's say the feature as coded is usable but "a bit wonky" in some cases.
What to do? Is a feature that's 90% there really "better than nothing"? I know I'll get some bug reports which I won't be able to fix: what do I tell customers about those? Should I live with unanswered feature requests or unanswered bug reports?
Make sure people know, that you know, that there are problems. That there are bugs. And give them an easy way to proide feedback.
What about having a "closed beta" with the "4 or 5 users" who suggested the feature in the first place?
There will always be unanswered feature requests and bug reports. Ship it, but include a readme with "known issues" and workarounds when possible.
You need to think of this from your user's perspective - which will cause less frustration? Buggy code is usually more frustrating than missing features.
Perfectionists may answer "don't do it".
Business people may answer "do it".
I guess where the balance is is up to you. I would be swaying towards putting the feature in there if the bugs are non-critical. Most users don't see your software the same way you do. You're a craftsman/artist, which means your more critical than regular people.
Is there any way that you can get a beta version to the 4-5 people who requested the feature? Then, once you get their feedback, it may be clear which decision to make.
Precisely document the wonkiness and ship it.
Make sure a user is likely to see and understand your documentation of the wonkiness.
You could even discuss the decision with users who have requested the feature: do some market research.
Just because you can't fix it now, doesn't mean you won't be able to in the future. Things change.
Label what you have now as a 'beta version' and send it out to those people who have asked for it. Get their feedback on how well it works, fix whatever they complain about, and you should then be ready to roll it out to larger groups of users.
Ship early, ship often, constant refactoring.
What I mean is, don't let it stop you from shipping, but don't give up on fixing the problems either.
An inability to resolve wonkiness is a sign of problems in your code base. Spend more time refactoring than adding features.
I guess it depends on your standards. For me, buggy code is not production ready and so shouldn't be shipped. Could you have a beta version with a known issues list so users know what to expect under certain conditions? They get the benefit of using the new features but also know that it's not perfect (use that their own risk). This may keep those 4 or 5 customers that requested the feature happy for a while which gives you more time to fix the bugs (if possible) and release to production later for the masses.
Just some thoughts depending on your situation.
Depends. On the bugs, their severity and how much effort you think it will take to fix them. On the deadline and how much you think you can stretch it. On the rest of the code and how much the client can do with it.
I would not expect coders to deliver known problems into test let alone to release to a customer.
Mind you, I believe in zero tolerance of bugs. Interestingly I find that it is usually developers/ testers who are keenest to remove all bugs - it is often the project manager and/ or customer who are willing to accept bugs.
If you must release the code, then document every feature/ bug that you are aware of, and commit to fixing each one.
Why don't you post more information about the limitations of the platform you are working on, and perhaps some of the clever folk here can help get your bug list down.
If the demand is for a feature NOW, rather than a feature that works. You may have to ship.
In this situation though:
Make sure you document the bug(s)
and consequences (both to the user
and other developers).
Be sure to add the bug(s) to your
bug tracking database.
If you write unit tests (I hope so),
make sure that tests are written
which highlight the bugs, before you
ship. This will mean that when you
come to fix the bugs in the future,
you know where and what they are,
without having to remember.
Schedule the work to fix the bugs
ASAP. You do fix bugs before
writing new code, don't you?
If bugs can cause death or can lose users' files then don't ship it.
If bugs can cause the application to crash itself then ship it with a warning (a readme or whatever). If crashes might cause the application to corrupt the users' files that they were in the middle of editing with this exact application, then display a warning each time they start up the application, and remind them to backup their files first.
If bugs can cause BSODs then be very careful about who you ship it to.
If it doesn't break anything else, why not ship it? It sounds like you have a good relationship with your customers, so those who want the feature will be happy to get it even if it's not all the way there, and those who don't want it won't care. Plus you'll get lots of feedback to improve it in the next release!
The important question you need to answer is if your feature will solve a real business need given the design you've come up with. Then it's only a matter of making the implementation match the design - making the "bugs" being non-bugs by defining them as not part of the intended behaviour of the feature (which should be covered by the design).
This boils down to a very real choice of paths: is a bug something that doesn't work properly, that wasn't part of the intended behaviour and design? Or is it a bug only if if doesn't work in accordance to the intended behaviour?
I am a firm believer in the latter; bugs are the things that do not work the way they were intended to work. The implementation should capture the design, that should capture the business need. If the implementation is used to address a different business need that wasn't covered by the design, it is the design that is at fault, not the implementation; thus it is not a bug.
The former attitude is by far the most common amongst programmers in my experience. It is also the way the user views software issues. From a software development perspective, however, it is not a good idea to adopt this view, because it leads you to fix bugs that are not bugs, but design flaws, instead of redesigning the solution to the business need.
Coming from someone who has to install buggy software for their users - don't ship it with that feature enabled.
It doesn't matter if you document it, the end users will forget about that bug the first time they hit it, and that bug will become critical to them not being able to do their job.

Resources