MS Access writing to Excel over multiple minutes will sometimes throw a false error - excel

I'm someone who solves problems by looking, not asking. So this is new to me. This has been an issue for years, and it crops up with different computers, networks, versions and completely different code. There is a lot here, so, thank you in advance if you are willing to read the whole thing.
Generally speaking, I write MS Access programs that will open Excel and then create multiple worksheets inside of a workbook using data from Access tables and/or Excel sheets. The process can take a couple of minutes to run and occasionally, it will get an error. I could tell you the error message, but it doesn't matter because it will be different depending where the error occurs. When it occurs I simply click debug and click continue and it... continues. If it errors out again (many loops later), it will happen in the exact same spot.
So, what I start with is to make minor changes to the code. In the current program I'm working on, the error happens when I write to a cell and the value is a value directly from a table. I created a variable, copied the value to the variable and then wrote to the cell. The error moved to a completely different part of the program and it became a "paste" error. Generally what fixes it is to put a wait function at the spot where the error occurs. One second is usually good enough. Sometimes it takes a couple of these, but that usually solves it. It only took one delay per loop this time, so it is working. I just hate causing delays in my program. So... Has anyone seen anything like this before, or is it just me. It feels like a timing issue between Access and Excel since the delays are usually helpful. Thanks in advance.

I dug up my last major Access project that interacted with Word (ca. 2016) where I struggled with similar issues. I see many, many Debug.Print statements (some commented, some still active), but unlike what I recalled earlier in my comments, I don't see any "wait" statements anymore! From what I now recall and after re-inspecting the code, most problems were resolved by
implementing robust error handling and best practices for always closing automation objects (and/or releasing the objects if I wanted the instances to persist)
subscribing to and utilizing appropriate automation object events to detect and handle interaction rather than trying to force everything into serialized work-then-wait code. To do this, I placed all automation code in well-structured classes that declared automation objects WithEvents (in VBA of course) and then defined relevant event handlers for actions I was effecting. I now recall finally being able to avoid weird errors and application hangs, etc.
You also may never get a good answer to a question like this, so despite that I am not an absolute expert on Office development, I have had my own experience with frustrating bugs like this and so I'll share my 2 cents. This may not be satisfying, but after experiencing similar behavior using office automation objects, my general understanding is that interaction between OS processes are not deterministic. Especially since VBA generally has no threading or parallelism concerns, it can be strange to deal with objects that behave in unpredictable ways. The time slices given to each process separately is at the mercy of the OS and it will vary greatly with multiple processors/cores, running processes, memory management, etc. Despite the purpose of the automation objects--to control instances of office apps--the API's are not designed well for inter-application processes.
Although it would be great if old automation code would produces more useful errors, perhaps nested exceptions (like in .Net and other modern environments), something that indicates delays and timeouts within callbacks between automation objects, instead you get hodgepodge of various context errors.
My hardware is old, but still ticking. I often get delays, even if only for a second, when switching between apps, etc. Instead of thinking of it as an error, I just perceive it as a slow machine, just wait and continue. It may be useful to consider these type of random errors as similar delays. If a wait call here or there resolves the issue, however annoying, that may just be the best solution... wait and continue.
Every now and then after debugging these types of issues I would actually discover the underlying problem and be able to fix it. At the least I would be able to avoid actual problems with the data, despite errors being raised, just like you describe. But even when I felt that I understood the problem, the answer was still often to do exactly as you have done and just add a short wait.

I do believe now this is a timing issue. After thinking things through, I realized that I could easily (well 3 hours later) separate the database info from the spreadsheet info and then move the updated code that is causing problems into an Excel Macro. I then called that macro from Access. Not only do the errors go away, but it runs about 4 times faster. It's not surprising, I just hadn't thought of that direction before.

Related

There's any way to delete objects in VB6 without getting errors?

I need to simplify a program made in VB6, so I need to delete a lot of objects but don't want to delete one, run, get an error, fix it, run, get another error, a so on for every object.
I don't created the original program, so is very annoying try to think what someone else thought.
Plus, the code is not commented and is very dirty, the objects have the default names, like Label1, Timer12. WTF?
So I can't get a clue about what a timer does until I fully read the code.
Any idea?
Perhaps the original author didn't have much thought, true. But then you're not giving it much thought either by thinking that the best approach will be to delete an object, run and see what happens, and then rinse and repeat until you have a fully working program. Given your approach, by the time you've finished deleting all the objects - you won't have much left to run anyway! (No data, No Problem!).
Look through the code in conjunction with running and debugging the application. This will give you an idea of what it is trying to do (I am under the assumption there's not really much in the way of documentation for this project?)
Time to think of a new plan! Sometimes re-writing from scratch is absolutely necessary rather than attempting to work with legacy code that is clearly hard to maintain and is far from optimum. The downside is that this will take more time, on paper - but perhaps will save time in the long run. This might be the time also to move away from VB6 and in to the .NET era. Rewriting applications can let you do this too!
If a re-write is out of the question (such a shame), then you can't go through willy-nilly deleting objects! The program functions with these objects. So they've got to stay and you need to maintain it.
See (2).
Good luck, God speed.
Or, given your comment
Rename Label1 to something more appropriate. Do a search replace within that file (Not the whole solution) to the new name, then correct any accidental errors and move on.
I would go with the last option of Moo-Juice's answer, but slightly different: I would rename a control by adding an "x" to its name and run the project.
If it runs, then remove the control, if it doesnt run, then rename the control back by removing the "x".
This should not take more than 1 hour for about 250 controls
But first you might do something else: Instead of just deleting the unused controls, you might also want to look at the controls:
If there are so many controls then it might be possible to convert series of controls into a control array: This might clean up the code a lot as well, and give you a better overview of the remaining controls
When you cleaned up the controls have a look at their names, and rename them to something useful.
After that it is time to clean up the code in the same way.
You can download MZ-Tools to help you with that.
It's a must for the lazy coder :)
Also place "Option Explicit" at the top of all code windows. This will make it easier for you to spot errors caused by not (properly) declared variables.
Do this after you have done "Review Source Code" of MZ-Tools though, as that might save you a lot of work :)

When exactly am I required to set objects to nothing in classic asp?

On one hand the advice to always close objects is so common that I would feel foolish to ignore it (e.g. VBScript Out Of Memory Error).
However it would be equally foolish to ignore the wisdom of Eric Lippert, who appears to disagree: http://blogs.msdn.com/b/ericlippert/archive/2004/04/28/when-are-you-required-to-set-objects-to-nothing.aspx
I've worked to fix a number of web apps with OOM errors in classic asp. My first (time consuming) task is always to search the code for unclosed objects, and objects not set to nothing.
But I've never been 100% convinced that this has helped. (That said, I have found it hard to pinpoint exactly what DOES help...)
This post by Eric is talking about standalone VBScript files, not classic ASP written in VBScript. See the comments, then Eric's own comment:
Re: ASP -- excellent point, and one that I had not considered. In ASP it is sometimes very difficult to know where you are and what scope you're in.
So from this I can say that everything he wrote isn't relevant for classic ASP i.e. you should always Set everything to Nothing.
As for memory issues, I think that assigning objects (or arrays) to global scope like Session or Application is the main reason for such problems. That's the first thing I would look for and rewrite to hold only single identifider in Session then use database to manage the data.
Basically by setting a COM object to Nothing, you are forcing its terminator to run deterministically, which gives you the opportunity to handle any errors it may raise.
If you don't do it, you can get into a situation like the following:
Your code raises an error
The error isn't handled in your code and therefore ...
other objects instantiated in your code go out of scope, and their terminators run
one of the terminators raises an error
and the error that is propagated is the one from the terminator going out of scope, masking the original error.
I do remember from the dark and distant past that it was specifically recommended to close ADO objects. I'm not sure if this was because of a bug in ADO objects, or simply for the above reason (which applies more generally to any objects that can raise errors in their terminators).
And this recommendation is often repeated, though often without any credible reason. ("While ASP should automatically close and free up all object instantiations, it is always a good idea to explicitly close and free up object references yourself").
It's worth noting that in the article, he's not saying you should never worry about setting objects to nothing - just that it should not be the default behaviour for every object in every script.
Though I do suspect he's a little too quick to dismiss the "I saw this elsewhere" method of coding behaviour, I'm willing to bet that there is a reason Eric didn't consider that has caused this to be passed along as a hard 'n' fast rule - dealing with junior programmers.
When you start looking more closely at the Dreyfus model of skill acquisition, you see that at the beginning levels of acquiring a new skill, learners need simple to follow recipes. They do not yet have the knowledge or ability to make the judgement calls Eric qualifies the recommendation with later on.
Think back to when you first started programming. Could you readily judge if you were "set[tting an] expensive objects to Nothing when you are done with them if you are done with them well before they go out of scope"? Did you really know which objects were expensive or when they truly went out of scope?
Thus, most entry level programmers are simply told "always set every object to Nothing when you are done with it" because it is within their grasp to understand and follow. Unfortunately, not many programmers take the time to self-educate, learn, and grow into the higher-level Dreyfus stages where you can use the more nuanced situational approach.
And then we come back to my earlier statement - even the best of us started out at that earlier stage, where we reflexively closed all objects because that was the best we were capable of. We left large bodies of code that people look at now, and project our current competence backwards to the earlier work and assume we did that for reasons we don't understand.
I've got to get going, but I hope to expand this a little futher...

Is PetraVM Jinx Beta 1 good?

PetraVM recently came out with a Beta release of their Jinx product. Has anyone checked it out yet? Any feedback?
By good, I mean:
1) easy to use
2) intuitive
3) useful
4) doesn't take a lot of code to integrate
... those kinds of things.
Thanks guys!
After literally stumbling across Jinx while poking around on Google, I have been on the beta and pre-beta tests with a fair amount of usage already under my belt. As with any beta related comments please understand that things may change or already have changed, so do keep this in mind and take the following with a grain of salt.
So, going through the list of questions one by one:
1) Install and go. Jinx adds a control panel to Visual Studio which you can mostly ignore as the defaults are typically good for most cases. Otherwise you just work normally and forget about it. Jinx does not instrument your code, require any additional libraries linked in or the numerous other things some tools require.
2) The question of "intuitive" is really up to the user. If you understand threaded code and the sorts of bugs possible, Jinx just makes those bugs happen much more frequently, which by itself is a huge benefit to people doing threaded code. While Jinx attempts to stop the code in a state that makes the problem as obvious as possible, "obvious" and "intuitive" are really up to the skill of the programmer.
3) Useful? Anyone who has done threaded code before knows that a race condition can happen regularly or once every month based on cosmic ray counts, that randomness makes debugging threaded code very difficult. With Jinx, even the most minor race condition can be reproduced usually on the first run consistently. This works even for lockless code that other static analysis or instrumenting tools would generally miss.
This sort of quick reproduction of problems is amazingly useful. Jinx has helped me track down a "one instruction in the wrong place" sort of bug that would actually hit once a week at most. Jinx forced the crash to happen almost immediately and allowed me to focus on the actual cause of the bug instead being completely in the dark as to the real source.
4) Integration with Jinx is a breeze. If you don't mind your machine becoming a bit slow, you can tell Jinx to watch the entire machine. It slows the machine down as it is actually watching everything on the machine, including the OS. While interresting and useful if your software consists of multiple processes on the same machine, this is not suggested as it can become painful to work with the machine.
Instead of using the global system, adding an include and two lines of code does the primary work needed of registering the process with Jinx such that Jinx can watch just the registered items. You can help Jinx by using the Jinx specific asserts and registering regions of code that are more important. In the case of the crash mentioned above though, I didn't have to do any of that and Jinx found the problem without the additional integration work. In any case, the integration is extremely simple.
After using Jinx for the last couple months, I have to say that overall it has been a great pleasure. I won't write new threaded code without Jinx running in the background anymore simply because it does its intended job of forcing obscure threading issues to be immediate assert/crashes. As mentioned, things that you could go weeks without seeing become problems almost immediately, this is a wonderful thing to have during initial test and implementation.
KRB
BTW, PetraVM has changed its name to Corensic and you can find Jinx Beta 2 over at www.corensic.com.
--Prashant, the marketing guy at Corensic

Any good strategies for dealing with 'not reproducible' bugs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Very often you will get or submit bug reports for defects that are 'not reproducible'. They may be reproducible on your computer or software project, but not on a vendor's system. Or the user supplies steps to reproduce, but you can't see the defect locally. Many variations on this scenario of course, so to simplify I guess what I'm trying to learn is:
What is your company's policy towards 'not reproducible' bugs? Shelve them, close them, ignore? I occasionally see intermittent, non reproducible bugs in 3-rd party frameworks, and these are pretty much always closed instantly by the vendor... but they are real bugs.
Have you found any techniques that help in fixing these types of bugs? Usually what I do is get a system info report from the user, and steps to reproduce, then search on keywords, and try to see any sort of pattern.
Verify the steps used to produce the error
Oftentimes the people reporting the error, or the people reproducing the error, will do something wrong and not end up in the same state, even if they think they are. Try to walk it through with the reporting party. I've had a user INSIST that the admin privileges were not appearing correctly. I tried reproducing the error and was unable to. When we walked it through together, it turned out he was logging in as a regular user in that case.
Verify the system/environment used to produce the error
I've found many 'irreproducible' bugs and only later discovered that they ARE reproducible on Mac OS (10.4) Running X version of Safari. And this doesn't apply only to browsers and rendering, it can apply to anything; the other applications that are currently being run, whether or not the user is RDP or local, admin or user, etc... Make certain you get your environment as close to theirs as possible before calling it irreproducible.
Gather Screenshots and Logs
Once you have verified that the user is doing everything correctly and still getting a bug, and that you're doing exactly what they do, and you are NOT getting the bug, then it's time to see what you can actually do about it. Screenshots and logs are critical. You want to know exactly what it looks like, and exactly what was going on at the time.
It is possible that the logs could contain some information that you can reproduce on your system, and once you can reproduce the exact scenario, you might be able to coax the error out of hiding.
Screenshots also help with this, because you might discover that "X piece has loaded correctly, but it shouldn't have because it is dependent on Y" and that might give you a hint. Even if the user can describe what were doing, a screen shot could help even more.
Gather step-by-step description from the user
It's very common to blame the users, and not trust anything that they say (because they call a 'usercontrol' a 'thingy') but even though they might not know the names of what they're seeing, they will still be able to describe some of the behaviour they are seeing. This includes some minor errors that may have occured a few minutes BEFORE the real error occurred, or possibly slowness in certain things that are usually fast. All these things can be clues to help you narrow down which aspect is causing the error on their machine and not yours.
Try Alternate Approachs to produce the error
If all else fails, try looking at the section of code that is causing problems, and possibly refactor or use a workaround. If it is possible for you to create a scenario where you start with half the information already there (hopefully in UAT) ask the user to try that approach, and see if the error still occurs. Do you best to create alternate but similar approaches that get the error into a different light so that you can examine it better.
Short answer: Conduct a detailed code review on the suspected faulty code, with the aim of fixing any theoretical bugs, and adding code to monitor and log any future faults.
Long answer:
To give a real-world example from the embedded systems world: we make industrial equipment, containing custom electronics, and embedded software running on it.
A customer reported that a number of devices on a single site were experiencing the same fault at random intervals. Their symptoms were the same in each case, but they couldn't identify an obvious cause.
Obviously our first step was to try and reproduce the fault in the same device in our lab, but we were unable to do this.
So, instead, we circulated the suspected faulty code within the department, to try and get as many ideas and suggestions as possible. We then held a number of code review meetings to discuss these ideas, and determine a theory which: (a) explained the most likely cause of the faults observed in the field; (b) explained why we were unable to reproduce it; and (c) led to improvements we could make to the code to prevent the fault happening in the future.
In addition to the (theoretical) bug fixes, we also added monitoring and logging code, so if the fault were to occur again, we could extract useful data from the device in question.
To the best of my knowledge, this improved software was subsequently deployed on site, and appears to have been successful.
resolved "sterile" and "spooky"
We have two closed bug categories for this situation.
sterile - cannot reproduce.
spooky - it's acknowledged there is a problem, but it just appears intermittently, isn't quite understandable, and gives everyone a faint case of the creeps.
Error-reporting, log files, and stern demands to "Contact me immediately if this happens again."
If it happens in one context, and not in another, we try to enumerate the difference between both, and eliminate them.
Sometimes this works (e.g. other hardware, dual core vs. hyperthreading, laptop-disk vs. workstation disk, ...).
Sometimes it doesn't. If it's possible, we may start remote-debugging. If that doesn't help, we may try get our hands on the customer's system.
But of course, we don't write too many bugs in the first place :)
Well, you try your best to reproduce it, and if you can't, you take a long think and consider how such a problem might arise. If you still have no idea, then there's not much you can do about it.
Some of the new features in Visual Studio 2010 will help. See:
Historical Debugger and Test Impact Analysis in Visual Studio Team System 2010
Better Software Quality with Visual Studio Team System 2010
Manual Testing with Visual Studio Team System 2010
Sometimes the bug is not reproducible even in a pre-production environment that is the exact duplicate of the production environment. Concurrency issues are notorious for this.
Random Failures Are Often Concurrency Issues
Link: https://pragprog.com/tips/
The reason can be simply because of the Heisenberg effect, i.e. observation changes behaviour. Another reason can be because the chances are very small of hitting the combination of events that triggers the bug.
Sometimes you are lucky and you have audit logs that you can playback, greatly increasing the chances of recreating the issue. You can also stress the environment with high volumes of transactions. This effectively compresses time so that if the bug occurs say once a week, you may be able to reliably reproduce it in 1 day if you stress the system to 7 X the production load.
The last resort is whitebox testing where you go through the code line by line writing unit tests as you go.
I add logging to the exception handling code throughout the program. You need a method to collect the logs (users can email it, etc.)
Preemptive checks for code versions and sane environments are a good thing too. With the ease of software updates these days the code and environment the user is running has almost certainly not been tested. It didn't exist when you released your code.
With a web project I'm developing at the moment I'm doing something very similar to your technique. I'm building a page that I can direct users to in order to collect information such as their browser version and operating system. I'll also be collecting the apps registry info so i can have a look at what they've been doing.
This is a very real problem. I can only speak for web development, but I find users are rarely able to give me the basic information I would need to look into the issue. I suspect it's entirely possible to do something similar with other kinds of development. My plan is to keep working on this system to make it more and more useful.
But my policy is never to close a bug simply because I can't reproduce it, no matter how annoying it may be. And then there's the cases when it's not a bug, but the user has simply gotten confused. Which is a different type of bug I guess, but just as important.
You talk about problems that are reproducible but only on some systems. These are easy to handle:
First step: By using some sort of remote software, you let the customer tell you what to do to reproduce the problem on the system that has it. If this fails, then close it.
Second step: Try to reproduce the problem on another system. If this fails, make an exact copy of the customers system.
Third step: If it still fails, you have no option than to try to debug it on the customer system.
Once you can reproduce it, you can fix it. Doesn't matter on what system.
The tricky issue are truly non-reproducible issues, that is things that happen only intermittently. For that I'll have to chime in with the reports, logs and stern demands attitude. :)
It is important to categorize such bugs (rarely reproducible) and act on them differently than bugs that are frequently reproducible based on specific user actions.
Clear issue description along with steps to reproduce and observed behavior: Unambiguous reporting helps in understanding of the issue by entire team eliminating incorrect conclusions. For example, user reporting blank screen is different than HMI freeze on user action. Sequence of steps and approx timing of user action is also important.Did the user immediately select the option after screen transition or waited for a few minutes? An interesting bug concerning timing is a car allergic to vanilla ice-cream that baffled automotive engineers.
System config and startup parameters: Sometimes even hardware configuration and application software version (including drivers and firmware version) may do a trick or two. Mismatch of version or configuration can result in issues that are difficult to reproduce in other setups. Hence these are essential details to be captured. Most bug reporting tools have these details as mandatory parameters to report while logging an issue.
Extensive Logging: This is dependent on the logging facilities followed in concerned projects. While working with embedded Linux systems, we not only provide general diagnostic logs, but also system level logs like dmesg or top command logs. You may never know that wrong part is not the code flow but the abnormal memory usage/CPU usage. Identify the type of the issue and report the relevant logs for investigation.
Code Reviews and Walk-through: Dev teams cannot wait forever to reproduce these issues at their end and then take action. Bug report and available logs should be investigated and various possibilities be identified on this basis from design and code. If required, they should prepare hotfix on possible root causes and circulate the hotfix among teams including the tester who identified it to see if bug is reproducible with it.
Don't close these issues based on observation by a single tester/team after a fix is identified and checked in: Perhaps the most important part is approach followed to close these issues. Once fix of these issues has been checked in, all testing/validation teams at different locations should be informed on it for running intensive tests and identifying regression errors if any. Only all (practically most of them) of them reports as non-reproducible, a closure assessment has to be done by senior management.
If it is not reproduce able get logs, screen shots of exact steps to reproduce.
There's a nice new feature in Windows 7 that allows the user to record what they're doing and then send a report - it comes through as a doc with screen-shots of every stage. Hopefully it'll help in the cases where it's the user interacting with the application in an order that the developer wouldn't think of. I've seen plenty of bugs where it's just a case that the developer's logical way of using the app doesn't fit with how end users actually do it... resulting in lots of subtle errors.
Logging is your friend!
Generally what happens when we discover a bug that we can't reproduce is we either ask the customer to turn on more logging (if its available), or we release a version with extra logging added around the area we are interested in. Generally speaking the logging we have is excellent and has the ability to be very verbose, and so releasing versions with extra logging doesn't happen often.
You should also consider the use of memory dumps (which IMO also falls under the umbrella of logging). Producing a minidump is so quick that it can usually be done on production servers, even under load (as long as the number of dumps being produced is low).
The way I see it: Being able to reproduce a problem is nice because it gives you an environment where you can debug, experiement and play around in more freely, but - reproducing a bug is by no means essential to debug it! If the bug is only happening on someone else system then you still need to diagnose and debug the problem in the same way, its just that this time you need to be cleverer about how you do it.
The accepted answer is the best general approach. At a high level, it's worth weighing the importance of fixing the bug against what you could add as a feature or enhance that would benefit the user. Could a 'non-reproducible' bug take two days to fix? Could a feature be added in that time that gives users more benefit than that bug fix? Maybe the users would prefer the feature. I've been fixated at times as a developer on imperfections I can see, and then users are asked for feedback and none of them actually mention the bug(s) that I can see, but the software is missing a feature that they really want!
Sometimes, simple persistence in attempting to reproduce the bug whilst debugging can be the most effective approach. For this strategy to work, the bug needs to be 'intermittent' rather than completely 'non-reproducible'. If you can repeat a bug even one time in 10, and you have ideas about the most likely place it's occurring, you can place breakpoints at those points then doggedly attempt to repeat the bug and see exactly what's going on. I've experienced this to be more effective than logging in one or two cases (although logging would be my first go-to in general).

How to detect and debug multi-threading problems?

This is a follow up to this question, where I didn't get any input on this point. Here is the brief question:
Is it possible to detect and debug problems coming from multi-threaded code?
Often we have to tell our customers: "We can't reproduce the problem here, so we can't fix it. Please tell us the steps to reproduce the problem, then we'll fix it." It's a somehow nasty answer if I know that it is a multi-threading problem, but mostly I don't. How do I get to know that a problem is a multi-threading issue and how to debug it?
I'd like to know if there are any special logging frameworks, or debugging techniques, or code inspectors, or anything else to help solving such issues. General approaches are welcome. If any answer should be language related then keep it to .NET and Java.
Threading/concurrency problems are notoriously difficult to replicate - which is one of the reasons why you should design to avoid or at least minimize the probabilities. This is the reason immutable objects are so valuable. Try to isolate mutable objects to a single thread, and then carefully control the exchange of mutable objects between threads. Attempt to program with a design of object hand-over, rather than "shared" objects. For the latter, use fully synchronized control objects (which are easier to reason about), and avoid having a synchronized object utilize other objects which must also be synchronized - that is, try to keep them self contained. Your best defense is a good design.
Deadlocks are the easiest to debug, if you can get a stack trace when deadlocked. Given the trace, most of which do deadlock detection, it's easy to pinpoint the reason and then reason about the code as to why and how to fix it. With deadlocks, it always going to be a problem acquiring the same locks in different orders.
Live locks are harder - being able to observe the system while in the error state is your best bet there.
Race conditions tend to be extremely difficult to replicate, and are even harder to identify from manual code review. With these, the path I usually take, besides extensive testing to replicate, is to reason about the possibilities, and try to log information to prove or disprove theories. If you have direct evidence of state corruption you may be able to reason about the possible causes based on the corruption.
The more complex the system, the harder it is to find concurrency errors, and to reason about it's behavior. Make use of tools like JVisualVM and remote connect profilers - they can be a life saver if you can connect to a system in an error state and inspect the threads and objects.
Also, beware the differences in possible behavior which are dependent on the number of CPU cores, pipelines, bus bandwidth, etc. Changes in hardware can affect your ability to replicate the problem. Some problems will only show on single-core CPU's others only on multi-cores.
One last thing, try to use concurrency objects distributed with the system libraries - e.g in Java java.util.concurrent is your friend. Writing your own concurrency control objects is hard and fraught with danger; leave it to the experts, if you have a choice.
I thought that the answer you got to your other question was pretty good. But I'll emphasis these points.
Only modify shared state in a critical section (Mutual Exclusion)
Acquire locks in a set order and release them in the opposite order.
Use pre-built abstractions whenever possible (Like the stuff in java.util.concurrent)
Also, some analysis tools can detect some potential issues. For example, FindBugs can find some threading issues in Java programs. Such tools can't find all problems (they aren't silver bullets) but they can help.
As vanslly points out in a comment to this answer, studying well placed logging output can also very helpful, but beware of Heisenbugs.
For Java there is a verification tool called javapathfinder which I find it useful to debug and verify multi-threading application against potential race condition and death-lock bugs from the code.
It works finely with both Eclipse and Netbean IDE.
[2019] the github repository
https://github.com/javapathfinder
Assuming I have reports of troubles that are hard to reproduce I always find these by reading code, preferably pair-code-reading, so you can discuss threading semantics/locking needs. When we do this based on a reported problem, I find we always nail one or more problems fairly quickly. I think it's also a fairly cheap technique to solve hard problems.
Sorry for not being able to tell you to press ctrl+shift+f13, but I don't think there's anything like that available. But just thinking about what the reported issue actually is usually gives a fairly strong sense of direction in the code, so you don't have to start at main().
In addition to the other good answers you already got: Always test on a machine with at least as many processors / processor cores as the customer uses, or as there are active threads in your program. Otherwise some multithreading bugs may be hard to impossible to reproduce.
Apart from crash dumps, a technique is extensive run-time logging: where each thread logs what it's doing.
The first question when an error is reported, then, might be, "Where's the log file?"
Sometimes you can see the problem in the log file: "This thread is detecting an illegal/unexpected state here ... and look, this other thread was doing that, just before and/or just afterwards this."
If the log file doesn't say what's happening, then apologise to the customer, add sufficiently-many extra logging statements to the code, give the new code to the customer, and say that you'll fix it after it happens one more time.
Sometimes, multithreaded solutions cannot be avoided. If there is a bug,it needs to be investigated in real time, which is nearly impossible with most tools like Visual Studio. The only practical solution is to write traces, although the tracing itself should:
not add any delay
not use any locking
be multithreading safe
trace what happened in the correct sequence.
This sounds like an impossible task, but it can be easily achieved by writing the trace into memory. In C#, it would look something like this:
public const int MaxMessages = 0x100;
string[] messages = new string[MaxMessages];
int messagesIndex = -1;
public void Trace(string message) {
int thisIndex = Interlocked.Increment(ref messagesIndex);
messages[thisIndex] = message;
}
The method Trace() is multithreading safe, non blocking and can be called from any thread. On my PC, it takes about 2 microseconds to execute, which should be fast enough.
Add Trace() instructions wherever you think something might go wrong, let the program run, wait until the error happens, stop the trace and then investigate the trace for any errors.
A more detailed description for this approach which also collects thread and timing information, recycles the buffer and outputs the trace nicely you can find at:
CodeProject: Debugging multithreaded code in real time 1
A little chart with some debugging techniques to take in mind in debugging multithreaded code.
The chart is growing, please leave comments and tips to be added.
(update file at this link)
Visual Studio allows you to inspect the call stack of each thread, and you can switch between them. It is by no means enough to track all kinds of threading issues, but it is a start. A lot of improvements for multi-threaded debugging is planned for the upcoming VS2010.
I have used WinDbg + SoS for threading issues in .NET code. You can inspect locks (sync blokcs), thread call stacks etc.
Tess Ferrandez's blog has good examples of using WinDbg to debug deadlocks in .NET.
assert() is your friend for detecting race-conditions. Whenever you enter a critical section, assert that the invariant associated with it is true (that's what CS's are for). Though, unfortunately, the check might be expensive and thus not suitable for use in production environment.
I implemented the tool vmlens to detect race conditions in java programs during runtime. It implements an algorithm called eraser.
Develop code the way that Princess recommended for your other question (Immutable objects, and Erlang-style message passing). It will be easier to detect multi-threading problems, because the interactions between threads will be well defined.
I faced a thread issue which was giving SAME wrong result and was not behaving un-predictably since each time other conditions(memory, scheduler, processing load) were more or less same.
From my experience, I can say that HARDEST PART is to recognize that it is a thread issue, and BEST SOLUTION is to review the multi-threaded code carefully. Just by looking carefully at the thread code you should try to figure out what can go wrong. Other ways (thread dump, profiler etc) will come second to it.
Narrow down on the functions that are being called, and rule out what could and could not be to blame. When you find sections of code that you suspect may be causing the issue, add lots of detailed logging / tracing to it. Once the issue occurs again, inspect the logs to see how the code executed differently than it does in "baseline" situations.
If you are using Visual Studio, you can also set breakpoints and use the Parallel Stacks window. Parallel Stacks is a huge help when debugging concurrent code, and will give you the ability to switch between threads to debug them independently. More info-
https://learn.microsoft.com/en-us/visualstudio/debugger/using-the-parallel-stacks-window?view=vs-2019
https://learn.microsoft.com/en-us/visualstudio/debugger/walkthrough-debugging-a-parallel-application?view=vs-2019
I'm using GNU and use simple script
$ more gdb_tracer
b func.cpp:2871
r
#c
while (1)
next
#step
end
The best thing I can think of is to stay away from multi-threaded code whenever possible. It seems there are very few programmers who can write bug free multi threaded applications and I would argue that there are no coders beeing able to write bug free large multi threaded applications.

Resources