Identifying Duplicate Exceptions for Bug Tracking - bug-tracking

I've developed a "Proof of Concept" application that logs unhandled exceptions from an application to a bug-tracking system (in this case Team Foundation Server, but it could be ANY bug tracking system). A limitation of this idea is that I don't want duplicate Bug Items opened every time the same exception is thrown (for example, many users encounter the exception - it's still a single "bug").
My first attempt was to store the Exception Type, Message and Stack Trace as fields in the Bug Tracking System.The logging component would then do a query against the Bug "Store" to see if there is an open bug with the same information. (This example is .NET - but I would think the concept is platform independant).
The problem obviously is that these fields can be very large (particularly the stack trace) - and requires a "Full-Text" type of implementation to store them and the searching is very expensive.
I was wondering what approaches have been defined for this problem. I had heard that FogBugz for example had such a feature for automated bug tracking, and was curious how it was implemented.

If you have the stack trace, you could find the last statement in the stack trace and compare it with the ones already logged. If the symbols were included, you'd also get the line number. So, now you have two things for comparison, the actual error number and the statement that failed and possibly the actual line number. If something has already been logged with all of those, then it's more than likely (not 100%, of course) the same issue.
In fact, you could probably parse the stack trace with the "at" word, as each line in the stack trace begins with "at". So, look for the last "at", get that line, compare it with the same last "at" line of the stored stack traces, and you might actually have something.
HTH!

You could create a checksum hash of the stack trace and store that as an indexed column. That way the query to the Bug Store would be pretty fast to avoid duplicates on insert.

You could look at the source code for one of the existing open-source solutions that aggregate exceptions.
For example: https://github.com/getsentry/sentry/tree/master/src/sentry
It is not a simple problem and there are complex heuristics (e.g. same exception reported different ways on different browsers, e.g. exceptions caused by browser extensions are common and are rarely important).

Related

Log all the information about an exception

How can I write a log entry – serialising information to text – that captures all the useful information about an exception?
In Python 3, exception instances got considerably more awesome, but also more complex. They can have a __cause__ or not, can have a __context__ or not, they can have a __traceback__ or not, they have a type, and probably more wrinkles that I'm missing.
The information needed to diagnose an exception later can include any or all of this. It can also, of course, include the same information for any related (cause, or context, or …?) exception instance, recursively.
At the point of logging a caught exception (prior to re-raising it), what single “get me all the information” technique is there which I can use to record the full diagnostic information contained in the exception object, as text which will be useful to the eventual diagnosis effort?

Command Query Separation: commands must return void?

If CQS prevents commands from returning status variables, how does one code for commands that may not succeed? Let's say you can't rely on exceptions.
It seems like anything that is request/response is a violation of CQS.
So it would seem like you would have a set of "mother may I" methods giving the statuses that would have been returned by the command. What happens in a multithreaded / multi computer application?
If I have three clients looking to request that a server's object increase by one (and the object has limits 0-100). All check to see if they can but one gets it - and the other two can't because it just hit a limit. It would seem like a returned status would solve the problem here.
It seems like anything that is request/response is a violation of CQS.
Pretty much yes, hence Command-Query-Separation. As Martin Fowler nicely puts it:
The fundamental idea is that we should divide an object's methods into two sharply separated categories:
Queries: Return a result and do not change the observable state of the system (are free of side effects).
Commands: Change the state of a system but do not return a value [my emphasis].
Requesting that a server's object increase by one is a Command, so it should not return a value - processing a response to that request means that you are doing a Command and Query action at the same time which breaks the fundamental tenet of CQS.
So if you want to know what the server's value is, you issue a separate Query.
If you really need a request-response pattern, you either need to have something like a convoluted callback event process to issue queries for the status of a specific request, or pure CQS isn't appropriate for this part of your system - note the word pure.
Multithreading is a main drawback of CQS and can make it can hard to do. Wikipedia has a basic example and discussion of this and also links to the same Martin Fowler article where he suggests it is OK to break the pattern to get something done without driving yourself crazy:
[Bertrand] Meyer [the inventor of CQS] likes to use command-query separation absolutely, but there are
exceptions. Popping a stack is a good example of a query that modifies
state. Meyer correctly says that you can avoid having this method, but
it is a useful idiom. So I prefer to follow this principle when I can,
but I'm prepared to break it to get my pop.
TL;DR - I would probably just look at returning a response, even tho it isn't correct CQS.
Article "Race Conditions Don’t Exist" may help you to look at the problem with CQS/CQRS mindset.
You may want to step back and ask why counter value is absolutely necessary to know before sending a command? Apparently, you want to make decision on the client side whether you can increase counter more or not.
The approach is to let the server make such decision. Let all the clients send commands (some will succeed and some will fail). Eventually clients will get consistent view of the server object state (where limit has reached) and may finally stop sending such commands.
This time window of inconsistency leads to wrong decisions by the clients, but it never breaks consistency of the object (or domain model) on the server side as long as commands are handled adequately.

What is a good common term for errors and warnings?

I'd like to build an OO hierarchy of errors and warnings returned to the client during a, let's say, pricing operation:
interface PricingMessage {}
interface PricingWarning extends PricingMessage {}
interface PricingError extends PricingMessage {}
class NoSuchProductError implements PricingError {
...
}
I'm not very keen on the name PricingMessage. What is the concept that includes errors and warnings?
EDIT: To be clear, I'm looking for a common concept or name for errors and warnings specifically (excluding e.g. general info messages). For instance, compilers also report errors and warnings. What are these?
Some suggestions...
"Result"
An operation has results, which could be some number of errors, warnings, notes and explicit or implied (no errors) success.
PricingResult
"Issue"
An operation ran, but with issues, some of which could be fatal (errors) and some of which might not be (warnings). If there are no issues, success is implied.
PricingIssue
"Condition"
An operation might have or be in an error condition or a warning condition. I think some compilers use "condition" as an abstract term for an error or warning.
PricingCondition
"Diagnostic"
The outcome of an operation might include diagnostics for errors and warnings. Another compiler term, I believe.
PricingDiagnostic
If you were dealing with java, or similar OO languages, the word you are looking for would be Exception. This indicates that you have reached an "exceptional" condition which needs to be handled in a controlled way.
Through looking at a few synonym lists, I found the following:
Anomaly, oddity, deviation
Alert, message, notification
Fault, misstep, failure, glitch
I prefer the name Alert. An alert IMO can have any level of severity, it could be identified as informational, warning, critical or any other level deemed appropriate.
The counter argument I have heard to this naming is the idea that alert the noun follows too closely to alert the verb, making the distinction that the object (noun) may or many not have been brought to the users attention yet (verb). In this context naming them alert could creates a bit of cognitive dissonance, and perhaps confusion for developers reasoning about your code.
The best I can propose would be to create a hard distinction be made in your code base between Alert (the object of exceptional condition) and Notification (the act of bringing the alert to the users attention) to keep things intuitive for programmers moving forward.
In theory these could be defined as events - so you could use that.
This is very subjective, but here are a few suggestions:
Output
LogEntry
In web design, the term "admonition" is sometimes used for a block of text that can be an error, warning, or informational.

Oracle Sort Order - What may cause it to change

Disclaimer: I know that it is bad to not use an 'ORDER BY' in SQL when sorted data is required.
I am currently supporting a Pro*C program which is having a wierd-problem.
One of the possible causes of the wierd-problem may be that the original developers (from a long time ago) have not used ORDER BY in their SQL even though the program logic depends on it!
The program has been working fine all these years and started showing problems only recently.
We are trying to pin the wierd-problem to the ORDER BY mistake (there are other cause candidates like a recent port from Solaris to Linux which took place).
What shadowy things on the database end should we look at that may have changed the old sort order? Things like data files etc?
Anybody have any experience with Pro*C on Solaris magically sorting the result-set?
Thanks!
Since you know that the program cares about the order in which results are returned and you know that the query that is submitted is missing an ORDER BY clause, is there a reason that you don't just fix the problem rather than looking to try to figure out whether the actual order of results may have changed? If you fix the known ORDER BY problem and the "weird problem" you have disappears, that would provide some pretty good evidence that the "weird problem" is, in fact, caused by the missing ORDER BY.
Unfortunately, there are lots of things that might have caused the order of results to change many of which may be impossible to track down. The most obvious cause would be a change in the execution plan. That, in turn, may have been caused either because statistics changed or because statistics didn't change enough or because of a patch or because of an initialization parameter change or because of a client configuration change among other things. If you are licensed to use the AWR (Automatic Workload Repository), you might be able to find evidence that the plan has changed by looking to see if there are multiple PLAN_HASH_VALUE values for the SQL_ID in DBA_HIST_SQLSTAT over different days. If there are, you'd still have to try to figure out whether the different plans actually caused the results to be returned in a different order. Beyond a query plan change, though, there are dozens of other possible causes. The physical order of data on disk may have changed because someone reorganized the table or because someone moved data files around on the disk or because the SAN automatically rebalanced something by moving data around. Some data may have been cached (or may not have been cached) in general in the past that is now cached. An Oracle patch may have been applied.
I suggest that change your physical table with view and make your required order in that view.
example
TABLE_NOT_SORTED --> rename to --> PHYS_TABLE_NOT_SORTED
CREATE VIEW TABLE_NOT_SORTED
AS
SELECT * FROM PHYS_TABLE_NOT_SORTED
ORDER BY DESIRED_COLUMNS
For response to comment:
According to this question and Ask Tom's Answer, it seems that since Oracle does not guarantee a default sorting if you do not use "ORDER BY", they are free to change it. They are absolutely right of course. If you need sorting, use Order By.
Other than that we can not say anything about your code or default ordering.

How to test a program processing large amounts of data stored in an unpredictable format

What I have to do
I'm trying to manipulate some rather large amounts of data stored in Excel files (one of the workbooks has as much as 150 spreadsheets). The result of these manipulations may yield approximately 800.000 rows in a database table.
The problem
Data stored in the spreadsheets has unpredictable format. The company that generated these spreadsheets had no fixed/documented format for exporting these files, and sometimes erroneous data appear. For example most of the years are represented like "2009" but there are cases where a year is represented as "20". Other example, data is not really normalized in these files, so I use separators to split the values of certain cells. Sometimes these separators change.
There are things like these that I couldn't predict and I only discovered them only after running an already evolved version of my program over a pretty large part of the available data.
The question
How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
Shall I take a defensive approach and throw exceptions whenever some kind of unexpected issue arises? Then the main loop of the program may catch and log them and continue with the available data? This would yield some processed data, but that means that on a subsequent iteration of the program I have to have checks for what's already inside the database from previous iterations (which I don't really like).
What's your opinion? How would you tackle this problem?
If there is no specification for what the format of the data is, then anything is acceptable.
If not, then there is either an explicit or implicit specification of the data. I would try and nail this down right now. If you can't get an explicit enough definition of the data to write your program so that it can be expected to run without error, then I would say you are taking a very large risk in causing some serious damage depending on how this data is being used.
You should write your program so that it either throws an exception or logs an error whenever running across data that does not meet the specification. Then, run the program on PART of the available data until it runs without exception. This can be viewed as a training set for the development of your program. Then, use some of the saved data to use as a TEST set. This will give you an estimate of how many exceptions/errors your program will generate in production.
Overfitting is a common machine learning concept, but it is useful to other tasks such as this - program development. It is surprising to me how developers can write a bunch of unit tests, code their application to perform well on it, and then expect similar or bug-free performance in production.
If you're not willing to take all these steps (i.e. run your code on essentially all of the data -- since the test set is also making use of the data) then I would say the task is too large to do.
As an aside, rather than creating a definition of a format that is very strange and peculiar to account for all the "errors" in the current data, you might want to create a new, normalized (in the sense these things are simplified away) specification for the data, and then write a "faulty document patcher" that can be run on faulty documents to fix the data.
If the application generating the data is still in production, then you might need to go to the developers of this application to get a buy in on the new spec. Once you have that, you can then start logging bugs against their application, so hopefully the faulty document patcher can be retired.
More likely, I'm guessing that the software developers are long gone, no one understands the code anymore, if it is even running at all.
How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
For every single data type I would set reasonable constraints on the values that it is allowed to be.
If a cell violates these constraints then throw an exception containing the piece of data it failed on and its data type. If a piece of data violated its constraints you can modify the source to include the additional constraints required for that piece of data, and a conversion method to make it uniform.
To give an example on the date you gave, initially a date would have the constraint that it could be only four digits. When the program came across the "20" it would throw an exception.
Then you could go and allow two digit dates, and a method to convert the two-digit dates into a four digit one to allow further processing.
One question is, will you run your program more than once? From your question it sounds possible you only want to run it once, and then you will then work with the data in the database.
In which case you can be very defensive - throw exceptions whenever unexpected data appears. Run the program repeatedly on ever-larger sets of the data. Initially, solve any exceptions by altering the code, as it's a good rule of thumb that the exceptions you find first are going to be common. You might want to empty the output database between runs.
Later on, you will be finding rare exceptions that might only occur a couple of times in the input. Just solve these by hand and insert the corresponding rows in the database yourself. Or write another small program that reads your exception information and inserts the new rows, rather than running your whole big program again.
Typically for this sort of thing I do these as #MarkJ suggested, and I encode the whole thing in unit tests.
So I compose a small datafile that at first contains only a few rows of normal data. That's unit test number 1.
Then I take a quick visual scan of some of the data to spot any obvious exceptions. Unit tests 2 through n.
Finally, I write parser code until it passes all unit tests, and throws and logs exceptions for all un-managed data.
I then use these oddball bits of data to make new unit tests, and improve the parser until it can pass those too.
Although sometimes accommodating some really strange bit of data adds more parser complexity than it's worth, and I'll just log the exception, dump it, and move on. This is a matter of professional judgment.
How about processing every piece of data (so you don't have to check for dupes). Those that pass go into the database. The exceptions go into an exception file. The user can open the exception file and make corrections/modifications to the data. Then they can run your program on the exception file.
This will isolate unhandled data for the user to correct and prevent you from processing the same data twice (or more).

Resources