I often hear arguments like this: A disadvantage of traditional testing is that it is incomplete whereas Alloy analysis is exhaustive and complete (within a bound). But, the first is talking about software, the second is talking about models. Isn't it an apples-to-oranges comparison?
Update: I was wrong. The comparison is not this: testing code versus analyzing models. That is an apples-to-oranges comparison. Instead, the comparisons are these:
Testing models versus analysis of models.
Testing code versus analysis of code.
Those are apples-to-apples comparisons.
So, whether the artifact is a model or code, you can compare two kinds of analysis: testing, which corresponds to drawing a relatively small number of cases randomly, without a bound on the size, versus small scope analysis, which involves all cases within a small bound.
Thanks to Daniel Jackson for clearing up my misunderstanding.
First, when Alloy was invented, the only existing tools for analyzing models in data-rich languages such as Z and VDM that were not proof-based used scenarios to test the model. Each scenario was constructed by the user, so the approach suffered from the cost of creating the scenarios and the low coverage of their small number.
Second, Alloy has been used to find bugs in code: see the PhD theses by Mandana Vaziri, Mana Taghdiri, Greg Dennis, Juan Pablo Galeotti and others. In all of these, bugs were found that evaded conventional tests.
Third, it's worth noting that bounded-exhaustive forms of testing are becoming viable. Sarfraz Khurshid was a pioneer in this work with his thesis on generating test cases, initially in a tool called TestEra based on Alloy, and later (with Darko Marinov et al) in a tool called Korat that traded a more diected solving method for less declarative constraints.
Related
In my community, recently we actively use the term "falsification" of a formal specification. The term appears in, for instance:
https://www.cs.huji.ac.il/~ornak/publications/cav05.pdf
I wonder whether Alloy Analyzer does falsification. It seems true for me, but I'm not sure. Is it correct? If not, what is the difference?
Yes, Alloy is a falsifier. Alloy's primary novelty when it was introduced 20 years ago was to argue that falsification was often more important than verification, since most designs are not correct, so the role of an analyzer should be to find the errors, not to show that they are not present. For a discussion of this issue, see Section 1.4, Verification vs. Refutation in Software analysis: A roadmap (Jackson and Rinard, 2000); Section 5.1.1, Instance Finding and Undecidability Compromises in Software Abstractions (Jackson 2006).
In Alloy's case though, there's another aspect, which is the argument that scope-complete analysis is actually quite effective from a verification standpoint. This claim is what we called the "small scope hypothesis" -- that most bugs can be found in small scopes (that is analyses that are bounded by a small fixed number of elements in each basic type).
BTW, Alloy was one of the earliest tools to suggest using SAT for bounded verification. See, for example, Boolean Compilation of Relational Specifications (Daniel Jackson, 1998), a tech report that was known to the authors of the first bounded model checking paper, which discusses Alloy's predecessor, Nitpick, in the following terms:
The hypothesis underlying Nitpick is a controversial one. It is that,
in practice, small scopes suffice. In other words, most errors can be
demonstrated by counterexamples within a small scope. This is a purely
empirical hypothesis, since the relevant distribution of errors cannot
be described mathematically: it is determined by the specifications
people write.
Our hope is that successful use of the Nitpick tool will justify the
hypothesis. There is some evidence already for its plausibility. In
our experience with Nitpick to date, we have not gained further
information by increasing the scope beyond 6.
A similar notion of scope is implicit in the context of model checking
of hardware. Although the individual state machines are usually
finite, the design is frequently parameterized by the number of
machines executing in parallel. This metric is analogous to scope; as
the number of machines increases, the state space increases
exponentially, and it is rarely possible to analyze a system involving
more than a handful of machines. Fortunately, however, it seems that
only small configurations are required to find errors. The celebrated
analysis of the Futurebus+ cache protocol [C+95], which perhaps marked
the turning point in model checking’s industrial reputation, was
performed for up to 8 processors and 3 buses. The reported flaws,
however, could be demonstrated with counterexamples involving at most
3 processors and 2 buses.
From my understanding of what is meant by falsification, yes, Alloy does it.
It becomes quite apparent when you look at the motivation behind the creation of Alloy, as forumalted in the Software Abstraction book:
This book is the result of a 10-year effort to bridge this gap, to develop a language (Alloy) that captures the essence of software abstractions simply and succinctly, with an analysis that is fully automatic, and can expose the subtlest of flaws.
I want to use Dynamic Topic Modeling by Blei et al. (http://www.cs.columbia.edu/~blei/papers/BleiLafferty2006a.pdf) for a large corpus of nearly 3800 patent documents.
Does anybody has experience in using the DTM in the gensim package?
I identified two models:
models.ldaseqmodel – Dynamic Topic Modeling in Python Link
models.wrappers.dtmmodel – Dynamic Topic Models (DTM) Link
Which one did you use, of if you used both, which one is "better"? In better words, which one did/do you prefer?
Both packages work fine, and are pretty much functionally identical. Which one you might want to use depends on your use case. There are small differences in the functions each model comes with, and small differences in the naming, which might be a little confusing, but for most DTM use cases, it does not matter very much which you pick.
Are the model outputs identical?
Not exactly. They are however very, very close to being identical (98%+) - I believe most of the differences come from slightly different handling of the probabilities in the generative process. So far, I've not yet come across a case where a difference in the sixth or seventh digit after the decimal point has any significant meaning. Interpreting the topics your models finds matters much more than one version finding a higher topic loading for some word by 0.00002
The big difference between the two models: dtmmodel is a python wrapper for the original C++ implementation from blei-lab, which means python will run the binaries, while ldaseqmodel is fully written in python.
Why use dtmmodel?
the C++ code is faster than the python implementation
supports the Document Influence Model from Gerrish/Blei 2010 (potentially interesting for your research, see this paper for an implementation.
Why use ldaseqmodel?
easier to install (simple import statement vs downloading binaries)
can use sstats from a pretrained LDA model - useful with LdaMulticore
easier to understand the workings of the code
I mostly use ldaseqmodel but thats for convenience. Native DIM support would be great to have, though.
What should you do?
Try each of them out, say, on a small sample set and see what the models return. 3800 documents isn't a huge corpus (assuming the patents aren't hundreds of pages each), and I assume that after preprocessing (removing stopwords, images and metadata) your dictionary won't be too large either (lots of standard phrases and legalese in patents, I'd assume). Pick the one that works best for you or has the capabilities you need.
Full analysis might take hours anyway, if you let your code run overnight there is little practical difference, after all, do you care if it finishes at 3am or 5am? If runtime is critical, I would assume the dtmmodel will be more useful.
For implementation examples, you might want to take a look at these notebooks: ldaseqmodel and dtmmodel
I am now working on a project where we are using cucumber-jvm to drive acceptance tests.
On previous projects I would create internal DSLs in groovy or scala to drive acceptance tests. These DSLs would be fairly simple to use such that even a non-techie would be able to write tests with a little bit of guidance.
What I see is that BDD adds another layer of indirection and semantic sugar to the tests, but I fail to see the value-add, especially if the non-techies can use an internal DSL.
In the case of cucumber, stepDefs seem to scatter the code that drives any given test over several different classes, making the test code difficult to read and debug outside the feature file. On the other hand putting all the code pertaining to one test in a single stepDef class discourages re-use of stepsDefs. Both outcomes are undesirable, leaving me asking what is the use of natural language worth all this extra, and unintuitive indirection?
Is there something I am missing? Like a subtle philosophical difference between ATDD and BDD? Does the former imply imperative testing whereas the latter implies declarative testing? Do these aesthetic differences have intrinsic value?
So I am left asking what is the value add to justify the deterioration in the readability of the actual code that drives the test. Is this BDD stuff actually worth the pain? Is the value add more than just aesthetic?
I would be grateful if someone out there could come up with a compelling argument as to why the gain of BDD surpasses the pain of BDD?
What I see is that BDD adds another layer of indirection and semantic sugar to the tests, but I fail to see the value-add, especially if the non-techies can use an internal DSL.
The extra layer is the plain language .feature file and at the point of creation it has nothing to do with testing, it has to do with creating the requirements of the system using a technique called specification by example to create well defined stories. When written properly in the business language, specification by example are very powerful at creating a shared understanding. This exercise alone can both reduce the amount of rework and can find defects before development starts. This exercise is otherwise known as deliberate discovery.
Once you have a shared understanding and agreement on the specifications, you enter development and make those specifications executable. Here is where you would use ATDD. So BDD and ATDD are not comparable, they are complimentary. As part of ATDD, you drive the development of the system using the behaviour that has been defined by way of example in the story. the nice thing you have as a developer is a formal format that contains preconditions, events, and postconditions that you can automate.
Here on, the automated running of the executable specifications on a CI system will reduce regression and provide you with all the benefits you get from any other automated testing technique.
These really interesting thing is that the executable specification files are long-lived and evolve over time and as you add/change behaviour to your system. Unlike most Agile methodologies where user stories are throw-away after they have been developed, here you have a living documentation of your system, that is also the specifications, that is also the automated test.
Let's now run through a healthy BDD-enabled delivery process (this is not the only way, but it is the way we like to work):
Deliberate Discovery session.
Output = agreed specifications delta
ATDD to drive development
Output = actualizing code, automated tests
Continuous Integration
Output = report with screenshots is browsable documentation of the system
Automated Deployment
Output = working software being consumed
Measure & Learn
Output = New ideas and feedback to feed the next deliberate discover session
So BDD can really help you in the missing piece of most delivery systems, the specifications part. This is typically undisciplined and freeform, and is left up to a few individuals to hold together. This is how BDD is an Agile methodology and not just a testing technique.
With that in mind, let me address some of your other questions.
In the case of cucumber, stepDefs seem to scatter the code that drives any given test over several different classes, making the test code difficult to read and debug outside the feature file. On the other hand putting all the code pertaining to one test in a single stepDef class discourages re-use of stepsDefs. Both outcomes are undesirable, leaving me asking what is the use of natural language worth all this extra, and unintuitive indirection?
If you make the stepDefs a super thin layer on top of your automation testing codebase, then it's easy to reuse the automation code from multiple steps. In the test codebase, you should utilize techniques and principles such as the testing pyramid and the shallow depth of test to ensure you have a robust and fast test automation layer. What's also interesting about this separation is that it allows you to ruse the code between your stepDefs and your unit/integration tests.
Is there something I am missing? Like a subtle philosophical difference between ATDD and BDD? Does the former imply imperative testing whereas the latter implies declarative testing? Do these aesthetic differences have intrinsic value?
As mentioned above, ATDD and BDD are complimentary and not comparable. On the point of imperative/declarative, specification by example as a technique is very specific. When you are performing the deliberate discovery phase, you always as the question "can you give me an example". In that example, you would use exact values. If there are two values that can be used in the preconditions (Given) or event (When) steps, and they have different outcomes (Then step), it means you have two different scenarios. If the have the same outcome, it's likely the same scenario. Therefore as part of the BDD practice, the steps need to be declarative as to gain the benefits of deliberate discovery.
So I am left asking what is the value add to justify the deterioration in the readability of the actual code that drives the test. Is this BDD stuff actually worth the pain? Is the value add more than just aesthetic?
It's worth it if you are working in a team where you want to solve the problem of miscommunication. One of the reasons people fail with BDD is because the writing and automation of features is lefts to the developers and the QA's, and the artifacts are no longer coherent as living specifications, they are just test scripts.
Test scripts tell you how a system does a particular thing but it does not tell you why.
I would be grateful if someone out there could come up with a compelling argument as to why the gain of BDD surpasses the pain of BDD?
It's about using the right tool for the right job. Using Cucumber for writing unit tests or automated test scripts is like using a hammer to put a screw into wood. It might work, but it's never pretty and it's always painful!
On the subject of tools, your typical business analyst / product owner is not going to have the knowledge needed to peek into your source control and work with you on adding / modifying specs. We created a commercial tool to fix this problem by allowing your whole team to collaborate over specifications in the cloud and stays in sync (realtime) with your repository. Check out Simian.
I have also answered a question about BDD here that may be of interest to you that focuses more on development:
Should TDD and BDD be used in conjunction?
Cucumber and Selenium are two popular technologies. Most of the organizations use Selenium for functional testing. These organizations which are using Selenium want to integrate Cucumber with selenium as Cucumber makes it easy to read and to understand the application flow. Cucumber tool is based on the Behavior Driven Development framework that acts as the bridge between the following people:
Software Engineer and Business Analyst.
Manual Tester and Automation Tester.
Manual Tester and Developers.
Cucumber also benefits the client to understand the application code as it uses Gherkin language which is in Plain Text. Anyone in the organization can understand the behavior of the software. The syntax's of Gherkin is in the simple text which is readable and understandable.
Is there a package or methodology in existence for the detection of flawed logical arguments in text?
I was hoping for something that would work for text that is not written in an academic setting (such as a logic class). It might be a stretch but I would like something that can identify where logic is trying to be used and identify the logical error. A possible use for this would be marking errors in editorial articles.
I don't need anything that is polished. I wouldn't mind working to develop something either so I'm really looking for what's out there in the wild now.
That's a difficult problem, because you'll have to map natural language to some logical representation, and deal with ambiguity in the process.
Attempto Project may be interesting for you. It has several tools that you can try online. In particular, RACE may be doing something you wanted to do. It checks for consistency on the given assertions. But the bigger issue here is in transforming them to logical forms.
For an onology of logical axioms, OpenCyc and the commercial full Cyc ontologies might be worth investigating as well. CycML is used as a language to model the logical assertions, and the Cyc engine is capable of logical inference. The source for OpenCyc can be found in the OpenCyc SourceForge project. The Cyc Wikipedia page also has great information.
Yes, this is a very nasty problem. I would suggest you try to focus in on a narrow domain. For example, if you are looking for logic errors in cancer determination, you have to focus on which type of cancer as well as what are you trying to resolve eg: correct treatment plans, correct observations, correct procedures, correct stage determination, etc. Then you have to find the taxonomy or ontology for that specific cancer, eg: Medline. So for example, you will likely have to focus in on ONLY lung cancer and then only a subset of lung cancer types and only observations indicating lung cancer. Then you will have identify your corpus, knowledge trees, entity relationships and then worry about negation detection, hypotheticals and subject detection. If Healthcare doesn float your boat, I hear another challenging domain for logic errors is the legal/law industry.
I'm working on some code generation tools, and a lot of complexity comes from doing scope analysis.
I frequently find myself wanting to know things like
What are the free variables of a function or block?
Where is this symbol declared?
What does this declaration mask?
Does this usage of a symbol potentially occur before initialization?
Does this variable potentially escape?
and I think it's time to rethink my scoping kludge.
I can do all this analysis but am trying to figure out a way to structure APIs so that it's easy to use, and ideally, possible to do enough of this work lazily.
What tools like this are people familiar with, and what did they do right and wrong in their APIs?
I'm a bit surprised at at the question, as I've done tons of code generation and the question of scoping rarely comes up (except occasionally the desire to generate unique names).
To answer your example questions requires serious program analysis well beyond scoping. Escape analysis by itself is nontrivial. Use-before-initialization can be trivial or nontrivial depending on the target language.
In my experience, APIs for program analysis are difficult to design and frequently language-specific. If you're targeting a low-level language you might learn something useful from the Machine SUIF APIs.
In your place I would be tempted to steal someone else's framework for program analysis. George Necula and his students built CIL, which seems to be the current standard for analyzing C code. Laurie Hendren's group have built some nice tools for analyzing Java.
If I had to roll my own I'd worry less about APIs and more about a really good representation for abstract-syntax trees.
In the very limited domain of dataflow analysis (which includes the uninitialized-variable question), João Dias and I have adapted some nice work by Sorin Lerner, David Grove, and Craig Chambers. Only our preliminary results are published.
Finally if you want to generate code in multiple languages this is a complete can of worms. I have done it badly several times. If you create something you like, publish it!