How to compare the results of different statistical tests? - statistics

I don't know if it is a good question or not.
Here's the case, say I have a scale/continuous dependent variable and a bunch of independent variables. My ultimate goal is to build a model to predict/estimate the dependent variable using these independent variables. I believe it's a common setting.
The point is that I know the physical meaning of all the variables, but I don't know their detailed relationship (or even related or not). I want to build a model more from an analysis/explanation point of view so that I could get some real-world insights from the model, instead of a black box.
My approach is trying to use CHAID kind of algorithm to build a decision tree type of model. At every branch, I want to statistically test each independent variable to see if there's relation between it and the dependent variable. Then, based on the test result, I want to pick the most powerful one to build my tree.
The problem is, unlike CHAID algorithm, where most variables are categorical, in my case, the dependent variable is scale, and independent variables are categorical or scale, which means I might need to do different statistical tests for different variables, e.g. t-test and ANOVA for categorical ones and regression for continuous ones. I'm wondering how should I fairly compare these results to pick the most powerful one? (like the correction step in CHAID)
Any idea on any part of my plan is of great importance to me! Thanks!

Related

How to represent a complex use case where every step of the main flow can have multiple scenarios (alternative or error path)?

Little background
I'm new to writing use cases and representing their scenarios.
I'm dealing with a complex system. In the first step of analyzing the system, I created a use case diagram where each use case represents a distinct goal or value for the system. I have tried my best to keep the use cases independent. All these use cases require the initialization and activation of the system, so I decided to take out this common part and link it to the main use cases using include relationship.
I understand that include and extend relationships need to be used only when necessary.
Now I'm lookin into defining scenarios for each use case and then developing user stories and requirements based on scenarios.
Main issue
The use cases are very complex and the easiest way to analyze it seems to be mapping it into a sequence of steps/activities where each activity contains several scenarios and each scenario is represented using a sequence diagram.
I understand that an activity cannot be a use case which is related to the main use case using include relationship; but having sequence diagrams for activities seem wrong too.
What is the best way to represent a use case where each step of the main flow is complex and can have several interactions between actors and systems as well as having error scenarios which can result in termination of the sequence at that step or possibility of the user cancelling/aborting the sequence?
I have attached a simplified version of the activity diagram for "Initialize" use case.
As I mentioned, each activity can have many scenarios. For example
"Perform Self check" has many steps and each step might result in a failure that can terminate the sequence and alert the user (via a HMI). The user then can either terminate the initialization or retry.
"Validate system configuration" include steps for obtaining the reference config versions and comparing that to the system config, then download the new config files if necessary and then update the system configs. Each step might have a failure resulting in some sort of message to user and termination of the sequence. In some cases user should be able to skip the failed steps and proceed without doing that activity.
Same goes for every other activity in the diagram; many steps with exception or alternative paths.
Can I map these on one sequence diagram for the "Initialize" Use case?
My attempt to put all these on one sequence diagram failed.
I tried putting all these interactions on an activity diagram with swimlanes but things got so complex that stakeholders have a hard time understanding what is going on.
Maybe I'm trying to put too much details at the system level. Should I leave all these interim steps and interaction for the lower level of design? Should I create a hierarchy of use cases and roll down the complexity? I'm confused. :(
What is the best way to deal with such level of complexity? Could you provide some good examples.
The only way to represent a complex use case, where every step of the main flow can have multiple scenarios, is fortunately very simple:
The complexity of the scenarios does not change anything to the simplicity of the actor's goals. And if the goals are not sufficiently simple, you'd probably looking at too much details. Or the things are not as clear as they should.
The scenarios are often represented with a set of sequence diagrams. But if it gets really complex you'd better show the flow with an activity diagram.
By the way, you do not need to create an artificial extending or included use-case for the sake of modelling common steps. You may just create a separate activity diagram for the common part. Then, in each of your use-case activity diagram, you'd insert a call action of the common activity. This also avoids to misleadingly include the common part in the description of one UC and forget it for the others.
Last but not least, you also want to develop user-stories based on the use-case scenario. This is a mixed approach that requires some more thoughts:
user-stories are generally used without use-cases. Complex erquirements are described as an epic. The epic would then successfully be refine it into user-stories, that fit in an iteration;
it is possible to structure such user-stories according to stakeholder goals and tasks. THis approach is called user-story mapping. This is closer to the use-case, but there is no term to describe the higher-level goals.
use-case driven development is generally used without user-stories: the scenarios and activity directly lead to development without intermeriate user-stories.
Fortunately, the Use-Case 2.0 approach allows to combine both ways. Read the linked whitebook: it's short, it's free, it's written by the inventor of use-cases together with leading authors of use-case methodology; it offers a reegineered appraoch that allows agile developments, using use-case for the big picture and using use-case slices to break it down dynamically into units that can be developped in one iteration.
A complex use case can remain a single use case, but it may need multiple diagrams to specify its flows.
Your activity diagram (although not 100% UML compliant) gives a good overview of the flow of the use case. Keep this as the main diagram. I would decompose the complex steps in separate diagrams. To indicate that a step is decomposed in a separate diagram, you can display a rake symbol, as follows:
See UML 2.5.1 specification, section 16.3.4.1 for more information.

How to encode a taxonomy in Weaviate contextionary

I would like to create a semantic context for my data before vectorizing the actual data in Weaviate (https://github.com/semi-technologies/weaviate).
Lets say we have a taxonomy where we have a set of domain specific concepts together with links to their related concepts. Could you advise me what the best way is to encode not only those concepts but also relations between them using contextionary?
Depending on your use case, there are a few answers possible.
You can create the "semantic context" in a Weaviate schema and use a vectorization module to vectorized the data according to this schema.
You have domain-specific concepts in your data that the out-of-the-box vectorization modules don't know about (e.g., specific abbreviations).
You want to capture the semantic context of (i.e., vectorize) the graph itself before adding it to Weaviate.
The first is the easiest and straightforward one, the last one is the most esoteric.
Create a schema and use a vectorizer for your data
In your case, you would create a schema based on your taxonomy and load the data using an out-of-the-box vectorizer (this configurator helps you to build a Docker-compose file).
I would recommend starting with this anyway, because it will determine your data model and how you can search through and/or classify data. It might even be the case that for your use case this step already solves the problem because the out-of-the-box vectorizers are (bias alert) pretty decent.
Domain-specific concepts
At the moment of writing, Weaviate has two vectorizers, the contextionary and the transformers modules.
If you want to extend Weaviate with custom context, you can extend the contextionary or fine tune and distribute custom transformers.
If you do this, I would highly recommend still taking the first step. Because it will simply improve the results.
Capture semantic context of your graph
I don't think this is what you want, but it possible and quite esoteric. In principle, you can store your vectorized graph in Weaviate, but you need to generate the vectors on your own. For example, at the moment of writing, we are looking at RDF2Vec.
PS:
Because people often ask about the role of ontologies and taxonomies in Weaviate, I've written this blog post.

Using Conditional Random Fields for Nested named entity recognition

My question is the following.
When we work on Named entity recognition tasks, in most cases the classic LSTM-CRF architecture is used, where the CRF uses the Viterbi decoder and the transition matrix to find the best tag sequence associated to a sentence.
My question is, if a token is now associated to multiple entities and not just one (which is the case of Nested NER), as in the case of Bank of China, where China is a location and Bank of China is an organization. Can the CRF algorithm be adapted for this case? That is, finding more than one possible path in the sequence.
This issue is related to the datasets format more than the LSTM-CRF in itself, i.e. you may indeed implement a LSTM-CRF that would recognize nested entities, without depth limitation, but they are rather rare.
Most of the machine learning (including LSTM-CRF) software are learned with a CoNLL (tab separated) dataset format, which is not convenient for unlimited depth nesting. Many dataset and systems implement a fixed depth nesting, using additional columns (more or less one per nesting depth). Software may use separate or joint learning for each depth or use cascading models.

Is it possible to generate parts of a meta model from upper layer?

based on the four layer MOF structure, I'm currently working on a model (in fact a UML class diagram) at M1 level. However, I observed that some parts of the meta model are highly depending on references to certain classes, which may may differ depending on the use case. Therefore, I created a meta model on the M2 level, which allows users to define the variable parts of the M1 Model, which again can then be generated and incorpareted in the M1 model. The following images tries to depict that:
A resulting M1 model example would then look like that:
As switching between the different levels can be a little bit confusing, I wonder if this approach is per se possible and UML conform? Furthermore, is there a notation for the "generated instances" relation in Figure 1 by chance? Within the MOF spec, <<merge>> or <<import>> is for example used, which maybe fit in for that purpose.
Probably your question is too broad to give a concise answer. However, here's my advice when dealing with meta models: I found that people hardly have an idea why you need a meta model at all and it takes quite some time to convince them starting to create one. Even with so called UML pros. Now, with that in background, it's evident that modelers who shall use the meta model might have even more difficulties dealing with it. This leaves just one way: keep it simple. And that's what I did in the past. Introducing a meta model with just really the basics, concentrating on meta types, tagged values and some connectors. After a while, people really get used to it and appreciate working with the meta model. Only then there starts the need to switch to a version two, which is still static though.
Now, what you want looks like a version ninety nine. This would probably only work in a super model where you have some gurus floating on top of it all an provide a meta meta model. This will going to be interesting and I'd like to be part of that team. However, I doubt you will be able to get practicable results from it. My recommendation is that you stay with the static meta model. Everything else will likely lead you to nowhere.

DDD/CQRS for composite .NET app with multiple databases

I'll admit that I am still quite a newbie with DDD and even more so with CQRS. I also realize that DDD and/or CQRS might not be the right approach to every problem. Nevertheless, I like the principals but have some questions in the context of a current project.
The solution is a simulator that generates performance data based on the current configuration. Administrators can create and modify the specifications for simulations. Testers set some environmental conditions and run the simulator. The results are captured, aggregated and reported.
The solution consists of 3 component areas each with their own use-cases, domain logic and supporting data structure. As a result, a modular designed seems appealing as a way to segregate logic and separate concerns.
The first area would be the administrative aspect which allows users to create and modify the specifications. This would be a CRUD heavy 'module'.
The second area would be for executing the simulations. The domain model would be similar to the first area but optimized for executing the simulation as opposed to providing a convenient model for editing.
The third area is reporting.
From this I believe that I have three Bounding Contexts, yes? I have three clear entry points into the application, three sets of domain logic and three different data models to support the domain logic.
My first instinct is to follow these lines and create three modules (assemblies) that encapsulate the domain layer for each area. Should I also have three separate databases? Maybe more than three to support write versus read?
I gather this may be preferred for CQRS but am not sure how to go about it. It appears to me that CQRS suggests a set of back-end processes that move data around. But if that's the case, and data persistence is cross-cutting (as DDD suggests), then doesn't my data access code need awareness of all of the domain objects? If so, then is there a benefit to having separate modules?
Finally, something I failed to mention earlier is that specifications are considered 'drafts' until published, which makes then available for simulation. My PublishingService needs to have knowledge of the domain model for both the first and second areas so that when it responds to the SpecificationPublishedEvent, it can read the specification, translate the model and persist it for execution. This makes me think I don't have three bounding contexts after all. Or am I missing something in my analysis?
You may have a modular UI for this, but I don't see three separate domains in what you are describing necessarily.
First off, in CQRS reporting is not directly a domain model concern, it is a facet of the separated Read Model which takes on the responsibility of presenting the domain state optimized for reporting.
Second just because you have different things happening in the domain is not necessarily a reason to bound them away from each other. I'd take a read through the blue DDD book to get a bit better feel for what BCs look like.
I don't really understand your domain well enough but I'll try to give some general suggestions.
Start with where you talked about your PublishingService. I see a Specification aggregate root which takes a few commands that probably look like CreateNewSpecification, UpdateSpecification and PublishSpecification.
The events look similar and probably feel redundant: SpecificationCreated, SpecificationUpdated, SpecificationPublished. Which kind of sucks but a CRUD heavy model doesn't have very interesting behaviors. I'd also suggest finding an automated way to deal with model/schema changes on this aggregate which will be tedious if you don't use code generation, or handle the changes in a dynamic *emphasized text*way that doesn't require you to build new events each time.
Also you might just consider not using event sourcing for such an aggregate root since it is so CRUD heavy.
The second thing you describe seems to be about starting a simulation which will run based on a Specification and produce data during that simulation (I assume). An event driven architecture makes sense here to decouple updating the reporting data from the process that is producing the data. This has huge benefits if you are producing large amounts of data to process.
However it doesn't sound like a Simulation is necessarily the kind of AR that would benefit from Event Sourcing either. For a couple reasons:
Simulation really takes only one Command which is something like StartSimulation
Simulation then produces events over it's life-time which represent what is happening internally with the simulation
Simulation doesn't seem to ever receive any other Commands that could depend on the current state of the Simulation
Simulation is not interacted with by multiple clients/users simultaneously and as we pointed out it isn't really interacted with at all
In general, domain modeling is very specific to each individual project so it's hard to give you all the information you need to build your domain model. It will come as a result of spending a great deal of time trying to understand your user's needs and the problem they are trying to solve with the software. It likely will go through multiple refinements as you develop insights into their process.

Resources