Yahoo Vespa Create a search definition in runtime - search

I would like to know if there is any API in "vespa platform" which I can use to create a search definition (sd) in runtime.
This is a requirement, because the documents that I will index are depending on the user input in my front end application.

No, there is no such API available. The idea of deploying an immutable application package (including the SD) is a conscious design choice to ensure appropriate management of multiple search clusters in multiple locations over time as well as enabling source control management.
If needed, one could build what you describe "on top" of Vespa: A web service that will let you mutate an existing SD and, upon submit, create the updated application package and deploy to your Vespa cluster. Vespa will (in most cases) handle schema changes without impacting serving.

Related

Decision path for Azure Service Fabric Programming Models

Background
We are looking at porting a 'monolithic' 3 tier Web app to a microservices architecture. The web app displays listings to a consumer (think Craiglist).
The backend consists of a REST API that calls into a SQL DB and returns JSON for a SPA app to build a UI (there's also a mobile app). Data is written to the SQL DB via background services (ftp + worker roles). There's also some pages that allow writes by the user.
Information required:
I'm trying to figure out how (if at all), Azure Service Fabric would be a good fit for a microservices architecture in my scenario. I know the pros/cons of microservices vs monolith, but i'm trying to figure out the application of various microservice programming models to our current architecture.
Questions
Is Azure Service Fabric a good fit for this? If not, other recommendations? Currently i'm leaning towards a bunch of OWIN-based .NET web sites, split up by area/service, each hosted on their own machine and tied together by an API gateway.
Which Service Fabric programming model would i go for? Stateless services with their own backing DB? I can't see how Stateful or Actor model would help here.
If i went with Stateful services/Actor, how would i go about updating data as part of a maintenance/ad-hoc admin request? Traditionally we would simply login to the DB and update the data, and the API would return the new data - but if it's persisted in-memory/across nodes in a cluster, how would we update it? Would i have to expose this all via methods on the service? Similarly, how would I import my existing SQL data into a stateful service?
For Stateful services/actor model, how can I 'see' the data visually, with an object Explorer/UI. Our data is our Gold, and I'm concerned of the lack of control/visibility of it in the reliable services models
Basically, is there some documentation on the decision path towards which programming model to go for? I could model a "listing" as an Actor, and have millions of those - sure, but i could also have a Stateful service that stores the listing locally, and i could also have a Stateless service that fetches it from the DB. How does one decide as to which is the best approach, for a given use case?
Thanks.
What is it about your current setup that isn't meeting your requirements? What do you hope to gain from a more complex architecture?
Microservices aren't a magic bullet. You mainly get four benefits:
You can scale and distribute pieces of your overall system independently. Service Fabric has very sophisticated tools and advanced capabilities for this.
You can deploy and upgrade pieces of your overall system independently. Service Fabric again has advanced capabilities for this.
You can have a polyglot system - each service can be written in a different language/platform.
You can use conflicting dependencies - each service can have its own set of dependencies, like different framework versions.
All of this comes at a cost and introduces complexity and new ways your system can fail. For example: your fast, compile-time checked in-proc method calls now become slow (by comparison to an in-proc function call) failure-prone network calls. And these are not specific to Service Fabric, btw, this is just what happens you go from in-proc method calls to cross-machine I/O - doesn't matter what platform you use. The decision path here is a pro/con list specific to your application and your requirements.
To answer your Service Fabric questions specifically:
Which programming model do you go for? Start with stateless services with ASP.NET Core. It's going to be the simplest translation of your current architecture that doesn't require mucking around with your data layer.
Stateful has a lot of great uses, but it's not necessarily a replacement for your RDBMS. A good place to start is hot data that can be stored in simple key-value pairs, is accessed frequently and needs to be low-latency (you get local reads!), and doesn't need to be datamined. Some examples include user session state, cache data, a "snapshot" of the most recent items in a data stream (like the most recent stock quote in a stream of stock quotes).
Currently the only way to see or query your data is programmatically directly against the Reliable Collection APIs. There is no viewer or "management studio" tool. You have to write (and secure) an API in each service that can display and query data.
Finally, the actor model is a very niche model. It serves specific purposes but if you just treat it as a data store it will not work for you. Like in your example, a listing per actor probably wouldn't work because you can't query across that list, or even have multiple users reading the same listing simultaneously.

How Does the Single Store Xpages work?

I have a number of XPAges design elements that I use in many different databases. If I read the wiki correctly the single store is an all or nothing situation.
So I want to create unique design in a database but use the set of reusable XPages element from a single store location. the wiki says:
Apart from the "dummy or blank XPage with the same name of the default XPage" in each instance application, does it matter if an 'instance' contains XPage design elements?
No. If SCXD is set on an application all XPages design elements are ignored on the database and the application uses the design elements on the SCXD database.
If this is the case then I have to create databases where probably 75% of the code is reusable but I would have to repeat it (and maintain it) in dozens of separate databases. pity!
XPages and related elements (Custom Controls, SSJS Libraries, Java Code) can be inherited from a specific template like other design elements. So, I would setup a database called, perhaps, "Core Components" (.ntf or .nsf) with a template name of "CoreComponents". Then on the individual elements in the target DB you would set inheritance to be specifically from the "CoreComponents" template. Then the elements that are unique to each database do not inherit from any template. You can then use File-Application-Refresh design to update the elements with specific inheritance and the one which are unique in that database will not get overwritten.
You do need to do a clean build after the refresh, so I recommend that you keep the Core Components database locally or on a different server than the others so that the daily design task will not update them resulting in corrupted xsp elements.
IBM's preferred model for reusing XPage artifacts across multiple applications is to create OSGi plugins that leverage the XPages Extensibility API.
NotesIn9 episode 64 demonstrates how to make an existing Custom Control design element a library component, which can then be used in any app that has the library available, instead of having to copy the design element to each app separately. Any subsequent changes to that component are then applied immediately to any apps that use it when a new version of the library is deployed.
If you truly have "dozens" of apps that all share certain features, but the entire design should not be identical across all of them, then the OSGi model is definitely the way to go.
But why not flip the entire model on its head? Traditionally, we've always put the code and the data in the same place (e.g. same NSF) because it was a pain to access -- and, especially, visually represent -- data in one NSF via code in another NSF. That's not true anymore. Why have dozens of apps just because the data lives in dozens of places? Any data source in XPages can be told where the data lives... you can link a central user interface to any number of "remote" data stores (either different NSFs on the same server, or even databases on other servers).
Red Pill, for instance, takes this to its logical extreme: they deploy one NSF, which acts as a portal to all your data, no matter where that data lives. The ACLs of the various NSFs (and Readers fields) still ensure that users don't pry into data they haven't been granted access to, and they have complex analytics algorithms for determining which data the users will actually care about. But if you have 500 NSFs in the domain, you're not maintaining 500 different code templates... it's literally just 1; but that one user interface is how users find, and interact with, all their data.
You certainly don't have to take this premise to that extreme, but perhaps you could identify, say, 5 apps where the UI and / or business logic is similar (or even identical), but the data just lives in multiple places. Create one central app for interacting with all of that data. Create a "homepage" that gives users a way to select which "app" they're trying to access (or, if they should only have access to one to begin with, compute which one that is), and then once they navigate in to the specific "app", just bind the data sources to the relevant NSF instead of assuming each view or document lives in the same NSF that the code does.
It's still a good idea to be aware of the Extensibility API, not only for the sake of code reusability, but also to understand just how much of the behavior of the platform truly is within our control now -- provided, of course, that we're willing to occasionally write some custom Java code. But if you shift away from the one-to-one mapping between code and data that we've habitually maintained in Domino for so long, I can practically guarantee that you'll prefer this approach... both for the ease of implementation and maintenance, and for the comparative simplicity it offers to end users.
You can combine the template technique and the all-code-in-one-database approach:
Divide the application design into two parts: a data part and a code part.
The data part contains all Notes views. If it's an classic Notes application it would contain also all design elements for Notes client like Forms, Subforms, Frames and so on.
The code part contains all XPages, Custom Controls, CSS, client/server JavaScript libraries, Themes, images, jars and so on.
Put your 75% common code into masterData.ntf and masterCode.ntf.
The application code databases appCodeX.ntf inherit all design elements of masterCode.ntf and contain the additional application specific design elements.
The code from all application templates gets united in allCode.ntf. It inherits all from masterCode.ntf and inherits the additional pieces of code from application templates.
Based on that you create an allCode.nsf.
On the data side you use the classic template way.
From here you have to possibilities:
You use Single Copy XPage Design - connect every appData database with allCode.nsf
You connect your XPages in allCode.nsf with appData databases
I prefer the latter. You can define in allCode.nsf where all the application data databases are located, e.g. in property documents.
With the approach showed in picture you're still able to separate application easily e.g. in case you want to sell them. You have already a separate template for every single application.

Azure Web Site Migrations & Concurrency

I have two Azure Websites set up - one that serves the client application with no database, another with a database and WebApi solution that the client gets data from.
I'm about to add a new table to the database and populate it with data using a temporary Seed method that I only plan on running once. I'm not sure what the best way to go about it is though.
Right now I have the database initializer set to MigrateDatabaseToLatestVersion and I've tested this update locally several times. Everything seems good to go but the update / seed method takes about 6 minutes to run. I have some questions about concurrency while migrating:
What happens when someone performs CRUD operations against the database while business logic and tables are being updated in this 6-minute window? I mean - the time between when I hit "publish" from VS, and when the new bits are actually deployed. What if the seed method modifies every entry in another table, and a user adds some data mid-seed that doesn't get hit by this critical update? Should I lock the site while doing it just in case (far from ideal...)?
Any general guidance on this process would be fantastic.
Operations like creating a new table or adding new columns should have only minimal impact on the performance and be transparent, especially if the application applies the recommended pattern of dealing with transient faults (for instance by leveraging the Enterprise Library).
Mass updates or reindexing could cause contention and affect the application's performance or even cause errors. Depending on the case, transient fault handling could work around that as well.
Concurrent modifications to data that is being upgraded could cause problems that would be more difficult to deal with. These are some possible approaches:
Maintenance window
The most simple and safe approach is to take the application offline, backup the database, upgrade the database, update the application, test and bring the application back online.
Read-only mode
This approach avoids making the application completely unavailable, by keeping it online but disabling any feature that changes the database. The users can still query and view data while the application is updated.
Staged upgrade
This approach is based on carefully planned sequences of changes to the database structure and data and to the application code so that at any given stage the application version that is online is compatible with the current database structure.
For example, let's suppose we need to introduce a "date of last purchase" field to a customer record. This sequence could be used:
Add the new field to the customer record in the database (without updating the application). Set the new field default value as NULL.
Update the application so that for each new sale, the date of last purchase field is updated. For old sales the field is left unchanged, and the application at this point does not query or show the new field.
Execute a batch job on the database to update this field for all customers where it is still NULL. A delay could be introduced between updates so that the system is not overloaded.
Update the application to start querying and showing the new information.
There are several variations of this approach, such as the concept of "expansion scripts" and "contraction scripts" described in Zero-Downtime Database Deployment. This could be used along with feature toggles to change the application's behavior dinamically as the upgrade stages are executed.
New columns could be added to records to indicate that they have been converted. The application logic could be adapted to deal with records in the old version and in the new version concurrently.
The Entity Framework may impose some additional limitations in the options, because it generates the SQL statements on behalf of the application, so you would have to take that into consideration when planning the stages.
Staging environment
Changing the production database structure and executing mass data changes is risky business, especially when it must be done in a specific sequence while data is being entered and changed by users. Your options to revert mistakes can be severely limited.
It would be necessary to do extensive testing and simulation in a separate staging environment before executing the upgrade procedures on the production environment.
I agree with the maintenance window idea from Fernando. But here is the approach I would take given your question.
Make sure your database is backed up before doing anything (I am assuming its SQL Azure)
Put up a maintenance page on the Client Application
Run the migration via Visual Studio to your database(I am assuming you are doing this through the console) or a unit test
Publish the website/web api websites
Verify your changes.
The main thing is working with the seed method via Entity Framework is that its easy to get it wrong and without a proper backup while running against Prod you could get yourself in trouble real fast. I would probably run it through your test database/environment first (if you have one) to verify what you want is happening.

mvc-mini-profiler - working with a load balanced web role (azure et al)

I believe that the mvc mini profiler is a bit of a 'God-send'
I have incorporated it in a new MVC project which is targeting the Azure platform.
My question is - how to handle profiling across server (role instance) barriers?
Is this is even possible?
I don't understand why you would need to profile these apps any differently. You want to profile how your app behaves on the production server - go ahead and do it.
A single request will still be executed on a single instance, and you'll get the data from that same instance. If you want to profile services located on a different physical tier as well, that would require different approaches; involving communication through internal endpoints which I'm sure the mini profiler doesn't support out of the box. However, the modification shouldn't be that complicated.
However, would you want to profile physically separated tiers, I would go about it in a different way. Specifically, profile each tier independantly. Because that's how I would go about optimizing it. If you wrap the call to your other tier in a profiler statement, you can see where the problem lies and still be able to solve it.
By default the mvc-mini-profiler stores and delivers its results using HttpRuntime.Cache. This is going to cause some problems in a multi-instance environment.
If you are using multiple instances, then some ways you might be able to make this work are:
to change the Http Cache to an AppFabric Cache implementation (or some MemCached implementation)
to use an alternative Storage strategy for your profile results (the code includes SqlServerStorage as an example?)
Obviously, whichever strategy you choose will require more time/resources than just the single instance implementation.

Use cases of the Workflow Engine

I'd like to know about specific problems you - the SO reader - have solved using Workflow Engines and what libraries/frameworks you used if you didn't roll your own. I'd also like to know when a Workflow Engine wasn't the best choice and if/how you chose something simpler, like a TaskList/WorkList/Task-Management type application using state machines.
Questions:
What problems have you used workflow engines to solve?
What libraries/frameworks did you use?
When did a simpler State Machine/Task Management like system suffice?
Bonus: How did/do you make the distinction between Task Management and Workflow Engine?
I'm looking for first-hand experiences.
Some of the resources I've checked out:
Ruote and State Machine != Workflow Engine
StonePath and Docs
Creating and Managing Worklist Task Plans with Oracle
Design and Implementation of a Workflow Engine - Thesis
What to use Windows Workflow Foundation For
JBoss jBPM Docs
I'm biased as well, as I am the main author of StonePath.
I have developed workflow applications for the U.S. State Department, the Geneva Centre for Humanitarian Demining, several fortune 500 clients, and most recently the Washington DC Public School System. Every time I have seen a 'workflow engine' that tried to be the one master reference for business processes, I have seen an organization fighting itself to work around the tool. This may be due to the fact that these solutions have always been vendor/product driven, and then end up with a tactical team of 'consultants' constantly feeding the app... but because of this, I tend to react negatively when I hear the benefits of process-based tools that promise to 'centralize the workflow definitions in one place and make them repeatable'.
I very much like Ruote - I have been following that project for some time and should I need that kind of solution, it will be the next tool I'll be willing to try. StonePath has a very different purpose than ruote
Ruote is useful to Ruby in general,
StonePath is aimed at Rails, the web framework written in Ruby.
Ruote is about long-lived business processes and their associated definitions (Note - active development on ruote ceased).
StonePath is about managing State-based workflow and tasking.
Frankly, I think the distinction from the outside looking in might be subtle - many times the same kinds of business processes can be represented either way - the state-and-task-based model tends to map to my mental model though.
Let me describe the highlights of a state-based workflow.
States
Imagine a workflow revolving around the processing of something like a mortgage loan or a passport renewal. As the document moves 'around the office', it travels from state to state.
If you are responsible for the document, and your boss asked you for a status update you'd say things like
"It is in data entry"...
"We are checking the applicant's credentials now"...
"we are awaiting quality review"...
"We are done"... and so on.
These are the states in a state-based workflow. We move from state to state via transitions - like "approve", "apply", kickback", "deny", and so on. These tend to be action verbs. Things like this are modeled all the time in software as a state machine.
Tasks
The next part of a state/task-based workflow is the creation of tasks.
A Task is a unit of work, typically with a due date and handling instructions, that connects a work item (the loan application or passport renewal, for instance), to a users "in box".
Tasks can happen in parallel with each other or sequentially
Tasks can be created automatically when we enter states,
Create tasks manually as people realize work needs to get done
Require tasks to be completed before we can move onto a new state.
This kind of behavior is optional, and part of the workflow definition.
The rabbit hole can go a lot deeper than this, and I wrote an article about it for Issue #4 of PragPub, the Pragmatic Programmer's Magazine. Check out the repo link above for an updated PDF of that article.
In working with StonePath the last few months, I have found that the state-based model maps really well to restful web architectures - in particular, the tasks and state transitions map nicely as nested resources. Expect to see future writing from me on this subject.
I'm biased, I'm one of the authors of ruote.
variant 1) state machine attached to a resource (document, order, invoice, book, piece of furniture).
variant 2) state machine attached to a virtual resource named a task
variant 3) workflow engine interpreting workflow definitions
Now your question is tagged "BPM" we can be expanded into "Business Process management". How does that kind of management occur in each of the variant ?
In variant 1, the business process (or workflow) is scattered in the application. The state machine attached to the resource enforces some of the aspects of the workflow, but only those related to the resource. There may be other resources with their own state machine following the same business process.
In variant 2, the workflow can be concentrated around the task resource and represented by the state machine around that resource.
In variant 3, the workflow is enacted by interpreting a resource called a workflow definition (or business process definition).
What happens when the business process changes ? Is it worth having a workflow engine where business processes are manageable resources ?
Most of the state machine libraries have 1 set states + transitions. Workflow engines are, most of them, workflow definition interpreters and they allow multiple different workflows to run together.
What will be the cost of changing the workflow ?
The variants are not mutually exclusive. I have seen many examples where a workflow engine changes the state of multiple resources some of them guarded by state machines.
I also use variant 3 + 2 a lot, for human tasks : the workflow engine, at some points when running a process instance, hands a task (workitem) to a human participant (resource task is created and placed in state 'ready').
You can go a long way with variant 2 alone (the task manager variant).
We could also mention variant 0), where there is no state machine, no workflow engine, and the business process(es) are scattered and/or hardcoded in the application.
You can ask many questions, but if you don't take the time to read the answers and don't take the time to try out and experiment, you won't go very far, and will never acquire any flair for when to use this or that tool.
On a previous project I was working on i added some Workflow type rules to a set of Government Forms in the Healhcare industry.
Forms needed to be filled out by the end user , and depending on some answers other Forms were scheduled to be filled out at a later date. There were also external events that would cancel scheduled Forms or schedule new ones.
Sample Flow :
Patient Admitted -> Schedule Initial Assessment FOrm -> Schedule Quarterly Review Form -> Patient Died -> Cancel Review -> Schedule Discharge Assessment Form
Many other rules were based on things such as Patient age, where they were being admitted etc.
This was an ASP.NET app, the rules were basically a table in the database. I added scripting, so a script would run on Form completion to determine what to do next. This was a horrid design, and would have been perfect for a proper Workflow engine.
I'm one of the authors of the open source Temporal Workflow Engine we initially developed at Uber as Cadence. The difference between Temporal and the majority of the existing workflow engines is that it is developer focused and is extremely flexible and scalable (to tens of thousands updates per second and up to billions of open workflows). The workflows are written as object oriented programs and the engine ensures that the state of the workflow objects including thread stacks and local variables is fully preserved in case of host failures.
What problems have you used workflow engines to solve?
Temporal is used for practically any backend application that lives beyond a single request reply. Examples of usage are:
Distributed CRON jobs
Managing ML/Data pipelines
Reacting to business events. For example trip events at Uber. The workflow can accumulate state based on events received and execute activities when necessary.
Services Deployment to Mesos/ Kubernetes
CI Pipeline implementation
Ensuring that multiple service calls complete when a request is received. Including SAGA pattern implementation
Managing human worker tasks (similar to Amazon MTurk)
Media processing
Customer Support Ticket Routing
Order processing
Testing service similar to ChaosMonkey
and many others
The other set of use cases is based on porting existing workflow engines to run on Temporal. Practically any existing engine workflow specification language can be ported to run on Temporal. This way a single backend service can power multiple domain specific workflow systems.
What libraries/frameworks did you use?
Temporal is a self contained service written in Go with Go, Java, PHP, and Typescript client side SDKs (.NET and Python are coming in 2022). The only external dependency is storage. Cassandra, MySQL and, PostgreSQL are supported. Elasticsearch can be used for advanced indexing.
Temporal also support asynchronous cross region (using AWS terminology) replication.
When did a simpler State Machine/Task Management like system suffice?
Open source Temporal service can be self hosted or temporal.io cloud offering can be used. So the overhead of building any custom state machine/task management is always higher than using Temporal. Outside the company the service and storage for it need to be set up. If you already have an SQL database the service deployment is trivial through a docker image. The docker is also used to run a local Temporal service for development on a personal computer or laptop.
I am one of the authors of Imixs-Workflow. Imixs-Workflow is an open source workflow engine based on BPMN 2.0 and fully integrated into the Java EE technology stack.
I develop workflow engines by myself since more than 10 years. I will try to answer your question in short:
> What problems have you used workflow engines to solve?
My personal goal when I started to think about workflow engines was to avoid hard codding the business logic within my application. Many things in a business application can be reused so it makes sense to keep them configurable. For example:
sending out a notification
view open tasks
assigned a task to a person
describing the current task
From this function list you can see I am talking about human-centric workflows. In short: A human-centric workflow engine answers the questions: Who is responsible for a task and who needs to be informed next? And these are the typical questions in business requirements.
>What libraries/frameworks did you use?
5 years ago we started reimplementing Imixs-Workflow engine focusing on BPMN 2.0. BPMN is the common standard for process modeling. And the surprising thing for me was that we were suddenly able to describe even highly complex business processes that could be visualized and executed. I recommend everyone to use BPMN for modeling business processes.
> When did a simpler State Machine/Task Management like system suffice?
A simple state machine is sufficient if you just want to track the status of a business object. This is the case when you begin to introduce the 'status' attribute into your object model. But in case you need business processes with responsibilities, logging and flow control, then a state machine is no longer sufficient.
> Bonus: How did/do you make the distinction between Task Management and Workflow Engine?
This is exactly the point where many workflow engines mentioned here differ. For a human-centric workflow you typically need a task management to distribute tasks between human actors. For a process automation, this point is not so relevant. It is sufficient if the engine performs certain tasks. Task management and workflow engines can not be compared because task management is always a function of a workflow engine.
Check rails_workflow gem - I think this is close to what you searching.
I have an experience with using Activiti BPMN 2.0 engine for handling high-performance and high-throughput data transfer processes in an infrastructure of network nodes. The basic task was to allow configuration and monitoring of such transfer processes and control each network node (ie. request node1 to send a data file to node2 via specific transport layer).
There could be thousands of processes running at a time and overall tens or low hundreds of thousands processes per day.
There were bunch of different process definitions but it was not necessarily required that an operator of the system could create custom workflows. So the primary use case for the BPM engine itself was to be robust, scalable and allow monitoring of each process flow.
In the end it basically worked but what we learned from that project was that a BPMN platform, or rather the Activiti engine specifically, was not the best bet for such a high-throughput system.
The main challenges were task execution prioritization, DB locking, execution retries to name the few concerning the BPM itself. So we had to develop custom handling of these, for example:
Handling of retries in the BPM for cases when a node had no free worker for given task, or when the node was not running at all.
Execution of parallel transfer tasks in a single process and synchronization of the results (success/failure).
I don't know if other BPMN engines would be more suitable for such scenario since BPMN is mostly intended for long-running business tasks involving user interaction where performance is probably not the same issue as was in our case.
I rolled my own workflow engine to support phased processing of documents - cataloging, sending for image processing (we work with redaction sw), if needed sending to validation, then release and finally shipping back to the client. In our case we have a truckload of documents to process so sometimes we need to run each service separately to control delivery and resources usage. Simple in concept but high performance and distributed processing needed, and we could't find any off the shelf product that fit the bill for us.

Resources