Keyword Search in microservices based architecture - search

I need some advice as to how a search needs to be implemented to search keywords within relational databases owned by microservices.
I have some microservices with their own relational DB. These microservices are likely to be deployed in a docker container.
What would be the best way to use a search engine like Apache SOLR so that each of the microservices' database can be indexed and we can achieve keyword search
Thanks in advance

This seems to be an architectural question and while it makes sense, the question is also a little open ended depending on what your system requires. A couple of things that come off the top of my head is:
Use the DataImportHandler from Apache SOLR.
Use a message queue like Kafka or Kinesis and have the independent services consume from it to propagate to their data stores in this case, a search service backed by Apache SOLR and another service backed by MySQL.
Personally, I've never used the DataImportHandler myself but my initial thoughts are that it couples Apache SOLR to MySQL. Setting up the DataImportHandler requires Apache SOLR to know the MySQL schema, the access credentials, etc. Because of this, I would advise the second option which moves towards a shared-nothing architecture.
I'm going to call the service that is backed by MySQL the "entity" service as it sounds like its going to be the canonical service for saving some particular type of object. The entity service and the search service will have its own particular consumer that ingests events from Kinesis or Kafka into their data stores, the search service to Apache SOLR and the entity service to MySQL.
This helps decouple the services from knowing that each other exists and also allow each of them to scale independently from each other. It'll be a redundancy in data but it should be alright because the data access patterns are different.
The other caveat that I'd like to mention is that it assumes that the entity for which you're saving is allowed to be asynchronous. Notice that messages in this system doesn't require it to be persisted in MySQL at the moment which in this case is the entity service. However, you may change it to your liking such that a message persists in the entity service and then is propagated through a queue to a search service to index. After its been index, you can just add additional endpoints to search Apache SOLR. Hope this gives you some insight on how some other architectures may come into play. If you give a little more insight as to your system and the entities that involved, you might be able to get a better answer.

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

Multiple web services sharing single database using Sequelize

I'm trying to make a service architecture which includes two Node.js apps which shares the same database. The overall service architecture looks like below (simplified version)
I'm planning to use Sequelize as an ORM to access the database. As far as I know, if a service uses Sequelize, it needs model to get the structure of data tables. In my case, api and service will access the same database, which means they should share the same Sequelize model.
So here is the question: where should I locate the common Sequelize relevant files? It seems I have two choices:
put them on the upper common location (assuming the project structure is monorepo) so that each apps can use the single same files
maintain copies of files in each apps' project folders. In this case, each apps will be independent(Let's say I want to dockerize each apps) but in case the Sequelize files modified, the same action should be done for the other.
I'm not sure how I understood is correct. Is my question valid? If so, what is the better choice and practice? I appreciate for your answers in advance.
There is no correct answer, it depends on the specific situation, but sharing a database between multiple microservices is a bad design.
Sharing a database means tight coupling at the data level. The direct consequence is that when a service modifies the database table structure, such as deleting the name field of the user table, it may break the APIs of other services and all use the sequelize user model. All services need to update the model definition and modify the implementation code of the API.
If all of your services are maintained by a team, I suggest you choose the first solution, which costs less and is easier to maintain. If your services are maintained by different teams, the two solutions are actually similar, because as long as the table structure is modified, the application layer model needs to be modified or verified whether it still works well.
Therefore, I recommend following the best practices of microservice architecture, first splitting the database vertically according to the business model, and building application APIs on top of it.
Core principles of microservices:
loose coupling
high cohesion

Decision path for Azure Service Fabric Programming Models

Background
We are looking at porting a 'monolithic' 3 tier Web app to a microservices architecture. The web app displays listings to a consumer (think Craiglist).
The backend consists of a REST API that calls into a SQL DB and returns JSON for a SPA app to build a UI (there's also a mobile app). Data is written to the SQL DB via background services (ftp + worker roles). There's also some pages that allow writes by the user.
Information required:
I'm trying to figure out how (if at all), Azure Service Fabric would be a good fit for a microservices architecture in my scenario. I know the pros/cons of microservices vs monolith, but i'm trying to figure out the application of various microservice programming models to our current architecture.
Questions
Is Azure Service Fabric a good fit for this? If not, other recommendations? Currently i'm leaning towards a bunch of OWIN-based .NET web sites, split up by area/service, each hosted on their own machine and tied together by an API gateway.
Which Service Fabric programming model would i go for? Stateless services with their own backing DB? I can't see how Stateful or Actor model would help here.
If i went with Stateful services/Actor, how would i go about updating data as part of a maintenance/ad-hoc admin request? Traditionally we would simply login to the DB and update the data, and the API would return the new data - but if it's persisted in-memory/across nodes in a cluster, how would we update it? Would i have to expose this all via methods on the service? Similarly, how would I import my existing SQL data into a stateful service?
For Stateful services/actor model, how can I 'see' the data visually, with an object Explorer/UI. Our data is our Gold, and I'm concerned of the lack of control/visibility of it in the reliable services models
Basically, is there some documentation on the decision path towards which programming model to go for? I could model a "listing" as an Actor, and have millions of those - sure, but i could also have a Stateful service that stores the listing locally, and i could also have a Stateless service that fetches it from the DB. How does one decide as to which is the best approach, for a given use case?
Thanks.
What is it about your current setup that isn't meeting your requirements? What do you hope to gain from a more complex architecture?
Microservices aren't a magic bullet. You mainly get four benefits:
You can scale and distribute pieces of your overall system independently. Service Fabric has very sophisticated tools and advanced capabilities for this.
You can deploy and upgrade pieces of your overall system independently. Service Fabric again has advanced capabilities for this.
You can have a polyglot system - each service can be written in a different language/platform.
You can use conflicting dependencies - each service can have its own set of dependencies, like different framework versions.
All of this comes at a cost and introduces complexity and new ways your system can fail. For example: your fast, compile-time checked in-proc method calls now become slow (by comparison to an in-proc function call) failure-prone network calls. And these are not specific to Service Fabric, btw, this is just what happens you go from in-proc method calls to cross-machine I/O - doesn't matter what platform you use. The decision path here is a pro/con list specific to your application and your requirements.
To answer your Service Fabric questions specifically:
Which programming model do you go for? Start with stateless services with ASP.NET Core. It's going to be the simplest translation of your current architecture that doesn't require mucking around with your data layer.
Stateful has a lot of great uses, but it's not necessarily a replacement for your RDBMS. A good place to start is hot data that can be stored in simple key-value pairs, is accessed frequently and needs to be low-latency (you get local reads!), and doesn't need to be datamined. Some examples include user session state, cache data, a "snapshot" of the most recent items in a data stream (like the most recent stock quote in a stream of stock quotes).
Currently the only way to see or query your data is programmatically directly against the Reliable Collection APIs. There is no viewer or "management studio" tool. You have to write (and secure) an API in each service that can display and query data.
Finally, the actor model is a very niche model. It serves specific purposes but if you just treat it as a data store it will not work for you. Like in your example, a listing per actor probably wouldn't work because you can't query across that list, or even have multiple users reading the same listing simultaneously.

Benefits of using a hosted search service over building your own

I'm building a B2B Node app which has heavily related data models. We currently have our own search queries, but as we scale some of the queries appear to be becoming sluggish.
We will need to support multilingual search as well as content-based searches (searching matching content within related data).
The queries are growing more and more complicated (each has multiple joins on joins on joins) and I'm now considering a hosted search tool such as Algolia.
Given my concerns below, why should I use a hosted cloud search service rather than continue building my own queries?
Data privacy is important
Data is hosted in our own postgres DB - integrations with that are important (e.g.: will I now need to manually maintain our DB data and data in Algolia?)
Speed will be important, but not so much now
Must be able to do content-based searches across multiple languages
We are a tiny team of devs now, so dev resource time is vital
What other things should I be concerned about that can help make a decision in search capabilities?
Regarding maintenance of both DB and Cloud data, it seems it's as simple as getting all data, caching it, and storing it in the cloud:
var index = Algolia.initIndex('contacts');
var contactsJSON = require('./contacts.json');
index.addObjects(contactsJSON, function(err, content) {
if (err) {
console.error(err);
}
});
Search services like Algolia or self-hosted Elasticsearch/solr operate as full text search, not relational db queries.
But it sounds like the bottleneck is the continual rejoining. Which if you can make your relational data act like a full text document db then that could be a more efficient type of index (pre-joined sort of).
You might also look into views, or a data warehouse (maybe star schema).
But if you are going the search route maybe investigate hosting your own elasticsearch.
You could specify database, schema, sql, index, query details if you want more help.
Full Disclosure: I founded a company called SearchStax on the premise that companies and developers should not spend time setting up, managing, scaling or building tools for the search infrastructure (ops) - they are better off investing time of their employees into building value for the company, whether that be features, capabilities, product or customers.
Open Source Search solutions based on top of Lucene (Apache Solr / Elasticsearch) have what you need now and what you might need in near future from a capability perspective from a search engine. Find a mature service provider / AS-A-Service company that has specialization in open source search and let them deal with all. It may look small effort right now, though it's probably not worth time and effort of your devs to spend time on the operations of that.
For your concerns mentioned above:
Data privacy is important
Your concern around Privacy and Security are addressable. There are multiple ways you can secure your Solr environment and the right MSP or a Managed Solution provider should be able to address those.
a. Security at the transport layer can be addressed by SSL certificates. All the data going over the wire is encrypted.
b. IP Filtering and User Based Authentication should address who has access to what. Solr-as-a-Service offering by Measured Search supports both.
c. Security at rest can be addressed in multiple ways - OS level / File encryption, but you can even go further by ensuring not even your services provider has access to that data by using Searchable Encryption technology.
Privacy concerns are all address by Terms & Conditions - I am sure your legal department will address that from a Service Provider's perspective.
Data is hosted in our own postgres DB - integrations with that are important
Solr provides ability to import data directly (DIH) through a traditional relational database (MySQL, Postgres, Oracle, etc). You can either use that so Solr can pull data periodically or write your own simple script to push data through the Solr APIs.
If you are hosted in the cloud (AWS), a tunnel can be created so only the Solr deployments have the ability to pull data from your servers and your database servers are not exposed to the world, if you choose to go the DIH route.
Speed will be important, but not so much now
Solr is built for search speed - I don't think that's where your problems are going to be. Service offering like Measured Search's - you can spin up a cluster in any data center supported by AWS or Azure and make sure your search deployments are closer to your application servers so the latency overhead is minimal.
Must be able to do content-based searches across multiple languages
Yes, Solr supports that. More than 30 languages.
We are a tiny team of devs now, so dev resource time is vital
I am biased here, but I would not have my developers spend much time on operations and let them focus on what they do best - build great product capabilities to push the limits and deliver business value.
If you are interested in doing a comparison and ROI of doing it yourself vs using a solr-as-a-service like offered by SearchStax, check this paper out - https://www.searchstax.com/white-papers/why-measured-search-is-better-than-diy-solr-infrastructure/

JavaScript and Java Querying of CouchDb

I am looking at Couch Db and I saw Ektorp that presents a JPA like interface for database. However I see that there are examples that how to make query at JavaScript. I didn't understand how the system work.
Do I query a database from web tier without a middle tier? How can security be done with that?
CouchDB uses javascript to define map and reduce functions for it's views. Ektorp is simply providing you a convenient way to create those functions that will be used by couchdb. You might want to read the couchdb wiki page on views:
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
Just because the views are javascript, does not imply that you have to create the views from a 'web tier'.
In terms of architecture, you have a couple of options. You can use a traditional three tier approach with a java front end, and in your middle tier call couchdb with ektorp. Then you are in full control of security.
You can also go with what is coming to be known as the 2.1 tier model, where users interact directly with couchdb, mainly with a couchapp. You can then provide support services that listen to the changes feed. I have done this with ektorp and it works very well. Other have used node.js. It is a different way of thinking, but it can work. You can read a fun post about this model here:
http://markmail.org/thread/cfw7f3ef75aoqzin
Anyway, I just wanted to provide you with possible options in how you 'tier' your architecture.

Resources