How Do I create in memory search indexes in Elixir - search

I am currently working on an Elixir/Phoenix project and I was wondering what is a good way to create a quick in-memory search index.
The index would be created on request and destroyed when the request is over and currently the data comes from a database via Ecto. Also, I would like to query it by different indexes so not just by :id but other indexes Example :user_id so a flat key value store may not be enough.
Are there any tools that would be helpful? I looked a bit into mnesia but when using it with ecto3_mnesia, a local file/folder was created and I would prefer if everything was in memory.
Thanks

I have no idea about ecto3_mnesia, but I am pretty sure raw :mnesia without any redundant wrapper is a good fit here (or, even, :ets if you don’t need a clustered solution.)
:mnesia.table_create/2 accepts many options, two you might be interested in are disc_copies and raw_copies. Simply initialize the former with empty node list and the latter with your complete node list, and you are all set: no disk copies are created, everything is in memory.

Related

Couchbase retrieving relational docs in nodeJS

I am still debating which way to go and possibly store certain information in its own doc. so for example the customer can have addresses with each address would be its own doc and then in the customer doc there would be an array of ref keys stored under addresses. The benefit would be i could update these docs simply based on the key value vs having to get the customer doc first, finding the array index of the address and then either modify the whole doc or go and use subdoc to replace the content of the array with the index.
Where i am stuck is how to retrieve those referenced subdoc's. is N1QL the only way to go or does the KV API offer a way to do this short of retrieving the whole customer doc, then looping thru address array and retrieving all referenced docs that way. I know Ottoman offers something like that but i am having an issue with the latest version of SDK 2.6 and Ottoman as its not very well maintained. So hopefully someone can share some insight what and why its the best way.
If you want to rely on key/value, then you'll need to do the multiple lookup as you've described. I'm not very familiar with Ottoman: it might do this for you, but behind the scenes it will still be multiple key/value operations and/or N1QL.
With N1QL, you can perform JOINs, but again, behind the scenes it's going to eventually be pulling documents out by key/value. It just does those extra steps for you. Direct key/value is always going to be the fastest route.
If you are still in the process of deciding whether to split the data amongst multiple documents or "denormalize" the data into a single doc, one thing you should think about is how often you're going to access customer+addresses together and how often you're going to customer/access separately. If you're reading/writing customer+address often, consider putting it in one document. Otherwise, consider putting it in multiple documents.
The third option is to store it both places, or rather "cache" the address data in the customer document. This is tricky, because it could get out of sync if you're not careful. So make sure it's worth it before you go down that road.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Initial DB structure / data for MongoDB + NodeJS web application

I'm developing a web application in Node.js with MongoDB as the back end. What I wanted to know is, what is the generally accepted procedure, if any exists, for creating initial collections and populating them with initial data such as a white list for names or lists of predefined constants.
From what I have seen, MongoDB creates collections implicitly any time data is inserted into the database and the collection being inserted into doesn't already exist. Is it standard to let these implicit insertions take care of collection creation, or do people using MongoDB have scripts setup which build the main structure and insert any required initial data? (For example, when using MySQL I'd have a .sql script which I can run to dump and rebuild /repopulate the database from scratch).
Thank you for any help.
MHY
If you have data, this post on SO might be interresting for you. But since Mongo understands JavaScript, you can easily write a script that prepares the data for you.
It's the nature of Mongo to create everything that does not exist. This allows a very flexible and agile development since you are not constrainted to types or need to check if table x already exists before working on it. If you need to create collections dynamically, just get it from the database and work it if (no matter if it exists or not).
If you are looking for a certain object, be sure to check it (not null or if a certain key exists) because it may affect your code if you work with null objects.
There's is absolutely no reason to use setup scripts merely to make collections and databases appear. Both DB and collection creation is done lazily.
Rember that MongoDB is a completely schema free document store so there's no way to even setup a specific schema in advance.
There are tools available to dump and restore database content supplied with mongo.
Now, if your application needs initial data (like configuration parameters or whitelists like you suggest) it's usually best practice to have your application components set up there own data as needed and offer data migration paths as well.

Drupal: Avoid database when dealing with node type info?

I'm writing a Drupal module that deals with creating new nodes from CSV files. The way I've been doing it currently, the user provides a node type, and my module goes to the database to find the fields for that node.
After the user matches the node fields to the CSV fields, I want to validate the data. This requires finding out the types of the node fields. I'm not entirely sure how to do that. (Maybe look at the content_node_field table?)
Then, I have to create the nodes. Currently, the module creates a new StdClass object, populates it with the necessary data, and saves it.
But what if I could abstract away from the database entirely and avoid dealing with it? What if I asked the user to a node of this type that already exists? I could node_load() this node, and use that to determine node fields. When it comes time to save the nodes, I could use the "seed" node to figure out what the structure of the new nodes needs to be.
One downside: this requires at least one node of this type to exist before the module can function.
Also, would this be slower than accessing the db directly?
I fear that over time, db names could change, and content types could be defined across multiple tables. By working only from a pre-existing node, I could get around many of these issues. Right?
Surely node_load will be hitting the database anyway? The node fields are stored in the database so if you need to get them, at some point you have to talk to the database. Given that some page loads on Drupal invoke hundreds (or even thousands!) of database queries I really wouldn't worry about one or two!
Table names are unlikely to change and the schema should stay fixed between point versions of Drupal at least. It would be better practice to use the API to get the data you want if it is possible though, and this would give better protection against change. I don't know if that's possible.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources