Put unstructured data to Elasticsearch - node.js

I have a Nodejs app with Mongodb. Now I want to user Elasticsearch to replicate data from mongo to Elasticsearch. I'm using npm package "elasticsearch". For example for collection "Posts" I have like this:
items: [
{
_id: '111111111111',
title: 'test1',
status: true,
},
{
_id: '22222222',
title: 'test2',
status: 0,
},
{
_id: '333333333',
title: 'test1',
status: {published: trye},
}
]
As you can see, My data is unstructured and Elasticsearch shows me error while I'm adding these items. I want a trick to turn off Elasticsearch Restriction and allow me to add these data. I can't made changes on my data its huge.
Any solution?

It gives you errors because your status field is a boolean first, a number next, and an object at the end -> mapping conflicts. If you don't want to change the data I assume you don't expect to search over the fields that show conflicts (how could you query consistently over a field that could be anything?). Then, my best recommendation is to store the conflictual fields without indexing them. That means you will see them in the documents, but you won't be able to query them or aggregate over them. To disable indexing you set their mapping type as an Object and set the enabled mapping property to false (see Docs).
If you want to be able to query or aggregate over everything, you must do the extra effort of preprocessing your data consistently.

Related

Usage of TSVECTOR and to_tsquery to filter records in Sequelize

I've been trying to get full search text to work for a while now without any success. The current documentation has this example:
[Op.match]: Sequelize.fn('to_tsquery', 'fat & rat') // match text search for strings 'fat' and 'rat' (PG only)
So I've built the following query:
Title.findAll({
where: {
keywords: {
[Op.match]: Sequelize.fn('to_tsquery', 'test')
}
}
})
And keywords is defined as a TSVECTOR field.
keywords: {
type: DataTypes.TSVECTOR,
},
It seems like it's generating the query properly, but I'm not getting the expected results. This is the query that it's being generated by Sequelize:
Executing (default): SELECT "id" FROM "Tests" AS "Test" WHERE "Test"."keywords" ## to_tsquery('test');
And I know that there are multiple records in the database that have 'test' in their vector, such as the following one:
{
"id": 3,
"keywords": "'keyword' 'this' 'test' 'is' 'a'",
}
so I'm unsure as to what's going on. What would be the proper way to search for matches based on a TSVECTOR field?
It's funny, but these days I am also working on the same thing and getting the same problem.
I think part of the solution is here (How to implement PostgresQL tsvector for full-text search using Sequelize?), but I haven't been able to get it to work yet.
If you find examples, I'm interested. Otherwise as soon as I find the solution that works 100% I will update this answer.
What I also notice is when I add data (seeds) from sequelize, it doesn't add the lexemes number after the data of the field in question. Do you have the same behavior ?
last thing, did you create the index ?
CREATE INDEX tsv_idx ON data USING gin(column);

What is the most efficient way to perform CRUD operations to millions of documents in MongoDB

I am new to MongoDB and currently doing a project where MongoDB is my primary Database Management System. I am using Mongoose as the Object Data Modeling. Suppose, I have two collections called products and features. And each product may have multiple features that is a one-to-many relationship.
// products schema
const products = mongoose.Schema({
id: Number,
name: String,
description: String,
category: String,
price: Number,
}, {
strict: false,
});
// features schema
const features = mongoose.Schema({
id: Number,
product_id: Number,
feature: String,
value: String,
}, {
strict: false,
});
I have imported documents/records for both of the collections from external .csv files and the number of records for both collections are more than 3 million. My client-side application requires data about a particular product with all the features in it like below:
{
productId: 3,
name: 'Denim Jeans',
description: '',
category: 'Cloth',
price: 40.00,
features: [
{
feature: 'Material',
value: 'Cotton',
},
{
feature: 'color',
value: 'blue',
},
....
]
}
Each product will not have more than 5-6 features. So, what I wanted to do is to embed the features document as a subdocument in the products document like the response above. So, I wrote a piece of code like this. It's not the exact same code as I deleted it from my code when it was not working but the logic is the same.
db.products.find({}, (err, product) => {
// product -> array of all documents from products collection
// for each product, I am trying to find the corresponding feature from
// the features collections and embed it to each product document
product.forEach(item => {
db.features.find({product_id: item.id}, (err, feature) => {
// feature -> array of all the features of a product
// embed the feature array to each individual product item
item.features = feature;
})
})
})
Now, the issue is when I run the above piece of code, I got errors like OutOfMemory as it is trying to read from millions of records and my memory is not capable of holding all of this. My question is what is the best way to retrieve all the products and for each individual product write a query to get its corresponding features and embed it inside each product document.
I have a couple of ideas. Kindly correct me if I am wrong. Instead of storing all the products and their features in memory, I want to store them on the disk and update the individual product using the Bulk API of MongoDB. But in that case, how to achieve this and I am concerned about the performance. What is the best practice to follow in this case? Or, should I keep them in a separate collection and from the application server make two queries and package the response there? Or, should I use any kind of aggregation pipeline on the database level? Thanks in advance.

Create View from multiple collections MongoDB

I have following Mongo Schemas(truncated to hide project sensitive information) from a Healthcare project.
let PatientSchema = mongoose.Schema({_id:String})
let PrescriptionSchema = mongoose.Schema({_id:String, patient: { type: Number, ref: 'Patient', createdAt:Date }})
let ReportSchema = mongoose.Schema({_id:String, patient: { type: Number, ref: 'Patient', createdAt:Date }})
let EventsSchema = mongoose.Schema({_id:String, patient: { type: Number, ref: 'Patient', createdAt:Date }})
There is ui screen from the mobile and web app called Health history, where I need to paginate the entries from prescription, reports and events sorted based on createAt. So I am building a REST end point to get this heterogeneous data. How do I achieve this. Is it possible to create a "View" from multiple schema models so that I won't load the contents of all 3 schema to fetch one page of entries. The schema of my "View" should look like below so that I can run additional queries on it (e.g. find last report)
{recordType:String,/* prescription/report/event */, createdDate:Date, data:Object/* content from any of the 3 tables*/}
I can think of three ways to do this.
Imho the easiest way to achieve this is by using an aggregation something like this:
db.Patients.aggregate([
{$match : {_id: <somePatientId>},
{
$lookup:
{
from: Prescription, // replicate this for Report and Event,
localField: _id,
foreignField: patient,
as: prescriptions // or reports or events,
}
},
{ $unwind: prescriptions }, // or reports or events
{ $sort:{ $createDate : -1}},
{ $skip: <positive integer> },
{ $limit: <positive integer> },
])
You'll have to adapt it further, to also get the correct createdDate. For this, you might want to look at the $replaceRoot operator.
The second option is to create a new "meta"-collection, that holds your actual list of events, but only holds a reference to your patient as well as the actual event using a refPath to handle the three different event types. This solution is the most elegant, because it makes querying your data way easier, and probably also more performant. Still, it requires you to create and handle another collection, which is why I didn't want to recommend this as the main solution, since I don't know if you can create a new collection.
As a last option, you could create virtual populate fields in Patient, that automatically fetch all prescriptions, reports and events. This has the disadvantage that you can not really sort and paginate properly...

MongoDB/Mongoose: Query for valid document property

Using Mongoose for MongoDB I store several collections of data which are defined each by a Mongoose schema.
1) Is there an easy way (without explicitly querying the database) to find out whether a specific property is part of a particular collection schema model?
Lets say I have a collection of users, including information about name and address. At runtime I - for mistake - receive data which is supposed to be stored in the user's document but does not (fully) comply with the schema (e. g. shoe size is included).
2) I know that Mongoose refuses to save the data set in that case but how and at all do I get some sort of feedback about that to report back appropriately to the client?
I think the fastest way to check whether a certain collection contains documents that have the field that you mention is to run a count query with the $exists operator on each collection:
db.collection1.count({ field: { $exists: true }});
db.collection2.count({ field: { $exists: true }});
db.collection3.count({ field: { $exists: true }});
Afterwards, you can save the return value of each count operation in a variable and pass it to the client, thus making it possible to convey a message to the end-user.

MongoDB Relational Data Structures with array of _id's

We have been using MongoDB for some time now and there is one thing I just cant wrap my head around. Lets say I have a a collection of Users that have a Watch List or Favorite Items List like this:
usersCollection = [
{
_id: 1,
name: "Rob",
itemWatchList:[
"111111",
"222222",
"333333"
]
}
];
and a separate Collection of Items
itemsCollection = [
{
_id:"111111",
name: "Laptop",
price:1000.00
},
{
_id:"222222",
name: "Bike",
price:123.00
},
{
_id:"333333",
name: "House",
price:500000.00
}
];
Obviously we would not want to insert the whole item obj inside the itemWatchList array because the items data could change i.e. price.
Lets say we pull that user to the GUI and want to diplay a grid of the user itemWatchList. We cant because all we have is a list of ID's. Is the only option to do a second collection.find([itemWatchList]) and then in the results callback manipulate the user record to display the current items? The problem with that is what if I return an array of multiple Users each with an array of itemWatchList's, that would be a callback nightmare to try and keep the results straight. I know Map Reduce or Aggregation framework cant traverse multiple collections.
What is the best practice here and is there a better data structure that should be used to avoid this issue all together?
You have 3 different options with how to display relational data. None of them are perfect, but the one you've chosen may not be the best option for your use case.
Option 1 - Reference the IDs
This is the option you've chosen. Keep a list of Ids, generally in an array of the objects you want to reference. Later to display them, you do a second round-trip with an $in query.
Option 2 - Subdocuments
This is probably a bad solution for your situation. It means putting the entire array of documents that are stored in the items collection into your user collection as a sub-document. This is great if only one user can own an item at a time. (For example, different shipping and billing addresses.)
Option 3 - A combination
This may be the best option for you, but it'll mean changing your schema. For example, lets say that your items have 20 properties, but you really only care about the name and price for the majority of your screens. You then have a schema like this:
usersCollection = [
{
_id: 1,
name: "Rob",
itemWatchList:[
{
_id:"111111",
name: "Laptop",
price:1000.00
},
{
_id:"222222",
name: "Bike",
price:123.00
},
{
_id:"333333",
name: "House",
price:500000.00
}
]
}
];
itemsCollection = [
{
_id:"111111",
name: "Laptop",
price:1000.00,
otherAttributes: ...
},
{
_id:"222222",
name: "Bike",
price:123.00
otherAttributes: ...
},
{
_id:"333333",
name: "House",
price:500000.00,
otherAttributes: ...
}
];
The difficulty is that you then have to keep these items in sync with each other. (This is what is meant by eventual consistency.) If you have a low-stakes application (not banking, health care etc) this isn't a big deal. You can have the two update queries happen successively, updating the users that have that item to the new price. You'll notice this sort of latency on some websites if you pay attention. Ebay for example often has different prices on the search results pages than the actual price once you open the actual page, even if you return and refresh the search results.
Good luck!

Resources