I need to write search functionality for a website full of videos. The videos are hosted by a 3rd party online video platform (OVP) like brightcove, kaltura, ooyala etc... The OVP offers a search API. The OVP holds the following information for each video: title, description, is_published, duration, tags and create_date. The search API easily performs search and sorting on these fields.
However, I would like to ascribe more information to each video such as: available close cap langauges, number of views, number of likes, most popular in past month etc... I suppose the best way for me to store this data is in my own database table, with a FK video_id that relates to the video_id in the OVP's video database table.
If I want to search for videos across fields in both the OVPs video database table via their API and the my own video database table tbl_video_meta_data via SQL statements, how should I do it?
I thought of two solutions, but not sure if either of them are a good idea, or if there are other alternatives to consider
1) Perform a search via the OVP search API based on its supported fields. Then perform a separate search within my own tbl_video_meta_data based on its own available fields. Then display records that are common to both search results. In this approach, I'm worried about the performance of doing two separate searches, and then filtering them again at the coding level instead of using SQL to do all this.
2) I should have a cronjob that periodically fetches video data from the OVP and loads it to a tbl_video_cache. tbl_video_cache is truncated each time this happens. My table tbl_video_meta_data will of course have an FK video_id that relates to tbl_video_cache. Now I can perform a SQL search on both tables via JOIN. Actually, now that I think about it....this seems to be the best approach.
I think I will go with 2, but curious to know if there are any drawbacks to it.
I just went with 2)
2) I should have a cronjob that periodically fetches video data from the OVP and loads it to a tbl_video_cache. tbl_video_cache is truncated each time this happens. My table tbl_video_meta_data will of course have an FK video_id that relates to tbl_video_cache. Now I can perform a SQL search on both tables via JOIN. Actually, now that I think about it....this seems to be the best approach
Related
The issue we are having is trying to map/relate the fields with different tables from result of saved search created on Records Browser Item(http://www.netsuite.com/help/helpcen...cord/item.html).
We have a retail inventory management system with many modules. So the attempt relating our columns to NetSuite has been going on for a while without any conclusion.
The approach we are trying is to run SuiteScript on the debugger and view the dataset. We were successful those with relatively little volume of data. As the limit is 10,000 rows, we are stuck with Search on Item that returns 1Mil. records. The search returns this volume of data when we add all the search columns. The problem the process of add/removing individual columns is rigorous and just with one column it returns more than 10,000 rows. So it becomes impossible to fetch the data and complete the mapping process.
So I would like to know if there is any way we can only see the schema and their relationships for a saved search?
Thanks.
In SuiteScript 1.0, this can be achieved by a scheduled script that creates multiple CSV files from a saved search (SuiteAnswers article 36206). You'll have to get around the search limit (SuiteAnswers article 33496) AND the governance limit (SuiteAnswers article 23406). If you make the file Available Without Login, you should be able to retrieve the CSV with an HTTP GET request without credentials. However, that will make the data potentially viewable by anyone who knows the URL--a security concern that you will have to consider.
In SuiteScript 2.0, this can probably be achieved with a Map/Reduce script (SuiteAnswers article 43795). This may be a better way to optimize the script, but I have not tested it myself in SuiteScript 2.0.
I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.
How do you implement a search "engine" on your website / webapp?
Suppose that you have some products, news, events, and so on, all stored in database in different tables.
You have free text hardcoded inside website in static pages, or at least you have them as gettext files.
You want to be able to list pages that contains some of the query terms requested.
Personally, i create another table (fulltext with mysql) that contains url and contents of the page, and then i do the fulltext search on that table and report results.
This table is periodically filled by a script that reads the db and insert the data.
Are there better methods to implement a "simple" search?
Well "simple" is subjective. Your approach to search will not scale and certainly not suited to complex queries (tihnk boolean searches, or range queries etc.)
My recommendation would be to denormalize your data into a flat structure and write it to Apache Solr. It offers a RESTful interface for integrating into PHP or whatever platform you prefer. It offers faceting, caching, a sophisticated query language etc.
So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...
I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.