We are revamping our existing system, which uses MYSQL DB to deal with the following type of data.
transaction and order related data
customers information
products information
We need to query on these data and pull in statistical data, and also filter, facet and segment list and KPIs.
We tried ClickHouse, Druid, DGraph did a few tests on sample data to benchmark and to check which DB fits our needs.
Few things I liked about Druid DB are,
Druid Search Queries: Which lists down all the matches along with the dimensions(column names) and count/occurrence for the same.
Link: http://druid.io/docs/latest/querying/searchquery.html
utf8mb4 support
Full text search
Case insensitive search
We found ClickHouse to be faster when compared to MYSQL and Druid databases. But have the following problems.
Unable to do druid-like-search queries (which return dimension and occurrences). Any workaround to achieve this?
Case insensitive search. How do we handle this? ClickHouse is case-sensitive, right?
utf8mb4 support? How do we save/store special characters or few emoji's which are not supported on utf8?
We had similar issues in MYSQL, and changing the collation to utf8mb4 solved it. What do we in ClickHouse to achieve this?
Your suggestions can help us overcome these challenges and make a better decision.
Thanks in advance.
Unable to do druid-like-search queries (which return dimension and occurrences). Any workaround to achieve this?
That feature sounds to work roughly like:
SELECT interval, dim1, COUNT(*) FROM my_table WHERE condition GROUP BY interval, dim1
UNION ALL
SELECT interval, dim2, COUNT(*) FROM my_table WHERE condition GROUP BY interval, dim2
UNION ALL
...
Case insensitive search. How do we handle this? ClickHouse is case-sensitive, right?
There are multiple options, for example positionCaseInsensitiveUTF8(haystack, needle) function or match with regular expressions: https://clickhouse.yandex/docs/en/query_language/functions/string_search_functions/#match-haystack-pattern
utf8mb4 support? How do we save/store special characters or few emoji's which are not supported on utf8?
Strings in ClickHouse are arbitrary byte sequences, so you can store whatever you want there, but you should probably check whether the available functions match your usecase.
Case insensitive search. How do we handle this? ClickHouse is
case-sensitive, right?
This blog might be useful. Specifically:
Adding (?i) at the beginning of every pattern makes it case-insensitive, like we did before:
SELECT
id,
job_title,
multiMatchAllIndices(description, ['(?i)python', '(?i)javascript', '(?i)postgres']) AS indices,
company
FROM jobs
WHERE length(indices) > 0
Related
i wanna to search in cassandra database.
After much research, I found
Stratio’s Cassandra Lucene Index
Is there another way to simple search on Cassandra?
I mean a simple search query something like in Mysql
I've used this query but his conclusion was wrong
select * from users where uname > 'sa' allow filtering;
It seems to me that you'd want to perform a text search on a non PRIMARY KEY column.
If that's the case, you could use a SSTable Attached Secondary Index (SASI) that would allow to exactly search as you wrote. Specifically, you'd need to create a CONTAINS index to perform inequality searches on text fields. You need Cassandra 3.4 and later.
I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.
Suppose there's a table with columns (UserID, FieldID, Value), with half a million records. I want to see if some search term T(N) occurs anywhere in each Value (i.e. Value.Contains( T(N) ) ).
I think I'm just hitting a wall volume wise, just too many values to sift through. I don't think a Full Text index will help, because it's only useful for StartsWith queries that look at individual words, not occurrences anywhere within the string at all.
Is there a good approach to indexing this kind of data for such a search in SQL Server?
A half-million records is not terribly large, although I don't know the size of the field contents. A couple of ideas - this was too long for a comment or else I may have posted as such.
You could implement a full-text search engine like Elastic, Solr, etc and use it as a sidecar. If when you are doing text searches, you are not otherwise making much use of the other data, this might be easy enough. Note that you could put other data for searching into Elastic or Solr, but I'm not sure if you'd want to duplicate all your data, and those tools aren't really great for a transactional data store.
Another option for volumes this small, assuming you only need basic "contains" searching: create two more tables: keywords and keyword_index (or whatever). When saving, tokenize your text content and write out any new keywords to keywords table and then add the data to the join table. Index everything, and then do your search off the keywords table, joining back to the master via the intermediate keyword_index table.
This is fairly hackish, and getting your keyword handling really dialed in (for stemming, etc) may be a pain. It is a reasonable quick & dirty solution for smaller-scale needs though.
I have a gigantic data of more than 2500000000 records distributed among 10 tables in derby. There are two columns "floraNfauna" and "locations" common in each table. Now I have to find a particular "floraNfauna" found at particular "locations", so I use "select" query with "like" e.g. "select * from tables where floraNfauna like('%fish%') and locations like('%shallow water bodies%')"; and it takes days to finally fetch the results which count below 1000 sometimes. After searching I found that "full text search" would be the best and faster approach to this. Can you help me with an example?
Derby integrates nicely with Lucene, which is a full-text search engine.
Read more about that here: http://wiki.apache.org/db-derby/LuceneIntegration
Firstly you must consider indexing your table. Here is an SO link which definitely would help to know more about Why to index a DB table.
More about Adding Indexes to a table.
Secondly, if you are using a centralized database, then definitely consider upgrading your server hardware configuration.
Thanks, hope it helps.
Is there any column store database that supports secondary index ?
I know HBase does, but it's not there yet.
Haggai.
By storing overlapping projections in different sort orders, column stores based on the C-Store architecture (so, as far as commericial implementations go, Vertica) natively support secondary indexes.
See http://db.csail.mit.edu/projects/cstore/vldb.pdf
Also check out MonetDb, which treats "create index" statements as hints for its self-organizing engine.
Take a look in this class IndexSpecification which is part of r0.19.3.
Here you can see how to use it (maybe they have a test for that as well)
I've never used that and don't if it performs well. please share with us your results.
good luck
-- Yonatan
Sybase IQ supports as many indexes as you might ever desire on every column and even within a column (e.g. the word index which lets you stay with defaults or specify your own delimiter)