Currently SQL '%like%' search is used to get all the rows which contains certain keywords. we're trying to replace MySQL like search with Lucene-Solr.
We constructed indexes,
queried to solr with a keyword,
retrieved the primary keys of all corresponding records,
queried to mysql with PK
and fetched the result.
and it got slower. damn!
I suppose that bandwidth used in 1, 2, 3 is the cause (since the result is really huge, like 1 million+), but I cannot figure any better ways.
Is there any other ways to get solr search result except CSV over http? (like file dump in mysql)
We did the same procedure to combine solr and mysql which was 100-1000x faster than single mySql fulltext search .
So your workflow/procedure is not a problem in general.
The question is: where is your bottleneck.
To investigate that, you should take a look to the catalina out to see the query time of each solr request. Same on MySQL - take a look to query-time/long running queries.
We had an performance problem because the returned number of PK was very large -> so the mySQL query was very large because of an very long where in () clause.
Followed by an very large MySQL statement there where lots of rows returned 200-1.000.000+
But the point is, that the application/user does not need such a big date at onces.
So we decided to work with pagination and offset (on solr side). Solr now returns only 30-50 results (depending of the pagination setting of the users application environment).
This works very fast.
//Edit: Is there any other ways to get solr search result except CSV over http?
There are different formats, like XML, PHP, CSV, Python, Ruby and JSON. To change this, you can use the wtparameter, like ....&wt=json
http://wiki.apache.org/solr/CoreQueryParameters#wt
http://wiki.apache.org/solr/QueryResponseWriter
//Edit #2
An additional way could be not only indexing the data to solr. You could (additional) store the data to solr in order to fetch the data from solr and live without MySQL data.
it depends on your data, if that is an way for you...
Solr provides a way to export the results as CSV and JSON
1 million+ is still a very large set. You can always do it in batches.
Can't you retrieve all your MySQL database to Solr?
You can use DIH ( Data Import Handler ) to retrieve all the data from MySQL and add to Solr pretty easy.
Then you will have all information you need in just one place and I think you will get a better performance.
Related
I currently have a table in my Postgres database with about 115k rows that I feel is too slow for my serverless functions. The only thing I need that table for is to lookup values using functions like ILIKE and the network barrier is slowing things down a lot I believe.
My thought was to take the table and make it into a javascript array of objects as it doesn't change often if ever. Now that I have it in a file such as array.ts and inside is:
export default [
{}, {}, {},...
]
What is the best way to query this huge array? Is it best to just use the .filter function? I currently am trying to import the array and filter it but it seems to just hang and never actually complete. MUCH slower the the current DB approach so I am unsure if this is the right approach.
Make the database faster
As people have commented, it's likely that the database will actually perform better than anything else given that databases are good at indexing large data sets. It may just be a case of adding the right index, or changing the way your serverless functions handle the connection pool.
Make local files faster
If you want to do it without the database, there are a couple of things that will make a big difference:
Read the file and then use JSON.parse, do not use require(...)
JavaScript is much slower to parse than JSON. You can therefore make things load much faster by parsing it as JavaScript.
Find a way to split up the data
Especially in a serverless environment, you're unlikely to need all the data for every request, and the serverless function will probably only serve a few requests before it is shutdown and a new one is started.
If you could split your files up such that you typically only need to load an array of 1,000 or so items, things will run much faster.
Depending on the size of the objects, you might consider having a file that contains only the id of the objects & the fields needed to filter them, then having a separate file for each object so you can load the full object after you have filtered.
Use a local database
If the issue is genuinely the network latency, and you can't find a good way to split up the files, you could try using a local database engine.
#databases/sqlite can be used to query an SQLite database file that you could pre-populate with your array of values and index appropriately.
const openDatabase = require('#databases/sqlite');
const {sql} = require('#databases/sqlite');
const db = openDatabase('mydata.db');
async function query(pattern) {
await db.query(sql`SELECT * FROM items WHERE item_name LIKE ${pattern}`);
}
query('%foo%').then(results => console.log(results));
I am trying to do CRUD operations on MongoDB of a very large size around 20GB data and there can be multiple such versions of data. Can anyone guide me on how to handle such high data for the CRUD operations and maintaining the previous versions of the data in MongoDB?
I am using NodeJS as backend and I can also use any other database if required.
Mongodb is a reliable database, I am using it to processes 10-11 billions of data every single day nodejs should also be fine as long as you are handling the files in streams of data.
Things you should do to Optimize:
Indexing - this will be the biggest part, if you want faster queries you better look into indexing in mongodb, every single document needs to be indexed according to your query, else you are going to have a tough time dealing with queries.
Sharding and Replication - this will help you organise the data and increases the query speed, replication would allow you to have your reads and writes separated(there are cons for replication you can read about that in the mongodb documentation).
This are the main things you need to consider, there are a lot but this should get you started...;) need any help please do let me know.
Recently, I'm working on make a solution for storing user's search log/query log into a HBase table.
Let's simple the raw Query log:
query timestamp req_cookie req_ip ...
Data access patterns:
scan through all querys within a time range.
scan through all search history with a specified query
I came up with the following row-key design:
<query>_<timestamp>
But the query may be very long or in different encoding, put query directly into the rowkey seems unwise.
I'm looking for help in optimizing this schema, anybody handling this scenario before?
1- You can do a full table scan with a timerange. In case you need realtime responses you have to maintain a reverse row-key table <timestamp>_<query> (plan your region splitting policy carefully first).
Be warned that sequential row key prefixes will get some of your
regions very hot if you have a lot of concurrence, so it would be wise
to buffer writes to that table. Additionally, if you get more writes than a single region can handle you're going to implement some sort of sharding prefix (i.e modulo of the timestamp), although this will make your
retrievals a lot more complex (you'll have to merge the results of
multiple scans).
2- Hash the query string in a way that you always have a fixed-length row key without having to care about encoding (MD5 maybe?)
I have a use case, in which my data is stored in DynamoDB with hashkey as UniqueID and range key as Date. The same data is also present in simple storage service of Amazon(S3). I want to search all the data based on time range. I want this to be fast enough. I can think of the following possible approaches:
- Scrap the full S3 and sort them based on time(it is not satisfying my latency requirements)
-Using DynamoDB scan filters will not help, as they scan the whole table. Consider data to be of large amount.
Requirements : fast(can get the result in less than 1 minute) ,
do not access a large amount of data,
can't use any other DB source
I think AWS Elasticsearch might be the answer to your problem. DynamoDB is now integrated with Elasticsearch, enabling you to perform full-text queries on your data.
Elasticsearch is a popular open source search and analytics engine designed to simplify real-time search and big data analytics.
Elasticsearch integration is easy with the new Amazon DynamoDB Logstash Plugin.
You should use Query instead of Scan, see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html.
I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.
Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.
There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.
...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.