One complex wildcard query lead solr OOM - search

We are using Solr as our search engine, and recently notice some user input wildcard query can lead to Solr dead loop in
org.apache.lucene.util.automaton.Operations.determinize()
And it also eats memory and finally OOM.
the wildcard query seems like *?????????-???????o·???è??*。
Although we can validate the input parameter, but I also wonder is there any configuration which can disable complex wildcard query like this which lead to serve performance problems.

Related

Elasticsearch Query Sanitization & Security

I have a dataset hosted in Elasticsearch that I would like to expose to my end users for searching. Basically, I would like to have a search box in my application that would let users write queries that I can then run on Elasticsearch cluster.
Elasticsearch recommends that simple query string be used whenever we are dealing with user inputs that can't be trusted. This is not optimal as I would like to retain the attribute based search that the default Elasticsearch queries can handle.
Example indexed document:
{
"account_id": "1357983",
"first_name": "godzilla",
"age": 36"
}
I would like to support queries like the following to my end users:
# This is not possible with Simple Query String as far as I understand
first_name:"godzilla" AND age:36
I understand that security & performance are 2 main reasons not running user entered queries directly on the cluster.
Question:
What are some steps I can take to protect the cluster against malicious(?) queries?
Is there some kind of sanitization step I can run on user entered queries before I run them on Elasticsearch?
Performance issues aside, are there any security implications?
Things I am already doing:
The search results are never directly returned to users. Searches are only used to identify documents that have a unique ID associated. These IDs are then used to find the actual records in a RDBMS systems and then returned to users. Only those records that actually belong to the user are returned to the end user.
Every query has an explicit timeout set to 5 seconds.
Every query has the size attribute configured to 20 so that only the top 20 matching records are returned(this is enough for my use case).
expand_wildcards is set to false
(Naive approach that I have not tried yet) - Strip out all wildcard expressions(* and ?) from query before running the query. Show an error message to user if these terms are used. I am only interested in supporting exact matches for now.
Is there a way to disable regex, fuzziness & wildcards so that only exact matches are returned so that performance does not take a hit when queries like /.*/ are run?
Disabled Expensive queries as defined here
Apart from the above mentioned points, what else can I do to protect the ES cluster from a Security & Performance perspective?

What is the most performant and scalable way to paginate of cosmos db result with the SQL API

I have more questions based on this question and answer, which is now quite old but still seems to be accurate.
The suggestion of storing results in memory seems problematic in many scenarios.
A web farm where the end-user isn't locked to a specific server.
Very large result sets.
Smaller result sets with many different users and queries.
I see a few ways of handling paging based on what I've read so far.
Use OFFSET and LIMIT at potentially high RU costs.
Use continuation tokens and caches with the scaling concerns.
Save the continuation tokens themselves to go back to previous pages.
This can get complex since there may not be a one to one relationship between tokens and pages.
See Understanding Query Executions
In addition, there are other reasons that the query engine might need to split query results into multiple pages. These include:
The container was throttled and there weren't available RUs to return more query results
The query execution's response was too large
The query execution's time was too long
It was more efficient for the query engine to return results in additional executions
Are there any other, maybe newer options for paging?

Are Objection.js WHERE IN queries slowing my Node.js application down?

Ok, so this is basically a yes/no answer. I have a node application running on Heroku with a postgres plan. I use Objection.js as ORM. I'm facing 300+ms response times on most of the endpoints, and I have a theory about why this is. I would like to have this confirmed.
Most of my api endpoints do around 5-10 eager loads. Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
But that made me thinking: Heroku Postgres doesn't run on the same heroku dyno as the node application I assume, so that means there is latency for every query. Could it be that all these latencies add up, causing a total of 300ms delay?
Summarizing: would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Speculating why this kind of slowdowns happen is mostly useless (I'll speculate a bit in the end anyways ;) ). First thing to do in this kind of situation would be to measure where that 300ms are used.
From database you should be able to see query times to spot if there are any slow queries causing the problems.
Also knex outputs some information about performance to console when you run it with DEBUG=knex:* environment variable set.
Nowadays also node has builtin profiling support which you can enable by setting --inspect flag when starting node. Then you will be able to connect your node process with chrome dev tools and see where node is using its time. From that profile you will be able to see for example if database query result parsing is dominating execution time.
The best way to figure out the slowness is to isolate the slow part of the app and inspect that carefully or even post that example to stackoverflow that other people can tell you why it might be slow. General explanation like this doesn't give much tool for other to help to resolve the real issue.
Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
With objection you can select which eager algorithm you like to use. In most of the cases (when there are one-to-many or many-to-many relations) making multiple queries is actually more performant compared to using join, because with join amount of data explodes and transfer time + result parsing on node side will take too much time.
would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Usually no.
Speculation part:
You mentioned that Most of my api endpoints do around 5-10 eager loads.. Most of the times when I have encountered this kind of slowness in queries the reason has been that app is querying too big chunks of data from database. When query returns for example tens of thousands rows, it will be some megabytes of JSON data. Only parsing that amount of data from database to JavaScript objects takes hundreds of milliseconds. If your queries are causing also high CPU load during that 300ms then this might be your problem.
Slow queries. Somtimes database doesn't have indexes set correctly so queries just will have to scan all the tables linearly to get the results. Checking slow queries from DB logs will help to find these. Also if getting response is taking long, but CPU load is low on node process, this might be the case.
Hereby I can confirm that every query takes approx. 30ms, regardless the complexity of the query. Objection.js indeed does around 10 separate queries because of the eager loading, explaining the cumulative 300ms.
Just FYI; I'm going down this road now ⬇
I've started to delve into writing my own more advanced SQL queries. It seems like you can do pretty advanced stuff, achieving similar results to the eager loading of Objection.js
select
"product".*,
json_agg(distinct brand) as brand,
case when count(shop) = 0 then '[]' else json_agg(distinct shop) end as shops,
case when count(category) = 0 then '[]' else json_agg(distinct category) end as categories,
case when count(barcode) = 0 then '[]' else json_agg(distinct barcode.code) end as barcodes
from "product"
inner join "brand" on "product"."brand_id" = "brand"."id"
left join "product_shop" on "product"."id" = "product_shop"."product_id"
left join "shop" on "product_shop"."shop_code" = "shop"."code"
left join "product_category" on "product"."id" = "product_category"."product_id"
left join "category" on "product_category"."category_id" = "category"."id"
left join "barcode" on "product"."id" = "barcode"."product_id"
group by "product"."id"
This takes 19ms on 1000 products, but usually the limit is 25 products, so very performant.
As mentioned by other people, objection uses multiple queries instead of joins by default to perform eager loading. It's a safer default than completely join based loading which can become really slow in some cases. You can read more about the default eager algorithm here.
You can choose to use the join based algoritm simply by calling the joinEager instead method of eager. joinEager executes one single query.
Objection also used to have a (pretty stupid) default of 1 parallel queries per operation, which meant that all queries inside an eager call were executed sequentially. Now that default has been removed and one should get a better performance even in cases like yours.
Your trick of using json_agg is pretty clever and actually avoids the slowness problems I referred to that can arise in some cases when using joinEager. However, that cannot be easily used for nested loading or with other database engines.

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?
Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.
You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

Neo4j Optimization Questions for Server Plug-in Queries

I'm trying to optimize a fuzzy search query. It's fairly large, as it searches most properties in the database for a single word. I have some questions about some things I've been doing to improve the search speed.
Test Info: I added about 10,000 nodes and I'm searching on about 40 properties. My query times are about 3-30 seconds depending on the criteria.
MATCH (n) WHERE
(n:Type__Exercise and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' )) or
(n:Type__Fault and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' ))
with n LIMIT 100
return count(n)
This is basically my query, but with a lot more OR clauses. I also use parameters when sending the query to the execution engine. I realize it's very expensive to use the regular expressions on every single property. I'm hoping I can get good enough performance without doing exact matches up to a certain amount of data (This application will only have 1-10 users querying at a time). This is a possible interim effort we're investigating until the new label indexes support full text queries.
First of all, how do I tell if my query was cached? I make a call to my server plug-in via the curl command and the times I'm seeing are almost identical each time I pass the same criteria (The time is for the entire curl command to finish). I'm using a single instance of the execution engine that was created by using the GraphDatabaseService that is passed in to the plug-in via a #Source parameter. How much of an improvement should I see if a query is cached?
Is there a query size where Neo4j doesn't bother caching the query?
How effective is the LIMIT clause at speeding up queries? I added one, but didn't see a great performance boost (for queries that do have results). Does the execution engine stop once it finds enough nodes?
My queries are ready-only, do I still have to wrap my calls with a transaction?
I could split up my query so I only search one property at a time or say 4 properties at a time. Then I could run the whole set of queries via the execution engine. It seems like this would be better for caching, but is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
Is there a way to use parameters when using PROFILE in the Neo4j console? I've been trying to use this to see how many db hits I'm getting on my queries.
How effective is the Neo4j browser for comparing times it takes to execute a query?
Does caching happen here?
If I want to warm up Neo4j data for queries - can I run the exact queries I'm expecting? Does the query need to return data, or will a count type query warm the cache? As an alternative, should I just iterate over all the nodes? I'd rather just pull in the nodes that are likely to be searched vs all of them.
I think for the time being you'd be better served using the fulltext-legacy indexing facilities, I recently wrote a blog post about it: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
If you don't want to do that:
I would probably also rewrite your query to turn it around:
MATCH (n)
WHERE
(n:Type__Exercise OR n:Type__Fault) AND
(n.description =~ '(?i).*criteria.*' OR n.name =~ '(?i).*criteria.*' )
You can probably also benefit a bit more by having a secondary "search" field that is just the concatenation of your description and name fields. You probably also want to improve your regexp like adding a word boundary \b left and right.
Regarding your questions:
First of all, how do I tell if my query was cached?
Your query will be cached if you use parameters (for the regexps) there is a configurable query-caches size (defaulting to 100 queries)
Is there a query size where Neo4j doesn't bother caching the query?
Neo4j currently caches all queries that come in regardless of size
My queries are ready-only, do I still have to wrap my calls with a transaction?
Cypher will create its own transaction. In general read transactions are mandatory. For cypher you need outer transactions if you want multiple queries to participate in the same tx-scope.
is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
It depends smaller queries are executed more quickly (if they touch less of the total dataset) but you have to combine their results in the client.
If they touch the same nodes you do double work.
For bigger queries you have to watch out when you span up cross products or exponential path explosions.
Regarding running smaller queries with many threads
Good question, it should be faster there are currently some bottlenecks that we're about to remove. Just try it out.
Is there a way to use parameters when using PROFILE in the Neo4j console?
You can use the shell variables for that, with export name=value and list them with env
e.g.
export name=Lisa
profile match (n:User {name:{name}}) return n;
How effective is the Neo4j browser for comparing times it takes to execute a query?
The browser measures the complete roundtrip with potentially more data loading, so it's timing is not very accurate.
Warmup
The exact queries would make sense
You don't have to return data, it is enough to return count(*) but you should access the properties you want to access to make sure they are loaded.

Resources