Elasticsearch Query Sanitization & Security - security

I have a dataset hosted in Elasticsearch that I would like to expose to my end users for searching. Basically, I would like to have a search box in my application that would let users write queries that I can then run on Elasticsearch cluster.
Elasticsearch recommends that simple query string be used whenever we are dealing with user inputs that can't be trusted. This is not optimal as I would like to retain the attribute based search that the default Elasticsearch queries can handle.
Example indexed document:
{
"account_id": "1357983",
"first_name": "godzilla",
"age": 36"
}
I would like to support queries like the following to my end users:
# This is not possible with Simple Query String as far as I understand
first_name:"godzilla" AND age:36
I understand that security & performance are 2 main reasons not running user entered queries directly on the cluster.
Question:
What are some steps I can take to protect the cluster against malicious(?) queries?
Is there some kind of sanitization step I can run on user entered queries before I run them on Elasticsearch?
Performance issues aside, are there any security implications?
Things I am already doing:
The search results are never directly returned to users. Searches are only used to identify documents that have a unique ID associated. These IDs are then used to find the actual records in a RDBMS systems and then returned to users. Only those records that actually belong to the user are returned to the end user.
Every query has an explicit timeout set to 5 seconds.
Every query has the size attribute configured to 20 so that only the top 20 matching records are returned(this is enough for my use case).
expand_wildcards is set to false
(Naive approach that I have not tried yet) - Strip out all wildcard expressions(* and ?) from query before running the query. Show an error message to user if these terms are used. I am only interested in supporting exact matches for now.
Is there a way to disable regex, fuzziness & wildcards so that only exact matches are returned so that performance does not take a hit when queries like /.*/ are run?
Disabled Expensive queries as defined here
Apart from the above mentioned points, what else can I do to protect the ES cluster from a Security & Performance perspective?

Related

Get all `Facets` from azure search without item result

Hi all I'm facing performance issues with azure cognitive search currently I have 956 Facets filed.
When I load Documents from Azure server it's taking almost 30 to 35 seconds.
But when I remove Facets from Azure search request Documents load in 2 to 3 seconds.
So for this, I have created 2 API's
First API load Document result from the azure server.
Second API load all Facets from the azure server.
Is there any way to load only Facets?
Code get the document from the azure server.
DocumentSearchResult<AzureSearchItem> results = null;
ISearchFilterResult searchResult = DependencyResolver.Current.GetService<ISearchFilterResult>();
WriteToFile("Initiate request call for search result ProcessAzureSearch {0}");
results = searchServiceClient.Documents.Search<AzureSearchItem>(searchWord, parameters);
WriteToFile("Response received for search result {0}");
Faceting is an aggregation operation that's performed over the matching results and is quite intensive when there are a lot of distinct buckets. I can't comment on the specific increase in latency but adding facets to the query definitely has a performance impact.
Since faceting computes aggregation on matching documents, it has to run the query in the backend but as Gaurav mentioned, specifying top = 0 will prevent the actual retrieval as it doesn't need to be included in the response. This could improve the performance especially if the individual docs are large.
Another possibility is to run just the query first and then use a identifier field to filter the docs with facets. Since filtering is faster than querying, the overall latency should improve. This only works if you're able to identify the id groups for the resultant docs from the 1st API call.
In general I'd recommend using facets judiciously and re-evaluate the design if there is a need to run faceting queries on a field with high cardinality. Here's a document on optimizing search performance that you can take a look at -
https://learn.microsoft.com/en-us/azure/search/search-performance-optimization
SearchParameters has a property called Top which instructs the Search Service to return those number of documents.
Gets or sets the number of search results to retrieve. This can be
used in conjunction with $skip to implement client-side paging of
search results. If results are truncated due to server-side paging,
the response will include a continuation token that can be used to
issue another Search request for the next page of results.
One possible solution would be to set this value to 0 in your Facets API and in that case no documents will be returned by the Search Service.
I am not sure about the performance implication of this approach though. I just tried it with a very small set of data and it worked just fine for me.

Cassandra: using a one letter as a shard key to reduce load to cluster

I need to implement a functionality to search users by their nickname.
I know that it's possible to create a SASI index on a nickname and the search will work. However, as far as I understand the query will be sent to all nodes in the cluster.
I want to modify a table and introduce a shard key which will be first letter of the nickname. Like that if user starts to search, we know that we need to forward the query only to specific node ( + replicas ).
P.S I know that such kind of pattern can create a hotspot. However, I think the trade-offs here are meaningful and in practice I should not get an issue due to this hotspot ( I don't expect to get billion users in my system ).
What do you think?
Thank you in advance.

How to create Accumulo Embedded Index with Rounds strategy?

I am a beginner in Accumulo and using Accumulo 1.7.2.
As an Indexing strategy, I am planning to use Embedded Index with Rounds Strategy (http://accumulosummit.com/program/talks/accumulo-table-designs/ on page 21). For the same, I couldn't find any documents anywhere. I am wondering if any of you could help me here.
My description of that strategy was mostly just to avoid sending a query to all the servers at once by simply querying one portion of the table at a time. Adding rounds to an existing 'embedded index' example might be the easiest place to start.
The Accumulo O'Reilly book includes an example that starts on page 284 in a section called 'Index Partitioned by Document' whose code lives here: https://github.com/accumulobook/examples/tree/master/src/main/java/com/accumulobook/designs/multitermindex
The query portion of that example is in the class WikipediaQueryMultiterm.java. It uses a BatchScanner configured with a single empty range to send the query to all tablet servers. To implement the by-rounds query strategy this could be replaced with something that goes from one tablet server to the next, either in a round-robin fashion, or perhaps going to 1, then if not enough results are found, going to the next 2, then 4 and so on, to mimic what Cassandra does.
Since you can't target servers directly with a query and since the table is using some partitioning IDs you could configure your scanners to scan all the key values within the first partition ID, then querying the next partition ID and so on, or perhaps visiting the partitions in random order to avoid congestion.
What some others have mentioned, adding additional indexes to help narrow the search space before sending a query to multiple servers hosting an embedded index, is beyond the scope of what I described and is a strategy that I believe is employed by the recently released DataWave project: https://github.com/NationalSecurityAgency/datawave

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Neo4j Optimization Questions for Server Plug-in Queries

I'm trying to optimize a fuzzy search query. It's fairly large, as it searches most properties in the database for a single word. I have some questions about some things I've been doing to improve the search speed.
Test Info: I added about 10,000 nodes and I'm searching on about 40 properties. My query times are about 3-30 seconds depending on the criteria.
MATCH (n) WHERE
(n:Type__Exercise and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' )) or
(n:Type__Fault and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' ))
with n LIMIT 100
return count(n)
This is basically my query, but with a lot more OR clauses. I also use parameters when sending the query to the execution engine. I realize it's very expensive to use the regular expressions on every single property. I'm hoping I can get good enough performance without doing exact matches up to a certain amount of data (This application will only have 1-10 users querying at a time). This is a possible interim effort we're investigating until the new label indexes support full text queries.
First of all, how do I tell if my query was cached? I make a call to my server plug-in via the curl command and the times I'm seeing are almost identical each time I pass the same criteria (The time is for the entire curl command to finish). I'm using a single instance of the execution engine that was created by using the GraphDatabaseService that is passed in to the plug-in via a #Source parameter. How much of an improvement should I see if a query is cached?
Is there a query size where Neo4j doesn't bother caching the query?
How effective is the LIMIT clause at speeding up queries? I added one, but didn't see a great performance boost (for queries that do have results). Does the execution engine stop once it finds enough nodes?
My queries are ready-only, do I still have to wrap my calls with a transaction?
I could split up my query so I only search one property at a time or say 4 properties at a time. Then I could run the whole set of queries via the execution engine. It seems like this would be better for caching, but is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
Is there a way to use parameters when using PROFILE in the Neo4j console? I've been trying to use this to see how many db hits I'm getting on my queries.
How effective is the Neo4j browser for comparing times it takes to execute a query?
Does caching happen here?
If I want to warm up Neo4j data for queries - can I run the exact queries I'm expecting? Does the query need to return data, or will a count type query warm the cache? As an alternative, should I just iterate over all the nodes? I'd rather just pull in the nodes that are likely to be searched vs all of them.
I think for the time being you'd be better served using the fulltext-legacy indexing facilities, I recently wrote a blog post about it: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
If you don't want to do that:
I would probably also rewrite your query to turn it around:
MATCH (n)
WHERE
(n:Type__Exercise OR n:Type__Fault) AND
(n.description =~ '(?i).*criteria.*' OR n.name =~ '(?i).*criteria.*' )
You can probably also benefit a bit more by having a secondary "search" field that is just the concatenation of your description and name fields. You probably also want to improve your regexp like adding a word boundary \b left and right.
Regarding your questions:
First of all, how do I tell if my query was cached?
Your query will be cached if you use parameters (for the regexps) there is a configurable query-caches size (defaulting to 100 queries)
Is there a query size where Neo4j doesn't bother caching the query?
Neo4j currently caches all queries that come in regardless of size
My queries are ready-only, do I still have to wrap my calls with a transaction?
Cypher will create its own transaction. In general read transactions are mandatory. For cypher you need outer transactions if you want multiple queries to participate in the same tx-scope.
is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
It depends smaller queries are executed more quickly (if they touch less of the total dataset) but you have to combine their results in the client.
If they touch the same nodes you do double work.
For bigger queries you have to watch out when you span up cross products or exponential path explosions.
Regarding running smaller queries with many threads
Good question, it should be faster there are currently some bottlenecks that we're about to remove. Just try it out.
Is there a way to use parameters when using PROFILE in the Neo4j console?
You can use the shell variables for that, with export name=value and list them with env
e.g.
export name=Lisa
profile match (n:User {name:{name}}) return n;
How effective is the Neo4j browser for comparing times it takes to execute a query?
The browser measures the complete roundtrip with potentially more data loading, so it's timing is not very accurate.
Warmup
The exact queries would make sense
You don't have to return data, it is enough to return count(*) but you should access the properties you want to access to make sure they are loaded.

Resources