How to minimise time for any operation in janusgraph using gremlin? - cassandra

For any query, it is taking more than five minutes to give result.
I am running simple query like as following
g.V().hasLabel("Label").has("pProperty","vValue").next()
When I have lesser number of nodes it was working fine but now I have more than 1 million nodes, so the issue arises.

When using JanusGraph and a Gremlin query to search for a property, if no index has been created for that property the query becomes a full scan over the data. Simple and composite indices can be created using the JanusGraph Management API. The Gremlin profile() step will show you if your query used an index.

Seconding what Kelvin said about adding an index. To make things more-efficient, you'll either need to filter on additional indexed properties, or make sure that you're designating an appropriate "entry point" for your traversal.

Related

Inconsistent results when sorting documents by _doc

I want to fetch elasticsearch hits using the sort+search_after paging mechanism.
The elasticsearch documentation states:
_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.
However, when performing the same query multiple times, I get different results. More specifically, the first hit alternates randomly between two different hits, where the returned sort field is 0 for one hit, and some specific number for the other.
This obviously breaks the paging as it relies on the value returned in sorting to be later fed into sort_after for the next query.
No data is being written to the index while I am querying it, so this is not because of refreshes.
My questions are therefore:
Is it wrong to sort by _doc for paging? Seems the results I get are inconsistent.
How does sorting by _doc work internally? The documentation is lacking in this regard as it simply states the sort is performed by "index order".
The data was written to the index in parallel using Spark. I thought the problem might have been the parallel write combined with the "index order" sorting, however I did not manage to replicate this behavior with other indicies which were also written to in Spark.
es 7, index contains 2 shards, one primary and one replica
cheers.
The reason this happened is that the index consists of 2 shards. One primary and one replica. The documents were not indexed in the same order. Thus, the order of the results depends on the shard they were returned from. This is fine when using scrolling because Elasticsearch keeps an inner state of the results, but not with paging, which is stateless.

How to create Accumulo Embedded Index with Rounds strategy?

I am a beginner in Accumulo and using Accumulo 1.7.2.
As an Indexing strategy, I am planning to use Embedded Index with Rounds Strategy (http://accumulosummit.com/program/talks/accumulo-table-designs/ on page 21). For the same, I couldn't find any documents anywhere. I am wondering if any of you could help me here.
My description of that strategy was mostly just to avoid sending a query to all the servers at once by simply querying one portion of the table at a time. Adding rounds to an existing 'embedded index' example might be the easiest place to start.
The Accumulo O'Reilly book includes an example that starts on page 284 in a section called 'Index Partitioned by Document' whose code lives here: https://github.com/accumulobook/examples/tree/master/src/main/java/com/accumulobook/designs/multitermindex
The query portion of that example is in the class WikipediaQueryMultiterm.java. It uses a BatchScanner configured with a single empty range to send the query to all tablet servers. To implement the by-rounds query strategy this could be replaced with something that goes from one tablet server to the next, either in a round-robin fashion, or perhaps going to 1, then if not enough results are found, going to the next 2, then 4 and so on, to mimic what Cassandra does.
Since you can't target servers directly with a query and since the table is using some partitioning IDs you could configure your scanners to scan all the key values within the first partition ID, then querying the next partition ID and so on, or perhaps visiting the partitions in random order to avoid congestion.
What some others have mentioned, adding additional indexes to help narrow the search space before sending a query to multiple servers hosting an embedded index, is beyond the scope of what I described and is a strategy that I believe is employed by the recently released DataWave project: https://github.com/NationalSecurityAgency/datawave

Cassandra data model too many table

I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Neo4j Optimization Questions for Server Plug-in Queries

I'm trying to optimize a fuzzy search query. It's fairly large, as it searches most properties in the database for a single word. I have some questions about some things I've been doing to improve the search speed.
Test Info: I added about 10,000 nodes and I'm searching on about 40 properties. My query times are about 3-30 seconds depending on the criteria.
MATCH (n) WHERE
(n:Type__Exercise and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' )) or
(n:Type__Fault and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' ))
with n LIMIT 100
return count(n)
This is basically my query, but with a lot more OR clauses. I also use parameters when sending the query to the execution engine. I realize it's very expensive to use the regular expressions on every single property. I'm hoping I can get good enough performance without doing exact matches up to a certain amount of data (This application will only have 1-10 users querying at a time). This is a possible interim effort we're investigating until the new label indexes support full text queries.
First of all, how do I tell if my query was cached? I make a call to my server plug-in via the curl command and the times I'm seeing are almost identical each time I pass the same criteria (The time is for the entire curl command to finish). I'm using a single instance of the execution engine that was created by using the GraphDatabaseService that is passed in to the plug-in via a #Source parameter. How much of an improvement should I see if a query is cached?
Is there a query size where Neo4j doesn't bother caching the query?
How effective is the LIMIT clause at speeding up queries? I added one, but didn't see a great performance boost (for queries that do have results). Does the execution engine stop once it finds enough nodes?
My queries are ready-only, do I still have to wrap my calls with a transaction?
I could split up my query so I only search one property at a time or say 4 properties at a time. Then I could run the whole set of queries via the execution engine. It seems like this would be better for caching, but is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
Is there a way to use parameters when using PROFILE in the Neo4j console? I've been trying to use this to see how many db hits I'm getting on my queries.
How effective is the Neo4j browser for comparing times it takes to execute a query?
Does caching happen here?
If I want to warm up Neo4j data for queries - can I run the exact queries I'm expecting? Does the query need to return data, or will a count type query warm the cache? As an alternative, should I just iterate over all the nodes? I'd rather just pull in the nodes that are likely to be searched vs all of them.
I think for the time being you'd be better served using the fulltext-legacy indexing facilities, I recently wrote a blog post about it: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
If you don't want to do that:
I would probably also rewrite your query to turn it around:
MATCH (n)
WHERE
(n:Type__Exercise OR n:Type__Fault) AND
(n.description =~ '(?i).*criteria.*' OR n.name =~ '(?i).*criteria.*' )
You can probably also benefit a bit more by having a secondary "search" field that is just the concatenation of your description and name fields. You probably also want to improve your regexp like adding a word boundary \b left and right.
Regarding your questions:
First of all, how do I tell if my query was cached?
Your query will be cached if you use parameters (for the regexps) there is a configurable query-caches size (defaulting to 100 queries)
Is there a query size where Neo4j doesn't bother caching the query?
Neo4j currently caches all queries that come in regardless of size
My queries are ready-only, do I still have to wrap my calls with a transaction?
Cypher will create its own transaction. In general read transactions are mandatory. For cypher you need outer transactions if you want multiple queries to participate in the same tx-scope.
is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
It depends smaller queries are executed more quickly (if they touch less of the total dataset) but you have to combine their results in the client.
If they touch the same nodes you do double work.
For bigger queries you have to watch out when you span up cross products or exponential path explosions.
Regarding running smaller queries with many threads
Good question, it should be faster there are currently some bottlenecks that we're about to remove. Just try it out.
Is there a way to use parameters when using PROFILE in the Neo4j console?
You can use the shell variables for that, with export name=value and list them with env
e.g.
export name=Lisa
profile match (n:User {name:{name}}) return n;
How effective is the Neo4j browser for comparing times it takes to execute a query?
The browser measures the complete roundtrip with potentially more data loading, so it's timing is not very accurate.
Warmup
The exact queries would make sense
You don't have to return data, it is enough to return count(*) but you should access the properties you want to access to make sure they are loaded.

Resources