I have more than 3k rows of data to retrieve from cassandra using an api. I have indexing on it but its causing issue of connection reset [closed] - cassandra

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 14 days ago.
Improve this question
I have more than 3k rows of data to retrieve from cassandra using an api. I have indexing on it but then also its causing issue of connection reset.
Should I look for any other data base to do so?
Is it possible to have a work around in cassandra?
Will providing limit or filter on between dates in query will help?
(so there will be restriction on api, is it standard practice)

So there's a lot missing here that is needed to help diagnose what is going on. Specifically, it'd be great to see the underlying table definition and the actual CQL query that the API is trying to run.
Without that, I can say that to me, it sounds like the API is trying to aggregate the 3000 rows from multiple partitions with a specific date range in the cluster (and is probably using the ALLOW FILTERING directive to accomplish this). Most multi-partition queries will time-out, just because of all the extra network time being introduced while polling each node in the cluster.
As with all queries in Cassandra, a table needs to be built to support a specific query. If it's not, this is generally what happens.
Will providing limit or filter on between dates in query will help?
Yes, breaking this query up into smaller pieces will help. If you can look at the underlying table definition, that might give you a clue as to the right way to properly query the table. But in this case, making 10 queries for 300 rows probably has a higher chance for success than 1 query for 3000 rows.

Related

Which compaction strategy is recommended for a table with minimal updates [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am looking for compaction strategy for the data which has following characteristics
We don't need the data after 60-90 days. At extreme scenarios maybe 180 days.
Ideally insert happens and updates never happens but it is realistic to expect duplicate events which cause updates.
It is indirectly time series data if you think about it, events coming first will be stored first and once the event is stored its almost never modified unless duplicate events are published.
Which strategy will be best for this case?
TimeWindowCompactionStrategy is only suitable for timeseries use cases and is the only reason you'd choose TWCS.
LeveledCompactionStrategy has very limited edge cases and the time I spend helping users troubleshoot LCS because it doesn't suit their needs is hardly worth the supposed benefits.
Unless you have some very specific requirements, SizeTieredCompactionStrategy is almost always the right choice and the reason it is the default compaction strategy. Cheers!

Right way to design table(s) on Azure Table Storage [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am not looking for the code showing how to create a table on Azure Table Storage. I am looking for more like a design guidance.
I have a WebJob that runs a long running Process. Each Job has more than one tasks. Each task takes n number of minutes to complete.
In order to have some visibility into the tasks, I am adding one row per task in a table named "TaskDetails".
If I also want to save Job related information, is it better to repeat Job details in the TaskDetails table or create a separate Jobs table and have JobId as one of the field in TaskDetails class.
I do not believe there is a way to Join multiple Azure Tables so i am little confused on the design.
Azure Storage Tables are NOT relational. SO, no, you can't do joins - there are no relations. Proper planning for writing and reading to Azure tables is critical. Really, the number of tables you use to organize your data is irrelevant. Using Azure tables efficiently requires knowing two values: the PartitionKey and the RowKey. If you know those two values, you can access any table value quickly and efficiently. If you search by any other fields, you will find yourself iterating over every item in the table and queries will drag on forever.

ArangoDB, what's the better way to peform queries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What's better to retrieve complex data from ArangoDB: A big query with all collection joins and graph traversal or multiple queries for each piece of data?
I think it depends on several aspects, e.g. the operation(s) you want to perform, scenario in which the querie(s) should be executed or if you favor performance over maintainability.
AQL provides the ability to write a single non-trivial query which might span through entire dataset and perform complex operation(s). Dissolving a big query into multiple smaller ones might improve maintainability and code readability, but on the other hand separate queries for each piece of data might have negative performance impact in the form of network latency associated with each request. One should also consider if the scenario allows to work with partial results returned from database while the other batch of queries is being processed.

How can search results be cached? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I implement caching mechanism of search results as on stackoverflow?
How does elastic search and lucene deal with caching?
As of now, you can cache in two different ways within Elasticsearch
Filter cache - Here if you can offload as many constraints which don't take part in scoring of results, you can have segment level caches for that particular filter alone. This along with warmer API provides some decent amount in memory based caching for the filters applied alone
Shard request cache * - You can cache the results ( Other than hits) on query level. This is pretty new feature and should provide a good amount of caching. But still _source needs to be still taken the shards.
Within Elasticsearch you can exploit these features to attain a good amount of caching.
Also, you can explore other caching option external to Elasticsearch to memcache or other in memory caches.
previously called shared query cache
Warmers
Warmers have been removed. There have been significant improvements to the index that make warmers not necessary anymore.
in ElasticSearch 5.4+

How slow is a call to a local database? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In general, say you have a (<16mb) table in a database running on the same machine as your server. If you need to do lots of querying into this table (%100 reads), is it better to:
Get the entire table, and do all the searching/querying/ in the server code.
Make lots of queries into the local database.
If the database is local, can I take advantage of the dbms's highly-efficient internal data structures for querying, or is the delay such that it's faster to map the tables returned by the database into my own data structures.
Thanks.
This is going to depend heavily on what kind of searches you're doing.
If your data is all ID lookups, it's probably faster to have it in RAM.
If your data is all full scans (no indexes), it's probably faster to have it in RAM.
If your data uses indexes, it's probably faster to have it in the DB.
Of course, much of the appeal of a database is indexes and a common query interface, so you have to weigh how valuable those are versus raw speed.
There's no way to really answer this without knowing exactly the nature of the data and queries to be done on it. Over-the-wire time has its cost, as does BSON <-> native marshalling, but indexed searches can be O(log n) as opposed to a dumb O(n) (or worse) search over a simple in-memory data structure.
Have you tried benchmarking?

Resources