sorl-spark : how to increase reading data speed? - apache-spark

I am using spark-solr to fetch 2 or 3 attributes (id and date attributes) from solr but it takes tens of seconds to fetch hundred thousands documents.
My solr collections have around 10 shards, and each of them have 4 replicas. My collections contains from ten millions documents to hundred millions of documents.
Regarding the lucidworks spark-solr connector, I set rows to 10000 and splits to true.
Is it the expected behavior ? (I mean, is Solr slow when fetching data by essence ?) Or could you help me understand how to configure solr and this lucidworks connector to increase the fetch speed please ? I hardly found answers on the internet.
Thank you for your help :)

Related

How to run an incremental query

I am using Cassandra 3.10 and DataStax 3.1.4.
I would like to be able to run a query that returned data in sets of say 10000 records until the full dataset has been processed, the aim is to be memory efficient
You can page the data in most drivers. So for your query you will specify a fetchsize. When you get to the last fetched row in your resultset the driver will automatically fetch the next fetchsize number of rows.
Everything you need to know about the datastax java driver pager is well documented here: https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
If you know in advance the size of your dataset (eg 10k records) the best you can do is design your tables around this dataset size, eg create a table and organize a priori your data into partitions of 10k records each.
This basically aims at matching the rule "model around your queries".

Will Cassandra be useful for this scenerio

I have around 10 Million names, combination of about 5 files each consisting of 2 million names and there are 100s of users. Each user comes with a file which has 1Million numbers.
I need to process these 1 million numbers against 2 million names and generate the values and show the values with names to the User.
Will cassandra be a good choice to make?
Currently I'm using SQL with RoR but it's quite slow in returning the values.
Cassandra is a no-sql database and not rdbms.
So, if you don't know, then there is no joins in cassandra.
So, if your problem is slow returning data because of IO, then definately cassandra is a good choice.
However, if your result is coming slow because of join, then cassandra cannot help you.
Because, like i said, there is no join in cassandra.
Now coming to your requirement.
It needs more information to frame opinion for that, like,
when do you want to process data to create value (as batch, on the fly).
How many data records you want to pull and show to user at a time, etc.

Solr improve Search speed

in solr search how to optimizing to improve Solr search Speed. I try with different Cache mechanism but not work.we are using 65 million record to search using solr search.it takes approx. 45 sec. to search. but i want to search 65 million record approx. 5-10 sec. so friend suggest me to reduce the search time.
i am using Apache Solr (Ver. 5.2.1) .
You can create multiple core where in you can split your data into different cores. As the data gets divided/split in different cores, the search is limited to the core and limited indexed data which could improve your search speed.
In my case I have data of different category so created the cores for each category. Cores are created by category name. When a search request comes for a category, the search request is made only to that category.
The second approach is you can do the sharding which will again split the data into different shard. Here each shard will hold the index data.
When data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index.
It is highly recommended that you use SolrCloud when needing to scale up or scale out.
Below are the links which will help you on the solrCloud
https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

how many partition key for a Cassandra table?

partition key for a Cassandra table?
In customer table customerid is partition key?
Suppose I have 1 million customers in year so I have 1 million partitions
After 10 years so I have 10 million customers or more also ... so I have 10 million paritions
SO my Question is ?
1) if I want read customers table (10 million partition) is that affect the read performance ?
note : In single partition we may have 50 to 100 columns ?
You have the right idea in that you'll want to use data modeling to create a multi-tenant environment. The caveat is that you're not going to want to do full table/multiple partition scans in Cassandra to retrieve that data. It's pretty well documented as to why, but anytime you have a highly distributed environment, you will want to minimize the amount of network hops, data shuffling, etc. Can't fight physics :)
Anyways, it sounds like this is reporting type of use case - you're going to need to use something like Spark or some type of map and reduce to efficiently report on multiple partitions like this.

Cassandra multi row selection

Somewhere I have heard that using multi row selection in cassandra is bad because for each row selection it runs new query, so for example if i want to fetch 1000 rows at once it would be the same as running 1000 separate queries at once, is that true?
And if it is how bad would it be to keep selecting around 50 rows each time page is loaded if say i have 1000 page views in a single minute, would it severely slow cassandra down or not?
P.S I'm using PHPCassa for my project
Yes, running a query for 1000 rows is the same as running 1000 queries (if you use the recommended RandomPartitioner). However, I wouldn't be overly concerned by this. In Cassandra, querying for a row by its key is a very common, very fast operation.
As to your second question, it's difficult to tell ahead of time. Build it and test it. Note that Cassandra does use in memory caching so if you are querying the same rows then they will cache.
We are using Playorm for Cassandra and there is a "findAll" pattern there which provides support to fetch all rows quickly. Visit
https://github.com/deanhiller/playorm/wiki/Support-for-retrieving-many-entities-in-parallel for more details.
1) I have little bit debugged the Cassandra code base and as per my observation to query multiple rows at the same time cassandra has provided the multiget() functionality which is also inherited in phpcassa.
2) Multiget is optimized to to handle the batch request and it saves your network hop.(like for 1k rows there will be 1k round trips, so it definitely reduces the time for 999 round trips)
3) More about multiget() in phpcassa: php cassa multiget()

Resources