I have a row of counters and I want to get its' columns sorted by values. Is there any strategies or data models for that?
I'm afraid there is no way of having Cassandra do this for you; you will need to get the entire row from Cassandra (paging for large rows) and sort it in the client.
You could use a periodic MapReduce job to do this for you, and cache the result of the job back into Cassandra, if your solution can cope with non-uptodate results.
Related
We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
I have 2 tables in my database. Each table is having 100 million rows.
Is there a way to join these 2 tables and extract data using apache spark in fastest way ?
I would say the most efficient way would be to use DataFrames and call join, followed by any other criteria. The benefit is that certain filters or selections will be pushed as far down as possible to cut down your network load...only the data that is needed will be pulled.
Without more information, that is the best suggestion I can give.
I have a situation where I have a large partition/row with many cells/values. I need to query this row for all the cells sorted by a value (one of the keys). This sort value is dynamic, and changes of often. You can't update any of the primary keys of cassandra because it changes how the data is stored. So, how do I do this? Does cassandra not support normalized queries that the sort can change at any given moment?
Cassandra does not support normalized queries where the sort can change at any given moment. You can do sort on the client or using additional tools like Spark.
we have a table with 15 million records, and ours is a 10 node cassandra cluster. We have a column which has close to 20 repeatable values. Is it advisable to build secondary index on this column?
Assuming completely uniform distribution on that column, then each column value would map to 750,000 rows. Now while the DataStax doc on When To Use An Index states that...
built-in indexes are best on a table having many rows that contain the indexed value.
750,000 rows certainly qualifies as "many." But even given that, remember that you're also talking about 14,250,000 rows that Cassandra has to ignore when fulfilling your query.
Also, unless you have a RF of 10 (and I doubt that you would with 10 nodes), you are going to incur network time as Cassandra works between all of the different nodes required to fulfill your query. For 750,000 rows, that's probably going to timeout.
The only way I think this could be efficient, would be to first restrict your query by a partition key. Using the secondary index while also restricting with a partition key will help Cassandra find your rows more quickly. Even so, with a dataset that big, I would re-evaluate your data model and try to figure out a different table to fulfill that query without requiring a secondary index.
Cassandra doesn't have capped collections (or row size limits), but one way of simulating it is to use an offline mapreduce job clean up extra entries. Would it be better to have a second table that stores row counts for primary keys in another table? The downside is that you have to scan through the entire row_count table since counters aren't indexable. Or would it be faster to just scan over the backing table with the real data?
Or is there another technique I should look into?
Edit: I found this Columns count vs counter column performance. Row counts go over all the data, so I'm leaning away from that.