I have a racesByID table. I also need to find the races by year. What are the pros and cons of using a secondary index on year column over creating a racesByYear table?
It depends on how many race you have per year,
Secondary indexes doesn't perform well if you have low or very high carnality, you will have performance issues in those case. for details check this link : https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html#useWhenIndex__when-no-index
In most cases separates tables will perform better, the cons is that you have to manage the consistency and keep table in sync
I hope this help.
Related
We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
Can anyone tell me the differnce between creating a secondary index vs creating an index CF manually in Cassandra
Secondary indexes in Cassandra are stored and maintained on each node. Thus, when you filter by a secondary index, Cassandra will need to do the search on every node, and then return the combined results. Therefore, filtering by secondary indexes can be significantly slower than filtering by partition key (according to my tests it can be 10 times slower, depending on your data and topology).
Maintaining your own index table is more efficient for most use cases, but you need to deal with updating the index table on your own. Also, you will need to do two queries for retrieving your data: one that queries the index table, and another one for retrieving the actual data.
Another solution would be to duplicate your data completely, and create two tables with the same structure, but different keys.
If performance is your key concern, then go for an index table or a duplicated table. If you need simplicity and can afford some performance penalty, use secondary indexes, but I recommend to do some performance testing beforehand.
I have a table like this
CREATE TABLE my_table(
category text,
name text,
PRIMARY KEY((category), name)
) WITH CLUSTERING ORDER BY (name ASC);
I want to write a query that will sort by name through the entire table, not just each partition.
Is that possible? What would be the "Cassandra way" of writing that query?
I've read other answers in the StackOverflow site and some examples created single partition with one id (bucket) which was the primary key but I don't want that because I want to have my data spread across the nodes by category
Cassandra doesn't support sorting across partitions; it only supports sorting within partitions.
So what you could do is query each category separately and it would return the sorted names for each partition. Then you could do a merge of those sorted results in your client (which is much faster than a full sort).
Another way would be to use Spark to read the table into an RDD and sort it inside Spark.
Always model cassandra tables through the access patterns (relational db / cassandra fill different needs).
Up to Cassandra 2.X, one had to model new column families (tables) for each access pattern. So if your access pattern needs a specific column to be sorted then model a table with that column in the partition/clustering key. So the code will have to insert into both the master table and into the projection table. Note depending on your business logic this may be difficult to synchronise if there's concurrent update, especially if there's update to perform after a read on the projections.
With Cassandra 3.x, there is now materialized views, that will allow you to have a similar feature, but that will be handled internally by Cassandra. Not sure it may fit your problem as I didn't play too much with 3.X but that may be worth investigation.
More on materialized view on their blog.
We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.
I am having a table containing 1 billion rows, fixed row format and using myisam engine in mysql. I am thinking of shardding the table but that development takes time. Are there any temporary solutions for improving the performance?
you can take a look at mysql partitioning. http://dev.mysql.com/doc/refman/5.1/en/partitioning-overview.html
it allows you to distribute portions of individual tables across a file system transparent to your queries
As per your comment if "insert/select ratio = 100:1" is the case, then i don see any reason to have indexes (apart from primary key index if any) on the table. It will further slow down your inserts.
Also, if you can queue inserts to this table then you can try creating a in-memory table (http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html) and direct all the inserts to the table which will be faster and then do a bulk insert/periodic flush in to ur myisam engine based table.
Also you can partition the table on a specific column out of those 4 you have(if there is any good candidate) or go for hash based partition (if you don find any). I am not sure why you are saying sharding is going to take dev time. you can partition an existing non partitioned table too. http://forums.mysql.com/read.php?106,264106,264110