How to optimize a table containing 1 billion rows, fixed row format using myisam engine in mysql? - linux

I am having a table containing 1 billion rows, fixed row format and using myisam engine in mysql. I am thinking of shardding the table but that development takes time. Are there any temporary solutions for improving the performance?

you can take a look at mysql partitioning. http://dev.mysql.com/doc/refman/5.1/en/partitioning-overview.html
it allows you to distribute portions of individual tables across a file system transparent to your queries

As per your comment if "insert/select ratio = 100:1" is the case, then i don see any reason to have indexes (apart from primary key index if any) on the table. It will further slow down your inserts.
Also, if you can queue inserts to this table then you can try creating a in-memory table (http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html) and direct all the inserts to the table which will be faster and then do a bulk insert/periodic flush in to ur myisam engine based table.
Also you can partition the table on a specific column out of those 4 you have(if there is any good candidate) or go for hash based partition (if you don find any). I am not sure why you are saying sharding is going to take dev time. you can partition an existing non partitioned table too. http://forums.mysql.com/read.php?106,264106,264110

Related

Can sort() and cache() combined in spark increase filter speed like creating index column in SQL?

We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];

Trying to visual how wide and skinny rows are layed out

Can someone give and show me how the data is layed out when you design your tables for wide vs. skinny rows.
I'm not sure I fully grasp how the data is spread out with a "wide" row.
Is there a difference in how you can fetch the data or will it be the same i.e. if it is ordered it doesn't matter if the data is vertical (skinny) or horizontally (wide) organized.
Update
Is a table considered with if the primary key consists of more than one column?
Or table will have wide rows only if the partition key is a composite partition key?
Wide... Skinny... Terms that make your head explode... I prefer to oversimplify the thing as such:
All the tables have wide rows
You simply need to take care of how wide the rows gets
This allows me to think this as follow (mangling a bit the C* terminology):
Number of RECORDS in a partition
1 <--------------------------------------- ... 2Billion
^ ^
Skinny rows wide rows
The lesser records in a partition, the skinner is the "partition", and vice-versa.
When designing for C* I always keep in mind a couple of things:
I want to use "skinny partitions" when my data can be fetched with one query and it is fully contained in one record of one partition. Typical example is something along SELECT * FROM table WHERE username = 'xmas79'; where the table has a primary key in the form of PRIMARY KEY (username)that let me get all the data belonging to a particular username.
I want to use "wide rows" when my data can be fetched with one query and it is fully contained on multiple records of one partition. Typical examples are range queries like SELECT * FROM table WHERE sensor = 'pressure' AND time >= '2016-09-22';, where the table has a primary key in the form of PRIMARY KEY (sensor, time).
So, first approach for one shot queries, second approach for range queries. Beware that this second approach have the (major) drawback that you can keep adding data to the partition, and it will get wider and wider, hurting performances.
In order to control how wide your partitions are, you need to add something to the partition key. In the sensor example above, if your don't violate your requirements of course, you can "group" some measurements by date, eg you split the measures in a day-by-day groups, making the primary key like PRIMARY KEY ((sensor, day), time), where the partition key was transformed to (sensor, day). By this approach, you have full (well, let's say good at least) control on the wideness of your partitions.
You only need to find a good compromise between your query capabilities and the desired performance.
I suggest these three readings for further investigation on the details:
Wide Rows in Cassandra CQL
Does CQL support dynamic columns / wide rows?
CQL3 for Cassandra experts
Beware that in the 1. there's a mistake in the second to last picture: the primary key should be
PRIMARY KEY ((user_id, tweet_id))
with double parenthesis around the columns instead of one.

Difference between creating a secondary index vs creating an index CF manually in Cassandra

Can anyone tell me the differnce between creating a secondary index vs creating an index CF manually in Cassandra
Secondary indexes in Cassandra are stored and maintained on each node. Thus, when you filter by a secondary index, Cassandra will need to do the search on every node, and then return the combined results. Therefore, filtering by secondary indexes can be significantly slower than filtering by partition key (according to my tests it can be 10 times slower, depending on your data and topology).
Maintaining your own index table is more efficient for most use cases, but you need to deal with updating the index table on your own. Also, you will need to do two queries for retrieving your data: one that queries the index table, and another one for retrieving the actual data.
Another solution would be to duplicate your data completely, and create two tables with the same structure, but different keys.
If performance is your key concern, then go for an index table or a duplicated table. If you need simplicity and can afford some performance penalty, use secondary indexes, but I recommend to do some performance testing beforehand.

Dynamic sorting with Cassandra

I have a situation where I have a large partition/row with many cells/values. I need to query this row for all the cells sorted by a value (one of the keys). This sort value is dynamic, and changes of often. You can't update any of the primary keys of cassandra because it changes how the data is stored. So, how do I do this? Does cassandra not support normalized queries that the sort can change at any given moment?
Cassandra does not support normalized queries where the sort can change at any given moment. You can do sort on the client or using additional tools like Spark.

"Capped collections" in Cassandra

Cassandra doesn't have capped collections (or row size limits), but one way of simulating it is to use an offline mapreduce job clean up extra entries. Would it be better to have a second table that stores row counts for primary keys in another table? The downside is that you have to scan through the entire row_count table since counters aren't indexable. Or would it be faster to just scan over the backing table with the real data?
Or is there another technique I should look into?
Edit: I found this Columns count vs counter column performance. Row counts go over all the data, so I'm leaning away from that.

Resources