Cassandra performance

Cassandra performance - cassandra

I have read all these articles about how fast cassandra can be, for example single row read can take about 5ms.
So far i didn't care to much about my website speeds, but as the site grew bigger some pages started to require quite a few queries, for example one page requires to read 5 different tables and around 50 different rows, and so I have noticed that it takes from 0.7 sec to 2.0 secs which is really slow, so i took a closer look and found out that single query takes about 150ms.
The table that I'm testing is almost empty so size can't be an issue. I have installed APC and it did not help.
I am using PHPCassa and thrift comes along with this library.
Are these speeds normal, maybe php is just not fast enough? What could i do to improve this situation?
Note, I understand that running so many queries is to much and cassandra is optimized for writes not reads, but in some situations I can't find a way to put data in a single table/row.
EDIT I have just found out about optional C extension which should improve performance, and indeed it does, now single row read takes from 50ms to 100ms so that's a major improvement, thou still far away from those 5ms
EDIT2 Sorry for not updating my question with more information, but i have been very busy, and I have actually solved this problem, now 10 row reads from 4 different tables takes just 0.073158 s and average read time is just 0.005575 s so it's way more then I have expected to achieve. For those who are facing the same problem these are the things i would suggest to do:
Install optional C extension, steps to do that can be found here
Install APC
Make sure that right java version is installed, this could be causing slowdown
After installing all these things don't just restart apache, restart the whole server, I didn't do that at first and I only noticed this major speed improvement only after server restart

This still doesn't explain why a column family that is mostly empty performs worse than others. Next time you face that issue, you should give us how you use this table and what kind of query gave you bad performances.
Just a guess: Does this column family contains some frequently deleted data? Because the actual deletion for deleted (tombstoned) values takes a GcGracePeriod of 10 days by default.
So you might face some issues if you perform a lot of writes, reads and deletes on a lot of columns on the same keys.

Related

MariaDB got much faster, and I can't find the cause?

I have a concern about my MariaDB 10.4.12 database query execution time, which is getting much faster without any update to my database schema or data. While a speed-up is always welcome, I am concerned about the root cause of this speed-up, especially since I have not rolled out any changes in the last 24 hours. This specific query has sped up 60x overnight.
I have a NodeJS web application that filters a large dataset into "reporting" pages, which typically take 10-12 seconds to load. My main table has 3.5 million rows and the base query involves many joins, date comparisons, and text comparisons. There is room for fine-tuning the query, but it worked for what it was designed to do and I could live with 10 second load times. I noticed this morning, though, that my queries were executed in less than 1 second, without any recent changes on my part.
The most recent change to the application was pushed out five days ago, which affected the amount of data being pulled into this database. A separate application on the same server reaches out to a data set every 10 minutes and replicates these rows into the same database the "reporting" application communicates with. Up until this update, the query was collecting and inserting ~80,000 rows on average, taking about 8-10 seconds to fully replicate the data into this database. My change five days ago reduced the rows being inserted to ~20,000 on average.
Other clues:
PHPMyAdmin still takes 10-12 seconds to run the query, while the MySQL command-line tool takes in less than 1 second
The MariaDB temp directory was changed to a larger partition 7 days ago
The query was tested to be slow (10-12 seconds) 24 hours ago
The query is still slow on a pre-production server that runs the same application with an identical MySQL instance running (same schema and data)
My current running theory is that the ~80,000 inserts were not being executed in the time range being reported by NodeJS (8-10 seconds for the inserts), and they were instead waiting in the MariaDB temp directory until they could be fully written to the database. That would suggest that the database was constantly bogged down by these writes, and reducing the number to ~20k allowed the database to insert faster, allowing the select queries to run faster this morning.
Should I be concerned about this speed up? Could MariaDB have found a faster way to index my data? Am I going crazy?
Thank you.

Don't worry. This kind of thing can be caused by contention (multiple database clients using the database concurrently) and all sorts of other things.
(Cherish this moment. Performance usually goes the other direction.)
You can test for correctness to increase your confidence level. Check a few older and a few newer records to see if they still contain good data.
Or a full-table-scan query, something like this
SELECT COUNT(*), AVG(some_number_column), MIN(some_text_column) FROM mytable
That will take a while but it will hit every row in the table.
You probably don't need to do this, but it's a way to double check (and tell your boss, "I double checked.)

10 seconds, then 1 second. That is "normal".
The first was run when none of the data was cached in RAM; the second was with all cached.
Run it a third time; it will be 1 second again.
Restart MariaDB and run it again; it will again take 10 seconds.
Walk away from the machine for a long time; don't touch the table. It might be back to 10 seconds. For this, look at size of RAM and innodb_buffer_pool_size. Also look for big table scans that bump everything out of cache.

Temporary speed improvements

In our workflow, we have little ongoing work in the arangodb (~1% cpu use). For about 30 minutes of the day usage spikes and we need it to be more performant (helping do a 3s query to 1s).
Instead of moving up the instance box that it's hosted on, is there a way to get more out of arango temporarily during peak times? Would this be clustering or should we just look into temporarily boosting the instance that it's on.

Accumulating above suggestions plus adding some more that fit the generic nature of this question.
if possible split read/write workload, either in a timely fashion by holding back writes, or by switching to a new collection for the new writes.
make sure indices are properly set (use explain)
try whether query profiling can help you improve the performance

What is the best way to query timeseries data with cassandra?

My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?

You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)

Grails Excel import fails for huge data

I am using grails 2.3.7 and the latest excel-import plugin (1.0.0). My requirement is that I need to copy the contents of an excel sheet completely as it is into the database. My database is mssql server 2012.
I have got the code working for the development version. The code works fine when the number of records are few or may be upto a few hundreds.
But while in production the excel sheet will be having as many as 50,000 rows and over 75 columns.
Initially I faced a data out of memory exception. I increased the heap size to as much as 8GB, but now the thread keeps running on and on without termination. No errors are generated.
Please note that this is a once in while operation and it will be carried out by a person who will ensure that this operation does not hamper other operations running parellely. So need to worry about the huge load of this operation. I can afford to run it.
When the records are upto 10,000 with the same number of columns the data gets copied in around 5 mins. If now I have 50,000 rows then the time taken should ideally be around 5 times more, which is around 25 mins. But the code kept running for more than an hour without termination.
Any idea how to go about this issue. Any help is highly appreciated.

If you load 5 times more data in memory, it doesn't always take 5 times more. I guess that most of 8GB are in virtual memory and the virtual memory is very slow on hardware. Try to decrease the memory, run some memory tests and try to use as much as possible the RAM.

In my experience, a normal problem with large batch operations in Grails. I think you have memory leaks that radically slow down the operation as it proceeds.
My solution has been to use an ETL tool such as Pentaho Kettle for the import, or chunk the import into manageable pieces. See this related question:
Insert 10,000,000+ rows in grails

Not technically an answer to your problem, but have you considered just using CSV instead of of excel?
From a users point of view, saving as a CSV before importing is not a lot of work.
I am loading, validating and saving CSVs with 200-300 000 rows without a hitch.
Just make sure you have the logic in a service so it puts a transaction around it.
A bit more code to decode csv maybe, especially to translate to various primitives, but it should be orders of magnitude faster.

Solr for constantly updating index

I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks

My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.

Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string