Main issue - I have a query joining two tables that are indexed.. When I run the query, even after five hours, it's still running.. I've never gotten results back. it's like a never ending query.
This is a new database that we just installed and really the first query we have tried. Performance is very slow.. At this point, I'm trying to figure
out if the issue is the SERVER/CPU/Network/IO/RAM or the database itself (need to tune configuration file) or if it's the query. When I do explain plan,
I do see indexes being used. Is my volume of data too big for MariaDB to handle based on the numbers presented below? Where do I begin to troubleshoot?
Any help is greatly appreciated.
Specs -
• We have MariaDB 10.2.19 on linux rhel7.5
• Our database is mainly used to store monthly data and then we report on it and do analytic work.
• We have around 70 plus tables.
• 3-4 users
• I am running a query that joins two tables. The size of the tables are listed below. The query is not complicated – a simple Inner join on columns and for both tables I am looking for data from the year/month of 201911. The two tables are joined by two columns and both columns are indexed on both tables. I have modified the query to many variation and none of the queries return any rows… even after running for 5 hours, it’s still never finished and I had to kill the query.
Size of (Innodb) Database 1919.69G
Size of table1 - daily_2019
Total Rows 1,034,987,628
Data I’m analyzing (using where clause) 73,895,929
Total table size 230.62GB
Size of table2 - ZIPCODES
Total Rows 68,429,146
Data I’m analyzing (using where clause) 3,998,975
Total table size 28.70GB
Explain plan - Tables daily_2019 and zipcodes each have 3 indexes on three different fields. Col1, col2 and col3 each have indexes both for daily_2019 and zipcodes. Two keys are used in the explain plan. One from each table.
We have set the innodb_buffer_pool_size to 128G
Explain Plan
+------+-------------+-------+------+-------------------------------------+-------------+---------+-----------------------+---------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+------+-------------+-------+------+-------------------------------------+-------------+---------+-----------------------+---------+------------------------------------+
| 1 | SIMPLE | T2 | ref | ZIP_I2,ZIP_I1,ZIP_I3 | ZIP_I1 | 5 | const | 7994358 | Using where
| 1 | SIMPLE | T1 | ref | ADHOC_I2,ADHOC_I1,ADHOC_I3 | ADHOC_I3 | 13 | T2.ID | 12 | Using index condition; Using where
+------+-------------+-------+------+-------------------------------------+-------------+---------+-----------------------+---------+------------------------------------+
Related
Like the title, In CASSANDRA, I'm trying to access 2 different row values belonging to different columns at the same time to perform an operation (like addition).
Elaboration: Let's say I have 3 columns and some N rows ->
row_id | start | end
--------+-------+-----
1 | 3 | 7
2 | 9 | 11
3 | 11| 19
4 | 22| 30
I want to subtract the end value in 1st row with start value in the next consecutive row.
Any Idea how may I approch this in cassandra ?
It isn't possible to do this in Cassandra.
Partitions (records) are distributed randomly across the cluster and are not sorted in the way you think it should be in a table. Your idea of "next consecutive row" will be completely different to the way data in a table is stored.
Your use case is more analytics rather than OLTP so you're better off using an ETL software like Spark. Cheers!
I have a table in cassandra which has the following structure
CREATE TABLE example (session text, seq_number int, Primary key((session), seq_num))
In the data I would expect all sequence numbers to be in the table starting at 0.
So a table might look like
session text| seq_number|
session1 | 0 |
session1 | 1 |
session1 | 2 |
session1 | 4 | // bad row, means missing data
In this example I would like to read only the first 3 rows. I don't want the fourth row because it is not 3 and it is at index 3. Is this possible?
Possible yes, you can use group by on session and a user defined aggregation function. It may be more work than its worth though, if you just set fetch size low (say 100) on queries then iterate through resultset on client side it might save you a lotta work and potentially even be more efficient overall. I would recommend implementing the client side solution first and benchmarking it to see if its even necessary or beneficial.
Lets say we have a key-space named sensors and a table named sensor_per_row.
this table has the following structure :
sensor_id | ts | value
In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.
select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row
The outcome of this select is
sensor_id | value | day | ts
-----------+-------+------------+---------------
Sensor 2 | 52.7 | 2019-01-04 | 1546640464138
Sensor 2 | 52.8 | 2019-01-04 | 1546640564376
Sensor 2 | 52.9 | 2019-01-04 | 1546640664617
How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :
sensor_id | system.avg(value) | day
-----------+-------------------+------------
Sensor 2 | 52.52059 | 2018-12-11
Sensor 2 | 42.52059 | 2018-12-10
Sensor 3 | 32.52059 | 2018-12-11
One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ?
Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?
NoSQL Limitations
While working with NoSQL, we generally have to give up:
Some ACID guarantees.
Consistency from CAP.
Shuffling operations: JOIN, GROUP BY.
You may perform above operations by reading data(rows) from the table and summing.
You can also refer to the answer MAX(), DISTINCT and group by in Cassandra
So I found the solution , I will post it in case somebody else has the same question.
As I read the data modeling seems to be the answer. Which means :
In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .
For instance lets say we have the log schema {sensor_id , region , value} ,
The first comes in mind is to generate a table called sensor_per_row like :
sensor_id | value | region | ts
-----------+-------+------------+---------------
This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them .
Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .
For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.
If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.
So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row
I hope i gave you the idea.
I know - Cassandra does not supports group by. But how to achieve similar result on a big collection of data.
Let's say I have table with 1 mln rows of clicks, 1 mln with shares and table user_profile. clicks and shares store one operation per row with created_at column. On a dashboard I would like to show results grouped by day, for example:
2016-06-01 - 2016-07-01
+-------------+--------+------+
|user_profile | like |share |
+-------------+--------+------+
| John | 34 | 12 |
| Adam | 12 | 4 |
| Bruce | 4 | 2 |
+-------------+--------+------+
The question is, how can I do this in the right way:
Create table user_likes_shares with counter by date
Create UDF to group by each column and join them in the code by merging arrays by key
Select data from 3 tables group and join them in the code by merging arrays by key
Another option
If you use code to join the results, do you use Apache Spark SQL, Is the Spark the right way in this case?
Assuming that your dashboard page will show all historical results, grouped by day:
1. 'Group by' in a table: The denormalised approach is the accepted way of doing things in Cassandra as writes and disk space are cheap. If you can structure your data model (and application writes) to support this, then this is the best approach.
2. 'Group by' in a UDA: In this blog post, the author notes that all rows are pulled back to the coordinator, reconciled and aggregated there (for CL>1). So even if your clicks and shares tables are partitioned by date, Cassandra will still have to pull all rows for that date back to the coordinator, store them in the JVM heap and then process them. So this approach has reduced scalability.
3. Merging in code: This will be a much slower approach as you will have to transfer a lot more data from the coordinator to your application server.
4. Spark: This is a good approach if you have to make ad-hoc queries (e.g. analyzing data, rather than populating a web page) and can be simplified by running your Spark jobs through a notebook application (a.g. Apache Zeppelin). However, in your use case, you have the complexity of having to wait for that job to finish, write the output somewhere and then display it on a web page.
I'm trying to do a partial search through a column family in Cassandra similar to an SQL query like: SELECT * FROM columnfamily WHERE col = 'val*' where val* means any value matching at least the first three characters 'val'.
I've read datastax's documentation on the SELECT function, but can't seem to find any support for the partial WHERE criteria. Any ideas?
There is no wildcard support like this in Cassandra, but you can model your data in such a way that you could get the same end result.
You would take the column that you want to perform this query on and denormalize it into a second column family. This CF would have a single wide row with the column name as the value of the col you want to do the wild card query on. The column value for this CF could either be the row key for the original CF or some other representation of the original row.
Then you would use slicing to get out the values you care about. For example if this was the wide row to slice on:
+---------+----------+--------+----------+---------+--------+----------+
| RowKey | aardvark | abacus | abacuses | abandon | accent | accident |
| +----------+--------+----------+---------+--------+----------+
| | | | | | | |
| | | | | | | |
+---------+----------+-----------------------------+--------+----------+
Using CQL you could select out everything starting with 'aba*' using this query*:
SELECT 'aba'..'abb' from some_cf where RowKey = some_row_key;
This would give you the columns for 'abacus', 'abacuses', and 'abandon'.
There are some things to be aware of with this strategy:
In the above example, if you have things with the same column_name you need to have some way to differentiate between them (otherwise inserting into the wide column family will clobber other valid values). One way that you could do this is by using a composite column of word:some_unique_value.
The above model only allows wild cards at the end of the string. Wild cards at the beginning of the string could also easily be handled with a few modifications. Wild cards in the middle of a string would be much more challenging.
Remember that Cassandra doesn't give you an easy way to do ad-hoc queries. Instead you need to figure out how you will be using the data and model your CFs accordingly. Take a look at this blog post from Ed Anuff on indexing data in Cassandra for more info on modeling data like this.
*Note that the CQL syntax for slicing columns is changing in an upcoming release of Cassandra.