5 seconds per select from 80M records in memsql - singlestore

We are using memsql 5.1 for a web-analytics project. There is about 80M records and 0,5M records per day. A simple request works about 5 seconds - how many data was received per domain,geo,lang for a given day. I feel it is possible to reduce those time, but i cant find a way. Please tell me the way.
Tables like one
CREATE TABLE `domains` (
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`geo` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`lang` char(5) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`browser` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`os` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`device` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`domain` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`ref` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`blk_cnt` int(11) DEFAULT NULL,
KEY `date` (`date`,`geo`,`lang`,`domain`) /*!90619 USING CLUSTERED COLUMNSTORE */
/*!90618 , SHARD KEY () */
)
request like this one:
memsql> explain SELECT domain, geo, lang, avg(blk_cnt) as blk_cnt, count(*) as cnt FROM domains WHERE date BETWEEN '2016-07-31 0:00' AND '2016-08-01 0:00' GROUP BY domain, geo, lang ORDER BY blk_cnt ASC limit 40;
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [r0.domain, r0.geo, r0.lang, $0 / CAST(COALESCE($1,0) AS SIGNED) AS blk_cnt, CAST(COALESCE($2,0) AS SIGNED) AS cnt] |
| Top limit:40 |
| GatherMerge [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] partitions:all est_rows:40 |
| Project [r0.domain, r0.geo, r0.lang, s / CAST(COALESCE(c,0) AS SIGNED) AS blk_cnt, CAST(COALESCE(cnt_1,0) AS SIGNED) AS cnt, s, c, cnt_1] est_rows:40 |
| TopSort limit:40 [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] |
| HashGroupBy [SUM(r0.s) AS s, SUM(r0.c) AS c, SUM(r0.cnt) AS cnt_1] groups:[r0.domain, r0.geo, r0.lang] |
| TableScan r0 storage:list stream:no |
| Repartition [domains.domain, domains.geo, domains.lang, cnt, s, c] AS r0 shard_key:[domain, geo, lang] est_rows:40 est_select_cost:144350216 |
| HashGroupBy [COUNT(*) AS cnt, SUM(domains.blk_cnt) AS s, COUNT(domains.blk_cnt) AS c] groups:[domains.domain, domains.geo, domains.lang] |
| Filter [domains.date >= '2016-07-31 0:00' AND domains.date <= '2016-08-01 0:00'] |
| ColumnStoreScan scan_js_data.domains, KEY date (date, geo, lang, domain) USING CLUSTERED COLUMNSTORE est_table_rows:72175108 est_filtered:18043777 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
After application of the recommendations
time of original query - 5s
with timestamp optimization - 3.7s
with timestamp+shardkey - 2.6s
Thank you very match!

Executing the group by is probably the most expensive part of this query. Using shard key that matches the group by, i.e. SHARD KEY (domain, geo, lang), will allow the group by to be executed faster.

Related

Is it possible to create a PERSISTED column that's made up of an array of specific JSON values and if so how?

Is it possible to create a PERSISTED column that's made up of an array of specific JSON values and if so how?
Simple Example (json column named data):
{ name: "Jerry", age: 91, mother: "Janet", father: "Eustace" }
Persisted Column Hopeful (assuming json column is called 'data'):
ALTER TABLE tablename ADD parents [ data::$mother, data::$father ] AS PERSISTED JSON;
Expected Output
| data (json) | parents (persisted json) |
| -------------------------------------------------------------- | ------------------------- |
| { name: "Jerry", age: 91, mother: "Janet", father: "Eustace" } | [ "Janet", "Eustace" ] |
| { name: "Eustace", age: 106, mother: "Jane" } | [ "Jane" ] |
| { name: "Jim", age: 54, mother: "Rachael", father: "Dom" } | [ "Rachael", "Dom ] |
| -------------------------------------------------------------- | ------------------------- |
The above doesn't work, but hopefully it conveys what I'm trying to accomplish.
There is no PERSISTED ARRAY data type for columns, but there is a JSON column type that can store arrays.
For example:
-- The existing table
create table tablename (
id int primary key AUTO_INCREMENT
);
-- Add the new JSON column
ALTER TABLE tablename ADD column parents JSON;
-- Insert data into the table
INSERT INTO tablename (parents) VALUES
('[ "Janet", "Eustace" ]'),
('[ "Jane" ]');
-- Select table based on matches in the JSON column
select *
from tablename
where JSON_ARRAY_CONTAINS_STRING(parents, 'Jane');
-- Change data in the JSON column
update tablename
set parents = JSON_ARRAY_PUSH_STRING(parents, 'Jon')
where JSON_ARRAY_CONTAINS_STRING(parents, 'Jane')
-- Show changed data
select *
from tablename
where JSON_ARRAY_CONTAINS_STRING(parents, 'Jane');
Check out more examples of pushing and selecting JSON data in the docs at https://docs.memsql.com/v7.0/concepts/json-guide/
Here is a sample table definition where I do something similar with customer and event:
CREATE TABLE `eventsext2` (
`data` JSON COLLATE utf8_bin DEFAULT NULL,
`memsql_insert_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`customer` as data::$custID PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
`event` as data::$event PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
customerevent as concat(data::$custID,", ",data::$event) persisted text,
`generator` as data::$genID PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
`latitude` as (substr(data::$longlat from (instr(data::$longlat,'|')+1))) PERSISTED decimal(21,18),
`longitude` as (substr(data::$longlat from 1 for (instr(data::$longlat,'|')-1))) PERSISTED decimal(21,18),
`location` as concat('POINT(',latitude,' ',longitude,')') PERSISTED geographypoint,
KEY `memsql_insert_time` (`memsql_insert_time`)
/*!90618 , SHARD KEY () */
) /*!90623 AUTOSTATS_CARDINALITY_MODE=OFF, AUTOSTATS_HISTOGRAM_MODE=OFF */ /*!90623 SQL_MODE='STRICT_ALL_TABLES' */;
Though not your question, denormalizing this table into two tables might be a good choice:
create table parents (
id int primary key auto_increment,
tablenameid int not null,
name varchar(20),
type int not null, -- 1=Father, 2=Mother, ideally foreign key to other table
);

memsql show slow performance response using order by

I have the following query:
SELECT * FROM scheme.table cont WHERE game_id = 'some-game-id' AND event_arrival_time BETWEEN '2019-12-02 00:00:00' AND '2019-12-31 23:59:59' ORDER BY event_arrival_time
I get response time of almost 30 seconds
for the same query if I remove ORDER BY event_arrival_time i get the response in few seconds
This is the create table query:
CREATE TABLE `cont_event` ( `event_id` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `action` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `correlation_id` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `status` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `event_arrival_time` datetime DEFAULT NULL, `create_time` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP, `create_ts` bigint(20) DEFAULT NULL, `operator_id` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `game_id` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `player_id` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `segment_code` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `bet_amount_original` decimal(15,4) DEFAULT NULL, `bet_amount_converted` decimal(15,4) DEFAULT NULL, `cont_amount_player` decimal(15,4) DEFAULT NULL, `cont_amount_operator` decimal(15,4) DEFAULT NULL, `cont_amount_total` decimal(15,4) DEFAULT NULL, `operator_income` decimal(15,4) DEFAULT NULL, `cont_amount_jackpot` decimal(15,4) DEFAULT NULL, `original_currency` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `base_currency` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `currency_rate` decimal(20,6) DEFAULT NULL, `operator_game_code` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `funnel_id` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `segment_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `operator_game_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `description` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `extra_fields` longtext CHARACTER SET utf8 COLLATE utf8_general_ci, `jackpot_game_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `game_version` varchar(16) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `event_type` varchar(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `event` JSON COLLATE utf8_bin, KEY `operator_id` (`event_arrival_time`,`action`,`operator_id`,`game_id`,`correlation_id`) /*!90619 USING CLUSTERED COLUMNSTORE */ /*!90618 , SHARD KEY () */ ) /*!90621 AUTOSTATS_ENABLED=TRUE */
i do have index on: event_arrival_time, action, operator_id, game_id, correlation_id
when doing memsql profile i can see the filter works fine(see attached):
iam not sure what am I missing? any suggestions for optimization?
Here is your first query repeated:
SELECT *
FROM scheme.table cont
WHERE
game_id = 'some-game-id' AND
event_arrival_time BETWEEN '2019-12-02 00:00:00' AND '2019-12-31 23:59:59'
ORDER BY
event_arrival_time
The index you have currently defined looks either wrong to me, or at least sub-optimal. Try the following index instead:
CREATE INDEX idx ON cont (event_arrival_time, game_id);
Ideally, MemSQL would be able to scan the above index, in the order of the event arrival time, in the range you defined, and retrieve all matching records.

Updating to empty set

I just created a new column for my table
alter table user add (questions set<timeuuid>);
Now the table looks like
user (
google_id text PRIMARY KEY,
date_of_birth timestamp,
display_name text,
joined timestamp,
last_seen timestamp,
points int,
questions set<timeuuid>
)
Then I tried to update all those null values to empty sets, by doing
update user set questions = {} where google_id = ?;
for each google id.
However they are still null.
How can I fill that column with empty sets?
A set, list, or map needs to have at least one element because an
empty set, list, or map is stored as a null set.
source
Also, this might be helpful if you're using a client (java for instance).
I've learnt that there's not really such a thing as an empty set, or list, etc.
These display as null in cqlsh.
However, you can still add elements to them, e.g.
> select * from id_set;
set_id | set_content
-----------------------+---------------------------------
104649882895086167215 | null
105781005288147046623 | null
> update id_set set set_content = set_content + {'apple','orange'} where set_id = '105781005288147046623';
set_id | set_content
-----------------------+---------------------------------
104649882895086167215 | null
105781005288147046623 | { 'apple', 'orange' }
So even though it displays as null you can think of it as already containing the empty set.

Cassandra - query not returning result with frozen type

I have created a cassandra table using few primary datatypes and a frozen type address_type.
CREATE TYPE address_type (
first_name text,
last_name text,
address_line1 text,
address_line2 text
);
CREATE TABLE user (
id text,
active_profile boolean,
addresses frozen<address_type>,
PRIMARY KEY (id)
);
And indexed the columns addresses because I want to select few resultset based on address_type.first_name.
CREATE INDEX ON user (addresses) ;
Finally this is my query which returns 0 rows.
select * from user where addresses = {first_name:'test2'};
When I tried
select * from "user" where addresses > {first_name:'test2'};
Which resulted in
code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'addresses > <value>'"
Can someone help me? Where I am going wrong here?
Let's insert some data :
cqlsh:test> INSERT INTO user (id , addresses) VALUES ('user_0', {first_name:'Ashraful', last_name:'Islam'});
cqlsh:test> INSERT INTO user (id , addresses) VALUES ('user_1', {first_name:'Ashraful'});
cqlsh:test> SELECT * FROM user ;
id | active_profile | addresses
--------+----------------+----------------------------------------------------------------------------------------
user_1 | null | {first_name: 'Ashraful', last_name: null, address_line1: null, address_line2: null}
user_0 | null | {first_name: 'Ashraful', last_name: 'Islam', address_line1: null, address_line2: null}
Since addresses is frozen type. you can't query with a piece of a frozen field. You have to provide full value of addresses
Example :
cqlsh:test> SELECT * FROM user WHERE addresses = {first_name:'Ashraful', last_name:'Islam'} ;
id | active_profile | addresses
--------+----------------+----------------------------------------------------------------------------------------
user_0 | null | {first_name: 'Ashraful', last_name: 'Islam', address_line1: null, address_line2: null}
(1 rows)
cqlsh:test> SELECT * FROM user WHERE addresses = {first_name: 'Ashraful'} ;
id | active_profile | addresses
--------+----------------+-------------------------------------------------------------------------------------
user_1 | null | {first_name: 'Ashraful', last_name: null, address_line1: null, address_line2: null}
(1 rows)

MemSQL Weird Insert/Update Behaviour

We are using single node MemSQL and everything was working fine but when we are trying to move our MemSQL setup to use multi node the insert/update statements are behaving very weirdly
My table structures are like below , have removed many columns , to keep it short
CREATE /*!90618 REFERENCE*/ TABLE `fact_orderitem_hourly_release_update`
(
`order_id` int(11) NOT NULL DEFAULT '0',
`customer_login` varchar(128) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`warehouse_id` int(11) DEFAULT NULL,
`city` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`store_id` int(11) DEFAULT NULL,
PRIMARY KEY (`order_id`)
);
CREATE TABLE `fact_orderitem_hourly_scale` (
`order_id` int(11) NOT NULL DEFAULT '0',
`order_group_id` int(11) NOT NULL DEFAULT '0',
`item_id` int(11) NOT NULL,
`sku_id` int(11) NOT NULL DEFAULT '0',
`sku_code` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`po_type` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`store_order_id` varchar(50) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`bi_last_modified_on` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
PRIMARY KEY (`item_id`,`sku_id`),
/*!90618 SHARD */ KEY `sku_id` (`sku_id`),
KEY `idx_fact_orderitem_hourly_lmd` (`bi_last_modified_on`),
KEY `idx_fact_orderitem_hourly_ord` (`order_id`),
KEY `idx_order_group_id` (`order_group_id`),
KEY `idx_store_order_id` (`store_order_id`)
);
My Load Script :
mysql -h$LiveMemSQL_DB -u$LiveMemSQL_USER --password=$LiveMemSQL_PASS -P$LiveMemSQL_PORT --verbose reports_and_summary < /home/titan/brand_catalog/upsert_memsql_orl_update.sql
Contents of .SQL File :
--start of .sql file
TRUNCATE TABLE reports_and_summary.fact_orderitem_hourly_release_update;
#Load data into staging
LOAD DATA LOCAL INFILE '/myntra/redshift/delta_files/live_scale_order_release_upd.txt' INTO TABLE reports_and_summary.fact_orderitem_hourly_release_update LINES TERMINATED BY '\n';
#Insert/Update statement
INSERT INTO reports_and_summary.fact_orderitem_hourly_scale
(
item_id,
sku_id,
customer_login,
order_status,
is_realised,
is_shipped,
shipping_charge,
gift_charge,
warehouse_id,
city,
store_id
)
select
fo.item_id,
fo.sku_id,
fr.customer_login,
fr.order_status,
fr.is_realised,
fr.is_shipped,
fr.shipping_charge,
fr.gift_charge,
fr.warehouse_id,
fr.city,
fr.store_id
from fact_orderitem_hourly_release_update fr
join fact_orderitem_hourly_scale fo
on fr.order_id=fo.order_id
ON duplicate key update
customer_login=values(customer_login),
order_status=values(order_status),
is_realised=values(is_realised),
is_shipped=values(is_shipped),
shipping_charge=values(shipping_charge),
gift_charge=values(gift_charge),
warehouse_id=values(warehouse_id),
city=values(city),
store_id=values(store_id);
--End .sql file
When I trigger the above .sql through mysql command line client , it works sometimes and it doesn't many of times , and some times if I execute the same .sql file continuously 5-10 times , the updates will get effected in one of those runs , and sometimes say for example if there are 3 records with order_id 101 and status SHIPPED and we got an update in merge table say the order status has been changed to DELIVERED , ideally status of all 3 orders should be changed to DELIVERED , but only one or 2 of the rows associated with an order are getting updated but if I execute the same .sql file content through MySQLWorkbench it works perfectly fine , I may sound stupid , but this is what is happening and I am struggling from last 2 days with this weird behavior
Please find the below screen cast , where I captured this behaviour https://www.youtube.com/watch?v=v2HN-n4V0MI&feature=youtu.be
Your staging table is a reference table, and writes to reference tables are replicated asynchronously to the cluster. This is why sometimes your updates work as expected and sometimes they don't.
You can
wait for a bit after writing into the reference table
make the staging table non-reference

Resources