Percona Toolkit query digest not reading all queries in slow query log - percona

I have a collection of slow query logs from RDS and I put them together into a single file. Trying to run it through pt-query-digest, following instructions here, but it reads the whole file as a single query.
Command:
pt-query-digest --group-by fingerprint --order-by Query_time:sum collider-slow-query.log > slow-query-analyze.txt
Output, showing that it only analyzed one query:
# Overall: 1 total, 1 unique, 0 QPS, 0x concurrency ______________________
Here's just a short snippet containing 2 queries from the file being analyzed to demonstrate there are many queries:
2019-05-03T20:44:21.828Z # Time: 2019-05-03T20:44:21.828954Z
# User#Host: username[username] # [ipaddress] Id: 19
# Query_time: 17.443164 Lock_time: 0.000145 Rows_sent: 5 Rows_examined: 121380
SET timestamp=1556916261;
SELECT wp_posts.ID FROM wp_posts LEFT JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) WHERE 1=1 AND wp_posts.ID NOT IN (752921) AND (
wp_term_relationships.term_taxonomy_id IN (40)
) AND wp_posts.post_type = 'post' AND ((wp_posts.post_status = 'publish')) GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 5;
2019-05-03T20:44:53.597Z # Time: 2019-05-03T20:44:53.597137Z
# User#Host: username[username] # [ipaddress] Id: 77
# Query_time: 35.757909 Lock_time: 0.000054 Rows_sent: 2 Rows_examined: 199008
SET timestamp=1556916293;
SELECT post_id, meta_value FROM wp_postmeta
WHERE meta_key = '_wp_attached_file'
AND meta_value IN ( 'family-guy-vestigial-peter-slice.jpg','2015/08/bobs-burgers-image.jpg','2015/08/bobs-burgers-image.jpg' );
Why isn't it reading all the queries? Is there a problem with my concatenation?

I had the same problem as well. It turns out that some additional timestamp is stored in the SQL (through this variable: https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_log_timestamps).
This makes pt-query-digest not see individual queries. It should be easily fixed by either turning off the timestamp variable or remove the timestamp:
2019-10-28T11:17:18.412 # Time: 2019-10-28T11:17:18.412214
# User#Host: foo[foo] # [192.168.8.175] Id: 467836
# Query_time: 5.839596 Lock_time: 0.000029 Rows_sent: 1 Rows_examined: 0
use foo;
SET timestamp=1572261432;
SELECT COUNT(*) AS `count` FROM `foo`.`invoices` AS `Invoice` WHERE 1 = 1;
By removing the first timestamp (the 2019-10-28T11:17:18.412 part), it works again.

Related

Parquet data to AWS Redshift slow

I want to insert data from S3 parquet files to Redshift.
Files in parquet comes from a process that reads JSON files, flatten them out, and store as parquet. To do it we use pandas dataframes.
To do so, I tried two different things. The first one:
COPY schema.table
FROM 's3://parquet/provider/A/2020/11/10/11/'
IAM_ROLE 'arn:aws:iam::XXXX'
FORMAT AS PARQUET;
It returned:
Invalid operation: Spectrum Scan Error
error: Spectrum Scan Error
code: 15001
context: Unmatched number of columns between table and file. Table columns: 54, Data columns: 41
I understand the error but I don't have an easy option to fix it.
If we have to do a reload from 2 months ago the file will only have for example 40 columns, because on that given data we needed just this data but table already increased to 50 columns.
So we need something automatically, or that we can specify the columns at least.
Then I applied another option which is to do a SELECT with AWS Redshift Spectrum. We know how many columns the table have using system tables, and we now the structure of the file loading again to a Pandas dataframe. Then I can combine both to have the same identical structure and do the insert.
It works fine but it is slow.
The select looks like:
SELECT fields
FROM schema.table
WHERE partition_0 = 'A'
AND partition_1 = '2020'
AND partition_2 = '11'
AND partition_3 = '10'
AND partition_4 = '11';
The partitions are already added as I checked using:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename = 'table'
and schemaname = 'schema'
and values = '["A","2020","11","10","11"]'
limit 1;
I have around 170 files per hour, both in json and parquet file. The process list all files in S3 json path, and process them and store in S3 parquet path.
I don't know how to improve execution time, as the INSERT from parquet takes 2 minutes per each partition_0 value. I tried the select alone to ensure its not an INSERT issue, and it takes 1:50 minutes. So the issue is to read data from S3.
If I try to select a non existent value for partition_0 it takes again around 2 minutes, so there is some kind of problem to access data. I don't know if partition_0 naming and others are considered as Hive partitioning format.
Edit:
AWS Glue Crawler table specification
Edit: Add SVL_S3QUERY_SUMMARY results
step:1
starttime: 2020-12-13 07:13:16.267437
endtime: 2020-12-13 07:13:19.644975
elapsed: 3377538
aborted: 0
external_table_name: S3 Scan schema_table
file_format: Parquet
is_partitioned: t
is_rrscan: f
is_nested: f
s3_scanned_rows: 1132
s3_scanned_bytes: 4131968
s3query_returned_rows: 1132
s3query_returned_bytes: 346923
files: 169
files_max: 34
files_avg: 28
splits: 169
splits_max: 34
splits_avg: 28
total_split_size: 3181587
max_split_size: 30811
avg_split_size: 18825
total_retries:0
max_retries:0
max_request_duration: 360496
avg_request_duration: 172371
max_request_parallelism: 10
avg_request_parallelism: 8.4
total_slowdown_count: 0
max_slowdown_count: 0
Add query checks
Query: 37005074 (SELECT in localhost using pycharm)
Query: 37005081 (INSERT in AIRFLOW AWS ECS service)
STL_QUERY Shows that both queries takes around 2 min
select * from STL_QUERY where query=37005081 OR query=37005074 order by query asc;
Query: 37005074 2020-12-14 07:44:57.164336,2020-12-14 07:46:36.094645,0,0,24
Query: 37005081 2020-12-14 07:45:04.551428,2020-12-14 07:46:44.834257,0,0,3
STL_WLM_QUERY Shows that no queue time, all in exec time
select * from STL_WLM_QUERY where query=37005081 OR query=37005074;
Query: 37005074 Queue time 0 Exec time: 98924036 est_peak_mem:0
Query: 37005081 Queue time 0 Exec time: 100279214 est_peak_mem:2097152
SVL_S3QUERY_SUMMARY Shows that query takes 3-4 seconds in s3
select * from SVL_S3QUERY_SUMMARY where query=37005081 OR query=37005074 order by endtime desc;
Query: 37005074 2020-12-14 07:46:33.179352,2020-12-14 07:46:36.091295
Query: 37005081 2020-12-14 07:46:41.869487,2020-12-14 07:46:44.807106
stl_return Comparing min start for to max end for each query. 3-4 seconds as says SVL_S3QUERY_SUMMARY
select * from stl_return where query=37005081 OR query=37005074 order by query asc;
Query:37005074 2020-12-14 07:46:33.175320 2020-12-14 07:46:36.091295
Query:37005081 2020-12-14 07:46:44.817680 2020-12-14 07:46:44.832649
I dont understand why SVL_S3QUERY_SUMMARY shows just 3-4 seconds to run query in spectrum, but then STL_WLM_QUERY says the excution time is around 2 minutes as i see in my localhost and production environtments... Neither how to improve it, because stl_return shows that query returns few data.
EXPLAIN
XN Partition Loop (cost=0.00..400000022.50 rows=10000000000 width=19608)
-> XN Seq Scan PartitionInfo of parquet.table (cost=0.00..22.50 rows=1 width=0)
Filter: (((partition_0)::text = 'A'::text) AND ((partition_1)::text = '2020'::text) AND ((partition_2)::text = '12'::text) AND ((partition_3)::text = '10'::text) AND ((partition_4)::text = '12'::text))
-> XN S3 Query Scan parquet (cost=0.00..200000000.00 rows=10000000000 width=19608)
" -> S3 Seq Scan parquet.table location:""s3://parquet"" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=19608)"
svl_query_report
select * from svl_query_report where query=37005074 order by segment, step, elapsed_time, rows;
Just like in your other question you need to change your keypaths on your objects. It is not enough to just have "A" in the keypath - it needs to be "partition_0=A". This is how Spectrum knows that the object is or isn't in the partition.
Also you need to make sure that your objects are of reasonable size or it will be slow if you need to scan many of them. It takes time to open each object and if you have many small objects the time to open them can be longer than the time to scan them. This is only an issue if you need to scan many many files.

force replication of replicated tables

Some of my tables are of type REPLICATE. I would these tables to be actually replicated (not pending) before I start querying my data. This will help me avoid data movement.
I have a script, which I found online, which runs in a loop and do a SELECT TOP 1 on all the tables which are set for replication, but sometimes the script runs for hours. It may seem as the server sometimes won't trigger replication even if you do a SELECT TOP 1 from foo.
How can you force SQL Datawarehouse to complete replication?
The script looks something like this:
begin
CREATE TABLE #tbl
WITH
( DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
ROW_NUMBER() OVER(
ORDER BY
(
SELECT
NULL
)) AS Sequence
, CONCAT('SELECT TOP(1) * FROM ', s.name, '.', t.[name]) AS sql_code
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady';
DECLARE #nbr_statements INT=
(
SELECT
COUNT(*)
FROM #tbl
), #i INT= 1;
WHILE #i <= #nbr_statements
BEGIN
DECLARE #sql_code NVARCHAR(4000)= (SELECT
sql_code
FROM #tbl
WHERE Sequence = #i);
EXEC sp_executesql #sql_code;
SET #i+=1;
END;
DROP TABLE #tbl;
SET #i = 0;
WHILE
(
SELECT TOP (1)
p.[state]
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady'
) = 'NotReady'
BEGIN
IF #i % 100 = 0
BEGIN
RAISERROR('Replication in progress' , 0, 0) WITH NOWAIT;
END;
SET #i = #i + 1;
END;
END
Henrik, if 'select top 1' doesn't trigger a replicated table build, then that would be a defect. Please file a support ticket.
Without looking at your system, it is impossible to know exactly what is going on. Here are a couple of things that could be in factoring into extended build time to look into:
The replicated tables are large (size, not necessarily rows) requiring long build times.
There are a lot of secondary indexes on the replicated table requiring long build times.
Replicated table builds require statirc20 (2 concurrency slots). If the concurrency slots are not available, the build will queue behind other running queries.
The replicated tables are constantly being modified with inserts, updates and deletes. Modifications require the table to be built again.
The best way is to run a command like this as part of the job which creates/updates the table:
select top 1 * from <table>
That will force its redistribution at the correct time, without the slow loop through the stored procedure.

cassandra-stress:Unable to find table <keyspace name>.<tablename> at maybeLoadschemainfo (StressProfile.java)

Configurations:-
3 node cluster
Node 1 --> 172.27.21.16(Seed Node)
Node 2 --> 172.27.21.18
Node 3 --> 172.27.21.19
cassandra.yaml paramters for all the nodes:-
1) seeds: "172.27.21.16"
2) write_request_timeout_in_ms: 5000
3) listen_address: 172.27.21.1(6,8,9)
4) rpc_address: 172.27.21.1(6,8,9)
This is my code.yaml file:-
keyspace: prutorStress3node
keyspace_definition: |
CREATE KEYSPACE prutorStress3node WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
table: code
table_definition: |
CREATE TABLE code (
id timeuuid,
assignment_id bigint,
user_id text,
contents text,
save_type smallint,
save_time timeuuid,
PRIMARY KEY(assignment_id, id)
) WITH CLUSTERING ORDER BY (id DESC)
columnspec:
- name: assignment_id
population: uniform(1..100000) # Assignment IDs from range 1-100000
- name: id
cluster: fixed(100) # Each assignment id will have atmost 100 ids(maximum 100 codes allowed per student)
- name: user_id
size: fixed(50) # user_id comes from account table varchar(50)
- name: contents
size: uniform(1..2000)
- name: save_type
size: fixed(1) # Generated values between 1 to 4
population: uniform(1..4)
- name: save_time
insert:
partitions: fixed(1) # The number of partitions to update per batch
select: fixed(1)/100 # The ratio of rows each partition should insert as a proportion of the total possible rows for the partition (as defined by the clustering distribution columns
batchtype: UNLOGGED # The type of CQL batch to use. Either LOGGED/UNLOGGED
queries:
query1:
cql: select id,contents,save_type,save_time from code where assignment_id = ?
fields: samerow
query2:
cql: select id,contents,save_type,save_time from code where assignment_id = ? LIMIT 10
fields: samerow
query3:
cql: SELECT contents FROM code WHERE id = ? # Create index for this query
fields: samerow
Following is my script which I am running from 172.27.21.17(A node on the same network but in a different cluster):-
#!/bin/bash
echo '***********************************'
echo Deleting old data from prutorStress3node
echo '***********************************'
cqlsh -e "DROP KEYSPACE prutorStress3node" 172.27.21.16
sleep 120
echo '***********************************'
echo Creating fresh data
echo '***********************************'
/bin/cassandra-stress user profile=./code.yaml n=1000000 no-warmup cl=quorum ops\(insert=1\) \
-rate threads=25 -node 172.27.21.16
sleep 120
Issue:-
The issue is that when I run the script just after deleting previous data and after bringing up those nodes, it works fine, but when I run it further, for each run, I hit the following error:-
Unable to find prutorStress3node.code
at org.apache.cassandra.stress.StressProfile.maybeLoadSchemaInfo(StressProfile.java:306)
at org.apache.cassandra.stress.StressProfile.maybeCreateSchema(StressProfile.java:273)
at org.apache.cassandra.stress.StressProfile.newGenerator(StressProfile.java:676)
at org.apache.cassandra.stress.StressProfile.printSettings(StressProfile.java:129)
at org.apache.cassandra.stress.settings.StressSettings.printSettings(StressSettings.java:383)
at org.apache.cassandra.stress.Stress.run(Stress.java:95)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
In the file StressProfile.java ,I saw that table metadata is being populated to NULL. I tried to make sense of the stack trace, but was not able to make anything of it. Please give me some directions as to what might have gone wrong?
Giving keyspace and table name in small case solved the issue for me. Previously i was giving mixed case and i was getting the same error.

Polybase - maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed

[Question from customer]
I have following data in a text file. Delimited by |
A | null , ZZ
C | D
When I run this query using HDInsight:
CREATE EXTERNAL TABLE myfiledata(
col1 string,
col2 string
)
row format delimited fields terminated by '|' STORED AS TEXTFILE LOCATION 'wasb://.....';
I get the following result as expected:
A null , ZZ
C D
But when I run the same query using SQL DW Polybase, it throws error:
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
How do I fix this?
Here's my script in SQL DW:
-- Creating external data source (Azure Blob Storage)
CREATE EXTERNAL DATA SOURCE azure_storage1
WITH
(
TYPE = HADOOP
, LOCATION ='wasbs://....blob.core.windows.net'
, CREDENTIAL = ASBSecret
)
;
-- Creating external file format (delimited text file)
CREATE EXTERNAL FILE FORMAT text_file_format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR ='|'
, USE_TYPE_DEFAULT = TRUE
)
)
;
-- Creating external table pointing to file stored in Azure Storage
CREATE EXTERNAL TABLE [Myfile]
(
Col1 varchar(5),
Col2 varchar(5)
)
WITH
(
LOCATION = '/myfile.txt'
, DATA_SOURCE = azure_storage1
, FILE_FORMAT = text_file_format
)
;
We’re currently working on a way to bubble up the reason for reject to the user.
In the meantime, here's what's happening:
The default # of rows allowed to fail schema matching is 0. This means that if at least one of the rows you’re loading in from /myfile.txt doesn’t match the schema. In Hive, strings can accommodate an arbitrary amount of chars, but varchars cannot. In this case it’s failing on the varchar(5) for “null , ZZ” because that is more than 5 characters.
If you’d like to change the REJECT_VALUE in the CREATE EXTERNAL TABLE call, that will let through the other row – more info can be found here: https://msdn.microsoft.com/library/dn935021(v=sql.130).aspx
It's due to dirty record for the respective file format, for example in the case of parquet if the column contains '' (empty string) then it won't work, and will throw Query aborted-- the maximum reject threshold
[AZURE.NOTE] A query on an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". This indicates that your external data contains dirty records. A data record is considered 'dirty' if the actual data types/number of columns do not match the column definitions of the external table or if the data doesn't conform to the specified external file format. To fix this, ensure that your external table and external file format definitions are correct and your external data conform to these definitions. In case a subset of external data records is dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE DDL.

Cassandra terrible performance on select

I have two nodes Cassandra cluster. In order to test Cassandra i built a File table (Fid Integer,Sid Integer), Which Fid is key. I built index on Sid, Insert rate is about 10,000 in 1 second. But when i select from table the performance is terrible, and for low limit like 1000 it generate error, bellow is my sample code,
from cassandra.cluster import Cluster
cluster = Cluster(['127.0.0.1'])
session = cluster.connect('myk')
rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')
for user_row in rows:
print user_row
Error message is:
Traceback (most recent call last):
File "Test.py", line 5, in <module>
rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')
File "build\bdist.win32\egg\cassandra\cluster.py", line 1065, in execute
File "build\bdist.win32\egg\cassandra\cluster.py", line 2427, in result
cassandra.OperationTimedOut: errors={}, last_host=172.16.47.130
by changing
rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')
to
rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000',timeout=20.0)
Error has gone, but why performance (for fetching 1000 rows from a 800,000 records table) is very slow? Any hints?
I built index on Sid
The key to the lack of performance here is your use of secondary indexes in place of what should be either a clustering key or part of a composite key. Secondary indexes in Cassandra are for assisting in full table scans (an expensive operation) for batch analytics or for early development testing. They are not analogous to relational indexes.
So if you want to execute queries like
rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')
then you need a table whose primary key is sid. If you would like to query based on FID as well then you need two complimentary tables, one keyed on FID and one on SID. At insert time you would place the information in both tables.

Resources