Postgres and 1000 multiple calls - node.js

I have a DB PostgresSql Server v. 11.7 - which is used 100% for local development only.
Hardware: 16 cores CPU, 112 GB memory, 3TB m.2 SSD (It is running Ubuntu 18.04 - But I get the about the same speed at my windows 10 laptop when I run the exact same query locally on it).
The DB contains ~ 1500 DB table (of the same structure).
Every call to the DB is custom and specific - so nothing to cache here.
From NodeJS I execute a lot of simultaneously calls (via await Promise.all(all 1000 promises)) and afterwards make a lot of different calculations.
Currently my stats look like this (max connection set to the default of 100):
1 call ~ 100ms
1.000 calls ~ 15.000ms (15ms/call)
I have tried to change the different settings of PostgreSQL. For example to change the max connections to 1.000 - but nothing really seems to optimize the performance (and yes - I do remember to restart the PostgreSql service every time I make a change).
How can I make the execution of the 1.000 simultaneously calls as fast as possible? Should I consider to copy all the needed data to another in-memory database like Redis instead?
The DB table looks like this:
CREATE TABLE public.my_table1 (
id int8 NOT NULL GENERATED ALWAYS AS IDENTITY,
tradeid int8 NOT NULL,
matchdate timestamptz NULL,
price float8 NOT NULL,
"size" float8 NOT NULL,
issell bool NOT NULL,
CONSTRAINT my_table1_pkey PRIMARY KEY (id)
);
CREATE INDEX my_table1_matchdate_idx ON public.my_table1 USING btree (matchdate);
CREATE UNIQUE INDEX my_table1_tradeid_idx ON public.my_table1 USING btree (tradeid);
The simple test query - fetch 30 mins of data between two time-stamps:
select * from my_table1 where '2020-01-01 00:00' <= matchdate AND matchdate < '2020-01-01 00:30'
total_size_incl_toast_and_indexes 21 GB total table size --> 143 bytes/row
live_rows_in_text_representation 13 GB total table size --> 89 bytes/row
My NodeJS code looks like this:
const startTime = new Date();
let allDBcalls = [];
let totalRawTrades = 0;
(async () => {
for(let i = 0; i < 1000; i++){
allDBcalls.push(selectQuery.getTradesBetweenDates(tickers, new Date('2020-01-01 00:00'), new Date('2020-01-01 00:30')).then(function (rawTradesPerTicker) {
totalRawTrades += rawTradesPerTicker["data"].length;
}));
}
await Promise.all(allDBcalls);
_wl.info(`Fetched ${totalRawTrades} raw-trades in ${new Date().getTime() - startTime} ms!!`);
})();
I just tried to run EXPLAIN - 4 times on the query:
EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM public.my_table1 where '2020-01-01 00:00' <= matchdate and matchdate < '2020-01-01 00:30';
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.024..0.555 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.096 ms
Execution Time: 0.634 ms
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.018..0.305 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.170 ms
Execution Time: 0.374 ms
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.020..0.351 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.097 ms
Execution Time: 0.428 ms
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.016..0.482 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.077 ms
Execution Time: 0.586 ms

Related

Spark stage taking too long - 2 executors doing "all" the work

I've been trying to figure this out for the past day, but have not been successful.
Problem I am facing
I'm reading a parquet file that is about 2GB big. The initial read is 14 partitions, then eventually gets split into 200 partitions. I perform seemingly simple SQL query that runs for 25+ mins runtime, about 22 mins is spent on a single stage. Looking in the Spark UI, I see that all computation is eventually pushed to about 2 to 4 executors, with lots of shuffling. I don't know what is going on. Please I would appreciate any help.
Setup
Spark environment - Databricks
Cluster mode - Standard
Databricks Runtime Version - 6.4 ML (includes Apache Spark 2.4.5, Scala 2.11)
Cloud - Azure
Worker Type - 56 GB, 16 cores per machine. Minimum 2 machines
Driver Type - 112 GB, 16 cores
Notebook
Cell 1: Helper functions
load_data = function(path, type) {
input_df = read.df(path, type)
input_df = withColumn(input_df, "dummy_col", 1L)
createOrReplaceTempView(input_df, "__current_exp_data")
## Helper function to run query, then save as table
transformation_helper = function(sql_query, destination_table) {
createOrReplaceTempView(sql(sql_query), destination_table)
}
## Transformation 0: Calculate max date, used for calculations later on
transformation_helper(
"SELECT 1L AS dummy_col, MAX(Date) max_date FROM __current_exp_data",
destination_table = "__max_date"
)
## Transformation 1: Make initial column calculations
transformation_helper(
"
SELECT
cId AS cId
, date_format(Date, 'yyyy-MM-dd') AS Date
, date_format(DateEntered, 'yyyy-MM-dd') AS DateEntered
, eId
, (CASE WHEN isnan(tSec) OR isnull(tSec) THEN 0 ELSE tSec END) AS tSec
, (CASE WHEN isnan(eSec) OR isnull(eSec) THEN 0 ELSE eSec END) AS eSec
, approx_count_distinct(eId) OVER (PARTITION BY cId) AS dc_eId
, COUNT(*) OVER (PARTITION BY cId, Date) AS num_rec
, datediff(Date, DateEntered) AS analysis_day
, datediff(max_date, DateEntered) AS total_avail_days
FROM __current_exp_data
CROSS JOIN __max_date ON __main_data.dummy_col = __max_date.dummy_col
",
destination_table = "current_exp_data_raw"
)
## Transformation 2: Drop row if Date is not valid
transformation_helper(
"
SELECT
cId
, Date
, DateEntered
, eId
, tSec
, eSec
, analysis_day
, total_avail_days
, CASE WHEN analysis_day == 0 THEN 0 ELSE floor((analysis_day - 1) / 7) END AS week
, CASE WHEN total_avail_days < 7 THEN NULL ELSE floor(total_avail_days / 7) - 1 END AS avail_week
FROM current_exp_data_raw
WHERE
isnotnull(Date) AND
NOT isnan(Date) AND
Date >= DateEntered AND
dc_eId == 1 AND
num_rec == 1
",
destination_table = "main_data"
)
cacheTable("main_data_raw")
cacheTable("main_data")
}
spark_sql_as_data_table = function(query) {
data.table(collect(sql(query)))
}
get_distinct_weeks = function() {
spark_sql_as_data_table("SELECT week FROM current_exp_data GROUP BY week")
}
Cell 2: Call helper function that triggers the long running task
library(data.table)
library(SparkR)
spark = sparkR.session(sparkConfig = list())
load_data_pq("/mnt/public-dir/file_0000000.parquet")
set.seed(1234)
get_distinct_weeks()
Long running stage DAG
Stats about long running stage
Logs
I trimmed it down, and show only entries that appeared multiple times below
BlockManager: Found block rdd_22_113 locally
CoarseGrainedExecutorBackend: Got assigned task 812
ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
InMemoryTableScanExec: Predicate (dc_eId#61L = 1) generates partition filter: ((dc_eId.lowerBound#622L <= 1) && (1 <= dc_eId.upperBound#621L))
InMemoryTableScanExec: Predicate (num_rec#62L = 1) generates partition filter: ((num_rec.lowerBound#627L <= 1) && (1 <= num_rec.upperBound#626L))
InMemoryTableScanExec: Predicate isnotnull(Date#57) generates partition filter: ((Date.count#599 - Date.nullCount#598) > 0)
InMemoryTableScanExec: Predicate isnotnull(DateEntered#58) generates partition filter: ((DateEntered.count#604 - DateEntered.nullCount#603) > 0)
MemoryStore: Block rdd_17_104 stored as values in memory (estimated size <VERY SMALL NUMBER < 10> MB, free 10.0 GB)
ShuffleBlockFetcherIterator: Getting 200 non-empty blocks including 176 local blocks and 24 remote blocks
ShuffleBlockFetcherIterator: Started 4 remote fetches in 1 ms
UnsafeExternalSorter: Thread 254 spilling sort data of <Between 1 and 3 GB> to disk (3 times so far)

Parquet data to AWS Redshift slow

I want to insert data from S3 parquet files to Redshift.
Files in parquet comes from a process that reads JSON files, flatten them out, and store as parquet. To do it we use pandas dataframes.
To do so, I tried two different things. The first one:
COPY schema.table
FROM 's3://parquet/provider/A/2020/11/10/11/'
IAM_ROLE 'arn:aws:iam::XXXX'
FORMAT AS PARQUET;
It returned:
Invalid operation: Spectrum Scan Error
error: Spectrum Scan Error
code: 15001
context: Unmatched number of columns between table and file. Table columns: 54, Data columns: 41
I understand the error but I don't have an easy option to fix it.
If we have to do a reload from 2 months ago the file will only have for example 40 columns, because on that given data we needed just this data but table already increased to 50 columns.
So we need something automatically, or that we can specify the columns at least.
Then I applied another option which is to do a SELECT with AWS Redshift Spectrum. We know how many columns the table have using system tables, and we now the structure of the file loading again to a Pandas dataframe. Then I can combine both to have the same identical structure and do the insert.
It works fine but it is slow.
The select looks like:
SELECT fields
FROM schema.table
WHERE partition_0 = 'A'
AND partition_1 = '2020'
AND partition_2 = '11'
AND partition_3 = '10'
AND partition_4 = '11';
The partitions are already added as I checked using:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename = 'table'
and schemaname = 'schema'
and values = '["A","2020","11","10","11"]'
limit 1;
I have around 170 files per hour, both in json and parquet file. The process list all files in S3 json path, and process them and store in S3 parquet path.
I don't know how to improve execution time, as the INSERT from parquet takes 2 minutes per each partition_0 value. I tried the select alone to ensure its not an INSERT issue, and it takes 1:50 minutes. So the issue is to read data from S3.
If I try to select a non existent value for partition_0 it takes again around 2 minutes, so there is some kind of problem to access data. I don't know if partition_0 naming and others are considered as Hive partitioning format.
Edit:
AWS Glue Crawler table specification
Edit: Add SVL_S3QUERY_SUMMARY results
step:1
starttime: 2020-12-13 07:13:16.267437
endtime: 2020-12-13 07:13:19.644975
elapsed: 3377538
aborted: 0
external_table_name: S3 Scan schema_table
file_format: Parquet
is_partitioned: t
is_rrscan: f
is_nested: f
s3_scanned_rows: 1132
s3_scanned_bytes: 4131968
s3query_returned_rows: 1132
s3query_returned_bytes: 346923
files: 169
files_max: 34
files_avg: 28
splits: 169
splits_max: 34
splits_avg: 28
total_split_size: 3181587
max_split_size: 30811
avg_split_size: 18825
total_retries:0
max_retries:0
max_request_duration: 360496
avg_request_duration: 172371
max_request_parallelism: 10
avg_request_parallelism: 8.4
total_slowdown_count: 0
max_slowdown_count: 0
Add query checks
Query: 37005074 (SELECT in localhost using pycharm)
Query: 37005081 (INSERT in AIRFLOW AWS ECS service)
STL_QUERY Shows that both queries takes around 2 min
select * from STL_QUERY where query=37005081 OR query=37005074 order by query asc;
Query: 37005074 2020-12-14 07:44:57.164336,2020-12-14 07:46:36.094645,0,0,24
Query: 37005081 2020-12-14 07:45:04.551428,2020-12-14 07:46:44.834257,0,0,3
STL_WLM_QUERY Shows that no queue time, all in exec time
select * from STL_WLM_QUERY where query=37005081 OR query=37005074;
Query: 37005074 Queue time 0 Exec time: 98924036 est_peak_mem:0
Query: 37005081 Queue time 0 Exec time: 100279214 est_peak_mem:2097152
SVL_S3QUERY_SUMMARY Shows that query takes 3-4 seconds in s3
select * from SVL_S3QUERY_SUMMARY where query=37005081 OR query=37005074 order by endtime desc;
Query: 37005074 2020-12-14 07:46:33.179352,2020-12-14 07:46:36.091295
Query: 37005081 2020-12-14 07:46:41.869487,2020-12-14 07:46:44.807106
stl_return Comparing min start for to max end for each query. 3-4 seconds as says SVL_S3QUERY_SUMMARY
select * from stl_return where query=37005081 OR query=37005074 order by query asc;
Query:37005074 2020-12-14 07:46:33.175320 2020-12-14 07:46:36.091295
Query:37005081 2020-12-14 07:46:44.817680 2020-12-14 07:46:44.832649
I dont understand why SVL_S3QUERY_SUMMARY shows just 3-4 seconds to run query in spectrum, but then STL_WLM_QUERY says the excution time is around 2 minutes as i see in my localhost and production environtments... Neither how to improve it, because stl_return shows that query returns few data.
EXPLAIN
XN Partition Loop (cost=0.00..400000022.50 rows=10000000000 width=19608)
-> XN Seq Scan PartitionInfo of parquet.table (cost=0.00..22.50 rows=1 width=0)
Filter: (((partition_0)::text = 'A'::text) AND ((partition_1)::text = '2020'::text) AND ((partition_2)::text = '12'::text) AND ((partition_3)::text = '10'::text) AND ((partition_4)::text = '12'::text))
-> XN S3 Query Scan parquet (cost=0.00..200000000.00 rows=10000000000 width=19608)
" -> S3 Seq Scan parquet.table location:""s3://parquet"" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=19608)"
svl_query_report
select * from svl_query_report where query=37005074 order by segment, step, elapsed_time, rows;
Just like in your other question you need to change your keypaths on your objects. It is not enough to just have "A" in the keypath - it needs to be "partition_0=A". This is how Spectrum knows that the object is or isn't in the partition.
Also you need to make sure that your objects are of reasonable size or it will be slow if you need to scan many of them. It takes time to open each object and if you have many small objects the time to open them can be longer than the time to scan them. This is only an issue if you need to scan many many files.

Postgresql Query is taking over 30s, Failing on Heroku

I have two tables (tb_accounts and tb_similar_accounts) in my PostgreSQL database.
Each record in tb_accounts has a unique id field.
And in tb_similar_accounts table, we store the relationships between accounts. We have around 10 million records in this table.
I get around 1000~4000 accounts in one of my API endpoints, and I want to get all the links from tb_similar_accounts
const ids_str = `${ids.map((id) => `('${id}')`).join(`,`)}`;
const pgQuery = `
SELECT account_1 as source, account_2 as target, strength as weight FROM tb_similar_accounts
WHERE account_1 = ANY (VALUES ${ids_str}) AND account_2 = ANY (VALUES ${ids_str})
ORDER BY strength DESC
LIMIT 10000
`;
const { rows: visEdges } = await pgClient.query(pgQuery);
This is what I'm doing here, but it's taking too much time. It takes over 30 seconds so it's failing on the Heroku server.
Limit (cost=296527.60..297597.60 rows=10000 width=30) (actual time=25075.977..25078.602 rows=10000 loops=1)
Buffers: shared hit=19 read=121634
I/O Timings: read=22104.502
-> Gather Merge (cost=296527.60..297967.40 rows=13456 width=30) (actual time=25075.975..25080.332 rows=10000 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=110 read=244343
I/O Timings: read=44196.556
-> Sort (cost=295527.60..295534.33 rows=13456 width=30) (actual time=25070.720..25071.022 rows=5546 loops=2)
Sort Key: tb_similar_accounts.strength DESC
Sort Method: top-N heapsort Memory: 1550kB
Worker 0: Sort Method: top-N heapsort Memory: 1550kB
Buffers: shared hit=110 read=244343
I/O Timings: read=44196.556
-> Hash Semi Join (cost=5.74..295343.04 rows=13456 width=30) (actual time=1040.173..25060.553 rows=38449 loops=2)
Hash Cond: (tb_similar_accounts.account_1 = "*VALUES*".column1)
Buffers: shared hit=63 read=244343
I/O Timings: read=44196.556
-> Hash Semi Join (cost=2.87..295096.26 rows=381936 width=30) (actual time=2.197..25039.864 rows=80874 loops=2)
Hash Cond: (tb_similar_accounts.account_2 = "*VALUES*_1".column1)
Buffers: shared hit=33 read=244343
I/O Timings: read=44196.556
-> Parallel Seq Scan on tb_similar_accounts (cost=0.00..286491.44 rows=14038480 width=30) (actual time=0.032..23394.824 rows=11932708 loops=2)
Buffers: shared hit=33 read=244343
I/O Timings: read=44196.556
-> Hash (cost=1.44..1.44 rows=410 width=32) (actual time=0.241..0.242 rows=410 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 26kB
-> Values Scan on "*VALUES*_1" (cost=0.00..1.44 rows=410 width=32) (actual time=0.001..0.153 rows=410 loops=2)
-> Hash (cost=1.44..1.44 rows=410 width=32) (actual time=0.198..0.198 rows=410 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 26kB
-> Values Scan on "*VALUES*" (cost=0.00..1.44 rows=410 width=32) (actual time=0.002..0.113 rows=410 loops=2)
Planning Time: 3.522 ms
Execution Time: 25081.725 ms
This is for 410 accounts and takes around 25 seconds.
Is there any way to improve this query? (I'm using Node.js and pg module.)
Check if both your tables are indexed properly i.e. try to build index on the columns which are used in joining two tables.

Resolving performance issue with skewed partitions in PySpark Window function

I am attempting to calculate some moving averages in Spark, but am running into issues with skewed partitions. Here is the simple calculation I'm trying to perform:
Getting the base data
# Variables
one_min = 60
one_hour = 60*one_min
one_day = 24*one_hour
seven_days = 7*one_day
thirty_days = 30*one_day
# Column variables
target_col = "target"
partition_col = "partition_col"
df_base = (
spark
.sql("SELECT * FROM {base}".format(base=base_table))
)
df_product1 = (
df_base
.where(F.col("product_id") == F.lit("1"))
.select(
F.col(target_col).astype("double").alias(target_col),
F.unix_timestamp("txn_timestamp").alias("window_time"),
"transaction_id",
partition_col
)
)
df_product1.persist()
Calculating running averages
window_lengths = {
"1day": one_day,
"7day": seven_days,
"30day": thirty_days
}
# Create window specs for each type
part_windows = {
time: Window.partitionBy(F.col(partition_col))
.orderBy(F.col("window_time").asc())
.rangeBetween(-secs, -one_min)
for (time, secs) in window_lengths.items()
}
cols = [
# Note: not using `avg` as I will be smoothing this at some point
(F.sum(target_col).over(win)/F.count("*").over(win)).alias(
"{time}_avg_target".format(time=time)
)
for time, win in part_windows.items()
]
sample_df = (
df_product1
.repartition(2000, partition_col)
.sortWithinPartitions(F.col("window_time").asc())
.select(
"*",
*cols
)
)
Now, I can collect a limited subset of these data (say just 100 rows), but if I try to run the full query, and, for example, aggregate the running averages, Spark gets stuck on some particularly large partitions. The vast majority of the partitions have fewer than 1million records in them. Only about 50 of them have more than 1M record and only about 150 have more than 500K
However, a small handful have more than 2.5M (~10), and 3 of them have more than 5M records. These partitions have run for more than 12 hours and failed to complete. The skew in these partitions are a natural part of the data representing larger activity in these distinct values of the partitioning column. I have no control over the definition of the values of this partitioning column.
I am using a SparkSession with dynamic allocation enabled, 32G of RAM and 4 cores per executor, and 4 executors minimum. I have attempted to up the executors to 96G with 8 cores per executor and 10 executors minimum, but the job still does not complete.
This seems like a calculation which shouldn't take 13 hours to complete. The df_product1 DataFrame contains just shy of 300M records.
If there is other information that would be helpful in resolving this problem, please comment below.

Whats faster: ranged queries on partition keys or individual equality queries on PK's in Azure table Storage?

I have a need to run queries against table storage. I need to get specific data from about 10 consecutive partition keys. To be more precise, my Azure table contains a PK/RK pattern so that every PK has about 300 rows. Within each PK, I need to retrieve about 100 rows.
I can do either 1 call like this:
var query = table.CreateQuery<Item>()
.Where(n => string.Compare(n.PartitionKey, fromPk, StringComparison.Ordinal) >= 0 &&
string.Compare(n.PartitionKey, toPk, StringComparison.Ordinal) <= 0 &&
string.Compare(n.RowKey, fromRk, StringComparison.Ordinal) >= 0 &&
string.Compare(n.RowKey, endRk, StringComparison.Ordinal) <= 0
).AsTableQuery();
or
10 calls to this:
var query = table.CreateQuery<Item>()
.Where(n => string.Compare(n.PartitionKey, pk, StringComparison.Ordinal) == 0 &&
string.Compare(n.RowKey, fromRk, StringComparison.Ordinal) >= 0 &&
string.Compare(n.RowKey, endRk, StringComparison.Ordinal) <= 0
).AsTableQuery();
What's better?
I've not profiled it myself, I'm guessing that in some situations the differences could be pretty small, but the single query is probably the way to go.
The single query is reasonably efficient, but because of the amount of data you have (about 3000 rows in the result set) and the fact that you will get at most 1000 rows at a time, this will result in at least 3 calls to the API under the hood.
The individual queries are probably individualy faster and you probably won't result in getting a continuation token and you could run them in parallel. If you can't run the queries in parallel I would expect the latency of the requests to overwhelm any query performance difference.

Resources