MS Azure Data Factory ADF Copy Activity from BLOB to Azure Postgres Gen5 8 cores fails with connection closed by host error

MS Azure Data Factory ADF Copy Activity from BLOB to Azure Postgres Gen5 8 cores fails with connection closed by host error - azure

I am using ADF copy acivity to copy files on azure blob to azure postgres.. im doing recursive copy i.e. there are multiple files withing the folder.. thats fine.. size of 5 files which i have to copy is total around 6 gb. activity fails after 30-60 min of run. used write batch size from 100- 500 but still fails.
used 4 or 8 orauto DIUS, similarly tried used 1,2,4,8 or auto parallel connections to postgres.normally it seems it uses 1 per source file. azure postgres server has 8 cores and temp buffer size is 8192 kb. max allowed is 16000 something kb. even tried using that but 2 errors which i have been constantly getting. ms support team suggested to use retry option. still awaiting response from there pg team if i get something but below r the errors.
Answer: {
'errorCode': '2200',
'message': ''Type=Npgsql.NpgsqlException,Message=Exception while reading from stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System,'',
'failureType': 'UserError',
'target': 'csv to pg staging data migration',
'details': []
}
or
Operation on target csv to pg staging data migration failed: 'Type=Npgsql.NpgsqlException,Message=Exception while flushing stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System

I was also facing this issue recently and contacted our microsoft rep who got back to me with the following update on 2020-01-16:
“This is another issue we found in the driver, we just finished our
deployment yesterday to fix this issue by upgrading driver version.
Now customer can have up to 32767 columns data in one batch size(which
is the limitation in PostgreSQL, we can’t exceed that).
Please let customer make sure that (Write batch size* column size)<
32767 as I mentioned, otherwise they will face the limitation. “
"Column size" refers to the count of columns in the table. The "area" (row write batch size * column count) cannot be greater than 32,767.
I was able to change my ADF write batch size on copy activity to a dynamic formula to ensure optimum batch sizes per table with the following:
#div(32766,length(pipeline().parameters.config)
pipeline().parameters.config refers to an array containing information about columns for the table. the length of the array = number of columns for table.
hope this helps! I was able to populate the database (albeit slowly) via ADF... would much prefer a COPY based method for better performance.

Related

Fix PlayFramework Slick Error - java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend rejected from slick.util.AsyncExecutor

UPDATE 2
More investigation, some code refactoring and many trials later, We have more insight. The issue seems relative to concurrent read-write. IE When trying to read from DB while in the same time write (update) occurs. Not sure if this is managed by Postgres or Slick.
We are facing a random error with Slick 3.
That is the brief history.
We are building an ETL pipeline Spark, Minio as S3 Storage (Minio is an open-source alternative to AWS) and Delta tables. The pipeline has a web interface created using Play Framework (Scala).
Cluster is consisted of:
7 workers nodes of 16 cores and 64GB RAM each configured in client
mode.
1 Storage node
spark.defaut.parallelism and spark.sql.shuffle.partitions are both set to 600
spark.dynamicAllocation is disabled
App data (session data, users data, and some other records in) is saved in PostgreSQL using Slick 3 mapper.
Data processed size is exponentially growing, and now it is around 50 GB. (In production, we aim to process Terabytes of data)
Data processing flow consists essentially in data Aggregation using group-by and saving data into S3 Storage following these steps
Read CSV data from Storage and create read_df dataframe
Read main_db from dtorag and create main_df
Merge read_df with main_df
GroupBy a specfic Key (let’s say user_id)
Save records to Storage to replace main_db. To guarantee data integrity, this stage is split into three phases:
Write records to a temp object referenced by date time
Backup Existing database object main_db (copy to another object)
Rename temp object to main_db (copy and delete)
Then Update PostgreSQL history table with processed job information such as:
time_started, time_ended, number_of_rows_processed, size, etc. And that is where issue occurs.
We are facing a random error, and we noticed it happens when shuffle occurs after groupby. Sometimes, we end up with 1000+ partitions. In those cases, Step 5 is not completed and gives the following Exception:
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#291fad07 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#7345bd2c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 26]
completed tasks value sometimes is lower, sometime reaches hundreds.
Below is code that is executed in step 5
Await.result(mainDbSvc.update("main.delta", mainDb), 600.seconds)
Googling the exception (And we did a ton of research about this), we found that it could be because connections are closed before code is executed when using transactionally. Notice, we don’t use transactionnally in our code. Below is code executed when calling update()
val updateQuery = this.mainDbTable.filter(_.id === id).update(db)
dbConfig.db.run(updateQuery)
That is the actual slick configuration:
connectionPool = "HikariCP"
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
numThreads = 100
Initially, before errors starting, it was
numThreads = 20
maxConnections = 20
We tried queueSize = 2000 but not fixed.
Can someone have a solution for us?
Furthermore, we suspect the step5 to be responsible for that connection closed issue because that did not happen when it is turned off. What is the link between threads that read/write from S3 Storage (on another server) and hikari (slick) processes that are killed?
And is there a better way to guarantee data integrity (in case of failure while writing data) without this time-consuming copy-restore-and-delete process ?
Note:
After Aggregation, we repartition() to reduce partitions and avoid skew data before saving results. Coalesce() made driver JVM craches with OOM.
main_df and read_df do not have the same schema so, overwritting using delta in built-in method is not possible.
Update() functions Await time was 10s but following issue, we increased it, but that did not fix the issue.
UPDATE
This is the full trace of exception.
An error has occurred: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#313a2647 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#5c6d1059 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at slick.util.AsyncExecutor$$anon$1$$anon$4.execute(AsyncExecutor.scala:161)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
at modules.load.daos.slick3.JobsDao.update(JobsDao.scala:180)
at modules.load.services.JobService.update(JobService.scala:50)
at modules.load.models.JobSnippet$.updateJob(Job.scala:113)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6(BuildController.scala:169)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6$adapted(BuildController.scala:151)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

How to copy managed database?

AFAIK there is no REST API providing this functionality directly. So, I am using restore for this (there are other ways but those don’t guarantee transactional consistency and are more complicated) via Create request.
Since it is not possible to turn off short time backup (retention has to be at least 1 day) it should be reliable. I am using current time for ‘properties.restorePointInTime’ property in request. This works fine for most databases. But one db returns me this error (from async operation request):
"error": {
"code": "BackupSetNotFound",
"message": "No backups were found to restore the database to the point in time 6/14/2021 8:20:00 PM (UTC). Please contact support to restore the database."
}
I know I am not out of range because if the restore time is before ‘earliestRestorePoint’ (this can be found in GET request on managed database) or in future I get ‘PitrPointInTimeInvalid’ error. Nevertheless, I found some information that I shouldn’t use current time but rather current time - 6 minutes at most. This is also true if done via Azure Portal (where it fails with the same error btw) which doesn’t allow to input time newer than current - 6 minutes. After few tries, I found out that current time - circa 40 minutes starts to work fine. But 40 minutes is a lot and I didn’t find any way to find out what time works before I try and wait for result of async operation.
My question is: Is there a way to find what is the latest time possible for restore?
Or is there a better way to do ‘copy’ of managed database which guarantees transactional consistency and is reasonably quick?
EDIT:
The issue I was describing was reported to MS. It was occuring when:
there is a custom time zone format e.g. UTC + 1 hour.
Backups are skipped for the source database at the desired point in time because the database is inactive (no active transactions).
This should be fixed as of now (25th of August 2021) and I were not able to reproduce it with current time - 10 minutes. Also I was told there should be new API which would allow to make copy without using PITR (no sooner than 1Q/22).

To answer your first question "Is there a way to find what is the latest time possible for restore?"
Yes. Via SQL. The only way to find this out is by using extended event (XEvent) sessions to monitor backup activity.
Process to start logging the backup_restore_progress_trace extended event and report on it is described here https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/backup-activity-monitor
Including the SQL here in case the link goes stale.
This is for storing in the ring buffer (max last 1000 records):
CREATE EVENT SESSION [Verbose backup trace] ON SERVER
ADD EVENT sqlserver.backup_restore_progress_trace(
WHERE (
[operation_type]=(0) AND (
[trace_message] like '%100 percent%' OR
[trace_message] like '%BACKUP DATABASE%' OR [trace_message] like '%BACKUP LOG%'))
)
ADD TARGET package0.ring_buffer
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,
TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
ALTER EVENT SESSION [Verbose backup trace] ON SERVER
STATE = start;
Then to see output of all backup events:
WITH
a AS (SELECT xed = CAST(xet.target_data AS xml)
FROM sys.dm_xe_session_targets AS xet
JOIN sys.dm_xe_sessions AS xe
ON (xe.address = xet.event_session_address)
WHERE xe.name = 'Verbose backup trace'),
b AS(SELECT
d.n.value('(#timestamp)[1]', 'datetime2') AS [timestamp],
ISNULL(db.name, d.n.value('(data[#name="database_name"]/value)[1]', 'varchar(200)')) AS database_name,
d.n.value('(data[#name="trace_message"]/value)[1]', 'varchar(4000)') AS trace_message
FROM a
CROSS APPLY xed.nodes('/RingBufferTarget/event') d(n)
LEFT JOIN master.sys.databases db
ON db.physical_database_name = d.n.value('(data[#name="database_name"]/value)[1]', 'varchar(200)'))
SELECT * FROM b
NOTE: This tip came to me via Microsoft support when I had the same issue of point in time restores failing what seemed like randomly. They do not give any SLA for log backups. I found that on a busy database the log backups seemed to happen every 5-10 minutes but on a quiet database hourly. Recovery of a database this way can be slow depending on number of transaction logs and amount of activity to replay etc. (https://learn.microsoft.com/en-us/azure/azure-sql/database/recovery-using-backups)
To answer your second question: "Or is there a better way to do ‘copy’ of managed database which guarantees transactional consistency and is reasonably quick?"
I'd have to agree with Thomas - if you're after guaranteed transactional consistency and speed you need to look at creating a failover group https://learn.microsoft.com/en-us/azure/azure-sql/database/auto-failover-group-overview?tabs=azure-powershell#best-practices-for-sql-managed-instance and https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/failover-group-add-instance-tutorial?tabs=azure-portal
A failover group for a managed instance will have a primary server and failover server with the same user databases on each kept in synch.
But yes, whether this suits your needs depends on the question Thomas asked of what is the purpose of the copy.

Azure Functions "The operation has timed out." for timer trigger blob archival

I have a Python Azure Functions timer trigger that is run once a day and archives files from a general purpose v2 hot storage container to a general purpose v2 cold storage container. I'm using the Linux Consumption plan. The code looks like this:
container = ContainerClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name)
blob_list = container.list_blobs(name_starts_with = hot_data_dir)
files = []
for blob in blob_list:
files.append(blob.name)
for file in files:
blob_from = BlobClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name,
blob_name=file)
data = blob_from.download_blob()
blob_to = BlobClient.from_connection_string(conn_str=cold_conn_str,
container_name=cold_container_name,
blob_name=f'archive/{file}')
try:
blob_to.upload_blob(data.readall())
except ResourceExistsError:
logging.debug(f'file already exists: {file}')
except ResourceNotFoundError:
logging.debug(f'file does not exist: {file}')
container.delete_blob(blob=file)
This has been working for me for the past few months with no problems, but for the past two days I am seeing this error halfway through the archive process:
The operation has timed out.
There is no other meaningful error message other than that. If I manually call the function through the UI, it will successfully archive the rest of the files. The size of the blobs ranges from a few KB to about 5 MB and the timeout error seems to be happening on files that are 2-3MB. There is only one invocation running at a time so I don't think I'm exceeding the 1.5GB memory limit on the consumption plan (I've seen python exited with code 137 from memory issues in the past). Why am I getting this error all of a sudden when it has been working flawlessly for months?
Update
I think I'm going to try using the method found here for archival instead so I don't have to store the blob contents in memory in Python: https://www.europeclouds.com/blog/moving-files-between-storage-accounts-with-azure-functions-and-event-grid

Just summarize the solution from comments for other communities reference:
As mentioned in comments, OP uses start_copy_from_url() method instead to implement the same requirements as a workaround.
start_copy_from_url() can process the file from original blob to target blob directly, it works much faster than using data = blob_from.download_blob() to store the file temporarily and then upload data to target blob.

Aggregate continuous stream of number from a file using hazelcast jet

I am trying to sum continuous stream of numbers from a file using hazelcast jet
pipe
.drawFrom(Sources.fileWatcher)<dir>))
.map(s->Integer.parseInt(s))
.addTimestamps()
.window(WindowDefinition.sliding(10000,1000))
.aggregate(AggregateOperations.summingDouble(x->x))
.drainTo(Sinks.logger());
Few questions
It doesn't give the expected output, my expectation is as soon as new number appears in the file, it should just add it to the existing sum
To do this why i need to give window and addTimestamp method, i just need to do sum of infinite stream
How can we achieve fault tolerance, i. e. if server restarts will it save the aggregated result and when it comes up it will aggregate from the last computed sum?
if the server is down and few numbers come in file now when the server comes up, will it read from last point from when the server went down or will it miss the numbers when it was down and will only read the number it got after the server was up.

Answer to Q1 & Q2:
You're looking for rollingAggregate, you don't need timestamps or windows.
pipe
.drawFrom(Sources.fileWatcher(<dir>))
.rollingAggregate(AggregateOperations.summingDouble(Double::parseDouble))
.drainTo(Sinks.logger());
Answer to Q3 & Q4: the fileWatcher source isn't fault tolerant. The reason is that it reads local files and when a member dies, the local files won't be available anyway. When the job restarts, it will start reading from current position and will miss numbers added while the job was down.
Also, since you use global aggregation, data from all files will be routed to single cluster member and other members will be idle.

RPC timeout in cqlsh - Cassandra

I have 5 nodes in my ring with SimpleTopologyStrategy and replication_factor=3. I inserted 1M rows using stress tool . When am trying to read the row count in cqlsh using
SELECT count(*) FROM Keyspace1.Standard1 limit 1000000;
It fails with error:
Request did not complete within rpc_timeout.
It fetches for limit 100000. Fails even for 500000.
All my nodes are up. Do I need to increase the rpc_timeout?
Please help.

You get this error because the request is timing out on the server side. One should know that this is a very expensive operation in Cassandra as others have pointed out.
Still, if you really want to do this you should update your /etc/cassandra/cassandra.yaml file and change the range_request_timeout_in_ms parameter. This will be valid for all your range queries.
Example to set a 40 second timeout:
range_request_timeout_in_ms: 40000
You will probably have to adjust at the client side as well. When using cqlsh as a client this is accomplished by creating/updating your configuration file for cqlsh under ~/.cassandra/cqlshrc and add the client_timeout parameter to the connection section.
Example to set a 40 second timeout:
[connection]
client_timeout=40

It takes a long time to read in 1M rows so that is probably why it is timing out. You shouldn't use count like this, it is very expensive since it has to read all the data. Use Cassandra counters if you need to count lots of items.
You should also check your Cassandra logs to confirm there aren't any other issues - sometimes exceptions in Cassandra lead to timeouts on the client.

If you can live with an approximate row count, take a look at this answer to Row count of a column family in Cassandra.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string