Azure batch pool does not allocate nodes - azure

I have created an azure batch account using "user subscription" allocation mode in order to control the network where my nodes will belong. The objective is to be able to open some firewalls for the set of IP that the nodes may take.
I had been using "batch service" allocation mode before without any trouble but it forces security breach because you have to open your firewalls to all azure if you want to access other services from batch.
The problem I am facing is that no matter what I try (be it Autoscale formula or just a fixed target node count) I never get any node allocated to my pool.
The only message I get is: AllocationTimedout: Desired number of dedicated nodes could not be allocated as the resize timeout was reached.
I checked the timeout (which is set to 10 minutes the default value) and I expect azure to be able to create nodes in less than 10 minutes (in "batch service" mode, it is much quicker).
I also checked my virtual machine quota and it is enough to create at least one node (it could create even more).
The problem I am facing is that I think the timeout is not the issue. It is the consequence of something not working in the background.
I checked the Activity log of batch and can see errors:
Write Deployments and Write VirtualMachineScaleSets.
The first seems to be linked to the second and the second state:
Error code
InvalidParameter
Message
Windows computer name prefix cannot be more than 9 characters long, be entirely numeric, or contain the following characters: ` ~ ! # # $ % ^ & * ( ) = + _ [ ] { } \ | ; : . ' " , < > / ?.
What am I missing here? The nodes names are given by Azure batch, not by me and they are indeed very long on standard "batch service" allocation mode.

Related

How to trigger/force Azure Batch pool autoscale formula

I created a pool in Azure Batch (from the Azure portal) with Auto scale activated.
I also defined a formula where the initial number of node is set to 0. This number will ramp up according to number of active tasks and will go back to 0 if there no task remaining.
My problem is that the minimum evaluation interval for the formula is 5 minutes, which means that in the worst case I have to wait at least 5 minutes (plus the time it takes for the nodes to boot and execute the start task) before a task can be assigned to a node.
I would like to apply the formula on the pool on demand by using the REST API (for instance right after adding a job).
According to the API documentation:
https://learn.microsoft.com/en-us/rest/api/batchservice/pool/evaluate-auto-scale
You can evaluate a formula but it will not be applied on the pool.
https://learn.microsoft.com/en-us/rest/api/batchservice/pool/enable-auto-scale
You can enable automatic scaling for a pool but if it is already enabled you have to specify a new auto scale formula and/or a new evaluation interval.
If you specify a new interval, then the existing auto scale evaluation schedule will be stopped and a new auto scale evaluation schedule will be started, with its starting time being the time when this request was issued.
You can disable and then re-enable the autoscale formula, but note the call limits on the enable API. However note that if you are trying to do this frequently on the order of less than the minimum evaluation period that thrashing a pool faster than the underlying infrastructure can allocate resources does not provide any benefits.

How do I set TTL for oneshot search in Splunk API using Python?

I am intermittently getting the following error back from the Splunk API (about 40% of the time search works as expected):
HTTP 503 Service Unavailable -- Search not executed: This search could
not be dispatched because the role-based disk usage quota of search
artifacts for user "[REDACTED]" has been reached (usage=1067MB,
quota=1000MB). Use the [[/app/search/job_manager|Job Manager]] to
delete some of your search artifacts, or ask your Splunk administrator
to increase the disk quota of search artifacts for your role in
authorize.conf., usage=1067MB, quota=1000MB, user=[REDACTED],
concurrency_category="historical",
concurrency_context="user_instance-wide"
The default ttl for a search in the splunk api is 10 min (at least for my company). I am told I need to lower the TTL for my searches (which are massive) and I will stop running out of space. I do not have admin access, so no ability to increase my space or clear space on the fly (as far I know). I can find code on how to lower TTL using saved searches, but I use oneshot searches. It is not reasonable for me to switch.
How do I lower ttl for oneshot searches?
Here is what I have now that does not seem to lower TTL:
#setup splunk connection
service = client.connect(
host=HOST,
port=PORT,
username=suser,
password=spass,
autologin=True,
)
#setup arguments
kwargs_oneshot = {"count" : "0",
"earliest_time": begin,
"latest_time": end,
"set_ttl":60
}
#setup search job
oneshotsearch_results = service.jobs.oneshot(query, **kwargs_oneshot)
# Get the results and display them using the ResultsReader
reader = results.ResultsReader(oneshotsearch_results)
Rather than set_ttl, I believe you need ttl or timeout. See https://docs.splunk.com/Documentation/Splunk/latest/RESTREF/RESTsearch#search.2Fjobs
Also, consider making your searches less massive or running them less often.

YARN + SPARK + Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

I have a small 6 data-nodes machines cluster and am experiencing total failure when running Spark jobs.
the failed:
ERROR [SparkListenerBus][driver][] [org.apache.spark.scheduler.LiveListenerBus] Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[42.3.44.157:50010,DS-87cdbf42-3995-4313-8fab-2bf6877695f6,DISK], DatanodeInfoWithStorage[42.3.44.154:50010,DS-60eb1276-11cc-4cb8-a844-f7f722de0e15,DISK]], original=[DatanodeInfoWithStorage[42.3.44.157:50010,DS-87cdbf42-3995-4313-8fab-2bf6877695f6,DISK], DatanodeInfoWithStorage[42.3.44.154:50010,DS-60eb1276-11cc-4cb8-a844-f7f722de0e15,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512)
---T08:18:07.007 ERROR [SparkListenerBus][driver][] [STATISTICS] [onQueryTerminated] queryId:
I found the following workaround by setting the following values in HDFS configuration
dfs.client.block.write.replace-datanode-on-failure.enable=true
dfs.client.block.write.replace-datanode-on-failure.policy=NEVER
The two properties dfs.client.block.write.replace-datanode-on-failure.policy and dfs.client.block.write.replace-data node-on-failure.enable influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.
Could be setting those parameter values a good solution?
dfs.client.block.write.replace-datanode-on-failure.enable true If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy
dfs.client.block.write.replace-datanode-on-failure.policy DEFAULT This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended.

MS Azure Data Factory ADF Copy Activity from BLOB to Azure Postgres Gen5 8 cores fails with connection closed by host error

I am using ADF copy acivity to copy files on azure blob to azure postgres.. im doing recursive copy i.e. there are multiple files withing the folder.. thats fine.. size of 5 files which i have to copy is total around 6 gb. activity fails after 30-60 min of run. used write batch size from 100- 500 but still fails.
used 4 or 8 orauto DIUS, similarly tried used 1,2,4,8 or auto parallel connections to postgres.normally it seems it uses 1 per source file. azure postgres server has 8 cores and temp buffer size is 8192 kb. max allowed is 16000 something kb. even tried using that but 2 errors which i have been constantly getting. ms support team suggested to use retry option. still awaiting response from there pg team if i get something but below r the errors.
Answer: {
'errorCode': '2200',
'message': ''Type=Npgsql.NpgsqlException,Message=Exception while reading from stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System,'',
'failureType': 'UserError',
'target': 'csv to pg staging data migration',
'details': []
}
or
Operation on target csv to pg staging data migration failed: 'Type=Npgsql.NpgsqlException,Message=Exception while flushing stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System
I was also facing this issue recently and contacted our microsoft rep who got back to me with the following update on 2020-01-16:
“This is another issue we found in the driver, we just finished our
deployment yesterday to fix this issue by upgrading driver version.
Now customer can have up to 32767 columns data in one batch size(which
is the limitation in PostgreSQL, we can’t exceed that).
Please let customer make sure that (Write batch size* column size)<
32767 as I mentioned, otherwise they will face the limitation. “
"Column size" refers to the count of columns in the table. The "area" (row write batch size * column count) cannot be greater than 32,767.
I was able to change my ADF write batch size on copy activity to a dynamic formula to ensure optimum batch sizes per table with the following:
#div(32766,length(pipeline().parameters.config)
pipeline().parameters.config refers to an array containing information about columns for the table. the length of the array = number of columns for table.
hope this helps! I was able to populate the database (albeit slowly) via ADF... would much prefer a COPY based method for better performance.

Understanding Azure SQL Performance

The facts:
1 Azure SQL S0 instance
a few tables one of them containing ~ 8.6 Million Rows and 1 PK
Running a Count-query on this table takes nearly 30 minutes (!) to complete.
Upscaling the instance from S0 to S1 reduces the query time to 13 minutes:
Looking into Azure Portal (new version) the resource-usage-monitor shows the following:
Questions:
Does anyone else consider even 13 minutes as rediculos for a simple COUNT()?
Does the second screenshot meen that during the 100%-period my instance isn't responding to other requests?
Why are my metrics limited to 100% in both S0 and S1? (see look under "Which Service Tier is Right for My Database?" stating " These values can be above 100% (a big improvement over the values in the preview that were limited to a maximum of 100).") I'd expect the S0 to bee like on 150% or so if the quoted statement is true.
I'm interested in experiences regarding usage of databases with more than 1.000 records or so from other people. I don't see how a S*-scaled Azure SQL for 22 - 55 € per month could help me in upscaling-strategies at the moment.
Azure SQL Database editions provide increasing levels of DTUs from Basic -> Standard -> Premium levels (CPU,IO,Memory and other resources - see https://msdn.microsoft.com/en-us/library/azure/dn741336.aspx). Once your query reaches its limits of DTU (100%) in any of these resource dimensions, it will continue to receive these resources at that level (but not more) and that may increase the latency in completing the request. It looks like in your scenario above, the query is hitting its DTU limit (10 DTUs for S0 and 20 for S1). You can see the individual resource usage percentages (CPU, Data IO or Log IO) by adding these metrics to the same graph, or by querying the DMV sys.dm_db_resource_stats.
Here is a blog that provides more information on appropriately sizing your database performance levels. http://azure.microsoft.com/blog/2014/09/11/azure-sql-database-introduces-new-near-real-time-performance-metrics/
To your specific questions
1) As you have 8.6 million rows, database needs to scan the index entries to get the count back. So, it may be hitting the IO limit for the edition here.
2) If you have multiple concurrent queries running against your DB, they will be scheduled appropriately to not starve one request or the other. But latencies may increase further for all queries since you will be hitting the available resource limits.
3) For older Web/Business editions, you may be able to see the metric values going beyond 100% (they are normalized to the limits of an S2 level), as they don't have any specific limits and run in a resource-shared environment with other customer loads. For the new editions, metrics will never exceed 100%, because system guarantees you resources upto 100% of that edition's limits, but no more. This provides predictable, guaranteed amount of resources for your DB unlike Web/Business editions, where you may get very little or lot more resources at different times depending on other competing customer DB workloads running on the same machine.
Hope this helps.
-- Srini

Resources