We have previously recognized, Azure runtime 1.2.3 scanning the /home folder, a new behavior compared to 1.1.7. It was causing 20 minutes delay for initial configuration, as it was scanning /home folder simultaneously with another scan. Now with 1.2.5 we experience a real issue, as this simultaneous prevents modules to start.
Question 1: why does Azure runtime make such a scan on host device?
Question 2: Is it possible to disable the /home folder scan totally, or is it possible to close some subfolders to scan?
Request: This was not the behavior with Azure Runtime 1.1.x. Will it be fixed / changed for 1.2.x in the near future?
Related
Posting here as server fault doesn't seem to have the detailed Azure knowledge.
I have a Azure storage account, a file share. This file share is connected to a Azure VM through mapped drive. A FTP server on the VM accepts a stream of files and stores them in the File Share directly.
There are no other connections. Only I have Azure admin access, limited support people have access to the VM.
Last week, for unknown reasons 16 million files, which are nested in many sub-folders (origin, date) moved instantly into a unrelated subfolder, 3 levels deep.
I'm baffled how this can happen. There is a clear instant cut off when files moved.
As a result, I'm seeing increased costs on LRS. I'm assuming because internally Azure storage is replicating the change at my expense.
I have attempted to copy the files back using a VM and AZCOPY. This process crashed midway through leaving me with a half a completed copy operation. This failed attempt took days, which makes me confident I wasn't the support guys dragging and moving a folder by accident.
Questions:
Is it possible to just instantly move so many files (how)
Is there a solid way I can move the files back, taking into account the half copied files - I mean an Azure backend operation way rather than writing an app / power shell / AZCOPY?
So there a cost efficient way of doing this (I'm on Transaction Optimised tier)
Do I have a case here to get Microsoft to do something, we didn't move them... I assume something internally messed up.
Thanks
A tool that supports server-side copy (like AzCopy) can move the files quickly because only the metadata is updated. If you wants to investigate the root cause, I recommend opening a support case. (To sort this out – Your best bet is to connect with our Azure support team by filing a ticket, our support team on best effort basis can help you guide on this matter. )
/var/mqsi/components/broker_name is the default directory that the broker called broker_name stores its configuration information, as far as I am aware. Below this directory are the pid directories containing execution group information, stdout/stderr, as well as other things.
I don't know if it's relevant to the issue we're seeing, but for historic reasons we have this directory in a different location for our brokers: (/var/broker_name/data/mqsi/components/broker_name).
We are using IIB 9.0.0.7, and multi-instancing (so all the directories are on the same HNAS with different mounts). We have eight brokers, so /var/broker_name_1, /var/broker_name_2, etc. are each mount points; with directories /data/mqsi/components/broker_name_n below that.
Yesterday afternoon, the directory /var/broker_name_n/data/mqsi/components/broker_name_n disappeared from the file system for two of the brokers. The directory /var/broker_name_n/data/mqsi/components still exists; but the broker_name_n directory below components disappeared.
Because the processes are running in memory, the broker and applications on the execution groups carried on working until we restarted one of the brokers; at which point all of the execution groups disappeared. The other broker was still running, so I looked and saw that the directory was missing. restarting the broker recreated it, but not the execution groups.
Is this behaviour something that anyone has seen before? Could this be caused by something in IIB; or is it likely to be something on the system that caused this to happen? It's weird that it was that specific directory on both of the servers, if it's a file-system issue.
in /var/log/messages, we don't see any issues; except for errors because different processes can't find the files they need - e.g. BIP3108S: Unable to initialize the listener environment. Exception text getListenerParametersFromFile: java.io.FileNotFoundException: /var/broker_name_1/data/mqsi/components/broker_name_1/config/wsplugin6.conf (No such file or directory)
Is there any way to recover from this in IIB without reinstalling everything?
So, I have a collection of Windows Server 2016 virtual machines that are used to run some tests in pairs. To perform these tests, I copy a selection of scripts and files from the network on to the machine, before performing the tests.
I'm basically using a selection of scripts that have existed around here since before my time and whilst i would like to use other methods, so much of our infrastructure relies on these scripts that overhauling the system would be a colossal task.
First up, i sort out the mapped drives with
net use X: \\network\location1 /user:domain\user password
net use Y: \\network\location2 /user:domain\user password
and so on
Soon after, i use rsync to copy files from a location in /cygdrive/y/somewhere to /cygdrive/c/somewhere_else
During the rsync, i will get errors that "files have vanished" (I'm currently unable to post the exact error, I will edit this later to include this). When i check what's currently in the /cygdrive directory, all i see is /cygdrive/c and everything else has disappeared.
I've tried making a symbolic link to /cygdrive/y in a different location, I've tried including persistent:yes on the net use command, I've changed the power settings on the network card to not sleep. None of these work.
I'm currently looking into the settings for the virtual machines themselves at this point, but I have some doubts as we have other virtual windows machines that do not seem to have this issue.
Has anyone has heard of anything similar and/or knows of a decent method to troubleshoot this?
Right, so I've been working on this all day and finally noticed a positive change, but since my systems are in VMware's vCloud, this may not work for some people. It's was simply a matter of having the VM turned off and upgrading the Virtual Hardware Version to the latest version. I have noticed with this though, that upon a restart, one of the first messages that comes up mentions that the computer is "disabling group policies".
I did a bit of research into this and found out that Windows 8 and 10 (no mention of any Windows Server machines) both automatically update Group Policies in the background, disconnecting and reconnecting mapped drives to recreate them.
It's possible that changing the Group Policy drive from "recreate" to "update" should fix this issue, and that the Virtual Hardware update happened to resolve this in a similar manner.
I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.
I have a azure service fabric development cluster running locally with two applications.
After a two week holiday I come back and see that my hard drive is completely full, consequently nothing really works anymore.
the sfdevcluster\log\traces folder has many *.etl files all larger than 100MB.
And all kinds of other log files > 250 MB are present
So my questions: how to disable tracing/logging on azure service fabric and are there tools to administer log files?
The powerShell script file that does the cluster setup magic is:
Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\DevClusterSetup.ps1
Looking inside, there is a function called DeployNodeConfiguration which sets the logs and data path using the PowerShell command New-ServiceFabricNodeConfiguration. Unfortunately, It does not seem that there is a way to limit the size of those folders.
I believe that your slowness / freeze is due to insufficient space on the OS drive (happened to me too haha). A workaround can be to set the location of those folders to a non-OS drive with a limited amount of space.
Hope this helps
This turned out to be a bug in Service Fabric, upgrade your local cluster to the latest version 6.1.472.9494 which will fix the issue. more details here