Unable to get a container to use glusterfs running through convoy on Rancher - glusterfs

I have completed the installation as per the tutorial on youtube - Installed glusterfs and convoy gluster via the catalogue
However when I try to start a container that utilises glusterfs it fails to start. It completes all the usual stages such networking etc but then is just stuck the state of Starting (Starting)

In my case the problem turned out that I needed to enter convoy-gluster as the storage driver rather than convoy as in the tutorial.

Related

Databricks : can't install a new cluster on azure databricks

I tried to install a new cluster on Databricks (I lost the one I used, someone deleted it) and it doesn't work. I have the following message:
Time
....
Message
Cluster terminated.Reason:Network Configuration Failure
The data plane network is misconfigured. Please verify that the network for your data plane is configured correctly.
Instance ID: ...............
Error message: Failed to launch HostedContainer{hostPrivateIP=......,
containerIp=....,
clusterId=...., resources=InstantiatedResources
{memoryMB=9105, ECUs=3.0, cgroupShares=...},
isSpot=false, id=...., instanceId=InstanceId(....),
state=Pending,instanceType=Standard_DS3_v2,
metadata=ContainerMetadata(Standard_DS3_v2)}
Because starting the FUSE daemon timed out.
This may happen because your VMs do not have outbound connectivity to DBFS storage.
Also consider upgrading your cluster to a later spark version.
What can I do ? Maybe I have to unmount ?
Can you please try and play around with the instance type and see if that helps ? May be this particular instance of the VM is not available in th region . Also check with other version of Pyspark . AFAIK unmounting will not help , but I may be wrong .

Does Kubernetes restart a failed container or create a new container when the running container fails for any reason?

I have ran a docker container locally and it stores data in a file (currently no volume is mounted). I stored some data using the API. After that I failed the container using process.exit(1) and started the container again. The previously stored data in the container survives (as expected). But when I do this same thing in Kubernetes (minikube) the data is lost.
Posting this as a community wiki for better visibility, feel free to edit and expand it.
As described in comments, kubernetes replaces failed containers with new (identical) ones and this explain why container's filesystem will be clean.
Also as said containers should be stateless. There are different options how to run different applications and take care about its data:
Run a stateless application using a Deployment
Run a stateful application either as a single instance or as a replicated set
Run automated tasks with a CronJob
Useful links:
Kubernetes workloads
Pod lifecycle

titan rexster with external cassandra instance

I have a cassandra cluster (2.1.0) running fine.
After installing titan 5.1, and editing the titan-cassandra.properties to point to cluster hostname list rather than localhost, i run following -
titan.sh -c conf/titan-cassandra.properties start
It is able to recognize running cassandra instance, starts elastic search, but times out while connecting to rexster.
If i run it with local cassandra, everything runs fine using following ->br>
titan.sh start
do i need to make any change in rexster properties to point to running cassandra cluster..
Thanks in advance
Titan Server started by titan.sh represents a quick way to get started with Titan/Rexster/ES. It is designed to simplify running all those things with one startup script. Once you start breaking things apart (e.g. a separate cassandra cluster), you might not want to use titan.sh anymore because, it still forks a cassandra process when it starts up. Presumably, you don't need that anymore, given that you have a separate cassandra cluster.
Given the more advanced nature of your environment, I would simply download Rexster and configure it to connect to your cluster.

HDInsight word count map reduce program stuck at mapper 100% and reducer 0%

I am new to Hadoop and I have a very similar problem as posted here. Only thing is OP runs hadoop on linux where as I am running it on Windows.
I have installed Hadoop Azure HDInsight Emulator on my local machine. When I run a simple word count program. Mapper jobs runs perfectly 100% but Reduce job gets stuck at 0%.
I tried debugging it as suggested by Chris (In response to this que) and found the problem with hostname on which reducer jobs run (which was the exact problem as of OP)
Reduce is not running on localhost instead it runs on some hostname 192.168.17.213 which is not getting resolved and reducer can not progress from there.
These are error logs
copy failed: attempt_201402111921_0017_m_000000_0 from 192.168.17.213
2014-02-12 01:51:53,073 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.ConnectException: Connection timed out: connect
OP got that issue resolved by chaning \etc\hosts file setting to localhost.
But that seems to be a linux config.. How do I set my hostname to localhost in my Hadoop Azure HDInsight Emulator?
There is an article showing you how to run the word counting MapReduce program on HDInsight emulator. The article is Get started with HDInsight emulator located at http://www.windowsazure.com/en-us/documentation/articles/hdinsight-get-started-emulator/.

I could not submit a job to the executing node in condor apart from the central manager

I have a condor pool which consist of 4 dedicated machine one is set as a centeral manager, submitting, and executing node while the other three is set to be executing nodes I used CentOS 5.4 as an OS for all the machines. My problem is when I submitted a job from the central manager it works just on the central manager so when I specify in the JDL file that the job should run in any machine apart from the central manager the job stay in hold and does not run. When I type condor_status all nodes appear. I keep the daemon MASTER, STARTD in the daemon list for the executing nodes. Does any one come across this problem?
There's not enough information to answer your question, but the first thing to do is to run condor_q -analyze <jobid> and see what it tells you. See the Condor manual Section 2.6.5: Why is the job not running?
One possible cause is that you're not telling Condor to transfer your input/output files for you, and your nodes have different "filesystem domains", so Condor is unable to find a host which shares a common filesystem with your submit host.

Resources