Torque Job Stuck with mpirun - linux

I ran the following PBS script to run a job with my newly configured Torque:
#!/bin/sh
#PBS -N asyn
#PBS -q batch
#PBS -l nodes=2:ppn=2
#PBS -l walltime=120:00:00
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE>nodes
mpirun -np 4 gmx_mpi mdrun -deffnm asyn_10ns
It gets stuck with an R status. The time remains 00:00:00. When I checked the tracejob, it gets me the following details:
[root#headnode ~]# tracejob 11
/var/spool/torque/mom_logs/20170825: No such file or directory
Job: 11.headnode
08/25/2017 13:49:31.230 S enqueuing into batch, state 1 hop 1
08/25/2017 13:49:31.360 S Job Modified at request of root#headnode
08/25/2017 13:49:31.373 L Job Run
08/25/2017 13:49:31.361 S Job Run at request of root#headnode
08/25/2017 13:49:31.374 S Not sending email: User does not want mail of this type.
08/25/2017 13:49:31 A queue=batch
08/25/2017 13:49:31 A user=souparno group=souparno jobname=asyn queue=batch ctime=1503649171 qtime=1503649171 etime=1503649171
start=1503649171 owner=souparno#headnode exec_host=headnode3/0-1+headnode2/0-1 Resource_List.nodes=2:ppn=2
Resource_List.walltime=120:00:00 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=2
The sched_log is giving me the following details:
08/25/2017 13:49:31.373;64; pbs_sched.25166;Job;11.headnode;Job Run
The server_log gives the following output:
08/25/2017 13:49:31.230;256;PBS_Server.25216;Job;11.headnode;enqueuing into batc
h, state 1 hop 1
08/25/2017 13:49:31.230;08;PBS_Server.25216;Job;perform_commit_work;job_id: 11.headnode
08/25/2017 13:49:31.230;02;PBS_Server.25216;node;close_conn;Closing connection 8 and calling its accompanying function on close
08/25/2017 13:49:31.360;08;PBS_Server.25134;Job;11.headnode;Job Modified at request of root#headnode
08/25/2017 13:49:31.361;08;PBS_Server.25134;Job;11.headnode;Job Run at request of root#headnode
08/25/2017 13:49:31.374;13;PBS_Server.25134;Job;11.headnode;Not sending email: User does not want mail of this type.
08/25/2017 13:50:59.137;02;PBS_Server.25119;Svr;PBS_Server;Torque Server Version = 6.1.1.1, loglevel = 0
What could be the possible problem for which the job is stuck? To mention about the file "nodes", it is not also created.
pbsnodes -a gives the following output:
[root#headnode ~]# pbsnodes -a
headnode2
state = free
power_state = Running
np = 22
ntype = cluster
jobs = 0-1/18.headnode
status = opsys=linux,uname=Linux headnode2 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64,sessions=2406 3695 3699 3701 3731 3733 3757,nsessions=7,nusers=2,idletime=2901,totmem=82448372kb,availmem=80025348kb,physmem=49401852kb,ncpus=24,loadave=23.00,gres=,netload=1677000736,state=free,varattr= ,cpuclock=OnDemand:2301MHz,macaddr=34:40:b5:e5:4a:fa,version=6.1.1.1,rectime=1503919171,jobs=18.headnode
mom_service_port = 15002
mom_manager_port = 15003
headnode3
state = free
power_state = Running
np = 22
ntype = cluster
jobs = 0-1/18.headnode
status = opsys=linux,uname=Linux headnode3 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64,sessions=3570 3574 3576 3602 3604 3628 3803 32545,nsessions=8,nusers=3,idletime=882,totmem=98996200kb,availmem=97047600kb,physmem=65949680kb,ncpus=24,loadave=16.00,gres=,netload=1740623635,state=free,varattr= ,cpuclock=OnDemand:2301MHz,macaddr=34:40:b5:e5:43:52,version=6.1.1.1,rectime=1503919176,jobs=18.headnode
mom_service_port = 15002
mom_manager_port = 15003
headnode4
state = free
power_state = Running
np = 22
ntype = cluster
status = opsys=linux,uname=Linux headnode4 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64,sessions=3592 4057 27119,nsessions=3,nusers=1,idletime=73567,totmem=98991080kb,availmem=96941208kb,physmem=65944560kb,ncpus=24,loadave=23.99,gres=,netload=727722516,state=free,varattr= ,cpuclock=OnDemand:2200MHz,macaddr=34:40:b5:e5:49:8a,version=6.1.1.1,rectime=1503919177,jobs=
mom_service_port = 15002
mom_manager_port = 15003
headnode5
state = free
power_state = Running
np = 22
ntype = cluster
status = opsys=linux,uname=Linux headnode5 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64,sessions=17666,nsessions=1,nusers=1,idletime=2897,totmem=74170352kb,availmem=71840968kb,physmem=49397752kb,ncpus=24,loadave=23.04,gres=,netload=5756452931,state=free,varattr= ,cpuclock=OnDemand:2200MHz,macaddr=34:40:b5:e5:4a:a2,version=6.1.1.1,rectime=1503919174,jobs=
mom_service_port = 15002
mom_manager_port = 15003
headnode6
state = free
power_state = Running
np = 22
ntype = cluster
status = opsys=linux,uname=Linux headnode6 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64,sessions=3678 24197,nsessions=2,nusers=1,idletime=70315,totmem=98991080kb,availmem=97279540kb,physmem=65944560kb,ncpus=24,loadave=16.00,gres=,netload=711846161,state=free,varattr= ,cpuclock=OnDemand:2200MHz,macaddr=34:40:b5:e5:44:52,version=6.1.1.1,rectime=1503919171,jobs=
mom_service_port = 15002
mom_manager_port = 15003

Related

Extract attribute value from log file and store logs in separate file based on attribute value

I am trying to generate different log files for each user based on the pattern. #
As the log files are quite huge (couple of hundred of GBs), I want to use awk or a similar shell command to process it efficiently.
Sample input log looks like:
2019-09-01T06:04:55+00:00 xxxxxxxx CEF: 0|XXX XXXX XXX|YYY-OS|9.0.3|end|TRAFFIC|1|rt=Sep 12 2019 06:04:55 GMT deviceExternalId=11111000000 dvchost=ABCDEFGHIJ src=0.0.0.0 dst=0.0.0.0 sourceTranslatedAddress=0.0.0.0 destinationTranslatedAddress=0.0.0.0 cs1Label=Rule_name cs1=sec_t2u_allow_ssl suser=intra\\test.user1 duser= app=ssl\n
2019-09-01T06:04:55+00:00 xxxxxxxx CEF: 0|XXX XXXX XXX|YYY-OS|9.0.3|end|TRAFFIC|1|rt=Sep 12 2019 06:04:55 GMT deviceExternalId=11111000000 dvchost=ABCDEFGHIJ src=0.0.0.0 dst=0.0.0.0 sourceTranslatedAddress=0.0.0.0 destinationTranslatedAddress=0.0.0.0 cs1Label=Rule_name cs1=sec_t2u_allow_ssl suser=intra\\ts-test.user2 duser= app=ssl\n
2019-09-01T06:04:55+00:00 xxxxxxxx CEF: 0|XXX XXXX XXX|YYY-OS|9.0.3|end|TRAFFIC|1|rt=Sep 12 2019 06:04:55 GMT deviceExternalId=11111000000 dvchost=ABCDEFGHIJ src=0.0.0.0 dst=0.0.0.0 sourceTranslatedAddress=0.0.0.0 destinationTranslatedAddress=0.0.0.0 cs1Label=Rule_name cs1=sec_t2u_allow_ssl suser= duser= app=ssl\n
Some usernames are present but not for all. In case if "suser= " I want to put it in separate file
Expected output format would be:
###test.user1.log -->###
2019-09-01T06:04:55+00:00 xxxxxxxx CEF: 0|XXX XXXX XXX|YYY-OS|9.0.3|end|TRAFFIC|1|rt=Sep 12 2019 06:04:55 GMT deviceExternalId=11111000000 dvchost=ABCDEFGHIJ src=0.0.0.0 dst=0.0.0.0 sourceTranslatedAddress=0.0.0.0 destinationTranslatedAddress=0.0.0.0 cs1Label=Rule_name cs1=sec_t2u_allow_ssl suser=intra\\\\test.user1 duser= app=ssl\n
.....
###ts-test.user2.log --> ###
2019-09-01T06:04:55+00:00 xxxxxxxx CEF: 0|XXX XXXX XXX|YYY-OS|9.0.3|end|TRAFFIC|1|rt=Sep 12 2019 06:04:55 GMT deviceExternalId=11111000000 dvchost=ABCDEFGHIJ src=0.0.0.0 dst=0.0.0.0 sourceTranslatedAddress=0.0.0.0 destinationTranslatedAddress=0.0.0.0 cs1Label=Rule_name cs1=sec_t2u_allow_ssl suser=intra\\ts-test.user2 duser= app=ssl\n
.....
###others.log --> ###
2019-09-01T06:04:55+00:00 xxxxxxxx CEF: 0|XXX XXXX XXX|YYY-OS|9.0.3|end|TRAFFIC|1|rt=Sep 12 2019 06:04:55 GMT deviceExternalId=11111000000 dvchost=ABCDEFGHIJ src=0.0.0.0 dst=0.0.0.0 sourceTranslatedAddress=0.0.0.0 destinationTranslatedAddress=0.0.0.0 cs1Label=Rule_name cs1=sec_t2u_allow_ssl suser= duser= app=ssl\n
I tried to find regex to match the pattern.
This regex "(\bsuser\b)=(.*?(?=\s\w+=|$))" could be used to match the pattern similar to "suser=abcd\\ts-test.user"
Some of the regexp operators that were included in the questio are not available in awk 4.1.4 that I have (for example '(?=)'). So that the regexp needed some modification
Here is a small single-line awk script that you can use as a starting point
awk '
/suser=/ {
# fallback log file
logfile = "others.log"
# Extra user ID to items[1] from $0 '**suser=' token
if ( match($0, " suser=\\S+\\.(\\w+)", items) ) {
logfile = items[1] ".log"
}
# print NR, logfile
print $0 > logfile
}'

Spark Cloudant error: 'nothing was saved because the number of records was 0!'

I'm using the spark-cloudant library 1.6.3 that is installed by default with the spark service.
I'm trying to save some data to Cloudant:
val df = getTopXRecommendationsForAllUsers().toDF.filter( $"_1" > 6035)
println(s"Saving ${df.count()} ratings to Cloudant: " + new Date())
println(df.show(5))
val timestamp: Long = System.currentTimeMillis / 1000
val dbName: String = s"${destDB.database}_${timestamp}"
df.write.mode("append").json(s"${dbName}.json")
val dfWriter = df.write.format("com.cloudant.spark")
dfWriter.option("cloudant.host", destDB.host)
if (destDB.username.isDefined && destDB.username.get.nonEmpty) dfWriter.option("cloudant.username", destDB.username.get)
if (destDB.password.isDefined && destDB.password.get.nonEmpty) dfWriter.option("cloudant.password", destDB.password.get)
dfWriter.save(dbName)
However, I hit the error:
Starting getTopXRecommendationsForAllUsers: Sat Dec 24 08:50:11 CST 2016
Finished getTopXRecommendationsForAllUsers: Sat Dec 24 08:50:11 CST 2016
Saving 6 ratings to Cloudant: Sat Dec 24 08:50:17 CST 2016
+----+--------------------+
| _1| _2|
+----+--------------------+
|6036|[[6036,2503,4.395...|
|6037|[[6037,572,4.5785...|
|6038|[[6038,1696,4.894...|
|6039|[[6039,572,4.6854...|
|6040|[[6040,670,4.6820...|
+----+--------------------+
only showing top 5 rows
()
Use connectorVersion=1.6.3, dbName=recommendationdb_1482591017, indexName=null, viewName=null,jsonstore.rdd.partitions=5, + jsonstore.rdd.maxInPartition=-1,jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=900000,bulkSize=20, schemaSampleSize=1
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 2 in stage 642.0 failed 10 times, most recent failure: Lost task 2.9 in stage 642.0 (TID 409, yp-spark-dal09-env5-0049): java.lang.RuntimeException: Database recommendationdb_1482591017: nothing was saved because the number of records was 0!
at com.cloudant.spark.common.JsonStoreDataAccess.saveAll(JsonStoreDataAccess.scala:187)
I know there is data because I also save it to files:
! cat recommendationdb_1482591017.json/*
{"_1":6036,"_2":[{"user":6036,"product":2503,"rating":4.3957030284620355},{"user":6036,"product":2019,"rating":4.351395783537379},{"user":6036,"product":1178,"rating":4.3373212302468165},{"user":6036,"product":923,"rating":4.3328207761734605},{"user":6036,"product":922,"rating":4.320787353937724},{"user":6036,"product":750,"rating":4.307312349612301},{"user":6036,"product":53,"rating":4.304341611330176},{"user":6036,"product":858,"rating":4.297961629128419},{"user":6036,"product":1212,"rating":4.285360675560061},{"user":6036,"product":1423,"rating":4.275255129149407}]}
{"_1":6037,"_2":[{"user":6037,"product":572,"rating":4.578508339835482},{"user":6037,"product":858,"rating":4.247809350206506},{"user":6037,"product":904,"rating":4.1222486445799404},{"user":6037,"product":527,"rating":4.117342524702621},{"user":6037,"product":787,"rating":4.115781026855997},{"user":6037,"product":2503,"rating":4.109861422105844},{"user":6037,"product":1193,"rating":4.088453520710152},{"user":6037,"product":912,"rating":4.085139017248665},{"user":6037,"product":1221,"rating":4.084368219857013},{"user":6037,"product":1207,"rating":4.082536396283374}]}
{"_1":6038,"_2":[{"user":6038,"product":1696,"rating":4.894442132848873},{"user":6038,"product":2998,"rating":4.887752985607918},{"user":6038,"product":2562,"rating":4.740442462948304},{"user":6038,"product":3245,"rating":4.7366090605162094},{"user":6038,"product":2609,"rating":4.736125582066063},{"user":6038,"product":1669,"rating":4.678373819044571},{"user":6038,"product":572,"rating":4.606132758047402},{"user":6038,"product":1493,"rating":4.577140478430046},{"user":6038,"product":745,"rating":4.56568047928448},{"user":6038,"product":213,"rating":4.546054686400765}]}
{"_1":6039,"_2":[{"user":6039,"product":572,"rating":4.685425482619273},{"user":6039,"product":527,"rating":4.291256016077275},{"user":6039,"product":904,"rating":4.27766400846558},{"user":6039,"product":2019,"rating":4.273486883864949},{"user":6039,"product":2905,"rating":4.266371181044469},{"user":6039,"product":912,"rating":4.26006044096224},{"user":6039,"product":1207,"rating":4.259935289367192},{"user":6039,"product":2503,"rating":4.250370780277651},{"user":6039,"product":1148,"rating":4.247288578998062},{"user":6039,"product":745,"rating":4.223697008637559}]}
{"_1":6040,"_2":[{"user":6040,"product":670,"rating":4.682008703927743},{"user":6040,"product":3134,"rating":4.603656534071515},{"user":6040,"product":2503,"rating":4.571906881428182},{"user":6040,"product":3415,"rating":4.523567737705732},{"user":6040,"product":3808,"rating":4.516778146579665},{"user":6040,"product":3245,"rating":4.496176019230939},{"user":6040,"product":53,"rating":4.491020821805015},{"user":6040,"product":668,"rating":4.471757243976877},{"user":6040,"product":3030,"rating":4.464674231353673},{"user":6040,"product":923,"rating":4.446195112198678}]}
{"_1":6042,"_2":[{"user":6042,"product":3389,"rating":3.331488167984286},{"user":6042,"product":572,"rating":3.3312810949271903},{"user":6042,"product":231,"rating":3.2622287749148926},{"user":6042,"product":1439,"rating":3.0988533259613944},{"user":6042,"product":333,"rating":3.0859809743588706},{"user":6042,"product":404,"rating":3.0573976830913203},{"user":6042,"product":216,"rating":3.044620107397873},{"user":6042,"product":408,"rating":3.038302525994588},{"user":6042,"product":2411,"rating":3.0190834747311244},{"user":6042,"product":875,"rating":2.9860048032439095}]}
This is a defect with spark-cloudant 1.6.3 that is fixed with 1.6.4. The pull request is https://github.com/cloudant-labs/spark-cloudant/pull/61
The answer is to upgrade to spark-cloudant 1.6.4. See this answer if you are trying to do that on the IBM Bluemix Spark Service: Spark-cloudant package 1.6.4 loaded by %AddJar does not get used by notebook

Spark master node ip not updated in Zookeeper store after failover

We were working on spark HA setup in multi node cluster. What we have seen that after stand by spark machine becomes "Alive" the zookeeper data store for master is not updated , it still has the old machine ip address , we thought of building an application which will use this info and return back with current master in typical HA setup , but because of this bug in Zookeeper , we are not able to build that application. like zookeeper has the data :-
[zk: localhost:2181(CONNECTED) 2] get /sparkha/master_status
10.49.1.112
cZxid = 0x10000007f
ctime = Wed Sep 14 23:52:52 PDT 2016
mZxid = 0x10000007f
mtime = Wed Sep 14 23:52:52 PDT 2016
pZxid = 0x10000007f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0
after failover the ip should be changed to new machine ip. we wanted to know is it a bug or we are missing something ?
We know about spark-damon.sh status but we were more focusing on ZK store.
Thanks

pending slurm jobs not showing up in sacct

I am experiencing an issue with slurm where sacct does not show pending jobs. Below, you can see that job 110061 doesn't show up in sacct but is clearly pending in squeue. Any ideas as to why this would happen?
[plcmp14evs:/sim/dev/ash/projects/full-trees] 153% sacct -j 110061
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[plcmp14evs:/sim/dev/ash/projects/full-trees] 154% squeue -j 110061
JOBID PARTITION STATE USER TIME TIMELIMIT NODES NODELIST(REASON) NAME
110061 eventual PENDING andrew 0:00 UNLIMITED 1 (Priority) [rf] script.8.R
-- Edit --
This is the output of scontrol show config | grep Acc
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = localhost
AccountingStorageLoc = /disks/linux/tmp/slurm_accounting.txt
AccountingStoragePort = 0
AccountingStorageType = accounting_storage/filetxt
AccountingStorageUser = root
AccountingStoreJobComment = YES
JobAcctGatherFrequency = 3 sec
JobAcctGatherType = jobacct_gather/linux
At this point there is nothing to record, since the job has not even started yet.

Out of memory exception while registering to a WMI event query

Short: What could cause an out-of-memory error when registering a WQL event query (error code 0x80041006)? How can we investigate the cause?
Long: We keep getting an out-of-memory exception when trying to register a specific WQL event query in the MicrosoftDNS provider in a windows 2003 R2 server.
We can reproduce by registering the following WQL notification query in wbemtest:
select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType")
Here is the wbemess.log file that corresponds to this query and exception:
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed with error code 0x80041006. Will retry at next polling interval
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed on the first try with error code 0x80041006.
Deactivating subscription
Other types (e.g. MicrosoftDNS_AType) seem to work fine.
What could be the cause of such an error? How can we debug / track it down? Are there any throttles / quotas we can try adjusting to find the problem? Any help of pointers would be highly appreciated.
Please little 'r' me since I'm not on this DL.
P.S. The full log section corresponding to this repro:
(Tue Nov 10 10:19:13 2009.66326250) : Registering notification sink with query select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType") in namespace //./root/MicrosoftDNS.
(Tue Nov 10 10:19:13 2009.66326250) : Activating filter 0A2F8D88 with query select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType") in namespace //./root/MicrosoftDNS.
(Tue Nov 10 10:19:13 2009.66326250) : Activating filter 0A35B658 with query select * from __ClassOperationEvent where TargetClass isa "MicrosoftDNS_CNAMEType" in namespace //./root/MicrosoftDNS.
(Tue Nov 10 10:19:13 2009.66326250) : Activating filter 'select * from __ClassOperationEvent where TargetClass isa "MicrosoftDNS_CNAMEType"' with provider $Core
(Tue Nov 10 10:19:13 2009.66326265) : Activating filter 'select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType")' with provider $Core
(Tue Nov 10 10:19:13 2009.66326265) : Instituting polling query select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com" to satisfy event query select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType")
(Tue Nov 10 10:19:13 2009.66326265) : Executing polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' in namespace '//./root/MicrosoftDNS'
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed with error code 0x80041006. Will retry at next polling interval
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed on the first try with error code 0x80041006.
Deactivating subscription
(Tue Nov 10 10:19:14 2009.66327484) : Deactivating filter 0A35B658
(Tue Nov 10 10:19:14 2009.66327484) : Deactivating filter 0A2F8D88
Try something like that:
1) Run "wbemtest" on cmd prompt
2) Connect to the "root" namespace (not "root\default", just "root")
3) Select Open Instance, and specify "__ProviderHostQuotaConfiguration=#"
4) Check "Local Only" for easier readability and you will see the threshold values
5) Change the MemoryPerHost value to something greater - eg. Double it (256 MB)
6) Save Property
7) Save Object
8) Exit
restart WMI services

Resources