Spark master node ip not updated in Zookeeper store after failover - apache-spark

We were working on spark HA setup in multi node cluster. What we have seen that after stand by spark machine becomes "Alive" the zookeeper data store for master is not updated , it still has the old machine ip address , we thought of building an application which will use this info and return back with current master in typical HA setup , but because of this bug in Zookeeper , we are not able to build that application. like zookeeper has the data :-
[zk: localhost:2181(CONNECTED) 2] get /sparkha/master_status
10.49.1.112
cZxid = 0x10000007f
ctime = Wed Sep 14 23:52:52 PDT 2016
mZxid = 0x10000007f
mtime = Wed Sep 14 23:52:52 PDT 2016
pZxid = 0x10000007f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0
after failover the ip should be changed to new machine ip. we wanted to know is it a bug or we are missing something ?
We know about spark-damon.sh status but we were more focusing on ZK store.
Thanks

Related

pypyodbc "Login timeout expired" error on linux, same code working on windows machine

I have the below pypyodbc code running on my windows computer connecting to a windows server with a SQL database.
When I use the exact same code on a linux machine in the same network I receive a
pypyodbc.OperationalError: ('HYT00', '[HYT00] [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired')
error. The suggestions in this Stackoverflow question or this question did not help. The query on the windows machine is successful, I can ping the windows server from both machines. Both machines run the pypyodbc version 1.3.5
import pypyodbc as pyodbc
server = "XXX.XXX.X.XX\INSTANCENAME"
port = "1433"
database = "DATABASENAME"
user = "sa"
PWD = "XXX"
tcon = "no"
driver = '{ODBC Driver 17 for SQL Server}'
con = pyodbc.connect(Trusted_Connection=tcon, driver = driver, server = server , database = database, UID=user, PWD=PWD)
crsr = con.cursor()
crsr.execute("SELECT * FROM dbo.DigiCenter where Number = '"+str(123456) +"'")
for row in crsr.fetchall():
print(row)
con.close()

Spark Cloudant error: 'nothing was saved because the number of records was 0!'

I'm using the spark-cloudant library 1.6.3 that is installed by default with the spark service.
I'm trying to save some data to Cloudant:
val df = getTopXRecommendationsForAllUsers().toDF.filter( $"_1" > 6035)
println(s"Saving ${df.count()} ratings to Cloudant: " + new Date())
println(df.show(5))
val timestamp: Long = System.currentTimeMillis / 1000
val dbName: String = s"${destDB.database}_${timestamp}"
df.write.mode("append").json(s"${dbName}.json")
val dfWriter = df.write.format("com.cloudant.spark")
dfWriter.option("cloudant.host", destDB.host)
if (destDB.username.isDefined && destDB.username.get.nonEmpty) dfWriter.option("cloudant.username", destDB.username.get)
if (destDB.password.isDefined && destDB.password.get.nonEmpty) dfWriter.option("cloudant.password", destDB.password.get)
dfWriter.save(dbName)
However, I hit the error:
Starting getTopXRecommendationsForAllUsers: Sat Dec 24 08:50:11 CST 2016
Finished getTopXRecommendationsForAllUsers: Sat Dec 24 08:50:11 CST 2016
Saving 6 ratings to Cloudant: Sat Dec 24 08:50:17 CST 2016
+----+--------------------+
| _1| _2|
+----+--------------------+
|6036|[[6036,2503,4.395...|
|6037|[[6037,572,4.5785...|
|6038|[[6038,1696,4.894...|
|6039|[[6039,572,4.6854...|
|6040|[[6040,670,4.6820...|
+----+--------------------+
only showing top 5 rows
()
Use connectorVersion=1.6.3, dbName=recommendationdb_1482591017, indexName=null, viewName=null,jsonstore.rdd.partitions=5, + jsonstore.rdd.maxInPartition=-1,jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=900000,bulkSize=20, schemaSampleSize=1
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 2 in stage 642.0 failed 10 times, most recent failure: Lost task 2.9 in stage 642.0 (TID 409, yp-spark-dal09-env5-0049): java.lang.RuntimeException: Database recommendationdb_1482591017: nothing was saved because the number of records was 0!
at com.cloudant.spark.common.JsonStoreDataAccess.saveAll(JsonStoreDataAccess.scala:187)
I know there is data because I also save it to files:
! cat recommendationdb_1482591017.json/*
{"_1":6036,"_2":[{"user":6036,"product":2503,"rating":4.3957030284620355},{"user":6036,"product":2019,"rating":4.351395783537379},{"user":6036,"product":1178,"rating":4.3373212302468165},{"user":6036,"product":923,"rating":4.3328207761734605},{"user":6036,"product":922,"rating":4.320787353937724},{"user":6036,"product":750,"rating":4.307312349612301},{"user":6036,"product":53,"rating":4.304341611330176},{"user":6036,"product":858,"rating":4.297961629128419},{"user":6036,"product":1212,"rating":4.285360675560061},{"user":6036,"product":1423,"rating":4.275255129149407}]}
{"_1":6037,"_2":[{"user":6037,"product":572,"rating":4.578508339835482},{"user":6037,"product":858,"rating":4.247809350206506},{"user":6037,"product":904,"rating":4.1222486445799404},{"user":6037,"product":527,"rating":4.117342524702621},{"user":6037,"product":787,"rating":4.115781026855997},{"user":6037,"product":2503,"rating":4.109861422105844},{"user":6037,"product":1193,"rating":4.088453520710152},{"user":6037,"product":912,"rating":4.085139017248665},{"user":6037,"product":1221,"rating":4.084368219857013},{"user":6037,"product":1207,"rating":4.082536396283374}]}
{"_1":6038,"_2":[{"user":6038,"product":1696,"rating":4.894442132848873},{"user":6038,"product":2998,"rating":4.887752985607918},{"user":6038,"product":2562,"rating":4.740442462948304},{"user":6038,"product":3245,"rating":4.7366090605162094},{"user":6038,"product":2609,"rating":4.736125582066063},{"user":6038,"product":1669,"rating":4.678373819044571},{"user":6038,"product":572,"rating":4.606132758047402},{"user":6038,"product":1493,"rating":4.577140478430046},{"user":6038,"product":745,"rating":4.56568047928448},{"user":6038,"product":213,"rating":4.546054686400765}]}
{"_1":6039,"_2":[{"user":6039,"product":572,"rating":4.685425482619273},{"user":6039,"product":527,"rating":4.291256016077275},{"user":6039,"product":904,"rating":4.27766400846558},{"user":6039,"product":2019,"rating":4.273486883864949},{"user":6039,"product":2905,"rating":4.266371181044469},{"user":6039,"product":912,"rating":4.26006044096224},{"user":6039,"product":1207,"rating":4.259935289367192},{"user":6039,"product":2503,"rating":4.250370780277651},{"user":6039,"product":1148,"rating":4.247288578998062},{"user":6039,"product":745,"rating":4.223697008637559}]}
{"_1":6040,"_2":[{"user":6040,"product":670,"rating":4.682008703927743},{"user":6040,"product":3134,"rating":4.603656534071515},{"user":6040,"product":2503,"rating":4.571906881428182},{"user":6040,"product":3415,"rating":4.523567737705732},{"user":6040,"product":3808,"rating":4.516778146579665},{"user":6040,"product":3245,"rating":4.496176019230939},{"user":6040,"product":53,"rating":4.491020821805015},{"user":6040,"product":668,"rating":4.471757243976877},{"user":6040,"product":3030,"rating":4.464674231353673},{"user":6040,"product":923,"rating":4.446195112198678}]}
{"_1":6042,"_2":[{"user":6042,"product":3389,"rating":3.331488167984286},{"user":6042,"product":572,"rating":3.3312810949271903},{"user":6042,"product":231,"rating":3.2622287749148926},{"user":6042,"product":1439,"rating":3.0988533259613944},{"user":6042,"product":333,"rating":3.0859809743588706},{"user":6042,"product":404,"rating":3.0573976830913203},{"user":6042,"product":216,"rating":3.044620107397873},{"user":6042,"product":408,"rating":3.038302525994588},{"user":6042,"product":2411,"rating":3.0190834747311244},{"user":6042,"product":875,"rating":2.9860048032439095}]}
This is a defect with spark-cloudant 1.6.3 that is fixed with 1.6.4. The pull request is https://github.com/cloudant-labs/spark-cloudant/pull/61
The answer is to upgrade to spark-cloudant 1.6.4. See this answer if you are trying to do that on the IBM Bluemix Spark Service: Spark-cloudant package 1.6.4 loaded by %AddJar does not get used by notebook

Percona Xtradb cluster crashing

We have a Percona Xtradb cluster with 5 nodes and an arbitrator. One of our Php developers ran a bad query on the cluster, crashing all the nodes. After the crash, we could not collect any error log to tell us what really went wrong as the entire cluster crashed without performing any logging.
I have always thought that when a single query is executed on the cluster, it is processed by only one of the nodes in the cluster. So if the query is bad (to the point of killing a db server), it should only crash the one node thats processing it, leaving the cluster running with the remaining 4 nodes.
This behavior has puzzled us and we would like to understand what is really going on especially that this is the second time this is happening. Why would a query running on the cluster while processed by one of the nodes would cause other nodes in the cluster to crash in case of some issue while being processed?
Below is our my.cnf config:
#
# Default values.
[mysqld_safe]
flush_caches
numa_interleave
#
#
[mysqld]
back_log = 65535
binlog_format = ROW
character_set_server = utf8
collation_server = utf8_general_ci
datadir = /var/lib/mysql
default_storage_engine = InnoDB
expand_fast_index_creation = 1
expire_logs_days = 7
innodb_autoinc_lock_mode = 2
innodb_buffer_pool_instances = 16
innodb_buffer_pool_populate = 1
innodb_buffer_pool_size = 32G # XXX 64GB RAM, 80%
innodb_data_file_path = ibdata1:64M;ibdata2:64M:autoextend
innodb_file_format = Barracuda
innodb_file_per_table
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_io_capacity = 1600
innodb_large_prefix
innodb_locks_unsafe_for_binlog = 1
innodb_log_file_size = 64M
innodb_print_all_deadlocks = 1
innodb_read_io_threads = 64
innodb_stats_on_metadata = FALSE
innodb_support_xa = FALSE
innodb_write_io_threads = 64
log-bin = mysqld-bin
log-queries-not-using-indexes
log-slave-updates
long_query_time = 1
max_allowed_packet = 64M
max_connect_errors = 4294967295
max_connections = 4096
min_examined_row_limit = 1000
port = 3306
relay-log-recovery = TRUE
skip-name-resolve
slow_query_log = 1
slow_query_log_timestamp_always = 1
table_open_cache = 4096
thread_cache = 1024
tmpdir = /db/tmp
transaction_isolation = REPEATABLE-READ
updatable_views_with_limit = 0
user = mysql
wait_timeout = 60
#
# Galera Variable config
wsrep_cluster_address = gcomm://ip_1, ip_2, ip_3,ip_4,ip_4,ip_5
wsrep_cluster_name = cluster_db
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_provider_options = "gcache.size=4G"
wsrep_slave_threads = 32
wsrep_sst_auth = "user:password"
wsrep_sst_donor = "db1"
#wsrep_sst_method = xtrabackup_throttle
wsrep_sst_method = xtrabackup-v2
#
# XXX You *MUST* change!
server-id = 1
Can you post the query? SELECT queries only execute on a single node but all write queries will execute everywhere. What's in your error log?

how to connect to oracle db using OCCI on linux?

I am trying to connect to oracle database (PHK01200_SECCOMPAS_APPL.WORLD) from linux box (RHEL 6) using OCCI (Oracle instantclient version 12.1 (latest)). I am getting tns error while connection. tnsping works fine though. Could you please help in setting up correct configuration. What am I missing here?
Output
[m499757#hkl20030996 bin]$ ./sqlplus toolkit/******#PHK01200_SECCOMPAS_APPL.WORLD
SQL*Plus: Release 10.2.0.3.0 - Production on Thu May 22 12:32:56 2014
Copyright (c) 1982, 2006, Oracle. All Rights Reserved.
ERROR:
ORA-12505: TNS:listener does not currently know of SID given in connect
descriptor
Config details:
Tnsnames.ora
PHK01200_SECCOMPAS_APPL.WORLD =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = PHKLOD2001-xxxx.xx.hedani.net)(PORT = 1522))
)
(CONNECT_DATA =
(SID = PHK01200_SECCOMPAS_APPL)
)
)
Odbc.ini
[PHK01200_SECCOMPAS_APPL]
Driver = OracleODBC-12g
DSN = OracleODBC-12g
ServerName = PHK01200_SECCOMPAS_APPL.WORLD
UserID = toolkit
Password = ******
Odbcinst.ini
[OracleODBC-12g]
Description = Oracle ODBC driver for Oracle 12g
Driver = /cs/gat/share/oracle/64/instantclient/libsqora.so.12.1
Driver64 = /cs/gat/share/oracle/64/instantclient/libsqora.so.12.1
FileUsage = 1
Driver Logging = 7
LDAP.ora
# LDAP.ORA Configuration
# Generated by Oracle configuration tools.
DEFAULT_ADMIN_CONTEXT = "dc=uk,dc=csfb,dc=com"
#DEFAULT_ADMIN_CONTEXT = "dc=corpny,dc=csfb,dc=com"
DIRECTORY_SERVERS= (oid_ldap_server_sg.sg.csfb.com:1522:1524,oid_ldap_server_ny.corpny.csfb.com:1522:1524,oid_l dap_server_ln.csfp.co.uk:1522:1524)
DIRECTORY_SERVER_TYPE = OID
sqlnet.ora
AUTOMATIC_IPC = OFF
TRACE_LEVEL_CLIENT = OFF
TCP.NODELAY = YES
NAMES.DIRECTORY_PATH= (TNSNAMES,LDAP,ONAMES,HOSTNAME)
names.default_domain = world
name.default_zone = world
Below wokred.
./sqlplus toolkit/******#PHK01200
And in my python script I used below connection string
"DRIVER={OracleODBC-12g}; Dbq=PHKLOD2001-scan.ap.hedani.net:1522/PHK01200_SECCOMPAS_APPL.WORLD"

Out of memory exception while registering to a WMI event query

Short: What could cause an out-of-memory error when registering a WQL event query (error code 0x80041006)? How can we investigate the cause?
Long: We keep getting an out-of-memory exception when trying to register a specific WQL event query in the MicrosoftDNS provider in a windows 2003 R2 server.
We can reproduce by registering the following WQL notification query in wbemtest:
select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType")
Here is the wbemess.log file that corresponds to this query and exception:
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed with error code 0x80041006. Will retry at next polling interval
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed on the first try with error code 0x80041006.
Deactivating subscription
Other types (e.g. MicrosoftDNS_AType) seem to work fine.
What could be the cause of such an error? How can we debug / track it down? Are there any throttles / quotas we can try adjusting to find the problem? Any help of pointers would be highly appreciated.
Please little 'r' me since I'm not on this DL.
P.S. The full log section corresponding to this repro:
(Tue Nov 10 10:19:13 2009.66326250) : Registering notification sink with query select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType") in namespace //./root/MicrosoftDNS.
(Tue Nov 10 10:19:13 2009.66326250) : Activating filter 0A2F8D88 with query select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType") in namespace //./root/MicrosoftDNS.
(Tue Nov 10 10:19:13 2009.66326250) : Activating filter 0A35B658 with query select * from __ClassOperationEvent where TargetClass isa "MicrosoftDNS_CNAMEType" in namespace //./root/MicrosoftDNS.
(Tue Nov 10 10:19:13 2009.66326250) : Activating filter 'select * from __ClassOperationEvent where TargetClass isa "MicrosoftDNS_CNAMEType"' with provider $Core
(Tue Nov 10 10:19:13 2009.66326265) : Activating filter 'select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType")' with provider $Core
(Tue Nov 10 10:19:13 2009.66326265) : Instituting polling query select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com" to satisfy event query select * from __InstanceOperationEvent within 20 where TargetInstance.ContainerName="xyz.com" AND (TargetInstance ISA "MicrosoftDNS_CNAMEType")
(Tue Nov 10 10:19:13 2009.66326265) : Executing polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' in namespace '//./root/MicrosoftDNS'
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed with error code 0x80041006. Will retry at next polling interval
(Tue Nov 10 10:19:14 2009.66327484) : Polling query 'select * from MicrosoftDNS_CNAMEType where ContainerName = "xyz.com"' failed on the first try with error code 0x80041006.
Deactivating subscription
(Tue Nov 10 10:19:14 2009.66327484) : Deactivating filter 0A35B658
(Tue Nov 10 10:19:14 2009.66327484) : Deactivating filter 0A2F8D88
Try something like that:
1) Run "wbemtest" on cmd prompt
2) Connect to the "root" namespace (not "root\default", just "root")
3) Select Open Instance, and specify "__ProviderHostQuotaConfiguration=#"
4) Check "Local Only" for easier readability and you will see the threshold values
5) Change the MemoryPerHost value to something greater - eg. Double it (256 MB)
6) Save Property
7) Save Object
8) Exit
restart WMI services

Resources