How to set read timeout configuration in spark-bigquery-connector? - apache-spark

We are using spark-bigquery-connector to pull the data from BigQuery using Spark. Intermittently, we face read timeout issues. Exception: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Read timed out. How do I increase the default timeout value? Below is a sample code snippet showing how we pull the data from BigQuery
sparkSession.read
.format("com.google.cloud.spark.bigquery")
.load("data-set")
.select("col1", "col 2")
.show(20)
Below are the configuration we set at sparkConf level
sparkConf.set("viewsEnabled", true)
sparkConf.set("parentProject", "<parentProject>")
sparkConf.set("materializationProject", "<materializationProject>")
sparkConf.set("materializationDataset", "<materializationDataset>")
sparkConf.set("credentials", "<>")
If we use BigQueryClient, timeout can be configures as follows,
BigQuery bigquery = BigQueryOptions.getDefaultInstance()
.setRetrySettings(RetrySettings.newBuilder()
.setMaxAttempts(10)
.setRetryDelayMultiplier(1.5)
.setTotalTimeout(Duration.ofMinutes(5))
.build()).getService();
But, how can we tune/configure the read timeout value when using sparkSession to read the data.
Exception Trace:
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Read timed out
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:287)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:717)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:714)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getTable(BigQueryImpl.java:713)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:75)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

Related

A connection could not be established because the database name '<databaseName>' is larger than the maximum length allowed by the network protocol

We are trying to pull data using a JDBC connection request and spark 3.2.0. Our JDBC call using splice driver (com.splicemachine:db-client:2.7.0.1815) looks something like this with sample query -
val url_upd = "jdbc:splice://sl73caehdp0225.visa.com:1527/splicedb;user=***;password=a***"
val query = "select top 5 ENCRPT_PYMT_CRD_ACCT_NUM_NORM from optd.ttd_cs_dtl CS where TRAN_CD IN ('01','04','05','06','10','11')"
val jdbcDF = spark.read.format("jdbc").option("url", url).option("driver", "com.splicemachine.db.jdbc.ClientDriver").option("query", query).load();
And we are observing a limitation of "111" characters for the size of the query to get the results. If the size exceeds this limit, we are receiving the below error -
"A connection could not be established because the database name '' is larger than the maximum length allowed by the network protocol."
The same connection request and query work fine in spark 2.3

How to read from a static file using a Socket Text Stream with a Batch Interval of 10 seconds in Spark with Python?

I have static file(log_file) with some 10K records in my local drive (windows). Structure is as follows.
"date","time","size","r_version","r_arch","r_os","package","version","country","ip_id"
"2012-10-01","00:30:13",35165,"2.15.1","i686","linux-gnu","quadprog","1.5-4","AU",1
I want to read this log records using a Socket Text Stream with a Batch Interval of 10 seconds and later I have to perform few spark operation either with RDD or DF computation. I have read below code just to read the data in time interval, split the same in the form of RDD and show.
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
conf = SparkConf().setMaster("local[*]").setAppName("Assignment4")
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 10)
data = ssc.socketTextStream("file:///SparkL2/log_file.txt",2222)
linesrdd = data.map(lambda x: x.split(","))
linesrdd.pprint()
ssc.start()
ssc.awaitTermination()
I saved this code and did a spark-submit from Anaconda command prompt. I am facing error in socketTextStream function, probably because I am not using it correctly.
(base) PS C:\Users\HP> cd c:\SparkL2
(base) PS C:\SparkL2> spark-submit Assignment5.py
20/09/09 21:42:42 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.net.UnknownHostException: file:///SparkL2/log_file.txt
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:196)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:162)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
at java.net.Socket.connect(Socket.java:606)
at java.net.Socket.connect(Socket.java:555)
at java.net.Socket.<init>(Socket.java:451)
at java.net.Socket.<init>(Socket.java:228)
at org.apache.spark.streaming.dstream.SocketReceiver.onStart(SocketInputDStream.scala:61)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint.$anonfun$startReceiver$1(ReceiverTracker.scala:596)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint.$anonfun$startReceiver$1$adapted(ReceiverTracker.scala:586)
at org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Can any one help me on this. I am very new to pyspark and want to learn it by myself.

Spark Dataframe leftanti Join Fails

We are trying to publish deltas from a Hive table to Kafka. The table in question is a single partition, single block file of 244 MB. Our cluster is configured for a 256M block size, so we're just about at the max for a single file in this case.
Each time that table is updated, a copy is archived, then we run our delta process.
In the function below, we have isolated the different joins and have confirmed that the inner join performs acceptably (about 3 minutes), but the two antijoin dataframes will not complete -- we keep throwing more resources at the Spark job, but are continuing to see the errors below.
Is there a practical limit on dataframe sizes for this kind of join?
private class DeltaColumnPublisher(spark: SparkSession, sink: KafkaSink, source: RegisteredDataset)
extends BasePublisher(spark, sink, source) with Serializable {
val deltaColumn = "hadoop_update_ts" // TODO: move to the dataset object
def publishDeltaRun(dataLocation: String, archiveLocation: String): (Long, Long) = {
val current = spark.read.parquet(dataLocation)
val previous = spark.read.parquet(archiveLocation)
val inserts = current.join(previous, keys, "leftanti")
val updates = current.join(previous, keys).where(current.col(deltaColumn) =!= previous.col(deltaColumn))
val deletes = previous.join(current, keys, "leftanti")
val upsertCounter = spark.sparkContext.longAccumulator("upserts")
val deleteCounter = spark.sparkContext.longAccumulator("deletes")
logInfo("sending inserts to kafka")
sink.sendDeltasToKafka(inserts, "U", upsertCounter)
logInfo("sending updates to kafka")
sink.sendDeltasToKafka(updates, "U", upsertCounter)
logInfo("sending deletes to kafka")
sink.sendDeltasToKafka(deletes, "D", deleteCounter)
(upsertCounter.value, deleteCounter.value)
}
}
The errors we're seeing seems to indicate that the driver is losing contact with the executors. We have increased the executor memory up to 24G and the network timeout as high as 900s and the heartbeat interval as high as 120s.
17/11/27 20:36:18 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#596e3aa6,BlockManagerId(1, server, 46292, None))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at ...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at ...
Later in the logs:
17/11/27 20:42:37 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#25d1bd5f,BlockManagerId(1, server, 46292, None))] in 3 attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at ...
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Could not find HeartbeatReceiver.
The config switches we have been manipulating (without success) are --executor-memory 24G --conf spark.network.timeout=900s --conf spark.executor.heartbeatInterval=120s
The option I failed to consider is to increase my driver resources. I added --driver-memory 4G and --driver-cores 2 and saw my job complete in about 9 minutes.
It appears that an inner join of these two files (or using the built-in except() method) puts memory pressure on the executors. Partitioning on one of the key columns seems to help ease that memory pressure, but increases overall time because there is more shuffling involved.
Doing the left-anti join between these two files requires that we have more driver resources. Didn’t expect that.

columnSimilarities() of RowMatrix returns ERROR Schema: Failed initialising database

under spark 2.2.0, I've experienced error using columnSimilarities().
Here is code to reproduce.
from pyspark.mllib.linalg.distributed import RowMatrix
rdd = sc.parallelize([[1.0,2.0,1.0],[1.0,5.0,1.0],[1.0,2.0,1.0],[4.0,2.0,4.0]])
mat = RowMatrix(rdd)
sim = mat.columnSimilarities(0.1)
sim.entries.collect()
Error is like that(trancated. too long. Full log is here).
17/08/13 10:15:19 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#3234df5e, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)
This code works well.
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
rdd = sc.parallelize([IndexedRow(0, [1.0,2.0,1.0]),
IndexedRow(1, [1.0,5.0,1.0]),
IndexedRow(2, [1.0,2.0,1.0]),
IndexedRow(3, [4.0,2.0,4.0])])
mat = IndexedRowMatrix(rdd).toRowMatrix()
sim = mat.columnSimilarities(0.1)
sim.entries.collect()
Is this bug of Spark ?
This is a problem of jdbc connectivity - and not about columnSimilarities - or MLlib in general.
You might have some work to do to get the derby connection running. Here is one starting point : https://stackoverflow.com/a/40547664/1056563

Spark cassandra connector NoHostAvailableException will making multiple reads

While performing a multiple select in a mapPartition.
I do 2 prepared requests by row.
for advice the code look like this
source.mapPartitions { partition =>
lazy val prepared: PreparedStatement = ...
cc.withSessionDo { session =>
partition.map{ row =>
session.execute(prepared.bind(row.get("id"))
}
}
}
When the batch reaches ~ 400 row it throws a
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /localhost:9042 (com.datastax.driver.core.ConnectionException: [/localhost:9042] Pool is CLOSING))
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:216)
at com.datastax.driver.core.RequestHandler.access$900(RequestHandler.java:45)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:276)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:118)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:94)
at com.datastax.driver.core.SessionManager.execute(SessionManager.java:552)
at com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:589)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:97)
... 25 more
It have tried to change configs to see if it can do something but the error is still poping
.set("spark.cassandra.output.batch.size.rows", "auto")
.set("spark.cassandra.output.concurrent.writes", "500")
.set("spark.cassandra.output.batch.size.bytes", "100000")
.set("spark.cassandra.read.timeout_ms", "120000")
.set("spark.cassandra.connection.timeout_ms" , "120000")
This kind of code work in spark cassandra connector but there is maybe something I haven't seen
After the exception was raised the next stream batches have no problems to connect to cassandra.
Did I timeout my cassandra with to much simultaneous requests ?
I use cassandra 2.1.3 with spark connector 1.4.0-M3 and driver 2.1.7.1

Resources