Teradata extraction by pyspark2 is taking long time - apache-spark

I am trying to extract maximum date from teradata table thru pyspark2. While the simple query is running for few seconds in Teradata, in spark after 1 hour of execution it is not giving me any answer.
I am executing pyspark2 in CLI, and I already kept tdgssconfig.jar,terajdbc4.jar in the same location
pyspark2 --conf spark.ui.port=45321 --jars tdgssconfig.jar,terajdbc4.jar
TD_QUERY = "(select max({a}) as max_date from {b}) as temp".format(a=Partition_Info,b=SOURCE_TABLE_VIEW)
df_td_date = spark.read\
.format("jdbc")\
.option("url",connection_url)\
.option("driver",connection_driver)\
.option("dbtable",TD_QUERY)\
.option("user",user_name)\
.option("password",pwd)\
.load()
max_date_temp = df_td_max_date.collect()
Please let me know, if I need to improve any part of this code?

Related

Running a Spark Streaming job in Zeppelin throws connection refused 8998 error

I'm working in a virtual machine. I run a Spark Streaming job which I basically copied from a Databricks tutorial.
%pyspark
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
Py4JJavaError: An error occurred while calling o101.start.
: java.net.ConnectException: Call From VirtualBox/127.0.1.1 to localhost:8998 failed on connection exception: java.net.ConnectException:
I checked and there is no service listening on port 8998. I learned that this port is associated with the Apache Livy-server which I am not using. Can someone point me into the right direction?
Ok, so I fixed this issue. First, I added 'file://' when specifying the input folder. Second, I added a checkpoint location. See code below:
inputFolder = 'file:///home/sallos/tmp/'
streamingInputDF = (
spark
.readStream
.schema(schema)
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputFolder)
)
streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.SrcIPAddr,
window(streamingInputDF.Datefirstseen, "30 seconds"))
.sum('Bytes').withColumnRenamed("sum(Bytes)", "sum_bytes")
)
query = (
streamingCountsDF
.writeStream.format("memory")\
.queryName("sumbytes")\
.outputMode("complete")\
.option("checkpointLocation","file:///home/sallos/tmp_checkpoint/")\
.start()
)

Snowflake -Pyspark numPartitions support

We're attempting to run the snowflake query with Pyspark, and we've set numPartitions to 10 and submitted a spark query. However, when I checked the Snowflake History tab. As far as I can tell, only one query is being executed rather than ten.
Is the numPartitions clause supported by snowflake -Spark? The sample code we used to execute is shown below.
sfOptions = dict()
sfOptions["url"] ="jdbc:snowflake://**************.privatelink.snowflakecomputing.com"
sfOptions["user"] ="**01d"
sfOptions["private_key_file"] = key_file
sfOptions["private_key_file_pwd"] = key_passphrase
sfOptions["db"] ="**_DB"
sfOptions["warehouse"] ="****_WHS"
sfOptions["schema"] ="***_SHR"
sfOptions["role"] ="**_ROLE"
sfOptions["numPartitions"]="10"
sfOptions["partitionColumn"] = "***_TRANS_ID"
sfOptions["lowerBound"] = lowerbound
sfOptions["upperBound"] = upperbound
print(sfOptions)
df = spark.read.format('jdbc') \
.options(**sfOptions) \
.option("query", "select * from ***_shr.SPRK_TST as f") \
.load()
Need your help and guidance one this . Thanks!

Spark Structural Streaming Output Mode problem

I am currently looking for an applicable solution to solve the following problem using Spark Structured Streaming API. I have searched through a lot of blog posts and Stackoverflow. Unfortunately, I still can't find a solution to this. Hence raising this ticket to call for expert help.
Use Case
Let said I have a Kafka Topic (user_creation_log) that has all the real-time user_creation_event. For those users who didn't do any transaction within 10 secs, 20 secs, and 30 secs then we will assign them a certain voucher. ( time windows is shortened for testing purpose)
Flag and sending the timeout row (more than 10 sec, more than 20 sec , more than 30 secs) to Kafka is the most problematic part!!! Too much rules, or perhaps i should break it 10sec,20sec and 30 secs into different script
My Tracking Table
I am able to track user no_action_sec by no_action_10sec,no_action_20sec,no_action_30sec flag(shown in code below). The no_action_sec is derived from (current_time - creation_time) which will be calculated in every microbatch.
Complete Output Mode
outputMode("complete") writes all the rows of a Result Table (and corresponds to a traditional batch structured query).
Update Output Mode
outputMode("update") writes only the rows that were updated (every time there are updates).
In this case Update Output Mode seems very suitable because it will write an updated row to output. However, whenever the flag10, flag20, flag30 columns have been updated, the row didn't write to the desired location.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.appName("Notification") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
split_col=split(lines.value, ' ')
df = lines.withColumn('user_id', split_col.getItem(0))
df = df.withColumn('create_date_time', split_col.getItem(1)) \
.groupBy("user_id","create_date_time").count()
df = df.withColumn("create_date_time",col("create_date_time").cast(LongType())) \
.withColumn("no_action_sec", current_timestamp().cast(LongType()) -col("create_date_time").cast(LongType()) ) \
.withColumn("no_action_10sec", when(col("no_action_sec") >= 10 ,True)) \
.withColumn("no_action_20sec", when(col("no_action_sec") >= 20 ,True)) \
.withColumn("no_action_30sec", when(col("no_action_sec") >= 30 ,True)) \
query = df \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()
Current Output
UserId = 0 is disappear in Batch 2. It's supposed to show up because no_action_30sec will changes from null to True.
Expected output
User Id should be write to output 3 times once it triggers the flag logic 10 sec, 20 sec and 30 sec
Can anyone shed light on this problem? Like what can i do to let rows write into output when no_action_10sec,no_action_20sec,no_action_30sec is flag to True?
Debug
OutputMode = Complete will output too much redundant data
Mock Data Generator
for i in {0..10000}; do echo "${i} $(date +%s)"; sleep 1; done | nc -lk 9999
Assume that the row has been showing up in console mode (.format("console") ) will send to Kafka for chaining action

How to set wlm_query_slot_count using Spark-Redshift connector

I am using the spark-redshift connector in order to launch a query from Spark.
val results = spark.sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", url_connection)
.option("query", query)
.option("aws_iam_role", iam_role)
.option("tempdir", base_path_temp)
.load()
I would like to increase the slot count in order to improve the query, because is disk-based. But I don't know how to do the next query in the connector:
set wlm_query_slot_count to 3;
I don't see how to do this , since in the read command in the connector doesn't provide preactions and postactions like in the write command.
Thanks

Spark thinks I'm reading DataFrame from a Parquet file

Spark 2.x here. My code:
val query = "SELECT * FROM some_big_table WHERE something > 1"
val df : DataFrame = spark.read
.option("url",
s"""jdbc:postgresql://${redshiftInfo.hostnameAndPort}/${redshiftInfo.database}?currentSchema=${redshiftInfo.schema}"""
)
.option("user", redshiftInfo.username)
.option("password", redshiftInfo.password)
.option("dbtable", query)
.load()
Produces:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at scala.Option.getOrElse(Option.scala:121)
I'm not reading anything from a Parquet file, I'm reading from a Redshift (RDBMS) table. So why am I getting this error?
If you use generic load function you should include format as well:
// Query has to be subquery
val query = "(SELECT * FROM some_big_table WHERE something > 1) as tmp"
...
.format("jdbc")
.option("dbtable", query)
.load()
Otherwise Spark assumes that you use default format, which in presence of no specific configuration, is Parquet.
Also nothing forces you to use dbtable.
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
query,
props
)
variant is also valid.
And of course with such simple query all of that it is not needed:
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
some_big_table,
props
).where("something > 1")
will work the same way, and if you want to improve performance you should consider parallel queries
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
Spark 2.1 Hangs while reading a huge datasets
Partitioning in spark while reading from RDBMS via JDBC
or even better, try Redshift connector.

Resources