Writing stream in Databricks with toTable doesn't execute foreachBatch

Writing stream in Databricks with toTable doesn't execute foreachBatch - databricks

The below code is working as it should, i.e. data is written to the output table and is selectable from the table within 10 seconds. The problem is that foreachBatch is not executed.
When I have tested it with .format("console") and calling .start() then foreachBatch is run. So it feels like .toTable() is to blame here.
This code is using the Kafka connector but the same problems existed with Event hub connector.
If I try to add .start() after toTable() is get the error
'StreamingQuery' object has no attribute 'start'
Here is the code that is working except foreachBatch
TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhub.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=mykeyname;SharedAccessKey=mykey;EntityPath=myentitypath;\";"
df = spark.readStream \
.format("kafka") \
.option("subscribe", TOPIC) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.request.timeout.ms", "60000") \
.option("kafka.session.timeout.ms", "60000") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
n = 100
count = 0
def run_command(batchDF, epoch_id):
global count
count += 1
if count % n == 0:
spark.sql("OPTIMIZE firstcatalog.bronze.factorydatas3 ZORDER BY (readtimestamp)")
...Omitted code where I transform the data in the value column to strongly typed data...
myTypedDF.writeStream \
.foreachBatch(run_command) \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/") \
.partitionBy("somecolumn") \
.toTable("myunitycatalog.bronze.mytable")

you either do foreachBatch or toTable, but not both. You can move writing to table inside the foreachBatch function - just make sure that you do idempotent writes because batch could be restarted. Change your code to this:
def run_command(batchDF, epoch_id):
global count
batchDF.write.format("delta") \
.option("txnVersion", epoch_id) \
.option("txnAppId", "my_app") \
.partitionBy("somecolumn") \
.mode("append") \
.saveAsTable("myunitycatalog.bronze.mytable")
count += 1
if count % n == 0:
spark.sql("OPTIMIZE myunitycatalog.bronze.mytable ZORDER BY (readtimestamp)")
myTypedDF.writeStream \
.foreachBatch(run_command) \
.outputMode("append") \
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/") \
.start()

Related

Spark Streaming is reading from Kafka topic and how to convert nested Json format into dataframe

I am able to read data from Kafka topic and able to print the data on the console using spark streaming.
I wanted the data to be in a dataframe format.
Here is my code:
spark = SparkSession \
.builder \
.appName("StructuredSocketRead") \
.getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
lines = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","********") \
.option("subscribe","******") \
.option("startingOffsets", "earliest") \
.load()
readable = lines.selectExpr("CAST(value AS STRING)")
query = readable \
.writeStream \
.outputMode("append") \
.format("console") \
.option("truncate", "False") \
.start()
query.awaitTermination()
The output is in JSON file format. How to convert this into a dataframe?Please find the output below:
{"items": [{"SKU": "23565", "title": "EGG CUP MILKMAID HELGA ", "unit_price": 2.46, "quantity": 2}], "type": "ORDER", "country": "United Kingdom", "invoice_no": 154132541847735, "timestamp": "2020-11-02 20:56:01"}

IICU, please use explode() and getItems() in order to create a Dataframe out of the json..
Create the dataframe here
a_json = {"items": [{"SKU": "23565", "title": "EGG CUP MILKMAID HELGA ", "unit_price": 2.46, "quantity": 2}], "type": "ORDER", "country": "United Kingdom", "invoice_no": 154132541847735, "timestamp": "2020-11-02 20:56:01"}
df = spark.createDataFrame([(a_json)])
df.show(truncate=False)
+--------------+---------------+-------------------------------------------------------------------------------------+-------------------+-----+
|country |invoice_no |items |timestamp |type |
+--------------+---------------+-------------------------------------------------------------------------------------+-------------------+-----+
|United Kingdom|154132541847735|[[quantity -> 2, unit_price -> 2.46, title -> EGG CUP MILKMAID HELGA , SKU -> 23565]]|2020-11-02 20:56:01|ORDER|
+--------------+---------------+-------------------------------------------------------------------------------------+-------------------+-----+
Logic Here
df = df.withColumn("items_array", F.explode("items"))
df = df.withColumn("quantity", df.items_array.getItem("quantity")).withColumn("unit_price", df.items_array.getItem("unit_price")).withColumn("title", df.items_array.getItem("title")).withColumn("SKU", df.items_array.getItem("SKU"))
df.select("country", "invoice_no", "quantity","unit_price", "title", "SKU", "timestamp", "timestamp").show(truncate=False)
+--------------+---------------+--------+----------+-----------------------+-----+-------------------+-------------------+
|country |invoice_no |quantity|unit_price|title |SKU |timestamp |timestamp |
+--------------+---------------+--------+----------+-----------------------+-----+-------------------+-------------------+
|United Kingdom|154132541847735|2 |2.46 |EGG CUP MILKMAID HELGA |23565|2020-11-02 20:56:01|2020-11-02 20:56:01|
+--------------+---------------+--------+----------+-----------------------+-----+-------------------+-------------------+

'sqlContext' is not defined when my function calls another function

I have a function all_purch_spark() that sets a Spark Context as well as SQL Context for five different tables. The same function then successfully runs a sql query against an AWS Redshift DB. It works great. I am including the entire function below (stripped of sensitive data of course). Please forgive its length but I wanted to show it as is given the problem I am facing.
My problem is with the second function repurch_prep() and how it calls the first function all_purch_spark(). I can't figure out how to avoid errors such as this one: NameError: name 'sqlContext' is not defined
I will show the two functions and error below.
Here is the first function all_purch_spark(). Again I put the whole function here for reference. I know it is long but wasn't sure I could reduce it to a meaningful example.
def all_purch_spark():
config = {
'redshift_user': 'tester123',
'redshift_pass': '*****************',
'redshift_port': "5999",
'redshift_db': 'my_database',
'redshift_host': 'redshift.my_database.me',
}
from pyspark import SparkContext, SparkConf, SQLContext
jars = [
"/home/spark/SparkNotebooks/src/service/RedshiftJDBC42-no-awssdk-1.2.41.1065.jar"
]
conf = (
SparkConf()
.setAppName("S3 with Redshift")
.set("spark.driver.extraClassPath", ":".join(jars))
.set("spark.hadoop.fs.s3a.path.style.access", True)
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("com.amazonaws.services.s3.enableV4", True)
.set("spark.hadoop.fs.s3a.endpoint", f"s3-{config.get('region')}.amazonaws.com")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true")
)
sc = SparkContext(conf=conf).getOrCreate()
sqlContext = SQLContext(sc)
##Set Schema and table to query
schema1 = 'production'
schema2 = 'X4production'
table1 = 'purchases'
table2 = 'customers'
table3 = 'memberships'
table4 = 'users' #set as users table in both schemas
purchases_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table1}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
customers_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table2}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
memberships_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table3}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
users_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table4}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
cusers_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema2}.{table4}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('fc_purchases').getOrCreate()
purchases_df.createOrReplaceTempView('purchases')
customers_df.createOrReplaceTempView('customers')
memberships_df.createOrReplaceTempView('memberships')
users_df.createOrReplaceTempView('users')
cusers_df.createOrReplaceTempView('cusers')
all_purch = spark.sql("SELECT \
p_paid.customer_id AS p_paid_user_id \
,p_trial.created_at AS trial_start_date \
,p_paid.created_at \
,cu.graduation_year \
,lower(cu.student_year) AS student_year \
,lower(p_paid.description) as product \
,u.email \
,u.id AS u_user_id \
,cu.id AS cu_user_id \
FROM \
purchases AS p_paid \
INNER JOIN purchases AS p_trial ON p_trial.customer_id = p_paid.customer_id \
INNER JOIN customers AS c on c.id = p_paid.customer_id \
INNER JOIN memberships AS m on m.id = c.membership_id \
INNER JOIN users AS u on u.id = m.user_id \
INNER JOIN cusers AS cu on cu.id = u.id \
WHERE \
p_trial.created_at >= '2018-03-01' \
AND p_paid.created_at >= '2018-03-01' \
AND u.institution_contract = false \
AND LOWER(u.email) not like '%hotmail.me%' \
AND LOWER(u.email) not like '%gmail.com%' \
AND p_trial.description like '% Day Free Trial' \
AND p_paid.status = 'paid' \
GROUP BY \
p_paid_user_id \
,trial_start_date \
,p_paid.created_at \
,u.email \
,cu.graduation_year \
,student_year \
,product \
,cu_user_id \
,u_user_id \
ORDER BY p_paid_user_id")
all_purch.registerTempTable("all_purch_table")
return all_purch
Here is the second function that calls the above function. It is supposed to select against the registered table views set in above function:
def repurch_prep():
all_purch_spark()
all_repurch = sqlContext.sql("SELECT * FROM all_purch_table WHERE p_paid_user_id IN \
(SELECT p_paid_user_id FROM all_purch_table GROUP BY p_paid_user_id HAVING COUNT(*) > 1) \
ORDER BY p_paid_user_id ASC")
return all_repurch
When I run repurch_prep() it throws the following exception even though the SQL Context is defined in above function. I have tried returning values above but can't figure out how to get this to work:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in
----> 1 repurch_prep()
~/spark/SparkNotebooks/firecracker/utils_prod_db_spark.py in repurch_prep()
735 #sc = SparkContext().getOrCreate()
736 #sqlContext = SQLContext()
--> 737 all_repurch = sqlContext.sql("SELECT * FROM all_purch_table WHERE p_paid_user_id IN \
738 (SELECT p_paid_user_id FROM all_purch_table GROUP BY p_paid_user_id HAVING COUNT(*) > 1) \
739 ORDER BY p_paid_user_id ASC")
NameError: name 'sqlContext' is not defined
Any help greatly appreciated.

The solution per #Lamanus was to place variable outside of function making them global rather than storing them in a function (as I did) and call that function from another.
############### SPARK REDSHIFT GLOBAL CONFIG #####################
config = {
'redshift_user': 'tester123',
'redshift_pass': '*****************',
'redshift_port': "5999",
'redshift_db': 'my_database',
'redshift_host': 'redshift.my_database.me',
}
from pyspark import SparkContext, SparkConf, SQLContext
jars = [
"/home/spark/SparkNotebooks/src/service/RedshiftJDBC42-no-awssdk-1.2.41.1065.jar"
]
conf = (
SparkConf()
.setAppName("S3 with Redshift")
.set("spark.driver.extraClassPath", ":".join(jars))
.set("spark.hadoop.fs.s3a.path.style.access", True)
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("com.amazonaws.services.s3.enableV4", True)
.set("spark.hadoop.fs.s3a.endpoint", f"s3-{config.get('region')}.amazonaws.com")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true")
)
sc = SparkContext(conf=conf).getOrCreate()
###############################################################
def all_purch_spark():
sqlContext = SQLContext(sc)
##Set Schema and table to query
schema1 = 'production'
schema2 = 'X4production'
table1 = 'purchases'
table2 = 'customers'
table3 = 'memberships'
table4 = 'users' #set as users table in both schemas
purchases_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table1}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
customers_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table2}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
memberships_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table3}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
users_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema1}.{table4}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
cusers_df = sqlContext.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{config.get('redshift_host')}:{config.get('redshift_port')}/{config.get('redshift_db')}") \
.option("dbtable", f"{schema2}.{table4}") \
.option("user", config.get('redshift_user')) \
.option("password", config.get('redshift_pass')) \
.load()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('fc_purchases').getOrCreate()
purchases_df.createOrReplaceTempView('purchases')
customers_df.createOrReplaceTempView('customers')
memberships_df.createOrReplaceTempView('memberships')
users_df.createOrReplaceTempView('users')
cusers_df.createOrReplaceTempView('cusers')
all_purch = spark.sql("SELECT \
p_paid.customer_id AS p_paid_user_id \
,p_trial.created_at AS trial_start_date \
,p_paid.created_at \
,cu.graduation_year \
,lower(cu.student_year) AS student_year \
,lower(p_paid.description) as product \
,u.email \
,u.id AS u_user_id \
,cu.id AS cu_user_id \
FROM \
purchases AS p_paid \
INNER JOIN purchases AS p_trial ON p_trial.customer_id = p_paid.customer_id \
INNER JOIN customers AS c on c.id = p_paid.customer_id \
INNER JOIN memberships AS m on m.id = c.membership_id \
INNER JOIN users AS u on u.id = m.user_id \
INNER JOIN cusers AS cu on cu.id = u.id \
WHERE \
p_trial.created_at >= '2018-03-01' \
AND p_paid.created_at >= '2018-03-01' \
AND u.institution_contract = false \
AND LOWER(u.email) not like '%hotmail.me%' \
AND LOWER(u.email) not like '%gmail.com%' \
AND p_trial.description like '% Day Free Trial' \
AND p_paid.status = 'paid' \
GROUP BY \
p_paid_user_id \
,trial_start_date \
,p_paid.created_at \
,u.email \
,cu.graduation_year \
,student_year \
,product \
,cu_user_id \
,u_user_id \
ORDER BY p_paid_user_id")
all_purch.registerTempTable("all_purch_table")
return all_purch

pyspark getting distinct values based on groupby column for streaming data

i am trying to get distinct values for a column based on groupby operation on other column using pyspark stream, but i am getting in correct count.
Function created:
from pyspark.sql.functions import weekofyear,window,approx_count_distinct
def silverToGold(silverPath, goldPath, queryName):
(spark.readStream
.format("delta")
.load(silverPath)
.withColumn("week",weekofyear("eventDate"))
#.groupBy(window(col(("week")).cast("timestamp"),"5 minute")).approx_count_distinct("device_id")
# .withColumn("WAU",col("window.start"))
# .drop("window")
.groupBy("week").agg(approx_distinct.count("device_id").alias("WAU"))
.writeStream
.format("delta")
.option("checkpointLocation",goldPath + "/_checkpoint")
#.option("streamName",queryName)
.queryName(queryName)
.outputMode("complete")
.start(goldPath)
#return queryName
)
Expected Result:
week WAU
1 7
2 4
3 9
4 9
Actual Result:
week WAU
1 7259
2 7427
3 7739
4 7076
Sample Input Data:
Input data in text format:
device_id,eventName,client_event_time,eventDate,deviceType
00007d948fbe4d239b45fe59bfbb7e64,scoreAdjustment,2018-06-01T16:55:40.000+0000,2018-06-01,android
00007d948fbe4d239b45fe59bfbb7e64,scoreAdjustment,2018-06-01T16:55:34.000+0000,2018-06-01,android
0000a99151154e4eb14c675e8b42db34,scoreAdjustment,2019-08-18T13:39:36.000+0000,2019-08-18,ios
0000b1e931d947b197385ac1cbb25779,scoreAdjustment,2018-07-16T09:13:45.000+0000,2018-07-16,android
0003939e705949e4a184e0a853b6e0af,scoreAdjustment,2018-07-17T17:59:05.000+0000,2018-07-17,android
0003e14ca9ba4198b51cec7d2761d391,scoreAdjustment,2018-06-10T09:09:12.000+0000,2018-06-10,ios
00056f7c73c9497180f2e0900a0626e3,scoreAdjustment,2019-07-05T18:31:10.000+0000,2019-07-05,ios
0006ace2d1db46ba94b802d80a43c20f,scoreAdjustment,2018-07-05T14:31:43.000+0000,2018-07-05,ios
000718c45e164fb2b017f146a6b66b7e,scoreAdjustment,2019-03-26T08:25:08.000+0000,2019-03-26,android
000807f2ea524bd0b7e27df8d44ab930,purchaseEvent,2019-03-26T22:28:17.000+0000,2019-03-26,android
Any suggestions on this

def silverToGold(silverPath, goldPath, queryName):
return (spark.readStream
.format("delta")
.load(silverPath)
.groupBy(weekofyear('eventDate').alias('week'))
.agg(approx_count_distinct("device_id",rsd=0.01).alias("WAU"))
.writeStream
.format("delta")
.option("checkpointLocation", goldPath +"/_checkpoint")
.outputMode("complete")
.start(goldPath)
)

Spark structured streaming sinks to output is delayed

The below spark structured streaming code collects data from Kafka at every 10 seconds:
window($"timestamp", "10 seconds")
I was expecting the results to be printed on the console every 10 seconds. But, I notice the sink to the console is happening at every ~2 mins or above.
May I know what am I doing wrong?
def streaming(): Unit = {
System.setProperty("hadoop.home.dir", "/Documents/ ")
val conf: SparkConf = new SparkConf().setAppName("Histogram").setMaster("local[8]")
conf.set("spark.eventLog.enabled", "false");
val sc: SparkContext = new SparkContext(conf)
val sqlcontext = new SQLContext(sc)
val spark = SparkSession.builder().config(conf).getOrCreate()
import sqlcontext.implicits._
import org.apache.spark.sql.functions.window
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "wonderful")
.option("startingOffsets", "latest")
.load()
import scala.concurrent.duration._
val personJsonDf = inputDf.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.withWatermark("timestamp", "500 milliseconds")
.groupBy(
window($"timestamp", "10 seconds")).count()
val consoleOutput = personJsonDf.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Update())
.start()
consoleOutput.awaitTermination()
}
object SparkExecutor {
val spE: SparkExecutor = new SparkExecutor();
def main(args: Array[String]): Unit = {
println("test")
spE.streaming
}
}

I think that you might be missing the trigger definition for querying personJsonDf during the writeStreamoperation. The 2 minute period might be a default one (not sure).
The groupBy window that you have defined, will be used in the query but it does not define its periodicity.
One way to configure this could be:
val consoleOutput = personJsonDf.writeStream
.outputMode("complete")
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Update())
.start()
Finally, the class Trigger contains some useful methods you wanna check out.
Hope it helps.

StreamingQueryException: Text data source supports only a single column

I know this question has already been asked before multiple times but none of the answers help in my case.
Below is my spark code
class ParseLogs extends java.io.Serializable {
def formLogLine(logLine: String): (String,String,String,Int,String,String,String,Int,Float,String,String,Flo at,Int,String,Int,Float,String)={
//some logic
//return value
(recordKey._2.toString().replace("\"", ""),recordKey._3,recordKey._4,recordKey._5,recordKey._6,recordKey._8,sbcId,recordKey._10,recordKey._11,recordKey._12,recordKey._13.trim(),LogTransferTime,contentAccessed,OTT,dataTypeId,recordKey._14,logCaptureTime1)
}
}
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
val myDf = inputDf.selectExpr("CAST(value AS STRING)")
val df1 = myDf.map(line => new ParseLogs().formLogLine(line.get(0).toString()))
I get below error
User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 17 columns.;

Use UDF to convert logLine to what you want.For example:
spark.sqlContext.udf.register("YOURLOGIC", (logLine: String) => {
//some logic
(recordKey._2.toString().replace("\"",""),recordKey._3,recordKey._4,recordKey._5,recordKey._6,recordKey._8,sbcId,recordKey._10,recordKey._11,recordKey._12,recordKey._13.trim(),LogTransferTime,contentAccessed,OTT,dataTypeId,recordKey._14,logCaptureTime1)
})
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
val myDf = inputDf.selectExpr("CAST(value AS STRING)")
val df1 = myDf.selectExpr("YOURLOGIC(value) as result")
val result = df1.select(
df1("result").getItem(0),
df1("result").getItem(1),
df1("result").getItem(2)),
df1("result").getItem(3)),
...if you have 17 item,then add to 17
df1("result").getItem(17))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Writing stream in Databricks with toTable doesn't execute foreachBatch - databricks

Related

Spark Streaming is reading from Kafka topic and how to convert nested Json format into dataframe

'sqlContext' is not defined when my function calls another function

pyspark getting distinct values based on groupby column for streaming data

Spark structured streaming sinks to output is delayed

StreamingQueryException: Text data source supports only a single column

Categories

Resources