spark streaming write data to spark-sql, but cann't query it - apache-spark

I have submitted a streaming application, and write DataFram to spark-sql.
df.show(5)
df.write.mode(SaveMode.Append).saveAsTable(s"industy.test1")
when i send some message to kafka, the line df.show(5) have printed in the log file, that means i have successfully consumed the data, but when i use the zeppelin and the spark-sql interactive command line to query industry.test1, there is no data that i have consumed. And the log file has not any errors, but when i reopened a spark-sql interactive command line to query table, i got the data. So i want to know why i need to reopen it. Thanks for any help.

Related

PySpark job only partly running

I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.
When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.
When you use the toPandas() to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
Why there is no logs after toPandas(): As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging to collect the logs in driver.
Why there is no CSV saved in /tmp dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas() to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.

In Pyspark Structured Streaming, how can I discard already generated output before writing to Kafka?

I am trying to do Structured Streaming (Spark 2.4.0) on Kafka source data where I am reading latest data and performing aggregations on a 10 minute window. I am using "update" mode while writing the data.
For example, the data schema is as below:
tx_id, cust_id, product, timestamp
My aim is to find customers who have bought more than 3 products in last 10 minutes. Let's say prod is the dataframe which is read from kafka, then windowed_df is defined as:
windowed_df_1 = prod.groupBy(window("timestamp", "10 minutes"), cust_id).count()
windowed_df = windowed_df_1.filter(col("count")>=3)
Then I am joining this with a master dataframe from hive table "customer_master" to get cust_name:
final_df = windowed_df.join(customer_master, "cust_id")
And finally, write this dataframe to Kafka sink (or console for simplicity)
query = final_df.writeStream.outputMode("update").format("console").option("truncate",False).trigger(processingTime='2 minutes').start()
query.awaitTermination()
Now, when this code runs every 2 minutes, in the subsequent runs, I want to discard all those customers who were already part of my output. I don't want them in my output even if they buy any product again.
Can I write the stream output temporarily somewhere (may be a hive table) and do an "anti-join" for each execution ?
This way I can also have a history maintained in a hive table.
I also read somewhere that we can write the output to a memory sink and then use df.write to save it in HDFS/Hive. But what if we terminate the job and re-run ? The in-memory table will be lost in this case I suppose.
Please help as I am new to Structured Streaming.
**
Update: -
**
I also tried below code to write output in Hive table as well as Console(or Kafka sink):
def write_to_hive(df, epoch_id):
df.persist()
df.write.format("hive").mode("append").saveAsTable("hive_tab_name")
pass
final_df.writeStream.outputMode("update").format("console").option("truncate", False).start()
final_df.writeStream.outputMode("update").foreachBatch(write_to_hive).start()
But this only performs the 1st action, i.e. write to Console.
If i write "foreachBatch" first, it will save to Hive table but does not print to console.
I want to write to 2 different sinks. Please help.

Processing Batch files coming every 2 minutes in system using spark Streaming

I am trying to process my near real time batch csv files using spark streaming. I am reading these files in batches of 100 files and doing some operation and writing to output files. I am using spark.readstream and spark.writestream functions to read and write streaming files.
I am trying to find out how I can stop the spark streaming?
stream_df = spark.readstream.csv(filepath_directory).
.option("maxFilesPerTrigger", 10)
query = spark.writeStream.format("parquet")..outputMode("append").option("output filepath")
I faced the issue while streaming job is running, due to some reason
my job is failing or some exceptions are.
I tried try, except.
I am planning to stop the query in case of any exceptions and process the same code again.
I used query.stop(), is this the right way to stop streaming job?
I read one post, I am not sure whether it is for Dstream and how to execute this code in my pyspark code.
https://www.linkedin.com/pulse/how-shutdown-spark-streaming-job-gracefully-lan-jiang

Parquet File Output Sink - Spark Structured Streaming

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files. I periodically feed the Stream input data (using StreamReader to read in files), but it does not write output to Parquet file for each file provided as input. Once I have given it a few files, it tends to write a Parquet file just fine.
I am wondering how to control this. I would like to be able force a new write to Parquet file for every new file provided as input. Any tips appreciated!
Note: I have maxFilesPerTrigger set to 1 on the Read Stream call. I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to the Parquet file
After further analysis, and working with the ForEach output sink using the default Append Mode, I believe the issue I was running into was the combination of the Append mode along with the Watermarking feature.
After re-reading https://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html#starting-streaming-queries It appears that when the Append mode is used with a watermark set, the Spark structured steaming will not write out aggregation results to the Result table until the watermark time limit has past. Append mode does not allow updates to records, so it must wait for the watermark to past, to ensure no change the row...
I believe - the Parquet File sink does not allow for the Update mode, howver after switching to the ForEach output Sink, and using the Update mode, I observed data coming out the sink as I expected. Essentially for each record in, at least one record out, with no delay (as was observed before).
Hopefully this is helpful to others.

Hadoop Data Ingestion

I have a below requirement:
There's an upstream system which makes a key-entry in a database table. That entry indicates that a set of data is available in a database-table (oracle). We have to ingest this data and save it as a parquet file. No processing of data is required. This ingestion process should start everytime a new key-entry is available.
For this problem statement, we have planned to have a database poller which polls for the key-entry. After reading that entry, we need to ingest the data from an Oracle table. For this ingestion purpose, which tool is best? Is it Kafka, Sqoop, Spark-SQL etc.,? Please help.
Also we need to ingest csv files too. Only when a file is completely written, then only we have to start ingesting it. Please let me know how to perform this as well.
For Ingesting relational Data you can use sqoop, and for you scenario you can have look at https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
write sqoop incremental job and schedule it using cron , each time sqoop job will execute you will have updated data in hdfs.
For .csv files you can use flume. refer,
https://www.rittmanmead.com/blog/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/
With Sqoop you can import data from database in your Hadoop File System.

Resources