Spark Streaming finds file but claims it can't find the file - apache-spark

I have the below - which monitors a directory & pulls in the logs every X seconds.
The issue I have is this:
I set the script running
I then create a file in the directory (let's say testfile.txt)
The script then errors saying textfile.txt does not exist
It found the file and the filename, so it does exist and it finds it.
What I can see is that I define the path with a file:/// and it returns an error that it can't find file:/. So it seems to be missing two // for some reason:
Thanks for any help!!!!
Code
#only files after stream starts
df = spark_session\
.readStream\
.option('newFilesOnly', 'true')\
.option('header', 'true')\
.schema(myschema)\
.text('file:///home/keenek1/analytics/logs/')\
.withColumn("FileName", input_file_name())
Error
FileNotFoundException: File file:/home/keenek1/analytics/logs/loggywoggywoo.txt does not exist\

Please change file:/// to hdfs://.
df = spark_session\
.readStream\
.option('newFilesOnly', 'true')\
.option('header', 'true')\
.schema(myschema)\
.text('hdfs://home/keenek1/analytics/logs/')\ # changed file:/// to hdfs://
.withColumn("FileName", input_file_name())
For below question
If the same log file is overwritten lets say hourly, the checkpoint doesn't reprocess the file. I need it to say 'if modified time changes, reprocess' - is that possible?
workaround will be, point your spark streaming to different directory & use spark listeners to check file timestamp from actual directory if any changes in file timestamp, move that file to your streaming directory with new name
Let me know if you want code, I can give you in scala, may be you need to convert that into python.

Related

Dealing with overwritten files in Databricks Autoloader

Main topic
I am facing a problem that I am struggling a lot to solve:
Ingest files that already have been captured by Autoloader but were
overwritten with new data.
Detailed problem description
I have a landing folder in a data lake where every day a new file is posted. You can check the image example below:
Each day an automation post a file with new data. This file is named with a suffix meaning the Year and Month of the current period of the posting.
This naming convention results in a file that is overwritten each day with the accumulated data extraction of the current month. The number of files in the folder only increases when the current month is closed and a new month starts.
To deal with that I have implemented the following PySpark code using the Autoloader feature from Databricks:
# Import functions
from pyspark.sql.functions import input_file_name, current_timestamp, col
# Define variables used in code below
checkpoint_directory = "abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test/_checkpoint/sapex_ap_posted"
data_source = f"abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test"
source_format = "csv"
table_name = "prod_gbs_gpdi.bronze_data.sapex_ap_posted"
# Configure Auto Loader to ingest csv data to a Delta table
query = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.option("header", "true")
.option("delimiter", ";")
.option("skipRows", 7)
.option("modifiedAfter", "2022-10-15 11:34:00.000000 UTC-3") # To ingest files that have a modification timestamp after the provided timestamp.
.option("pathGlobFilter", "AP_SAPEX_KPI_001 - Posted Invoices in *.CSV") # A potential glob pattern to provide for choosing files.
.load(data_source)
.select(
"*",
current_timestamp().alias("_JOB_UPDATED_TIME"),
input_file_name().alias("_JOB_SOURCE_FILE"),
col("_metadata.file_modification_time").alias("_MODIFICATION_TIME")
)
.writeStream
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable(table_name)
)
This code allows me to capture each new file and ingest it into a Raw Table.
The problem is that it works fine ONLY when a new file arrives. But if the desired file is overwritten in the landing folder the Autoloader does nothing because it assumes the file has already been ingested, even though the modification time of the file has chaged.
Failed tentative
I tried to use the option modifiedAfter in the code. But it appears to only serve as a filter to prevent files with a Timestamp to be ingested if it has the property before the threshold mentioned in the timestamp string. It dows not reingest files that have Timestamps before the modifiedAfter threshold.
.option("modifiedAfter", "2022-10-15 14:10:00.000000 UTC-3")
Question
Does someone knows how to detect a file that was already ingested but has a different modified date and how to reprocess that to load in a table?
I have figured out a solution to this problem. In the Autoloader Options list in Databricks documentation is possible to see an option called cloudFiles.allowOverwrites. If you enable that in the streaming query then whenever a file is overwritten in the lake the query will ingest it into the target table. Please pay attention that this option will probably duplicate the data whenever a new file is overwritten. Therefore, downstream treatment will be necessary.

Spark: How to overwrite a file on S3 folder and not complete folder

Using Spark I am trying to push some data(in csv, parquet format) to S3 bucket.
df.write.mode("OVERWRITE").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)
In above code piece, destination_path variable holds the S3 bucket location where data needs to be exported.
Eg. destination_path = "s3://some-test-bucket/manish/"
In the folder manish of some-test-bucket if I have several files and sub-folders. Above command will delete all of them and spark will write new output files. But I want to overwrite just one file with this new file.
Even if I am able to overwrite just contents of this folder, but sub-folder remain intact even that would solve the problem to certain extent.
How can this be achieved?
I tried to use mode as append instead of overwrite.
Here in this case subfolder name remains intact but again all the contents of manish folder and its sub-folder are overwritten.
Short answer: Set the Spark configuration parameter spark.sql.sources.partitionOverwriteMode to dynamic instead of static. This will only overwrite the necessary partitions and not all of them.
PySpark example:
conf=SparkConf().setAppName("test).set("spark.sql.sources.partitionOverwriteMode","dynamic").setMaster("yarn")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
The file's can be deleted first and then use append mode to insert the data instead of overwriting to retain the sub folder's. Below is an example from Pyspark.
import subprocess
subprocess.call(["hadoop", "fs", "-rm", "{}*.csv.deflate".format(destination_path)])
df.write.mode("append").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

File Name and Variable in Flume

Right now I am working in a project where we are trying to read tomcat access log using flume and process those data in Spark and dump those in DB in proper format. But problem is that tomcat access log file is a daily rolling file and file name will change every day. Some thing like...
localhost_access_log.2017-09-19.txt
localhost_access_log.2017-09-18.txt
localhost_access_log.2017-09-17.txt
and my flume conf file for source section is something like
# Describe/configure the source
flumePullAgent.sources.nc1.type = exec
flumePullAgent.sources.nc1.command = tail -F /tomcatLog/localhost_access_log.2017-09-17.txt
#flumePullAgent.sources.nc1.selector.type = replicating
Which is running tail command on a fixed file name(I used fixed name , for testing only). How can I pass the file name as a parameter in flume conf file.
In fact , If some how I able to pass the file name as parameter , then also it will not be a actual solution. say , I start flume today with some file name (example : "localhost_access_log.2017-09-19.txt"), tomorrow when I will change the file name (localhost_access_log.2017-09-19.txt to localhost_access_log.2017-09-20.txt) some one has to stop the flume and restart with new file name. In that case it will not be a continues process, I have to stop / start the flume using cron job or something like this. Another problem is that I will loss some data(The server we are working now is high throughput server , 700-800 TPS almost ) every day during the processing time.(I mean time it will take to generate the new file name+time to stop flume+time to start flule)
Any one , have idea how to run flume with roll over file name in production environment? Any help will be highly appreciated...
exec source is not suitable for your task, you can instead use Spooling Directory Source. From Flume user guide:
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear.
Then, in config file you'd mention your logs directory like this:
agent.sources.spooling_src.spoolDir = /tomcatLog

Saving a file locally in Databricks PySpark

I am sure there is documentation for this somewhere and/or the solution is obvious, but I've come up dry in all of my searching.
I have a dataframe that I want to export to a text file to my local machine. The dataframe contains strings with commas, so just display -> download full results ends up with a distorted export. I'd like to export out with a tab-delimiter, but I cannot figure out for the life of me how to download it locally.
I have
match1.write.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.save("file:\\\C:\\Users\\user\\Desktop\\NewsArticle.txt")
but clearly this isn't right. I suspect it is writing somewhere else (somewhere I don't want it to be...) because running it again gives me the error that the path already exists. So... what is the correct way?
cricket_007 pointed me along the right path--ultimately, I needed to save the file to the Filestore of Databricks (not just dbfs), and then save the resulting output of the xxxxx.databricks.com/file/[insert file path here] link.
My resulting code was:
df.repartition(1) \ #repartitioned to save as one collective file
.write.format('csv') \ #in csv format
.option("header", True) \ #with header
.option("quote", "") \ #get rid of quote escaping
.option(delimiter="\t") \ #delimiter of choice
.save('dbfs:/FileStore/df/') #saved to the FileStore
Check if it is present at below location. Multiple part files should be there in that folder.
import os
print os.getcwd()
If you want to create a single file (not multiple part files) then you can use coalesce()(but note that it'll force one worker to fetch whole data and write these sequentially so it's not advisable if dealing with huge data)
df.coalesce(1).write.format("csv").\
option("delimiter", "\t").\
save("<file path>")
Hope this helps!

Resources