I am new to spark, I am observing that after spark program completes there are directories getting created and their names are like:
spark-3505c49a-0402-41ce-8187-c82ea7527e15
blockmgr-37b4b5e6-a97c-4779-9658-a19c194b9a2c
Please help me to understand about these directories, why are they getting created?
These are temporary directories. The spark- prefixed directory is used for multiple purposes, one of which is the ShuffleManager and blockmgr- prefixed directories is used by the BlockManager.
Related
I'm writing PySpark script in Azure Synapse Notebook. It's supposed to load a long list of CSV files into a dataframe like this:
%%pyspark
path = [
'abfss://mycontainer#mylake.dfs.core.windows.net/sku-934/data.csv',
'abfss://mycontainer#mylake.dfs.core.windows.net/sku-594/data.csv',
'abfss://mycontainer#mylake.dfs.core.windows.net/sku-365/data.csv',
# Many more paths here
]
df = spark.read.options(header=True).csv(path)
However, I cannot guarantee that files at all of those paths exist. Sometimes they don't. If that's the case, the whole script stops with AnalysisException: Path does not exist
Question - can I instruct spark in Azure Synapse Notebook to ignore missing files and load only those that are there?
What I already tried to solve this - googling suggested I could do spark.sql("set spark.sql.files.ignoreCorruptFiles=true"), but for some reason it had no effect. Maybe that doesn't work in Synapse, or it's intended for a different use case. My knowledge about this is very limited, so I can't tell.
What you ask is not possible.
Either insert those missing file with script with no data ahead of running app or build the valid list first.
Well known issue.
I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.
Do spark uses same cache location for storing tmp files per each executor
e.g., If I have two task running in one executor and both create a file with the same name, will one gives an error that "file exists"?
I got the answer from another source,
It does use the same cache location, as per its spark local dirs
property, and io tmp dir for java stuff,
if by create a file you mean like adding a file (addFile), then you
could overcome this by also setting spark.files.overwrite to true,
which will only work if the current file is different from the newly
added.
With a call like
df.write.csv("s3a://mybucket/mytable")
I obviously know where files/objects are written, but because of S3's eventual consistency guarantees, I can't be 100% sure that getting a listing from that location will return all (or even any) of the files that were just written. If I could get the list of files/objects spark just wrote, then I could prepare a manifest file for a Redshift COPY command without worrying about eventual consistency. Is this possible-- and if so how?
The spark-redshift library can take care of this for you. If you want to do it yourself you can have a look at how they do it here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L299
EDIT: I avoid further worry about consistency by using df.coalesce(fileCount) to output a known number of file parts (for Redshift you want a multiple of the slices in your cluster). You can then check how many files are listed in the Spark code and also how many files are loaded in Redshift stl_load_commits.
It's good to be aware of consistency risks; you can get it in listings with delayed create visibility and deleted objects still being found.
AFAIK, You can't get a list of files created, as its somewhere where tasks can generate whatever they want into the task output dir, which is then marshalled (via listing and copy) into the final output dir,
In the absence of a consistency layer atop S3 (S3mper, s3guard, etc), you can read & spin for "a bit" to allow for the shards to catch up. I have no good idea of what is a good value of "a bit".
However, if you are calling fs.write.csv(), you may have been caught by listing inconsistencies within the committer used to propagate task output to the job dir; s that's done in S3A via list + copy, see.
I have a very large set of json files (>1 million files) that I would like to work on with Spark.
But, I've never tried loading this much data into an RDD before, so I actually don't know if it can be done, or rather if it even should be done.
What is the correct pattern for dealing with this amount of data within RDD(s) in Spark?
Easiest way would be to create directory, copy all the files to the directory and pass directory as path while reading the data.
If you try to use patterns in the directory path, Spark might run into out of memory issues.