Adding Spark Partition as a column without reading all files - apache-spark

Using spark 2.1.
What is the best way to ensure that the partition column used when writing a dataframe out to parquet gets added back into a dataframe after reading it back in without using /* across all of the files? Just want to s3a://my/path/part={2018-*} and make sure that the part column I originally used becomes available when I read it.
I thought that basePath option takes care of this where it just adds any of the partitions following the path as a column but can't seem to get it to work.
Tried this out:
my files are pretty standard and look like this and I want part added back in as a column:
s3a://my/path/part=20170101
s3a://my/path/part=20170102
This is not working:
spark.read
.option("BasePath", "s3a://my/path/")
.parquet(filePath)
Am I just thinking about this incorrectly and I should be reading all of the files and then filtering after? I thought a main benefit of partitioning by a column is that you can then just be able to read a subset of the files by using the partition.

Related

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

Apache PySpark - Reading a directory without scanning the files

We have a growing data lake of logs we keep on google storage. The data is partitioned by dates (and other stuff such as env=production/staging). Imagine the path gs://bucket/data/env=*/date=*
We begin an app or an analysis by creating dataframes that can be queried later on for processing. The problem is that creating the DFs takes a long time even before we do actions on them. In other words the following command takes a long time because Spark seems to be scanning all the files inside (and as I mentioned, the amount of data keeps growing).
df = spark.read.load("gs://bucket/data/", schema=data_schema, format="json")
Note that we provide the schema here. Also, after the data is loaded the partitioning works well, that is, if we filter by day we do get the speed-up that we expect. We don't want to read a specific partition from the get go, we would like to have everything in one DF and read only what we need later on.

How can I append to same file in HDFS(spark 2.11)

I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files
If it keep creating n numbers of files,i feel it won't be much efficient
HDFS FILE SYSYTEM
Code
lines.foreachRDD(f => {
if (!f.isEmpty()) {
val df = f.toDF().coalesce(1)
df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9")
}
})
In my pom I am using respective dependencies:
spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11
As you already realized Append in Spark means write-to-existing-directory not append-to-file.
This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).
Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.
It’s creating file for each rdd as every time you are reinitialising the DataFrame variable. I would suggest have a DataFrame variable and assign as null outside of loop and inside each rdd union with the local DataFrame. After the loop write using the outer DataFrame.

spark: dataframe.count yields way more rows than printing line by line or show()

New to Spark; using Databricks. Really puzzled.
I have this dataFrame: df.
df.count() yields Long = 5460
But if I print line by line:
df.collect.foreach(println) I get only 541 rows printed out. Similarly, df.show(5460) only shows 1017 rows. What could be the reason?
A related question: how can I save "df" with Databricks? And where does it save to? -- I tried to save before but couldn't find the file afterwards. I load the data by mounting an S3 bucket, if that's relevant.
Regarding your first question, Databricks output truncates by default. This applies both to text output in cells and to the output of display(). I would trust .count().
Regarding your second question, there are four types of places you can save on Databricks:
To Hive-managed tables using df.write.saveAsTable(). These will end up in an S3 bucket managed by Databricks, which is mounted to /user/hive/warehouse. Note that you will not have access to the AWS credentials to work with that bucket. However, you can use the Databricks file utilities (dbutils.fs.*) or the Hadoop filesystem APIs to work with the files, should you need to.
Local SSD storage. This is best done with persist() or cache() but, if you really need to, you can write to, for example, /tmp using df.write.save("/dbfs/tmp/...").
Your own S3 buckets, which you need to mount.
To /FileStore/, which is the only "directory" you can download from directly from your cluster. This is useful, for example, for writing CSV files you want to bring into Excel immediately. You write the file and output a "Download File" HTML link into your notebook.
For more details see the Databricks FileSystem Guide.
The difference could be bad source data. Spark is lazy by nature so it's not going to build a bunch columns and fill them in just to count rows. So the data may not parse when you actually execute against it or the rows or null. Or your schema doesn't allow nulls for certain columns and they are null when the data is fully parsed. Or you are modifying the data between your count, collect and show. There is just not enough detail to tell for sure. you can open up a spark shell and create a small piece of data and test those conditions by turning that data into a dataframe. Change the schema to allow and not allow nulls or add nulls in source data and not nulls. make the source data string but make the schema require integers.
As far as saving your data frame. you create a dataframe writer with write and then define the file type you want to save it as and then the file name. This example saves a parquet file. There are many other options for filetype and write options that are permitted here.
df.write.parquet("s3://myfile")

Updating values in apache parquet file

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.
Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
Does parquet support append operations?
Does the file system (namely, HDFS) allow append on files?
Can the job framework (Spark) implement append operations?
Answers:
parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)
HDFS allows append on files using the dfs.support.append property
Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.
More details are here:
http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/
http://bytepadding.com/linux/understanding-basics-of-filesystem/
There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.
Best practices:
A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.
B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.
C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.
it's surprisingly fast to reassemble a parquet file using row groups.
In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..
Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers)
Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.
You must re-create the file, this is the Hadoop way. Especially if the file is compressed.
Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.
Well, in 2022, I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you.

Resources