Overwriting a file in PySpark, without affecting others - apache-spark

Overwriting a file in PySpark, without affecting others.
I need to save a dataframe as a parquet file. If a directory for a given file already exists, I need to overwrite it, but upper subdirectories should not be ovewritten.
Example:
root/2021/12/01/file1.parquet
root/2021/12/02/file2.parquet
root/2021/12/03/file3.parquet
If /2021/12/01/file1.parquet is being re-created (or overwritten), the other two files in the root remain as-is. Path /2021/12 is part of the partition structure of these files. Hence, .mode("overwrite") will overwrite the other two files as file1 is being re-created.
How can this be accomplished in PySpark?

df.write.mode("overwrite").parquet("/tmp/output/people.parquet")

Related

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.
ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv
This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.
I want to make this process much faster. Files will be in GBs.
Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.
You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:
spark.read.csv('hdfs://path/to/directory')
If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:
spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')
You can have more information about how to read a CSV file with Spark here
If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)
Using: spark.read.csv(["path1","path2","path3"...]) you can read multiple files from different paths. But that means you have first to make a list of the paths. A list not a string of comma-separated file paths

Spark: How to overwrite a file on S3 folder and not complete folder

Using Spark I am trying to push some data(in csv, parquet format) to S3 bucket.
df.write.mode("OVERWRITE").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)
In above code piece, destination_path variable holds the S3 bucket location where data needs to be exported.
Eg. destination_path = "s3://some-test-bucket/manish/"
In the folder manish of some-test-bucket if I have several files and sub-folders. Above command will delete all of them and spark will write new output files. But I want to overwrite just one file with this new file.
Even if I am able to overwrite just contents of this folder, but sub-folder remain intact even that would solve the problem to certain extent.
How can this be achieved?
I tried to use mode as append instead of overwrite.
Here in this case subfolder name remains intact but again all the contents of manish folder and its sub-folder are overwritten.
Short answer: Set the Spark configuration parameter spark.sql.sources.partitionOverwriteMode to dynamic instead of static. This will only overwrite the necessary partitions and not all of them.
PySpark example:
conf=SparkConf().setAppName("test).set("spark.sql.sources.partitionOverwriteMode","dynamic").setMaster("yarn")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
The file's can be deleted first and then use append mode to insert the data instead of overwriting to retain the sub folder's. Below is an example from Pyspark.
import subprocess
subprocess.call(["hadoop", "fs", "-rm", "{}*.csv.deflate".format(destination_path)])
df.write.mode("append").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Spark - Folder with same name as text file automatically created after RDD?

I placed a text file named Linecount2.txt in hdfs and built a simple rdd to count the number of lines using spark.
val lines = sc.textFile("user/root/hdpcd/Linecount2.txt")
lines.count()
This works.
But when I tried using the same text file with the aforementioned path, I receive the error:
"org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
When I looked into that path, I could see a folder was created 'Linecount.txt'.Hence the path for the file is now
("user/root/hdpcd/Linecount2.txt/Linecount2.txt")
Then, after defining the path I was able to run it successfully.
The third time I tried this, I got the same error because input path doesn't exist.
When I went through the path,
Why does this happen?
There is a difference between putting an HDFS file at user/root/hdpcd/Linecount2.txt compared to /user/root/hdpcd/Linecount2.txt, (or, more simply hdpcd/Linecount2.txt, when you already are the root user)
The leading slash is very important if you want to place a file in an absolute directory other than your current user account, otherwise, that's the default.
You've not given your hdfs put command, but the issue here is simply the difference between the absolute and relative paths. And it's not Spark specifically that's the issue
Also, hdfs put will say that a file already exists if you try to place it in the same location, so the fact you were able to upload twice should be an indication that your path was incorrect

Parquet file format on S3: which is the actual Parquet file?

Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:
_SUCCESS; and
part-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet directory?; or
mydata.parquet_$folder$ file?; or
mydata.parquet/part-<big-UUID>.snappy.parquet?
Thanks!
The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.
According to the api to save the parqueat file it saves inside the folder you provide. Sucess is incidation that the process is completed scuesffuly.
S3 create those $folder if you write directly commit to s3. What happens is it writes to temporory folders and copies to the final destination inside the s3. The reason is there no concept of rename.
Look at the s3-distcp and also DirectCommiter for performance issue.
The $folder$ marker is used by s3n/amazon's emrfs to indicate "empty directory". ignore.
The _SUCCESS file is, as the others note, a 0-byte file. ignore
all other .parquet files in the directory are the output; the number you end up with depends on the number of tasks executed on the input
When spark uses a directory (tree) as a source of data, all files beginning with _ or . are ignored; s3n will strip out those $folder$ things too. So if you use the path for a new query, it will only pick up that parquet file.

Resources