Spark - Folder with same name as text file automatically created after RDD?

Spark - Folder with same name as text file automatically created after RDD? - apache-spark

I placed a text file named Linecount2.txt in hdfs and built a simple rdd to count the number of lines using spark.
val lines = sc.textFile("user/root/hdpcd/Linecount2.txt")
lines.count()
This works.
But when I tried using the same text file with the aforementioned path, I receive the error:
"org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
When I looked into that path, I could see a folder was created 'Linecount.txt'.Hence the path for the file is now
("user/root/hdpcd/Linecount2.txt/Linecount2.txt")
Then, after defining the path I was able to run it successfully.
The third time I tried this, I got the same error because input path doesn't exist.
When I went through the path,
Why does this happen?

There is a difference between putting an HDFS file at user/root/hdpcd/Linecount2.txt compared to /user/root/hdpcd/Linecount2.txt, (or, more simply hdpcd/Linecount2.txt, when you already are the root user)
The leading slash is very important if you want to place a file in an absolute directory other than your current user account, otherwise, that's the default.
You've not given your hdfs put command, but the issue here is simply the difference between the absolute and relative paths. And it's not Spark specifically that's the issue
Also, hdfs put will say that a file already exists if you try to place it in the same location, so the fact you were able to upload twice should be an indication that your path was incorrect

Related

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r

The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.

There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")

Overwriting a file in PySpark, without affecting others

Overwriting a file in PySpark, without affecting others.
I need to save a dataframe as a parquet file. If a directory for a given file already exists, I need to overwrite it, but upper subdirectories should not be ovewritten.
Example:
root/2021/12/01/file1.parquet
root/2021/12/02/file2.parquet
root/2021/12/03/file3.parquet
If /2021/12/01/file1.parquet is being re-created (or overwritten), the other two files in the root remain as-is. Path /2021/12 is part of the partition structure of these files. Hence, .mode("overwrite") will overwrite the other two files as file1 is being re-created.
How can this be accomplished in PySpark?

df.write.mode("overwrite").parquet("/tmp/output/people.parquet")

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!

Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Parquet file format on S3: which is the actual Parquet file?

Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:
_SUCCESS; and
part-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet directory?; or
mydata.parquet_$folder$ file?; or
mydata.parquet/part-<big-UUID>.snappy.parquet?
Thanks!

The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.

According to the api to save the parqueat file it saves inside the folder you provide. Sucess is incidation that the process is completed scuesffuly.
S3 create those $folder if you write directly commit to s3. What happens is it writes to temporory folders and copies to the final destination inside the s3. The reason is there no concept of rename.
Look at the s3-distcp and also DirectCommiter for performance issue.

The $folder$ marker is used by s3n/amazon's emrfs to indicate "empty directory". ignore.
The _SUCCESS file is, as the others note, a 0-byte file. ignore
all other .parquet files in the directory are the output; the number you end up with depends on the number of tasks executed on the input
When spark uses a directory (tree) as a source of data, all files beginning with _ or . are ignored; s3n will strip out those $folder$ things too. So if you use the path for a new query, it will only pick up that parquet file.

Spark cannot access local file anymore?

I had this code running before
df = sc.wholeTextFiles('./dbs-*.json,./uob-*.json').flatMap(lambda x: flattenTransactionFile(json.loads(x[1]))).toDF()
But it appears that now, I get
Py4JJavaError: An error occurred while calling o24.partitions.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/jiewmeng/dbs-*.json matches 0 files
Input Pattern hdfs://localhost:9000/user/jiewmeng/uob-*.json matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:330)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:272)
at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
Looks like spark is trying to use Hadoop? How can I use a local file? Also why the sudden failure? Since I managed to use ./dbs-*.json before?

By default, the location of the file is relative to your directory in HDFS. In order to refer local file system, you need to use file:///your_local_path
For e.g. in cloudera VM, if I say
sc.textFile('myfile')
it will assume the HDFS path
/user/cloudera/myfile
where as to mention my local home directory I would say
sc.textFile('file:///home/cloudera/myfile')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark - Folder with same name as text file automatically created after RDD? - apache-spark

Related

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

Overwriting a file in PySpark, without affecting others

Reading GeoJSON in databricks, no mount point set

Parquet file format on S3: which is the actual Parquet file?

Spark cannot access local file anymore?

Categories

Resources