Spark cannot access local file anymore? - apache-spark

I had this code running before
df = sc.wholeTextFiles('./dbs-*.json,./uob-*.json').flatMap(lambda x: flattenTransactionFile(json.loads(x[1]))).toDF()
But it appears that now, I get
Py4JJavaError: An error occurred while calling o24.partitions.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/jiewmeng/dbs-*.json matches 0 files
Input Pattern hdfs://localhost:9000/user/jiewmeng/uob-*.json matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:330)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:272)
at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
Looks like spark is trying to use Hadoop? How can I use a local file? Also why the sudden failure? Since I managed to use ./dbs-*.json before?

By default, the location of the file is relative to your directory in HDFS. In order to refer local file system, you need to use file:///your_local_path
For e.g. in cloudera VM, if I say
sc.textFile('myfile')
it will assume the HDFS path
/user/cloudera/myfile
where as to mention my local home directory I would say
sc.textFile('file:///home/cloudera/myfile')

Related

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Spark overwrite parquet files on aws s3 raise URISyntaxException: Relative path in absolute URI

I am using Spark to write and read parquet files on AWS S3. I have parquet files which stored in
's3a://mybucket/file_name.parquet/company_name=company_name/record_day=2019-01-01 00:00:00'
partitioned by 'company_name' and 'record_day'
I want to write basic pipeline to update my parquet files on regularly basis by 'record_day'. To do this, i am gonna use overwrite mode:
df.write.mode('overwrite').parquet(s3a://mybucket/file_name.parquet/company_name='company_name'/record_day='2019-01-01 00:00:00')
But am getting unexpected error 'java.net.URISyntaxException: Relative path in absolute URI: key=2019-01-01 00:00:00'.
I spent several hours searching for the problem but found no solution(. For some tests, I replaced the 'overwrite' parameter with 'append', and everything works fine. I also made a simple dataframe and overwrite mode also works fine on it. I know that i can solve my problem in a different way, by deleting and then writing the particular part, but I would like to understand what the cause of the error is?
Spark 2.4.4 Hadoop 2.8.5
Appreciate any help.
I had the same error and the my solution was to remove the : part in the date.

Spark - Folder with same name as text file automatically created after RDD?

I placed a text file named Linecount2.txt in hdfs and built a simple rdd to count the number of lines using spark.
val lines = sc.textFile("user/root/hdpcd/Linecount2.txt")
lines.count()
This works.
But when I tried using the same text file with the aforementioned path, I receive the error:
"org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
When I looked into that path, I could see a folder was created 'Linecount.txt'.Hence the path for the file is now
("user/root/hdpcd/Linecount2.txt/Linecount2.txt")
Then, after defining the path I was able to run it successfully.
The third time I tried this, I got the same error because input path doesn't exist.
When I went through the path,
Why does this happen?
There is a difference between putting an HDFS file at user/root/hdpcd/Linecount2.txt compared to /user/root/hdpcd/Linecount2.txt, (or, more simply hdpcd/Linecount2.txt, when you already are the root user)
The leading slash is very important if you want to place a file in an absolute directory other than your current user account, otherwise, that's the default.
You've not given your hdfs put command, but the issue here is simply the difference between the absolute and relative paths. And it's not Spark specifically that's the issue
Also, hdfs put will say that a file already exists if you try to place it in the same location, so the fact you were able to upload twice should be an indication that your path was incorrect

Recursively Read Files Spark wholeTextFiles

I have a directory in an azure data lake that has the following path:
'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib'
Within this directory there are a number of other directories (50) that have the format 20190404.
The directory 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/20180404' contains 100 or so xml files which I am working with.
I can create an rdd for each of the sub-folders which works fine, but ideally I want to pass only the top path, and have spark recursively find the files. I have read other SO posts and tried using a wildcard thus:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*'
rdd = sc.wholeTextFiles(pathWild)
rdd.count()
But it just freezes and does nothing at all, seems to completely destroy the kernel. I am working in Jupyter on Spark 2.x. New to spark. Thanks!
Try this:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*/*'

Naming the csv file in write_df

I am writing a file in sparkR using write_df, I am unable to specify the file name to this:
Code:
write.df(user_log0, path = "Output/output.csv",
source = "com.databricks.spark.csv",
mode = "overwrite",
header = "true")
Problem:
I expect inside the 'Output' folder a file called 'output.csv' but what happens is a folder called 'output.csv' and inside it called 'part-00000-6859b39b-544b-4a72-807b-1b8b55ac3f09.csv'
What am I doing wrong?
P.S: R 3.3.2, Spark 2.1.0 on OSX
Because of the distributed nature of spark, you can only define the directory into which the files would be saved and each executor writes its own file using spark's internal naming convention.
If you see only a single file, it means that you are working in a single partition, meaning only one executor is writing. This is not the normal spark behavior, however, if this fits your use case, you can collect the result to an R dataframe and write to csv from that.
In the more general case where the data is parallelized between multiple executors, you cannot set the specific name for the files.

Resources