Get most recent file in S3 via PySpark - apache-spark

Is there anyway to get the last file in an S3 repo via Pyspark?
I managed to do it with Python using this code:
paginator = client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=Bucket, Prefix=Path)
for page in pages:
for obj in page['Contents']:
latest = max(page['Contents'], key=lambda x: x['LastModified'])
And on Spark i can't find any documentation.
Thank you

you'd just use the Hadoop FileSystem APIs, use listStatusIterator()/listFiles() to get an iterator and scan, FileStatus.getModificationTime() gives you that last modified field.
Be aware though: S3 timestamp of large file uploads is time upload was started, not completed. A large file which took many minutes to upload would appear older than a small file uploaded in a single PUT which took place during the upload.

Related

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Downloading S3 files in Google Colab

I am working on a project and it happens that some data is provided in form of S3fileSystem. I can read that data using S3FileSystem.open(path). But there are more than 360 files and it takes atleast 3 minutes to read a single file. I was wondering, is there any way of downloading these files in my system and read them from there, instead of reading it directly from S3fileSystem. There is another reason, although I can read all those files but once my session on colab reconnects I have to re-read all those files again, hence it will take a lot of time. I am using following code to read files
fs_s3 = s3fs.S3FileSystem(anon=True)
s3path = 'file_name'
remote_file_obj = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(remote_file_obj, engine= 'h5netcdf')
Is there any way of downloading those files?
You can use another s3fs to mount the bucket, then copy the files to Colab.
how to mount
After mounting, you can
!cp /s3/yourfile.zip /content/

How to append files in GCS with the same schema?

Is there any way one can append two files in GCS, suppose file one is a full
load and second file is an incremental load. Then what's the way we can append
the two?
Secondly, using gsutil compose will append the two files including the attributes
names as well. So, in the final file I want the data of the two files.
You can append two separate files using compose in the Google Cloud Shell and rename the output file as the first file, like this:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/obj1
This command is meant for parallel uploads in which you divide a large object file in smaller objects. They get uploaded to Google Cloud Storage and then you can append them to get the original file. You can find more information on Composite Objects and Parallel Uploads.
I've come up with two possible solutions:
Google Cloud Function solution
The option I would go for is using a Cloud Function. Doing something like the following:
Create an empty bucket like append_bucket.
Upload the first file.
Create a Cloud Function to be triggered by new uploaded files on the
bucket.
Upload the second file.
Read the first and the second file (you will have to download them as string first).
Make the append operation.
Upload the result to the bucket.
Google Dataflow solution
You can also do it with Dataflow for BigQuery (keep in mind it’s still in beta).
Create a BigQuery dataset and table.
Create a Dataflow instance, from the template Cloud Storage Text to BigQuery.
Create a Javascript file with the logic to transform the text.
Upload your files in Json format to the bucket.
Dataflow will read the Json file, execute the Javascript code and append the new data to the BigQuery dataset.
At last, export the BigQuery query result to Cloud Storage.

How to download specific number of files/objects from S3 bucket using aws cli command?

Consider I want to download only 10 files from the bucket, how do we pass 10 as an argument.
The easiest way to do so is to make a python script that you can run every 30 minutes.I have written the python code that will do your work :
import boto3
import random
s3 = boto3.client('s3')
source=boto3.resource('s3')
keys = []
resp = s3.list_objects_v2(Bucket='bucket_name')
for obj in resp['Contents']:
keys.append(obj['Key'])
length = len(keys);
for x in range(10):
hello=random.randint(0,length)
source.meta.client.download_file('bucket_name', keys[hello] , keys[hello])
In line 12 you can pass a number as an argument that will define the number of random files you want to download. Further if you want your script to execute the task automatically every 30 minutes, then you can define above code as a separate method and then can use "sched" module of python to call this method repeatedly for which you can find the code in the link here:
What is the best way to repeatedly execute a function every x seconds in Python?
Your use case appears to be:
Every 30 minutes
Download 10 random files from Amazon S3
Presumably, these 10 files should not be files previously downloaded.
There is no in-built S3 functionality to download a random selection of files. Instead, you will need to:
Obtain a listing of files from your desired S3 bucket and optional path
Randomly select which files you want to download
Download the selected files
This would be easily done via a programming language (eg Python), where you could obtain an array of filenames, randomize it, then loop through the list and download each file.
You can also do it in a shell script by calling the AWS Command-Line Interface (CLI) to obtain the listing (aws s3 ls) and to copy the files (aws s3 cp).
Alternatively, you could choose to synchronize ALL the files to your local machine (aws s3 sync) and then select random local files to process.
Try the above steps. If you experience difficulties, post your code and the error/problem you are experiencing and we can assist.

NetSuite SuiteScript How to bridge the 10MG limit?

HI and thanks for any help. Is there a way to work with files larger than 10mg? I have to check for updates on items in a file that would be uploaded, but the file contains all items in the system and is approximately 20MG. This 10MG limit is killing me. I see streaming for file save and appending but not for file reading. So I am open to any suggestions. The provider in this instance doesn't offer the facility to chunk the files. thanks in advance for your help.
If you are using SS2 to process a file from the file cabinet then if you use file.lines.iterator() to process a file the size limit is 10MB per line.
I believe returning a file object from a map reduce script's getInputStage automatically parses the file into lines.
The 10MB file size limit comes into play if you try to create a file larger than 10MB.
If you are trying to read in a an external file via script then one approach that I've used is to proxy the call via an external service. e.g. query an AWS lambda function that checks for and saves the file to S3. Return the file path and size to your SuiteScript. The SuiteScript then asks for "pages" of the file that are less than 10MB and saves those. If you are uploading something like a .csv then the lambda function can send the header with each paged request.

Resources