Downloading S3 files in Google Colab - python-3.x

I am working on a project and it happens that some data is provided in form of S3fileSystem. I can read that data using S3FileSystem.open(path). But there are more than 360 files and it takes atleast 3 minutes to read a single file. I was wondering, is there any way of downloading these files in my system and read them from there, instead of reading it directly from S3fileSystem. There is another reason, although I can read all those files but once my session on colab reconnects I have to re-read all those files again, hence it will take a lot of time. I am using following code to read files
fs_s3 = s3fs.S3FileSystem(anon=True)
s3path = 'file_name'
remote_file_obj = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(remote_file_obj, engine= 'h5netcdf')
Is there any way of downloading those files?

You can use another s3fs to mount the bucket, then copy the files to Colab.
how to mount
After mounting, you can
!cp /s3/yourfile.zip /content/

Related

How to open index.html file in databricks or browser?

I am trying to open index.html file through databricks. Can someone please let me know how to deal with it? I am trying to use GX with databricks and currently, data bricks store this file here: dbfs:/great_expectations/uncommitted/data_docs/local_site/index.html I want to send index.html file to stakeholder
I suspect that you need to copy the whole folder as there should be images, etc. Simplest way to do that is to use Dataricks CLI fs cp command to access DBFS and copy files to the local storage. Like this:
databricks fs cp -r 'dbfs:/.....' local_name
To open file directly in the notebook you can use something like this (note that dbfs:/ should be replaced with /dbfs/):
with open("/dbfs/...", "r") as f:
data = "".join([l for l in f])
displayHTML(data)
but this will break links to images. Alternatively you can follow this approach to display Data docs inside the notebook.

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Get most recent file in S3 via PySpark

Is there anyway to get the last file in an S3 repo via Pyspark?
I managed to do it with Python using this code:
paginator = client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=Bucket, Prefix=Path)
for page in pages:
for obj in page['Contents']:
latest = max(page['Contents'], key=lambda x: x['LastModified'])
And on Spark i can't find any documentation.
Thank you
you'd just use the Hadoop FileSystem APIs, use listStatusIterator()/listFiles() to get an iterator and scan, FileStatus.getModificationTime() gives you that last modified field.
Be aware though: S3 timestamp of large file uploads is time upload was started, not completed. A large file which took many minutes to upload would appear older than a small file uploaded in a single PUT which took place during the upload.

Linux Split for tar.gz works well when joined but when tranferred to remote machine with help of S3 bucket

I have few files which i did tar.gz.
As this file can get too big thus I used the Linux split.
As this needs to be transferred to a different machine i have used s3 bucket to transfer these files. I used application/octet-stream content-type to upload these files.
The files when downloaded shows exactly same size as original size thus no bytes are lost.
now when I do cat downloaded_files_* > tarball.tar.gz the size is exactly as the original file
but only the part with _aa gets extracted.
i checked the type of files
file downloaded_files_aa
this is tar zip file(gzip compressed data, from Unix, last modified: Sun May 17 15:00:41 2020)
but all other files are data files
I am wondering how can i get the files.
Note: Http upload via API gateway done to upload the files to s3
================================
Just putting my debugging finding with a hope probably it will help someone facing same problem.
As we wanted to use API gateway out upload calls were done http calls. This is something which is not using regular aws sdk.
https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-post-example.html
Code Samples: https://docs.aws.amazon.com/AmazonS3/latest/API/samples/AWSS3SigV4JavaSamples.zip
After some debugging, we found this leg was working fine.
As the machine which we wanted to download the files had direct access to s3 we used the aws sdk for downloading the files.
This is the URL
https://docs.aws.amazon.com/AmazonS3/latest/dev/RetrievingObjectUsingJava.html
this code does not work well, though it showed the exact file size download as upload the file lost some information. The code also complained about still pending bytes. Some changes were done to get rid of error but it never worked.
the code which I found here is working like magic
InputStream reader = new BufferedInputStream(
object.getObjectContent());
File file = new File("localFilename");
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file));
int read = -1;
while ( ( read = reader.read() ) != -1 ) {
writer.write(read);
}
writer.flush();
writer.close();
reader.close();
This code also make the download much faster then our previous approach.

Google Colab is so slow while reading images from Google Drive

I have my own dataset for a deep learning project. I uploaded that into Google Drive and linked it to a Colab page. But Colab could read only 2-3 images in a second, where my computer can dozens of them. (I used imread to read images.)
There is no speed problem with model compiling process of keras, but only with reading images from Google Drive. Does anybody know a solution? Someone suffered of this problem too, but it's still unsolved: Google Colab very slow reading data (images) from Google Drive (I know this is kind of a duplication of the question in the link, but I reposted it because it is still unsolved. I hope this is not a violation of Stack Overflow rules.)
Edit: The code piece that I use for reading images:
def getDataset(path, classes, pixel=32, rate=0.8):
X = []
Y = []
i = 0
# getting images:
for root, _, files in os.walk(path):
for file in files:
imagePath = os.path.join(root, file)
className = os.path.basename(root)
try:
image = Image.open(imagePath)
image = np.asarray(image)
image = np.array(Image.fromarray(image.astype('uint8')).resize((pixel, pixel)))
image = image if len(image.shape) == 3 else color.gray2rgb(image)
X.append(image)
Y.append(classes[className])
except:
print(file, "could not be opened")
X = np.asarray(X, dtype=np.float32)
Y = np.asarray(Y, dtype=np.int16).reshape(1, -1)
return shuffleDataset(X, Y, rate)
I'd like to provide a more detailed answer about what unzipping the files actually looks like. This is the best way to speed up reading data because unzipping the file into the VM disk is SO much faster than reading each file individually from Drive.
Let's say you have the desired images or data in your local machine in a folder Data. Compress Data to get Data.zip and upload it to Drive.
Now, mount your drive and run the following command:
!unzip "/content/drive/My Drive/path/to/Data.Zip" -d "/content"
Simply amend all your image paths to go through /content/Data, and reading your images will be much much faster.
I recommend you to upload your file to GitHub then clone it to Colab. It can reduce my training time from 1 hour to 3 minutes.
Upload zip files to the drive. After transferring to colab unzip them. File copy overhead is cumbersome therefore you shouldn't copy masses of files instead copy a single zip and unzip.

Resources