Reading file from G Drive via Apache Beam - python-3.x

I'm trying to fetch file from Google Drive using Apache Beam. I tried,
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
lines = (pipeline | beam.Create(filenames))
print(lines)
This returns a string like PCollection[[19]: Create/Map(decode).None]
I need to read a file from Google Drive and write it into GCS bucket. How can I read a file form G Drive from Apache beam?

If you don’t have complex transformations to apply, I thinks it’s better to not use Beam in this case.
Solution 1 :
You can instead use Google Collab (Juypiter Notebook on Google servers), mount your gDrive and use the gCloud CLI to copy files.
You can check the following links :
google-drive-to-gcs
stackoverflow-copy-file-from-google-drive-to-gcs
Solution 2
You can also use APIs to retrieve files from Google Drive and copy them to Cloud Storage.
You can for example develop a Python script using Python Google clients and the following packages :
google-api-python-client
google-auth-httplib2
google-auth-oauthlib
google-cloud-storage
This article shows an example.

If you want to use Beam for this, you would could write a function
def read_from_gdrive_and_yield_records(path):
...
and then use it like
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
paths = pipeline | beam.Create(filenames)
records = paths | beam.FlatMap(read_from_gdrive_and_emit_records)
records | beam.io.WriteToText('gs://...')
Though as mentioned, unless you have a lot of files, this may be overkill.

Related

How to open index.html file in databricks or browser?

I am trying to open index.html file through databricks. Can someone please let me know how to deal with it? I am trying to use GX with databricks and currently, data bricks store this file here: dbfs:/great_expectations/uncommitted/data_docs/local_site/index.html I want to send index.html file to stakeholder
I suspect that you need to copy the whole folder as there should be images, etc. Simplest way to do that is to use Dataricks CLI fs cp command to access DBFS and copy files to the local storage. Like this:
databricks fs cp -r 'dbfs:/.....' local_name
To open file directly in the notebook you can use something like this (note that dbfs:/ should be replaced with /dbfs/):
with open("/dbfs/...", "r") as f:
data = "".join([l for l in f])
displayHTML(data)
but this will break links to images. Alternatively you can follow this approach to display Data docs inside the notebook.

Downloading S3 files in Google Colab

I am working on a project and it happens that some data is provided in form of S3fileSystem. I can read that data using S3FileSystem.open(path). But there are more than 360 files and it takes atleast 3 minutes to read a single file. I was wondering, is there any way of downloading these files in my system and read them from there, instead of reading it directly from S3fileSystem. There is another reason, although I can read all those files but once my session on colab reconnects I have to re-read all those files again, hence it will take a lot of time. I am using following code to read files
fs_s3 = s3fs.S3FileSystem(anon=True)
s3path = 'file_name'
remote_file_obj = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(remote_file_obj, engine= 'h5netcdf')
Is there any way of downloading those files?
You can use another s3fs to mount the bucket, then copy the files to Colab.
how to mount
After mounting, you can
!cp /s3/yourfile.zip /content/

Use images in s3 with SageMaker without .lst files

I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.
Images are stored in an s3 bucket with their class labels in their file names currently, e.g.
My-s3-bucket-dir
cat-1.jpg
dog-1.jpg
cat-2.jpg
..
I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.
All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst file
When I try to manually create the .lst it doesn't seem to like it and it also takes too long doing manual work to be a good practice.
How can I automatically generate the .lst file (or otherwise send the images/classes for training)?
Things I read made it sound like im2rec.py was a solution, but I don't see how. The example I'm working with now is
Image-classification-fulltraining-highlevel.ipynb
but it seems to download the data as .rec,
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
which just skips working with the .jpeg files. I found another that converts them to .rec but again it has essentially the .lst already as .json and just converts it.
I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.
How can I simply and automatically generate the .lst or otherwise get the data/class info into SageMaker without manually creating a .lst file?
Update
It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...
Please note that [...] im2rec.py is running locally,
therefore cannot take input from the S3 bucket. To generate the list
file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team
There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.
Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for loop for example. RecordIO requires you to use im2rec.py tool, which is a little more work.
Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:
# assuming train_index, train_class, train_pics store the pic index, class and path
with open('train.lst', 'a') as file:
for index, cl, pic in zip(train_index, train_class, train_pics):
file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

How to download specific number of files/objects from S3 bucket using aws cli command?

Consider I want to download only 10 files from the bucket, how do we pass 10 as an argument.
The easiest way to do so is to make a python script that you can run every 30 minutes.I have written the python code that will do your work :
import boto3
import random
s3 = boto3.client('s3')
source=boto3.resource('s3')
keys = []
resp = s3.list_objects_v2(Bucket='bucket_name')
for obj in resp['Contents']:
keys.append(obj['Key'])
length = len(keys);
for x in range(10):
hello=random.randint(0,length)
source.meta.client.download_file('bucket_name', keys[hello] , keys[hello])
In line 12 you can pass a number as an argument that will define the number of random files you want to download. Further if you want your script to execute the task automatically every 30 minutes, then you can define above code as a separate method and then can use "sched" module of python to call this method repeatedly for which you can find the code in the link here:
What is the best way to repeatedly execute a function every x seconds in Python?
Your use case appears to be:
Every 30 minutes
Download 10 random files from Amazon S3
Presumably, these 10 files should not be files previously downloaded.
There is no in-built S3 functionality to download a random selection of files. Instead, you will need to:
Obtain a listing of files from your desired S3 bucket and optional path
Randomly select which files you want to download
Download the selected files
This would be easily done via a programming language (eg Python), where you could obtain an array of filenames, randomize it, then loop through the list and download each file.
You can also do it in a shell script by calling the AWS Command-Line Interface (CLI) to obtain the listing (aws s3 ls) and to copy the files (aws s3 cp).
Alternatively, you could choose to synchronize ALL the files to your local machine (aws s3 sync) and then select random local files to process.
Try the above steps. If you experience difficulties, post your code and the error/problem you are experiencing and we can assist.

How to export data from a dataframe to a file databricks

I'm doing right now Introduction to Spark course at EdX.
Is there a possibility to save dataframes from Databricks on my computer.
I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.
In the notebook data is imported using command:
log_file_path = 'dbfs:/' + os.path.join('databricks-datasets',
'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')
I found this solution but it doesn't work:
df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')
Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.
You can also save it to the file store and donwload via its handle, e.g.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")
You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...
Download in this case (for Databricks west europe instance)
https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv
I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from #MrChristine does not apply here.
Try this.
df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")
This will save the file into Unix Server.
if you give only /home/yphani/datacsv it looks for the path on HDFS.

Resources