I have code that moves items from one s3 bucket to another. I am running it locally on my computer. However, it will take a long time to finish running as there are many items in the bucket.
import boto3
#Get resource
s3 = boto3.resource('s3')
#Get reference to buckets
src = s3.Bucket('src')
dst = s3.Bucket('dst')
#Iterate through the items in the source bucket
for item in src.objects.all():
#Creates a copy of the item?
copy_source = {
'Bucket' : 'src',
'Key' : item.key
}
#Places the copy of the item in the destination bucket
dst.copy(copy_source,'Images/'+item.key)
Is there any way I can run this code remotely such that I would not have to monitor it? I have tried AWS lambda but it has a maximum run time of 15 minutes. Is there something like that I could use but for a longer time.
You could use a Data Pipeline.
A data pipeline spawns an EC2 instance where you can run your job.
You can schedule the pipeline to run run at least every 15 minutes. (But not less)
There is also the option to create a pipeline that you van run on demand.
It also offers a console where you can view the jobs and their outcome and have the opportunity to rerun failed jobs.
For this kind of activity you should probably use this:
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
Another option is to just start an EC2 instance run your job and then stop it.
Related
Im ingesting a data with the api calls and would like to widgets to parametirze. In azure I have the following set up:
I have the list of attribute_codes, reading them with lookup activtiy and passing these parameter inside the databricks notebook code. Code inside the databricks:
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Removing the folder in Data Lake
dbutils.fs.rm(f'/mnt/bronze/attribute_code/{day}',True)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
My problem is that, since its in the ForEach loop, eveytime new parameter is passed, it deletes the entire folder with previosly, loaded data. Now someone can come and say to remove line where I drop and create the spacific daily folder but pipeline should run multiple times a day and I need to drop previously loaded data on that day and load new one.
My goal is to iterte over the entire list of the attribute_code and load them all in one folder with the name "data_{count}.json
Instead of using dbutils.fs.rm in your notebook, you can use delete activity before for each activity to get desired results.
Using dbutils.fs.rm, the folder is being deleted each time the notebook is triggered inside for each loop deleting previously created files as well.
So, using a delete activity only before for each loop to delete the folder (deletes only if it exists), you can load data as per requirement.
For path, I have used the following dynamic content:
attribute/#{formatDateTime(utcNow(),'yyyy-MM-dd')}
And using the following code in my databricks notebook:
#I used similar code
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
Lets say I have the following output from my look up activity:
When I run the pipeline, it would run successfully. Only the latest look up data would be loaded.
I have created this console script with python3 that uses the requests module to pull information from the website https://api.warframestat.us/ps4/fissures but I want to run this 24/7 and not on my computer is it possible to do this computation online? My program executes itself every minute because the website changes its information. I'm new to programming by the way, I know my questions are for some of you trivial.
I have not added the ability to send emails, I will do that later.
My code:
import JSON
import requests
import time
url = 'https://api.warframestat.us/ps4/fissures'
current_survivals = [] # hold current survivals until new once appear
ids_sent = []
while True:
raw_data = requests.get(url)
fissures_data = json.loads(raw_data.text)
# I have to clear to get rid of the old id's
current_survivals.clear()
print()
for x in fissures_data:
print(x['missionType'])
print()
# add
for mission in fissures_data:
if mission['missionType'] == 'Capture':
# add to current_survivals
current_survivals.append(mission['id'])
# Makes sure not to send an email twice
for x in current_survivals:
if x in ids_sent:
print("Email already sent.")
else:
ids_sent.append(x)
print("Send Email")
# send an email
print("Current_survivals: " + str(current_survivals))
print("ids_sent: " + str(ids_sent) + "\n")
time.sleep(60) #will run agian after 1 minute
To have it a) not run on your computer and b) run every minute, perhaps the best approach would be to use an AWS lambda (or equiv on Azure etc).
You'd need to set an AWS account, config a lambda etc etc - so a bit of a learning curve there, but it does seem to be the cleanest way to do it.
The other approach would be a cloud server (eg AWS server instance), but to have it run every minute you'd then be having to deal with (and learn) cron, or have your code sleep for a minute then try. But in that case you'd also need to worry about uptime, and some system to ensure your script is working correctly, being launched correctly etc etc.
I am writing a python3 lambda function which needs to return all of the files that were uploaded to an S3 bucket in the past 30 days from the time that the function is ran.
How should I approach this? Ideally, I want to only iterate through the files from the past 30 days and nothing else - there are thousands upon thousands of files in the S3 bucket that I am iterating through, and maybe 100 max will be updated/uploaded per month. It would be very inefficient to have to iterate through every file and compare dates like that. There is also a 29 second time limit for AWS API gateway.
Any help would be greatly appreciated. Thanks!
You will need to iterate through the list of objects (sample code: List s3 buckets with its size in csv format) and compare the date within the Python code (sample code: Get day old filepaths from s3 bucket).
There is no filter when listing objects (aside from Prefix).
An alternative is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket. You could parse that CSV instead of the listing objects.
A more extreme option is to keep a separate database of objects, which would need to be updated whenever objects are added/deleted. This could be done via Amazon S3 Events that trigger an AWS Lambda function. Lots of work, though.
I can't give you an 100% answer, since you have asked for the upload date, but if you can live with the 'last modified' value, this code snippet should do the job:
import boto3
import datetime
paginator = boto3.resource('s3').meta.client.get_paginator('list_objects')
date = datetime.datetime.now() - datetime.timedelta(30)
filtered_files = (page['Key'] for page in paginator.paginate(Bucket="bucketname").search(f"Contents[?to_string(LastModified)>='\"{date}\"']"))
For filterting I used JMESPath
From the architect perspective
The bottle neck is that whether if you can iterate all objects with in 30 seconds. If natively there are too many files, there are a few more options you can use:
Create a aws lambda function that triggered by S3:PutObject event, and store the S3 key, and last_modified_at information into Dynamodb (A AWS Key Value NoSQL database). Then you can easily use Dynamodb to filter the S3 key and retrieve those S3 object accordingly.
Crreate a aws lambda function that triggered by S3:PutObject event, and move the file to a partitioned S3 Key schema location such as s3://bucket/datalake/year=${year}/month=${month}/day=${day}/your-file.csv. Then you can easily use the partition information to locate the subset of your objects, which fits in 30 seconds hard limit.
From programming perspective
Here's the code snippet solves your problem using this library s3pathlib:
from datetime import datetime, timedelta
from s3pathlib import S3path
# define a folder
p_dir = S3Path("bucket/my-folder/")
# find one month ago datetime
now = datetime.utcnow()
one_month_ago = now - timedelta(days=30)
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable Attribute can be used for filtering
S3Path.last_modified_at >= one_month_ago
):
# do whatever you like
print(p.console_url) # click link to open it in console, inspect
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document
I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.
Consider I want to download only 10 files from the bucket, how do we pass 10 as an argument.
The easiest way to do so is to make a python script that you can run every 30 minutes.I have written the python code that will do your work :
import boto3
import random
s3 = boto3.client('s3')
source=boto3.resource('s3')
keys = []
resp = s3.list_objects_v2(Bucket='bucket_name')
for obj in resp['Contents']:
keys.append(obj['Key'])
length = len(keys);
for x in range(10):
hello=random.randint(0,length)
source.meta.client.download_file('bucket_name', keys[hello] , keys[hello])
In line 12 you can pass a number as an argument that will define the number of random files you want to download. Further if you want your script to execute the task automatically every 30 minutes, then you can define above code as a separate method and then can use "sched" module of python to call this method repeatedly for which you can find the code in the link here:
What is the best way to repeatedly execute a function every x seconds in Python?
Your use case appears to be:
Every 30 minutes
Download 10 random files from Amazon S3
Presumably, these 10 files should not be files previously downloaded.
There is no in-built S3 functionality to download a random selection of files. Instead, you will need to:
Obtain a listing of files from your desired S3 bucket and optional path
Randomly select which files you want to download
Download the selected files
This would be easily done via a programming language (eg Python), where you could obtain an array of filenames, randomize it, then loop through the list and download each file.
You can also do it in a shell script by calling the AWS Command-Line Interface (CLI) to obtain the listing (aws s3 ls) and to copy the files (aws s3 cp).
Alternatively, you could choose to synchronize ALL the files to your local machine (aws s3 sync) and then select random local files to process.
Try the above steps. If you experience difficulties, post your code and the error/problem you are experiencing and we can assist.