Load/access/mount directory to aws sagemaker from S3 - python-3.x

I am a newbee to aws s3/sagemaker. I am strugling to access my data [data meaning folders/directories, not any specific file/files] from S3 bucket to sagemaker jupyter notebook.
Say, my URI is:
s3://data/sub/dir/, where dir may contain multiple directories with files. I need to acess the directory (e.g., dir) in such a way where I can access any sub directories/files from it. I tried-
!aws s3 cp s3://data/sub/dir tempdata --recursive but did not work, getting error like-
fatal error: An error occurred (404) when calling the HeadObject operation: Key "sub/dir" does not exist.
Please advice, how can I access the dirs from s3 buckets to my aws sagemaker jupyter lab.
Or how to mount s3 buckets to sagemaker? I also tried this link and installed with no errors but s3fs wont show when I run dh -f, thus not worked as well! Thanks in advance.

Your cp syntax is correct.
S3 Sync could be an alternative way to get the same result, and the error response, if you got something wrong, could be more informative: !aws s3 sync s3://data/sub/dir tempdata

Related

Cannot load tokenizer from local

I just started to use AWS Lambda and Docker so would appreciate any advice.
I am trying to deploy an ML model to AWS Lambda for reference. The image created from Dockerfile successfully load XLNet model from local dir, however, it stucked when doing the same thing for tokenizer
In the pretrained_tokenizer folder, I have 4 files saved from tokenizer.save_pretrained(...) and config.save_pretrained(...)
In Dockerfile, I have tried multiple things, including:
copy the folder COPY app/pretrained_tokenizer/ opt/ml/pretrained_tokenizer/
copy each file from folder with separated COPY command
compress the folder to .tar.gz and use ADD pretrained_tokenizer.tar.gz /opt/ml/ (which is supposed to extract the tar files in the process)
In my python script, I tried to load the tokenizer using tokenizer = XLNetTokenizer.from_pretrained(tokenizer_file, do_lower_case=True), which works on Colab, but not when I try to to do an invocation to the image through sam local invoke -e events/event.json, the error is
[ERROR] OSError: Can't load tokenizer for 'opt/ml/pretrained_tokenizer/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'opt/ml/pretrained_tokenizer/' is the correct path to a directory containing all relev raise EnvironmentError(ers/tokenization_utils_base.py", line 1768, in from_pretrained
END RequestId: bf011045-bed8-41eb-ac21-f98bfcee475a
I have tried to look through past questions but couldn't really fix anything. I will appreciate any help!

aws CLI: get-job-output erroring with either Errno9 Bad File Descriptor or Errno9 no such file or directory

I'm having some problems with retrieving job output from an AWS glacier vault.
I initiated a job (aws glacier initiate-job), the job is indicated as complete via aws glacier, and then I tried to retrieve the job output
aws glacier get-job-output --account-id - --vault-name <myvaultname> --job-id <jobid> output.json
However, I receive an error: [Errno 2] No such file or directory: 'output.json'
Thinking that perhaps the file needed be created first, and if i did create the file first, (which really doesn't make sense), one would receive the [Errno 9] Bad file descriptor error.
I'm currently using the following version of the AWS CLI:
aws-cli/2.4.10 Python/3.8.8 Windows/10 exe/AMD64 prompt/off
I tried using the aws CLI from both an Administrative and non-Administrative command prompt with the same result. Any ideas on making this work?
From a related reported issue you can try run this command in a DOS window::
copy "c:\Program Files\Amazon\AWSCLI\botocore\vendored\requests\cacert.pem" "c:\Program Files\Amazon\AWSCLI\certifi"
It seems to be an certificate error

An error occurred (MissingAuthenticationTokenException) when calling the UpdateFunctionCode operation Lambda AWS

I have a function in my Lambda named my-s3-function. I need to add this dependency to my Lambda Node.JS. I have followed this part to update the script with dependency included (though, I didn't follow the step wherein I need to zip the folder using zip -r function.zip . but instead I zip the folder by right-clicking it on my PC).
The zip file's structured like this inside:
|node_modules
|<folders>
|<folders>
|<folders>
... // the list goes on
|index.js
|package_lock.json
Upon typing the code aws lambda update-function-code --function-name my-s3-function --zip-file fileb://function.zip to the terminal, I get the following response:
An error occurred (MissingAuthenticationTokenException) when calling the UpdateFunctionCode operation: Missing Authentication Token
What should I do to resolve this?
Based on the comments , this got resolved by configuring the credentials as described in the documentation.
Try first with exporting the credentials as described Environment variables to configure the AWS CLI. Once you are sure your credentials are correct then you can follow this Configuration and credential file

How to manually load spark-redshift AVRO files into Redshift?

I have a Spark job that failed at the COPY portion of the write. I have all the output already processed in S3, but am having trouble figuring out how to manually load it.
COPY table
FROM 's3://bucket/a7da09eb-4220-4ebe-8794-e71bd53b11bd/part-'
CREDENTIALS 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
format as AVRO 'auto'
In my folder there is a _SUCCESS, _committedxxx and _startedxxx file, and then 99 files all starting with the prefix part-. When I run this I get an stl_load_error -> Invalid AVRO file found. Unexpected end of AVRO file. If I take that prefix off, then I get:
[XX000] ERROR: Invalid AVRO file Detail: ----------------------------------------------- error: Invalid AVRO file code: 8001 context: Cannot init avro reader from s3 file Incorrect Avro container file magic number query: 10882709 location: avropath_request.cpp:432 process: query23_27 [pid=10653] -----------------------------------------------
Is this possible to do? It would be nice to save the processing.
I had the same error from Redshift.
The COPY works after I deleted the _committedxxx and _startedxxx files (the _SUCCESS file is no problem).
If you have many directories in s3, you can use the aws cli to clean them of these files:
aws s3 rm s3://my_bucket/my/dir/ --include "_comm*" --exclude "*.avro" --exclude "*_SUCCESS" --recursive
Note that the cli seems to have a bug, --include "_comm*" did not work for me. So it attempted to delete all files. Using "--exclude *.avro" does the trick. Be careful and run the command with --dryrun first!!

AWS EMR - Upload file into the application master

I'm using aws cli and I launch a Cluster with the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --ec2-attributes KeyName=ChiaveEMR --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
after that, I put a file into the master node:
aws emr put --cluster-id j-NSGFSP57255P --key-pair-file "ChiaveEMR.pem" --src "./configS3.txt"
The file is located in /home/hadoop/configS3.txt.
Then I launch a step:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Type=Spark,Name=SparkSubmit,Args=[--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/traccia-22-ottobre_2.11-1.0Ale.jar,/home/hadoop/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
But I get this error:
17/02/23 14:49:51 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
probably due to the fact that 'configS3.txt' is located on the master and not on the slaves.
How could I pass 'configS3.txt' to spark-submit script? I've tried from S3 too but it doesn't work. Any solutions? Thanks in advance
Since you are using "--deploy-mode cluster", the driver runs on a CORE/TASK instance rather than the MASTER instance, so yes, it's because you uploaded the file to the MASTER instance but then the code that's trying to access the file is not running on the MASTER instance.
Given that the error you are encountering is a FileNotFoundException, it sounds like your application code is trying to open it directly, meaning that of course you can't simply use the S3 path directly. (You can't do something like new File("s3://bucket/key") because Java has no idea how to handle this.) My assumption could be wrong though because you have not included your application code or explained what you are doing with this configS3.txt file.
Maurizio: you're still trying to fix your previous problem.
On a distributed system, you need files which are visible on all machines (which the s3:// filestore delivers) and to use an API which works with data from the distributed filesystem. which SparkContext.hadoopRDD() delivers. You aren't going to get anywhere by trying to work out how to get a file onto the local disk of every VM, because that's not the problem you need to fix: it's how to get your code to read data from the shared object store.
Sorry

Resources