AWS S3¶
Data transfer via the AWS CLI tools¶
Note
You will need to have configured the AWS CLI toolkit before using these commands. See Command-line interface authentication.
Note
If you have more than one profile configured, add the flag --profile <profile name> to the end of each command to
specify which profile to use.
List buckets¶
Get a list of S3 buckets for a given account:
aws s3 ls
Copy a single file¶
# From local to S3
aws s3 cp <local file> s3://<bucket name>/<key>
# From S3 to local
aws s3 cp s3://<bucket name>/<key> <local file>
The full "key" or path-like address of a file on S3 can be defined at the time of transfer — the "folders" do not need to exist ahead of time (because they are not real folders). If you want to use a key you have already created and don't want to risk typing it wrong, go to the AWS console and click the "Copy S3 URI" button.
Add the flag --dryrun to test the transfer without actually copying anything.
Copy a collection of files¶
# From local to S3
aws s3 sync <local path> s3://<bucket name>/<key prefix>
# From S3 to local
aws s3 sync s3://<bucket name>/<key prefix> <local path>
Like cp, the path prefix for all transferred files can be defined at the time of transfer. Useful flags:
--dryrun— test the transfer without copying anything--no-follow-symlinks— prevent copying linked data (recommended to avoid duplicating data)--exclude ".*" --exclude "*/.*"— exclude hidden files and directories
Check the number of files before transferring¶
Note
Data transfer speed is a function of file size and the number of files. Transferring a large number of small files is very slow because S3 must create an object for each file. If you have a large number of files, consider: 1) whether the data need to be kept; and 2) whether they can be packaged into a tar or zip archive.
As an example: a 20 GB directory with ~10 million small image tile files took over 24 hours to transfer. After removing the tile directories, only 3,000 files remained and the transfer finished in under a minute.
Check the number of files in a directory (including subdirectories):
du --inodes -s <path to directory>
Accessing data in S3 via the SOCA cluster¶
Mount S3 with mountpoint-s3¶
mountpoint-s3 is a tool that mounts an S3 bucket on the server running your job so you can access files as if they were on a local filesystem. This avoids copying data to the cluster's EFS storage, which is expensive. It is particularly useful for read-heavy workloads.
A description of supported filesystem operations is in the mountpoint-s3 documentation.
Note
There is currently a limitation where authentication with mountpoint-s3 does not support the SSO system. You will need
to apply temporary credentials to ~/.aws/credentials rather than using the aws sso login method.
Authenticate with temporary credentials¶
- Navigate to the AWS SSO portal.
- Select the account you want.
- Click "Command line or programmatic access."
- Copy the credentials block under Option 2 (manually add a profile to your credentials file).
- Open
~/.aws/credentialsin a text editor and paste the credentials. Rename the profile header (the text in brackets) to[default]if you have a single profile or to a custom name (e.g.[datascience]) if you have multiple profiles.
Mount a bucket¶
In your job file or in an interactive session, mount the bucket in the /scratch area:
# Optionally create a subdirectory so /scratch remains usable
mkdir /scratch/s3
# Mount the bucket
# Replace "datascience" with your credentials file profile name
# Replace "ddpsc-datascience" with your bucket name
mount-s3 ddpsc-datascience /scratch/s3
# Optionally limit the mount to a prefix within the bucket
mount-s3 --prefix shares/datascience/users/nfahlgren/ ddpsc-datascience /scratch/s3
Unmount a bucket¶
fusermount -u /scratch/s3
Restoring archived data in S3¶
For restoring a large number of files, use S3 Batch Operations. Contact the Data Science team for assistance.
Create an inventory of files to be restored¶
# Authenticate
aws sso login
# Create a CSV inventory for all files within a directory and subdirectories
aws s3 ls --recursive \
s3://<bucket>/<key prefix> \
| awk '{print "<bucket>,"$4}' > dir-inventory.csv
Quality-check the inventory file — does it have the number of files you expect?
Upload the inventory to S3¶
aws s3 cp dir-inventory.csv s3://<bucket>/dir-inventory.csv
Use S3 Batch Operations to restore the files¶
- Go to the SSO login page from Okta
- Select the
S3BatchOperationsrole for the account you want to restore files in - Navigate to S3 > Batch Operations > Create Job
- Manifest: select CSV and navigate to your CSV file, then click Next
- Operation: Restore > Intelligent-Tiering Archive Access tier or Deep Archive Access tier > Standard retrieval, then click Next
- Completion report: select a location to store a report
- Permissions: Choose from existing IAM roles >
S3forBatchOperations, then click Next - Click Create job
- Restoration takes 3–24 hours
Desktop clients for S3¶
Cyberduck (macOS/Linux/Windows)¶
- Installation instructions for macOS and Windows
- Step-by-step guide for connecting to S3
Transmit (macOS)¶
First, follow the Command-line interface authentication instructions on your
local computer to create an SSO profile (e.g. default).
- In Transmit, click the + button and select Amazon S3 to create a new server
- Set Protocol to Amazon S3, Address to
s3.amazonaws.com, and Profile to your profile name (e.g.default)
Before connecting (credentials expire every 12 hours):
- Log into the AWS web console via Okta
- Select the group and role (
S3ReadWrite), then click "Command line or programmatic access" - Copy the text under Option 2
- Open
~/.aws/credentials, paste the text, and rename the profile header in brackets to match your Transmit profile name - Go back to Transmit and connect