Retrieving subfolders names in S3 bucket from boto3
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Magic Ocean Looping
--
Chapters
00:00 Retrieving Subfolders Names In S3 Bucket From Boto3
01:32 Accepted Answer Score 123
03:05 Answer 2 Score 180
03:26 Answer 3 Score 56
03:48 Answer 4 Score 21
04:51 Thank you
--
Full question
https://stackoverflow.com/questions/3580...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #amazonwebservices #amazons3 #boto3
#avk47
ANSWER 1
Score 180
Below piece of code returns ONLY the 'subfolders' in a 'folder' from s3 bucket.
import boto3
bucket = 'my-bucket'
#Make sure you provide / in the end
prefix = 'prefix-name-with-slash/'
client = boto3.client('s3')
result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')
for o in result.get('CommonPrefixes'):
print 'sub folder : ', o.get('Prefix')
For more details, you can refer to https://github.com/boto/boto3/issues/134
ACCEPTED ANSWER
Score 123
S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic. One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.
To manipulate object in S3, you need boto3.client or boto3.resource, e.g. To list all object
import boto3
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name')
http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects
In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.
To limit the items to items under certain sub-folders:
import boto3
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=BUCKET,
Prefix ='DIR1/DIR2',
MaxKeys=100 )
Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.
import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key)
foldername = os.path.dirname(s3_key)
# if you are not using conventional delimiter like '#'
s3_key = 'first-level#1456753904534#part-00014'
filename = s3_key.split("#")[-1]
A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.
ANSWER 3
Score 56
It took me a lot of time to figure out, but finally here is a simple way to list contents of a subfolder in S3 bucket using boto3. Hope it helps
prefix = "folderone/foldertwo/"
s3 = boto3.resource('s3')
bucket = s3.Bucket(name="bucket_name_here")
FilesNotFound = True
for obj in bucket.objects.filter(Prefix=prefix):
print('{0}:{1}'.format(bucket.name, obj.key))
FilesNotFound = False
if FilesNotFound:
print("ALERT", "No file in {0}/{1}".format(bucket, prefix))
ANSWER 4
Score 21
I had the same issue but managed to resolve it using boto3.client and list_objects_v2 with Bucket and StartAfter parameters.
s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'
theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
print object['Key']
The output result for the code above would display the following:
firstlevelFolder/secondLevelFolder/item1
firstlevelFolder/secondLevelFolder/item2
Boto3 list_objects_v2 Documentation
In order to strip out only the directory name for secondLevelFolder I just used python method split():
s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'
theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
direcoryName = object['Key'].encode("string_escape").split('/')
print direcoryName[1]
The output result for the code above would display the following:
secondLevelFolder
secondLevelFolder
If you'd like to get the directory name AND contents item name then replace the print line with the following:
print "{}/{}".format(fileName[1], fileName[2])
And the following will be output:
secondLevelFolder/item2
secondLevelFolder/item2
Hope this helps