The Python Oracle

Retrieving subfolders names in S3 bucket from boto3

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Magic Ocean Looping

--

Chapters
00:00 Retrieving Subfolders Names In S3 Bucket From Boto3
01:32 Accepted Answer Score 123
03:05 Answer 2 Score 180
03:26 Answer 3 Score 56
03:48 Answer 4 Score 21
04:51 Thank you

--

Full question
https://stackoverflow.com/questions/3580...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #amazonwebservices #amazons3 #boto3

#avk47



ANSWER 1

Score 180


Below piece of code returns ONLY the 'subfolders' in a 'folder' from s3 bucket.

import boto3
bucket = 'my-bucket'
#Make sure you provide / in the end
prefix = 'prefix-name-with-slash/'  

client = boto3.client('s3')
result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')
for o in result.get('CommonPrefixes'):
    print 'sub folder : ', o.get('Prefix')

For more details, you can refer to https://github.com/boto/boto3/issues/134




ACCEPTED ANSWER

Score 123


S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic. One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.

To manipulate object in S3, you need boto3.client or boto3.resource, e.g. To list all object

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name') 

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.

To limit the items to items under certain sub-folders:

    import boto3 
    s3 = boto3.client("s3")
    response = s3.list_objects_v2(
            Bucket=BUCKET,
            Prefix ='DIR1/DIR2',
            MaxKeys=100 )

Documentation

Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.

import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key) 
foldername = os.path.dirname(s3_key)

# if you are not using conventional delimiter like '#' 
s3_key = 'first-level#1456753904534#part-00014'
filename = s3_key.split("#")[-1]

A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.




ANSWER 3

Score 56


It took me a lot of time to figure out, but finally here is a simple way to list contents of a subfolder in S3 bucket using boto3. Hope it helps

prefix = "folderone/foldertwo/"
s3 = boto3.resource('s3')
bucket = s3.Bucket(name="bucket_name_here")
FilesNotFound = True
for obj in bucket.objects.filter(Prefix=prefix):
     print('{0}:{1}'.format(bucket.name, obj.key))
     FilesNotFound = False
if FilesNotFound:
     print("ALERT", "No file in {0}/{1}".format(bucket, prefix))



ANSWER 4

Score 21


I had the same issue but managed to resolve it using boto3.client and list_objects_v2 with Bucket and StartAfter parameters.

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    print object['Key']

The output result for the code above would display the following:

firstlevelFolder/secondLevelFolder/item1
firstlevelFolder/secondLevelFolder/item2

Boto3 list_objects_v2 Documentation

In order to strip out only the directory name for secondLevelFolder I just used python method split():

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    direcoryName = object['Key'].encode("string_escape").split('/')
    print direcoryName[1]

The output result for the code above would display the following:

secondLevelFolder
secondLevelFolder

Python split() Documentation

If you'd like to get the directory name AND contents item name then replace the print line with the following:

print "{}/{}".format(fileName[1], fileName[2])

And the following will be output:

secondLevelFolder/item2
secondLevelFolder/item2

Hope this helps