Loading data from S3 to dask dataframe

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Drifting Through My Dreams

--

Chapters
00:00 Loading Data From S3 To Dask Dataframe
00:22 Accepted Answer Score 10
01:05 Answer 2 Score 4
01:33 Answer 3 Score 0
02:06 Thank you

--

Full question
https://stackoverflow.com/questions/5417...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #dask #daskdistributed

#avk47

ACCEPTED ANSWER

Score 10

The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.

The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).

Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'key': mykey, 'secret': mysecret})

The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.

General reference http://docs.dask.org/en/latest/remote-data-services.html

ANSWER 2

Score 4

If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:

import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')

If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):

import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)

Full documentation from dask can be found here

ANSWER 3

Score 0

Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake.

$ export AWS_PROFILE=your_aws_cli_profile_name

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html

For s3 you can use wildcard match to fetch multiple chunked files

import dask.dataframe as dd

# Given N number of csv files located inside s3 read and compute total record len

s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'

df = dd.read_csv(s3_url)

print(df.head())

print(len(df))