Loading data from S3 to dask dataframe
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Drifting Through My Dreams
--
Chapters
00:00 Loading Data From S3 To Dask Dataframe
00:22 Accepted Answer Score 10
01:05 Answer 2 Score 4
01:33 Answer 3 Score 0
02:06 Thank you
--
Full question
https://stackoverflow.com/questions/5417...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #dask #daskdistributed
#avk47
ACCEPTED ANSWER
Score 10
The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.
The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).
Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'key': mykey, 'secret': mysecret})
The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.
General reference http://docs.dask.org/en/latest/remote-data-services.html
ANSWER 2
Score 4
If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:
import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')
If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):
import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)
Full documentation from dask can be found here
ANSWER 3
Score 0
Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake.
$ export AWS_PROFILE=your_aws_cli_profile_name
or
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
For s3 you can use wildcard match to fetch multiple chunked files
import dask.dataframe as dd
# Given N number of csv files located inside s3 read and compute total record len
s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'
df = dd.read_csv(s3_url)
print(df.head())
print(len(df))