Data analysis using R/python and SSDs
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Take control of your privacy with Proton's trusted, Swiss-based, secure services.
Choose what you need and safeguard your digital life:
Mail: https://go.getproton.me/SH1CU
VPN: https://go.getproton.me/SH1DI
Password Manager: https://go.getproton.me/SH1DJ
Drive: https://go.getproton.me/SH1CT
Music by Eric Matyas
https://www.soundimage.org
Track title: Quiet Intelligence
--
Chapters
00:00 Data Analysis Using R/Python And Ssds
00:28 Answer 1 Score 6
00:53 Answer 2 Score 0
01:34 Answer 3 Score 2
01:59 Accepted Answer Score 19
03:45 Thank you
--
Full question
https://stackoverflow.com/questions/4262...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #r #dataanalysis #solidstatedrive
#avk47
ACCEPTED ANSWER
Score 19
My 2 cents: SSD only pays off if your applications are stored on it, not your data. And even then only if a lot of access to disk is necessary, like for an OS. People are right to point you to profiling. I can tell you without doing it that almost all of the reading time goes to processing, not to reading on the disk.
It pays off far more to think about the format of your data instead of where it's stored. A speedup in reading your data can be obtained by using the right applications and the right format. Like using R's internal format instead of fumbling around with text files. Make that an exclamation mark: never keep on fumbling around with text files. Go binary if speed is what you need.
Due to the overhead, it generally doesn't make a difference if you have an SSD or a normal disk to read your data from. I have both, and use the normal disk for all my data. I do juggle around big datasets sometimes, and never experienced a problem with it. Off course, if I have to go really heavy, I just work on our servers.
So it might make a difference when we're talking gigs and gigs of data, but even then I doubt very much that disk access is the limiting factor. Unless your continuously reading and writing to the disk, but then I'd say you should start thinking again about what exactly you're doing. Instead of spending that money on SDD drives, extra memory could be the better option. Or just convince the boss to get you a decent calculation server.
A timing experiment using a bogus data frame, and reading and writing in text format vs. binary format on a SSD disk vs. a normal disk.
> tt <- 100
> longtext <- paste(rep("dqsdgfmqslkfdjiehsmlsdfkjqsefr",1000),collapse="")
> test <- data.frame(
+ X1=rep(letters,tt),
+ X2=rep(1:26,tt),
+ X3=rep(longtext,26*tt)
+ )
> SSD <- "C:/Temp" # My ssd disk with my 2 operating systems on it.
> normal <- "F:/Temp" # My normal disk, I use for data
> # Write text
> system.time(write.table(test,file=paste(SSD,"test.txt",sep="/")))
user system elapsed
5.66 0.50 6.24
> system.time(write.table(test,file=paste(normal,"test.txt",sep="/")))
user system elapsed
5.68 0.39 6.08
> # Write binary
> system.time(save(test,file=paste(SSD,"test.RData",sep="/")))
user system elapsed
0 0 0
> system.time(save(test,file=paste(normal,"test.RData",sep="/")))
user system elapsed
0 0 0
> # Read text
> system.time(read.table(file=paste(SSD,"test.txt",sep="/"),header=T))
user system elapsed
8.57 0.05 8.61
> system.time(read.table(file=paste(normal,"test.txt",sep="/"),header=T))
user system elapsed
8.53 0.09 8.63
> # Read binary
> system.time(load(file=paste(SSD,"test.RData",sep="/")))
user system elapsed
0 0 0
> system.time(load(file=paste(normal,"test.RData",sep="/")))
user system elapsed
0 0 0
ANSWER 2
Score 6
http://www.codinghorror.com/blog/2010/09/revisiting-solid-state-hard-drives.html has a good article on SSDs, comments offer alot of insights.
Depends on the type of analysis you're doing, whether it's CPU bound or IO bound. Personal experience dealing with regression modelling tells me former is more often the case, SSDs wouldn't be of much use then.
In short, best to profile your application first.
ANSWER 3
Score 2
I have to second John's suggestion to profile your application. My experience is that it isn't the actual data reads that are the slow part, it's the overhead of creating the programming objects to contain the data, casting from strings, memory allocation, etc.
I would strongly suggest you profile your code first, and consider using alternative libraries (like numpy) to see what improvements you can get before you invest in hardware.
ANSWER 4
Score 0
The read and write times for SSDs are significantly higher than standard 7200 RPM disks (it's still worth it with a 10k RPM disk, not sure how much of an improvement it is over a 15k). So, yes, you'd get much faster times on data access.
The performance improvement is undeniable. Then, it's a question of economics. 2TB 7200 RPM disks are $170 a piece, and 100GB SSDS cost $210. So if you have a lot of data, you may run into a problem.
If you read/write a lot of data, get an SSD. If the application is CPU intensive, however, you'd benefit much more from getting a better processor.