scipy.curve_fit vs. numpy.polyfit different covariance matrices
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Forest of Spells Looping
--
Chapters
00:00 Question
01:45 Accepted answer (Score 2)
03:16 Answer 2 (Score 1)
04:58 Thank you
--
Full question
https://stackoverflow.com/questions/5198...
Accepted answer links:
[this]: https://github.com/numpy/numpy/blob/v1.1...
[Cross Validated]: https://stats.stackexchange.com/
[documentation]: https://docs.scipy.org/doc/scipy/referen...
Answer 2 links:
[tests]: https://symfit.readthedocs.io/en/latest/...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #numpy #scipy #curvefitting
#avk47
ACCEPTED ANSWER
Score 2
I imagine it has something to do with this
593 # Some literature ignores the extra -2.0 factor in the denominator, but
594 # it is included here because the covariance of Multivariate Student-T
595 # (which is implied by a Bayesian uncertainty analysis) includes it.
596 # Plus, it gives a slightly more conservative estimate of uncertainty.
597 if len(x) <= order + 2:
598 raise ValueError("the number of data points must exceed order + 2 "
599 "for Bayesian estimate the covariance matrix")
600 fac = resids / (len(x) - order - 2.0)
601 if y.ndim == 1:
602 return c, Vbase * fac
603 else:
604 return c, Vbase[:,:, NX.newaxis] * fac
As in this case len(x) - order is 4 and (len(x) - order - 2.0) is 2, that would explain why your values are different by a factor of 2.
This explains question 2. The answer to question 3 is likely "get more data.", as for larger len(x) the difference will probably be negligible.
Which formulation is correct (question 1) is probably a question for Cross Validated, but I'd assume it is is curve_fit as that is explicitly intended to calculate the uncertainties as you state. From the documentation
pcov : 2d array
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).
While the comment in the code for polyfit above says its intetention is more for Student-T analysis.
ANSWER 2
Score 2
The two methods compute the covariance in a different way. I'm not exactly sure about the method used by polyfit, but curve_fit estimates the covariance matrix by inverting J.T.dot(J), where J is the jacobian of the model. By looking at the code of polyfit, it seems they invert lhs.T.dot(lhs), where lhs was defined as the Vandermonde matrix, although I have to admit I do not know the mathematical background of this second method.
Now, as to your question of which is correct, polyfit's code has the following comment:
# Some literature ignores the extra -2.0 factor in the denominator, but
# it is included here because the covariance of Multivariate Student-T
# (which is implied by a Bayesian uncertainty analysis) includes it.
# Plus, it gives a slightly more conservative estimate of uncertainty.
Based on this, and your observation, it would seem that polyfit always gives a bigger estimate than curve_fit. This would make sence, because J.T.dot(J) is a first order approximation to the covariance matrix. So when in doubt, overestimating the error is always better.
However, if you know the measurement errors in your data I would recommend providing them too and calling curve_fit with absolute_sigma=True. From my own tests, doing that does match the analytical results one would expect, so I would be curious to see which of the two performs better when measurement errors are provided.