scipy.curve_fit vs. numpy.polyfit different covariance matrices

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Forest of Spells Looping

--

Chapters
00:00 Question
01:45 Accepted answer (Score 2)
03:16 Answer 2 (Score 1)
04:58 Thank you

--

Full question
https://stackoverflow.com/questions/5198...

Accepted answer links:
[this]: https://github.com/numpy/numpy/blob/v1.1...
[Cross Validated]: https://stats.stackexchange.com/
[documentation]: https://docs.scipy.org/doc/scipy/referen...

Answer 2 links:
[tests]: https://symfit.readthedocs.io/en/latest/...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #numpy #scipy #curvefitting

#avk47

ACCEPTED ANSWER

Score 2

I imagine it has something to do with this

593          # Some literature ignores the extra -2.0 factor in the denominator, but 
594          #  it is included here because the covariance of Multivariate Student-T 
595          #  (which is implied by a Bayesian uncertainty analysis) includes it. 
596          #  Plus, it gives a slightly more conservative estimate of uncertainty. 
597          if len(x) <= order + 2: 
598              raise ValueError("the number of data points must exceed order + 2 " 
599                               "for Bayesian estimate the covariance matrix") 
600          fac = resids / (len(x) - order - 2.0) 
601          if y.ndim == 1: 
602              return c, Vbase * fac 
603          else: 
604              return c, Vbase[:,:, NX.newaxis] * fac

As in this case len(x) - order is 4 and (len(x) - order - 2.0) is 2, that would explain why your values are different by a factor of 2.

This explains question 2. The answer to question 3 is likely "get more data.", as for larger len(x) the difference will probably be negligible.

Which formulation is correct (question 1) is probably a question for Cross Validated, but I'd assume it is is curve_fit as that is explicitly intended to calculate the uncertainties as you state. From the documentation

pcov : 2d array

The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).

While the comment in the code for polyfit above says its intetention is more for Student-T analysis.

ANSWER 2

Score 2

The two methods compute the covariance in a different way. I'm not exactly sure about the method used by polyfit, but curve_fit estimates the covariance matrix by inverting J.T.dot(J), where J is the jacobian of the model. By looking at the code of polyfit, it seems they invert lhs.T.dot(lhs), where lhs was defined as the Vandermonde matrix, although I have to admit I do not know the mathematical background of this second method.

Now, as to your question of which is correct, polyfit's code has the following comment:

# Some literature ignores the extra -2.0 factor in the denominator, but
#  it is included here because the covariance of Multivariate Student-T
#  (which is implied by a Bayesian uncertainty analysis) includes it.
#  Plus, it gives a slightly more conservative estimate of uncertainty.

Based on this, and your observation, it would seem that polyfit always gives a bigger estimate than curve_fit. This would make sence, because J.T.dot(J) is a first order approximation to the covariance matrix. So when in doubt, overestimating the error is always better.

However, if you know the measurement errors in your data I would recommend providing them too and calling curve_fit with absolute_sigma=True. From my own tests, doing that does match the analytical results one would expect, so I would be curious to see which of the two performs better when measurement errors are provided.