The Python Oracle

What is the appropriate distance metric when clustering paragraph/doc2vec vectors?

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Quirky Dreamscape Looping

--

Chapters
00:00 What Is The Appropriate Distance Metric When Clustering Paragraph/Doc2vec Vectors?
01:35 Accepted Answer Score 1
02:18 Answer 2 Score 2
03:11 Thank you

--

Full question
https://stackoverflow.com/questions/5272...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #clusteranalysis #distance #doc2vec #hdbscan

#avk47



ANSWER 1

Score 2


The proper similarity metric is the dot product, not cosine.

Word2vec etc. are trained using the dot product, not normalized by the vector length. And you should exactly use what was trained.

People use the cosine all the time because it worked well for bag of words. The choice is not based on a proper theoretical analysis for all I know.

HDBSCAN does not require a metric. The 1-sim transformation assumes that x is bounded by 1, so that won't reliably work.

I suggest to try the following approaches:

  • use negative distances. That may simply work. I.e., d(x,y)=-(x dot y)
  • use max-sim transformation. Once you have the dot product matrix it is easy to get the maximum value.
  • implement HDBSCAN* with a similarity rather than a metric



ACCEPTED ANSWER

Score 1


I believe in practice cosine-distance is used, despite the fact that there are corner-cases where it's not a proper metric.

You mention that "elements of the resulting docvecs are all in the range [-1,1]". That isn't usually guaranteed to be the case – though it would be if you've already unit-normalized all the raw doc-vectors.

If you have done that unit-normalization, or want to, then after such normalization euclidean-distance will always give the same ranked-order of nearest-neighbors as cosine-distance. The absolute values, and relative proportions between them, will vary a little – but all "X is closer to Y than Z" tests will be identical to those based on cosine-distance. So clustering quality should be nearly identical to using cosine-distance directly.