Wednesday, May 6, 2015

Large Clustering Problems


1. The Canopies Approach

  • Two distance metrics
    • cheap & expensive
  • First pass
    • very inexpensive distance metric
    • create overlapping canopies
  • Second pass
    • expensive, accurate distance metric
    • canopies determine which distances calculated
2. Using Canopies
  • Calculate expensive distances between points in the same canopy
  • All other distances default to infinity
  • Use finite distances and iteratively merge closest
3. Preserve Good Clustering
  • Small, disjoint canopies
    • big time savings
  • Large, overlapping canopies
    • original accurate clustering
  • Goal: fast and accurate
    • For every cluster, there exists a canopy such that all points in the cluster are in the canopy

No comments:

Post a Comment