Published on November 15, 2007
Manifold Discovery Via Metric MDS: Manifold Discovery Via Metric MDS Jeffrey L. Solka Ph.D. David A. Johannsen NSWCDD, Code B20 Staff April 15, 2006 The Topic of Interest Today: The Topic of Interest Today We wish to apply basic techniques of Riemannian geometry to the problem of dimensionality reduction for data analysis. More specifically, we perform metric MDS to various constant (sectional) curvature Riemannian manifolds and examine which gives the lowest value of the loss function (“discovered” manifold structure?). Using the constant curvature metric may regularize data (e.g., “round” vs. “lumpy” sphere). Background: Background “Curse of Dimensionality” (Bellman 1961, Scott 1992) We assume that there are (unknown and non-linear) relationships between observed/measured features, i.e., that the data is sampled from an embedded submanifold (of high codimension). Scott 1992 “Fortunately, statistical data seldom display structure in more than 5 dimensions, so guided projection to those dimensions may be adequate. It is these threshold dimensions from 3 to 5 that are and deserve to be the focus of our visualization efforts.” Background: Background Numerous methods of dimensionality reduction (ranging from classical multidimensional scaling (MDS) to Isometric mapping (ISOMAP), local linear embedding (LLE), and more …) all ultimately project data to Rn for some “small” n. Euclidean space is too restrictive Consider points distributed on a sphere with R2 as the MDS space. We consider a richer (though still very restrictive) class of spaces to which to project: closed and orientable constant curvature manifolds. Feature Geometrization Via the Isometric Figure Mapping (ISOMAP) Procedure: Feature Geometrization Via the Isometric Figure Mapping (ISOMAP) Procedure Steps (naïve): (1) create a graph with observations and calculate k nearest neighbor (epsilon ball distances) Euclidean distances; (2) convert Euclidean distances to geodesic distance using graph theoretic shortest paths (3) apply ordinal MDS to find a configuration of the k-dim. feature vector corresponding to the observations that minimizes a stress function. C R3 Geodesic and ISOMAP Geodesic ISOMAP Geodesic and Associated Nearest Neighbor Graph ISOMAP Stress Plot Swiss Roll Data Analysis Note that the Swiss Roll has a Euclidean structure, so classical MDS works well. Approach - Two Manifolds: Approach - Two Manifolds Closed Orientable Homogeneous Two Manifold Classification : Closed Orientable Homogeneous Two Manifold Classification Universal Covering Space of the Torus: Universal Covering Space of the Torus A convenient way to think of the torus, at least for our implementation is as follows E2/G where G = <(1,0)t, (0,1)t> The Universal Covering Space of the 2 Holed Torus and the Poincare Disk (a Realization of H2): The Universal Covering Space of the 2 Holed Torus and the Poincare Disk (a Realization of H2) Metric MDS - I: Metric MDS - I Given the ISOMAP k neighbor graph obtained from the data, estimate interpoint distances between observations using shortest paths in the graph (i.e., approximate the Riemannian metric). “Project” the data to a model space with a “nice” metric. Accomplished by finding a configuration in the model space that minimizes a loss function (usually a weighted squared difference with interpoint distances obtained using ISOMAP). May be more profitable to think of the process as selecting a new metric on the existent manifold, i.e., regularization. Metric MDS - II: Metric MDS - II So we seek a configuration of points, U, that minimizes “energy” (Sammon map): ISOMAP estimated interpoint distance matrix, The interpoint distance matrix of the configuration computed using the homogeneous metric naturally associated with the space. Metric MDS - III: Metric MDS - III Since MDS spaces are homogeneous and isotropic the minimization methodology is straightforward. Gradient descent minimization Gradient and exponential map computations are independent of the point and direction. Configuration is updated by line search in direction of minus gradient (backtracking with sufficient decrease). Interpoint distances computed in quotient space – this can get expensive for high genus. Spaces are amenable to more efficient minimization techniques (e.g., trust region). These (i.e., space forms) are perhaps the most general class of metric spaces where one can easily do the minimization. Hyperbolic Gradient Equation: Hyperbolic Gradient Equation These computations can be messy. For example, an arbitrary component of the gradient in the Hyperbolic metric (using the Poincaré model of H2) is Exponential Map in the Poincare’ Model: Exponential Map in the Poincare’ Model Exponential map: Geodesic at (x1, x2)T in the direction U = (u1, u2)T and speed ||U||. Exploitation: Exploitation The observations now reside in a space for which we have a “nice” metric, making explicit determination of geodesics easy – all manner of metric data exploration is now possible. Compare with general case of fitting a manifold to the data in which explicit determination of geodesics is impossible. In the 2-manifold case, data visualization is possible. Some Nuisances of the Visualization Process: Some Nuisances of the Visualization Process Visualizing the results of the MDS — for simplicity, restrict attention to the Euclidean case. Of course, the flat torus (i.e., the torus equipped with a flat metric) cannot be isometrically embedded in R3 Most natural way to think of the flat torus is R2/G where G is a discrete subgroup of the full isometry group of that acts discretely and properly discontinuously. For example, let tx = translation by (1,0)T and ty = translation by (0,1)T. Let G = <tx, ty>. Be the group generated by these isometries. The (closure of the) fundamental domain for this action is the unit square — usual way of thinking of T2 as identifying edges of the unit square. So we could do our visualization in this space. Visualization in the Hyperbolic Case: Visualization in the Hyperbolic Case Recall from classification of surfaces that a genus g surface is the identification space of a 4g-gon. If g > 1, we can think of the hyperbolic metric on the surface as being obtained by identifying (via hyperbolic isometries) the edges of a regular hyperbolic 4g-gon. So visualization is accomplished in the fundamental domain. The genus 2 surface obtained by the identification of sides of the hyperbolic octagon. Sample Results Obtained Using Trefoil Knot Data (Flat Torus Case): Sample Results Obtained Using Trefoil Knot Data (Flat Torus Case) Sample Results Obtained Using Data Sampled from S2: Sample Results Obtained Using Data Sampled from S2 Representative Data Visualization in the Hyperbolic Case: Representative Data Visualization in the Hyperbolic Case Presentations and Publications: Presentations and Publications D. Johannsen and J. L. Solka, “Metric MDS to Surfaces,” to be submitted to Journal of Multivariate Analysis,” 2006. D. Johannsen and J. L. Solka, “Generalized MDS for Data Visualization, Clustering, and Classification," presented at and appearing in the Proceedings of Interface 2005: Classification and Clustering 37th Symposium on the Interface, 6/8/2005-6/12/2005. D. Johannsen and J. L. Solka, “Generalized MDS for Data Exploration,” 8th Joint Conference on Information Sciences 2005 (HCIS) Salt Lake City, UT July 21-26, 2005. D. Johannsen and J. L. Solka, “Generalized MDS for Data Visualization, Clustering, and Classification," George Mason University Statistics Colloquium, 3/11/2005. J. L. Solka and D. A. Johannsen, “Modern Geometric Methods for Dimensionality Reduction,” International Federation of Classification Societies, July 15-18, 2004. Questions/Talking Points???: Questions/Talking Points??? What is the utility of these projections especially the hyperbolic case? Clustering Classification Visualization Are there datasets that naturally occur that may manifest hyperbolic structure? What about applications to higher dimensional cases such as T3 for example? What about the general case of three-manifolds? What about the case of involving non-homogeneity of dimensionality? Backups: Backups Grand Plan – Two Manifolds: Grand Plan – Two Manifolds Generalized MDS to the appropriate target space A Birds Eye View of Our Data Analysis Framework: A Birds Eye View of Our Data Analysis Framework Grand Plan – Two Dimensions: Grand Plan – Two Dimensions In low dimensions, geometry and topology of closed and orientable manifolds are intimately related. In two dimensions, the familiar Gauss-Bonnet Theorem tells the whole story For any boundaryless two-dimensional Riemannian manifold, the integral of the Gaussian curvature over the entire manifold with respect to area is 2p times the Euler characteristic of the manifold. The Euler-Poincaré or Euler Characteristic: The Euler-Poincaré or Euler Characteristic Descartes-Euler Polyhedral Formula V + F – E = 2 V = # vertices, F = # faces, and E = # edges c(g) = 2 – 2g g = genus Descartes-Euler Polyhedral Formula corresponds to the case of genus 0 Point Cloud to Simplicial Complex - I: Point Cloud to Simplicial Complex - I Recall a (Euclidean) Simplicial complex, K, is a collection of simplices such that s in K then every face of s is in K The intersection of any two is either empty of a face of each (Local Finiteness) Every point in a simplex of K has a neighborhood that intersects many simplices of K. Examples Euclidean polyhedra Triangulated surfaces Point Cloud to Simplicial Complex - II: Point Cloud to Simplicial Complex - II Proposed methodology Tight Cocone (Dey and Goswami, 2003) is the basis for the current algoritm for “manifold recovery” – Given (strong) assumptions on density, Tight Cocone guarantees a resultant triangulated surface is homeomorphic to the original surface from which the samples were drawn. Point Cloud to Simplicial Complex - III: Point Cloud to Simplicial Complex - III Tight Cocone is a Voronoi-based approach. Construct Voronoi diagram, Vp, and its dual, the Delaunay traingulation, Dp, for P our dataset. For each point p in the dataset determine the pole, p+ (and its associated pole vector, vp = p+ - p), and cocone Cp. Methodology is not in general restricted to codimension 1 one would merely have several normal directions Voronoi Diagram and It’s Dual the Delaunay Triangulation: Voronoi Diagram and It’s Dual the Delaunay Triangulation http://www.cs.berkeley.edu/~jordans/research/classes/meshes/voronoi/images/voronoi_delaunay_v_dual_100.gif Delaunay Triangulation (black) and Voronoi Diagram (red) Point Cloud to Simplicial Complex -IV : Point Cloud to Simplicial Complex -IV Select triangles in Dp whose dual Voronoi edges intersect Cp Repeat for all points in the dataset There are lots of redundant triangles, so use “manifold extraction” algorithm to eliminate redundant triangles. There are likely holes in regions of undersampling and junk triangles in regions of high curvature (where vp is a bad approximation to the normal at p) Anomalous Behavior in Tight Cocone: Anomalous Behavior in Tight Cocone Tamal K. Dey, Samrat Goswami, “Tight cocone: a water-tight surface reconstructor,” ACM Symposium on Solid Modeling and Applications Proceedings of the eighth ACM symposium on Solid modeling and applications, 2003, pp. 127-134. Point Cloud to Simplicial Complex -V: Point Cloud to Simplicial Complex -V Limitations e-sampled is a strong assumption. Currently, the algorithm is implemented for surfaces in R3 and will need to be extended. Computing Voronoi regions is expensive in the ambient dimensionality. Golub Data: Golub Data Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by ...,” Science 1999 286: 531-537 7129 gene expression values measured on 72 leukemia patients ALL T and B cell variant AML Down selected to 1729 genes based on an absolute expression larger than 20 across all samples Golub MDS Results in Spherical Target Space: Golub MDS Results in Spherical Target Space Orange = ALL Black = AML Golub MDS Results in Toroidal Target Space: Golub MDS Results in Toroidal Target Space Orange = ALL Black = AML Golub MDS Results in Hyperbolic (genus 2) Target Space: Golub MDS Results in Hyperbolic (genus 2) Target Space Blue = ALL Red = AML knn Performance on the Golub Data Under Various Metrics: knn Performance on the Golub Data Under Various Metrics Weighted knn Performance on the Golub Data Under Various Metrics: Weighted knn Performance on the Golub Data Under Various Metrics Grand Plan – Three Dimensions : Grand Plan – Three Dimensions Three-dimensional manifolds possess what is known as a standard two-level decomposition. First, there is the connected sum decomposition, which says that every compact three-manifold is the connected sum of a unique collection of prime three-manifolds. The second decomposition is the Jaco-Shalen-Johannson torus decomposition, which states that irreducible orientable compact 3-manifolds have a canonical (up to isotopy) minimal collection of disjointly embedded incompressible tori such that each component of the 3-manifold removed by the tori is either "atoroidal" or "Seifert-fibered." Grand Plan - Three Dimensions: Grand Plan - Three Dimensions Thurston's conjecture is that, after you split a three-manifold into its connected sum and the Jaco-Shalen-Johannson torus decomposition, the remaining components each admit exactly one of the following geometries: Euclidean geometry Hyperbolic geometry Spherical geometry, The geometry of S2 x R The geometry of H2 x R The geometry of SL2R Nil geometry, or Sol geometry. Many believe that Dr. Grigori (Grisha) Perelman has recently proven this conjecture Developing a Little Intuition on Three Manifolds - I: Developing a Little Intuition on Three Manifolds - I http://www.etsu.edu/physics/etsuobs/starprty/120598bg/section6.htm The three flat torus, T3 Developing a Little Intuition on Three Manifolds - II : Developing a Little Intuition on Three Manifolds - II The universal covering space of the three flat torus T3 Einstein sees his own back in the T3. http://www1.kcn.ne.jp/~iittoo/usDraft_A2.htm Friedman’s Relationships Between Energy Density and Curvature in the Universe: Flat Universe (0 curvature) Friedman’s Relationships Between Energy Density and Curvature in the Universe Elliptical Universe (Positive curvature) Hyperbolic Universe (Negative curvature) Implications of Manifold Theory to Cosmology – The Universe as T3: Implications of Manifold Theory to Cosmology – The Universe as T3 Suppose that the appropriate model of our universe is T3 This space is flat (that is, it has Euclidean geometry and zero curvature) and is finite. If this manifold is the shape of our universe, then we can imagine the big bang as the point at which a 3-torus started expanding. As time goes by, this 3-torus continues to expand. Since this manifold is flat, in this model the universe continues to expand forever (but just barely). This model illustrates that we can have the philosophically desirable property of a finite universe, even though the universe is "open" and will continue to expand forever Implications of Manifold Theory to Cosmology – The Universe as Siefert Weber Space: Implications of Manifold Theory to Cosmology – The Universe as Siefert Weber Space From the Shape of Space by J. Weeks, 2002 Every face of the dodecahedron is glued to the opposite face with a three-tenths clockwise turn Tiling of H3 by Congruent Right Angle Dodecahedron: Tiling of H3 by Congruent Right Angle Dodecahedron William P. Thurston, Three-Dimensional Geometry and Topology – Vol. I, Princeton University Press, 1997. Future Work: Future Work Testing the utility of the ISOMAP-based procedure on synthetic datasets Application of the procedure to sponsor datasets Investigation of the effect of noise on the procedure Investigation/Study of alternative approaches Spectral approaches Formulation of a strategy for the three-manifold case Yikes !!! Terminology: Terminology manifold A manifold is a topological space that is locally Euclidean (i.e., around every point, there is a neighborhood that is topologically the same as the open unit ball in ). More formally, any object that can be "charted" is a manifold. smooth manifold (differentiable manifold) Smooth manifolds (also called differentiable manifolds) are manifolds for which overlapping charts "relate smoothly" to each other, meaning that the inverse of one followed by the other is an infinitely differentiable map from Euclidean space to itself. Manifolds arise naturally in a variety of mathematical and physical applications as "global objects." For example, in order to precisely describe all the configurations of a robot arm or all the possible positions and momenta of a rocket, an object is needed to store all of these parameters. The objects that crop up are manifolds. From the geometric perspective, manifolds represent the profound idea having to do with global versus local properties. mathworld.wolfram.com Terminology: Terminology submanifold A submanifold is a subset of a manifold that is itself a manifold, but has smaller dimension. For example, the equator of a sphere is a submanifold. Many common examples of manifolds are submanifolds of Euclidean space. In fact, Whitney showed in the 1930s that any manifold can be embedded in RN, where N = 2n + 1. Riemannian manifold A smooth manifold with a metric is called a Riemannian manifold. A manifold possessing a metric tensor. For a complete Riemannian manifold, the metric d(x,y) is defined as the length of the shortest curve (geodesic) between x and y. mathworld.wolfram.com Terminology: Terminology Riemannian metric Suppose for every point x in a manifold M, an inner product is defined on a tangent space of M at x. Then the collection of all these inner products is called the Riemannian metric. embedding An embedding is a representation of a topological object, manifold, graph, field, etc. in a certain space in such a way that its connectivity or algebraic properties are preserved. For example, a field embedding preserves the algebraic structure of plus and times, an embedding of a topological space preserves open sets, and a graph embedding preserves connectivity. mathworld.wolfram.com Terminology: Terminology codimension Codimension is a term used in a number of algebraic and geometric contexts to indicate the difference between the dimension of certain objects and the dimension of a smaller object contained in it. This rough definition applies to vector spaces (the codimension of the subspace (4, -1, 10) in R3 is 3-1 = 2) and to topological spaces (with respect to the Euclidean topology and the Zariski topology, the codimension of a sphere in R3 is 3-2 = 1). closed manifold A compact, in the topological sense, manifold without boundary genus A topologically invariant property of a surface defined as the largest number of nonintersecting simple closed curves that can be drawn on the surface without separating it. Roughly speaking, it is the number of holes in a surface. mathworld.wolfram.com Terminology: Terminology Homeomorphism A homeomorphism, also called a continuous transformation, is an equivalence relation and one-to-one correspondence between points in two geometric figures or topological spaces that is continuous in both directions. A homeomorphism which also preserves distances is called an isometry. Affine transformations are another type of common geometric homeomorphism. mathworld.wolfram.com References: References R. E. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, Princeton, NJ, 1961. Scott, D.W. , Multivariate Density Estimation, NY: Wiley, 1992. William P. Thurston, Three-Dimensional Geometry and Topology – Vol. I, Princeton University Press, 1997. Jeffrey R. Weeks, The Shape of Space, 2nd Edition, Marcel Dekker, Inc., 2002.