Metric Distribution to Vector: Constructing Data Representation via Broad-Scale Discrepancies

doi:10.48550/arXiv.2210.00415

Journal Article10.48550/arXiv.2210.00415

Metric Distribution to Vector: Constructing Data Representation via Broad-Scale Discrepancies

Xue Liu, +4 more

- 02 Oct 2022

- arXiv.org

- Vol. abs/2210.00415

TL;DR: A novel embedding strategy named MetricDistribution2vec is presented to extract distribution characteristics into the vectorial representation for each data to conduct pattern classiﬁcation for graph-structured data.

Abstract: —Graph embedding provides a feasible methodology to conduct pattern classiﬁcation for graph-structured data by mapping each data into the vectorial space. Various pioneering works are essentially coding method that concentrates on a vectorial representation about the inner properties of a graph in terms of the topological constitution, node attributions, link relations, etc. However, the classiﬁcation for each targeted data is a qualitative issue based on understanding the overall discrepancies within the dataset scale. From the statistical point of view, these discrepancies manifest a metric distribution over the dataset scale if the distance metric is adopted to measure the pairwise similarity or dissimilarity. Therefore, we present a novel embedding strategy named MetricDistribution2vec to extract such distribution characteristics into the vectorial representation for each data. We demonstrate the application and effectiveness of our representation method in the supervised prediction tasks on extensive real-world structural graph datasets. The results have gained some unexpected increases compared with a surge of baselines on all the datasets, even if we take the lightweight models as classiﬁers. Moreover, the proposed methods also conducted experiments in Few-Shot classiﬁcation scenarios, and the results still show attractive discrimination in rare training samples based inference.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

TABLE 1 Statistics of the benchmark graph datasets. The columns are the name of the dataset, the number of graphs, the number of classes, the average number of nodes, and the average number of edges. Here we mention that each dataset is balanced in the number of different labeled parts, and the NCI-1, NCI-33, NCI-83, and NCI-109 are all sampled randomly from the original datasets as their vast volumes.

Fig. 6. The visualization of the high-dimensional embedded data derived from MetricDistribution2vec in a plane by t-SNE for 12 datasets. In each subplot, different colored nodes represent different labeled graphs, and similar embedded graphs are clustered nearby on the plot.

Fig. 1. A classification example in two dimensions to illustrate the metric distribution. In the scatter plot, each point is labeled with an allocated class denoted by binary colors (i.e., blue and orange). All the instances are separated by a segmentation boundary, denoted as the dark curve. In each class, we use dark lines to denote the distance between intra-group points and use dark dotted lines to represent the distance between inter-group points. In particular, we take the Euclidean distance as the metric in this case. In addition, the metric distributions for v1, v2, v3, and v4 are also shown. Among these four data instances, v1 and v2 are in one class, while v3 and v4 belong to another category. In each subplot, the histogram reports the distance between the targeted point with each instance (colored according to its class) within the dataset. The red curve exhibits the overall metric distribution trend. The data belonging to the same class clearly possess approximate metric distance distributions.

Fig. 4. The illustration of the optimal transportation between frequent fragment decompositions and between vectorial frequent fragment decompositions. The cluster of red lines denotes the transference plan for this transportation scenario.

Fig. 5. The classification accuracy sensitivities of MetricDistribution2vec using kNN, Logistic Regression, and SVM (RBF Kernel) as classifiers over the min-sup hyper-parameter are reported with different curves. The blue dotted horizontal line denotes the best result in baselines from Table 3. The different colored vertical lines reflect the best results and the corresponding values of min-sup for MetricDistribution2vec using different classifiers. In addition, The number of frequent fragments (fgs) under different min-sup is shown by the blue histograms.

Fig. 8. This figure shows the similarity between different metric distributions of the same graph under different sampling rates. In each subplot, the horizontal axis denotes the index of each graph, and the vertical axis denotes the distance between different metric distributions. There are four types of symbols on each graph to represent the differences between metric distributions derived by 90% sampling rate and 50%, 20%, 5% sampling rates, respectively.

References

•Posted Content

Semi-Supervised Classification with Graph Convolutional Networks

Thomas Kipf, +1 more

- 09 Sep 2016

- arXiv: Learning

TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.

...read moreread less

22.7K

•Proceedings Article•10.1145/2623330.2623732

DeepWalk: online learning of social representations

Bryan Perozzi, +2 more

- 24 Aug 2014

TL;DR: DeepWalk as mentioned in this paper uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences, which encode social relations in a continuous vector space, which is easily exploited by statistical models.

...read moreread less

11.4K

•Posted Content

node2vec: Scalable Feature Learning for Networks

Aditya Grover, +1 more

- 03 Jul 2016

- arXiv: Social and Information Networks

TL;DR: In node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks, a flexible notion of a node's network neighborhood is defined and a biased random walk procedure is designed, which efficiently explores diverse neighborhoods.

...read moreread less

6.6K

•Book

Topics in Optimal Transportation

Cédric Villani

- 01 Mar 2003

TL;DR: In this paper, the metric side of optimal transportation is considered from a differential point of view on optimal transportation, and the Kantorovich duality of the optimal transportation problem is investigated.

...read moreread less

6.1K

•Proceedings Article

Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Mikhail Belkin, +1 more

- 03 Jan 2001

TL;DR: The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering.

...read moreread less

5.3K

...

Expand