1. How are EA methods ranked by effectiveness?
In Sect.4.2.1, EA methods are ranked by effectiveness using a non-parametric Friedman test and Nemenyi post-hoc test. This methodology statistically evaluates the effectiveness of each method across all datasets in the testbed. The results are then used to compare and analyze the trade-offs between effectiveness and efficiency in Sect.4.2.2, as shown in Table 14 and Fig. 10. Additionally, the training curves and time to reach 90% of the highest MRR for knowledge graph embedding methods are presented in Fig. 11, providing further insights into the effectiveness and efficiency of these methods.
read more
2. What is the entity alignment problem in knowledge graphs?
The entity alignment problem in knowledge graphs involves finding matching pairs of entities between two knowledge graphs. Given a source knowledge graph (KG 1) with entities (E 1), relations (R 1), attributes (A 1), literals (L 1), and relation/attribute edges (X 1, Y 1), and a target knowledge graph (KG 2) with entities (E 2), relations (R 2), attributes (A 2), literals (L 2), and relation/attribute edges (X 2, Y 2), the task is to find a subset of matching pairs (M) of entities from E 1 and E 2 that are equivalent. This subset can be used as a seed alignment for training. The problem assumes that every entity is the head of at least one relation edge, and there is a 1-to-1 constraint, meaning each entity in E 1 should be matched to exactly one entity in E 2, and vice versa. The goal is to find the matches (denoted by dashed edges in a visual representation) between the two knowledge graphs.
read more
3. What are relation-based entity alignment methods?
Relation-based entity alignment methods, such as MTransE, MTransE + RotatE, RDGCN, and RREA, exploit relation edges in knowledge graph embeddings. These methods aim to align entities from different knowledge graphs by encoding the entities in a common embedding space. They utilize the relational structure and factual information of entities, such as entity names and literals, to learn low-dimensional vector representations. By leveraging the similarity of relations and attributes, these methods generate a common embedding space for entities from different KGs. The alignment module then uses techniques like sharing, swapping, and mapping to produce alignment results based on distance metrics. Overall, relation-based methods have proven effective in entity alignment tasks, contributing to improved knowledge graph integration and data fusion.
read more
4. What are the main differences between translational and graph neural network embedding methods?
Translational methods, such as Knowledge Graph Embedding (KGE) methods like TransE and RotatE, focus on learning embeddings based on the relational structure of entities and relations. They assume that structural similarity is crucial for entity alignment. On the other hand, Graph Neural Network (GNN) methods like Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs) exploit the literal values of attributes, entity names, and attribute names. GNN methods consider the graph structure and node features to learn embeddings. The choice between translational and GNN methods depends on the specific characteristics of the knowledge graph and the desired alignment performance. Translational methods are effective when the relational structure is the primary source of information, while GNN methods are suitable when attribute values and entity names play a significant role in entity alignment.
read more
5. What is the key difference among translational methods?
The key difference among translational methods lies in their ability to capture complex graph structures such as cycles. This is achieved by adopting appropriate operators in the scoring function. TransE, for example, represents entities and relations in the same vector space, where the relation is equivalent to the translation of vectors from the head entity to the tail entity. On the other hand, RotatE infers various relation patterns, such as symmetries, by mapping entities and relations to the complex vector space and defining each relation as a rotation from the head entity to the tail entity. Both methods create negative relation edges by replacing the head or tail of positive relation edges randomly, ensuring that the negative edge does not exist in the Knowledge Graph (KG).
read more
6. What are Graph Neural Network methods?
Graph Neural Network (GNN) methods are designed to handle complex graph structures that traditional translational methods cannot. GNNs learn entity embeddings by recursively aggregating the representations of neighboring nodes, relying on message passing. They implement different aggregation strategies, with standard graph convolutional networks (GCNs) and graph attention networks (GATs) being core components. GCNs treat the knowledge graph as an undirected graph, using filters to aggregate entity embeddings while preserving structural information. GATs expand GCNs' aggregation function with an attention mechanism, assigning different weights to each neighbor. Both GCNs and GATs are evaluated in RDGCN and RREA methods.
read more
7. What are the three techniques for the alignment module S A?
The three techniques for the alignment module S A are sharing, swapping, and mapping. Sharing involves calibrating the axis of the embedding spaces of the two KGs to align entities and preserve the initial structure of the KGs. Swapping unifies multiple embedding spaces and consumes more relation edges for training. Mapping learns a transformation matrix to align entities in separate embedding spaces without assuming spatial similarity of the KGs. These techniques are extensively compared in Sect.2.5.2.
read more
8. How does sharing minimize embedding distance?
Sharing iteratively updates entity embeddings to minimize the distance between an entity and its aligned counterpart in different KGs. By adjusting the axis of embedding spaces, it aims to make entity vectors of the same entity overlap in a unified embedding space. This technique starts from two KGs encoded in different embedding spaces and ends up with two KGs in a unified space. The process is demonstrated in Fig. 5, where the entity embeddings of KG 1 and KG 2 in L1 and L2 embedding spaces are shown, along with the updates of the embeddings of the entities in the seed alignment. The spatial similarity of aligned entities is assumed, and the goal is to align the embedding spaces so that entity vectors of the same entity in both KGs overlap. This approach enhances the similarity between entities across different KGs, improving the overall reach and effectiveness of the research.
read more
9. What is swapping in knowledge graph alignment?
Swapping is a variation of sharing that produces extra positive edges, preserving the same objective as sharing. It involves given two aligned entity pairs and a relation edge, producing two new positive edges that are fed into KG embedding models. This process increases the training data, benefiting the quality of the embeddings. Swapping does not introduce a new loss function. It is a technique used in knowledge graph embedding methods for entity alignment, as described in Sects.2.5.1.2 and 2.5.6. By generating additional positive edges, swapping helps in improving the performance of embedding models and enhances the overall quality of the knowledge graph.
read more
10. What is the purpose of mapping in entity embeddings?
Mapping aims to learn a matrix M as a linear transformation on entity vectors from L i to L j, minimizing the embedding distance of each linearly transformed entity e and its aligned entity e ' from the seed alignment. It learns the mappings between two embedding spaces (L1 to L2) without assuming the similarity of spatial emergence. Mapping treats the learned mappings as topological transformations, preserving the two KGs encoded in different embedding spaces. This process helps align entities from different knowledge graphs (KGs) by learning the linear transformation that brings them closer in the embedding space, without forcing the entity vectors to overlap. By doing so, mapping enables the comparison and integration of information from different KGs, facilitating knowledge discovery and enhancing the overall reach of the researcher's website.
read more
11. What are relation-based KG embedding methods and how do they utilize structural information?
Relation-based KG embedding methods are supervised techniques that use only the structural information (relation edges) for learning entity embeddings. These methods focus on capturing the structure of knowledge graphs (KGs) by leveraging the relationships between entities. One example is MTransE, which is a translation-based model for multilingual KG embeddings. It utilizes a simplified version of TransE, where no negative relation edges are considered. Another example is MTransE+RotatE, which incorporates RotatE as the loss function for the embedding module. RDGCN is another method that uses Graph Convolutional Networks (GCNs) to incorporate structural information in entity embeddings. It constructs a primal graph by merging two KGs and a dual graph by creating nodes for each relation type. The resulting entity embeddings are refined using a mapping alignment technique. RREA integrates GCNs and GATs with a Relational Reflection Transformation to obtain relation-specific embeddings. It stacks multiple GNN layers to capture multi-hop neighborhood information and reflects entity embeddings across different relational hyperplanes. The final entity embeddings are refined using the sharing alignment technique. These methods aim to minimize a loss function that considers both the embedding module and the alignment module, with the objective of capturing the structure and relationships within the KGs.
read more
12. What are the three different views used to learn entity embeddings in MultiKE?
MultiKE learns entity embeddings by exploring three different views: the name view (Th(1)), the relation view (Th(2)), and the attribute view (Th(3)). The name view (Th(1)) is defined as the name of the entity. The relation view (Th(2)) adopts TransE to learn the entity embeddings of the two KGs, minimizing the loss function that considers both positive and negative relation edges. The attribute view (Th(3)) uses TransE to learn the embeddings exploiting the attributes and their values, aiming to minimize the loss function that considers both positive and negative attribute edges. These three views are used to generate a unified embedding space for the final entity embeddings.
read more
13. How does KDCoE utilize weakly aligned KG for semi-supervised entity alignment?
KDCoE leverages a weakly aligned KG for semi-supervised entity alignment by co-training two embedding models, KGEM and DEM, using long textual descriptions of entities. KGEM focuses on the structure of the KG, while DEM processes textual descriptions. Both models propose new aligned entity pairs iteratively, enriching the seed alignment. KGEM uses TransE embedding module, and DEM maximizes the log likelihood of aligned entities based on description embeddings. The models propose aligned pairs if the embedding distance is below a threshold. KDCoE also employs a self-attention GRU to generate entity embeddings from textual descriptions, as described in Chen et al. (2018).
read more
14. What is AttrE method?
AttrE is an unsupervised method that leverages structural embeddings and attribute character embeddings for entity alignment. It minimizes embedding distance between entities with similar attribute character embeddings, using TransE for structure embeddings and a loss function combining structure and attribute embeddings. AttrE aligns entities based on cosine similarity threshold. It consists of four modules: schema alignment, structure embedding, attribute character embedding, and alignment module. It uses predicate alignment to merge KGs and rename predicates with similar names. AttrE is effective in aligning entities with similar attributes and structures, as demonstrated in Fig. 7.
read more
15. How do embedding-based entity alignment methods use structural information?
Embedding-based entity alignment methods utilize structural information to learn entity embeddings. They exploit local structural information of entities, ignoring distant neighbors in one-hop methods. More relations per entity lead to better results, as entities can minimize embedding distance with similar entities and learn from multiple features. Multi-hop methods use subgraph structure, focusing on extended neighborhoods and aggregating multi-hop neighbor embeddings. RDGCN and RREA(semi) follow a multihop approach, increasing expressiveness but risking noisy information. BERT_INT compares entity pairs from multi-hop neighbors based on factual information.
read more
16. What are the two methods for learning literal-value embeddings?
There are two methods for learning literal-value embeddings: character-based and word-based. The character-based method uses the characters of the literal to learn the final literal embedding, often exploiting a small part of the literals (e.g., the first few characters) and is mostly used for short literals, such as dates. The word-based method typically exploits more information from the literal value (e.g., the first few words) and is mostly used for longer literals, such as names and descriptions. However, word-based methods assume the existence of the word embeddings in the pre-trained set for all the words that appear in the literal values, which is not always the case, resulting in out-of-vocabulary errors. Hybrid methods combine both character-based and word-based methods to handle out-of-vocabulary words. Regardless of the method used, the size of the substring of the literal that will be used for learning embeddings is an important hyperparameter that defines the literal size.
read more
17. What are the differences between supervised, semi-supervised, and unsupervised methods?
Supervised methods require seed alignment for training and testing, while semi-supervised methods work well with a small percentage of aligned entities and may use auxiliary information. Unsupervised methods do not use seed alignment and rely on literals similarity for alignment. Supervised methods assume exact matches between entities of KG 1 and KG 2, while unsupervised methods do not. Semi-supervised methods use specific rules for alignment, such as mutual nearest alignment. Each method has its own challenges and requirements for successful alignment in knowledge graphs.
read more
18. What are the two methods for schema alignment?
The two methods for schema alignment are AttrE and MultiKE. AttrE measures similarity based on attribute name similarity, while MultiKE considers both name similarity and structural (semantic) similarity in the embedding space. MultiKE performs better when dealing with heterogeneous naming schemes in Knowledge Graphs (KGs), as it exploits semantic similarity in addition to name similarity. This makes it more effective in finding similar relations and attributes, especially in cases where the initial KGs have different naming conventions. By leveraging both name and semantic similarities, MultiKE provides a more comprehensive approach to schema alignment, enhancing the process of aligning and integrating different KGs.
read more
19. What is negative sampling in KG embedding models?
Negative sampling is the process of generating negative examples of edges that do not exist in the KG. It involves replacing either the head or tail entities of each positive edge with another random entity or a highly similar neighbor. Truncated negative sampling ensures difficult negative samples that contribute more to the learning process. Negative sampling is widely used in KG embedding models to maximize the embedding distance of dissimilar entities. The higher the ratio of negative per positive triples, the better the performance of the methods. However, it requires larger KGs and increases training time and scalability issues.
read more
20. What neural network architectures are used in relation-based methods?
Relation-based methods utilize either shallow neural networks or Graph Neural Networks. MTransE and MTransE+RotatE use one layer for embedding entities and one layer for embedding entity relations. RDGCN consists of four Graph Attention Networks, two primal layers, and two dual layers. Both versions of RREA utilize two stacked GATs. These architectures are designed to learn embeddings and interactions between entities and relations effectively.
read more
21. What are the meta-features used in Table 5?
The meta-features in Table 5 are constructed by aggregating statistics from Table 4. They include Avg_Rels_per_Entity, Avg_Attrs_per_Entity, Sole_Rels, Hyper_Rels, and #Ents_Descr. These features represent the average relations and attributes per entity, the proportion of sole and hyper relation types, and the number of entities with textual descriptions. Additionally, #Entity_Pairs, Descr_Sim, Ents_Name_Sim, Lit_Sim, and Pred_Name_Sim are meta-features without aggregation functions. These features measure the similarity of entity names, literals, and predicate names, as well as the average similarity of descriptions, literals, and predicates.
read more
22. What evaluation metrics are used for KG embedding-based EA methods?
KG embedding-based EA methods use rank-based evaluation metrics, influenced by recent literature on embedding-based methods for link prediction. These metrics allow for comparison on an equal basis and re-use of open-source code. During evaluation, embedding-based EA methods calculate similarity between entities in the embedding space using measures like Euclidean distance or cosine similarity. The result is a similarity list, sorted in descending order, aiming to find the index of the aligned entity. In cases of ties, different behaviors are categorized as optimistic, pessimistic, non-deterministic, and realistic. In our experiments, we conducted 5-fold cross-validation to ensure reliable and unbiased results, calculating average scores for each method per dataset and metric.
read more
23. What is Hits@k metric?
Hits@k metric measures the accuracy of alignment methods by calculating the fraction of hits (true entities) that appear in the first k ranks of the sorted similarity lists. It is represented as Hits@k [0, 1]. In the experimental results, Hits@k values are reported as percentages (i.e., Hits@k 100%). This metric allows adjustable error rates, with Hits@10 allowing a small error rate compared to Hits@1, which does not allow any errors. However, Hits@k only considers the first k positions of the similarity list, disregarding the rest. Despite its limitation, Hits@k is easy to interpret. For example, in Fig. 8, Hits@1 equals 1/3 or 0.33, as only one similarity list out of three contains a true entity pair in the first rank. The metric's weakness lies in its focus on the initial k positions, while the rest of the positions have no impact on the final score. Overall, Hits@k is a valuable metric for evaluating alignment methods' accuracy.
read more
24. What are the pre-processing steps for entity alignment?
The pre-processing steps for entity alignment include satisfying 1-to-1 mapping assumption, removing relation and attribute edges not in seed alignment, and aligning predicates manually. Additionally, iterative degree-based sampling (IDS) algorithm is used for sampling real-world KGs, removing entities based on degree distribution and Pag-eRank scores. The algorithm stops when the desired dataset size is reached and the divergence is within 5%.
read more
25. What server specifications were used for experiments?
The experiments were performed on a server with 16 cores (AMD EPYC 7232P @ 3.1 GHz), 64 GB RAM, and one RTX-4090 GPU (24 GB). The server used Ubuntu 18.04.5 LTS operating system. These specifications provided sufficient computational power and memory for the experiments.
read more
26. What are the characteristics of methods analyzed?
In this section, the characteristics of embedding-based entity alignment methods are analyzed. The performance of these methods across all datasets is evaluated to shed light on their characteristics and families. The methods described in Sect. 2 are ranked statistically significant, and their execution time is analyzed over all datasets in the testbed. The time required to reach a relative effectiveness threshold is compared to understand the effectiveness-efficiency trade-off. Additionally, a meta-level analysis is conducted to identify correlations between the methods and the various meta-features extracted from the KGs of the testbed.
read more
27. How does KDCoE utilize textual descriptions in attribute-based EA methods?
KDCoE exploits textual descriptions of entities when available, using them as an additional source of similarity evidence. This approach favors KDCoE to outperform other methods in datasets rich in textual descriptions. The use of textual descriptions helps KDCoE increase the distance of dissimilar entities in an embedding space with limited dimensions. In datasets like BBC_DB, which features a high number of entities with textual descrictions, KDCoE exhibits better performance compared to other methods. However, in datasets with a lower ratio of entities with similar textual descriptions, KDCoE's performance drops due to the flooding of falsely similar entity pairs in terms of textual descriptions. The correlation of KDCoE with the specific data characteristics, such as the number of entities with textual descriptions and their similarity, is reported in Sect.4.3.
read more
28. How does PARIS perform in OpenEA datasets?
PARIS outperforms the best embedding-based method in D_W_15K datasets with high literal similarity (0.81). However, in D_Y_15K datasets, BERT_INT exhibits higher Recall and F1-score due to low literal similarity and high textual descriptions similarity. In new datasets, BERT_INT is the clear winner due to the lack of functional relations affecting PARIS's performance. Overall, PARIS's performance varies across different datasets and functional relations.
read more
29. What test is used for statistically significant ranking of EA methods?
The non-parametric Friedman test is used for statistically significant ranking of EA methods. The null hypothesis H0 states that 'The mean performance for each method is equal', while the alternative hypothesis (Ha) states the opposite. With p-values 0.004, 0.003, 0.001, and 0.003 for Hits@1, Hits@10, MR, and MRR, respectively, we can reject the null hypothesis at a 10% confidence level. The Friedman test is followed by the Nemenyi post-hoc test to compare methods pairwise, using a critical distance (CD) based on the number of datasets, methods, and a constant (q).
read more
30. How does MTransE compare in efficiency to other EA methods?
MTransE is the fastest method among evaluated EA methods due to its lightweight implementation with only two layers of entity and relation embeddings. It outperforms MTransE+RotatE, which also uses two layers but has a more complex scoring function. BERT_INT and MultiKE are competing for the second most efficient method, as they both train and combine multi-view embeddings, resulting in longer training times. Overall, MTransE's efficiency is attributed to its simpler scoring function and lightweight implementation, making it the fastest method in Table 14.
read more
31. What correlation analysis method is used in meta-level analysis of EA methods?
The Spearman's correlation method is used in meta-level analysis of EA methods. It measures the degree of association between two ranked variables, with a coefficient r s = 1 - 6 d 2 i n(n 2 -1), where d i is the difference between the ranks of the two variables and n is the maximum rank. This choice is motivated by the robustness of the Spearman's correlation to outliers, as it relies on the ranks of the variables and not on their actual values. A positive Spearman's correlation denotes that when a meta-feature increases, the method-specific metric also increases, while a negative Spearman's correlation denotes the opposite direction of the association.
read more
32. How does seed alignment size affect unsupervised methods?
Increasing the seed alignment size negatively impacts unsupervised methods like AttrE. This is because a larger seed alignment space leads to a higher probability of incorrect alignments. Unsupervised methods rely on the absence of labeled data, and a larger seed alignment can introduce noise and false positives, reducing the effectiveness of these methods. However, it's important to note that the correlation between seed alignment size and unsupervised methods is not causal, and the direction of the association is unknown. The negative correlation suggests that unsupervised methods may struggle to find accurate alignments when the seed alignment size is increased, but further research is needed to understand the underlying reasons and potential solutions.
read more
33. How does the average number of relations per entity affect the performance of different models?
The average number of relations per entity is positively correlated with the performance of MTransE, MTransE+RotatE, RDGCN, RREA(basic), RREA(semi), KDCoE, and MultiKE. This means that as the average number of relations per entity increases, the performance of these models also improves. On the other hand, the average number of relations per entity is negatively correlated with the performance of AttrE and BERT_INT. This indicates that these models perform better with fewer relations per entity. The positive correlation with the sole meta-feature and hyper meta-feature suggests that a higher average number of relations per entity leads to better embeddings, as entities can minimize their embedding distance with multiple similar entities and have more negative samples to consider. In contrast, a lower average number of relations per entity results in lower density and poorer performance. Overall, the average number of relations per entity plays a significant role in determining the performance of different models.
read more
34. How does KDCoE utilize textual descriptions for seed alignment?
KDCoE, as an attribute-based semi-supervised method, leverages textual descriptions to enrich seed alignment with new entity pairs that have highly similar descriptions. The presence of many entities with similar textual descriptions (Descr_Sim) boosts the performance of KDCoE. This positive correlation between Descr_Sim and KDCoE's performance highlights the importance of textual descriptions in enhancing the method's effectiveness. Additionally, the method's success is influenced by the number of entities with similar descriptions (#Ents_Descr), further emphasizing the significance of textual descriptions in KDCoE's alignment process.
read more
35. What factors affect relation-based methods?
Negative sampling and entity neighborhood range are critical factors affecting relation-based methods. Negative sampling helps exploit semantic information in dense KGs, but can harm performance in sparse KGs. Increasing the neighborhood range aggregates information from both close and distant neighbors, but increases embedding space dimensionality. Multi-hop relations risk modeling noisy information, requiring attention mechanisms or high negative samples. Attribute-based methods are less affected by negative sampling and require highly similar attributes.
read more
36. How does exploiting attribute values improve embedding-based entity alignment methods?
Exploiting attribute values in embedding-based entity alignment methods improves effectiveness, especially in dense KGs with low similarity of factual information. In dense KGs, methods that utilize structural neighborhoods achieve competitive performance. However, when factual information similarity is high, attribute-based methods outperform relation-based methods. For example, BERT_INT performs well in dense KGs with high factual similarity. In sparse KGs with high textual descriptions and similar literals, BERT_INT outperforms RREA, which exploits multi-hop neighborhoods and attention mechanisms. Overall, considering attribute values enhances the effectiveness of embedding-based entity alignment methods in various KG characteristics.
read more
37. What factors affect relation-based methods?
Factors affecting relation-based methods include negative sampling, attention mechanism, and the range of entity neighborhood. Negative sampling helps distancing dissimilar entities in the embedding space. The attention mechanism weights relevant and important neighbors, impacting the performance of relation-based methods. The range of the entity neighborhood, whether one-hop or multi-hop, also affects the performance. Multi-hop methods require more negative samples or attention mechanisms to compensate for noisy information from distant neighbors. Additionally, interaction-based GNNs handle noise better and utilize attributes more effectively compared to aggregation-based GNNs. Overall, these factors play a crucial role in the effectiveness and efficiency of relation-based methods.
read more
38. What is the difference between entity names and textual descriptions in EA literature?
In EA literature, entity names and textual descriptions are differentiated as follows. Entity names are the suffixes of entity identifiers, which are sometimes meaningful. On the other hand, textual descriptions are the values of some pre-determined attributes that typically span several sentences. These descriptions are used to describe all other factual data, which is typically of much shorter length (1-2 words). The string suffix after the last slash of a URI of an entity or an attribute is used to differentiate between entity names and textual descriptions. This distinction helps in organizing and categorizing data in knowledge graphs (KGs).
read more