Proceedings Article10.4043/29707-MS
Technology Intelligence Analysis Based on Document Embedding Techniques for Oil and Gas Domain
Fábio Corrêa Cordeiro,Diogo da Silva Magalhães Gomes,Flávio Antônio Machado Gomes,Renata Cristina Texeira +3 more
- 28 Oct 2019
3
TL;DR: The novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques.
read more
Abstract:
we propose a methodology based on document embedding techniques for applying Technology Intelligence Analysis in Oil and Gas (O&G) domain. We build a specialized corpus in O&G domain and train a Vector Space Model (VSM) to represent each document as a vector, in such a way that the distance between two vectors captures their semantic similarity. We explore different analysis on this VSM to infer relations between documents, in order to obtain new insights in a strategic context.
this proposed methodology is based on Natural Language Processing (NLP) techniques to obtain strategic insights in a technology intelligence analysis scenario. It consists on generating a vector space model (VSM) induced from a domain-specific Oil and Gas corpus, composed of thousands of scientific articles collected from the Elsevier online database. We explore an approach to represent different entities - such as articles, authors and keywords - in the same vector space, making it possible to correlate them and infer relations of similarity based on their cosine distance. An evaluation metric is also provided in order to assist the training process and hyperparameters optimization.
Oil and Gas highly technical vocabulary represents a challenge to NLP applications, in which some terms may assume a completely different meaning from the general - context domain. In this scenario, gathering an Oil and Gas corpus and training specialized vector space models for this specific domain allows increasing the quality in Technology Intelligence Analysis. The most significant finding is that we were able to explicit the semantic relationships between different entities of interest in the same VSM, also linking these relationships together with some additional metadata. An interesting application is to compare the publications of authors affiliated to two or more O&G companies at a given time. These non-trivial correlations are important to gain strategic insights considering a Technology Intelligence Analysis scenario.
the novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques. This method applies some advanced NLP techniques to quickly process more than a hundred thousand documents in a few seconds, without requiring complex hardware resources, which would be impractical using traditional techniques.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Finite-element analysis case retrieval based on an ontology semantic tree
TL;DR: A novel method for measuring semantic similarity between FEA cases based on an ontology semantic tree is proposed. The method utilizes named entity recognition technology and a multitree algorithm to retrieve relevant cases.
Portuguese word embeddings for the oil and gas industry: Development and evaluation
Diogo da Silva Magalhães Gomes,Diogo da Silva Magalhães Gomes,Fábio Corrêa Cordeiro,Bernardo Scapini Consoli,Nikolas Lacerda Santos,Viviane Pereira Moreira,Renata Vieira,Renata Vieira,Silvia Maria Wanderley Moraes,Alexandre G. Evsukoff +9 more
TL;DR: In this paper, a representative set of word embedding models for the specific domain of oil and gas in Portuguese is proposed, and the results suggest that their domain-specific models outperformed the general model on their ability to represent specialized terminology.
References
•Journal Article
Visualizing Data using t-SNE
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
•Proceedings Article
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov,Kai Chen,Greg S. Corrado,Jeffrey Dean +3 more
- 16 Jan 2013
TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
27.5K
Software Framework for Topic Modelling with Large Corpora
Radim Řehůřek,Petr Sojka +1 more
- 22 May 2010
TL;DR: This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.
Recent Trends in Deep Learning Based Natural Language Processing [Review Article]
TL;DR: This paper reviews significant deep learning related models and methods that have been employed for numerous NLP tasks and provides a walk-through of their evolution.
3.4K
From frequency to meaning: vector space models of semantics
Peter D. Turney,Patrick Pantel +1 more
TL;DR: The goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs, and to provide pointers into the literature for those who are less familiar with the field.