TL;DR: The need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats is highlighted and the need to devise new tools for predictive analytics for structured big data is reinforced.
TL;DR: An ontology developed for a cyber security knowledge graph database is described to provide an organized schema that incorporates information from a large variety of structured and unstructured data sources, and includes all relevant concepts within the domain.
Abstract: In this paper we describe an ontology developed for a cyber security knowledge graph database. This is intended to provide an organized schema that incorporates information from a large variety of structured and unstructured data sources, and includes all relevant concepts within the domain. We compare the resulting ontology with previous efforts, discuss its strengths and limitations, and describe areas for future work.
TL;DR: An algorithm that assigns contextual polarity to concepts in text and flows this polarity through the dependency arcs in order to assign a final polarity label to each sentence is presented, which enables a more efficient transformation of unstructured social data into structured information, readily interpretable by machines.
Abstract: -Emulating the human brain is one of the core challenges of computational intelligence, which entails many key problems of artificial intelligence, including understanding human language, reasoning, and emotions. In this work, computational intelligence techniques are combined with common-sense computing and linguistics to analyze sentiment data flows, i.e., to automatically decode how humans express emotions and opinions via natural language. The increasing availability of social data is extremely beneficial for tasks such as branding, product positioning, corporate reputation management, and social media marketing. The elicitation of useful information from this huge amount of unstructured data, however, remains an open challenge. Although such data are easily accessible to humans, they are not suitable for automatic processing: machines are still unable to effectively and dynamically interpret the meaning associated with natural language text in very large, heterogeneous, noisy, and ambiguous environments such as the Web. We present a novel methodology that goes beyond mere word-level analysis of text and enables a more efficient transformation of unstructured social data into structured information, readily interpretable by machines. In particular, we describe a novel paradigm for real-time concept-level sentiment analysis that blends computational intelligence, linguistics, and common-sense computing in order to improve the accuracy of computationally expensive tasks such as polarity detection from big social data. The main novelty of the paper consists in an algorithm that assigns contextual polarity to concepts in text and flows this polarity through the dependency arcs in order to assign a final polarity label to each sentence. Analyzing how sentiment flows from concept to concept through dependency relations allows for a better understanding of the contextual role of each concept in text, to achieve a dynamic polarity inference that outperforms state-of-the-art statistical methods in terms of both accuracy and training time.
TL;DR: This paper builds on academic and industry discussions from the 2012 and 2013 pre-ICIS events: BI Congress III and the Special Interest Group on Decision Support Systems workshop, respectively, by presenting a big data analytics framework that depicts a process view of the components needed for big data Analytics in organizations.
Abstract: This paper builds on academic and industry discussions from the 2012 and 2013 pre-ICIS events: BI Congress III and the Special Interest Group on Decision Support Systems (SIGDSS) workshop, respectively. Recognizing the potential of “big data” to offer new insights for decision making and innovation, panelists at the two events discussed how organizations can use and manage big data for competitive advantage. In addition, expert panelists helped to identify research gaps. While emerging research in the academic community identifies some of the issues in acquiring, analyzing, and using big data, many of the new developments are occurring in the practitioner community. We bridge the gap between academic and practitioner research by presenting a big data analytics framework that depicts a process view of the components needed for big data analytics in organizations. Using practitioner interviews and literature from both academia and practice, we identify the current state of big data research guided by the framework and propose potential areas for future research to increase the relevance of academic research to practice.
TL;DR: In this article, the authors focused on classifying Arabic text documents and used a maximum entropy method to classify Arabic documents, they experimented their approach using real data, then compared the results with other existing systems.
Abstract: In organizations, a large amount of information exists in text documents. Therefore, it is important to use text mining to discover knowledge from these unstructured data. Automatic text classification considered as one of important applications in text mining. It is the process of assigning a text document to one or more predefined categories based on their content. This paper focus on classifying Arabic text documents. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In our approach, we first preprocessed data using natural language processing techniques such as tokenizing, stemming and part-of-speech. Then, we used maximum entropy method to classify Arabic documents. We experimented our approach using real data, then we compared the results with other existing systems.
TL;DR: This paper surveys various Question Answering Systems (QAS) and can see the insufficiency, so that they can propose new systems for complex queries and adapt or reuse QAS techniques for specific research issues.
TL;DR: This paper discusses Hadoop and its components in detail which comprise of MapReduce and HDFS, a scalable and fault-tolerant model that hides all the complexities for Big Data analytics.
TL;DR: This paper defines a lifecycle for Big Data processing and classifies various available tools and technologies in terms of the lifecycle phases of Big Data, which include data acquisition, data storage, data analysis, and data exploitation of the results.
TL;DR: This survey investigates the recent advancement in the field of text analysis and covers two basic approaches of text mining, such as classification and clustering that are widely used for the exploration of the unstructured text available on the Web.
Abstract: In this survey, we review different text mining techniques to discover various textual patterns from the social networking sites. Social network applications create opportunities to establish interaction among people leading to mutual learning and sharing of valuable knowledge, such as chat, comments, and discussion boards. Data in social networking websites is inherently unstructured and fuzzy in nature. In everyday life conversations, people do not care about the spellings and accurate grammatical construction of a sentence that may lead to different types of ambiguities, such as lexical, syntactic, and semantic. Therefore, analyzing and extracting information patterns from such data sets are more complex. Several surveys have been conducted to analyze different methods for the information extraction. Most of the surveys emphasized on the application of different text mining techniques for unstructured data sets reside in the form of text documents, but do not specifically target the data sets in social networking website. This survey attempts to provide a thorough understanding of different text mining techniques as well as the application of these techniques in the social networking websites. This survey investigates the recent advancement in the field of text analysis and covers two basic approaches of text mining, such as classification and clustering that are widely used for the exploration of the unstructured text available on the Web.
TL;DR: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data, and allows third-party plugins so that the unstructured data can be analyzed and queried.
Abstract: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data. A data lake stores data regardless of format and thus provides an intuitive way to store personal data fragments of any type. Metadata management is a central part of the lake architecture. For structured/semi-structured data fragments, metadata may contain information about the schema of the data so that the data can be transformed into queryable data objects when required. For unstructured data, enabling gravity pull means allowing third-party plugins so that the unstructured data can be analyzed and queried.
TL;DR: This paper proposes to give a review of the latest applications of big data analytics in the field of logistics and transportation industry and to propose a novel approach to detect and recognize containers code based on a Hadoop big data Analytics system.
Abstract: Nowadays, there are many challenges for the logistics industry mainly with the integration of E-commerce and new sources of data such as smartphones, sensors, GPS and other devices. Those new data sources generate daily a huge quantity of unstructured data, to deal with such complex data, the use of big data analytic tools becomes an obligation. In this context, many works have been done recently in the integration of big data analytics in the logistics industry. In this paper, we propose to give a review of the latest applications of big data analytics in the field of logistics and transportation industry and to propose a novel approach to detect and recognize containers code based on a Hadoop big data analytics system.
TL;DR: This review presents the central, publicly accessible databases that contain data pertinent to cancer, the resources available for delivering and analyzing information from these databases, as well as databases dedicated to specific types of cancer.
Abstract: Cancer is one of the four major non‑communicable diseases (NCD), responsible for ~14.6% of all human deaths. Currently, there are >100 different known types of cancer and >500 genes involved in cancer. Ongoing research efforts have been focused on cancer etiology and therapy. As a result, there is an exponential growth of cancer‑associated data from diverse resources, such as scientific publications, genome‑wide association studies, gene expression experiments, gene‑gene or protein‑protein interaction data, enzymatic assays, epigenomics, immunomics and cytogenetics, stored in relevant repositories. These data are complex and heterogeneous, ranging from unprocessed, unstructured data in the form of raw sequences and polymorphisms to well‑annotated, structured data. Consequently, the storage, mining, retrieval and analysis of these data in an efficient and meaningful manner pose a major challenge to biomedical investigators. In the current review, we present the central, publicly accessible databases that contain data pertinent to cancer, the resources available for delivering and analyzing information from these databases, as well as databases dedicated to specific types of cancer. Examples for this wealth of cancer‑related information and bioinformatic tools have also been provided.
TL;DR: This study assesses the performance differences between RDBMS and NoSQL and describes the optimal design for enhanced functionality when using NoSQL.
Abstract: Due to advancement of social network and popularization of mobile devices, the existing relational database management system(RDBMS)'s processing of massive data has become an issue. NoSQL is a database management system which makes processing of massive and/or unstructured data easier, and many companies today tend to start a project using NoSQL. Moreover, converting the RDBMS of current systems to NoSQL has become a trend. This study assesses the performance differences between RDBMS and NoSQL. The optimal design for enhanced functionality when using NoSQL has been described as well. In this study, PostgreSQL and MongoDB have been selected to represent RDBMS and NoSQL respectively, for comparative analysis.
TL;DR: The overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.
TL;DR: This paper aims to introduce the concepts behind NoSQL, provides a review of relevant literature, highlights the different NoSQL database types, and provides arguments for and against adopting NoSQL.
Abstract: With the emergence of Big Data, the use of NoSQL (Not only SQL) technology is rising rapidly among internet companies and other enterprises. Benefits include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are increasingly considered a viable alternative to relational databases, as more organizations recognize that its schema less data model is a better method for handling the large volumes of structured, semi structured and unstructured data, being captured and processed today. For example NoSQL databases are often used to collect and store social media data. This paper aims to introduce the concepts behind NoSQL, provides a review of relevant literature, highlights the different NoSQL database types, and provide arguments for and against adopting NoSQL. A small prototype application has been developed to assess the stated NoSQL benefits and illustrate the differences between the SQL and NoSQL approaches. The last section of the paper offers some conclusions and recommendations for further research to expand upon our research work.
TL;DR: The current paper aims to develop a holistic model that includes the factors that would affect the success or failure of the implementation of big data in organizations, and examines the opportunities that organizations would attain from implementing big data, as well as the challenges that could hinder this implementation.
Abstract: The term 'big data' has gained huge popularity in recent years among IT professionals and academicians. Big data describes the massive amount of data that can be processed and analyzed using technology to gain business values that will help organizations to achieve competitive advantages. The current paper aims to develop a holistic model that includes the factors that would affect the success or failure of the implementation of big data in organizations. Furthermore, this research examines the opportunities that organizations would attain from implementing big data, as well as the challenges that could hinder this implementation. The proposed model provides IT managers and decision makers the important factors that they need to consider when deciding to implement big data in order to ensure that it achieves the competitive advantage. KEYWORDS: Big Data, opportunities, challenges, implementation INTRODUCTION The interest of big data has increased because of the significant amount of data generated every day. Data is getting bigger because it is continuing to be generated from more devices and more sources such as personal computers, mobile phones, government records, healthcare records, social media, street sensors, climate sensors, airport terminals, hypermarkets' points of sales, etc. These sources generate a massive amount of data and it will continue to generate more and more data as time passes since people are getting more dependent on technology. As anticipated by Cisco Visual Networking Index (VNI) report (2015), mobile data traffic is expected to grow to 24.3 Exabytes per month by 2019 because of increased usage on smartphones. This is nearly a tenfold increase over 2014. A study by Intel also showed that data has increased enormously in the last decade. It showed that humankind has generated five Exabytes until 2003. From 2003 to 2013, data has increased to reach 2.7 Zettabytes (i.e. 500x more data). Data will continue to increase to three times bigger than that by 2015. In the same context, Das et al. (2013) pointed to the rapid growth of global data. He mentioned that it took from the down of time to 2003 to create five Exabytes of information whereas now the same volume of data is created in just two days. This will continue to reach eight Zettabytes by 2015 (i.e. that is the equivalent of 18 million Libraries of Congress). Although data is increasing enormously, a very small fraction of this data has been exploited; the rest is not tapped yet. According to IBM and Intel, 90% of data is unstructured and is not used. Data can be classified into structured and unstructured data. Structured data refers to data that can be organized and stored in relational databases so it can be easily used and searched efficiently. Unstructured data refers to data, which does not have a pre-defined data model, or it is not organized in a per-defined manner such as videos, photos, images, emails, text documents and blogs. Searching and analyzing of unstructured data is more difficult than for structured data. Das et al. (2013) also argued that unstructured data would account for 90% of data in the next decade where analyzing this massive amount of data would expose new improvements in business that were impossible to determine previously. Indeed, the interest of big data has increased because it is supposed to have a significant impact on the organizations and this would be achieved by analyzing the unstructured data. According to a survey conducted by IDG Enterprise (2014) amongst more than 750 IT decision-makers in 2013, the interest in big data continues to rise, as nearly half of the respondents (50%) are implementing or planning to implement big data projects within their organizations. BIG DATA DEFINITION Although the term has gained huge popularity in recent years, it is still poorly defined and there is huge ambiguity regarding its exact meaning (Hartmann et al. …
TL;DR: A global architecture is proposed for QoS based scheduling for big data application to distributed cloud datacenter at two levels which are coarse grained and fine grained, and results indicated better QoS achievement and 33.15 % cost gain of the proposed architecture over traditional Amazon methods.
Abstract: Big data is one of the major technology usages for business operations in today's competitive market. It provides organizations a powerful tool to analyze large unstructured data to make useful decisions. Result quality, time, and price associated with big data analytics are very important aspects for its success. Selection of appropriate cloud infrastructure at coarse and fine grained level will ensure better results. In this paper, a global architecture is proposed for QoS based scheduling for big data application to distributed cloud datacenter at two levels which are coarse grained and fine grained. At coarse grain level, appropriate local datacenter is selected based on network distance between user and datacenter, network throughput and total available resources using adaptive K nearest neighbor algorithm. At fine grained level, probability triplet (C, I, M) is predicted using naive Bayes algorithm which provides probability of new application to fall in compute intensive (C), input/output intensive (I) and memory intensive (M) categories. Each datacenter is transformed into a pool of virtual clusters capable of executing specific category of jobs with specific (C, I, M) requirements using self organized maps. Novelty of study is to represent whole datacenter resources in a predefined topological ordering and executing new incoming jobs in their respective predefined virtual clusters based on their respective QoS requirements. Proposed architecture is tested on three different Amazon EMR datacenters for resource utilization, waiting time, availability, response time and estimated time to complete the job. Results indicated better QoS achievement and 33.15 % cost gain of the proposed architecture over traditional Amazon methods.
TL;DR: Examining the course offerings of a small sample of undergraduate data analytics and data science programs is examined to determine what similarities and differences exist across programs, and discrepancies between skills in the literature and those offered in degree programs are identified.
Abstract: 1. INTRODUCTION Inexpensive data storage and the ever-growing flow of data from a variety of sources increase the amount of data available to organizations. Competing in the era of big data will require analytically-focused employees with the specialized knowledge and skills to extract useful information from this data. Some have expressed great concern that the demand for employees with this skill set will far outstrip supply (see, e.g., Davenport and Patil, 2012). A widely cited report by McKinsey and Company concluded, "The United States alone faces a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data and make decisions based on their findings" (Manyika et al., 2011, p. 3). Universities are responding to this call to educate the next generation of entry level data savvy professionals. There are a growing number of degree programs, specializations, and certificates in data science and data analytics at both the graduate and undergraduate levels (Davenport and Patil, 2012; Dumbill et al., 2013). However, there are still relatively few full degree programs at the undergraduate level. A recent review of undergraduate degree programs in data analytics and data science identified thirteen such programs across the United States (Aasheim et al., 2014). Since that time, it is likely that more such programs have been developed. However, as more universities expand into this area, little is currently known about the specifics of skills covered in those degree programs or the extent to which skills coverage is comparable across different programs. This paper fills that gap by examining the course offerings of a small sample of undergraduate data analytics and data science programs to determine what similarities and differences exist across programs. In addition, discrepancies between skills in the literature and those offered in degree programs are identified. This examination will contribute to the goal of identifying important topics for an undergraduate program in data analytics and data science. The focus will be data analytics programs specifically and how they relate to the traditional information systems program. 2. LITERATURE REVIEW Organizations have collected and analyzed data in an attempt to gain strategic advantage in the market place for many years. However, in recent years, the amount and complexity of available data has exploded, making it more difficult to gain insights from data to improve business decision making. This section presents a review of relevant literature addressing (1) the growth of big data, (2) the evolution of data analytics as a field of study, (3) legal and ethical issues surrounding big data, and (4) implications for academia. 2.1 The Growth of Big Data A number of factors have contributed to the explosion of data. In the latter part of the 20th century, organizations emphasized integrating transactional databases into data warehouses that could then be analyzed to improve business decisions (Eckerson, 2011). As organizations began to realize benefits from this analysis, this trend accelerated. In one example, Walmart was able to identify bestselling products in hurricane-prone areas when storms were approaching; as a result, prior to storm season, Walmart stores stocked-up not only on obvious high-demand staples such as batteries but also the less obvious number two bestselling item--Pop-Tarts (Preimesberger, 2011). Growth in e-commerce and social media has contributed to the increase in data accumulation, particularly as organizations utilize clickstream data and social media comments to track customer sentiment and understand consumer behavior. Organizations also collect unstructured data through sources such as bar codes, QR codes, RFID tags, and sensors. United Parcel Service (UPS) installed sensors on more than 46,000 delivery trucks to monitor location, safety, and efficiency related data including speed, direction, and mechanical performance (Davenport, 2013). …
TL;DR: In this article, the authors examine the development, architecture and component functionalities of big data, and identify its capabilities, including traceability, the analysis of unstructured data and patterns of care.
Abstract: To date, the health care industry has paid little attention to the potential benefits to be gained from big data. While most pioneering big data studies have adopted technological perspectives, a better understanding of the strategic implications of big data is urgently needed. To address this lack, this study examines the development, architecture and component functionalities of big data, and identifies its capabilities, including traceability, the analysis of unstructured data and patterns of care, and its predictive capacity to support healthcare managers seeking to formulate more effective big-data-based strategies. Our findings will help healthcare organizations respond strategically to the challenges they face in today's highly competitive healthcare market.
TL;DR: The paper designs a new context intelligence framework to handle industrial informatics regarding location, sensor and unstructured data for big data mining and designed a cyber physical system with the integration of various existing and proprietary data analytics systems.
Abstract: The purpose of this paper is to provide a comprehensive solution for industry through research and development of an Internet of Things (IoT) based Cyber Physical System for Industrial Informatics Analytics with the following objectives. This study conducted a review regarding big data analytics in industry and designed a cyber physical system with the integration of various existing and proprietary data analytics systems based on their business needs so that themodules can be reconfigurable and interchangeable. The paper designs a new context intelligence framework to handle industrial informatics regarding location, sensor and unstructured data for big data mining. A case study isused to illustrate the concept of the proposed cyberphysical system. Further study on system integration and migration from existing factories to smart factories should be conducted so as to realize the next industrial paradigm shift.
TL;DR: This study examines the development, architecture and component functionalities of big data, and identifies its capabilities, including traceability, the analysis of unstructured data and patterns of care, and its predictive capacity to support healthcare managers seeking to formulate more effective big-data-based strategies.
Abstract: To date, the health care industry has paid little attention to the potential benefits to be gained from big data. While most pioneering big data studies have adopted technological perspectives, a better understanding of the strategic implications of big data is urgently needed. To address this lack, this study examines the development, architecture and component functionalities of big data, and identifies its capabilities, including traceability, the analysis of unstructured data and patterns of care, and its predictive capacity to support healthcare managers seeking to formulate more effective big-data-based strategies. Our findings will help healthcare organizations respond strategically to the challenges they face in today’s highly competitive healthcare market.
TL;DR: The main ideas and methods of big data and deep learning are introduced and shown how they can be applied to various phases of the traditional modeling and simulation process and have the potential to lead to a new generation of modeling and Simulation applications that provide computational scientific support on a new scale beyond the current capabilities.
Abstract: Big data allows users to cope with data that are huge in regards to volume, velocity, variety, and veracity. It provides methods and tools to extract aggregates and new information out of heterogeneously structured data or even completely unstructured data. Deep learning is a collection of algorithms that allows us to discover correlations and learn --- supervised and unsupervised --- from information provided. This contribution introduces the main ideas and methods of big data and deep learning and shows how they can be applied to various phases of the traditional modeling and simulation process. Big data supports obtaining data for the initialization as well as evaluating the results of the simulation experiment. Deep learning can help with the conceptual modeling phase as well as with the discovery of correlations in the results. Examples of existing applications will be given to prove the feasibility of such ideas. This leads to the observation that big data, deep learning, and modeling and simulation have the potential to lead to a new generation of modeling and simulation applications that provide computational scientific support on a new scale beyond the current capabilities.
TL;DR: It is shown how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation.
Abstract: Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open-ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document-term matrix (DTM). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis (LSA) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open-source and commercial software packages. WIREs Comput Stat 2015, 7:326–340. doi: 10.1002/wics.1361
For further resources related to this article, please visit the WIREs website.
TL;DR: The importance of analysis of unstructured data along with structured data in business to extract holistic insights is emphasized and the need for appropriate and efficient analytical methods for knowledge discovery from huge volumes of heterogeneous data in unstructuring formats has been highlighted.
Abstract: Big Data is data of high volume and high variety being produced or generated at high velocity which cannot be stored, managed, processed or analyzed using the existing traditional software tools, techniques and architectures. With big data many challenges such as scale, heterogeneity, speed and privacy are associated but there are opportunities as well. Potential information is locked in big data which if properly leveraged will make a huge difference to business. With the help of big data analytics, meaningful insights can be extracted from big data which is heterogeneous in nature comprising of structured, unstructured and semi-structured content. One prime challenge in big data analytics is that nearly 95% data is unstructured. This paper describes what big data and big data analytics is. A review of different techniques and approaches to analyze unstructured data is given. This paper emphasizes the importance of analysis of unstructured data along with structured data in business to extract holistic insights. The need for appropriate and efficient analytical methods for knowledge discovery from huge volumes of heterogeneous data in unstructured formats has been highlighted.
TL;DR: This chapter will focus on methods particularly useful for discovering security breaches and attacks, and which can be implemented with either free or commonly available software.
Abstract: Knowledge of analytical methods and techniques is essential for uncovering hidden patterns in security-related data. Analytical techniques range from simple descriptive statistics, data visualization methods, and statistical analysis algorithms such as regression, correlation analysis, and support vector machines.
The field of analytics is broad. This chapter will focus on methods particularly useful for discovering security breaches and attacks, and which can be implemented with either free or commonly available software. As there are unlimited ways that an attacker can compromise a system, analysts also need a toolkit of techniques to be creative in analyzing security data. Among tools available for creative analysis, we will examine analytical programming languages allowing an analysts to customize analytical procedures and applications. The concepts introduced in this chapter will provide you with a framework for security analysis, along with useful methods and tools.
TL;DR: Research shows that the new model developed here is powerful and can adapt to various knowledge expression requirements of electric power big data and will be used in more applications.
Abstract: It is very important for the development of electric power big data technology to use the electric power knowledge. A new electric power knowledge theory model is proposed here to solve the problem of normalized modeled electric power knowledge for the management and analysis of electric power big data. Current modeling techniques of electric power knowledge are viewed as inadequate because of the complexity and variety of the relationships among electric power system data. Ontology theory and semantic web technologies used in electric power systems and in many other industry domains provide a new kind of knowledge modeling method. Based on this, this paper proposes the structure, elements, basic calculations and multidimensional reasoning method of the new knowledge model. A modeling example of the regulations defined in electric power system operation standard is demonstrated. Different forms of the model and related technologies are also introduced, including electric power system standard modeling, multi-type data management, unstructured data searching, knowledge display and data analysis based on semantic expansion and reduction. Research shows that the new model developed here is powerful and can adapt to various knowledge expression requirements of electric power big data. With the development of electric power big data technology, it is expected that the knowledge model will be improved and will be used in more applications.
TL;DR: In this paper, the authors demonstrate that information extraction from unstructured clinical narratives is essential to most clinical applications, and they perform an empirical study to validate the argument and show that structured data alone is insufficient in resolving eligibility criteria for recruiting patients onto clinical trials for chronic lymphocytic leukemia (CLL) and prostate cancer.
Abstract: Electronic health records capture patient information using structured controlled vocabularies and unstructured narrative text. While structured data typically encodes lab values, encounters and medication lists, unstructured data captures the physician's interpretation of the patient's condition, prognosis, and response to therapeutic intervention. In this paper, we demonstrate that information extraction from unstructured clinical narratives is essential to most clinical applications. We perform an empirical study to validate the argument and show that structured data alone is insufficient in resolving eligibility criteria for recruiting patients onto clinical trials for chronic lymphocytic leukemia (CLL) and prostate cancer. Unstructured data is essential to solving 59% of the CLL trial criteria and 77% of the prostate cancer trial criteria. More specifically, for resolving eligibility criteria with temporal constraints, we show the need for temporal reasoning and information integration with medical events within and across unstructured clinical narratives and structured data.
TL;DR: This paper provides a broad view and discussion of the current state of this subject with a particular focus on data modeling and data analytics, describing and clarifying the main differences between the three main approaches in what concerns these aspects, namely: operational databases, decision support databases and Big Data technologies.
Abstract: These last years we have been witnessing a tremendous growth in the volume and availability of data. This fact results primarily from the emergence of a multitude of sources (e.g. computers, mobile devices, sensors or social networks) that are continuously producing either structured, semi-structured or unstructured data. Database Management Systems and Data Warehouses are no longer the only technologies used to store and analyze datasets, namely due to the volume and complex structure of nowadays data that degrade their performance and scalability. Big Data is one of the recent challenges, since it implies new requirements in terms of data storage, processing and visualization. Despite that, analyzing properly Big Data can constitute great advantages because it allows discovering patterns and correlations in datasets. Users can use this processed information to gain deeper insights and to get business advantages. Thus, data modeling and data analytics are evolved in a way that we are able to process huge amounts of data without compromising performance and availability, but instead by “relaxing” the usual ACID properties. This paper provides a broad view and discussion of the current state of this subject with a particular focus on data modeling and data analytics, describing and clarifying the main differences between the three main approaches in what concerns these aspects, namely: operational databases, decision support databases and Big Data technologies.
TL;DR: The engineering principles and practices to manage FSD in RDBMSs to meet FSD’s unique requirements and challenges are described and the limitations and issues of current practices are described.
Abstract: 1. ABSTRACT RDBMSs are designed to manage well-structured data requiring users to design a schema before storing and querying data. This is the ‘schema first, data later’ approach. However, there are significant amount of unstructured data and semi-structured data that cannot be effectively modelled this way. Even if certain parts of the data can be modelled using schema, the inclusion of all fields would typically lead to a very large schema with many optional fields and with frequent schema evolution as data instances vary widely and evolve fast. Obviously, these data requires the ‘data first, schema later/never’ approach. We call these data Flexible Schema Data (FSD) . In this paper, we describe the engineering principles and practices to manage FSD in RDBMSs to meet FSD’s unique requirements and challenges. We describe the limitations and issues of current practices and potential research opportunities. Having a single data platform for managing both well-structured data and FSD is beneficial to users; this approach reduces significantly integration, migration, development, maintenance, and operational issues.
TL;DR: This paper focused on to study of different supervised classification techniques over big transactional database, and shows a advantages and limitations.
Abstract: Big Data concern large-volume, growing data sets that are complex and have multiple autonomous sources. Earlier technologies were not able to handle storage and processing of huge data thus Big Data concept comes into existence. This is a tedious job for users unstructured data. So, there should be some mechanism which classify unstructured data into organized form which helps user to easily access required data. Classification techniques over big transactional database provide required data to the users from large datasets more simple way. There are two main classification techniques, supervised and unsupervised. In this paper we focused on to study of different supervised classification techniques. Further this paper shows a advantages and limitations.