Top 466 papers published in the topic of Unstructured data in 2015

Showing papers on "Unstructured data published in 2015"

Journal Article•10.1016/J.IJINFOMGT.2014.10.007•

Beyond the hype

[...]

Amir H. Gandomi¹, Murtaza Haider¹•Institutions (1)

01 Apr 2015-International Journal of Information Management

TL;DR: The need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats is highlighted and the need to devise new tools for predictive analytics for structured big data is reinforced.

...read moreread less

3,982 citations

Proceedings Article•10.1145/2746266.2746278•

Developing an Ontology for Cyber Security Knowledge Graphs

[...]

Michael D. Iannacone¹, Shawn J. Bohn², Grant C. Nakamura², John Gerth³, Kelly M. T. Huffer¹, Robert A. Bridges¹, Erik M. Ferragut¹, John R. Goodall¹ - Show less +4 more•Institutions (3)

Oak Ridge National Laboratory¹, Pacific Northwest National Laboratory², Stanford University³

7 Apr 2015

TL;DR: An ontology developed for a cyber security knowledge graph database is described to provide an organized schema that incorporates information from a large variety of structured and unstructured data sources, and includes all relevant concepts within the domain.

...read moreread less

Abstract: In this paper we describe an ontology developed for a cyber security knowledge graph database. This is intended to provide an organized schema that incorporates information from a large variety of structured and unstructured data sources, and includes all relevant concepts within the domain. We compare the resulting ontology with previous efforts, discuss its strengths and limitations, and describe areas for future work.

...read moreread less

176 citations

Journal Article•10.1109/MCI.2015.2471215•

Sentiment Data Flow Analysis by Means of Dynamic Linguistic Patterns

[...]

Soujanya Poria¹, Erik Cambria², Alexander Gelbukh³, Federica Bisio⁴, Amir Hussain¹ - Show less +1 more•Institutions (4)

University of Stirling¹, Nanyang Technological University², Instituto Politécnico Nacional³, University of Genoa⁴

12 Oct 2015-IEEE Computational Intelligence Magazine

TL;DR: An algorithm that assigns contextual polarity to concepts in text and flows this polarity through the dependency arcs in order to assign a final polarity label to each sentence is presented, which enables a more efficient transformation of unstructured social data into structured information, readily interpretable by machines.

...read moreread less

Abstract: -Emulating the human brain is one of the core challenges of computational intelligence, which entails many key problems of artificial intelligence, including understanding human language, reasoning, and emotions. In this work, computational intelligence techniques are combined with common-sense computing and linguistics to analyze sentiment data flows, i.e., to automatically decode how humans express emotions and opinions via natural language. The increasing availability of social data is extremely beneficial for tasks such as branding, product positioning, corporate reputation management, and social media marketing. The elicitation of useful information from this huge amount of unstructured data, however, remains an open challenge. Although such data are easily accessible to humans, they are not suitable for automatic processing: machines are still unable to effectively and dynamically interpret the meaning associated with natural language text in very large, heterogeneous, noisy, and ambiguous environments such as the Web. We present a novel methodology that goes beyond mere word-level analysis of text and enables a more efficient transformation of unstructured social data into structured information, readily interpretable by machines. In particular, we describe a novel paradigm for real-time concept-level sentiment analysis that blends computational intelligence, linguistics, and common-sense computing in order to improve the accuracy of computationally expensive tasks such as polarity detection from big social data. The main novelty of the paper consists in an algorithm that assigns contextual polarity to concepts in text and flows this polarity through the dependency arcs in order to assign a final polarity label to each sentence. Analyzing how sentiment flows from concept to concept through dependency relations allows for a better understanding of the contextual role of each concept in text, to achieve a dynamic polarity inference that outperforms state-of-the-art statistical methods in terms of both accuracy and training time.

...read moreread less

170 citations

Journal Article•10.17705/1CAIS.03723•

Business Analytics in the Context of Big Data: A Roadmap for Research

[...]

Gloria Phillips-Wren¹, Lakshmi S. Iyer², Uday R. Kulkarni, Thilini Ariyachandra³•Institutions (3)

Loyola University Maryland¹, University of North Carolina at Greensboro², Xavier University³

01 Jan 2015-Communications of The Ais

TL;DR: This paper builds on academic and industry discussions from the 2012 and 2013 pre-ICIS events: BI Congress III and the Special Interest Group on Decision Support Systems workshop, respectively, by presenting a big data analytics framework that depicts a process view of the components needed for big data Analytics in organizations.

...read moreread less

Abstract: This paper builds on academic and industry discussions from the 2012 and 2013 pre-ICIS events: BI Congress III and the Special Interest Group on Decision Support Systems (SIGDSS) workshop, respectively. Recognizing the potential of “big data” to offer new insights for decision making and innovation, panelists at the two events discussed how organizations can use and manage big data for competitive advantage. In addition, expert panelists helped to identify research gaps. While emerging research in the academic community identifies some of the issues in acquiring, analyzing, and using big data, many of the new developments are occurring in the practitioner community. We bridge the gap between academic and practitioner research by presenting a big data analytics framework that depicts a process view of the components needed for big data analytics in organizations. Using practitioner interviews and literature from both academia and practice, we identify the current state of big data research guided by the framework and propose potential areas for future research to increase the relevance of academic research to practice.

...read moreread less

153 citations

Journal Article•

Arabic Text Classification Using Maximum Entropy

[...]

Alaa M. El-Halees

05 Dec 2015-IUG Journal of Natural Studies

TL;DR: In this article, the authors focused on classifying Arabic text documents and used a maximum entropy method to classify Arabic documents, they experimented their approach using real data, then compared the results with other existing systems.

...read moreread less

Abstract: In organizations, a large amount of information exists in text documents. Therefore, it is important to use text mining to discover knowledge from these unstructured data. Automatic text classification considered as one of important applications in text mining. It is the process of assigning a text document to one or more predefined categories based on their content. This paper focus on classifying Arabic text documents. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In our approach, we first preprocessed data using natural language processing techniques such as tokenizing, stemming and part-of-speech. Then, we used maximum entropy method to classify Arabic documents. We experimented our approach using real data, then we compared the results with other existing systems.

...read moreread less

150 citations

Journal Article•10.1016/J.PROCS.2015.12.005•

Question Answering Systems: Survey and Trends☆

[...]

Abdelghani Bouziane, Djelloul Bouchiha, Noureddine Doumi, Mimoun Malki¹•Institutions (1)

SIDI¹

01 Jan 2015-Procedia Computer Science

TL;DR: This paper surveys various Question Answering Systems (QAS) and can see the insufficiency, so that they can propose new systems for complex queries and adapt or reuse QAS techniques for specific research issues.

...read moreread less

148 citations

Journal Article•10.1016/J.PROCS.2015.04.108•

Hadoop, mapreduce and HDFS: A developers perspective

[...]

Mohd Rehan Ghazi¹, Durgaprasad Gangodkar¹•Institutions (1)

Graphic Era University¹

01 Jan 2015-Procedia Computer Science

TL;DR: This paper discusses Hadoop and its components in detail which comprise of MapReduce and HDFS, a scalable and fault-tolerant model that hides all the complexities for Big Data analytics.

...read moreread less

148 citations

Journal Article•10.1002/CPE.3398•

Emerging trends and technologies in big data processing

[...]

Rubén Casado, Muhammad Younas¹•Institutions (1)

Oxford Brookes University¹

10 Jun 2015-Concurrency and Computation: Practice and Experience

TL;DR: This paper defines a lifecycle for Big Data processing and classifies various available tools and technologies in terms of the lifecycle phases of Big Data, which include data acquisition, data storage, data analysis, and data exploitation of the results.

...read moreread less

Abstract: Big Data encompasses large volume of complex structured, semi-structured, and unstructured data, which is beyond the processing capabilities of conventional databases. The processing and analysis of Big Data now play a central role in decision making, forecasting, business analysis, product development, customer experience, and loyalty, to name but a few. In this paper, we examine the distinguishing characteristics of Big Data along the lines of the 3Vs: variety, volume, and velocity. Accordingly, the paper provides an insight into the main processing paradigms in relation to the 3Vs. It defines a lifecycle for Big Data processing and classifies various available tools and technologies in terms of the lifecycle phases of Big Data, which include data acquisition, data storage, data analysis, and data exploitation of the results. This paper is first of its kind that reviews and analyzes current trends and technologies in relation to the characteristics, evolution, and processing of Big Data. Copyright © 2014 John Wiley & Sons, Ltd.

...read moreread less

132 citations

Journal Article•10.1017/S0269888914000277•

A survey on text mining in social networks

[...]

Rizwana Irfan¹, Christine K. King¹, Daniel Grages¹, Sam J. Ewen¹, Samee U. Khan¹, Sajjad A. Madani², Joanna Kolodziej, Lizhe Wang³, Dan Chen⁴, Ammar Rayes⁵, Nikos Tziritas³, Cheng-Zhong Xu³, Albert Y. Zomaya⁶, Ahmed Alzahrani⁷, Hongxiang Li⁸ - Show less +11 more•Institutions (8)

North Dakota State University¹, COMSATS Institute of Information Technology², Chinese Academy of Sciences³, China University of Geosciences (Wuhan)⁴, Cisco Systems, Inc.⁵, University of Sydney⁶, King Abdulaziz University⁷, University of Louisville⁸

01 Mar 2015-Knowledge Engineering Review

TL;DR: This survey investigates the recent advancement in the field of text analysis and covers two basic approaches of text mining, such as classification and clustering that are widely used for the exploration of the unstructured text available on the Web.

...read moreread less

Abstract: In this survey, we review different text mining techniques to discover various textual patterns from the social networking sites. Social network applications create opportunities to establish interaction among people leading to mutual learning and sharing of valuable knowledge, such as chat, comments, and discussion boards. Data in social networking websites is inherently unstructured and fuzzy in nature. In everyday life conversations, people do not care about the spellings and accurate grammatical construction of a sentence that may lead to different types of ambiguities, such as lexical, syntactic, and semantic. Therefore, analyzing and extracting information patterns from such data sets are more complex. Several surveys have been conducted to analyze different methods for the information extraction. Most of the surveys emphasized on the application of different text mining techniques for unstructured data sets reside in the form of text documents, but do not specifically target the data sets in social networking website. This survey attempts to provide a thorough understanding of different text mining techniques as well as the application of these techniques in the social networking websites. This survey investigates the recent advancement in the field of text analysis and covers two basic approaches of text mining, such as classification and clustering that are widely used for the exploration of the unstructured text available on the Web.

...read moreread less

132 citations

Proceedings Article•10.1109/BDCLOUD.2015.62•

Personal Data Lake with Data Gravity Pull

[...]

Coral Walker¹, Hassan H. Alrehamy¹•Institutions (1)

Cardiff University¹

26 Aug 2015

TL;DR: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data, and allows third-party plugins so that the unstructured data can be analyzed and queried.

...read moreread less

Abstract: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data. A data lake stores data regardless of format and thus provides an intuitive way to store personal data fragments of any type. Metadata management is a central part of the lake architecture. For structured/semi-structured data fragments, metadata may contain information about the schema of the data so that the data can be transformed into queryable data objects when required. For unstructured data, enabling gravity pull means allowing third-party plugins so that the unstructured data can be analyzed and queried.

...read moreread less

108 citations

Proceedings Article•10.1109/ICADLT.2015.7136630•

Big data analytics for logistics and transportation

[...]

Abdelkarim Ben Ayed¹, Mohamed Ben Halima¹, Adel M. Alimi¹•Institutions (1)

University of Sfax¹

20 May 2015

TL;DR: This paper proposes to give a review of the latest applications of big data analytics in the field of logistics and transportation industry and to propose a novel approach to detect and recognize containers code based on a Hadoop big data Analytics system.

...read moreread less

Abstract: Nowadays, there are many challenges for the logistics industry mainly with the integration of E-commerce and new sources of data such as smartphones, sensors, GPS and other devices. Those new data sources generate daily a huge quantity of unstructured data, to deal with such complex data, the use of big data analytic tools becomes an obligation. In this context, many works have been done recently in the integration of big data analytics in the logistics industry. In this paper, we propose to give a review of the latest applications of big data analytics in the field of logistics and transportation industry and to propose a novel approach to detect and recognize containers code based on a Hadoop big data analytics system.

...read moreread less

Journal Article•10.3892/OR.2014.3579•

Human cancer databases (Review)

[...]

Athanasia Pavlopoulou¹, Demetrios A. Spandidos², Ioannis Michalopoulos¹•Institutions (2)

Academy of Athens¹, University of Crete²

01 Jan 2015-Oncology Reports

TL;DR: This review presents the central, publicly accessible databases that contain data pertinent to cancer, the resources available for delivering and analyzing information from these databases, as well as databases dedicated to specific types of cancer.

...read moreread less

Abstract: Cancer is one of the four major non‑communicable diseases (NCD), responsible for ~14.6% of all human deaths. Currently, there are >100 different known types of cancer and >500 genes involved in cancer. Ongoing research efforts have been focused on cancer etiology and therapy. As a result, there is an exponential growth of cancer‑associated data from diverse resources, such as scientific publications, genome‑wide association studies, gene expression experiments, gene‑gene or protein‑protein interaction data, enzymatic assays, epigenomics, immunomics and cytogenetics, stored in relevant repositories. These data are complex and heterogeneous, ranging from unprocessed, unstructured data in the form of raw sequences and polymorphisms to well‑annotated, structured data. Consequently, the storage, mining, retrieval and analysis of these data in an efficient and meaningful manner pose a major challenge to biomedical investigators. In the current review, we present the central, publicly accessible databases that contain data pertinent to cancer, the resources available for delivering and analyzing information from these databases, as well as databases dedicated to specific types of cancer. Examples for this wealth of cancer‑related information and bioinformatic tools have also been provided.

...read moreread less

Proceedings Article•10.1109/DTA.2015.14•

A Study on Data Input and Output Performance Comparison of MongoDB and PostgreSQL in the Big Data Environment

[...]

Min-Gyue Jung¹, Seon-A Youn¹, Jayon Bae¹, Yong-Lak Choi¹•Institutions (1)

Soongsil University¹

1 Nov 2015

TL;DR: This study assesses the performance differences between RDBMS and NoSQL and describes the optimal design for enhanced functionality when using NoSQL.

...read moreread less

Abstract: Due to advancement of social network and popularization of mobile devices, the existing relational database management system(RDBMS)'s processing of massive data has become an issue. NoSQL is a database management system which makes processing of massive and/or unstructured data easier, and many companies today tend to start a project using NoSQL. Moreover, converting the RDBMS of current systems to NoSQL has become a trend. This study assesses the performance differences between RDBMS and NoSQL. The optimal design for enhanced functionality when using NoSQL has been described as well. In this study, PostgreSQL and MongoDB have been selected to represent RDBMS and NoSQL respectively, for comparative analysis.

...read moreread less

Journal Article•10.1016/J.JBI.2015.06.029•

Combining knowledge- and data-driven methods for de-identification of clinical narratives

[...]

Azad Dehghan, Aleksandar Kovačević¹, George Karystianis, John A. Keane², Goran Nenadic - Show less +1 more•Institutions (2)

University of Novi Sad¹, University of Manchester²

01 Dec 2015-Journal of Biomedical Informatics

TL;DR: The overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.

...read moreread less

Proceedings Article•10.1109/WAINA.2015.19•

Handling Big Data Using NoSQL

[...]

Jagdev Bhogal¹, Imran Choksi¹•Institutions (1)

Birmingham City University¹

24 Mar 2015

TL;DR: This paper aims to introduce the concepts behind NoSQL, provides a review of relevant literature, highlights the different NoSQL database types, and provides arguments for and against adopting NoSQL.

...read moreread less

Abstract: With the emergence of Big Data, the use of NoSQL (Not only SQL) technology is rising rapidly among internet companies and other enterprises. Benefits include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are increasingly considered a viable alternative to relational databases, as more organizations recognize that its schema less data model is a better method for handling the large volumes of structured, semi structured and unstructured data, being captured and processed today. For example NoSQL databases are often used to collect and store social media data. This paper aims to introduce the concepts behind NoSQL, provides a review of relevant literature, highlights the different NoSQL database types, and provide arguments for and against adopting NoSQL. A small prototype application has been developed to assess the stated NoSQL benefits and illustrate the differences between the SQL and NoSQL approaches. The last section of the paper offers some conclusions and recommendations for further research to expand upon our research work.

...read moreread less

Journal Article•

Conceptual Model for Successful Implementation of Big Data in Organizations

[...]

Mohanad Halaweh¹, Ahmed El Massry•Institutions (1)

California State University, San Bernardino¹

07 Oct 2015-Journal of International Technology and Information Management

TL;DR: The current paper aims to develop a holistic model that includes the factors that would affect the success or failure of the implementation of big data in organizations, and examines the opportunities that organizations would attain from implementing big data, as well as the challenges that could hinder this implementation.

...read moreread less

Abstract: The term 'big data' has gained huge popularity in recent years among IT professionals and academicians. Big data describes the massive amount of data that can be processed and analyzed using technology to gain business values that will help organizations to achieve competitive advantages. The current paper aims to develop a holistic model that includes the factors that would affect the success or failure of the implementation of big data in organizations. Furthermore, this research examines the opportunities that organizations would attain from implementing big data, as well as the challenges that could hinder this implementation. The proposed model provides IT managers and decision makers the important factors that they need to consider when deciding to implement big data in order to ensure that it achieves the competitive advantage. KEYWORDS: Big Data, opportunities, challenges, implementation INTRODUCTION The interest of big data has increased because of the significant amount of data generated every day. Data is getting bigger because it is continuing to be generated from more devices and more sources such as personal computers, mobile phones, government records, healthcare records, social media, street sensors, climate sensors, airport terminals, hypermarkets' points of sales, etc. These sources generate a massive amount of data and it will continue to generate more and more data as time passes since people are getting more dependent on technology. As anticipated by Cisco Visual Networking Index (VNI) report (2015), mobile data traffic is expected to grow to 24.3 Exabytes per month by 2019 because of increased usage on smartphones. This is nearly a tenfold increase over 2014. A study by Intel also showed that data has increased enormously in the last decade. It showed that humankind has generated five Exabytes until 2003. From 2003 to 2013, data has increased to reach 2.7 Zettabytes (i.e. 500x more data). Data will continue to increase to three times bigger than that by 2015. In the same context, Das et al. (2013) pointed to the rapid growth of global data. He mentioned that it took from the down of time to 2003 to create five Exabytes of information whereas now the same volume of data is created in just two days. This will continue to reach eight Zettabytes by 2015 (i.e. that is the equivalent of 18 million Libraries of Congress). Although data is increasing enormously, a very small fraction of this data has been exploited; the rest is not tapped yet. According to IBM and Intel, 90% of data is unstructured and is not used. Data can be classified into structured and unstructured data. Structured data refers to data that can be organized and stored in relational databases so it can be easily used and searched efficiently. Unstructured data refers to data, which does not have a pre-defined data model, or it is not organized in a per-defined manner such as videos, photos, images, emails, text documents and blogs. Searching and analyzing of unstructured data is more difficult than for structured data. Das et al. (2013) also argued that unstructured data would account for 90% of data in the next decade where analyzing this massive amount of data would expose new improvements in business that were impossible to determine previously. Indeed, the interest of big data has increased because it is supposed to have a significant impact on the organizations and this would be achieved by analyzing the unstructured data. According to a survey conducted by IDG Enterprise (2014) amongst more than 750 IT decision-makers in 2013, the interest in big data continues to rise, as nearly half of the respondents (50%) are implementing or planning to implement big data projects within their organizations. BIG DATA DEFINITION Although the term has gained huge popularity in recent years, it is still poorly defined and there is huge ambiguity regarding its exact meaning (Hartmann et al. …

...read moreread less

Journal Article•10.1007/S10586-014-0416-6•

Scheduling of big data applications on distributed cloud based on QoS parameters

[...]

Rajinder Sandhu¹, Sandeep K. Sood¹•Institutions (1)

Guru Nanak Dev University¹

01 Jun 2015-Cluster Computing

TL;DR: A global architecture is proposed for QoS based scheduling for big data application to distributed cloud datacenter at two levels which are coarse grained and fine grained, and results indicated better QoS achievement and 33.15 % cost gain of the proposed architecture over traditional Amazon methods.

...read moreread less

Abstract: Big data is one of the major technology usages for business operations in today's competitive market. It provides organizations a powerful tool to analyze large unstructured data to make useful decisions. Result quality, time, and price associated with big data analytics are very important aspects for its success. Selection of appropriate cloud infrastructure at coarse and fine grained level will ensure better results. In this paper, a global architecture is proposed for QoS based scheduling for big data application to distributed cloud datacenter at two levels which are coarse grained and fine grained. At coarse grain level, appropriate local datacenter is selected based on network distance between user and datacenter, network throughput and total available resources using adaptive K nearest neighbor algorithm. At fine grained level, probability triplet (C, I, M) is predicted using naive Bayes algorithm which provides probability of new application to fall in compute intensive (C), input/output intensive (I) and memory intensive (M) categories. Each datacenter is transformed into a pool of virtual clusters capable of executing specific category of jobs with specific (C, I, M) requirements using self organized maps. Novelty of study is to represent whole datacenter resources in a predefined topological ordering and executing new incoming jobs in their respective predefined virtual clusters based on their respective QoS requirements. Proposed architecture is tested on three different Amazon EMR datacenters for resource utilization, waiting time, availability, response time and estimated time to complete the job. Results indicated better QoS achievement and 33.15 % cost gain of the proposed architecture over traditional Amazon methods.

...read moreread less

Journal Article•

Data Analytics vs. Data Science: A Study of Similarities and Differences in Undergraduate Programs Based on Course Descriptions.

[...]

Cheryl L. Aasheim, Susan Rebstock Williams, Paige Rutner, Adrian Gardiner

01 May 2015-The Journal of information and systems in education

TL;DR: Examining the course offerings of a small sample of undergraduate data analytics and data science programs is examined to determine what similarities and differences exist across programs, and discrepancies between skills in the literature and those offered in degree programs are identified.

...read moreread less

Abstract: 1. INTRODUCTION Inexpensive data storage and the ever-growing flow of data from a variety of sources increase the amount of data available to organizations. Competing in the era of big data will require analytically-focused employees with the specialized knowledge and skills to extract useful information from this data. Some have expressed great concern that the demand for employees with this skill set will far outstrip supply (see, e.g., Davenport and Patil, 2012). A widely cited report by McKinsey and Company concluded, "The United States alone faces a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data and make decisions based on their findings" (Manyika et al., 2011, p. 3). Universities are responding to this call to educate the next generation of entry level data savvy professionals. There are a growing number of degree programs, specializations, and certificates in data science and data analytics at both the graduate and undergraduate levels (Davenport and Patil, 2012; Dumbill et al., 2013). However, there are still relatively few full degree programs at the undergraduate level. A recent review of undergraduate degree programs in data analytics and data science identified thirteen such programs across the United States (Aasheim et al., 2014). Since that time, it is likely that more such programs have been developed. However, as more universities expand into this area, little is currently known about the specifics of skills covered in those degree programs or the extent to which skills coverage is comparable across different programs. This paper fills that gap by examining the course offerings of a small sample of undergraduate data analytics and data science programs to determine what similarities and differences exist across programs. In addition, discrepancies between skills in the literature and those offered in degree programs are identified. This examination will contribute to the goal of identifying important topics for an undergraduate program in data analytics and data science. The focus will be data analytics programs specifically and how they relate to the traditional information systems program. 2. LITERATURE REVIEW Organizations have collected and analyzed data in an attempt to gain strategic advantage in the market place for many years. However, in recent years, the amount and complexity of available data has exploded, making it more difficult to gain insights from data to improve business decision making. This section presents a review of relevant literature addressing (1) the growth of big data, (2) the evolution of data analytics as a field of study, (3) legal and ethical issues surrounding big data, and (4) implications for academia. 2.1 The Growth of Big Data A number of factors have contributed to the explosion of data. In the latter part of the 20th century, organizations emphasized integrating transactional databases into data warehouses that could then be analyzed to improve business decisions (Eckerson, 2011). As organizations began to realize benefits from this analysis, this trend accelerated. In one example, Walmart was able to identify bestselling products in hurricane-prone areas when storms were approaching; as a result, prior to storm season, Walmart stores stocked-up not only on obvious high-demand staples such as batteries but also the less obvious number two bestselling item--Pop-Tarts (Preimesberger, 2011). Growth in e-commerce and social media has contributed to the increase in data accumulation, particularly as organizations utilize clickstream data and social media comments to track customer sentiment and understand consumer behavior. Organizations also collect unstructured data through sources such as bar codes, QR codes, RFID tags, and sensors. United Parcel Service (UPS) installed sensors on more than 46,000 delivery trucks to monitor location, safety, and efficiency related data including speed, direction, and mechanical performance (Davenport, 2013). …

...read moreread less

Proceedings Article•10.1109/HICSS.2015.368•

Beyond a Technical Perspective: Understanding Big Data Capabilities in Health Care

[...]

Yichuan Wang¹, LeeAnn Kung, Chaochi Ting², Terry Anthony Byrd¹•Institutions (2)

Auburn University¹, IBM²

5 Jan 2015

TL;DR: In this article, the authors examine the development, architecture and component functionalities of big data, and identify its capabilities, including traceability, the analysis of unstructured data and patterns of care.

...read moreread less

Abstract: To date, the health care industry has paid little attention to the potential benefits to be gained from big data. While most pioneering big data studies have adopted technological perspectives, a better understanding of the strategic implications of big data is urgently needed. To address this lack, this study examines the development, architecture and component functionalities of big data, and identifies its capabilities, including traceability, the analysis of unstructured data and patterns of care, and its predictive capacity to support healthcare managers seeking to formulate more effective big-data-based strategies. Our findings will help healthcare organizations respond strategically to the challenges they face in today's highly competitive healthcare market.

...read moreread less

Proceedings Article•10.1109/IEEM.2015.7385969•

Research on IoT based Cyber Physical System for Industrial big data Analytics

[...]

Carman K. M. Lee¹, C.L. Yeung¹, M.N. Cheng¹•Institutions (1)

Hong Kong Polytechnic University¹

1 Dec 2015

TL;DR: The paper designs a new context intelligence framework to handle industrial informatics regarding location, sensor and unstructured data for big data mining and designed a cyber physical system with the integration of various existing and proprietary data analytics systems.

...read moreread less

Abstract: The purpose of this paper is to provide a comprehensive solution for industry through research and development of an Internet of Things (IoT) based Cyber Physical System for Industrial Informatics Analytics with the following objectives. This study conducted a review regarding big data analytics in industry and designed a cyber physical system with the integration of various existing and proprietary data analytics systems based on their business needs so that themodules can be reconfigurable and interchangeable. The paper designs a new context intelligence framework to handle industrial informatics regarding location, sensor and unstructured data for big data mining. A case study isused to illustrate the concept of the proposed cyberphysical system. Further study on system integration and migration from existing factories to smart factories should be conducted so as to realize the next industrial paradigm shift.

...read moreread less

Posted Content•

Beyond a Technical Perspective: Understanding Big Data Capabilities in Health Care

[...]

Yichuan Wang¹, LeeAnn Kung², Chaochi Ting³, Terry Anthony Byrd²•Institutions (3)

University of Newcastle¹, Auburn University², IBM³

05 Jan 2015-Social Science Research Network

TL;DR: This study examines the development, architecture and component functionalities of big data, and identifies its capabilities, including traceability, the analysis of unstructured data and patterns of care, and its predictive capacity to support healthcare managers seeking to formulate more effective big-data-based strategies.

...read moreread less

10.5555/2874916.2874964•

The next generation of modeling & simulation: integrating big data and deep learning

[...]

Andreas Tolk

26 Jul 2015

TL;DR: The main ideas and methods of big data and deep learning are introduced and shown how they can be applied to various phases of the traditional modeling and simulation process and have the potential to lead to a new generation of modeling and Simulation applications that provide computational scientific support on a new scale beyond the current capabilities.

...read moreread less

Abstract: Big data allows users to cope with data that are huge in regards to volume, velocity, variety, and veracity. It provides methods and tools to extract aggregates and new information out of heterogeneously structured data or even completely unstructured data. Deep learning is a collection of algorithms that allows us to discover correlations and learn --- supervised and unsupervised --- from information provided. This contribution introduces the main ideas and methods of big data and deep learning and shows how they can be applied to various phases of the traditional modeling and simulation process. Big data supports obtaining data for the initialization as well as evaluating the results of the simulation experiment. Deep learning can help with the conceptual modeling phase as well as with the discovery of correlations in the results. Examples of existing applications will be given to prove the feasibility of such ideas. This leads to the observation that big data, deep learning, and modeling and simulation have the potential to lead to a new generation of modeling and simulation applications that provide computational scientific support on a new scale beyond the current capabilities.

...read moreread less

Journal Article•10.1002/WICS.1361•

A practical guide to text mining with topic extraction

[...]

Andrew T. Karl, James Wisnowski, W. Heath Rushing

01 Sep 2015-Wiley Interdisciplinary Reviews: Computational Statistics

TL;DR: It is shown how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation.

...read moreread less

Abstract: Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open-ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document-term matrix (DTM). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis (LSA) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open-source and commercial software packages. WIREs Comput Stat 2015, 7:326–340. doi: 10.1002/wics.1361 For further resources related to this article, please visit the WIREs website.

...read moreread less

Proceedings Article•10.1109/ICRITO.2015.7359270•

Unravelling unstructured data: A wealth of information in big data

[...]

Mona Tanwar¹, Reena Duggal¹, Sunil Kumar Khatri¹•Institutions (1)

Amity University¹

1 Sep 2015

TL;DR: The importance of analysis of unstructured data along with structured data in business to extract holistic insights is emphasized and the need for appropriate and efficient analytical methods for knowledge discovery from huge volumes of heterogeneous data in unstructuring formats has been highlighted.

...read moreread less

Abstract: Big Data is data of high volume and high variety being produced or generated at high velocity which cannot be stored, managed, processed or analyzed using the existing traditional software tools, techniques and architectures. With big data many challenges such as scale, heterogeneity, speed and privacy are associated but there are opportunities as well. Potential information is locked in big data which if properly leveraged will make a huge difference to business. With the help of big data analytics, meaningful insights can be extracted from big data which is heterogeneous in nature comprising of structured, unstructured and semi-structured content. One prime challenge in big data analytics is that nearly 95% data is unstructured. This paper describes what big data and big data analytics is. A review of different techniques and approaches to analyze unstructured data is given. This paper emphasizes the importance of analysis of unstructured data along with structured data in business to extract holistic insights. The need for appropriate and efficient analytical methods for knowledge discovery from huge volumes of heterogeneous data in unstructured formats has been highlighted.

...read moreread less

Book Chapter•10.1016/B978-0-12-800207-0.00001-0•

Chapter 1 – Analytics Defined

[...]

Mark Ryan M. Talabis

1 Jan 2015

TL;DR: This chapter will focus on methods particularly useful for discovering security breaches and attacks, and which can be implemented with either free or commonly available software.

...read moreread less

Abstract: Knowledge of analytical methods and techniques is essential for uncovering hidden patterns in security-related data. Analytical techniques range from simple descriptive statistics, data visualization methods, and statistical analysis algorithms such as regression, correlation analysis, and support vector machines. The field of analytics is broad. This chapter will focus on methods particularly useful for discovering security breaches and attacks, and which can be implemented with either free or commonly available software. As there are unlimited ways that an attacker can compromise a system, analysts also need a toolkit of techniques to be creative in analyzing security data. Among tools available for creative analysis, we will examine analytical programming languages allowing an analysts to customize analytical procedures and applications. The concepts introduced in this chapter will provide you with a framework for security analysis, along with useful methods and tools.

...read moreread less

Journal Article•10.17775/CSEEJPES.2015.00003•

Knowledge model for electric power big data based on ontology and semantic web

[...]

Huang Yanhao¹, Xiaoxin Zhou¹•Institutions (1)

Electric Power Research Institute¹

29 May 2015-CSEE Journal of Power and Energy Systems

TL;DR: Research shows that the new model developed here is powerful and can adapt to various knowledge expression requirements of electric power big data and will be used in more applications.

...read moreread less

Abstract: It is very important for the development of electric power big data technology to use the electric power knowledge. A new electric power knowledge theory model is proposed here to solve the problem of normalized modeled electric power knowledge for the management and analysis of electric power big data. Current modeling techniques of electric power knowledge are viewed as inadequate because of the complexity and variety of the relationships among electric power system data. Ontology theory and semantic web technologies used in electric power systems and in many other industry domains provide a new kind of knowledge modeling method. Based on this, this paper proposes the structure, elements, basic calculations and multidimensional reasoning method of the new knowledge model. A modeling example of the regulations defined in electric power system operation standard is demonstrated. Different forms of the model and related technologies are also introduced, including electric power system standard modeling, multi-type data management, unstructured data searching, knowledge display and data analysis based on semantic expansion and reduction. Research shows that the new model developed here is powerful and can adapt to various knowledge expression requirements of electric power big data. With the development of electric power big data technology, it is expected that the knowledge model will be improved and will be used in more applications.

...read moreread less

Posted Content•

How essential are unstructured clinical narratives and information fusion to clinical trial recruitment

[...]

Preethi Raghavan¹, James L. Chen¹, Eric Fosler-Lussier¹, Albert Lai¹•Institutions (1)

Ohio State University¹

13 Feb 2015-arXiv: Computers and Society

TL;DR: In this paper, the authors demonstrate that information extraction from unstructured clinical narratives is essential to most clinical applications, and they perform an empirical study to validate the argument and show that structured data alone is insufficient in resolving eligibility criteria for recruiting patients onto clinical trials for chronic lymphocytic leukemia (CLL) and prostate cancer.

...read moreread less

Abstract: Electronic health records capture patient information using structured controlled vocabularies and unstructured narrative text. While structured data typically encodes lab values, encounters and medication lists, unstructured data captures the physician's interpretation of the patient's condition, prognosis, and response to therapeutic intervention. In this paper, we demonstrate that information extraction from unstructured clinical narratives is essential to most clinical applications. We perform an empirical study to validate the argument and show that structured data alone is insufficient in resolving eligibility criteria for recruiting patients onto clinical trials for chronic lymphocytic leukemia (CLL) and prostate cancer. Unstructured data is essential to solving 59% of the CLL trial criteria and 77% of the prostate cancer trial criteria. More specifically, for resolving eligibility criteria with temporal constraints, we show the need for temporal reasoning and information integration with medical events within and across unstructured clinical narratives and structured data.

...read moreread less

Journal Article•10.4236/JSEA.2015.812058•

Data Modeling and Data Analytics: A Survey from a Big Data Perspective

[...]

André Ribeiro, Afonso Silva, Alberto Rodrigues da Silva

24 Dec 2015-Journal of Software Engineering and Applications

TL;DR: This paper provides a broad view and discussion of the current state of this subject with a particular focus on data modeling and data analytics, describing and clarifying the main differences between the three main approaches in what concerns these aspects, namely: operational databases, decision support databases and Big Data technologies.

...read moreread less

Abstract: These last years we have been witnessing a tremendous growth in the volume and availability of data. This fact results primarily from the emergence of a multitude of sources (e.g. computers, mobile devices, sensors or social networks) that are continuously producing either structured, semi-structured or unstructured data. Database Management Systems and Data Warehouses are no longer the only technologies used to store and analyze datasets, namely due to the volume and complex structure of nowadays data that degrade their performance and scalability. Big Data is one of the recent challenges, since it implies new requirements in terms of data storage, processing and visualization. Despite that, analyzing properly Big Data can constitute great advantages because it allows discovering patterns and correlations in datasets. Users can use this processed information to gain deeper insights and to get business advantages. Thus, data modeling and data analytics are evolved in a way that we are able to process huge amounts of data without compromising performance and availability, but instead by “relaxing” the usual ACID properties. This paper provides a broad view and discussion of the current state of this subject with a particular focus on data modeling and data analytics, describing and clarifying the main differences between the three main approaches in what concerns these aspects, namely: operational databases, decision support databases and Big Data technologies.

...read moreread less

Proceedings Article•

Management of Flexible Schema Data in RDBMSs - Opportunities and Limitations for NoSQL -

[...]

Zhen Hua Liu¹, Dieter Gawlick²•Institutions (2)

Business International Corporation¹, Oracle Corporation²

1 Jan 2015

TL;DR: The engineering principles and practices to manage FSD in RDBMSs to meet FSD’s unique requirements and challenges are described and the limitations and issues of current practices are described.

...read moreread less

Abstract: 1. ABSTRACT RDBMSs are designed to manage well-structured data requiring users to design a schema before storing and querying data. This is the ‘schema first, data later’ approach. However, there are significant amount of unstructured data and semi-structured data that cannot be effectively modelled this way. Even if certain parts of the data can be modelled using schema, the inclusion of all fields would typically lead to a very large schema with many optional fields and with frequent schema evolution as data instances vary widely and evolve fast. Obviously, these data requires the ‘data first, schema later/never’ approach. We call these data Flexible Schema Data (FSD) . In this paper, we describe the engineering principles and practices to manage FSD in RDBMSs to meet FSD’s unique requirements and challenges. We describe the limitations and issues of current practices and potential research opportunities. Having a single data platform for managing both well-structured data and FSD is beneficial to users; this approach reduces significantly integration, migration, development, maintenance, and operational issues.

...read moreread less

Posted Content•

A Survey of Classification Techniques in the Area of Big Data

[...]

Praful Koturwar, Sheetal Girase, Debajyoti Mukhopadhyay

25 Mar 2015-arXiv: Learning

TL;DR: This paper focused on to study of different supervised classification techniques over big transactional database, and shows a advantages and limitations.

...read moreread less

Abstract: Big Data concern large-volume, growing data sets that are complex and have multiple autonomous sources. Earlier technologies were not able to handle storage and processing of huge data thus Big Data concept comes into existence. This is a tedious job for users unstructured data. So, there should be some mechanism which classify unstructured data into organized form which helps user to easily access required data. Classification techniques over big transactional database provide required data to the users from large datasets more simple way. There are two main classification techniques, supervised and unsupervised. In this paper we focused on to study of different supervised classification techniques. Further this paper shows a advantages and limitations.

...read moreread less

...

Expand