TL;DR: A data storage framework not only enabling efficient storing of massive IoT data, but also integrating both structured and unstructured data is proposed, able to combine and extend multiple databases and Hadoop to store and manage diverse types of data collected by sensors and RFID readers.
Abstract: The Internet of Things (IoT) has provided a promising opportunity to build powerful industrial systems and applications by leveraging the growing ubiquity of Radio Frequency IDentification (RFID) and wireless sensors devices. Benefiting from RFID and sensor network technology, common physical objects can be connected, and are able to be monitored and managed by a single system. Such a network brings a series of challenges for data storage and processing in a cloud platform. IoT data can be generated quite rapidly, the volume of data can be huge and the types of data can be various. In order to address these potential problems, this paper proposes a data storage framework not only enabling efficient storing of massive IoT data, but also integrating both structured and unstructured data. This data storage framework is able to combine and extend multiple databases and Hadoop to store and manage diverse types of data collected by sensors and RFID readers. In addition, some components are developed to extend the Hadoop to realize a distributed file repository, which is able to process massive unstructured files efficiently. A prototype system based on the proposed framework is also developed to illustrate the framework's effectiveness.
TL;DR: This work presents AGDISTIS, a novel knowledge-base-agnostic approach for named entity disambiguation that combines the Hypertext-Induced Topic Search (HITS) algorithm with label expansion strategies and string similarity measures and can efficiently detect the correct URIs for a given set of named entities within an input text.
Abstract: Over the last decades, several billion Web pages have been made available on the Web. The ongoing transition from the current Web of unstructured data to the Web of Data yet requires scalable and accurate approaches for the extraction of structured data in RDF (Resource Description Framework) from these websites. One of the key steps towards extracting RDF from text is the disambiguation of named entities. While several approaches aim to tackle this problem, they still achieve poor accuracy. We address this drawback by presenting AGDISTIS, a novel knowledge-base-agnostic approach for named entity disambiguation. Our approach combines the Hypertext-Induced Topic Search (HITS) algorithm with label expansion strategies and string similarity measures. Based on this combination, AGDISTIS can efficiently detect the correct URIs for a given set of named entities within an input text. We evaluate our approach on eight different datasets against state-of-the-art named entity disambiguation frameworks. Our results indicate that we outperform the state-of-the-art approach by up to 29% F-measure.
TL;DR: Current research that takes advantage of "Big Data" in health and biomedical informatics applications is summarized, highlighting ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas.
Abstract: Objectives: To summarise current research that takes advantage of “Big Data” in health and biomedical informatics applications. Methods:Survey of trends in this work, and exploration of literature describing how large-scale structured and unstructured data sources are being used to support applications from clinical decision making and health policy, to drug design and pharmacovigilance, and further to systems biology and genetics. Results: The survey highlights ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas. Consideration of this body of work identifies several important paradigm shifts that are facilitated by Big Data resources and methods: in clinical and translational research, from hypothesis-driven research to data-driven research, and in medicine, from evidence-based practice to practice-based evidence. Conclusions: The increasing scale and availability of large quantities of health data require strategies for data management, data linkage, and data integration beyond the limits of many existing information systems, and substantial effort is underway to meet those needs. As our ability to make sense of that data improves, the value of the data will continue to increase. Health systems, genetics and genomics, population and public health; all areas of biomedicine stand to benefit from Big Data and the associated technologies.
TL;DR: The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinicalbig data analytics tools.
Abstract: The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data. The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation. In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.
TL;DR: The concept of Big Data and associated analytics are to be taken seriously when approaching the use of vast volumes of both structured and unstructured data in science and health-care.
Abstract: Objectives: As technology continues to evolve and rise in various industries, such as healthcare, science, education, and gaming, a sophisticated concept known as Big Data is surfacing. The concept of analytics aims to understand data. We set out to portray and discuss perspectives of the evolving use of Big Data in science and healthcare and, to examine some of the opportunities and challenges. Methods: A literature review was conducted to highlight the implications associated with the use of Big Data in scientific research and healthcare innovations, both on a large and small scale. Results: Scientists and health-care providers may learn from one another when it comes to understanding the value of Big Data and analytics. Small data, derived by patients and consumers, also requires analytics to become actionable. Connectivism provides a framework for the use of Big Data and analytics in the areas of science and healthcare. This theory assists individuals to recognize and synthesize how human connections are driving the increase in data. Despite the volume and velocity of Big Data, it is truly about technology connecting humans and assisting them to construct knowledge in new ways. Concluding Thoughts: The concept of Big Data and associated analytics are to be taken seriously when approaching the use of vast volumes of both structured and unstructured data in science and health-care. Future exploration of issues surrounding data privacy, confidentiality, and education are needed. A greater focus on data from social media, the quantified self-movement, and the application of analytics to “small data” would also be useful.
TL;DR: Researchers need to study and document use cases that explain how specific, novel data, so-called Big Data, can be used to support decision-making.
Abstract: People and the computers they use are generating large amounts of varied data. The phenomenon of capturing and trying to use all of the semi-structured and unstructured data has been called by vendors and bloggers ‘Big Data’. Organisations can capture and store data of many types from almost any source, but capturing and storing data only adds value when it has a useful purpose. Big Data must be used to provide input to analytics and decision support capabilities if it is to create real value for organisations. Some bloggers, industry leaders and academics have become disillusioned by the term Big Data. It is a marketing term and not a technical term. More descriptive terms like unstructured data, process data and machine data are more useful for information technology (IT) professionals. Researchers need to study and document use cases that explain how specific, novel data, so-called Big Data, can be used to support decision-making.
TL;DR: This work combines four different state-of-the approaches by using 15 different algorithms for ensemble learning and evaluates their performace on five different datasets to suggest that ensemble learning can reduce the error rate of state- of-the-art named entity recognition systems by 40%, thereby leading to over 95% f-score in the best run.
Abstract: A considerable portion of the information on the Web is still only available in unstructured form. Implementing the vision of the Semantic Web thus requires transforming this unstructured data into structured data. One key step during this process is the recognition of named entities. Previous works suggest that ensemble learning can be used to improve the performance of named entity recognition tools. However, no comparison of the performance of existing supervised machine learning approaches on this task has been presented so far. We address this research gap by presenting a thorough evaluation of named entity recognition based on ensemble learning. To this end, we combine four different state-of-the approaches by using 15 different algorithms for ensemble learning and evaluate their performace on five different datasets. Our results suggest that ensemble learning can reduce the error rate of state-of-the-art named entity recognition systems by 40%, thereby leading to over 95% f-score in our best run.
TL;DR: In this article, a system and method for generating reports from unstructured data is described, which can include identifying events matching criteria of an initial search query (each of the events including a portion of raw machine data that is associated with a time), identifying a set of fields, each field defined for one or more of the identified events, causing display of an interactive graphical user interface (GUI).
Abstract: The disclosure relates to certain system and method embodiments for generating reports from unstructured data. In one embodiment, a method can include identifying events matching criteria of an initial search query (each of the events including a portion of raw machine data that is associated with a time), identifying a set of fields, each field defined for one or more of the identified events, causing display of an interactive graphical user interface (GUI) that includes one or more interactive elements enabling a user to define a report for providing information relating to the matching events (each interactive element enabling processing or presentation of information in the matching events using one or more fields in the identified set of fields), receiving, via the GUI, a report definition indicating how to report information relating to the matching events, and generating, based on the report definition, a report including information relating to the matching events.
TL;DR: The issue of Big Data including the four dimensions of Big data and the opportunities and challenges created by them are discussed, as well as various Big Data analytics applications.
Abstract: EXECUTIVE SUMMARY | Today, the amount of data we are able to collect has been exploding. As a result, Big Data have become a new buzzword in information technology. Storing, managing, and analyzing Big Data is challenging, and will soon become a major differentiator between high-performing and low-performing organizations. This article discusses the issue of Big Data including the four dimensions of Big Data and the opportunities and challenges created by them. It also discusses various Big Data analytics applications.Every day, we use several different devices to generate large amounts of data; for example, searching online, making purchases through e-commerce web sites, making transactions in the supermarket, reading data from sensors, using social media to interact with our friends, and using GPS. All the data are accumulated and stored somewhere, which we call "Big Data. "WHAT IS BIG DATA?According to the McKinsey Global Institute, "Big Data refer to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. "The EdTech Report to the nation in 2013 states "Every day, we create 2.5 quintillion (1020) bytes of data- so much that 90% of the data in the world today has been created in the last two years alone. These data come from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. These data are "Big Data." The concept of Big Data actually is not new. We have been accumulating data since the beginning of recorded time. However, as technology advances, data are accumulating at an alarming rate.FOUR DIMENSIONS OF BIG DATAIBM data scientists break Big Data down into four dimensions: volume, velocity, variety, and veracity (4-Vs). The volume dimension refers to the scale of the data. From the beginning of recorded time until 2003, we created 5 billion gigabytes (exabytes) of data. In 2011, the same amount were created nearly every two days. In 2013, the same amount of data were created every 10 minutes. Velocity refers to the analysis of streaming data. As data are accumulated every second, data quickly become out-of-date. Therefore, it is important to use the data as fast as possible. The third dimension, variety, refers to different types of data we collect, e.g., structured data, unstructured data, text data, numerical data, image data, and audio and video data. Veracity refers to the uncertainty in the data.The data we collect may contain noise, but we do not know which data are accurate and which have noise. This is why many business leaders do not trust the information generated from them. What's more, according to IBM, poor data quality costs the U.S. economy around $3.1 trillion each year.WHY DO WE CARE ABOUT BIG DATA?The model of generating/consuming data has changed. The old model was that few companies were generating data; all the others were consuming data. As technology advanced, a new model has evolved. Many companies are now generating and consuming data. But our ultimate goal is not just generating, storing, and managing Big Data. Once data are generated and stored, the next step is to analyze the data to find useful information. Information is then converted into knowledge to make decisions to optimize profit Figure 1 shows an overview of the data analysis and decision making process.The article by Clay Dillow, published in Fortune magazine on September 4, 2013, states that Big Data are now viewed as "the new oil"to drive the economies in the century ahead, the same way as they did at the beginning of the last century. So, we are experiencing a Big Data employment boom. Dillow's claim is supported by the job trends from indeed.com. Indeed.com is an employment-related meta-search engine for job listings. According to the job trends from indeed.com, the number of job postings related to Big Data increased exponentially after 2011. …
TL;DR: The epiC framework as mentioned in this paper introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations, which can be automatically parallelized and the runtime system takes care of fault tolerance.
Abstract: The Big Data problem is characterized by the so called 3V features: Volume - a huge amount of data, Velocity - a high data ingestion rate, and Variety - a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source implementation Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal with the data variety well since the programming interfaces and its associated data processing model is inconvenient and inefficient for handling structured data and graph data.This paper presents epiC, an extensible system to tackle the Big Data's data variety challenge. epiC introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations. Users process multi-structured datasets with appropriate epiC extensions, the implementation of a data processing model best suited for the data type and auxiliary code for mapping that data processing model into epiC's concurrent programming model. Like Hadoop, programs written in this way can be automatically parallelized and the runtime system takes care of fault tolerance and inter-machine communications. We present the design and implementation of epiC's concurrent programming model. We also present two customized data processing model, an optimized MapReduce extension and a relational model, on top of epiC. Experiments demonstrate the effectiveness and efficiency of our proposed epiC.
TL;DR: This paper conducts some empirical studies to show what are the differences before and after preprocessing the unstructured source code, and shows some interesting phenomena based on using or not using these preprocessing operations.
Abstract: Program comprehension usually focuses on the significance of textual information to capture the programmers’ intent and knowledge in the software, in particular the source code. In the source code, most of the data is unstructured data, such as the natural language text in comments and identifier names. Researchers in software engineering community have developed many techniques for handling such unstructured data, such as natural language processing (NLP) and information retrieval (IR). Before using the IR technique on the unstructured source code, we must preprocess the text identifies and comments since these data is different from that used in our daily life. During this process, several operations, i.e, tokenization, splitting, stemming, etc. are usually used for preprocessing the unstructured source code. These preprocessing operations will affect the quality of the data used in the IR process. But how these preprocessing operations affect the results of IR is still an open problem. To the best of our knowledge, there are still no studies focusin on this problem. This paper attempts to fill this gap, and conducts some empirical studies to show what are the differences before and after these preprocessing operations. The results show some interesting phenomena based on using or not using these preprocessing operations.
TL;DR: A technical overview of the emerging "broad data" area, in which the variety of heterogeneous data being used, rather than the scale of the data being analyzed, is the limiting factor in data analysis efforts.
Abstract: More and more, the needs of data analysts are requiring the use of data outside the control of their own organizations. The increasing amount of data available on the Web, the new technologies for linking data across datasets, and the increasing need to integrate structured and unstructured data are all driving this trend. In this article, we provide a technical overview of the emerging "broad data" area, in which the variety of heterogeneous data being used, rather than the scale of the data being analyzed, is the limiting factor in data analysis efforts. The article explores some of the emerging themes in data discovery, data integration, linked data, and the combination of structured and unstructured data.
TL;DR: This paper is focusing on surveillance of a nimble arising field data mining which is also known as knowledge discovery from data (KDD), constituting fundamentals of data mining, also several strategies for analyzing data like classification, estimation, prediction, association rules, clustering etc., data mining process, its types like web, content and structure mining.
Abstract: Now a day’s internet is a significant place for interchanging of data like text, images, audio, and video and for share-out information preferably in digital form. The usage of internet leads to accessing the immense amount of data. Data may be unstructured data, structure data, and semi-structured data. So we are storing and processing such vast amount of data having gigantic complexity. Researchers from the University of Berkeley estimate that every year about one Exabyte (= 1 Million Terabyte) of data brought forth, of which a large portion is in digital form. For ex., 100 hours of videos are uploaded on YouTube every minute (statistics from YouTube). The query is how to analyze such ample data effectively and efficiently? Answer is data mining. Data mining pertains to the process of analyzing, studying such declamatory quantity of data for witnessing useful patterns and knowledge. Today we have accumulation of large amount of data but deficiency of knowledge. In this paper our focusing on surveillance of a nimble arising field data mining which is also known as knowledge discovery from data (KDD). We are constituting fundamentals of data mining, also several strategies for analyzing data like classification, estimation, prediction, association rules, clustering etc., data mining process, its types like web, content and structure mining. After understating basics, we are presenting the different data mining models like decision tree, neural network with their bedrocks. Also presents real world applications and future scope of the dynamic and extraordinary discipline.
TL;DR: To sustain the credibility of OD-based systems more research will be needed to investigate effective existing approaches and to synthesize novel, OD-specific engineering principles.
Abstract: Structured and unstructured data in operational support tools have long been prevalent in software engineering. Similar data is now becoming widely available in other domains. Software systems that utilize such operational data (OD) to help with software design and maintenance activities are increasingly being built despite the difficulties of drawing valid conclusions from disparate and low-quality data and the continuing evolution of operational support tools. This paper proposes systematizing approaches to the engineering of OD-based systems. To prioritize and structure research areas we consider historic developments, such as big data hype; synthesize defining features of OD, such as confounded measures and unobserved context; and discuss emerging new applications, such as diverse and large OD collections and extremely short development intervals. To sustain the credibility of OD-based systems more research will be needed to investigate effective existing approaches and to synthesize novel, OD-specific engineering principles.
TL;DR: In this article, a method and system to evaluate data efficacy across an enterprise is disclosed, which includes the step of indexing a set of data sources that include at least one of structured and unstructured data artifacts.
Abstract: A method and system to evaluate data efficacy across an enterprise is disclosed. The method includes the step of indexing a set of data sources that include at least one of structured and unstructured data artifacts. The method further includes accessing the indexing on the one or more data sources with a computer. The method further includes the step of generating a plurality of analytics about the data sources based on the indexing, wherein the analytics include a plurality of: a document originality analytic, a corpus storage volume analytic, a data source ingest analytic, a document type analytic, and an analysis analytic. The method further includes displaying, on a display device, an interactive visualization of results based on the analytics, wherein the visualization comprises at least one of: a histogram, a graph, a timeline, a panel, a list, a chart, a popup, and a table.
TL;DR: In this article, the authors highlight the Information Security and Data Protection pitfalls that lay in wait - hurdles that have already tripped up market leaders and minnows alike, and suggest that enterprises and IT grapple to take advantage of these trends in order to gain share and drive revenue.
Abstract: The Digital Universe, which consists of all the data created by PC, Sensor Networks, GPS/WiFi Location, Web Metadata, Web-Sourced Biographical Data, Mobile, Smart-Connected Devices and Next- Generation Applications (to name but a few) is altering the way we consume and measure IT and disrupting proven business models. Unprecedented and exponential data growth is presenting businesses with new and unique opportunities and challenges. As the 'Internet of Things' (IoT) and Third Platform continue to grow, the analysis of structured and unstructured data will drive insights that change the way businesses operate, create distinctive value, and deliver services and applications to the consumer and to each other. As enterprises and IT grapple to take advantage of these trends in order to gain share and drive revenue, they must be mindful of the Information Security and Data Protection pitfalls that lay in wait - hurdles that have already tripped up market leaders and minnows alike.
TL;DR: This paper analyzes Big Data applications for information security problems, and defines research directions on Big Data analytics for security intelligence, which promises significant opportunities for prevention and detection of advanced cyber-attacks using correlated internal and external security data.
Abstract: Big Data is related to technologies for collecting, processing, analyzing and extracting useful knowledge from very large volumes of structured and unstructured data generated by different sources at high speed. Big Data creates critical information security and privacy problems, at the same time Big Data analytics promises significant opportunities for prevention and detection of advanced cyber-attacks using correlated internal and external security data. We must address several challenges to realize true potential of Big Data for information security. The paper analyzes Big Data applications for information security problems, and defines research directions on Big Data analytics for security intelligence.
TL;DR: The value of content analysis is discussed in a variety of academic contexts, including individual and collaborative scholarship and academic advising and the utility of this methodology is illustrated using Leximancer, an automated content analysis and concept mapping technology.
Abstract: Scholars in many knowledge domains rely on sophisticated information technologies to search for and retrieve records and publications pertinent to their research interests. But what is a scholar to...
TL;DR: This paper discusses a AaaS tool that performs terms and topics extraction and organization from unstructured data sources such as NoSQL databases, textual contents, and structured sources (e.g. SQL) and shows high accuracy in the mining process.
Abstract: Analytics-as-a-Service (AaaS) has become indispensable because it affords stakeholders to discover knowledge in Big Data. Previously, data stored in data warehouses follow some schema and standardization which leads to efficient data mining. However, the Big Data epoch has witnessed the rise of structured, semi-structured, and unstructured data, a trend that motivated enterprises to employ the NoSQL data storages to accommodate the high-dimensional data. Unfortunately, the existing data mining techniques which are designed for schema-oriented storages are non-applicable to the unstructured data style. Thus, the AaaS though still in its infancy, is gaining widespread attention for its ability to provide novel ways and opportunities to mine the heterogeneous data. In this paper, we discuss our AaaS tool that performs terms and topics extraction and organization from unstructured data sources such as NoSQL databases, textual contents (e.g., websites), and structured sources (e.g. SQL). The tool is built on methodologies such as tagging, filtering, association maps, and adaptable dictionary. The evaluation of the tool shows high accuracy in the mining process.
TL;DR: A proposed VA toolkit extracts data from Bitly and Twitter to predict movie revenue and ratings and is generalizable to other domains involving social media data, such as sales forecasting and advertisement analysis.
Abstract: With over 16 million tweets per hour, 600 new blog posts per minute, and 400 million active users on Facebook, businesses have begun searching for ways to turn real-time consumer-based posts into actionable intelligence. The goal is to extract information from this noisy, unstructured data and use it for trend analysis and prediction. Current practices support the idea that visual analytics (VA) can help enable the effective analysis of such data. However, empirical evidence demonstrating the effectiveness of a VA solution is still lacking. A proposed VA toolkit extracts data from Bitly and Twitter to predict movie revenue and ratings. Results from the 2013 VAST Box Office Challenge demonstrate the benefit of an interactive environment for predictive analysis, compared to a purely statistical modeling approach. The VA approach used by the toolkit is generalizable to other domains involving social media data, such as sales forecasting and advertisement analysis.
TL;DR: This paper proposes the JackHare framework with SQL query compiler, JDBC driver and a systematical method using MapReduce framework for processing the unstructured data in HBase to exploit the HBase as the underlying datastore to execute the ANSI-SQL queries.
Abstract: As data exploration has increased rapidly in recent years, the datastore and data processing are getting more and more attention in extracting important information. To find a scalable solution to process the large-scale data is a critical issue in either the relational database system or the emerging NoSQL database. With the inherent scalability and fault tolerance of Hadoop, MapReduce is attractive to process the massive data in parallel. Most of previous researches focus on developing the SQL or SQL-like queries translator with the Hadoop distributed file system. However, it could be difficult to update data frequently in such file system. Therefore, we need a flexible datastore as HBase not only to place the data over a scale-out storage system, but also to manipulate the changeable data in a transparent way. However, the HBase interface is not friendly enough for most users. A GUI composed of SQL client application and database connection to HBase will ease the learning curve. In this paper, we propose the JackHare framework with SQL query compiler, JDBC driver and a systematical method using MapReduce framework for processing the unstructured data in HBase. After importing the JDBC driver to a SQL client GUI, we can exploit the HBase as the underlying datastore to execute the ANSI-SQL queries. Experimental results show that our approaches can perform well with efficiency and scalability.
TL;DR: An introduction to the Big Data Era and the New Medical Frontier: Real-Time Wireless Medical Data Acquisition for 21st-Century Healthcare and Data Mining Challenges.
Abstract: Introduction to the Big Data Era Stephan Kudyba and Matthew Kwatinetz Information Creation through Analytics Stephan Kudyba Big Data Analytics-Architectures, Implementation Methodology, and Tools Wullianallur Raghupathi and Viju Raghupathi Data Mining Methods and the Rise of Big Data Wayne Thompson Data Management and Model Creation Process of Structured Data for Mining and Analytics Stephan Kudyba The Internet: A Source of New Data for Mining in Marketing Robert Young Mining and Analytics in E-Commerce Stephan Kudyba Streaming Data in the Age of Big Data Billie Anderson and J. Michael Hardin Using CEP for Real-Time Data Mining Steven Barber Transforming Unstructured Data into Useful Information Meta S. Brown Mining Big Textual Data Ioannis Korkontzelos The New Medical Frontier: Real-Time Wireless Medical Data Acquisition for 21st-Century Healthcare and Data Mining Challenges David Lubliner and Stephan Kudyba
TL;DR: AGDISTIS as discussed by the authors combines the Hypertext-Induced Topic Search (HITS) algorithm with label expansion strategies and string similarity measures to detect the correct URIs for a given set of named entities within an input text.
Abstract: Over the last decades, several billion Web pages have been made available on the Web. The ongoing transition from the current Web of unstructured data to the Data Web yet requires scalable and accurate approaches for the extraction of structured data in RDF (Resource Description Framework) from these websites. One of the key steps towards extracting RDF from text is the disambiguation of named entities. We address this issue by presenting AGDISTIS, a novel knowledge-base-agnostic approach for named entity disambiguation. Our approach combines the Hypertext-Induced Topic Search (HITS) algorithm with label expansion strategies and string similarity measures. Based on this combination, AGDISTIS can efficiently detect the correct URIs for a given set of named entities within an input text.
TL;DR: The BigBench benchmark is presented and the suitability and relevance of the workload is evaluated from the point of view of enterprise applications, and potential extensions to the proposed specification are discussed in order to cover typical big data processing use cases.
Abstract: Enterprises perceive a huge opportunity in mining information that can be found in big data New storage systems and processing paradigms are allowing for ever larger data sets to be collected and analyzed The high demand for data analytics and rapid development in technologies has led to a sizable ecosystem of big data processing systems However, the lack of established, standardized benchmarks makes it difficult for users to choose the appropriate systems that suit their requirements To address this problem, we have developed the BigBench benchmark specification BigBench is the first end-to-end big data analytics benchmark suite In this paper, we present the BigBench benchmark and analyze the workload from technical as well as business point of view We characterize the queries in the workload along different dimensions, according to their functional characteristics, and also analyze their runtime behavior Finally, we evaluate the suitability and relevance of the workload from the point of view of enterprise applications, and discuss potential extensions to the proposed specification in order to cover typical big data processing use cases
TL;DR: Predictive capabilities of ERP systems are focused on to analyze current data and historical facts in order to identify potential risks and opportunities for any organization.
Abstract: ERP systems, at present, are found to be inflexible to adapt to changing organizational processes. They are required to quickly adjust to changing processes and value-added chains and streamline their internal organizational structure. Data in ERP systems is becoming increasingly voluminous in their transactional programs. In this scenario, ERP systems are increasingly exposed to big data wherein the combined analysis of larger amounts of structured and unstructured data from disparate systems takes place in a short amount of time. Big data analytics requires greater use of predictive analytics to uncover hidden patterns and their relationships to visualize and explore data. The evolution of big data and predictive analytics have given a new way for exploring new frontiers in analytics-driven automation and decision management in highvolume, front-line operational decisions. In this paper the authors have focused on predictive capabilities of ERP systems, to analyze current data and historical facts in order to identify potential risks and opportunities for any organization. Analytical Decision Management & Business Rules are used to deploy decision as a service.
TL;DR: An integrative healthcare analytics system called GEMINI which allows point of care analytics for doctors where real-time usable and relevant information of their patients are required through the questions they asked about the patients they are caring for is developed.
Abstract: Healthcare systems around the world are facing the challenge of information overload in caring for patients in an affordable, safe and high-quality manner in a system with limited healthcare resources and increasing costs. To alleviate this problem, we develop an integrative healthcare analytics system called GEMINI which allows point of care analytics for doctors where real-time usable and relevant information of their patients are required through the questions they asked about the patients they are caring for. GEMINI extracts data of each patient from various data sources and stores them as information in a patient profile graph. The data sources are complex and varied consisting of both structured data (such as, patients' demographic data, laboratory results and medications) and unstructured data (such as, doctors' notes). Hence, the patient profile graph provides a holistic and comprehensive information of patients' healthcare profile, from which GEMINI can infer implicit information useful for administrative and clinical purposes, and extract relevant information for performing predictive analytics. At the core, GEMINI keeps interacting with the healthcare professionals as part of a feedback loop to gather, infer, ascertain and enhance the self-learning knowledge base. We present a case study on using GEMINI to predict the risk of unplanned patient readmissions.
TL;DR: This paper shows how research strings that have been extracted from an extensive literature review can be bundled to a new vision for BI that is aligned with new requirements coming from socio-technical macro trends.
Abstract: The body of knowledge generated by Business Intelligence (BI) research is constantly extended by a stream of heterogeneous technological and organizational innovations. This paper shows how these can be bundled to a new vision for BI that is aligned with new requirements coming from socio-technical macro trends. The building blocks of the vision come from five research strings that have been extracted from an extensive literature review: BI and Business Process Management, BI across enterprise borders, new approaches of dealing with unstructured data, agile and user-driven BI, and new concepts for BI governance. The macro trend of the diffusion of cyber-physical systems is used to illustrate the argumentation.
TL;DR: Large Scale and Big Data: Processing and Management provides readers with a central source of reference on the data management techniques currently available for large-scale data processing, and provides an overview of different programming models and cloud-based deployment models.
Abstract: Large Scale and Big Data: Processing and Management provides readers with a central source of reference on the data management techniques currently available for large-scale data processing. Presenting chapters written by leading researchers, academics, and practitioners, it addresses the fundamental challenges associated with Big Data processing tools and techniques across a range of computing environments.The book begins by discussing the basic concepts and tools of large-scale Big Data processing and cloud computing. It also provides an overview of different programming models and cloud-based deployment models. The books second section examines the usage of advanced Big Data processing techniques in different domains, including semantic web, graph processing, and stream processing. The third section discusses advanced topics of Big Data processing such as consistency management, privacy, and security.Supplying a comprehensive summary from both the research and applied perspectives, the book covers recent research discoveries and applications, making it an ideal reference for a wide range of audiences, including researchers and academics working on databases, data mining, and web scale data processing.After reading this book, you will gain a fundamental understanding of how to use Big Data-processing tools and techniques effectively across application domains. Coverage includes cloud data management architectures, big data analytics visualization, data management, analytics for vast amounts of unstructured data, clustering, classification, link analysis of big data, scalable data mining, and machine learning techniques.
TL;DR: A quick look at how to organize and analyze textual data for extracting insightful customer intelligence from a large collection of documents and for using such information to improve business operations and performance.
Abstract: The proliferation of textual data in business is overwhelming. Unstructured textual data is being constantly generated via call center logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, and so on. While the amount of textual data is increasing rapidly, businesses’ ability to summarize, understand, and make sense of such data for making better business decisions remain challenging. This paper takes a quick look at how to organize and analyze textual data for extracting insightful customer intelligence from a large collection of documents and for using such information to improve business operations and performance. Multiple business applications of case studies using real data that demonstrate applications of text analytics and sentiment mining using SAS® Text Miner and SAS® Sentiment Analysis Studio are presented. While SAS® products are used as tools for demonstration only, the topics and theories covered are generic (not tool specific).