TL;DR: This paper proposes a new data integration architecture, PAYGO, which is inspired by the concept of dataspaces and emphasizes pay-as-you-go data management as means for achieving web-scale data integration.
Abstract: The World Wide Web is witnessing an increase in the amount of structured content – vast heterogeneous collections of structured data are on the rise due to the Deep Web, annotation schemes like Flickr, and sites like Google Base. While this phenomenon is creating an opportunity for structured data management, dealing with heterogeneity on the web-scale presents many new challenges. In this paper, we highlight these challenges in two scenarios – the Deep Web and Google Base. We contend that traditional data integration techniques are no longer valid in the face of such heterogeneity and scale. We propose a new data integration architecture, PAYGO, which is inspired by the concept of dataspaces and emphasizes pay-as-you-go data management as means for achieving web-scale data integration.
TL;DR: In this article, a method and apparatus for transforming information from one semantic environment to another is disclosed for real-time transformation of electronic messages, which is based on the Normalization/Translation NorTran Workbench and a SOLx server.
Abstract: A method and apparatus are disclosed for transforming information from one semantic environment to another. In one implementation, a SOLx system (1700) includes a Normalization/Translation NorTran Workbench (1702) and a SOLx server (1708). The NorTran Workbench (1702) is used to develop a knowledge base based on information from a source system (1712), to normalize legacy content (1710) according to various rules, and to develop a database (1706) of translated content. During run time, the SOLx server (1708) receives transmissions from the source system (1712), normalizes the transmitted content, accesses the database (1706) of translated content and otherwise translates the normalized content, and reconstructs the transmission to provide substantially real-time transformation of electronic messages.
TL;DR: A comparatively new method that uses case materials for the development and testing of hypotheses and the key role of the content analysis schedule is explained and an illustration centering on environmental volatility is provided.
Abstract: In this article, we introduce a comparatively new method that uses case materials for the development and testing of hypotheses. After comparing cases to questionnaires as a data source, we explain the key role of the content analysis schedule, and provide an illustration centering on environmental volatility.
TL;DR: In this paper, the authors investigate to what extent a sample of machine learning application papers in social computing, specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data, give specific details about whether such best practices were followed.
Abstract: Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.
TL;DR: In this paper, a web-based system, method and program product are provided for adding content to a content object stored in a data repository as a group of hierarchically related content entities.
Abstract: A web-based system, method and program product are provided for adding content to a content object stored (e.g., a custom compilation or prepublished work) in a data repository as a group of hierarchically related content entities. Each noncontainer content object is preferably stored as a separate entity in the data repository. Each content entity is also stored as a row in a digital library index class as a collection of attributes and references to related content entities and containers. As the user selects desired objects for inclusion in a content object, the system arranges the objects hierarchically, e.g., into volumes, chapters and sections according to the order specified by the user. The system then creates a file object (e.g., a CBO) defining the content object that contains a list or outline of the container and noncontainer entities selected, their identifiers, order and structure. This file object is stored separately in the data repository. Content is removed from the compilation by removing the container or noncontainer identifier from the list or outline. This is achieved through a user interface by providing a mechanism for enabling a user to select a container or noncontainer (e.g., by title) to be removed.