TL;DR: The design and functionality of RCrawler are described, and the experience of implementing it in an R environment, including different optimizations that handle the limitations of R are reported on.
TL;DR: The results prove that semantic features which are extracted by the framework can boost the performance of detecting duplicates, and put into practice by integrating it into Twinder, a search engine for Twitter streams.
Abstract: With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.
TL;DR: In this paper, a file system is separated into a set of information components and then all information components of the file system are analyzed to identify duplication of information content, when information components with duplicate content are identified, duplicates are deleted from physical storage and indexes are generated to reflect inclusion of the retained copy of an information component in a plurality of different files.
Abstract: Provided are methods, apparatus and computer programs for improved data storage and management. The invention can be implemented in a replacement or add-on to existing operating system file systems. Files in a file system are separated into a set of information components and then all information components of the file system are analyzed to identify duplication of information content. When information components with duplicate content are identified, duplicates are deleted from physical storage and indexes are generated to reflect inclusion of the retained copy of an information component in a plurality of different files. Improvements to content searching is enabled, since relevant components can be identified without retrieving whole files and since search results will include fewer duplicate results.
TL;DR: In this article, a system and method for retrieval and transfer of encrypted content from a failed set-top box is presented. But the method is limited to the case of a single device.
Abstract: A system and method for retrieval and transfer of encrypted content from a failed set-top box. When content is recorded to the storage device of a set-top box, the content is encrypted with a content instance key. This content instance key is encrypted with the public key of the set-top box and a duplicate of the content instance key is encrypted by another public key other than the public key of the set-top. In the event the set-top fails, the encrypted content on the storage device may be retrieved from the storage device by decrypting the duplicate content instance key with the private key that corresponds with the public key that encrypted the duplicate of the content instance key.
TL;DR: In this paper, an automated system is used to determine whether a digital item includes content from other digital items by comparing computed hash codes with retained hash codes associated with other electronic books.
Abstract: When a digital item is submitted for publication, an automated system may determine whether the digital item includes content from other digital items. In some implementations, when the digital item is an electronic book (eBook), the automated system may select sets of words from the eBook and compute hash codes, such that each hash code corresponds to a set of words. The automated system may compare the computed hash codes with retained hash codes associated with other electronic books to determine whether the digital item includes duplicate content.