Duplicate content

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1016/J.SOFTX.2017.04.004•

RCrawler: An R package for parallel web crawling and scraping

[...]

31 Dec 2017-SoftwareX

TL;DR: The design and functionality of RCrawler are described, and the experience of implementing it in an R environment, including different optimizations that handle the limitations of R are reported on.

...read moreread less

81 citations

Proceedings Article•10.1145/2488388.2488499•

Groundhog day: near-duplicate detection on Twitter

[...]

Ke Tao¹, Fabian Abel, Claudia Hauff¹, Geert-Jan Houben¹, Ujwal Gadiraju¹ - Show less +1 more•Institutions (1)

Delft University of Technology¹

13 May 2013

TL;DR: The results prove that semantic features which are extracted by the framework can boost the performance of detecting duplicates, and put into practice by integrating it into Twinder, a search engine for Twitter streams.

...read moreread less

Abstract: With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.

...read moreread less

65 citations

Patent•

Information component based data storage and management

[...]

James S. Luke¹•Institutions (1)

IBM¹

21 Feb 2002

TL;DR: In this paper, a file system is separated into a set of information components and then all information components of the file system are analyzed to identify duplication of information content, when information components with duplicate content are identified, duplicates are deleted from physical storage and indexes are generated to reflect inclusion of the retained copy of an information component in a plurality of different files.

...read moreread less

Abstract: Provided are methods, apparatus and computer programs for improved data storage and management. The invention can be implemented in a replacement or add-on to existing operating system file systems. Files in a file system are separated into a set of information components and then all information components of the file system are analyzed to identify duplication of information content. When information components with duplicate content are identified, duplicates are deleted from physical storage and indexes are generated to reflect inclusion of the retained copy of an information component in a plurality of different files. Improvements to content searching is enabled, since relevant components can be identified without retrieving whole files and since search results will include fewer duplicate results.

...read moreread less

58 citations

Patent•

Retrieval and transfer of encrypted hard drive content from dvr set-top boxes

[...]

Anthony J. Wasilewski¹•Institutions (1)

Scientific Atlanta¹

18 Aug 2004

TL;DR: In this article, a system and method for retrieval and transfer of encrypted content from a failed set-top box is presented. But the method is limited to the case of a single device.

...read moreread less

Abstract: A system and method for retrieval and transfer of encrypted content from a failed set-top box. When content is recorded to the storage device of a set-top box, the content is encrypted with a content instance key. This content instance key is encrypted with the public key of the set-top box and a duplicate of the content instance key is encrypted by another public key other than the public key of the set-top. In the event the set-top fails, the encrypted content on the storage device may be retrieved from the storage device by decrypting the duplicate content instance key with the private key that corresponds with the public key that encrypted the duplicate of the content instance key.

...read moreread less

52 citations

Patent•

Detecting duplicated content among digital items

[...]

David R. Azari¹, Denis V. Batalov¹, Tanvi Milind Bhadbhade¹, Lee M. Miller¹, Alan Kipust¹, Theresa M. Hollis¹ - Show less +2 more•Institutions (1)

Amazon.com¹

15 Jun 2012

TL;DR: In this paper, an automated system is used to determine whether a digital item includes content from other digital items by comparing computed hash codes with retained hash codes associated with other electronic books.

...read moreread less

Abstract: When a digital item is submitted for publication, an automated system may determine whether the digital item includes content from other digital items. In some implementations, when the digital item is an electronic book (eBook), the automated system may select sets of words from the eBook and compute hash codes, such that each hash code corresponds to a set of words. The automated system may compare the computed hash codes with retained hash codes associated with other electronic books to determine whether the digital item includes duplicate content.

...read moreread less

41 citations

...

Expand

Year	Papers
2021	3
2020	3
2019	5
2018	7
2017	4
2016	3

Topic Tools

Papers published on a yearly basis

Papers

RCrawler: An R package for parallel web crawling and scraping

Groundhog day: near-duplicate detection on Twitter

Information component based data storage and management

Retrieval and transfer of encrypted hard drive content from dvr set-top boxes

Detecting duplicated content among digital items

Related Topics (5)

Performance Metrics