TL;DR: This paper proposes techniques to semi-automatically separate reputable, good pages from spam, and shows that they can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.
Abstract: Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.
TL;DR: A two-stage approach to improve the performance of common classifiers is described, which first implements a classifier to catch a large portion of spam in data, and design several heuristics to decide if a node should be relabeled based on the preclassified result and knowledge about the neighborhood.
Abstract: Web spam has been recognized as one of the top challenges in the search engine industry [14]. A lot of recent work has addressed the problem of detecting or demoting web spam, including both content spam [16, 12] and link spam [22, 13]. However, any time an anti-spam technique is developed, spammers will design new spamming techniques to confuse search engine ranking methods and spam detection mechanisms. Machine learning-based classification methods can quickly adapt to newly developed spam techniques. We describe a two-stage approach to improve the performance of common classifiers. We first implement a classifier to catch a large portion of spam in our data. Then we design several heuristics to decide if a node should be relabeled based on the preclassified result and knowledge about the neighborhood. Our experimental results show visible improvements with respect to precision and recall.
TL;DR: In this article, the authors dynamically produce alternate referrer pages substantially similar to pages previously viewed through a web browser by a visitor who linked to a target web page via a link on the previously viewed pages.
Abstract: The invention dynamically produces alternate referrer pages substantially similar to pages previously viewed, through a web browser, by a visitor who linked to a target web page via a link on the previously viewed pages. When the browser links to the target page, a referrer URL is obtained for the referrer page from which the browser loaded the target page. The referrer URL is stored in a queue. The queue is inspected regularly. If the queue contains an unexamined entry, a request for that entry's referrer URL is executed to obtain the alternate referrer pages. The IP address of the computer running the browser is used to derive a country code corresponding to the IP address. The referrer URL request can be issued through a computer in a geographic region corresponding to the country code so that geographic biasing of the previously viewed pages will be reflected in the alternate pages.
TL;DR: This paper explores simple and immediately useful approaches to enable referrers to monitor the number of click-throughs for which they should be paid.
Abstract: A web user "clicks through" one web site, the referrer, to another web site, the target, if the user follows a hypertext link to the target's site contained in a web page served from the referrer's site. Numerous click-through payment programs have been established on the web, by which (the webmaster of) a target site pays a referrer site for each click through that referrer to the target. However, typically the referrer has no ability to verify that it is paid for every click-through to the target for which it is responsible. Thus, targets can undetectably omit to pay referrers for some number of click-throughs, a practice called hit shaving. In this paper, we explore simple and immediately useful approaches to enable referrers to monitor the number of click-throughs for which they should be paid.
TL;DR: The results show it is possible to spam Google Scholar by 'improved' the ranking of articles by manipulating their citation counts and made articles appear in searchers for keywords the articles did not originally contained by placing invisible text in modified versions of the article.
Abstract: In this research-in-progress paper we present the current results of several experiments in which we analyzed whether spamming Google Scholar is possible. Our results show, it is possible: We 'improved' the ranking of articles by manipulating their citation counts and we made articles appear in searchers for keywords the articles did not originally contained by placing invisible text in modified versions of the article.