TL;DR: The goal of this paper is to develop effective techniques to detect automatically spun content on the Web, and uses a technique based upon immutables, words or phrases that spinning tools do not modify when generating spun content to identify automatically spun Web articles.
Abstract: Web spam is an abusive search engine optimization technique that artificially boosts the search result rank of pages promoted in the spam content. A popular form of Web spam today relies upon automated spinning to avoid duplicate detection. Spinning replaces words or phrases in an input article to create new versions with vaguely similar meaning but sufficiently different appearance to avoid plagiarism detectors. With just a few clicks, spammers can use automated tools to spin a selected article thousands of times and then use posting services via proxies to spam the spun content on hundreds of target sites. The goal of this paper is to develop effective techniques to detect automatically spun content on the Web. Our approach is directly tied to the underlying mechanism used by automated spinning tools: we use a technique based upon immutables, words or phrases that spinning tools do not modify when generating spun content. We implement this immutable method in a tool called DSpin, which identifies automatically spun Web articles. We then apply DSpin to two data sets of crawled articles to study the extent to which spammers use automated spinning to create and post content, as well as their spamming behavior.
TL;DR: This study attempts to identify spam websites using a dataset comprising 2751 websites using bio inspired outlier detection approaches and indicates that metrics including Domain Authority, Page Authority, Moz Rank, Links In, External Equity Links, Spam Score, Alexa Rank, Citation Flow, Trust Flow, External Back Links, Referred Domains, SemRush URL Links and SemRush Hostname Links play an important role in identifying spam.
Abstract: In the current scenario, with the exponential increase in the use of internet, organizations are continuously thriving for visibility on the web. This has opened new avenues in influencer marketing. Several portals encourage these marketers to build content for the purpose of digital marketing. However, the content building process produces a lot of spam within these websites when done in bulk. This is often done in order to establish their presence by using techniques including article spinning and keyword stuffing. This study thus attempts to identify these spam websites using a dataset comprising 2751 websites using bio inspired outlier detection approaches. We use publically available key performance indicators (KPIs) through which websites that create spam content to boost the amount of text in the domain are identified. A hybrid wolf search algorithm (WSA) and bat algorithm (BA) integrated with K-means are used to classify these websites into spam. Findings indicate that metrics including Domain Authority, Page Authority, Moz Rank, Links In, External Equity Links, Spam Score, Alexa Rank, Citation Flow, Trust Flow, External Back Links, Referred Domains, SemRush URL Links and SemRush Hostname Links play an important role in identifying spam. The proposed approach may prove beneficial in segregating spam influencer websites for effective influencer marketing.
TL;DR: It is believed that there is a need for next-generation search engines in Web 2.0 and a content-hole search system using Wikipedia is proposed, which attempts to extract and represent content holes from discussions on SNSs and blogs.
Abstract: SNSs and blogs, both of which are maintained by a community of people, have become popular in Web 2.0. We call these content as "Community-type content." This community is associated with the content, and those who use or contribute to community-type content are considered as members of the community. Occasionally, the members of a community do not understand the theme of the content from multiple viewpoints, hence, the amount of information is often insufficient. It is convenient to present the user missed information. In this way, when Web 2.0 became popular, the content on the Internet and type of users are changed. We believe that there is a need for next-generation search engines in Web 2.0. We require a search engine that can search for information users are unaware of; we call such information as "content holes." In this paper, we propose a method for searching content holes in community-type content. We attempt to extract and represent content holes from discussions on SNSs and blogs. Conventional Web search technique is generally based on similarities. On the other hand, our content-hole search is a different search. In this paper, we classify and represent a number of images for different searching methods; we define content holes and as the first step toward realizing our aim, we propose a content-hole search system using Wikipedia.