TL;DR: A method is proposed, which writes the destination information when generating a short URL so that a user is able to check whether the destination is a web document or a file, which prevents attacks such as “drive by download” through the short URL.
Abstract: All the information and data on the Internet are connected based on URL. Although many people use URL to share and convey the information, it is difficult to transmit the information when URL is long and special characters are mixed. Short URL service is a service that transforms long URL with information into short form of URL and conveys the information, which makes it possible to access the page with necessary information. Recently, attackers who want to distribute the malicious code abuse the short URL through SMS or SNS to distribute malicious codes. With the short URL information, as it is difficult to predict the original URL, it has the vulnerability to Phishing attacks. In this study, a method is proposed, which writes the destination information when generating a short URL so that a user is able to check whether the destination is a web document or a file. The service provider of short URL monitors the risk of target URL page of the generated short URL and decides whether to provide service. By monitoring the modification of web-document, it measures and evaluates the risk of the webpage and decides whether to block the short URL according to the threshold, which prevents attacks such as "drive by download" through the short URL.
TL;DR: This work proposes a novel URL de-duping framework based on sequence-to-sequence (Seq2Seq) neural networks and demonstrates the efficiency of this framework in the real large-scale web environment.
Abstract: Many URLs on the Internet point to identical contents, which increase the burden of web crawlers. Techniques that detect such URLs (known as URL de-duping) can greatly save resources such as bandwidth and storage for crawlers. Traditional de-duping methods are usually limited to heavily engineered rule matching strategies.In this work, we propose a novel URL de-duping framework based on sequence-to-sequence (Seq2Seq) neural networks. A single concise translation model can take the place of thousands of explicit rules. Experiments indicate that a vanilla Seq2Seq architecture yields robust and accurate results in detecting duplicate URLs. Furthermore, we demonstrate the efficiency of this framework in the real large-scale web environment.
TL;DR: It is demonstrated that machine learning can be of benefit in predicting, from previously encountered URLs, the likelihood of encountering structured data in an unseen URL, which can be applied within applications that encounter large number of possible URLs to dereference, and must implement some priority scheme to choose relevant URLs.
Abstract: A growing amount of structured data can be found embedded in web pages using formats such as RDFa, JSON-LD and Microdata. Although such data is indexed by search engines and sometimes replicated in centralised knowledge bases, application scenarios exist in which there is a need to discover such data on-the-fly, for example when using the follow-your-nose principle of accessing Linked Open Data, or in applications where the velocity at which data changes can result in centralised repositories being out of date. In this paper we demonstrate two complementary techniques for aiding such applications by analysing URLs. Firstly, we demonstrate that machine learning can be of benefit in predicting, from previously encountered URLs, the likelihood of encountering structured data in an unseen URL. This can be applied within applications that encounter large number of possible URLs to dereference, and must implement some priority scheme to choose relevant URLs. Secondly, we demonstrate that association rule mining can be of use in linking existing resources in a knowledge base, such as DBpedia, to URLs that follow common schemes, such as Semantic (search engine friendly) URLs.