Top 3 papers published in the topic of URL normalization in 2017

Showing papers on "URL normalization published in 2017"

Journal Article•10.1007/S11277-016-3866-8•

Secure Short URL Generation Method that Recognizes Risk of Target URL

[...]

Hyung-Jin Mun¹, Yong-Zhen Li²•Institutions (2)

Baekseok University¹, Yanbian University²

01 Mar 2017-Wireless Personal Communications

TL;DR: A method is proposed, which writes the destination information when generating a short URL so that a user is able to check whether the destination is a web document or a file, which prevents attacks such as “drive by download” through the short URL.

...read moreread less

Abstract: All the information and data on the Internet are connected based on URL. Although many people use URL to share and convey the information, it is difficult to transmit the information when URL is long and special characters are mixed. Short URL service is a service that transforms long URL with information into short form of URL and conveys the information, which makes it possible to access the page with necessary information. Recently, attackers who want to distribute the malicious code abuse the short URL through SMS or SNS to distribute malicious codes. With the short URL information, as it is difficult to predict the original URL, it has the vulnerability to Phishing attacks. In this study, a method is proposed, which writes the destination information when generating a short URL so that a user is able to check whether the destination is a web document or a file. The service provider of short URL monitors the risk of target URL page of the generated short URL and decides whether to provide service. By monitoring the modification of web-document, it measures and evaluates the risk of the webpage and decides whether to block the short URL according to the threshold, which prevents attacks such as "drive by download" through the short URL.

...read moreread less

6 citations

Proceedings Article•10.1145/3077136.3080746•

De-duping URLs with Sequence-to-Sequence Neural Networks

[...]

Keyang Xu¹, Zhengzhong Liu¹, Jamie Callan¹•Institutions (1)

Carnegie Mellon University¹

7 Aug 2017

TL;DR: This work proposes a novel URL de-duping framework based on sequence-to-sequence (Seq2Seq) neural networks and demonstrates the efficiency of this framework in the real large-scale web environment.

...read moreread less

Abstract: Many URLs on the Internet point to identical contents, which increase the burden of web crawlers. Techniques that detect such URLs (known as URL de-duping) can greatly save resources such as bandwidth and storage for crawlers. Traditional de-duping methods are usually limited to heavily engineered rule matching strategies.In this work, we propose a novel URL de-duping framework based on sequence-to-sequence (Seq2Seq) neural networks. A single concise translation model can take the place of thousands of explicit rules. Experiments indicate that a vanilla Seq2Seq architecture yields robust and accurate results in detecting duplicate URLs. Furthermore, we demonstrate the efficiency of this framework in the real large-scale web environment.

...read moreread less

2 citations

Proceedings Article•10.1145/3102254.3102265•

Analysis of semantic URLs to support automated linking of structured data on the web

[...]

Steven Lynden¹•Institutions (1)

National Institute of Advanced Industrial Science and Technology¹

19 Jun 2017

TL;DR: It is demonstrated that machine learning can be of benefit in predicting, from previously encountered URLs, the likelihood of encountering structured data in an unseen URL, which can be applied within applications that encounter large number of possible URLs to dereference, and must implement some priority scheme to choose relevant URLs.

...read moreread less

Abstract: A growing amount of structured data can be found embedded in web pages using formats such as RDFa, JSON-LD and Microdata. Although such data is indexed by search engines and sometimes replicated in centralised knowledge bases, application scenarios exist in which there is a need to discover such data on-the-fly, for example when using the follow-your-nose principle of accessing Linked Open Data, or in applications where the velocity at which data changes can result in centralised repositories being out of date. In this paper we demonstrate two complementary techniques for aiding such applications by analysing URLs. Firstly, we demonstrate that machine learning can be of benefit in predicting, from previously encountered URLs, the likelihood of encountering structured data in an unseen URL. This can be applied within applications that encounter large number of possible URLs to dereference, and must implement some priority scheme to choose relevant URLs. Secondly, we demonstrate that association rule mining can be of use in linking existing resources in a knowledge base, such as DBpedia, to URLs that follow common schemes, such as Semantic (search engine friendly) URLs.

...read moreread less