TL;DR: A technique called optimal RT-PFL to classify the malicious URLs from non-malicious URLs detected in the websites, which can productively identify minimal attribute reduction from the highly efficient dataset which improves the classification systems performance.
Abstract: The crucial and criminal activities in Internet occurs due to malicious or suspicious websites. Therefore, it turns to be an challenge approach to keep away the end users from moving on to the malicious websites. In this paper we propose a technique called optimal RT-PFL to classify the malicious URLs from non-malicious URLs detected in the websites. Here the data set should be encoded into both the lexical as well as the host-based features related to the URL in order to generate the feature components. Certain features is extracted by the feature extraction process. Optimal features of URLs is selected based upon the proposed feature selection approach namely Gray Wolf Optimizer based Rough Set Theory algorithm. This proposed algorithm can productively identify minimal attribute reduction from the highly efficient dataset which in turn improves the classification systems performance. The chosen URL should be admitted towards the classifier to foresee whether the admitted URL is benign or it is malicious. The classification of URLs depends on the newly proposed particle filtering based fuzzy logic approach. The subsequent classifiers gains higher accuracy by identifying huge amount of malicious URLs from the malevolent sites.
TL;DR: A Web Crawler which performs crawling in particular category to remove irrelevant URL and implements URL normalization for removing duplicate URLs within particular category is proposed.
Abstract: Searching is the important requirement of the web user and results is based on crawler. Users rely on search engines to get desired information in various forms text, images, sound, Video. Search engine gives information on the basis of indexed database and this database is created by the URLs through crawler. Some URLs directly or indirectly leads to same page. Crawling and indexing similar contents URLs implies wastage of resources. Crawler gives such results because of bad crawling algorithm, poor quality Ranking algorithm or low level user experience. The challenge is to remove duplicate results, near duplicate document detection and elimination to improve the performance of any search engine. This paper proposes a Web Crawler which performs crawling in particular category to remove irrelevant URL and implements URL normalization for removing duplicate URLs within particular category. Results are analyzed on the basis of total URL Fetched, Duplicate URLs, and Query execution time.