A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
TL;DR: In this paper, the authors focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.
read more
Abstract: The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Robotics cyber security: vulnerabilities, attacks, countermeasures, and recommendations.
TL;DR: In this paper, the main security vulnerabilities, threats, risks, and their impacts, and the main attacks within the robotics domain are reviewed, and different approaches and recommendations are presented in order to enhance and improve the security level of robotic systems.
Data Elimination on Repetition using a Blockchain based Cyber Threat Intelligence
S. Smys,Wang Haoxiang +1 more
- 05 Jan 2021
TL;DR: A CTI system using blockchain to tackle the issues of sustainability, scalability, privacy and reliability is introduced, capable of measuring organizations contributions, reducing network load, creating a reliable dataset and collecting CTI data with multiple feeds.
25
On Strengthening SMEs and MEs Threat Intelligence and Awareness by Identifying Data Breaches, Stolen Credentials and Illegal Activities on the Dark Web
George Pantelis,Petros Petrou,Sophia Karagiorgou,Dimitrios Alexandrou +3 more
- 17 Aug 2021
TL;DR: In this article, the authors investigate how the Dark Web enables cybercrime, maintains marketplaces with breached enterprise data collections and pawned email accounts, and the maturity and efficiency of technical tools and methods to curb illegal activities on the dark web through raising awareness via efficient text analytics, visual reporting and alerting mechanisms.
15
Exploring Dark Web Crawlers: A Systematic Literature Review of Dark Web Crawlers and Their Implementation
01 Jan 2023
TL;DR: In this paper , the authors present a systematic literature review (SLR) that examines the prevalence and characteristics of dark web crawlers and presents a model for crawling and scraping clear and dark websites for the purpose of digital investigations.
14
Black Widow Crawler for TOR network to search for criminal patterns
Sergio Mauricio Martínez Monterrubio,Joseph Eduardo Armas Naranjo,Lorena Isabel Barona López,Ángel Leonardo Valdivieso Caraguay +3 more
- 01 Mar 2021
TL;DR: The Black Widow crawler as discussed by the authors is a crawler focused on the Tor network, which searches, analyzes, and indexes websites containing criminal patterns and achieves a 240% improvement in the number of indexed secret services compared to the current crawler.
13
References
•Proceedings Article
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov,Ilya Sutskever,Kai Chen,Greg S. Corrado,Jeffrey Dean +4 more
- 05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Web Crawler Architecture.
Marc Najork
- 01 Sep 2009
TL;DR: In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers.
The architecture and implementation of an extensible web crawler
Jonathan M. Hsieh,Steven D. Gribble,Henry M. Levy +2 more
- 28 Apr 2010
TL;DR: It is argued that the low-latency, high selectivity, and scalable nature of the extensible crawler system makes it a promising platform for taking advantage of emerging real-time streams of data, such as Facebook or Twitter feeds.
•Proceedings Article
Crawling deep web content through query forms
Jun Liu,Zhaohui Wu,Lu Jiang,Qinghua Zheng,Xiao Liu +4 more
- 23 Aug 2018
TL;DR: This paper proposes the concept of Minimum Executable Pattern (MEP), and then presents a MEP generation method and a MEP-based Deep Web adaptive query method that outperforms existing methods in terms of query capability and applicability.
18
A Topical Crawler for Uncovering Hidden Communities of Extremist Micro-Bloggers on Tumblr
Swati Agarwal,Ashish Sureka +1 more
- 01 Jan 2015
TL;DR: A topical crawler based approach performing several tasks: searching for a blogger, computing its similarity against exemplary documents, filtering hate promoting bloggers, navigating through links to other bloggers and managing a queue of such bloggers for social network analysis is proposed.
Related Papers (5)
Andreas Hotho,Gerd Stumme +1 more
John G. Breslin,Alexandre Passant,Stefan Decker +2 more
- 03 Oct 2009