1. What are the different types of spam encountered in the world of Internet and how do they affect the community?
Spam can be classified into two types based on the domain of generation and the contents. Text spam involves unsolicited messages or contents sent in bulk or singly through known and unknown sources. Filtering out spam from unknown sources can be done by detecting the domain from where the message is generated and labelling them as spam depending on the trustfulness of the domain. Filtering spam based on the contents of the messages is a different task. Spam can affect the community in various ways, such as wasting time and decreasing productivity, reducing storage space, and causing harm through malicious infections. According to a report by LocalCircles, mobile users have been inundated with unwanted calls, spam, and advertising messages, with 68% of mobile customers receiving four or more promotional or spam SMSs on average per day. Spam can also lead to phishing or ransomware attacks, as users may click on unknown links and end up in scams. Overall, spam is a nuisance and poses a threat to the community in various ways.
read more
2. What are the key points of study in spam detection from emails and SMS?
1. Mostly the works are supervised or semi-supervised, very less works are on unsupervised data. 2. Standard datasets like Enron etc. are labelled data on text messages received through SMS or E-mails, but no standard dataset is available having text messages extracted from images received through social media chat platforms. 3. There are very few models which shed light on spam detection using text contained in spam digital images. 4. Most works are focused on spam detection from E-mail and SMS; very few studies are done on social media chat platforms. 5. As per our literature survey, there is no model which does detection of spam on the basis of reference text.
read more
3. What types of text data are included in the dataset?
The dataset includes text data from various sources such as images, posters, and flyers found on social media. It contains 'festive wishes', sale flyers, normal messages, and frequent English word use for convening messages. Examples of text data include 'Happy Birthday', 'Happy Diwali', 'Chhatrapati Shivaji Jayanti', 'Ramzan Mubark', 'autographical', 'Flipkart', 'Big Sale', 'doctor appointment', and quotes from famous celebrities and motivational sources. The dataset comprises 705 words, with 127 identified as spam and the rest as non-spam. Accompanying images showcase the text content commonly shared and circulated within social media chat platforms.
read more
4. What is the LCS method used for in spam detection?
The Longest Common Substring (LCS) method is used in spam detection to find common substrings between two sentences. It compares substrings of sequence 1 with sequence 2 to identify common substrings. The length of the common substrings is maximized to detect spam. LCS is suitable for short strings, but not for long texts like book pages or web scraping. In spam detection, LCS helps in labeling strings or text messages based on specific information. The method involves using a reference word (IW) and referred words (RW) to find spam words. By applying LCS, similar words to the reference word are identified as potential spam. The process continues by adding unique spam words to the reference word list and repeating the process until a certain iteration limit is reached. The spam words are marked based on the percentage of common substrings with the initial word. The stopping point of the method is crucial for accuracy, as stopping too early may result in fewer spam words, while stopping too late may yield redundant iterations. Overall, the LCS method aids in separating spam terms from non-spam words in text data.
read more