Spam Detection using Reference Text: A Preliminary Study for Spam Ground Truth Generation

Question

1. What are the different types of spam encountered in the world of Internet and how do they affect the community?

2. What are the key points of study in spam detection from emails and SMS?

3. What types of text data are included in the dataset?

4. What is the LCS method used for in spam detection?

Accepted Answer

Spam can be classified into two types based on the domain of generation and the contents. Text spam involves unsolicited messages or contents sent in bulk or singly through known and unknown sources. Filtering out spam from unknown sources can be done by detecting the domain from where the message is generated and labelling them as spam depending on the trustfulness of the domain. Filtering spam based on the contents of the messages is a different task. Spam can affect the community in various ways, such as wasting time and decreasing productivity, reducing storage space, and causing harm through malicious infections. According to a report by LocalCircles, mobile users have been inundated with unwanted calls, spam, and advertising messages, with 68% of mobile customers receiving four or more promotional or spam SMSs on average per day. Spam can also lead to phishing or ransomware attacks, as users may click on unknown links and end up in scams. Overall, spam is a nuisance and poses a threat to the community in various ways.

Accepted Answer

1. Mostly the works are supervised or semi-supervised, very less works are on unsupervised data. 2. Standard datasets like Enron etc. are labelled data on text messages received through SMS or E-mails, but no standard dataset is available having text messages extracted from images received through social media chat platforms. 3. There are very few models which shed light on spam detection using text contained in spam digital images. 4. Most works are focused on spam detection from E-mail and SMS; very few studies are done on social media chat platforms. 5. As per our literature survey, there is no model which does detection of spam on the basis of reference text.

Accepted Answer

The dataset includes text data from various sources such as images, posters, and flyers found on social media. It contains 'festive wishes', sale flyers, normal messages, and frequent English word use for convening messages. Examples of text data include 'Happy Birthday', 'Happy Diwali', 'Chhatrapati Shivaji Jayanti', 'Ramzan Mubark', 'autographical', 'Flipkart', 'Big Sale', 'doctor appointment', and quotes from famous celebrities and motivational sources. The dataset comprises 705 words, with 127 identified as spam and the rest as non-spam. Accompanying images showcase the text content commonly shared and circulated within social media chat platforms.

Accepted Answer

The Longest Common Substring (LCS) method is used in spam detection to find common substrings between two sentences. It compares substrings of sequence 1 with sequence 2 to identify common substrings. The length of the common substrings is maximized to detect spam. LCS is suitable for short strings, but not for long texts like book pages or web scraping. In spam detection, LCS helps in labeling strings or text messages based on specific information. The method involves using a reference word (IW) and referred words (RW) to find spam words. By applying LCS, similar words to the reference word are identified as potential spam. The process continues by adding unique spam words to the reference word list and repeating the process until a certain iteration limit is reached. The spam words are marked based on the percentage of common substrings with the initial word. The stopping point of the method is crucial for accuracy, as stopping too early may result in fewer spam words, while stopping too late may yield redundant iterations. Overall, the LCS method aids in separating spam terms from non-spam words in text data.

Accepted Answer

The dip in accuracy occurs when the 'Factor' value increases beyond 0.415 due to the impact of the common sub-string percentage between the initial word and words in the dataset. As the factor increases, the common sub-string percentage also increases, leading to a higher likelihood of words from the dataset being marked as spam. This results in a decrease in accuracy. When the factor is too low, many words from the dataset may be marked as spam, while a high factor leads to fewer common words, reducing accuracy. The starting value of the factor was chosen between 0.315 and 0.435 to avoid accuracy dips and optimize results.

Accepted Answer

The factor value affects precision and recall. As the factor value increases, precision also increases, reaching its peak at the maximum factor value. However, recall generally decreases as the factor value increases. This trend is consistent across different numbers of initial words. The factor value influences the number of 'True Positive', 'True Negative', 'False Positive', and 'False Negative' outcomes, which in turn affect precision and recall. A low factor value may result in high common sub-string percentage, leading to more false positives and lower precision. Conversely, a high factor value may reduce the detection of actual spam words, affecting precision negatively. Striking a balance between minimizing false positives and maximizing spam detection is crucial.

Accepted Answer

The number of initial words does not significantly impact accuracy, precision, and recall values. This is because the final outcome relies on the collection of spam words during each iteration, which are interdependent. Words not flagged as spam occur due to their low common sub-string percentage with the initial word. Therefore, the choice of initial words does not play a significant role in determining these values.

Accepted Answer

Our method demonstrates nearly equivalent accuracy to Multi Nominal Na'ive Bayes, surpasses in precision, and slightly lags behind in terms of recall. Despite the unsupervised nature of our method, it exhibits significantly improved precision compared to Multi Nominal Na'ive Bayes. The recall value falls slightly behind due to its unsupervised nature. However, our method offers a notable advantage as it can be tailored according to the reference words provided, allowing for specific genre identification. The longest common substring (LCS) technique is used to selectively collect words with a substantial proportion of common substrings, excluding words with low common substring percentages. By employing phrases like 'Happy Birthday' or 'Happy Diwali' as the initial word, our method ensures that words such as 'Ganesh Chaturthi' won't be falsely flagged as spam. The choice of the initial word is determined by the prominence of specific holidays or festive greetings. Our method maintains precision and avoids mislabelling legitimate words as spam, unlike other methods that rely on pre-classified data. Overall, our method not only achieves comparable accuracy to Multi Nominal Na'ive Bayes but also surpasses it in precision.

Accepted Answer

Text spam detection faces challenges due to the evolving nature of spam messages and sophisticated tactics used by spammers. Traditional text-based spam detection methods struggle with image spam, which uses images or graphics to convey messages. Understanding the contents of digital images and handling them appropriately is crucial. Additionally, identifying spam images targeting specific user groups is essential, as spam classification has changed significantly in response to data transfer volume. Continued research and development are necessary to improve text spam detection systems and enhance online user safety and security.

Spam Detection using Reference Text: A Preliminary Study for Spam Ground Truth Generation

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the different types of spam encountered in the world of Internet and how do they affect the community?

2. What are the key points of study in spam detection from emails and SMS?

3. What types of text data are included in the dataset?

4. What is the LCS method used for in spam detection?

5. Why does accuracy dip when 'Factor' value increases beyond 0.415?

6. What is the relationship between factor value, precision, and recall?

7. How does the number of initial words affect accuracy, precision, and recall values?

8. How does our method compare to Multi Nominal Na'ive Bayes?

9. What challenges exist in text spam detection?

Citations

Néhány példa a spam-ek jelentette gazdasági kockázatokra - esettanulmány

References

Survey on supervised machine learning techniques for automatic text classification

A Comprehensive Survey for Intelligent Spam Email Detection

A review on social spam detection: Challenges, open issues, and future directions

Spam detection in social media using convolutional and long short term memory neural network

Exact String Matching Algorithms: Survey, Issues, and Future Research Directions

Related Papers (5)

Suspicious e-mail detection using various techniques

Internet Users and Spam: What the attitudes and behavior of Internet users can tell us about fighting spam.

Spam Message Filtering with Bayesian Approach for Internet Communities

Spam Message Filtering for Internet Communities using Collection and Frequency Analysis

Cross-Domain Spam Detection in Social Media: A Survey