{"id":2063,"date":"2021-01-03T22:00:08","date_gmt":"2021-01-03T22:00:08","guid":{"rendered":"https:\/\/thenextweb.com\/?p=1332086"},"modified":"2021-01-03T22:00:08","modified_gmt":"2021-01-03T22:00:08","slug":"how-ai-weeds-the-spam-out-of-our-inboxes","status":"publish","type":"post","link":"https:\/\/www.londonchiropracter.com\/?p=2063","title":{"rendered":"How AI weeds the spam out of our inboxes"},"content":{"rendered":"\n<p>Of more<span>&nbsp;<\/span><a href=\"https:\/\/www.statista.com\/statistics\/456500\/daily-number-of-e-mails-worldwide\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">than 300 billion emails<\/a><span>&nbsp;<\/span>sent every day,<span>&nbsp;<\/span><a href=\"https:\/\/www.statista.com\/statistics\/420391\/spam-email-traffic-share\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">at least half<\/a><span>&nbsp;<\/span>are spam. Email providers have the huge task of filtering out&nbsp;spam and making sure their users receive the messages that matter.<\/p>\n<p>Spam detection is messy. The line between spam and non-spam messages is fuzzy, and the criteria change over time. From various efforts to automate spam detection,<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2017\/08\/28\/artificial-intelligence-machine-learning-deep-learning\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">machine learning<\/a><span>&nbsp;<\/span>has so far proven to be the most effective and favored approach by email providers. Although we still see spammy emails, a quick look at the junk folder will show how much spam gets weeded out of our inboxes every day thanks to machine learning algorithms.<\/p>\n<p>How does machine learning determine which emails are spam and which are not? Here\u2019s an overview of how machine learning-based&nbsp;spam detection works.<\/p>\n<h2>The challenge<\/h2>\n<p>Spam email comes in different flavors. Many are just annoying messages aiming to draw attention to a cause or spread false information. Some of them are<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2017\/04\/21\/what-is-phishing-and-spear-phishing\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">phishing emails<\/a><span>&nbsp;<\/span>with the intent of luring the recipient into clicking on a malicious link or downloading a malware.<\/p>\n<p>The one thing they have in common is that they are irrelevant to the needs of the recipient. A spam-detector algorithm must find a way to filter out spam while and at the same time avoid flagging authentic messages that users want to see in their inbox. And it must do it in a way that can match evolving trends such as panic caused from pandemics, election news, sudden interest in cryptocurrencies, and others.<\/p>\n<p>Static rules can help. For instance, too many BCC recipients, very short body text, and all caps subjects are some of the hallmarks of spam emails. Likewise, some sender domains and email addresses can be associated with spam. But for the most part, spam detection mainly relies on analyzing the content of the message.<\/p>\n<h2>Na\u00efve Bayes machine learning<\/h2>\n<figure class=\"wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<p><iframe loading=\"lazy\" src=\"https:\/\/www.youtube.com\/embed\/HZGCoVF3YvM\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\">[embedded content]<\/iframe><\/p>\n<\/figure>\n<p>Machine learning algorithms use statistical models to classify data. In the case of spam detection, a trained machine learning model must be able to determine whether the sequence of words found in an email are closer to those found in spam emails or safe ones.<\/p>\n<p>Different machine learning algorithms can detect spam, but one that has gained appeal is the \u201cna\u00efve Bayes\u201d algorithm. As the name implies, na\u00efve Bayes is based on \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayes%27_theorem\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Bayes\u2019 theorem<\/a>,\u201d which describes the probability of an event based on prior knowledge.<\/p>\n<figure class=\"wp-block-image size-large\">\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-8889 lazy\" src=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=696%2C225&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" alt=\"Bayes theorem\" width=\"696\" height=\"225\" data-attachment-id=\"8889\" data-permalink=\"https:\/\/bdtechtalks.com\/2020\/11\/30\/machine-learning-spam-detection\/bayes-theorem\/\" data-orig-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?fit=2026%2C654&amp;ssl=1\" data-orig-size=\"2026,654\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Bayes theorem\" data-image-description data-medium-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?fit=300%2C97&amp;ssl=1\" data-large-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?fit=696%2C225&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" data-lazy=\"true\" data-srcset=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=1024%2C331&amp;ssl=1 1024w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=300%2C97&amp;ssl=1 300w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=768%2C248&amp;ssl=1 768w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=1536%2C496&amp;ssl=1 1536w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=696%2C225&amp;ssl=1 696w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=1068%2C345&amp;ssl=1 1068w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=1301%2C420&amp;ssl=1 1301w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?resize=1920%2C620&amp;ssl=1 1920w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?w=2026&amp;ssl=1 2026w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Bayes-theorem.png?w=1392&amp;ssl=1 1392w\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/03\/how-ai-weeds-the-spam-out-of-our-inboxes-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F03%2Fhow-ai-weeds-the-spam-out-of-our-inboxes-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Bayes\u2019 theorem\" data-title=\"Share Bayes\u2019 theorem on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Bayes\u2019 theorem on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>Bayes\u2019 theorem<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<p>The reason it is called \u201cna\u00efve\u201d is that it assumes features of observations are independent. Let\u2019s say you want to use na\u00efve Bayes machine learning to predict whether it will rain or not. In this case, your features could be temperature and humidity, and the event you\u2019re predicting is rainfall.<\/p>\n<figure class=\"wp-block-image size-large\" readability=\"4\">\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-8890 lazy\" src=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?resize=696%2C63&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" alt=\"Naive Bayes rain prediction\" width=\"696\" height=\"63\" data-attachment-id=\"8890\" data-permalink=\"https:\/\/bdtechtalks.com\/2020\/11\/30\/machine-learning-spam-detection\/naive-bayes-rain-prediction\/\" data-orig-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?fit=1784%2C160&amp;ssl=1\" data-orig-size=\"1784,160\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Naive Bayes rain prediction\" data-image-description data-medium-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?fit=300%2C27&amp;ssl=1\" data-large-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?fit=696%2C63&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" data-lazy=\"true\" data-srcset=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?resize=1024%2C92&amp;ssl=1 1024w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?resize=300%2C27&amp;ssl=1 300w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?resize=768%2C69&amp;ssl=1 768w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?resize=1536%2C138&amp;ssl=1 1536w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?resize=696%2C62&amp;ssl=1 696w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?resize=1068%2C96&amp;ssl=1 1068w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?w=1784&amp;ssl=1 1784w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-rain-prediction.png?w=1392&amp;ssl=1 1392w\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/03\/how-ai-weeds-the-spam-out-of-our-inboxes-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F03%2Fhow-ai-weeds-the-spam-out-of-our-inboxes-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Na\u00efve Bayes is a very efficient and fast machine learning algorithm, which lent to its popularity in many fields.\" data-title=\"Share Na\u00efve Bayes is a very efficient and fast machine learning algorithm, which lent to its popularity in many fields. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Na\u00efve Bayes is a very efficient and fast machine learning algorithm, which lent to its popularity in many fields. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>Na\u00efve Bayes is a very efficient and fast machine learning algorithm, which lent to its popularity in many fields.<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<p>In the case of spam detection, things get a bit more complicated. Our target variable is whether a given email is \u201cspam\u201d or \u201cnot spam\u201d (also called \u201cham\u201d). The features are the words or word combinations found in the email\u2019s body. In a nutshell, we want to find out calculate the probability that an email message is spam based on its text.<\/p>\n<p>The catch here is that our features are not necessarily independent. For instance, consider the terms \u201cgrilled,\u201d \u201ccheese,\u201d and \u201csandwich.\u201d They can have separate meanings depending on whether they successively or in different parts of the message. Another example are the words \u201cnot\u201d and \u201cinteresting.\u201d In this case, the meaning can be completely different depending on where they appear in the message. But even though feature independence is complicated in text data, the na\u00efve Bayes classifier has proven to be efficient in<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2018\/02\/20\/ai-machine-learning-nlg-nlp\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">natural language processing<\/a><span>&nbsp;<\/span>tasks if you configure it properly.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-8891 jetpack-lazy-image jetpack-lazy-image--handled\" src=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?resize=696%2C82&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" srcset=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?resize=1024%2C121&amp;ssl=1 1024w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?resize=300%2C35&amp;ssl=1 300w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?resize=768%2C91&amp;ssl=1 768w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?resize=1536%2C182&amp;ssl=1 1536w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?resize=696%2C82&amp;ssl=1 696w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?resize=1068%2C126&amp;ssl=1 1068w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?w=1606&amp;ssl=1 1606w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?w=1392&amp;ssl=1 1392w\" alt=\"Naive Bayes spam detection\" width=\"696\" height=\"82\" data-attachment-id=\"8891\" data-permalink=\"https:\/\/bdtechtalks.com\/2020\/11\/30\/machine-learning-spam-detection\/naive-bayes-spam-detection\/\" data-orig-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?fit=1606%2C190&amp;ssl=1\" data-orig-size=\"1606,190\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Naive Bayes spam detection\" data-image-description data-medium-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?fit=300%2C35&amp;ssl=1\" data-large-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/Naive-Bayes-spam-detection.png?fit=696%2C82&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\"><\/figure>\n<h2>The data<\/h2>\n<p>Spam detection is a<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2020\/02\/10\/unsupervised-learning-vs-supervised-learning\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">supervised machine learning<\/a><span>&nbsp;<\/span>problem. This means you must provide your machine learning model with a set of examples of spam and ham messages and let it find the relevant patterns that separate the two different categories.<\/p>\n<p>Most email providers have their own vast data sets of labeled emails. For instance, every time you flag an email as spam in your Gmail account, you\u2019re providing Google with training data for its machine learning algorithms. (Note: Google\u2019s spam detection algorithm is much more complicated than what we\u2019re examining here, and the company has mechanisms to prevent abuse of its \u201cReport Spam\u201d feature.)<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-8892 jetpack-lazy-image jetpack-lazy-image--handled\" src=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?resize=467%2C133&amp;ssl=1\" sizes=\"(max-width: 467px) 100vw, 467px\" srcset=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?resize=1024%2C293&amp;ssl=1 1024w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?resize=300%2C86&amp;ssl=1 300w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?resize=768%2C219&amp;ssl=1 768w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?resize=696%2C199&amp;ssl=1 696w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?w=1036&amp;ssl=1 1036w\" alt=\"gmail spam report\" width=\"467\" height=\"133\" data-attachment-id=\"8892\" data-permalink=\"https:\/\/bdtechtalks.com\/2020\/11\/30\/machine-learning-spam-detection\/gmail-spam-report\/\" data-orig-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?fit=1036%2C296&amp;ssl=1\" data-orig-size=\"1036,296\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"gmail spam report\" data-image-description data-medium-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?fit=300%2C86&amp;ssl=1\" data-large-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/11\/gmail-spam-report.png?fit=696%2C199&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\"><\/figure>\n<\/div>\n<p>There are some open-source data sets, such as the spambase data set of the University of California, Irvine, and the Enron spam data set. But these data sets are for educational and test purposes and aren\u2019t of much use in creating production-level machine learning models.<\/p>\n<p>Companies that host their own email servers can easily create specialized data sets that tune their machine learning models to the specific language of their line of work. For instance, the data set of a company that provides financial services will look much different from that of a construction company.<\/p>\n<h2>Training the machine learning model<\/h2>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-4227 jetpack-lazy-image jetpack-lazy-image--handled\" src=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=696%2C522&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" srcset=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=1024%2C768&amp;ssl=1 1024w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=300%2C225&amp;ssl=1 300w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=768%2C576&amp;ssl=1 768w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=80%2C60&amp;ssl=1 80w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=265%2C198&amp;ssl=1 265w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=696%2C522&amp;ssl=1 696w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=1068%2C801&amp;ssl=1 1068w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=560%2C420&amp;ssl=1 560w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?resize=1920%2C1440&amp;ssl=1 1920w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?w=1392&amp;ssl=1 1392w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?w=2088&amp;ssl=1 2088w\" alt=\"machine learning natural language processing\" width=\"696\" height=\"522\" data-attachment-id=\"4227\" data-permalink=\"https:\/\/bdtechtalks.com\/robot-learning-or-machine-learning\/\" data-orig-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?fit=5000%2C3750&amp;ssl=1\" data-orig-size=\"5000,3750\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;3d rendering robot learning or machine learning with alphabets&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;robot learning or machine learning&quot;,&quot;orientation&quot;:&quot;1&quot;}\" data-image-title=\"machine learning natural language processing\" data-image-description=\"\n\n<p>machine learning natural language processing<\/p>\n<p> &#8221; data-medium-file=&#8221;https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?fit=300%2C225&amp;ssl=1&#8243; data-large-file=&#8221;https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/01\/AI-robot-artificial-intelligence-natural-language-processing-nlp.jpg?fit=696%2C522&amp;ssl=1&#8243; data-recalc-dims=&#8221;1&#8243; data-lazy-loaded=&#8221;1&#8243;><\/figure>\n<p>Although natural language processing has seen a lot of exciting advances in recent years,<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2018\/10\/22\/ai-deep-learning-human-language\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">artificial intelligence algorithms still don\u2019t understand language<\/a><span>&nbsp;<\/span>in the way we do.<\/p>\n<p>Therefore, one of the key steps in developing a spam-detector machine learning model is preparing the data for statistical processing. Before training your na\u00efve Bayes classifier, the corpus of spam and ham emails must go through certain steps.<\/p>\n<p>Consider a data set containing the following sentences:<\/p>\n<p><em>Steve wants to buy grilled cheese sandwiches for the party<\/em><\/p>\n<p><em>Sally is grilling some chicken for dinner<\/em><\/p>\n<p><em>I bought some cream cheese for the cake<\/em><\/p>\n<p>Text data must be \u201ctokenized\u201d before being fed to machine learning algorithms, both when training your models and later when making predictions on new data. In essence, tokenization means splitting your text data into smaller parts. If you split the above data set by single words (also called unigram), you\u2019ll have the following vocabulary. Note that I\u2019ve only included each word once.<\/p>\n<p><em>Steve, wants, to, buy, grilled, cheese, sandwiches, for, the, party, Sally, is, grilling, some, chicken, dinner, I, bought, cream, cake<\/em><\/p>\n<p>We can remove words that appear both in spam and ham emails and don\u2019t help in telling the difference between the two classes. These are called \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Stop_word\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">stop words<\/a>\u201d and include terms such as<span>&nbsp;<\/span><em>the<\/em>,<span>&nbsp;<\/span><em>for<\/em>,<span>&nbsp;<\/span><em>is, to,<span>&nbsp;<\/span><\/em>and<span>&nbsp;<\/span><em>some<\/em>. In the above data set, removing stop words will reduce the size of our vocabulary by five words.<\/p>\n<p>We can also use other techniques such as<span>&nbsp;<\/span><a href=\"https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/stemming-and-lemmatization-1.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">\u201cstemming\u201d and \u201clemmatization,\u201d<\/a><span>&nbsp;<\/span>which transform words to their base forms. For instance, in our example data set,<span>&nbsp;<\/span><em>buy<\/em><span>&nbsp;<\/span>and<span>&nbsp;<\/span><em>bought<\/em><span>&nbsp;<\/span>have a common root, as do<span>&nbsp;<\/span><em>grilled<\/em><span>&nbsp;<\/span>and<span>&nbsp;<\/span><em>grill.<span>&nbsp;<\/span><\/em>Stemming and lemmatization can help further simplify our machine learning model.<\/p>\n<p>In some cases, you should consider using bigrams (two-word tokens), trigrams (three-word token), or larger n-grams. For instance, tokenizing the above data set in bigram form will give us terms such as \u201ccheese cake,\u201d and using trigrams will produce \u201cgrilled cheese sandwich.\u201d<\/p>\n<p>Once you\u2019ve processed your data, you\u2019ll have a list of terms that define the features of your machine learning model. Now you must determine which words or\u2014if you\u2019re using n-grams\u2014word sequences are relevant to each of your spam and ham classes.<\/p>\n<p>When you train your machine learning model on the training data set, each term is assigned a weight based on how many times it appears in spam and ham emails. For instance, if \u201cwin big money prize\u201d is one of your features and only appears in spam emails, then it will be given a larger probability of being spam. If \u201cimportant meeting\u201d is only mentioned in ham emails, then its inclusion in an email will increase the probability of that email being classified as not spam.<\/p>\n<p>Once you have processed the data and assigned the weights to the features, your machine learning model is ready filter spam. When a new email comes in, the text is tokenized and run against the Bayes formula. Each term in the message body is multiplied by its weight and the sum of the weight determine the probability that the email is spam. (In reality, the calculation is a bit more complicated, but to keep things simple, we\u2019ll stick to the sum of weights.)<\/p>\n<h2>Advanced spam detection with machine learning<\/h2>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-5312 jetpack-lazy-image jetpack-lazy-image--handled\" src=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=696%2C464&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" srcset=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=1024%2C683&amp;ssl=1 1024w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=300%2C200&amp;ssl=1 300w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=768%2C512&amp;ssl=1 768w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=696%2C464&amp;ssl=1 696w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=1068%2C712&amp;ssl=1 1068w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=630%2C420&amp;ssl=1 630w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=1920%2C1280&amp;ssl=1 1920w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?w=1392&amp;ssl=1 1392w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?w=2088&amp;ssl=1 2088w\" alt=\"neural networks deep learning stochastic gradient descent\" width=\"696\" height=\"464\" data-attachment-id=\"5312\" data-permalink=\"https:\/\/bdtechtalks.com\/2019\/08\/20\/ai-adversarial-examples-hierarchical-random-switching\/neural-networks-deep-learning-stochastic-gradient-descent\/\" data-orig-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?fit=3600%2C2400&amp;ssl=1\" data-orig-size=\"3600,2400\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;1&quot;}\" data-image-title=\"neural networks deep learning stochastic gradient descent\" data-image-description=\"\n\n<p>neural networks deep learning stochastic gradient descent<\/p>\n<p> &#8221; data-medium-file=&#8221;https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?fit=300%2C200&amp;ssl=1&#8243; data-large-file=&#8221;https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?fit=696%2C464&amp;ssl=1&#8243; data-recalc-dims=&#8221;1&#8243; data-lazy-loaded=&#8221;1&#8243;><\/figure>\n<p>Simple as it sounds, the na\u00efve Bayes machine learning algorithm has proven to be effective for many text classification tasks, including spam detection.<\/p>\n<p>But this does not mean that it is perfect.<\/p>\n<p>Like other machine learning algorithms, na\u00efve Bayes<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2019\/10\/07\/rebooting-ai-gary-marcus-ernest-davis\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">does not understand the context of language<\/a><span>&nbsp;<\/span>and relies on statistical relations between words to determine whether a piece of text belongs to a certain class. This means that, for instance, a na\u00efve Bayes spam detector can be fooled into overlooking a spam email if the sender just adds some non-spam words at the end of the message or replace spammy terms with other closely related words.<\/p>\n<p>Na\u00efve Bayes is not the only machine learning algorithm that can detect spam. Other popular algorithms include<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2020\/06\/08\/what-is-recurrent-neural-network-rnn\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">recurrent neural networks (RNN)<\/a><span>&nbsp;<\/span>and transformers, which are efficient at processing sequential data like email and text messages.<\/p>\n<p>A final thing to note is that spam detection is always a work in progress. As developers use AI and other technology to detect and filter out noisome messages from emails, spammers find new ways to game the system and get their junk past the filters. That is why email providers always rely on the help of users to improve and update their spam detectors.<\/p>\n<p><i><span>This article was originally published by&nbsp;<a class=\"author url fn\" title=\"Posts by Ben Dickson\" href=\"https:\/\/bdtechtalks.com\/author\/bendee983\/\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">Ben Dickson<\/a> on <\/span><\/i><a href=\"https:\/\/bdtechtalks.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><i><span>TechTalks<\/span><\/i><\/a><i><span>, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the original article here. [LINK]<\/span><\/i><\/p>\n<p class=\"c-post-pubDate\"> Published January 3, 2021 \u2014 22:00 UTC <\/p>\n<p> <a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/03\/how-ai-weeds-the-spam-out-of-our-inboxes-syndication\/\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Of more&nbsp;than 300 billion emails&nbsp;sent every day,&nbsp;at least half&nbsp;are spam. Email providers have the huge task of filtering out&nbsp;spam and making sure their users receive the messages that matter. Spam detection is&#8230;<\/p>\n","protected":false},"author":1,"featured_media":2064,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts\/2063"}],"collection":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2063"}],"version-history":[{"count":0,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts\/2063\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/media\/2064"}],"wp:attachment":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2063"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2063"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2063"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}