Typical anti spam checks

Typical anti spam checks

 

Today, let’s talk about spam checks and mechanisms for checking if a message is spam. This is a typical overview of common spam checking algorithms that are out there.
So what types of mechanisms out there?
Generally, even before your message comes in, somebody connects to you from another IP so there are IP checks that you can perform. Then, you can also do a reverse DNS checks from the same IP that you got from the incoming server. After that, you can do message and header checks, some spam filters they do message and header together as one and some do it separately. You can check if the message is faulty, if the header is faulty, if there is a fake header or a message contains some spam or so on… and obviously if you didn’t catch anything on the previous steps, you can catch it with artificial intelligence checks. Those are filters that can be taught how to catch and how to distinguish spam from not spam.
So let’s review IP checks…
So, IP checks are generally done by what is called RBL and DNSBL. So what is it?
RBL stands for Real-time Black List. Some called it black list, some called it block list, it doesn’t really matter. It’s just a list of all the IP addresses that are for one reason or another are blocked by these particular servers. And there are a lot of services on the market that do it, services like SORBS and lots of others, dynamically that change over time and most of these lists, they employ sophisticated robots to find out if a particular server that they connect to is an open relay or proxies and so on. So they basically block list all the open relays and all the proxies out there. The logic behind it is that if it’s an open relay, spammers will eventually find it and will send spam through it.
DNSBL is DNS Block List but they are actually generally the same. It is just two different names for the same thing. Anytime you initiate a connection to the server, or somebody initiates a connection to your server, it will check your IP against one or more real-time databases. That is almost always the case I am yet to find out about one provider that does not do it and different list, they obviously used different real-time block list but most of them do it.
Obviously each database that you use, they will have its own criteria for inclusion. So for example, even within one organization, one and the same organization, like for example SORBS. They have five or six different lists and they include different items or different artifacts on this list. For example, one list maybe open relays only and another list maybe open proxies and so on.
So how does RBL operate?
Let’s say you have source SMTP server that wants to connect to your server and the server connects to you and obviously you notice that the IP address from which it connects, let’s say in this example its 123.21.15.12. This is the IP address of the source SMTP server. Now, you actually, want to formulate a request to SORBS database, to one of them, there are many… and there is one also an aggregate SORBS list that includes all the lists. Let’s say you want to formulate a request to SORBS database, to one of them, there are many and there’s also an aggregate SORBS list that includes all the lists.
Let’s say you want to question, “Is this IP Black listed or not?” … so what you will do is you formulate a request to their zone name; this is general zone name including all the lists dnsbl.sorbs.net. What you do is you reverse the IP that you got in the previous step… remember, it is 123.21.15.12. Now you reverse it, 12.15.21.123 and you append the zone name dnsbl.sorbs.net and you send this request as a regular DNS request. So then, what happens is that since it is a DNS request , it is sent to your local DNS. Your local DNS might know where dnsbl.sorbs.net is, but it will not know where 12.15.21.123.dnsbl.sorbs.net is, because it’s a compound name. It will send this request to what is called “authoritative name server” and in this case the “authoritative name server” is located in SORBS. It will then check its database or find out if this IP is black listed and will just return your status called, for example… we’ll say that it goes back and sends you  an IP address that says “the last digit of it is 1” which means that this IP is black listed.
In some cases, if it is not black listed, it may send 0 while in some cases it will just not respond at all. Some servers do respond while some do not and it really depends on the implementation. And also, depending on the implementation, you can send additional TXT request, it’s a different type of request to find out what the reason is.
For example, if the server is black listed because it is an open relay, it will say something like, “ this IP is open relay” , or typically what it does is, it gives you back link to a SORBS website where you can actually find out details of why a particular IP has been black listed.
Now, let us discuss the pros and cons of RBL. There are good ones and also bad ones, definitely.
PROS:
It is very quick to determine if an IP is a spammer. It is either yes or no, always.
It blocks about 80 % of all the dumb spam when set-up right. By dumb spam, I mean, there are lots and lots of spammers. They are not really sophisticated. They just have an enormous database of a few billion emails like 99% of them are dead and they just spray and pray. They shoot these emails in millions, hoping that at least, if you through a ton of dirt against a walls, some of it will stick. But, obviously, 99.999 % of this is not going to stick and its going to be rejected and never delivered.
And they used although open relays they can find, all the proxies they can find and they know that 90% of those proxies are blocked anyway, but they still do it because they hope that maybe some administrator didn’t block it and they will deliver at least some emails and some very tiny percent of those emailed will buy the crappy product to their selling.
It also conserved a lot of traffic, because a rejection will happen very very soon. It will happen even before you got the entire message. So if we will assume that a message is 200 Kb, you will not even receive the message, you will not even receive the envelope, you will not get the SMTP exchange at all. You will just get the IP address and rejected right away.
And it is fairly easy to set-up. Most of the modern servers they have  facilities for doing those DNS request and you just specify which list you want to use, that’s about it.
The least, one of the good and sometimes also the bad things is at least it is maintained elsewhere, you don’t maintain the list, so you don’t control it. From one point of view, you absolutely don’t have to worry about it but it is also one of the cons that you don’t control it. That could be a double-edged sword.
CONS:
Some list managers are overzealous, and you have to watch out. They blocked left and right, they just blocked everybody they can think of.
You entrust part of your email management policy to somebody else. You don’t control this part anymore.
In case the server fails you will never know because what will happen is your DNS request will fail.
You never know how good are the servers. There are some ways to find out and there are tests being done on each and every provider or the big providers of course, but they are not extremely reliable. They are not very scientific.

Now we come to the Reverse DNS checks. What is it?

Again, remember, we noticed the IP address of that somebody connected to you. This is a big reason why most of your checks happen with the IP is to conserve traffic. Because with busy servers, a lot of messages, for example, imagine the servers of Gmail or Yahoo.com , they got millions and millions and millions of email per minute per hour or per day . They have to be vey conservative and  have to work very quickly or otherwise they just blow up the servers.  They don’t have the capacity to check every single message that comes in, especially since 90% of it is spam, anyways. All the IP checks are dump first. So when somebody connects, you obviously noticed his IP, just like before. But then according to the protocol, like SMTP protocol, the SMTP exchange starts and you are required by the SMTP protocol to say “HELO”. It is not “HELLO” with double “L” but it is “HELO” with single “L”. It is not a mistake. It is actually spelled this way. Or what is called enhanced helo—“EHLO”. You are obligated to pass your domain name, your name.

A lot of SMTP servers will check why did you pass as your “HELO” name and will compare it to IP you passed. So let’s say, if you give me an IP, how you will know your IP is because you are connected to me. And now, you are actually passing me your domain name. So I can resolve this domain name and compare to the IP that I just got. So if it matches then, I would say maybe I would trust you more. If it does not match, then I either don’t trust you though or I trust you less.  So you pass something like EHLO mail.company.com.,whatever.
Now you have the IP address and the domain name and you can resolve the domain name and see if the IP address you got is the same as the IP you got from the connection. And you can also do backwards.
You can resolve the IP into domain. This is called reverse DNS resolve and sometimes called PTR request. You get the name from the IP address. You can also do it backwards. Not only that, some very sophisticated servers can do both and just say, “ok if I convert the domain name to IP and convert the IP to domain name and see if both of them match.

Now we increase the complexity and we come to check the message itself. And there are lots of spam checks that check just the message and there are also those that checks the header and the message, the entire envelope together with all the headers like your subject, your date, your received from, all the headers that are appended by other servers and so on.

Inside the message itself, when you actually get to the message, there are a lot of stop words that spammers used, like a lottery scam would used staff like “you won the lottery” or something like “become a millionaire” or “get rich quick” or something like that. The stop-words are haunted like crazy. List of headers that can indicate fake mail client, for example, each mail client has a certain way of position in their headers inside the message. Let’s say, you have an outlook, it would always encode the message in such a way that there is first text portion stripped out of html or any formatting like fonts and so on and after this there is a mime encoded html rich text message and it is always like this. Definitely, spam checks know it.

If they see a header that says “Yes, I am outlook express” but the position and order of the headers do not correspond, they will just say “Ok. They say that they are outlook express but they don’t look like outlook express so we will either reject it completely or we will treat is as suspicious.”

Some filters, they block content like exe files or zip files or others, some block any attachments and they can just say “Ok. We do not accept any attachments. You can just send text or rich text html formatting. If you want to attach something then just send the link and attach it on the website. Our mail servers will not pass any exe files.” Gmail does it and Yahoo does it. They don’t accept any exe files and so on.

We will now proceed to AI filters. The most sophisticated checks.

Since we already filtered out most of the dumb spam, I used to call it stupid spam because any person who actually hopes to deliver something this way is pretty stupid because he knows already that 99.9999% is going to be rejected. After this dumb spam has been rejected by the previous filters, now we come to much less messages. 99% of the messages were rejected by the previous filters, AI filters now comes into play.

Filters that require training usually are AI filters. AI—Artificial Intelligence filters almost always requires training. This means that filters are trained by feeding them, spam and ham content, spam and not spam. Let us say, a filter classifies this message as spam. You as a human look at the message and say “Yes, it is spam.” or “No, it is not spam.” You classify the thing correctly. This is what is trained all about.

There are many types of complex filters. Some of them depend on fuzzy logic, some of them on neural networks while some are expert systems like model-based reasoning systems and so on, very complex staff. Absolutely there are pros and cons to it. I will not expand on this subject because there are so many of them; different types and they are very complex.

The pros, it will be pre-accurate after a lot of training by users. Meaning to say, this filter is typically comes very useful in large user environments. It’s perfect for Gmail for example.

Gmail has hundreds of millions of users and at least some percentage of users they press the built in buttons like “ this is spam” or “this is not spam”. Each time you push this button, what you do is you actually train this filter to be more and more sophisticated. Imagine the community effect of millions of users training this artificial intelligence filter. It is very useful in this type of environment where you have large corporation or a big provider that every user is allowed to train the filter. This filter will filter out the last 5% of sophisticated spam, so all the spam that was not filtered out previously is going to be filtered out now.

The cons are it is now going to require expert set-up. It is very complex. Some of the filters, you can spend days and days setting it up and it is not very simple. It will require a lot of training by the users. It will not be very useful or even accurate if you just use the email occasionally and you are the only user of this email system with this particular filter.

If your database of training is not that good enough, I warn the users that it is going to be very very hard. It will take you thousands of messages to train the filter correctly.  It just won’t work. And most of the providers, if they start something new, they never face in the filters like this immediately.  They kind of have the buttons that say “this is spam” and “this is not spam”. They classify the email but the server does not react to it. It just builds and accumulates the database until it reaches the critical mass to be able to test accurately spam and not spam. Then and only then, they actually roll out the AI filters but then it becomes very accurate.

Here’s the summary of most of the modern filters, how did they work.

Most modern filters are combination of other filters. Almost none of them are combination of single filters. None of them just include only the RBL. Most of them will include something else. They use the buff check as weights or scores. This is a scoring system. Each filter that the message passes through is assigned a score from 0-100% and then all the scores are summed up and at the end the decision is made if the message is going to score above a particular score. For example is if the messages is above 50 % probability of spam then it is going to be rejected. If it is less than 30% but greater than 0% then it is going to be accepted and if is between 30% and 50% it is going to be sent to the spam box like the spam box that you have in your regular email. Some messages are sent to the spam box while some are just rejected out right.  Most of them don’t even reach the spam box. Obviously, each of the filters is going to add or remove weight. Some of them actually remove weight, for example, if you have a list of contacts and the from address of the email is in your contact list and it is going to assign a negative score, meaning you know this person and his email is in your contact list so it probably means that this is legitimate correspondents. It is not going to say that it is 100% not spam, it just going to assign a large score with a negative value but if this message is still fake then the positive score will weight more than the negative one.

Definitely after all the messages are executed then the system will obtained the final number. It is going to be 0 or 100, depending on the system and it will make its decision.

Leave A Reply (No comments so far)

*

Current day month ye@r *

No comments yet