How Heuristic Scoring Works
A heuristic spam filter doesn't make a binary decision based on a single check. It runs a message through a battery of pattern tests and accumulates a score. Each test that triggers adds weight to the total. When the composite score crosses a configurable threshold, the message is classified as spam.
This matters because no single signal is reliable in isolation. Exclamation marks in a subject line appear in legitimate marketing email. Mentions of money are commonplace in business correspondence. An IP-based URL might appear in a network monitoring alert from your own systems. It's the combination of signals — the message that has an ALL CAPS subject, mentions a cash prize, uses an IP-based link, and has no plain-text part — that reliably identifies spam while minimizing false positives.
Indition Spam Killer runs 89 heuristic checks across seven categories. Here's what each category is looking for.
1. Spam Phrases in Subject and Body
The most direct heuristic: does the message contain language statistically associated with spam? This category examines both the subject line and the full message body for phrases that appear overwhelmingly in spam and rarely in legitimate mail.
This isn't a simple keyword blocklist. Keyword filters are trivial to evade — spammers learned to write "Vi@gra" and "fr33 m0ney" decades ago. Modern phrase matching uses weighted patterns that account for context, word proximity, and frequency within the message. A single mention of "earn money" in the body of a business proposal scores very differently than a message whose entire body repeats variations on the phrase across multiple paragraphs.
The phrase database is partitioned by message zone: subject-only phrases carry different weights than body phrases. "You've been selected" in a subject line is a much stronger spam signal than the same phrase buried in a newsletter footer.
2. Subject Manipulation
How a subject line is formatted tells you a lot about the intent behind it. Spammers know that the subject line is the first thing a recipient sees, and they've developed a toolkit of manipulation techniques to trigger urgency, greed, or fear.
Checks in this category include:
- ALL CAPS subjects or subjects with a high ratio of uppercase characters. Legitimate business email rarely uses all-caps subject lines; spam does it constantly to manufacture urgency.
- Excessive punctuation — multiple consecutive exclamation marks ("You've won!!!!!"), question marks used for emphasis, or punctuation patterns that wouldn't appear in normal writing.
- Dollar signs and currency symbols in the subject. "Save $500 TODAY" is a strong spam signal; a legitimate invoice subject might say "Invoice #1234 — $500.00" but this scores differently because of overall context.
- Excessive emoji. A single emoji in a marketing subject is normal. A subject line composed primarily of emoji characters indicates either spam or a very aggressive marketing sender who might warrant rate limiting.
- Misleading Re: or Fwd: prefixes. Messages that begin with "Re:" but have no corresponding thread in the conversation are a classic spam and phishing technique to make a cold message look like a reply to an existing conversation.
3. URL Tricks
URLs in spam messages are one of the highest-signal categories because spammers have limited options for getting a clickable link into a message — and their tricks leave recognizable patterns.
IP-based URLs are a strong signal. Legitimate web services are accessed via domain names; a link like http://203.0.113.44/landing/offer suggests the sender is hiding or rotating infrastructure. There's essentially no legitimate reason for a mass-mailed message to link to an IP address directly.
URL shorteners in bulk mail are suspicious. While shorteners have legitimate uses, using them in email hides the destination URL from both the recipient and spam filters. Messages that contain shortened URLs — particularly from free services — score higher.
Punycode domains are a phishing-specific trick. Punycode allows non-ASCII characters in domain names; a link that looks like paypal.com to a human might actually be a punycode-encoded domain with a visually similar character. This technique is used almost exclusively for impersonation attacks.
Unusual ports in URLs. A link to http://example.com:8080/ is unusual in email. Legitimate web services run on standard ports; non-standard ports suggest the sender is running their own infrastructure outside normal hosting channels.
4. Authentication and Header Tricks
Email has multiple layers of sender identity: the envelope sender (used for bounces), the From header (what the user sees), the Reply-To header (where replies go), and the domain in the message body. Mismatches between these layers are strong spam and phishing signals.
A From/envelope mismatch means the domain shown in the From header doesn't match the domain used for the SMTP envelope. Legitimate bulk senders sometimes have subtle differences here (e.g., sending from a subdomain), but stark mismatches — a message claiming to be from a major bank but enveloped from a generic hosting domain — score heavily.
A Reply-To mismatch is particularly important for phishing detection. A message that displays From: security@yourbank.com but has Reply-To: accounts@random-domain.net is almost certainly a phishing attempt. The attacker wants you to see a trusted sender but have your reply go to their collection address.
5. Display Tricks: Hidden Text and Visual Manipulation
HTML email gives spammers tools to manipulate what a human sees versus what a content filter sees. This category catches attempts to hide content from filters while showing something different to the recipient — or attempts to hide content from the recipient while polluting filter statistics.
Hidden text is the classic evasion: white text on a white background, text styled with display:none, or text positioned off-screen with absolute positioning. Spammers inject blocks of random legitimate-looking text into messages to make them appear content-rich and throw off Bayesian classifiers. The HTML parser detects these structures directly.
Tiny fonts — CSS font sizes below 5px — serve the same purpose. The text is in the document but invisible to the human reader, intended only to influence content-based filters.
Image-only messages with no substantive text are a different kind of display manipulation: putting all the "content" in a single image so there's nothing for a text-based filter to analyze. This technique scores heavily on its own because nearly all legitimate messages contain readable text.
6. Content Structure Anomalies
Legitimate email follows predictable structural conventions. Deviations from these conventions are themselves signals, independent of any content within the message.
Well-formed messages have both a text/plain and a text/html part. HTML-only messages with no plain-text alternative are a common spam pattern — spam tools often generate only HTML because it allows more visual manipulation, while legitimate email clients generate both parts automatically. A missing text/plain part adds to the composite score.
Image-to-text ratio is another structural check. A message that is 90% image content and 10% text is suspicious even if both parts are present. Legitimate newsletters and marketing emails contain substantive text alongside imagery; messages designed to evade text analysis minimize readable content.
Encoding anomalies — unusual character encodings, nested encoding layers, or encoding schemes applied selectively to portions of the message — are structural red flags indicating deliberate obfuscation.
7. Dangerous Content: Active Code in Email
The final category covers content that is dangerous regardless of spam score: active code elements that should never appear in legitimate email.
JavaScript in email is blocked universally. Modern email clients don't execute JavaScript for exactly this reason, but some older clients or webmail implementations have had vulnerabilities. A message containing <script> tags is not legitimate marketing mail; it is a malware delivery attempt or an exploit probe.
HTML forms embedded in email — <form> elements with action attributes — are used in credential harvesting attacks. A message containing an embedded login form is almost certainly a phishing attempt; legitimate services direct users to their websites to log in, they don't embed the login form in the email body.
Base64-encoded URLs in HTML attributes are an evasion technique: the attacker encodes their malicious link in base64 hoping that content filters will fail to decode and check it. The HTML parser decodes all attributes before checking, catching this reliably.
Understanding what these 89 checks are actually looking for transforms spam filtering from a black box into a transparent system you can reason about. When a message gets flagged, the score breakdown tells you exactly which patterns it matched. When a legitimate message is caught, you understand why and can tune accordingly. That transparency is what distinguishes a configurable heuristic filter from a vendor's opaque cloud model.