Re: Is this a good idea?

spamass-milt-list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Is this a good idea?

From:	Nate Schindler
Subject:	Re: Is this a good idea?
Date:	Tue, 28 Sep 2004 14:31:39 -0700


> -----Original Message-----
> From: address@hidden
> [mailto:address@hidden
> Behalf Of Thomas Cameron
> Sent: Tuesday, September 28, 2004 1:59 PM
> To: address@hidden
> Subject: Re: Is this a good idea?
> 
> 
> Questions and answers inline, below:
> 
> > ----- Original Message -----
> > From: "Nate Schindler" <address@hidden>
> > To: <address@hidden>
> > Sent: Tuesday, September 28, 2004 12:34 PM
> > Subject: RE: Is this a good idea?
> >
> > Sendmail is using a fairly generic configuration.  There 
> are no local
> > mail users on that machine, except for a spamtrap account that feeds
> > mail directly back into 'sa-learn --spam'.
> 
> But doesn't SA need at least 200 hams and 200 spams before 
> Bayes kicks in? 
> And don't you really want to have about the same number of 
> spams and hams to 
> train Bayes with?

Yes.  This is something else that can be set - the minimum amount, but I'd 
recommend against changing the default of 200.  SA will learn those first 400 
messages on its own if you let it - autolearning still works, but it won't 
apply bayes scores until it has those first 400.  If you have some ham/spam 
that you can feed it manually, that'd be a better start.

> 
> > SpamAss-Milter is configured to block mail when SA tells it to.
> 
> Same here.  We tag as spam at 4 and reject at 6.  I *love* 
> that feature.
> 
> > It's configured to pass the mail alias (what's before the @ 
> sign) as the
> > user to spamc so that custom thresholds can be looked up by spamd.
> > SpamAssassin is set up to take the alias from 
> SpamAss-Milter, and check
> > it against a MySQL userpref table to see if there are any 
> custom rules
> > for this particular user.
> 
> Did you set up the MySQL stuff?  I am not a DBA so this might 
> not be easy 
> for me.  Also, how do your customers modify their settings in 
> the DB?  Do 
> you have to do it for them?

Yes, and I'm not DBA either.  MySQL is very easy to use, especially with such a 
simple database.  It's like 1 table with 5 or so fields.
We don't have "customers" per-se.  I'm an admin for a single corporation, so I 
do the changes myself, but there are php scripts that people have contributed 
so that what you're asking for can be done from a straight-forward, web-based 
interface.  I played with it, but had no use for it.... so I never fully 
implemented that part of it.

> 
> > Our CEO, for example, wants no mail sent to
> > him blocked... so in the userpref table, he's got a "required_hits"
> > entry of 100.  This is convenient, because SpamAssassin and 
> SpamAss-Milt
> > still properly tag the message with X-Spam-Level.  The CEO 
> has Outlook
> > configured to move messages with more than 5 stars to a 
> Spam folder in
> > his mailbox.  So, for people like him, it still separates 
> spam from ham.
> > For everybody else, it just rejects the spam.
> > I also have custom rules defined for SpamAssassin to read 
> the MessageWall
> > score, and adjust its own score according to MessageWall's 
> suggestion.
> > As far as what you were saying about copying spam and ham 
> to separate 
> > mailboxes
> > for learning purposes, the bayes_auto_learn option of SpamAssassin 
> > facilitates
> > this.
> 
> Right, that's why I was not sure it was a good idea at all.  
> I guess what I 
> really need is a way for the users to somehow forward false 
> negatives and 
> positives back to the relay server.

we have spamtrap and hamtrap aliases, which can be used for this purpose.  You 
have to tell Bayes to ignore certain headers to keep things from getting 
screwed up - for example, if a user forwards a piece of spam to a spamtrap, 
it's now FROM that user.  You don't want that user's e-mail address to be 
scored negatively, so you have to tell Bayes to ignore the From field.  I think 
you may be able to set that for just the spamtrap, and not globally.

> 
> > The only problem with it is that you don't have the 
> original messages
> > which trained the database.... however the concept of how 
> it works is the 
> > same
> > as you described - exceptionally spammy messages are 
> automatically learned 
> > as
> > spam, and exceptionally hammy messages are learned as ham.  
> The scores 
> > used to
> > make the decision of whether or not SpamAssassin should 
> learn a message 
> > are
> > configurable.
> 
> Yeah, I guess I could start tweaking those values to get the 
> same results.
> 
> > After this, final delivery finally takes place.
> > In Exchange, I have a couple public folders set up - Spam, 
> and Ham.  Users 
> > know
> > that if they receive a false negative, they can copy it to the Spam 
> > folder, and
> > I use it to train the filter periodically.
> 
> But doesn't the extra header info that Exchange adds screw things up?

It doesn't seem to.  Not being a master of how the SpamAssassin team 
implemented the Bayesian classifier, I can't give you an authoritative answer 
on how header information is treated, but it doesn't seem to matter.

Here's an example:
A user (our webmaster) screws up and puts something from his personal Spam 
folder into the public Spam folder, not thinking that he didn't need to.
I don't realize that he's made a mistake, and i give the message to sa-learn.  
sa-learn will probably say "Learned from 0 message(s) (1 message(s) examined)."
That tells me that although the message had been passed to the Exchange server, 
and extra headers and such were added, that SA still recognized the message as 
something it's already learned.
I think it checksums the body or something.  I'm pretty sure that's also how 
Pyzor/Razor/DCC work.  Headers change from server to server, so they had to be 
more specific about identifying a message as unique.
As far as shooting yourelf in the foot by telling it to learn messages that 
have your server's header information, it doesn't seem to matter.  I'm not sure 
if stuff like the server name in the header is cancelled out by having both ham 
and spam in the bayes system which reference it, or if bayes just ignores it 
entirely.

I guess all I can say for sure is that I don't know how they implemented Bayes, 
but it seems well-behaved.

> 
> > That's my story, and I'm sticking to it.  If any portion of 
> this chain of 
> > stuff
> > seems interesting to you, I can show you how it's configured.
> 
> It is very interesting but I think I will do things a bit 
> differently.  Not 
> sure just how yet, but I will get back to the list when I 
> figure it out.

Good luck, not that you'll need it. ;)
Let us know how it goes.

> 
> Thanks!
> TC 
> 
> 
> 

N8

> _______________________________________________
> Spamass-milt-list mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/spamass-milt-list
>

[Prev in Thread]

Current Thread

[Next in Thread]

Is this a good idea?, Thomas Cameron, 2004/09/28
- RE: Is this a good idea?, Nate Schindler, 2004/09/28
  - Re: Is this a good idea?, Thomas Cameron, 2004/09/28
- Re: Is this a good idea?, Nate Schindler <=

Prev by Date: Re: Is this a good idea?
Next by Date: Re: Message Rejection / Choice of bounce text?
Previous by thread: Re: Is this a good idea?
Next by thread: Re: Message Rejection / Choice of bounce text?
Index(es):
- Date
- Thread