bsf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: address@hidden: Evaluating bsf as a GNU project]


From: Cristian Gutierrez
Subject: Re: address@hidden: Evaluating bsf as a GNU project]
Date: Thu, 3 Jul 2003 15:14:33 -0400
User-agent: Mutt/1.4i

Error log for Alvaro Herrera; dumped on Thu, Jul 03, 2003 at 01:59:18AM -0400:
> On Thu, Jul 03, 2003 at 12:17:28AM -0400, Cristian Gutierrez wrote:
> > Error log for Alvaro Herrera; dumped on Wed, Jul 02, 2003 at 04:31:35PM 
> > -0400:
> 
> > I agree. I don't even believe that it's probably going to be approved in
> > its current state... although I think they're actually more committed to
> > check the disclaimer, licenses of requiered packages and other legal
> > stuff rather than performing a complete QA process... ;-)
> 
> Sure.  What do we depend on?  Perl is already GPL-compatible, so we need
> not worry about that.  I think Aldrin's modified version used
> BerkeleyDB; it should be pretty trivial to change it to GDBM.
> 
> What we urgently need is a way to retrain the score table incrementally.
> Maybe we should find a way to do it easily within the MUA -- it's very
> easy to do with mutt, don't know about other MUAs.

(That's unfair... almost anything it's very easy to do with Mutt! ;-)

I started a little script that reads two score files, so we could later
add some more code to generate an "unified" version of the score
files. A good algorithm for this (perhaps with a nice theoretical
background... Aldrin?) is really missing.

> Well, I usually get no false negatives in a day or two, followed by five
> or six the next day :-)  But I get at most one false positive a week.  I
> keep no whitelists however (all my mail is routed through BSF's
> scoretable).

That's something I avoided... 'cause I have enough information to
correctly classify the mailing lists-related intake. After all, I had to
code some recipes to put them in different folders, so just placing
those recipes before bsf's own recipe did the job. But I'm using no
other 'whitelist' for non mailing lists email, although.

> > The one that made it through the filter uses an html-comment obfuscation
> > technique, so their triggering words are efectively masked. See an
> > example:
> 
> Yeah, I've seen that.  With my score table it was correctly classified
> however.

Mmmmh... that's rather out of my understanding... what's the threshold
score you have been using? (mine is 90 an most of this obfuscated spam
gets about 40, or even less sometimes).

> > So I guess we fell short with this implementation (dunno about Aldrin's
> > improved one). As you may already guess, this is easily fixed using an
> > actual text-only HTML renderer (rather than removing the HTML part and
> > hoping there is a text-only one), like "lynx -dump". But we also know
> > that it's certainly a too heavy dependancy on other piece of
> > software... so, what Perl modules are we left with? HTML::Parser or
> > something alike?
> 
> Looks like some research and experimentation is in order.  Yes,
> HTML::Parser would probably be fine, though it _will_ be slow.  I'm less
> sure about what to do with attachments.  Maybe we should take a few
> attributes and classify that (filename, probably extension, content
> type, what does file(1) report on it), ignoring the rest of the file.
> It's of no use anyway.

That's a really nice idea, even something I had in mind a lot of time
ago. It could even evolve into a [naive] virus scanner!. 

How is MIME parts and HTML content handled by both "::Parser"'s? Could
we reach a more general solution to handle *all* combinations of
txt/html/binary attachments with rather simple code? (see the tree view
of spam messages with 'v' in Mutt). I think that is my biggest nightmare
(don't even think of unparseable messages composed with lousy MUAs,
raping more standards than some 2 & 3-letter companies ;-).

> 
> > Like Alvaro, in about 2 weeks I'll be also more available to the project
> > (and to the rest of my life, indeed). It may be convenient to have a
> > 'light' meeting within this period to discuss further development lines,
> > so when we have enough time on our hands we can use it effectively. Any
> > suggestions about this?
> 
> I will be probably visiting Claudio Gutierrez some day (rather sooner!),
> so we can meet then.  Do you have a cell phone or something?
> 

09-6707204 -- mobile phone
6890792 -- $HOME

No email-sms gateway due to some unpleasant market practices of my cell
phone company :-(

> (BTW, I've been using bogofilter at work with _no_ false negatives, _no_
> false positives, for some two weeks now -- and it's blindly fast)
> 
> Mental note: we may want to do something like SpamAssassin's spamd/spamc
> combo.
> 

That reminds me of another player in the bayesian filtering game, but
not so succesful as those two (or even us, I guess!). Go see the project
site at savannah (no url at hand, d'oh!) and read the *only* bug report
we got: "see tss.sf.net". It's a very similar project, also linked to
Paul Graham's writings about spam, and with similar objectives. Now
that'a *bug*! :-D

> 
> > "It is practically impossible to teach good programming style to students
> > that have had prior exposure to BASIC; as potential programmers they are
> > mentally mutilated beyond hope of regeneration." -- Dijkstra
> 
> Hey!  I had _lots_ of training in BASIC before introduced into
> procedural programming.  I don't feel mutilated, let alone "beyond
> hope".

Then, according to Dijkstra, you exist in a theoretical existence
plane... untouched by the banalities of mundane *practical* life ;-) (I
might add that you are the eventuality of an anomaly in an otherwise
complete harmony of mathematical precision, and so on... but let's keep
it for another occasion :-P)

-- 
Cristian Gutierrez                                 Linux user #298162
address@hidden           http://www.dcc.uchile.cl/~crgutier

/* You are not expected to understand this */ 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]