bsf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: address@hidden: Evaluating bsf as a GNU project]


From: Alvaro Herrera
Subject: Re: address@hidden: Evaluating bsf as a GNU project]
Date: Thu, 3 Jul 2003 01:59:18 -0400
User-agent: Mutt/1.4i

On Thu, Jul 03, 2003 at 12:17:28AM -0400, Cristian Gutierrez wrote:
> Error log for Alvaro Herrera; dumped on Wed, Jul 02, 2003 at 04:31:35PM -0400:

> > _However_ I think we have to improve the software _A_LOT_ before it's
> > "ready to be a GNU package".  I can put some effort after my final
> > report for cc69f is done, maybe in two weeks.  I'd be really ashamed
> > to put a lousy package on GNU.
> 
> I agree. I don't even believe that it's probably going to be approved in
> its current state... although I think they're actually more committed to
> check the disclaimer, licenses of requiered packages and other legal
> stuff rather than performing a complete QA process... ;-)

Sure.  What do we depend on?  Perl is already GPL-compatible, so we need
not worry about that.  I think Aldrin's modified version used
BerkeleyDB; it should be pretty trivial to change it to GDBM.

What we urgently need is a way to retrain the score table incrementally.
Maybe we should find a way to do it easily within the MUA -- it's very
easy to do with mutt, don't know about other MUAs.

> BTW, I shall inform you that I've re-trained the version Ricardo
> currently has installed in anakena (dunno about version number, *sigh*),
> with my 47 MiB legit and 26 MiB spam e-mails mailboxes (took an entire
> night in a 1GHz PIII), and I've happily saw an entire day of no false
> positives and just _one_ false negative (of about 10 legit and 20 spam
> emails).

Well, I usually get no false negatives in a day or two, followed by five
or six the next day :-)  But I get at most one false positive a week.  I
keep no whitelists however (all my mail is routed through BSF's
scoretable).


> The one that made it through the filter uses an html-comment obfuscation
> technique, so their triggering words are efectively masked. See an
> example:

Yeah, I've seen that.  With my score table it was correctly classified
however.

> It's actually rendered as: "As seen on NBC, CBS, CNN an even Oprah."
> (don't even ask what has been seen so much, is quite embarassing to
> describe!)

Let me guess... :-D

> So I guess we fell short with this implementation (dunno about Aldrin's
> improved one). As you may already guess, this is easily fixed using an
> actual text-only HTML renderer (rather than removing the HTML part and
> hoping there is a text-only one), like "lynx -dump". But we also know
> that it's certainly a too heavy dependancy on other piece of
> software... so, what Perl modules are we left with? HTML::Parser or
> something alike?

Looks like some research and experimentation is in order.  Yes,
HTML::Parser would probably be fine, though it _will_ be slow.  I'm less
sure about what to do with attachments.  Maybe we should take a few
attributes and classify that (filename, probably extension, content
type, what does file(1) report on it), ignoring the rest of the file.
It's of no use anyway.

> Like Alvaro, in about 2 weeks I'll be also more available to the project
> (and to the rest of my life, indeed). It may be convenient to have a
> 'light' meeting within this period to discuss further development lines,
> so when we have enough time on our hands we can use it effectively. Any
> suggestions about this?

I will be probably visiting Claudio Gutierrez some day (rather sooner!),
so we can meet then.  Do you have a cell phone or something?

(BTW, I've been using bogofilter at work with _no_ false negatives, _no_
false positives, for some two weeks now -- and it's blindly fast)

Mental note: we may want to do something like SpamAssassin's spamd/spamc
combo.


> "It is practically impossible to teach good programming style to students
> that have had prior exposure to BASIC; as potential programmers they are
> mentally mutilated beyond hope of regeneration." -- Dijkstra

Hey!  I had _lots_ of training in BASIC before introduced into
procedural programming.  I don't feel mutilated, let alone "beyond
hope".

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]