[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Ifile-discuss] iFile as help-desk front-end
From: |
Jason Rennie |
Subject: |
Re: [Ifile-discuss] iFile as help-desk front-end |
Date: |
Fri, 31 Oct 2003 08:44:41 -0500 |
address@hidden said:
> 1. What do the numbers reported by ifile -q really mean?
> I believe that for this system, simply giving up and routing to a
> human would be better than guessing wrong, so I'd like to have a
> "unknown" bin that collects the stuff that isn't matched well by
> ifile. I was under the impression that the numbers reported were a
> "quality of match" metric, but in cases where nothing matches
> (feeding Jabberwocky to ifile when it's been trained on an OS X FAQ)
> returns 0 for all categories. Is this a special case, and if I get
> exactly zero, or some very negative number, I should assume the match
> is poor?
These are log-likelihoods, one for each class model. A good indicator of
the "quality of match" is the ratio of the first two numbers. If you
divide the first number by the second, you'll always get a ratio less than
one. If the ratio is very close to one (e.g. 0.99), it means that ifile
had a hard time differentiating between those two classes. If it is
smaller (e.g. 0.9), you can be fairly confident of the prediction.
For example, ifile -q on your e-mail gives me
ifile/discuss -777.66423702
mlists/spam -945.67046261
...
The ratio is 0.822. Not much doubt that it got the classification
correct. Best thing for you to do would be to train up ifile, give it
test messages and look at the numbers that this ratio outputs to get a
sense of what number correspond to confident/not confident.
address@hidden said:
> Is there any advantage to this approach, or am I better off letting
> ifile sort things out over a large number of bins
The tiered representation can help, especially if the decisions in the
tree are very clear-cut. Another thing that would improve performance of
ifile is what's known as ECOC classification. This is where you
build many different binary classifiers by randomly grouping categories
together. Here's a writeup:
http://www.ai.mit.edu/~jrennie/papers/aimemo2001.pdf
Feel free to send me questions.
Jason