pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-users] Scoring based on arbitrary headers?


From: Duncan
Subject: Re: [Pan-users] Scoring based on arbitrary headers?
Date: Sat, 10 Jan 2015 11:30:24 +0000 (UTC)
User-agent: Pan/0.140 (Chocolate Salty Balls; GIT 2786476)

Jim Henderson posted on Thu, 08 Jan 2015 19:02:20 +0000 as excerpted:

> On Thu, 08 Jan 2015 05:19:20 +0000, Duncan wrote:
> 
>> Jim Henderson posted on Wed, 07 Jan 2015 00:34:20 +0000 as excerpted:
>> 
>>>> Meanwhile, just to confirm, arbitrary header scoring did work, but
>>>> only after downloading the messages and possibly manually triggering
>>>> a rescore,
>>>> correct?
>>> 
>>> Hmmm, I didn't try a manual rescore, but the scoring that applied to a
>>> post that should have been affected didn't show up when I went to look
>>> at the rules applied.
>> 
>> OK, so we do /not/ have confirmation that pan actually does arbitrary
>> header scoring, but we /do/ have confirmation that /if/ it does, it
>> doesn't do it automatically after the download, and requires a manual
>> rescore.
> 
> I'm not sure that that's an accurate summary of what my testing found -
> I ended up not getting a score based on an arbitrary header.
> 
> Checking the score on a message that I know matches my arbitrary scoring
> rule, it doesn't show the score item I added.
> 
> The lines I added to ~/News/Score were:
> 
> %BOS %Score created by JSH
> [*opensuse.org*]
> Score:: =9999
> X-Forwarded-For: ^[address redacted]$
> %EOS
> 
> Where [address redacted] is a valid IP address.  I followed the format
> used for the From: score that appears above it in the file.

With my own testing (as mentioned in a post yesterday) demonstrating that 
arbitrary-header scoring does work, and that pan appears to score on 
download without a manual rescore, provided it has already loaded that 
score, we're left with the following possibilities:

Either:

1) Your regex somehow failed to match,

OR

2) Pan hadn't yet reloaded the scorefile after you edited it, so it 
didn't know about your new score when it downloaded your test messages.

OR

3) An absolute =nnnn (as opposed to additive nnnn, no =) score that 
happened to match that message, appeared before your test score in the 
scorefile.  Because absolute scores are intended to be absolute, no 
further scoring is done after the first absolute match is found -- that 
first match is applied and that's it -- so unlike additive scores, 
absolute score order MATTERS.


Here's what I did for my test.  I used gmane as my test server and tested 
in gmane.* groups.  Due to the way gmane works, messages thru gmane have 
a header that looks like this (obvious obfuscation applied to avoid gmane 
email munging):

Approved: news at gmane dot org

Posts on gmane also have a header like this (picking your post as an 
example):

Archived-At: 
 <http://permalink.gmane.org/gmane.comp.gnome.apps.pan.user/14813>


Since these are unlikely to be in the overview (tho I didn't actually 
check) but are extremely common (pretty much every post) on gmane, I 
decided they'd make good arbitrary-header scoring test material.

So:

[*]
Score:: 100 %testing arbheaders
        Approved: gmane\.org

Score:: 200 %arbheaders test2
        Archived-at: gmane\.org

Now those are additive scores and went below my normal scoring, so if any 
absolute scores applied, pan would never get to these, but otherwise, 
assuming no further additive scores applied, basically all "current" gmane 
messages should get a score of 100+200=300.

Some things to note altho they'll be review for those familiar with the 
scorefile format:

% starting a line indicates a comment.  All those %BOS/%EOS lines that 
pan adds are purely that, comments, and do nothing to change the actual 
scoring.  Knowing that, for me those comments are mostly noise and I 
don't use 'em, tho I do have my own explanatory comments when necessary, 
and do tend to keep an originating date on any /expiring/ score, just so 
I know how long I intended it to run before expiring.

Similarly, on a score line, a % after the score value indicates a comment 
and can be used to give the score a name, exactly as you see in my 
example.

The [] starts a scoring section as well as indicating the newsgroups that 
section applies to.  Newsgroups entries are * wildcard, not the regex 
that applies to the content of most headers.  So the tested [*] says 
match on the following scores regardless of the group name, which was 
fine for my tests.

If I had set the first one to =100 instead of a bare 100, it would have 
been an absolute score, and any match at that point would prevent pan 
from even getting to the next score with that post, since an absolute 
score is just that, absolute, and the first such matching score applies, 
period.  (Of course this is one of the possibilities I list above for why 
your test didn't seem to work, that an earlier absolute score match 
prevented pan from ever reaching the test score.)

For my testing purposes at least, I didn't need to match the entire 
header, just verify that it was there, and that it contained the gmane.org 
bit.  Thus I didn't need the ^ and $ string beginning and ending 
anchors.  And of course the \. forces the dot to be matched as a literal 
dot, not the "any character" that a dot metacharacter will normally match 
in regex.


Now, after adding that to my scorefile and saving, I had to tell pan 
about the scorefile changes.  So I selected a message and hit Articles, 
Edit Article's Watch/Ignore Score.  In the resulting dialog box I simply 
hit Close and Rescore, to get pan to reload the changed scorefile.

That did it.  Most (cached) posts on gmane now appeared with a 300 
score.  Again, a few posts did not, because they matched some previous 
absolute score and thus were assigned that score and never reached my 
test scoring.

Switching groups with that setup was when I really noticed the slowdown 
of those arbitrary-header scores, because now pan had to go thru all 
cached messages on the new group, checking each one to see if the 
arbitrary-header scores matched and scoring as appropriate.

Then I tried downloading new "headers" (really overviews) in subscribed 
groups, to check scoring on new messages.  Which is when pan crashed, 
since (as I explained in yesterday's reply) that meant pan scanned a 
known-bad message in another group, that is known to crash pan.

After restarting pan and figuring out what happened (verifying the crash 
on getting new headers in subscribed groups another time or two in the 
process), I tried getting new headers in /selected/ groups, without the 
problem group selected.  That worked without crashing!

And as expected, the new scores didn't apply to the just fetched 
"headers" (overviews), because the overviews didn't contain the headers I 
was trying to score on.

But as soon as I downloaded the actual messages, the news scores applied 
as the content was actually there to match against, now. =:^)

But again, while I didn't see any in my short test (I couldn't get 
headers in subscribed groups or in the single affected group, without 
crashing, remember, and I didn't like the scanning delay when I switched 
groups either, so I had no interest in prolonging the test), had any of 
the new messages matched an absolute score reached before my test scores, 
of course the test scores wouldn't have applied here either.


Sooo...

What I'd suggest you try next is a more general match, as I did.  If you 
use gmane you can duplicate my test scores and verify that they're 
working for you too, before proceeding.

Once you get something general obviously applying, then home in on your 
objective.  First try a score like this:

[*]
Score:: =500 % test x-forwarded-for
        X-Forwarded-For: .*

That's absolute 500, to hopefully distinguish it from all the absolute 
9999/watch scores, assuming you have score-colors set appropriately, and 
the score column set to display.

And it should match ANY post in ANY group, that has ANY x-forwarded-for 
header set, no matter the content.

Once that is verified to work as expected, narrow it down one factor at a 
time:

[*opensuse.org*]
...

First the newsgroup, matching any group name containing opensuse.org.

...
        X-Forwarded-For: somedomain\.net
...

Then try a simple general domain match.

That might be narrow enough right there, without a full string match.  If 
not, continue to narrow it down, until you get a positive match without 
too many false-positives.

Of course somewhere in there you can set your desired score, as well.  
But don't forget, with absolute scores involved, order matters!  So order 
accordingly. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman




reply via email to

[Prev in Thread] Current Thread [Next in Thread]