pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Pan-users] Composing regex for Pan


From: Paul Hudson
Subject: RE: [Pan-users] Composing regex for Pan
Date: Sun, 14 Mar 2004 14:01:47 -0000

> > 
> >  \b[:upper:]{2,}\b
> This dumped all replies. The regex animal book doesn't 
> explain those constructs very well (nor have any of the web 
> sites I've looked at). 

Have a look at the link I sent - all the info's in there somewhere, I think
:)

> > http://www.pcre.org/pcre.txt).


> > (?-i)\b[A-Z]{2,}\b
> This works, sort-of, if I select NONE OF:, but things like 
> "!?&" in the string break it.

(All the below untested as before)

So, I'm unclear what you want. How about keeping things with at last one
word with at least one lower case letter in the middle of it?

(?-i)\b.+[a-z].+\b

> What I've been reading says that the ? refers to "zero or more times"
> (this must be my "snake & necklace" problem again).

It's the ( followed by ? that is important here - you're correct that ? In
other contexts means zero or more
> 
> I want to dump as many of the annoying spam, troll and 
> AOL-keyboard posts as I can, which I think, will require 
> parsing the string's individual characters, multiple times 
> (maybe my approach is flawed?) Once for ALL CAPS (if true, 
> dump the post, regardless of additional characters in the 
> string).

So dump lines that match

(?-i)[a-z]

maybe (don't contain at least one lower case character)

>After that, it gets interesting. Now we should have 
> mixed-case alpha and/or alpha-numeric (or "should" have).

So, don't do anything with these (leave them with the default score which
means they'll be shown)

> Next, filter on multiple instances (2 or more to start) of 
> any non-alpha, printable characters, anywhere in the string. 

Do you mean the same charact repeated? This one's interesting. I think we
can use backreferences here....

Keep lines that don't match

[:punct:]\1

> Dump the matches. Then filter those results against any other 
> specific criteria until what remains are subjects that look 
> "normal" as in: Just a test post | Just A Test Post | Just a 
> Test Post #10 | any of the previous, prefixed by "Re:", ect.

These should be straightforward?

What are you setting the score to for each of these?

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.572 / Virus Database: 362 - Release Date: 27/01/2004
 





reply via email to

[Prev in Thread] Current Thread [Next in Thread]