|
From: | Manuel Collado |
Subject: | Re: [bug-gawk] Use gawk and regex to stip comments from html files regex problem |
Date: | Thu, 30 Oct 2014 10:41:02 +0100 |
User-agent: | Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/17.0 Thunderbird/17.0 |
El 29/10/2014 22:54, frank ernest escribió:
Hello, I'm trying to use the lazy star to strip comments from html files but it's not working: gawk '{ gsub(/<!--.*?-->/, "", $0); print $0}' I have to use a non greedy method so that this: <!-- comment one --> Important text <!-- comment2 --> does not be come this: See? But, despite the fact that several docs on extended regexes mention the fact that the lazy star works it does not work in gawk. I know that I might use some other tool like lynx, but I wanted to do it with gawk and I don't see why a perfectly fine programming language should fail for so simple a task.
The repeated inner subexpression should match only text fragments different than the closing comment mark. So you can use:
{ gsub(/<!--([^-]|-[^-]|--[^>])*?-->/, "", $0) print } A bit cumbersome, but required by the eager policy of awk regexps.And please take into account that comments can start in one line and end in a different line. But this is another story.
Thanks
Regards. -- Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
[Prev in Thread] | Current Thread | [Next in Thread] |