bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Use gawk and regex to stip comments from html files regex


From: Manuel Collado
Subject: Re: [bug-gawk] Use gawk and regex to stip comments from html files regex problem
Date: Thu, 30 Oct 2014 10:41:02 +0100
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/17.0 Thunderbird/17.0

El 29/10/2014 22:54, frank ernest escribió:
Hello, I'm trying to use the lazy star to strip comments from html files but 
it's not working:
gawk '{ gsub(/<!--.*?-->/, "", $0); print $0}'
I have to use a non greedy method so that this:
<!-- comment one --> Important text <!-- comment2 -->
does not be come this:

See? But, despite the fact that several docs on extended regexes
mention the fact that the lazy star works it does not work in gawk. I
know that I might use some other tool like lynx, but I wanted to do it
with gawk and I don't see why a perfectly fine programming language
should fail for so simple a task.

The repeated inner subexpression should match only text fragments different than the closing comment mark. So you can use:

{
   gsub(/<!--([^-]|-[^-]|--[^>])*?-->/, "", $0)
   print
}

A bit cumbersome, but required by the eager policy of awk regexps.

And please take into account that comments can start in one line and end in a different line. But this is another story.


Thanks


Regards.

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado




reply via email to

[Prev in Thread] Current Thread [Next in Thread]