[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Use gawk and regex to stip comments from html files regex
From: |
Davide Brini |
Subject: |
Re: [bug-gawk] Use gawk and regex to stip comments from html files regex problem |
Date: |
Thu, 30 Oct 2014 11:29:23 +0100 |
On Wed, 29 Oct 2014 22:54:25 +0100, "frank ernest" <address@hidden> wrote:
> Hello, I'm trying to use the lazy star to strip comments from html files
> but it's not working: gawk '{ gsub(/<!--.*?-->/, "", $0); print $0}'
> I have to use a non greedy method so that this:
> <!-- comment one --> Important text <!-- comment2 -->
> does not be come this:
>
> See? But, despite the fact that several docs on extended regexes mention
> the fact that the lazy star works it does not work in gawk. I know that I
> might use some other tool like lynx, but I wanted to do it with gawk and
> I don't see why a perfectly fine programming language should fail for so
> simple a task.
First of all, it's not at all a "simple task". Keep in mind that
parsing *ML with regex-based tools is fragile and very hard to do right.
It's much better to use dedicated tools, of which plenty exist.
That being said, try with (not guaranteed to work in 100% of the cases):
gawk -v RS='<!--|-->' 'NR%2' file.html
Optionally you can set OFS="" for a more conservative output format.
(And good luck.)
--
D.