[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] Issues with -dont_wrap_pre and -nomargins
From: |
Thomas Dickey |
Subject: |
Re: [Lynx-dev] Issues with -dont_wrap_pre and -nomargins |
Date: |
Wed, 16 Sep 2009 04:59:17 -0400 (EDT) |
On Wed, 16 Sep 2009, Claus Strommer wrote:
Hello all. I am using lynx to convert an archive of html files into
plaintext for information retrieval. The command that I use is
lynx -nounderline -notitle -nocolor -nomargins -nolist -nobold -nonumbers
-force_html -dump -dont_wrap_pre <file>
If works almost perfectly, except for one minor issue; I am not sure if it
is a bug or something I am doing wrong. When I parse the attached a.html
file, some of the words are printed without a whitespace separator:
I can reproduce this
"...However, in order to fully develop our vision of the next version of
Twingle, we needed more control over the fine nuances of searching through
email. And, asthe next phase of the Twingle development is to include a
downloadable versionof the software, we needed it to make it easier for
people to install - when the lead developer gave up after 6 hours of trying
to get it all working on his own machine at home we knew we had a
problem!..."
'asthe' should be 'as the', 'versionof' -> 'version of', and so on. AFAIK,
this is not an input error - the words are separated when skip either of the
-dont_wrap_pre or -nomargins options. As these errors occur near the n*80th
characters in a paragraph, I can only assume that some part of the parsing is
going awry there. The errors occur in the 1.8.6-rel5 (macports), 1.8.6-rel4
(ubuntu) and and latest 1.8.8 builds.
s/1.8/2.8/
That sounds like a bug, for instance in how lynx is storing some hidden
characters for , etc.
So my question is: Is there anything I can do to work around this? I would
...other than fixing the bug - perhaps not. (I'm working on xterm and
mawk at the moment, intending to go back to lynx next...).
Just reading the code: It looks as if -nomargins goes to the no_margins
variable, and _that_ is used in only a few places:
DefaultStyle.c:466: if (no_margins) {
DefaultStyle.c:482: if (no_margins) {
LYGlobalDefs.h:394: extern BOOLEAN no_margins;
LYMain.c:393:BOOLEAN no_margins = FALSE;
LYMain.c:3622: "nomargins", 4|SET_ARG, no_margins,
LYOptions.c:35:#define MARGIN_STR (no_margins ? "" : " ")
LYOptions.c:36:#define MARGIN_LEN (no_margins ? 0 : 2)
LYrcFile.h:159:#define RC_NO_MARGINS "no_margins"
LYReadCFG.c:1494: PARSE_SET(RC_NO_MARGINS, no_margins),
The uses in DefaultStyle.c and LYOptions.c are simple to change and see if
the bug's behavior changes predictably. For instance, making -no_margins
do a single character rather than none might make it usable for your
script.
The derived variables are used in GridText.c's split_line() function,
which is (complicated) where the boundary check is most likely off.
It's complicated, since there are long expressions such as
spare = WRAP_COLS(text)
- (int) style->rightIndent
- indent
+ ctrl_chars_on_previous_line
- LYstrExtent2(previous->data, previous->size);
But that's the area where the fix would probably be made - split_line.
very much like to keep using these two options, as it is important to me to
be able to distinguish between lines and paragraphs. I am even willing to
use other tools, if you can suggest any - but as far as I've seen, lynx is
the only one which gives the desired options. Also, I'd like to stay away
from the -width option (it does not allow me to specify infinite width, AND
it breaks with tables - the attached b.html, for example).
--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net