bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

new fmt feature


From: Bitt Faulk
Subject: new fmt feature
Date: Fri, 16 Mar 2001 16:45:58 -0500 (EST)

As per Jim Meyering, I'm submitting this patch here for comments and
review.

fmt usually determines paragraph boundaries based on blank lines found in
the input.  This works well if the input is formatted this way.  It also
has the option (-s) to mark each line as a new paragraph, which makes
sense given some types of input.  But I often come across text that has
already been formatted in some way, but the author did not leave blank
lines between paragraphs, leaving it to the reader to determine paragraph
boundaries based on context.  Obviously, fmt cannot interpret that
context, and either it collapses all of the paragraphs into one, or, if
the -s option is used, makes long lines one and a half lines, leaving a
jagged right margin.  Obviously, neither of these is correct.

I have devised a new feature for fmt that takes such input and (in
addition to its normal blank line delineations) determines paragraph
breaks based on whether a line is significantly shorter than the average
line in the document.  That is, often, final lines in paragraphs have
fewer characters in them than other lines, leaving a (correct) ragged
right margin.  If my patched fmt is given the option to look for such
lines and sees one, it assumes that that is the end of the paragraph.

The shortness is determined by a percentage of a running average line
length.  This obviously means that the input must already be formatted to
some extent, and the user wants to run fmt to re-format it.  It also means
that lines early in the document are more likely to be incorrectly
formatted than those later on, due to the running aspect of the line
length average.  The default percentage is 70%, but this can be changed by
the user by giving an optional argument to the option to turn on this new
feature.

This method cannot be guaranteed to be totally accurate, but it does a
much better job with the type of input it is intended to operate on than
either of the currently existing options.

I've coded this new option in the attached patch as `-n'/`--new-split'.
This is not a very good name, and I'd appreciate input.  I was going to
make it `-c'/`--context-split', but `-c' is already in use.

Also, I'm not 100% happy with the implementation, as there are some
optimizations that can happen, especially in the ridculously long line
length determination on line 646.  Also, folding away the calculation of
the average line length when the new option is not in use would be
preferable.  However, it does work, and I'm interested in getting some
feedback first.

Attached are the patch (in diff -u format) and several test cases.  The
diff is against textutils-2.0.12, which appeared to be the newest version
on alpha.gnu.org.  Please let me know what you think of it.

-Bitt

Attachment: fmt.diff
Description: Text document

Attachment: canon
Description: Text document

Attachment: testcase1
Description: Text document

Attachment: testcase2
Description: Text document

Attachment: check.sh
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]