|
From: | Bitt Faulk |
Subject: | new fmt feature |
Date: | Fri, 16 Mar 2001 16:45:58 -0500 (EST) |
As per Jim Meyering, I'm submitting this patch here for comments and review. fmt usually determines paragraph boundaries based on blank lines found in the input. This works well if the input is formatted this way. It also has the option (-s) to mark each line as a new paragraph, which makes sense given some types of input. But I often come across text that has already been formatted in some way, but the author did not leave blank lines between paragraphs, leaving it to the reader to determine paragraph boundaries based on context. Obviously, fmt cannot interpret that context, and either it collapses all of the paragraphs into one, or, if the -s option is used, makes long lines one and a half lines, leaving a jagged right margin. Obviously, neither of these is correct. I have devised a new feature for fmt that takes such input and (in addition to its normal blank line delineations) determines paragraph breaks based on whether a line is significantly shorter than the average line in the document. That is, often, final lines in paragraphs have fewer characters in them than other lines, leaving a (correct) ragged right margin. If my patched fmt is given the option to look for such lines and sees one, it assumes that that is the end of the paragraph. The shortness is determined by a percentage of a running average line length. This obviously means that the input must already be formatted to some extent, and the user wants to run fmt to re-format it. It also means that lines early in the document are more likely to be incorrectly formatted than those later on, due to the running aspect of the line length average. The default percentage is 70%, but this can be changed by the user by giving an optional argument to the option to turn on this new feature. This method cannot be guaranteed to be totally accurate, but it does a much better job with the type of input it is intended to operate on than either of the currently existing options. I've coded this new option in the attached patch as `-n'/`--new-split'. This is not a very good name, and I'd appreciate input. I was going to make it `-c'/`--context-split', but `-c' is already in use. Also, I'm not 100% happy with the implementation, as there are some optimizations that can happen, especially in the ridculously long line length determination on line 646. Also, folding away the calculation of the average line length when the new option is not in use would be preferable. However, it does work, and I'm interested in getting some feedback first. Attached are the patch (in diff -u format) and several test cases. The diff is against textutils-2.0.12, which appeared to be the newest version on alpha.gnu.org. Please let me know what you think of it. -Bitt
fmt.diff
Description: Text document
canon
Description: Text document
testcase1
Description: Text document
testcase2
Description: Text document
check.sh
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |