new fmt feature

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

new fmt feature

From:	Bitt Faulk
Subject:	new fmt feature
Date:	Wed, 21 Mar 2001 17:02:07 -0500

As per Jim Meyering, I'm submitting this patch here for comments and
review.

fmt usually determines paragraph boundaries based on blank lines found
in the input.  This works well if the input is formatted this way.  It
also has the option (-s) to mark each line as a new paragraph, which
makes sense given some types of input.  But I often come across text
that has already been formatted in some way, but the author did not 
leave blank lines between paragraphs, leaving it to the reader to
determine paragraph boundaries based on context.  Obviously, fmt cannot
interpret that context, and either it collapses all of the paragraphs
into one, or, if the -s option is used, makes long lines one and a half
lines, leaving a jagged right margin.  Obviously, neither of these is
correct. 

I have devised a new feature for fmt that takes such input and (in
addition to its normal blank line delineations) determines paragraph
breaks based on whether a line is significantly shorter than the average
line in the document.  That is, often, final lines in paragraphs have
fewer characters in them than other lines, leaving a (correct) ragged
right margin.  If my patched fmt is given the option to look for such
lines and sees one, it assumes that that is the end of the paragraph.

The shortness is determined by a percentage of a running average line
length.  This obviously means that the input must already be formatted
to some extent, and the user wants to run fmt to re-format it.  It also
means that lines early in the document are more likely to be incorrectly
formatted than those later on, due to the running aspect of the line 
length average.  The default percentage is 70%, but this can be changed
by the user by giving an optional argument to the option to turn on this
new feature.

This method cannot be guaranteed to be totally accurate, but it does a
much better job with the type of input it is intended to operate on than
either of the currently existing options.

I've coded this new option in the attached patch as `-n'/`--new-split'.
This is not a very good name, and I'd appreciate input.  I was going to
make it `-c'/`--context-split', but `-c' is already in use.

Also, I'm not 100% happy with the implementation, as there are some
optimizations that can happen, especially in the ridculously long line
length determination on line 646.  Also, folding away the calculation of
the average line length when the new option is not in use would be
preferable.  However, it does work, and I'm interested in getting some
feedback first.

Attached are the patch (in diff -u format) and several test cases.  The
diff is against textutils-2.0.12, which appeared to be the newest
version on alpha.gnu.org.  Please let me know what you think of it. 

-Bitt

--- textutils-2.0.12/src/fmt.c.orig     Fri Mar 16 14:50:32 2001
+++ textutils-2.0.12/src/fmt.c  Fri Mar 16 14:43:26 2001
@@ -166,6 +166,7 @@
 static void put_line PARAMS ((WORD *w, int indent));
 static void put_word PARAMS ((WORD *w));
 static void put_space PARAMS ((int space));
+static bool fix_avg PARAMS ((void));
 
 /* The name this program was run with.  */
 const char *program_name;
@@ -192,6 +193,14 @@
    longer than this will each comprise a single word.  */
 static int max_width;
 
+/* If TRUE, attempt to determine new paragraphs based on short final
+   lines */
+static bool new_split;
+
+/* User-supplied percentage of the average line width used to figure
+   lines that should end a paragraph for the new_split option */
+static int percentage = 70;
+
 /* Values derived from the option values.  */
 
 /* The length of prefix minus leading space.  */
@@ -262,6 +271,19 @@
    paragraphs chosen by fmt_paragraph().  */
 static int last_line_length;
 
+/* The number of characters in the current line of input */
+static int cur_line_length;
+
+/* The number of characters in the last line of input */
+int last_cur_line_length;
+
+/* The number of characters in the input as read so far */
+int total_line_length = 0;
+
+/* The number of lines of input read */
+int num_lines = 0;
+
+
 void
 usage (int status)
 {
@@ -277,6 +299,7 @@
 \n\
 Mandatory arguments to long options are mandatory for short options too.\n\
   -c, --crown-margin        preserve indentation of first two lines\n\
+  -n, --new-split[=%WIDTH]  base paragraphs on length of end lines\n\
   -p, --prefix=STRING       combine only lines having STRING as prefix\n\
   -s, --split-only          split long lines, but do not refill\n\
   -t, --tagged-paragraph    indentation of first line different from second\n\
@@ -297,6 +320,7 @@
 static const struct option long_options[] =
 {
   {"crown-margin", no_argument, NULL, 'c'},
+  {"new-split", optional_argument, NULL, 'n'},
   {"prefix", required_argument, NULL, 'p'},
   {"split-only", no_argument, NULL, 's'},
   {"tagged-paragraph", no_argument, NULL, 't'},
@@ -319,7 +343,7 @@
 
   atexit (close_stdout);
 
-  crown = tagged = split = uniform = FALSE;
+  crown = tagged = split = uniform = new_split = FALSE;
   max_width = WIDTH;
   prefix = "";
   prefix_length = prefix_lead_space = prefix_full_length = 0;
@@ -340,7 +364,7 @@
       argc--;
     }
 
-  while ((optchar = getopt_long (argc, argv, "0123456789cstuw:p:",
+  while ((optchar = getopt_long (argc, argv, "0123456789cn::stuw:p:",
                                 long_options, NULL))
         != -1)
     switch (optchar)
@@ -378,6 +402,20 @@
        }
        break;
 
+      case 'n':
+       {
+         long int tmp_long;
+         new_split = TRUE;
+         if (optarg != NULL) {
+           if (xstrtol (optarg, NULL, 10, &tmp_long, "") != LONGINT_OK
+               || tmp_long <= 0 || tmp_long > 99)
+             error (EXIT_FAILURE, 0, _("invalid percentage: `%s'"),
+                    optarg);
+           percentage = (int) tmp_long;
+         }
+       }
+       break;
+
       case 'p':
        set_prefix (optarg);
        break;
@@ -479,6 +517,9 @@
   last_line_length = 0;
   c = next_char;
 
+  cur_line_length = 0;
+  last_cur_line_length = -1;
+
   /* Scan (and copy) blank lines, and lines not introduced by the prefix.  */
 
   while (c == '\n' || c == EOF
@@ -514,6 +555,10 @@
          other_indent = in_column;
          do
            {                   /* for each line till the end of the para */
+             num_lines++;
+             total_line_length += cur_line_length;
+             last_cur_line_length = cur_line_length;
+             cur_line_length = 0;
              c = get_line (f, c);
            }
          while (same_para (c) && in_column == other_indent);
@@ -528,6 +573,10 @@
          other_indent = in_column;
          do
            {                   /* for each line till the end of the para */
+             num_lines++;
+             total_line_length += cur_line_length;
+             last_cur_line_length = cur_line_length;
+             cur_line_length = 0;
              c = get_line (f, c);
            }
          while (same_para (c) && in_column == other_indent);
@@ -545,7 +594,13 @@
     {
       other_indent = first_indent;
       while (same_para (c) && in_column == other_indent)
-       c = get_line (f, c);
+       {
+         num_lines++;
+         total_line_length += cur_line_length;
+         last_cur_line_length = cur_line_length;
+         cur_line_length = 0;
+         c = get_line (f, c);
+       }
     }
   (word_limit - 1)->period = (word_limit - 1)->final = TRUE;
   next_char = c;
@@ -574,6 +629,7 @@
     {
       putchar (c);
       c = getc (f);
+      cur_line_length++;
     }
   return c;
 }
@@ -587,9 +643,18 @@
 {
   return (next_prefix_indent == prefix_indent
          && in_column >= next_prefix_indent + prefix_full_length
+         && (new_split?( total_line_length == 0 || ( last_cur_line_length == 
-1 && ( num_lines > 0 ) && ( ( cur_line_length * 100 ) > ( 
((total_line_length)/num_lines) * percentage ) ) ) || ( last_cur_line_length != 
-1 && num_lines > 0 && ( ( cur_line_length * 100 ) > ( 
((total_line_length)/num_lines) * percentage ) || fix_avg() ) ) ):TRUE)
          && c != '\n' && c != EOF);
 }
 
+static bool
+fix_avg ()
+{
+  --num_lines;
+  total_line_length -= last_cur_line_length;
+  return FALSE;
+}
+
 /* Read a line from input file F, given first non-blank character C
    after the prefix, and the following indent, and break it into words.
    A word is a maximal non-empty string of non-white characters.  A word
@@ -620,6 +685,7 @@
            flush_paragraph ();
          *wptr++ = c;
          c = getc (f);
+         cur_line_length++;
        }
       while (c != EOF && !ISSPACE (c));
       in_column += word_limit->length = wptr - word_limit->text;
@@ -656,6 +722,7 @@
 
   in_column = 0;
   c = get_space (f, getc (f));
+  cur_line_length++;
   if (prefix_length == 0)
     next_prefix_indent = prefix_lead_space < in_column ?
       prefix_lead_space : in_column;
@@ -668,6 +735,7 @@
            return c;
          in_column++;
          c = getc (f);
+         cur_line_length++;
        }
       c = get_space (f, c);
     }
@@ -692,6 +760,7 @@
       else
        return c;
       c = getc (f);
+      cur_line_length++;
     }
 }
 
--- textutils-2.0.12/doc/textutils.texi.orig    Fri Mar 16 14:55:31 2001
+++ textutils-2.0.12/doc/textutils.texi Fri Mar 16 14:42:16 2001
@@ -885,6 +885,23 @@
 lines within a paragraph, and align the left margin of each subsequent
 line with that of the second line.
 
address@hidden address@hidden
address@hidden address@hidden
address@hidden -n
address@hidden --new-split
+Determine end of paragraph based on line lengths.  By default, @code{fmt}
+determines where new paragraphs begin based on blank lines.  This option
+adds the ability for @code{fmt} to decide that the end of a paragraph has
+been reached by noticing that its last line is significantly shorter than
+the average line.  The portion of average that a line must be less than in
+order to be determined a final line defaults to 75%, but this can be changed
+by passing an argument to the option in the form of 100 times the fraction.
+That is, 90% would be signified by 90, not 0.9.  This feature only works for
+already formatted text, and can still fail often, but this problem occurs
+often enough that even a rudimentary start can be helpful.  Also, the average
+line length is determined as the file is parsed, so irregularities early in
+the file can have more significant effects than those later on.
+
 @item -t
 @itemx --tagged-paragraph
 @opindex -t

This is a canonicalization of the example text where fmt -s will work properly. 
 Also, the three paragraphs are, in order, the first and the beginning of the 
second paragraph of the United States' Declaration of Independence and the 
Preamble to the Constitution of the United States.  Vive la mindless 
nationalism!

When in the Course of human events, it becomes necessary for one people to 
dissolve the political bands which have connected them with another, and to 
assume among the powers of the earth, the separate and equal station to which 
the Laws of Nature and of Nature's God entitle them, a decent respect to the 
opinions of mankind requires that they should declare the causes which impel 
them to the separation.
We hold these truths to be self-evident, that all men are created equal, that 
they are endowed by their Creator with certain unalienable Rights, that among 
these are Life, Liberty and the pursuit of Happiness.
We the People of the United States, in Order to form a more perfect Union, 
establish Justice, insure domestic Tranquility, provide for the common defence, 
promote the general Welfare, and secure the Blessings of Liberty to ourselves 
and our Posterity, do ordain and establish this Constitution for the United 
States of America.

This is an example of where the default fmt will not work properly, but my new 
feature, with
standard options does.

When in the Course of human events, it becomes necessary for one people to 
dissolve the political
bands which have connected them with another, and to assume among the powers of 
the earth, the
separate and equal station to which the Laws of Nature and of Nature's God 
entitle them, a decent
respect to the opinions of mankind requires that they should declare the causes 
which impel them
to the separation.
We hold these truths to be self-evident, that all men are created equal, that 
they are endowed by
their Creator with certain unalienable Rights, that among these are Life, 
Liberty and the pursuit
of Happiness.
We the People of the United States, in Order to form a more perfect Union, 
establish Justice,
insure domestic Tranquility, provide for the common defence, promote the 
general Welfare, and
secure the Blessings of Liberty to ourselves and our Posterity, do ordain and 
establish this
Constitution for the United States of America.

This is an example of where the default fmt will not work properly, but my new 
feature,
with a percentage of 90 does, but, with the standard 70%, does not.

When in the Course of human events, it becomes necessary for one people to 
dissolve the
political bands which have connected them with another, and to assume among the 
powers
of the earth, the separate and equal station to which the Laws of Nature and of 
Nature's
God entitle them, a decent respect to the opinions of mankind requires that they
should declare the causes which impel them to the separation.
We hold these truths to be self-evident, that all men are created equal, that 
they are
endowed by their Creator with certain unalienable Rights, that among these are 
Life,
Liberty and the pursuit of Happiness.
We the People of the United States, in Order to form a more perfect Union, 
establish
Justice, insure domestic Tranquility, provide for the common defence, promote 
the
general Welfare, and secure the Blessings of Liberty to ourselves and our 
Posterity, do
ordain and establish this Constitution for the United States of America.

check.sh
Description: Bourne shell script

[Prev in Thread]

Current Thread

[Next in Thread]

new fmt feature, Bitt Faulk, 2001/03/16
- new fmt feature, Bitt Faulk <=

Prev by Date: [ȫ��] �� û�ұ� ��ȳ�
Next by Date: Re: gawk: {} repetition in patterns doesn't work?
Previous by thread: new fmt feature
Next by thread: Don't Be A Slave To The Grind Anymore
Index(es):
- Date
- Thread