From 5e303b170cd8f6a0a34a30d724e4c7d7a444ef32 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sun, 23 Feb 2014 22:49:27 -0800 Subject: [PATCH 2/2] diff: remove TOO_EXPENSIVE heuristic Problem reported by Vincent Lefevre in . The simplest solution is to remove the TOO_EXPENSIVE heuristic that I added to GNU diff in 1993. Although appropriate for circa-1993 hardware, these days the heuristic seems to be more trouble than it's worth. * NEWS: Document this. * doc/diffutils.texi (Overview): Modernize citations. Remove mention of TOO_EXPENSIVE heuristic. * src/analyze.c (diff_2_files): Adjust to TOO_EXPENSIVE-related API changes in gnulib's diffseq module. --- NEWS | 7 +++++++ doc/diffutils.texi | 18 +++++++++--------- src/analyze.c | 10 +--------- 3 files changed, 17 insertions(+), 18 deletions(-) diff --git a/NEWS b/NEWS index aff7a9d..58a7cbb 100644 --- a/NEWS +++ b/NEWS @@ -13,6 +13,13 @@ GNU diffutils NEWS -*- outline -*- consider two Asian file names to be the same merely because they contain no English characters. +** Performance changes + + diff's default algorithm has been adjusted to output higher-quality + results at somewhat greater computational cost, as CPUs have gotten + faster since the algorithm was last tweaked in diffutils-2.6 (1993). + + * Noteworthy changes in release 3.3 (2013-03-24) [stable] ** New features diff --git a/doc/diffutils.texi b/doc/diffutils.texi index 1d48f54..7b08cd4 100644 --- a/doc/diffutils.texi +++ b/doc/diffutils.texi @@ -142,26 +142,26 @@ use diffs to update files. David Hayes, Richard Stallman, and Len Tower. Wayne Davison designed and implemented the unified output format. The basic algorithm is described by Eugene W. Myers in ``An O(ND) Difference Algorithm and its Variations'', address@hidden Vol.@: 1 No.@: 2, 1986, pp.@: 251--266; and in ``A File address@hidden Vol.@: 1, 1986, pp.@: 251--266, address@hidden://dx.doi.org/10.1007/BF01840446}; and in ``A File Comparison Program'', Webb Miller and Eugene W. Myers, address@hidden and Experience} Vol.@: 15 No.@: 11, 1985, -pp.@: 1025--1040. address@hidden and Experience} Vol.@: 15, 1985, +pp.@: 1025--1040, address@hidden://dx.doi.org/10.1002/spe.4380151102}. @c From: "Gene Myers" @c They are about the same basic algorithm; the Algorithmica @c paper gives a rigorous treatment and the sub-algorithm for @c delivering scripts and should be the primary reference, but @c both should be mentioned. -The algorithm was independently discovered as described by E. Ukkonen in +The algorithm was independently discovered as described by Esko Ukkonen in ``Algorithms for Approximate String Matching'', address@hidden and Control} Vol.@: 64, 1985, pp.@: 100--118. address@hidden and Control} Vol.@: 64, 1985, pp.@: 100--118, address@hidden://dx.doi.org/10.1016/S0019-9958(85)80046-2}. @c From: "Gene Myers" @c Date: Wed, 29 Sep 1993 08:27:55 MST @c Ukkonen should be given credit for also discovering the algorithm used @c in GNU diff. -Unless the @option{--minimal} option is used, @command{diff} uses a -heuristic by Paul Eggert that limits the cost to @math{O(N^1.5 log N)} -at the price of producing suboptimal output for large inputs with many -differences. Related algorithms are surveyed by Alfred V. Aho in +Related algorithms are surveyed by Alfred V. Aho in section 6.3 of ``Algorithms for Finding Patterns in Strings'', @cite{Handbook of Theoretical Computer Science} (Jan Van Leeuwen, ed.), Vol.@: A, @cite{Algorithms and Complexity}, Elsevier/MIT Press, diff --git a/src/analyze.c b/src/analyze.c index 1886e7e..f062fd3 100644 --- a/src/analyze.c +++ b/src/analyze.c @@ -532,7 +532,6 @@ diff_2_files (struct comparison *cmp) { struct context ctxt; lin diags; - lin too_expensive; /* Allocate vectors for the results of comparison: a flag for each line of each file, saying whether that line @@ -564,18 +563,11 @@ diff_2_files (struct comparison *cmp) ctxt.heuristic = speed_large_files; - /* Set TOO_EXPENSIVE to be approximate square root of input size, - bounded below by 256. */ - too_expensive = 1; - for (; diags != 0; diags >>= 2) - too_expensive <<= 1; - ctxt.too_expensive = MAX (256, too_expensive); - files[0] = cmp->file[0]; files[1] = cmp->file[1]; compareseq (0, cmp->file[0].nondiscarded_lines, - 0, cmp->file[1].nondiscarded_lines, minimal, &ctxt); + 0, cmp->file[1].nondiscarded_lines, &ctxt); free (ctxt.fdiag - (cmp->file[1].nondiscarded_lines + 1)); -- 1.8.5.3