bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: msgmerge speedup: fstrcmp and diffseq improvements


From: Bruno Haible
Subject: Re: msgmerge speedup: fstrcmp and diffseq improvements
Date: Mon, 15 Sep 2008 01:50:52 +0200
User-agent: KMail/1.5.4

Hello Ralf,

> I see some possibilities for improvement:
> 
> 1) Avoid computing distances that are known to be worse than the current
> best one.

Implemented, thanks to your patch. A big boost!

> 2) Abort the distance computation as soon as the result is known to be
> suboptimal.

Implemented, thanks to your second patch.

> 3) When computing one string pair distance, avoid computing the path
> (i.e., the edit script).

We are already doing this optimization: The NOTE_INSERT and NOTE_DELETE
macros don't store any object in memory; they only increment a counter.

> 4) Avoid quadratic scaling in the total number of msgids (i.e., number
> of strings in .po file times number of strings in the .pot file).
> There are several ways to do this, I'll only give a couple of ideas that
> could be tried out when this issue becomes pressing.  ...
> 
> You have two sets S1 and S2 (of msgids), and a metric to compute
> distances between any two members of the union of both sets (in this
> case the metric is L, not r).  For each member of the first set (from
> the .po file), you search for the nearest neighbor in the second set.
> A mapping of the metric space into another one in which the sets may
> be partitioned efficiently, may admit fast nearest neighbor search
> algorithms, which exist even for very high-dimensional spaces.

Reducing the closest-neighbor problem to a clustering problem is probably
not promising: algorithms for clustering are known to be efficient when
the domain provides some partitioning criteria, but if it doesn't - like
here in the case of arbitrary strings - the clustering is hard.

But here is a different idea. What you have implemented is the "alpha"-
pruning of a chess playing program: stop analyzing moves which are worse
than the best move known so far. The second trick that chess programs use
is to sort the order in which they consider the moves so that the good
moves get likely analyzed first - with the consequence that more mediocre
moves can be discarded quickly. When I apply this idea to msgmerge (attached
draft patch), I get once more a 2x speedup:
  "msgmerge af.po coreutils.pot"
  16.5 sec. with all your patches,
   7.6 sec. with this additional sorting.

> 5) Use a different similarity measure.  Not considered here.

It is the different similarity measure (in this case: from a 4-gram index)
which can help sorting the list of message to be considered.

Bruno


*** msgl-fsearch.h      7 Oct 2007 19:35:29 -0000       1.2
--- msgl-fsearch.h      14 Sep 2008 23:33:16 -0000
***************
*** 1,5 ****
  /* Fast fuzzy searching among messages.
!    Copyright (C) 2006 Free Software Foundation, Inc.
     Written by Bruno Haible <address@hidden>, 2006.
  
     This program is free software: you can redistribute it and/or modify
--- 1,5 ----
  /* Fast fuzzy searching among messages.
!    Copyright (C) 2006, 2008 Free Software Foundation, Inc.
     Written by Bruno Haible <address@hidden>, 2006.
  
     This program is free software: you can redistribute it and/or modify
***************
*** 36,42 ****
     inside it must not be modified while the returned fuzzy index is in use.  
*/
  extern message_fuzzy_index_ty *
         message_fuzzy_index_alloc (const message_list_ty *mlp,
!                                 const char *canon_charset);
  
  /* Find a good match for the given msgctxt and msgid in the given fuzzy index.
     The match does not need to be optimal.  */
--- 36,43 ----
     inside it must not be modified while the returned fuzzy index is in use.  
*/
  extern message_fuzzy_index_ty *
         message_fuzzy_index_alloc (const message_list_ty *mlp,
!                                 const char *canon_charset,
!                                 bool heuristic);
  
  /* Find a good match for the given msgctxt and msgid in the given fuzzy index.
     The match does not need to be optimal.  */
*** msgl-fsearch.c      14 Sep 2008 21:41:31 -0000      1.4
--- msgl-fsearch.c      14 Sep 2008 23:33:17 -0000
***************
*** 23,28 ****
--- 23,29 ----
  #include "msgl-fsearch.h"
  
  #include <math.h>
+ #include <stdbool.h>
  #include <stdlib.h>
  
  #include "xalloc.h"
***************
*** 191,196 ****
--- 192,198 ----
    message_ty **messages;
    character_iterator_t iterator;
    hash_table gram4;
+   bool heuristic;
    size_t firstfew;
    message_list_ty *short_messages[SHORT_MSG_MAX + 1];
  };
***************
*** 200,206 ****
     inside it must not be modified while the returned fuzzy index is in use.  
*/
  message_fuzzy_index_ty *
  message_fuzzy_index_alloc (const message_list_ty *mlp,
!                          const char *canon_charset)
  {
    message_fuzzy_index_ty *findex = XMALLOC (message_fuzzy_index_ty);
    size_t count = mlp->nitems;
--- 202,209 ----
     inside it must not be modified while the returned fuzzy index is in use.  
*/
  message_fuzzy_index_ty *
  message_fuzzy_index_alloc (const message_list_ty *mlp,
!                          const char *canon_charset,
!                          bool heuristic)
  {
    message_fuzzy_index_ty *findex = XMALLOC (message_fuzzy_index_ty);
    size_t count = mlp->nitems;
***************
*** 294,302 ****
        }
    }
  
!   findex->firstfew = (int) sqrt ((double) count);
!   if (findex->firstfew < 10)
!     findex->firstfew = 10;
  
    /* Setup lists of short messages.  */
    for (l = 0; l <= SHORT_MSG_MAX; l++)
--- 297,312 ----
        }
    }
  
!   findex->heuristic = heuristic;
! 
!   if (heuristic)
!     {
!       findex->firstfew = (int) sqrt ((double) count);
!       if (findex->firstfew < 10)
!       findex->firstfew = 10;
!     }
!   else
!     findex->firstfew = count;
  
    /* Setup lists of short messages.  */
    for (l = 0; l <= SHORT_MSG_MAX; l++)
***************
*** 541,564 ****
                    size_t count;
                    struct mult_index *ptr;
                    message_ty *best_mp;
                    double best_weight;
  
!                   count = findex->firstfew;
!                   if (count > accu.nitems)
                      count = accu.nitems;
  
                    best_weight = FUZZY_THRESHOLD;
                    best_mp = NULL;
                    for (ptr = accu.item; count > 0; ptr++, count--)
                      {
!                       message_ty *mp = findex->messages[ptr->index];
                        double weight =
                          fuzzy_search_goal_function (mp, msgctxt, msgid,
                                                      best_weight);
  
!                       if (weight > best_weight)
                          {
                            best_weight = weight;
                            best_mp = mp;
                          }
                      }
--- 551,587 ----
                    size_t count;
                    struct mult_index *ptr;
                    message_ty *best_mp;
+                   index_ty best_index;
                    double best_weight;
  
!                   if (findex->heuristic)
!                     {
!                       count = findex->firstfew;
!                       if (count > accu.nitems)
!                         count = accu.nitems;
!                     }
!                   else
                      count = accu.nitems;
  
                    best_weight = FUZZY_THRESHOLD;
                    best_mp = NULL;
+                   best_index = 0;
                    for (ptr = accu.item; count > 0; ptr++, count--)
                      {
!                       index_ty index = ptr->index;
!                       message_ty *mp = findex->messages[index];
                        double weight =
                          fuzzy_search_goal_function (mp, msgctxt, msgid,
                                                      best_weight);
  
!                       if (weight > best_weight
!                           || (!findex->heuristic
!                               && weight == best_weight
!                               && best_mp != NULL
!                               && index < best_index))
                          {
                            best_weight = weight;
+                           best_index = index;
                            best_mp = mp;
                          }
                      }
*** msgmerge.c  14 Sep 2008 21:41:31 -0000      1.61
--- msgmerge.c  14 Sep 2008 23:33:18 -0000
***************
*** 679,684 ****
--- 679,691 ----
       from the compendiums.  Each message list has a built-in hash table,
       for speed when doing the exact searches.  */
    message_list_list_ty *lists;
+ 
+   /* A fuzzy index of the first among the message lists.  */
+   message_fuzzy_index_ty *currfindex;
+   /* A once-only execution guard for the initialization of the fuzzy index.
+      Needed for OpenMP.  */
+   gl_lock_define(, currfindex_init_lock)
+ 
    /* A fuzzy index of the compendiums, for speed when doing fuzzy searches.
       Used only if use_fuzzy_matching is true and compendiums != NULL.  */
    message_fuzzy_index_ty *findex;
***************
*** 696,706 ****
--- 703,737 ----
    message_list_list_append (definitions->lists, NULL);
    if (compendiums != NULL)
      message_list_list_append_list (definitions->lists, compendiums);
+   definitions->currfindex = NULL;
+   gl_lock_init (definitions->currfindex_init_lock);
    definitions->findex = NULL;
    gl_lock_init (definitions->findex_init_lock);
    definitions->canon_charset = canon_charset;
  }
  
+ /* Return the current list of non-compendium messages.  */
+ static inline message_list_ty *
+ definitions_current_list (const definitions_ty *definitions)
+ {
+   return definitions->lists->item[0];
+ }
+ 
+ /* Create the current fuzzy index.
+    Used only if use_fuzzy_matching is true.  */
+ static inline void
+ definitions_init_currfindex (definitions_ty *definitions)
+ {
+   /* Protect against concurrent execution.  */
+   gl_lock_lock (definitions->currfindex_init_lock);
+   if (definitions->currfindex == NULL)
+     definitions->currfindex =
+       message_fuzzy_index_alloc (definitions_current_list (definitions),
+                                definitions->canon_charset,
+                                false);
+   gl_lock_unlock (definitions->currfindex_init_lock);
+ }
+ 
  /* Create the fuzzy index.
     Used only if use_fuzzy_matching is true and compendiums != NULL.  */
  static inline void
***************
*** 727,749 ****
  
        /* Create the fuzzy index from it.  */
        definitions->findex =
!       message_fuzzy_index_alloc (all_compendium, definitions->canon_charset);
      }
    gl_lock_unlock (definitions->findex_init_lock);
  }
  
- /* Return the current list of non-compendium messages.  */
- static inline message_list_ty *
- definitions_current_list (const definitions_ty *definitions)
- {
-   return definitions->lists->item[0];
- }
- 
  /* Set the current list of non-compendium messages.  */
  static inline void
  definitions_set_current_list (definitions_ty *definitions, message_list_ty 
*mlp)
  {
    definitions->lists->item[0] = mlp;
  }
  
  /* Exact search.  */
--- 758,779 ----
  
        /* Create the fuzzy index from it.  */
        definitions->findex =
!       message_fuzzy_index_alloc (all_compendium, definitions->canon_charset,
!                                  true);
      }
    gl_lock_unlock (definitions->findex_init_lock);
  }
  
  /* Set the current list of non-compendium messages.  */
  static inline void
  definitions_set_current_list (definitions_ty *definitions, message_list_ty 
*mlp)
  {
    definitions->lists->item[0] = mlp;
+   if (definitions->currfindex != NULL)
+     {
+       message_fuzzy_index_free (definitions->currfindex);
+       definitions->currfindex = NULL;
+     }
  }
  
  /* Exact search.  */
***************
*** 760,768 ****
  definitions_search_fuzzy (definitions_ty *definitions,
                          const char *msgctxt, const char *msgid)
  {
!   message_ty *mp1 =
      message_list_search_fuzzy (definitions_current_list (definitions),
                               msgctxt, msgid);
    if (compendiums != NULL)
      {
        message_ty *mp2;
--- 790,807 ----
  definitions_search_fuzzy (definitions_ty *definitions,
                          const char *msgctxt, const char *msgid)
  {
!   message_ty *mp1;
! 
!   /* Create the fuzzy index lazily.  */
!   if (definitions->currfindex == NULL)
!     definitions_init_currfindex (definitions);
! #if 0
!   mp1 =
      message_list_search_fuzzy (definitions_current_list (definitions),
                               msgctxt, msgid);
+ #else
+   mp1 = message_fuzzy_index_search (definitions->currfindex, msgctxt, msgid);
+ #endif
    if (compendiums != NULL)
      {
        message_ty *mp2;
***************
*** 788,793 ****
--- 827,834 ----
  definitions_destroy (definitions_ty *definitions)
  {
    message_list_list_free (definitions->lists, 2);
+   if (definitions->currfindex != NULL)
+     message_fuzzy_index_free (definitions->currfindex);
    if (definitions->findex != NULL)
      message_fuzzy_index_free (definitions->findex);
  }





reply via email to

[Prev in Thread] Current Thread [Next in Thread]