emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extending the ecomplete.el data store.


From: Karl Fogel
Subject: Extending the ecomplete.el data store.
Date: Sun, 04 Feb 2018 00:16:32 -0600
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)

This post's primary audience is Lars Ingebrigtsen -- we agreed to move this 
thread over here from Emacs Tangents [1] -- though of course anyone's welcome 
to join in.

Some context for everyone else: after I wrote mailaprop [2] to do prioritized 
autofill for email addresses, Lars mentioned his ecomplete.el, which is part of 
Emacs.  Ecomplete offers similar functionality, although its UI is 
minibuffer-based rather than tooltip-based, and it uses a different address 
prioritization algorithm from mailaprop.

Lars, I'd like to propose extending the data stored by ecomplete.el so that it 
supports the union of the data needed by ecomplete and that needed by 
mailaprop.  (Mostly what mailaprop stores is a superset of what ecomplete 
stores, with one exception; more on that below.)

An ecomplete record looks like this:

  (KEY  TIMES_USED  LAST_TIME_USED  STRING)

Here is an example (the `mail' at the front is so that you could have one alist 
of things for `mail' and another for, say, `twitter', etc):

  ((mail
   ("address@hidden" 381 1516109510 "Lars Ingebrigtsen <address@hidden>")
   ("address@hidden" 10 1516065455 "Karl Fogel <address@hidden>")
   ...
   ))

Meanwhile, a mailaprop on-disk record looks like this:

  (KEY
   ((VARIANT  LAST_TIME_USED  SENT_COUNT  RECEIVED_COUNT)
    ...))

Here's an example of a key with three variants:

  ("address@hidden"
   (("address@hidden"                       "2017 Jun 12"  29 31)
    ("A. Szymanowski <address@hidden>"      "2017 Sep 03"   1  0)
    ("Abilene Szymanowski <address@hidden>" "2018 Jan 15"   8  7)))

Let's ignore the fact that ecomplete stores dates as seconds-since-epoch while 
mailaprop uses human-readable strings; I'd be happy to switch mailaprop to the 
ecomplete way for that.  We'll just focus on substantive differences here.

At the individual record level, the mailaprop information is a superset of the 
ecomplete information in two ways:

* Mailaprop remembers all the real-name variations and case variations 
individually, including case variations in the email address portion as well as 
in the real name portion.  So each variation gets its own record, but they're 
all tied together under the same case-folded KEY so they can be scored 
together.  (Contrast with ecomplete, where I believe `ecomplete-add-item' just 
remembers the most recently-seen variant for a given key.)

* Mailaprop splits the TIMES-USED into SENT_COUNT and RECEIVED_COUNT, that is, 
number of times the user has sent to the address in question, and number of 
times the user has received mail from the user in question.

At the next level up, ecomplete stores a piece of information that mailaprop 
does not:

* Ecomplete starts the alist with a symbol that offers the possibility of 
multiple types of records, e.g., `mail', `twitter', etc.

So, here's a proposal for a unified format that supports both packages -- this 
format is more verbose but more extensible:

  (KEY          ; string: downcased email addr
    ((VARIANT   ; string: case-preserving address w/ real name
       (TYPE                                     ; symbol: `mail', etc
         ('last-sent  LAST_TIME_SENT_TO)          ; int: seconds since epoch
         ('last-recv  LAST_TIME_RECEIVED_FROM_TO) ; int: seconds since epoch
         ('sent-count SENT_COUNT)                 ; int: total times sent
         ('recv-count RECEIVED_COUNT)             ; int: total times received
       )
       ...further TYPEs could go here...
     )
     ...further VARIANTs here...
    )
    ...[reserved, in case we ever need something other than VARIANTs]...
  )

That's the format for one record; the master record file is just a list of 
elements of the above type.

This format offers many possibilities for creative scoring mechanisms, and is 
more easily extensible than either package's current format.

If we unify the format, we should probably unify on one default record file 
too.  Right now, `ecomplete-database-file' defaults to ~/.ecompleterc or 
~/.emacs.d/ecompleterc, whereas `mailaprop-address-file' doesn't default to 
anything -- the user must set it manually: email addresses are pretty private, 
and I didn't want to guess about what locations would be confidential enough.  
I'd be happy to just have mailaprop use ecomplete's defaults for the database 
file, though.  The privacy concern can be addressed with documentation.

Now, about database maintenance:

Mailaprop adds new addresses to the database using a different mechanism than 
ecomplete uses.  Mailaprop users run an asynchronous script that reads all of 
their email and generates the database.  Ecomplete watches email as it comes 
and goes in Emacs, and automagically keeps its database up-to-date.  (I don't 
think ecomplete has any way to "catch up to the present" when you start using 
it; you just start out with no email addresses, and it watches everything you 
do from then on.)

These two methods of database maintenance are basically compatible.  In fact, 
one could use mailaprop's script to generate the database the first time, and 
then depend on ecomplete to keep it up-to-date after that.  As long as we 
document what's going on, and each package uses its current defaults, I think 
we're fine.  Those who use ecomplete will still get what they've been getting, 
and those who use mailaprop can either use the mailaprop way of periodically 
updating the database, or they can ask ecomplete to maintain it in real time 
for them (this might necessitate a trivial flag in ecomplete to get it to 
maintain the database while not offering completion, for those who want a 
mailaprop-style popup-autofill UI, but that's easy to do).

I guess we would also switch to UTF-8 for the coding system for the database?  
(Right now `ecomplete-database-file-coding-system' defaults to `iso-2022-7bit'.)

Note that ecomplete would have to add code to convert the new on-disk format to 
the in-memory format that ecomplete currently uses.  That is, this function...

  (defun ecomplete-setup ()
    "Read the .ecompleterc file."
    (when (file-exists-p ecomplete-database-file)
      (with-temp-buffer
        (let ((coding-system-for-read ecomplete-database-file-coding-system))
          (insert-file-contents ecomplete-database-file)
          (setq ecomplete-database (read (current-buffer)))))))
  
...would need to be supplemented with something that does what 
`mailaprop-digest-raw-addresses' does in mailaprop, and the reverse for writing 
the data out.  Obviously, this proposed new format is pretty easily convertible 
to and from ecomplete's in-memory representation.

Whew, okay, those are my thoughts.  I'm not sure whether it makes sense to 
unify the two packages themselves ever, but in any case using the same on-disk 
format would be a good move.

Modifications or counterproposals welcome of course, and it's also perfectly 
okay to say "Thanks, but this isn't worth the trouble." :-).  These two 
packages are so close in functionality and data that it seems a shame for them 
not to share a datastore, but we may just decide it's too much effort.  If we 
decide that, we should at least put pointers in each package mentioning the 
other, and this thread, so future programmers at least have their attention 
drawn to the redundancy before making further enhancements.

Best regards,
-Karl

[1] https://lists.gnu.org/archive/html/emacs-tangents/2018-01/msg00023.html

[2] https://lists.gnu.org/archive/html/emacs-tangents/2018-01/msg00003.html



reply via email to

[Prev in Thread] Current Thread [Next in Thread]