[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug#73220] [PATCH] ui: Add more nuance to relevance scoring.
From: |
Simon Tournier |
Subject: |
[bug#73220] [PATCH] ui: Add more nuance to relevance scoring. |
Date: |
Fri, 13 Sep 2024 16:12:19 +0200 |
Hi,
On Fri, 13 Sep 2024 at 03:02, aurtzy <aurtzy@gmail.com> wrote:
> Fixes <https://issues.guix.gnu.org/70689>.
Thanks!
> | Keyword(s) with poor | Expectations |
> | results before | |
> |-----------------------+-----------------------------------------------|
> | dig | ~bind~ near top. |
Hum, indeed and I do not know if we can improve here. Well, it’s hard
to improve for short terms, BTW.
--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8
name: go-go-uber-org-dig
relevance: 104
name: rust-num-bigint-dig
relevance: 78
name: rust-num-bigint-dig
relevance: 78
--8<---------------cut here---------------end--------------->8---
Compared to current:
--8<---------------cut here---------------start------------->8---
$ guix search dig | recsel -p name,relevance | head -8
name: sysdig
relevance: 24
name: texlive-pedigree-perl
relevance: 13
name: ruby-net-http-digest-auth
relevance: 13
--8<---------------cut here---------------end--------------->8---
Indeed, 17th position is better than 609th. But if you add a term as
’dns’, bang! :-) Well, BTW the description of ’bind’ could be a bit
improved because the word network does not appear. Anyway. :-)
Hum, why this:
guix search ' dig$' dig | recsel -p name,relevance | head -8
does not return the package ’bind’?
> | rsh | ~inetutils~ near top. |
--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8
name: inetutils
relevance: 26
name: emacs-tramp
relevance: 26
name: rust-borsh-schema-derive-internal
relevance: 22
--8<---------------cut here---------------end--------------->8---
Compared to current:
--8<---------------cut here---------------start------------->8---
$ guix search rsh | recsel -p name,relevance | head -8
name: go-sigs-k8s-io-yaml
relevance: 14
name: python-pymarshal
relevance: 13
name: emacs-powershell
relevance: 13
--8<---------------cut here---------------end--------------->8---
> | c | C language related. |
> | c compiler | Compiler-related C stuff. |
This cannot be improved.
> | r | R language related. |
Usually, I add the prefix ^r\- and I do not have issue with search for r
packages. For instance, search ^r\- keyword and it works well.
$ guix search ^r\- cyto | recsel -CP name | cut -f1 -d'-' | uniq -c
29 r
Somehow, I do not think we can improve here. I mean, the improvement is
to document the usage of prefixes. Similarly for ghc (haskell), ocaml,
python, etc.
> | tor | Tor related; ~torbrowser~ somewhere near top. |
--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 208
name: tor-client
relevance: 169
name: torsocks
relevance: 103
--8<---------------cut here---------------end--------------->8---
Compared to current:
--8<---------------cut here---------------start------------->8---
$ guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 47
name: ghc-storablevector
relevance: 29
name: tor-client
relevance: 28
--8<---------------cut here---------------end--------------->8---
However, the position move from 225th to 19th.
$ guix search tor | recsel -P name | grep -n torbrowser
225:torbrowser
$ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser
19:torbrowser
Similarly as ’dig’, the description of ’torbrowser’ package could be
improvement. Because ’guix search tor browser’ returns nothing.
> | gcc | ~gcc-toolchain~ near top. |
Indeed, something is unexpected. Well, first:
$ guix search gcc | recsel -CP name | uniq | head -8
gccgo
gfortran-toolchain
gdc-toolchain
gcc-toolchain
gcc-cross-x86_64-w64-mingw32-toolchain
gcc-cross-or1k-elf-toolchain
gcc-cross-i686-w64-mingw32-toolchain
gcc-cross-avr-toolchain
$ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8
18 llvm
12 gcc-toolchain
6 libgccjit
6 gccgo
3 isl
2 libstdc++-doc
2 java-commons-cli
2 gdc-toolchain
Other said, the packages with multi-versions decrease the experience.
Well, that had already by “improved” [1] with some SEO. ;-) Indeed,
maybe the relevance should be improved.
Second, gccgo has a relevance score of 22 with the only term ’gcc’,
compared to gcc-toolchain scoring at 15.
gccgo gcc-toolchain
4 * 1 * 1 4 * 1 * 1
+ 2 * 5 * 1 + 2 * 1 * 1
+ 1 * 0 + 1 * 0
+ 3 * 1 * 1 + 3 * 1 * 1
+ 2 * 0 + 2 * 1 * 3
+ 1 * 5 * 1 + 1 * 0
= 22 = 15
This is unexpected. And, IMHO that’s bug! In the description of
gcc-toolchain, the term ’gcc’ appears 3 times but it only score with ’1’
instead of ’5’.
As the patch try to address, the main issue is:
(define (score regexp str)
(fold-matches regexp str 0
(lambda (m score)
(+ score
(if (string=? (match:substring m) str)
5 ;exact match
1)))))
Here the exact match does not consider a substring exact match. For
instance, one would consider that the term ’gcc’ exactly matches in
“some GCC thing”. Considering the current implementation, that’s not
the case. For instance, a snippet as the procedure ’scoring’:
--8<---------------cut here---------------start------------->8---
scheme@(guix-user)> ,use(ice-9 regex)
scheme@(guix-user)> (define regexp (make-regexp "gcc" regexp/icase))
scheme@(guix-user)> (define str "some GCC thing")
scheme@(guix-user)> (fold-matches regexp str 0
(lambda (m res)
(+ res
(if (string=? (match:substring m) str)
5 1))))
$2 = 1
--8<---------------cut here---------------end--------------->8---
See v2 for my proposal fixing this.
Please note that this v2 gives the same ranking for torbrowser. And
also improve the situation with gcc-toolchain.
--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain
1:gcc-toolchain
2:gcc-toolchain
3:gcc-toolchain
4:gcc-toolchain
5:gcc-toolchain
6:gcc-toolchain
7:gcc-toolchain
8:gcc-toolchain
9:gcc-toolchain
10:gcc-toolchain
11:gcc-toolchain
12:gcc-toolchain
$ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser
7:torbrowser
$ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind
44:bind
--8<---------------cut here---------------end--------------->8---
However, inetutils is still at 44th with the only one term ’rsh’. I
would suggest to do some tweak with the description.
Bah maybe it is then a bit slower on cold caches? Hum?! Well, I have
not investigated, neither with your patch. :-) Well, that something that
could be investigated; especially the performance of ’char-set’
operations.
1: https://issues.guix.gnu.org/43342
> I opted to switch to counting a maximum of one match per field, which helps
> with cases where a common subword matches /many/ times in packages with longer
> descriptions, pushing more relevant packages down. In multi-term searches,
> the unique terms - which are naturally rarer - also contribute to a larger
> percentage of the score as a result of these changes.
> Having matches with only one word boundary be scored as 2 instead of 1 was
> done with the reasoning that a term is more likely to be part of a compound
> word name (and thus more relevant) if it is a prefix or suffix; for example,
> "gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser.
[...]
> Closing this message on an unrelated note for future work: I stumbled on an
> interesting idea while looking for test cases which suggested reducing the
> score of a programming library when its language is not included in search
> terms. It's out of scope for the current issue, but I thought I'd mention it
> anyways for potential further improvements.
Well, years ago I thought about implementing TF-IDF [2,3]. Other ideas
[4] are floating around. Then, we spent some time for making “guix
search” faster [5] and today my TODO is about having an extension
relying on Guile-Xapian.
Therefore, I would prefer keep the ’relevance’ more or less predictable
by only counting the number of occurrences and apply some weights.
Else, for what my opinion is worth, the direction would not be to
re-invent an algorithm but maybe implement some already well-known ones.
TF-IDF [3] is one or Okapi-BM25 is another one, etc. In all in all,
that what Xapian provides. ;-) And it does it very well! That’s why I
would be tempted to have a Guix extension relying on Guile-Xapin for
indexing and searching (fast!).
Cheers,
simon
2: Re: Organizing packages
zimoun <zimon.toutoune@gmail.com>
Tue, 16 Jul 2019 19:04:26 +0200
id:CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg@mail.gmail.com
https://lists.gnu.org/archive/html/guix-devel/2019-07
https://yhetil.org/guix/CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg@mail.gmail.com
3: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
4: Inverted index to accelerate guix package search
Arun Isaac <arunisaac@systemreboot.net>
Sun, 12 Jan 2020 20:33:51 +0530
id:cu7h810emy0.fsf@systemreboot.net
https://lists.gnu.org/archive/html/guix-devel/2020-01
https://yhetil.org/guix/cu7h810emy0.fsf@systemreboot.net
5: [bug#39258] Faster guix search using an sqlite cache
Arun Isaac <arunisaac@systemreboot.net>
Fri, 24 Jan 2020 01:21:57 +0530
id:cu7pnfaar36.fsf@systemreboot.net
https://issues.guix.gnu.org/39258
https://issues.guix.gnu.org/msgid/cu7pnfaar36.fsf@systemreboot.net
https://yhetil.org/guix/cu7pnfaar36.fsf@systemreboot.net
6: https://en.wikipedia.org/wiki/Okapi_BM25