silpa-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [silpa-discuss] Changes made as part of GSoC - Call for Suggestions


From: Jerin Philip
Subject: Re: [silpa-discuss] Changes made as part of GSoC - Call for Suggestions
Date: Thu, 1 Sep 2016 10:31:05 +0530

On 01-Sep-2016 04:19, "Balasankar C" <address@hidden> wrote:
>
> Hi,
>
> I participated in Google Summer of Code program under Indic Project and
> modified the stemmer and spell checker modules of LibIndic so as to incorporate
> inflection handling ability to the spell checker. There were some other
> structural changes that I brought to these repositories, which I intend to
> inform the list through this mail. These changes are listed below.
>
> 0. Rename packages to include the 'libindic' keyword
>
> The package spell checker was having a very general name "spellchecker" which
> was ambiguous. It didn't communicate the information that the module was
> specific for Indic languages or was under the LibIndic library. So, I renamed
> the package to "libindic-spellchecker". Following the pattern, I renamed
> indicstemmer to "libindic-stemmer" and indicngram to "libindic-ngram".
>
>
> 1. Use namespace packages (PEP 420)
>
> I made the package structure to make use of the concept of namespace packages.
> Changes caused due to this are
>         a. Packages will get installed inside a single folder named libindic. Example:
> /usr/local/lib/python2.7/dist-packages/libindic/ngram ,
> /usr/local/lib/python2.7/dist-packages/libindic/stemmer,
> /usr/local/lib/python2.7/dist-packages/libindic/spellchecker etc.
>         b. Import statements will have the term "libindic" in them - More visibility
> to LibIndic. Example: from libindic.stemmer import Malayalam
>
> I believe these changes bring more visibility to the project and reduce
> ambiguity about the packages. If it is ok with the everyone, I intend to do
> this for other modules also.
>
> * If the phrase "indic" already exists in the name, it will be removed. The
> prefix "libindic-" will be added to the package names.
> * For modules where code is different for different languages, the structure
> will be libindic.<module>.<language> (eg: libindic.stemmer.Malayalam).
> * For modules where the code is same for all languages, like chardetails, the
> structure will be libindic.<module>.<Module> (eg: libindic.chardetails.CharDetails)
>

I suggest that unless we're storing state, the object oriented approach is an overkill. chardetails should ideally be a function, that too in some libindic.util library.

In my opinion, something like stemmer(lang='ml') or stemmer.ml is shorter, but still comprehensible.

I also propose that the modules be moved to python3, break backwards compatibility and libindic be restructured.

Let me elaborate a little on restructuring. For example, spellchecker has its own implementation of the levenshtein. We could have this moved to a single util file, since its a metric commonly used and possible redundancy in multiple modules.

Alternatively, we can avoid a lot of headaches/reinventing the wheel by using existing functions in nltk. nltk already has implementations of basic NLP functions like the distance metrics, ngrams. This lets us avoid trouble of having to create and maintain documentation. Also, makes it easier for new contributors with experience in the library.

I'm requesting the move to python3 having faced issues with unicode, io and other stuff trying to maintain python2-python3 compatibility.

>
> Please share your thoughts, suggestions and modifications about this. If there
> is no objection, I intend to make these changes from 10th September.
>
> --
> Balasankar C
> http://balasankarc.in
>

Thanks,
Jerin Philip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]