[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-apl] Spell corrector - APL
From: |
Xiao-Yong Jin |
Subject: |
Re: [Bug-apl] Spell corrector - APL |
Date: |
Fri, 9 Sep 2016 22:58:23 -0500 |
Seems like a good motivation to support quad equal: ⌸
See the key operator in dyalog:
http://help.dyalog.com/15.0/Content/Language/Primitive%20Operators/Key.htm
On the other hand, pattern matching A[n]←x for
in-place operation seems a good way to go.
Not sure if it’s possible in GNU APL.
> On Sep 9, 2016, at 10:27 PM, Christian Robert <address@hidden> wrote:
>
>
> I got to may be 2% of the work with this:
>
> alpha_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
> remove_blank_lines←{(∊0≠⍴¨⍵)/⍵}
> tolower←{('abcdefghijklmnopqrstuvwxyz',⎕av)[('ABCDEFGHIJKLMNOPQRSTUVWXYZ',⎕av)⍳⍵]}
>
>
>
> )sic
> )erase readfile_fast
> ∇z←readfile_fast name;fd;lines;⎕io
> ⎕io←1 ⍝ Bring a file into a vector of strings, utf8 aware for both name and
> contents.
> →(0≠"r" ⎕fio[31] 18 ⎕cr name)/Error ⍝ Can not read file ? → Error
> z←⎕fio[26] 18 ⎕cr name ⍝ First pass, read the whole
> file
> lines←⍳+/((↑"\n")=z) ⍝ Compute the iota for each line
> z←(⍴lines)⍴⍬ ⍝ Preallocate "z" to the right
> size
> fd←⎕fio[3] 18 ⎕cr name ⍝ Open the file
> ⊣ {⊣z[⍵]←⊂19 ⎕cr ⎕ucs ¯1↓⎕fio[8] fd} ⍤0 lines ⍝ Put each line in the
> preallocated "z"
> ⊣ ⎕fio[4] fd ⋄ →0 ⍝ Close the file and return
> Error: ⎕ES ∊'Error on file "',name,'": ',⎕fio[2] | ⎕fio[1] ''
> ∇
>
>
> alpha_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
> remove_blank_lines←{(∊0≠⍴¨⍵)/⍵}
> tolower←{('abcdefghijklmnopqrstuvwxyz',⎕av)[('ABCDEFGHIJKLMNOPQRSTUVWXYZ',⎕av)⍳⍵]}
> vertical←{,[⍳0]⍵}
> words_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
>
> ⍝ then ...
>
> z←remove_blank_lines alpha_only ¨ tolower ¨ readfile_fast 'big.txt'
>
> ⍴ z
> 103561
> ⍝ here you have 103,561 lines, no empty ones, clean of special
> characters (but may have several blanks between each word).
>
> ⌊/⍴¨z ⍝ minimum line length, probable "I"
> 1
>
> ⌈/⍴¨z ⍝ maximum line length, may contain 400 to 600 words on each line
> of 2488 characters.
> 2488
>
> ⍝ at this point you have to iterate (rank operator?) over thoses 103,561
> lines
> ⍝ to extract all the words in each lines, saving thems (unique) and
> count the occurence of
> ⍝ each word.
>
> ⍝ since APL can't do things like count['abc'] = 0 or count['abc'] += 1
> (index with string on vectors)
> ⍝ it's a near no-end issue (eg: very difficult to do, but not impossible)
>
> ⍝ you will NEVER win race to language like "awk" who have indexed string
> *part* of the basic language.
>
> my 2 cents,
>
> Xtian.
>
> On 2016-09-09 17:39, Ala'a Mohammad wrote:
>> Hi,
>>
>> I'm trying to create simple spell corrector (Norvig at
>> http://norvig.com/spell-correct.html) in APL.
>> I tried but stumbled upon the frequency/count stage and could not move
>> further. The stopper was either WS Full, or apl process killed. I'm
>> assuming the main issue is 'lack of experience with APL', and thus the
>> inefficient coding.
>>
>> ftxt ← { ⎕FIO[26] ⍵ }
>> a ← 'abcdefghijklmnopqrstuvwxyz'
>> A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> nl ← ⎕UCS 13
>> cr ← ⎕UCS 10
>> tab ← ⎕UCS 9
>> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> alphamask ← { ~ ⍵ ∊ nonalpha }
>> hist ← { (⍪∪⍵),+/∨/¨(∪⍵)∘.⍷⍵ }
>> fhist ← { hist (alphamask txt) ⊂ downcase txt ← ftxt ⍵ }
>> ⍝ file ← '/misc/small.txt' ~ 28K
>> ⍝ file ← '/misc/xaa' ~ 1.3M
>> file ← '/misc/big.txt' ⍝ ~ 6.2M
>> ⍝ following 2 lines for debugging
>> ⎕ ← ⍴w ← (alphamask txt) ⊂ downcase txt ← ftxt file
>> ⎕ ← ⍴u ← ∪w
>> fhist file
>>
>> the errors happened inside 'hist' function, and I presume mostly due
>> to the jot dot find (if understand correctly, operating on a matrix of
>> length equal to : unique-length * words-length)
>>
>> Is there anyway to fix the issue? and then proceed to complete the solution.
>>
>> Also, Is this the way to create simple spell corrector in APL (that is
>> a one which is capitalizing on APL strength as an array language)?
>>
>> I'm using
>> LinuxMint 17.1 (kernel 3.13.0-37-generic #64-Ubuntu)
>> Gnu APL 1.6 (794)
>> Zsch 5.0.2
>> Emacs 25.1.50.1
>>
>> Best,
>>
>> Ala'a
>>
>> P.S: I hoped that I could create the solution in APL and then get some
>> wacks on the head from fellow experienced APL programmers before
>> submitting it as 'another solution in X language'. but the hope
>> stopped short before even getting the probability stage.
>>
>>
>