bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-apl] Fwd: Re: Spell corrector - APL


From: Christian Robert
Subject: [Bug-apl] Fwd: Re: Spell corrector - APL
Date: Fri, 9 Sep 2016 23:59:31 -0400
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

Bug report,

while I tested that I got a "Killed" (out of memory) issue.

while running my tests, the last function had an error. "⍵" was very large at 
that point. (100,000+)

I did at that point:

      )copy Util type

and the process/virtual_machine was completely stuck for more 10 minutes and I 
think I reset it.


there was no ")copy Util type" (mistyped) the one I was really needing was ")copy 
Util typeof"

well hard to reproduce. Why )copy is using so much memory/cpu?

Xtian.
--- Begin Message --- Subject: Re: [Bug-apl] Spell corrector - APL Date: Fri, 9 Sep 2016 23:27:22 -0400 User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

I got to may be 2% of the work with this:

alpha_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
remove_blank_lines←{(∊0≠⍴¨⍵)/⍵}
tolower←{('abcdefghijklmnopqrstuvwxyz',⎕av)[('ABCDEFGHIJKLMNOPQRSTUVWXYZ',⎕av)⍳⍵]}



      )sic
      )erase readfile_fast
      ∇z←readfile_fast name;fd;lines;⎕io
⎕io←1 ⍝ Bring a file into a vector of strings, utf8 aware for both name and 
contents.
→(0≠"r" ⎕fio[31] 18 ⎕cr name)/Error           ⍝ Can not read file ? → Error
z←⎕fio[26] 18 ⎕cr name                        ⍝ First pass, read the whole file
lines←⍳+/((↑"\n")=z)                          ⍝ Compute the iota for each line
z←(⍴lines)⍴⍬                                  ⍝ Preallocate "z" to the right 
size
fd←⎕fio[3] 18 ⎕cr name                        ⍝ Open the file
⊣ {⊣z[⍵]←⊂19 ⎕cr ⎕ucs ¯1↓⎕fio[8] fd} ⍤0 lines ⍝ Put each line in the preallocated 
"z"
⊣ ⎕fio[4] fd ⋄ →0                             ⍝ Close the file and return
Error: ⎕ES ∊'Error on file "',name,'": ',⎕fio[2] | ⎕fio[1] ''
∇


alpha_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
remove_blank_lines←{(∊0≠⍴¨⍵)/⍵}
tolower←{('abcdefghijklmnopqrstuvwxyz',⎕av)[('ABCDEFGHIJKLMNOPQRSTUVWXYZ',⎕av)⍳⍵]}
vertical←{,[⍳0]⍵}
words_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}

      ⍝ then ...

      z←remove_blank_lines alpha_only ¨ tolower ¨ readfile_fast 'big.txt'

      ⍴ z
103561
      ⍝ here you have 103,561 lines, no empty ones, clean of special characters 
(but may have several blanks between each word).

      ⌊/⍴¨z  ⍝ minimum line length, probable "I"
 1

      ⌈/⍴¨z  ⍝ maximum line length, may contain 400 to 600 words on each line 
of 2488 characters.
 2488

      ⍝ at this point you have to iterate (rank operator?) over thoses 103,561 
lines
      ⍝ to extract all the words in each lines, saving thems (unique) and count 
the occurence of
      ⍝ each word.

      ⍝ since APL can't do things like count['abc'] = 0   or count['abc'] += 1  
  (index with string on vectors)
      ⍝ it's a near no-end issue (eg: very difficult to do, but not impossible)

      ⍝ you will NEVER win race to language like "awk" who have indexed string 
*part* of the basic language.

my 2 cents,

Xtian.

On 2016-09-09 17:39, Ala'a Mohammad wrote:
Hi,

I'm trying to create simple spell corrector (Norvig at
http://norvig.com/spell-correct.html) in APL.
I tried but stumbled upon the frequency/count stage and could not move
further. The stopper was either WS Full, or apl process killed. I'm
assuming the main issue is 'lack of experience with APL', and thus the
inefficient coding.

ftxt ← { ⎕FIO[26] ⍵ }
a ← 'abcdefghijklmnopqrstuvwxyz'
A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
nl ← ⎕UCS 13
cr ← ⎕UCS 10
tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
hist ← { (⍪∪⍵),+/∨/¨(∪⍵)∘.⍷⍵ }
fhist ← { hist (alphamask txt) ⊂ downcase txt ← ftxt ⍵ }
⍝ file ← '/misc/small.txt' ~ 28K
⍝ file ← '/misc/xaa' ~ 1.3M
file ← '/misc/big.txt' ⍝ ~ 6.2M
⍝ following 2 lines for debugging
⎕ ← ⍴w ← (alphamask txt) ⊂ downcase txt ← ftxt file
⎕ ← ⍴u ← ∪w
fhist file

the errors happened inside 'hist' function, and I presume mostly due
to the jot dot find (if understand correctly, operating on a matrix of
length equal to : unique-length * words-length)

Is there anyway to fix the issue? and then proceed to complete the solution.

Also, Is this the way to create simple spell corrector in APL (that is
a one which is capitalizing on APL strength as an array language)?

I'm using
LinuxMint 17.1 (kernel 3.13.0-37-generic #64-Ubuntu)
Gnu APL 1.6 (794)
Zsch 5.0.2
Emacs 25.1.50.1

Best,

Ala'a

P.S: I hoped that I could create the solution in APL and then get some
wacks on the head from fellow experienced APL programmers before
submitting it as 'another solution in X language'. but the hope
stopped short before even getting the probability stage.





--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]