bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Spell corrector - APL


From: Ala'a Mohammad
Subject: Re: [Bug-apl] Spell corrector - APL
Date: Mon, 12 Sep 2016 21:34:21 +0400

Thanks for the pointer.

About ⎕FIO ¯1, I can not find it in the displayed help (i.e. using
⎕FIO ''), and what is the output resolution? is it similar to
GetTickCount in windows, or Linux gettimeofday?

Regards,

Ala'a


On Mon, Sep 12, 2016 at 2:23 PM, Juergen Sauermann
<address@hidden> wrote:
> Hi again,
>
> sorry, I meant:
>
>       downcase←{ ⎕UCS (32×(⍵≥65)∧⍵≤90)+⍵←⎕UCS ⍵ }
>
> /// Jürgen
>
>
> On 09/12/2016 12:10 PM, Juergen Sauermann wrote:
>
> Hi Ala'a,
>
> you can use ⎕FIO ¯1 to find out where the time is spent, e.g.:
>
> T←⎕FIO ¯1
> file ← 'test.txt'
> 'T1:' ((T←⎕FIO ¯1)-T)
> ⎕ ← ⍴w ← words ftxt file
> 'T2:' ((T←⎕FIO ¯1)-T)
> ⎕ ← ⍴u ← ∪w
> 'T3:' ((T←⎕FIO ¯1)-T)
> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
> 'T4:' ((T←⎕FIO ¯1)-T)
>
> Your downcase function fails on my machine:
>
>       ⎕ ← ⍴w ← words ftxt file
> INDEX ERROR+
> λ1[1]  λ←(a,⎕AV)[(A,⎕AV)⍳⍵]
>          ^      ^
>
>       )MORE
> ⎕IO=1 offending index=282 max index=282
>
> probably due to a character in my testfile that is not contained in ⎕AV.
> You should use ⎕UCS instead of ⎕AV to avoid that:
>
>       downcase←{ ⎕UCS (32×(T≥65)∧T≤90)+⍵←⎕UCS ⍵ }
>
> /// Jürgen
>
>
> On 09/11/2016 08:23 PM, Ala'a Mohammad wrote:
>
> Just an update as a reference, I'm now able to parse the big.txt file
> (without WS full or killed process), but it takes around 2 Hours and
> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
> process reach 1GiB (after parsing the words), and tops that with
> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
>
> The only change is scanning each unique word against the whole words vector.
>
> Below is the code with a sample timed run.
>
> Regards,
>
> Ala'a
>
> ⍝ fhist.apl
> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> alphamask ← { ~ ⍵ ∊ nonalpha }
> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> desc ← {⍵[⍒⍵[;2];]}
> ftxt ← { ⎕FIO[26] ⍵ }
>
> file ← '/misc/big.txt' ⍝ ~ 6.2M
> ⎕ ← ⍴w ← words ftxt file
> ⎕ ← ⍴u ← ∪w
> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
> )OFF
>
> : time apl -s -f fhist.apl
> 1098281
> 30377
>  the            80003
>  of             40025
>  to             28760
>  in             22048
>  for             6936
>  by              6736
>  be              6154
>  or              5349
>  all             4141
>  this            4058
>  are             3627
>  other           1488
>  before          1363
>  should          1297
>  over            1282
>  your            1276
>  any             1204
>  our             1065
>  holmes           450
>  country          417
>  world            355
>  project          286
>  gutenberg        262
>  laws             233
>  sir              176
>  series           128
>  sure             123
>  sherlock         101
>  ebook             85
>  copyright         69
>  changing          44
>  check             38
>  arthur            30
>  adventures        17
>  redistributing     7
>  header             7
>  doyle              5
>  downloading        5
>  conan              4
>
> apl -s -f fhist.apl  8901.96s user 5.78s system 99% cpu 2:28:38.61 total
>
> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <address@hidden> wrote:
>
> Thanks to all for the input,
>
> Replacing Find and Each OR with Match helped, now I'm parsing a 159K
> (~1545 lines) text file (a sample chunk from the big.txt).
>
> The strange thing for me that I'm trying to understand is that the APL
> process (when fed the 159K text file) start allocating memory until it
> reaches 2.7GiB, then after printing the result settle down to 50MiB.
> Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
> collection utility) which can be used to mitigate this issue?
>
> Here is the updated code:
>
> a ← 'abcdefghijklmnopqrstuvwxyz'
> A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> alphamask ← { ~ ⍵ ∊ nonalpha }
> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
> desc ← {⍵[⍒⍵[;2];]}
> ftxt ← { ⎕FIO[26] ⍵ }
> fhist ← { hist words ftxt ⍵ }
>
> file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
> ⎕ ← ⍴w ← words ftxt file
> ⎕ ← ⍴u ← ∪w
> desc 39 2 ⍴ fhist file
>
> And here is a sample run
> : apl -s -f fhist.apl
> 30186
> 4155
>  the            1560
>  to              804
>  of              781
>  in              493
>  for             219
>  be              173
>  holmes          164
>  your            132
>  this            114
>  all              99
>  by               97
>  are              97
>  or               73
>  other            56
>  over             51
>  our              48
>  should           47
>  before           43
>  sherlock         39
>  any              35
>  sir              26
>  sure             13
>  country           9
>  project           6
>  gutenberg         6
>  ebook             5
>  adventures        5
>  world             5
>  arthur            4
>  conan             4
>  doyle             4
>  series            2
>  copyright         2
>  laws              2
>  check             2
>  header            2
>  changing          1
>  downloading       1
>  redistributing    1
>
> Also attached the sample input file
>
> Regards,
>
> On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <address@hidden> wrote:
>
> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
>
> the errors happened inside 'hist' function, and I presume mostly due
> to the jot dot find (if understand correctly, operating on a matrix of
> length equal to : unique-length * words-length)
>
> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
>
> -k
>
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]