[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-apl] Spell corrector - APL
From: |
Ala'a Mohammad |
Subject: |
Re: [Bug-apl] Spell corrector - APL |
Date: |
Mon, 12 Sep 2016 21:34:21 +0400 |
Thanks for the pointer.
About ⎕FIO ¯1, I can not find it in the displayed help (i.e. using
⎕FIO ''), and what is the output resolution? is it similar to
GetTickCount in windows, or Linux gettimeofday?
Regards,
Ala'a
On Mon, Sep 12, 2016 at 2:23 PM, Juergen Sauermann
<address@hidden> wrote:
> Hi again,
>
> sorry, I meant:
>
> downcase←{ ⎕UCS (32×(⍵≥65)∧⍵≤90)+⍵←⎕UCS ⍵ }
>
> /// Jürgen
>
>
> On 09/12/2016 12:10 PM, Juergen Sauermann wrote:
>
> Hi Ala'a,
>
> you can use ⎕FIO ¯1 to find out where the time is spent, e.g.:
>
> T←⎕FIO ¯1
> file ← 'test.txt'
> 'T1:' ((T←⎕FIO ¯1)-T)
> ⎕ ← ⍴w ← words ftxt file
> 'T2:' ((T←⎕FIO ¯1)-T)
> ⎕ ← ⍴u ← ∪w
> 'T3:' ((T←⎕FIO ¯1)-T)
> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
> 'T4:' ((T←⎕FIO ¯1)-T)
>
> Your downcase function fails on my machine:
>
> ⎕ ← ⍴w ← words ftxt file
> INDEX ERROR+
> λ1[1] λ←(a,⎕AV)[(A,⎕AV)⍳⍵]
> ^ ^
>
> )MORE
> ⎕IO=1 offending index=282 max index=282
>
> probably due to a character in my testfile that is not contained in ⎕AV.
> You should use ⎕UCS instead of ⎕AV to avoid that:
>
> downcase←{ ⎕UCS (32×(T≥65)∧T≤90)+⍵←⎕UCS ⍵ }
>
> /// Jürgen
>
>
> On 09/11/2016 08:23 PM, Ala'a Mohammad wrote:
>
> Just an update as a reference, I'm now able to parse the big.txt file
> (without WS full or killed process), but it takes around 2 Hours and
> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
> process reach 1GiB (after parsing the words), and tops that with
> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
>
> The only change is scanning each unique word against the whole words vector.
>
> Below is the code with a sample timed run.
>
> Regards,
>
> Ala'a
>
> ⍝ fhist.apl
> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> alphamask ← { ~ ⍵ ∊ nonalpha }
> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> desc ← {⍵[⍒⍵[;2];]}
> ftxt ← { ⎕FIO[26] ⍵ }
>
> file ← '/misc/big.txt' ⍝ ~ 6.2M
> ⎕ ← ⍴w ← words ftxt file
> ⎕ ← ⍴u ← ∪w
> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
> )OFF
>
> : time apl -s -f fhist.apl
> 1098281
> 30377
> the 80003
> of 40025
> to 28760
> in 22048
> for 6936
> by 6736
> be 6154
> or 5349
> all 4141
> this 4058
> are 3627
> other 1488
> before 1363
> should 1297
> over 1282
> your 1276
> any 1204
> our 1065
> holmes 450
> country 417
> world 355
> project 286
> gutenberg 262
> laws 233
> sir 176
> series 128
> sure 123
> sherlock 101
> ebook 85
> copyright 69
> changing 44
> check 38
> arthur 30
> adventures 17
> redistributing 7
> header 7
> doyle 5
> downloading 5
> conan 4
>
> apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total
>
> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <address@hidden> wrote:
>
> Thanks to all for the input,
>
> Replacing Find and Each OR with Match helped, now I'm parsing a 159K
> (~1545 lines) text file (a sample chunk from the big.txt).
>
> The strange thing for me that I'm trying to understand is that the APL
> process (when fed the 159K text file) start allocating memory until it
> reaches 2.7GiB, then after printing the result settle down to 50MiB.
> Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
> collection utility) which can be used to mitigate this issue?
>
> Here is the updated code:
>
> a ← 'abcdefghijklmnopqrstuvwxyz'
> A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> alphamask ← { ~ ⍵ ∊ nonalpha }
> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
> desc ← {⍵[⍒⍵[;2];]}
> ftxt ← { ⎕FIO[26] ⍵ }
> fhist ← { hist words ftxt ⍵ }
>
> file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
> ⎕ ← ⍴w ← words ftxt file
> ⎕ ← ⍴u ← ∪w
> desc 39 2 ⍴ fhist file
>
> And here is a sample run
> : apl -s -f fhist.apl
> 30186
> 4155
> the 1560
> to 804
> of 781
> in 493
> for 219
> be 173
> holmes 164
> your 132
> this 114
> all 99
> by 97
> are 97
> or 73
> other 56
> over 51
> our 48
> should 47
> before 43
> sherlock 39
> any 35
> sir 26
> sure 13
> country 9
> project 6
> gutenberg 6
> ebook 5
> adventures 5
> world 5
> arthur 4
> conan 4
> doyle 4
> series 2
> copyright 2
> laws 2
> check 2
> header 2
> changing 1
> downloading 1
> redistributing 1
>
> Also attached the sample input file
>
> Regards,
>
> On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <address@hidden> wrote:
>
> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
>
> the errors happened inside 'hist' function, and I presume mostly due
> to the jot dot find (if understand correctly, operating on a matrix of
> length equal to : unique-length * words-length)
>
> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
>
> -k
>
>
>
- Re: [Bug-apl] Spell corrector - APL, (continued)
- Re: [Bug-apl] Spell corrector - APL, Christian Robert, 2016/09/09
- Re: [Bug-apl] Spell corrector - APL, Kacper Gutowski, 2016/09/10
- Re: [Bug-apl] Spell corrector - APL, Jay Foad, 2016/09/12
- Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad, 2016/09/12
- Re: [Bug-apl] Spell corrector - APL, Jay Foad, 2016/09/13
- Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/13
- Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad, 2016/09/13
Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/10