Re: [Bug-apl] Spell corrector

bug-apl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Spell corrector - APL

From:	Ala'a Mohammad
Subject:	Re: [Bug-apl] Spell corrector - APL
Date:	Tue, 13 Sep 2016 20:08:36 +0400

Apology!
It was my mistake rushing to conclusion (fat finger typo).

It works and Thanks A Lot! the total time went down to 4 min +- few
seconds (compared to the previous 2 hours and ~ 30 mins). Also the
Total Memory was less by 100MiB

Thanks again.

Below is the summary. (the latest update with some kludged timing
functionality attached at the end)

#Words: 1,098,281 (in 3.775 sec)
#Unique: 30,377 (in 2 min 36.497 sec)
Index Of: 1,098,281 (in 1 min 22.155 sec)
Group Indexes: 30,377 (in 1.436 sec)
Tally Each: 30,377 (in 13 msec)
Count Table: 30,377 (in 15 msec)
Total Time: 4 min 3 sec

I hope I did not miss anything or erred, and will continue to finish
the program (spelling corrector).

Thanks for all inputs.

Regards,

Ala'a

*** Program ***
⍝ hist.apl
⍝ Timing functions
∇t←ticks
  ⍝t←ts2ms ⎕TS
  t←+/0 0 0 (60×60×1000) (60×1000)  1000 1 × ⎕TS
∇
∇tireset
  tcount ← 0    ⍝ Timestamp Label counter.
  prevticks ← 0 ⍝ Previous Timestamp.
∇
∇ tsi ;t;diff;min;sec;msec;i ⍝ TimeStamp Interval
  t←ticks
  →(prevticks=0)/initial
  diff←t-prevticks
  sec←60|i←1000÷⍨diff-msec←1000|diff
  min←⌊i÷60
  tcount←tcount+1
  ⎕ ← 'T', ⍕tcount, ':', min , sec, msec, '(', t, ') -', ⎕TS
  →update
initial:
  ⎕ ← 'T0 : 0 0 0 ( 0 ) - ', ⎕TS
update:
  prevticks ← t
∇
tireset

⍝ String functions
a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
⍝ downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
downcase←{ ⎕UCS (32×(⍵≥65)∧⍵≤90)+⍵←⎕UCS ⍵ } ⍝ by Juergen Sauermann
nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
words ← { (alphamask ⍵) ⊂ downcase ⍵ }

⍝ Utils functions
desc ← {⍵[⍒⍵[;2];]}
ftxt ← { ⎕FIO[26] ⍵ }

⍝
⍝ Sample Run
⍝
file ← '/misc/big.txt' ⍝ ~ 6.2M
⍝ file ← '/misc/llaa' ⍝ ~ 159K
tsi
⎕ ← '#Words:', ⍴w ← words ftxt file
tsi
⎕ ← '#Unique:', ⍴u ← ∪w
tsi
⍝⍝ j ← {+/(⊂⍵)∘.≡w}¨u (previous solution)
⎕ ← 'Index Of:', ⍴x ← u⍳w
tsi
⎕ ← 'Group Indexes:', ⍴j ← ⊂⍨x[⍋x]
tsi
⎕ ← 'Tally Each:', ⍴y ← ≢¨j
tsi
⎕ ← 'Count Table:', ⍴i ← (⍪u),y
tsi
desc 39 2 ⍴ i
tsi

)OFF



*** Sample Run ***
** Note the timing is 3 numbers (minutes seconds mili-seconds) since
last 'tri' call plus total ticks and the current timestamp

: apl -s -q -f hist.apl
T0 : 0 0 0 ( 0 ) -  2016 9 13 19 56 50 499
#Words: 1098281
T1 : 0 3 775 ( 71814274 ) - 2016 9 13 19 56 54 274
#Unique: 30377
T2 : 2 36 497 ( 71970771 ) - 2016 9 13 19 59 30 771
Index Of: 1098281
T3 : 1 22 155 ( 72052926 ) - 2016 9 13 20 0 52 926
Group Indexes: 30377
T4 : 0 1 436 ( 72054362 ) - 2016 9 13 20 0 54 362
Tally Each: 30377
T5 : 0 0 13 ( 72054375 ) - 2016 9 13 20 0 54 375
Count Table: 30377 2
T6 : 0 0 15 ( 72054390 ) - 2016 9 13 20 0 54 390
 the            80003
 of             40025
 to             28760
 in             22048
 for             6936
 by              6736
 be              6154
 or              5349
 all             4141
 this            4058
 are             3627
 other           1488
 before          1363
 should          1297
 over            1282
 your            1276
 any             1204
 our             1065
 holmes           450
 country          417
 world            355
 project          286
 gutenberg        262
 laws             233
 sir              176
 series           128
 sure             123
 sherlock         101
 ebook             85
 copyright         69
 changing          44
 check             38
 arthur            30
 adventures        17
 redistributing     7
 header             7
 doyle              5
 downloading        5
 conan              4
T7 : 0 0 6 ( 72054396 ) - 2016 9 13 20 0 54 396

On Tue, Sep 13, 2016 at 7:25 PM, Jay Foad <address@hidden> wrote:
> This looks like you are applying desc to an array that does not have rank 2.
> I don't see how that can happen if you entered this exactly, since the
> argument of desc must have shape 39 2:
>
> desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]
>
> Jay.
>
> On 12 September 2016 at 18:34, Ala'a Mohammad <address@hidden> wrote:
>>
>> Thanks for the alternative, I'd tried to run it, but got Rank Error
>>
>> RANK ERROR
>> λ1[1]  λ←⍵[⍒⍵[;2];]
>>             ^    ^
>>
>> How can I help debug this?
>>
>> Regards,
>>
>> Ala'a
>>
>> On Mon, Sep 12, 2016 at 5:32 PM, Jay Foad <address@hidden> wrote:
>> > Hi Ala'a,
>> >
>> > How about replacing the last line with this? It runs in about 1 minute
>> > on my
>> > machine:
>> >
>> > desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]
>> >
>> > Jay.
>> >
>> > On 11 September 2016 at 19:23, Ala'a Mohammad <address@hidden> wrote:
>> >>
>> >> Just an update as a reference, I'm now able to parse the big.txt file
>> >> (without WS full or killed process), but it takes around 2 Hours and
>> >> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
>> >> process reach 1GiB (after parsing the words), and tops that with
>> >> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
>> >>
>> >> The only change is scanning each unique word against the whole words
>> >> vector.
>> >>
>> >> Below is the code with a sample timed run.
>> >>
>> >> Regards,
>> >>
>> >> Ala'a
>> >>
>> >> ⍝ fhist.apl
>> >> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> >> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> >> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> >> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> >> alphamask ← { ~ ⍵ ∊ nonalpha }
>> >> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> >> desc ← {⍵[⍒⍵[;2];]}
>> >> ftxt ← { ⎕FIO[26] ⍵ }
>> >>
>> >> file ← '/misc/big.txt' ⍝ ~ 6.2M
>> >> ⎕ ← ⍴w ← words ftxt file
>> >> ⎕ ← ⍴u ← ∪w
>> >> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
>> >> )OFF
>> >>
>> >> : time apl -s -f fhist.apl
>> >> 1098281
>> >> 30377
>> >>  the            80003
>> >>  of             40025
>> >>  to             28760
>> >>  in             22048
>> >>  for             6936
>> >>  by              6736
>> >>  be              6154
>> >>  or              5349
>> >>  all             4141
>> >>  this            4058
>> >>  are             3627
>> >>  other           1488
>> >>  before          1363
>> >>  should          1297
>> >>  over            1282
>> >>  your            1276
>> >>  any             1204
>> >>  our             1065
>> >>  holmes           450
>> >>  country          417
>> >>  world            355
>> >>  project          286
>> >>  gutenberg        262
>> >>  laws             233
>> >>  sir              176
>> >>  series           128
>> >>  sure             123
>> >>  sherlock         101
>> >>  ebook             85
>> >>  copyright         69
>> >>  changing          44
>> >>  check             38
>> >>  arthur            30
>> >>  adventures        17
>> >>  redistributing     7
>> >>  header             7
>> >>  doyle              5
>> >>  downloading        5
>> >>  conan              4
>> >>
>> >> apl -s -f fhist.apl  8901.96s user 5.78s system 99% cpu 2:28:38.61
>> >> total
>> >>
>> >> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <address@hidden>
>> >> wrote:
>> >> > Thanks to all for the input,
>> >> >
>> >> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K
>> >> > (~1545 lines) text file (a sample chunk from the big.txt).
>> >> >
>> >> > The strange thing for me that I'm trying to understand is that the
>> >> > APL
>> >> > process (when fed the 159K text file) start allocating memory until
>> >> > it
>> >> > reaches 2.7GiB, then after printing the result settle down to 50MiB.
>> >> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
>> >> > collection utility) which can be used to mitigate this issue?
>> >> >
>> >> > Here is the updated code:
>> >> >
>> >> > a ← 'abcdefghijklmnopqrstuvwxyz'
>> >> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> >> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> >> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> >> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> >> > alphamask ← { ~ ⍵ ∊ nonalpha }
>> >> > words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> >> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
>> >> > desc ← {⍵[⍒⍵[;2];]}
>> >> > ftxt ← { ⎕FIO[26] ⍵ }
>> >> > fhist ← { hist words ftxt ⍵ }
>> >> >
>> >> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
>> >> > ⎕ ← ⍴w ← words ftxt file
>> >> > ⎕ ← ⍴u ← ∪w
>> >> > desc 39 2 ⍴ fhist file
>> >> >
>> >> > And here is a sample run
>> >> > : apl -s -f fhist.apl
>> >> > 30186
>> >> > 4155
>> >> >  the            1560
>> >> >  to              804
>> >> >  of              781
>> >> >  in              493
>> >> >  for             219
>> >> >  be              173
>> >> >  holmes          164
>> >> >  your            132
>> >> >  this            114
>> >> >  all              99
>> >> >  by               97
>> >> >  are              97
>> >> >  or               73
>> >> >  other            56
>> >> >  over             51
>> >> >  our              48
>> >> >  should           47
>> >> >  before           43
>> >> >  sherlock         39
>> >> >  any              35
>> >> >  sir              26
>> >> >  sure             13
>> >> >  country           9
>> >> >  project           6
>> >> >  gutenberg         6
>> >> >  ebook             5
>> >> >  adventures        5
>> >> >  world             5
>> >> >  arthur            4
>> >> >  conan             4
>> >> >  doyle             4
>> >> >  series            2
>> >> >  copyright         2
>> >> >  laws              2
>> >> >  check             2
>> >> >  header            2
>> >> >  changing          1
>> >> >  downloading       1
>> >> >  redistributing    1
>> >> >
>> >> > Also attached the sample input file
>> >> >
>> >> > Regards,
>> >> >
>> >> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <address@hidden>
>> >> > wrote:
>> >> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
>> >> >>> the errors happened inside 'hist' function, and I presume mostly
>> >> >>> due
>> >> >>> to the jot dot find (if understand correctly, operating on a matrix
>> >> >>> of
>> >> >>> length equal to : unique-length * words-length)
>> >> >>
>> >> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
>> >> >>
>> >> >> -k
>> >>
>> >
>
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-apl] Spell corrector - APL, (continued)
- Re: [Bug-apl] Spell corrector - APL, Kacper Gutowski, 2016/09/10
  - Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad, 2016/09/10
    - Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad, 2016/09/11
    - Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/12
    - Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/12
    - Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad, 2016/09/12
    - Re: [Bug-apl] Spell corrector - APL, Jay Foad, 2016/09/12
    - Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad, 2016/09/12
    - Re: [Bug-apl] Spell corrector - APL, Jay Foad, 2016/09/13
    - Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/13
    - Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad <=
- Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/10

Prev by Date: Re: [Bug-apl] Spell corrector - APL
Next by Date: [Bug-apl] Unique far too slow
Previous by thread: Re: [Bug-apl] Spell corrector - APL
Next by thread: Re: [Bug-apl] Spell corrector - APL
Index(es):
- Date
- Thread