[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-apl] Spell corrector - APL
From: |
Ala'a Mohammad |
Subject: |
Re: [Bug-apl] Spell corrector - APL |
Date: |
Tue, 13 Sep 2016 20:08:36 +0400 |
Apology!
It was my mistake rushing to conclusion (fat finger typo).
It works and Thanks A Lot! the total time went down to 4 min +- few
seconds (compared to the previous 2 hours and ~ 30 mins). Also the
Total Memory was less by 100MiB
Thanks again.
Below is the summary. (the latest update with some kludged timing
functionality attached at the end)
#Words: 1,098,281 (in 3.775 sec)
#Unique: 30,377 (in 2 min 36.497 sec)
Index Of: 1,098,281 (in 1 min 22.155 sec)
Group Indexes: 30,377 (in 1.436 sec)
Tally Each: 30,377 (in 13 msec)
Count Table: 30,377 (in 15 msec)
Total Time: 4 min 3 sec
I hope I did not miss anything or erred, and will continue to finish
the program (spelling corrector).
Thanks for all inputs.
Regards,
Ala'a
*** Program ***
⍝ hist.apl
⍝ Timing functions
∇t←ticks
⍝t←ts2ms ⎕TS
t←+/0 0 0 (60×60×1000) (60×1000) 1000 1 × ⎕TS
∇
∇tireset
tcount ← 0 ⍝ Timestamp Label counter.
prevticks ← 0 ⍝ Previous Timestamp.
∇
∇ tsi ;t;diff;min;sec;msec;i ⍝ TimeStamp Interval
t←ticks
→(prevticks=0)/initial
diff←t-prevticks
sec←60|i←1000÷⍨diff-msec←1000|diff
min←⌊i÷60
tcount←tcount+1
⎕ ← 'T', ⍕tcount, ':', min , sec, msec, '(', t, ') -', ⎕TS
→update
initial:
⎕ ← 'T0 : 0 0 0 ( 0 ) - ', ⎕TS
update:
prevticks ← t
∇
tireset
⍝ String functions
a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
⍝ downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
downcase←{ ⎕UCS (32×(⍵≥65)∧⍵≤90)+⍵←⎕UCS ⍵ } ⍝ by Juergen Sauermann
nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
words ← { (alphamask ⍵) ⊂ downcase ⍵ }
⍝ Utils functions
desc ← {⍵[⍒⍵[;2];]}
ftxt ← { ⎕FIO[26] ⍵ }
⍝
⍝ Sample Run
⍝
file ← '/misc/big.txt' ⍝ ~ 6.2M
⍝ file ← '/misc/llaa' ⍝ ~ 159K
tsi
⎕ ← '#Words:', ⍴w ← words ftxt file
tsi
⎕ ← '#Unique:', ⍴u ← ∪w
tsi
⍝⍝ j ← {+/(⊂⍵)∘.≡w}¨u (previous solution)
⎕ ← 'Index Of:', ⍴x ← u⍳w
tsi
⎕ ← 'Group Indexes:', ⍴j ← ⊂⍨x[⍋x]
tsi
⎕ ← 'Tally Each:', ⍴y ← ≢¨j
tsi
⎕ ← 'Count Table:', ⍴i ← (⍪u),y
tsi
desc 39 2 ⍴ i
tsi
)OFF
*** Sample Run ***
** Note the timing is 3 numbers (minutes seconds mili-seconds) since
last 'tri' call plus total ticks and the current timestamp
: apl -s -q -f hist.apl
T0 : 0 0 0 ( 0 ) - 2016 9 13 19 56 50 499
#Words: 1098281
T1 : 0 3 775 ( 71814274 ) - 2016 9 13 19 56 54 274
#Unique: 30377
T2 : 2 36 497 ( 71970771 ) - 2016 9 13 19 59 30 771
Index Of: 1098281
T3 : 1 22 155 ( 72052926 ) - 2016 9 13 20 0 52 926
Group Indexes: 30377
T4 : 0 1 436 ( 72054362 ) - 2016 9 13 20 0 54 362
Tally Each: 30377
T5 : 0 0 13 ( 72054375 ) - 2016 9 13 20 0 54 375
Count Table: 30377 2
T6 : 0 0 15 ( 72054390 ) - 2016 9 13 20 0 54 390
the 80003
of 40025
to 28760
in 22048
for 6936
by 6736
be 6154
or 5349
all 4141
this 4058
are 3627
other 1488
before 1363
should 1297
over 1282
your 1276
any 1204
our 1065
holmes 450
country 417
world 355
project 286
gutenberg 262
laws 233
sir 176
series 128
sure 123
sherlock 101
ebook 85
copyright 69
changing 44
check 38
arthur 30
adventures 17
redistributing 7
header 7
doyle 5
downloading 5
conan 4
T7 : 0 0 6 ( 72054396 ) - 2016 9 13 20 0 54 396
On Tue, Sep 13, 2016 at 7:25 PM, Jay Foad <address@hidden> wrote:
> This looks like you are applying desc to an array that does not have rank 2.
> I don't see how that can happen if you entered this exactly, since the
> argument of desc must have shape 39 2:
>
> desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]
>
> Jay.
>
> On 12 September 2016 at 18:34, Ala'a Mohammad <address@hidden> wrote:
>>
>> Thanks for the alternative, I'd tried to run it, but got Rank Error
>>
>> RANK ERROR
>> λ1[1] λ←⍵[⍒⍵[;2];]
>> ^ ^
>>
>> How can I help debug this?
>>
>> Regards,
>>
>> Ala'a
>>
>> On Mon, Sep 12, 2016 at 5:32 PM, Jay Foad <address@hidden> wrote:
>> > Hi Ala'a,
>> >
>> > How about replacing the last line with this? It runs in about 1 minute
>> > on my
>> > machine:
>> >
>> > desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]
>> >
>> > Jay.
>> >
>> > On 11 September 2016 at 19:23, Ala'a Mohammad <address@hidden> wrote:
>> >>
>> >> Just an update as a reference, I'm now able to parse the big.txt file
>> >> (without WS full or killed process), but it takes around 2 Hours and
>> >> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
>> >> process reach 1GiB (after parsing the words), and tops that with
>> >> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
>> >>
>> >> The only change is scanning each unique word against the whole words
>> >> vector.
>> >>
>> >> Below is the code with a sample timed run.
>> >>
>> >> Regards,
>> >>
>> >> Ala'a
>> >>
>> >> ⍝ fhist.apl
>> >> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> >> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> >> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> >> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> >> alphamask ← { ~ ⍵ ∊ nonalpha }
>> >> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> >> desc ← {⍵[⍒⍵[;2];]}
>> >> ftxt ← { ⎕FIO[26] ⍵ }
>> >>
>> >> file ← '/misc/big.txt' ⍝ ~ 6.2M
>> >> ⎕ ← ⍴w ← words ftxt file
>> >> ⎕ ← ⍴u ← ∪w
>> >> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
>> >> )OFF
>> >>
>> >> : time apl -s -f fhist.apl
>> >> 1098281
>> >> 30377
>> >> the 80003
>> >> of 40025
>> >> to 28760
>> >> in 22048
>> >> for 6936
>> >> by 6736
>> >> be 6154
>> >> or 5349
>> >> all 4141
>> >> this 4058
>> >> are 3627
>> >> other 1488
>> >> before 1363
>> >> should 1297
>> >> over 1282
>> >> your 1276
>> >> any 1204
>> >> our 1065
>> >> holmes 450
>> >> country 417
>> >> world 355
>> >> project 286
>> >> gutenberg 262
>> >> laws 233
>> >> sir 176
>> >> series 128
>> >> sure 123
>> >> sherlock 101
>> >> ebook 85
>> >> copyright 69
>> >> changing 44
>> >> check 38
>> >> arthur 30
>> >> adventures 17
>> >> redistributing 7
>> >> header 7
>> >> doyle 5
>> >> downloading 5
>> >> conan 4
>> >>
>> >> apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61
>> >> total
>> >>
>> >> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <address@hidden>
>> >> wrote:
>> >> > Thanks to all for the input,
>> >> >
>> >> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K
>> >> > (~1545 lines) text file (a sample chunk from the big.txt).
>> >> >
>> >> > The strange thing for me that I'm trying to understand is that the
>> >> > APL
>> >> > process (when fed the 159K text file) start allocating memory until
>> >> > it
>> >> > reaches 2.7GiB, then after printing the result settle down to 50MiB.
>> >> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
>> >> > collection utility) which can be used to mitigate this issue?
>> >> >
>> >> > Here is the updated code:
>> >> >
>> >> > a ← 'abcdefghijklmnopqrstuvwxyz'
>> >> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> >> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> >> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> >> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> >> > alphamask ← { ~ ⍵ ∊ nonalpha }
>> >> > words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> >> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
>> >> > desc ← {⍵[⍒⍵[;2];]}
>> >> > ftxt ← { ⎕FIO[26] ⍵ }
>> >> > fhist ← { hist words ftxt ⍵ }
>> >> >
>> >> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
>> >> > ⎕ ← ⍴w ← words ftxt file
>> >> > ⎕ ← ⍴u ← ∪w
>> >> > desc 39 2 ⍴ fhist file
>> >> >
>> >> > And here is a sample run
>> >> > : apl -s -f fhist.apl
>> >> > 30186
>> >> > 4155
>> >> > the 1560
>> >> > to 804
>> >> > of 781
>> >> > in 493
>> >> > for 219
>> >> > be 173
>> >> > holmes 164
>> >> > your 132
>> >> > this 114
>> >> > all 99
>> >> > by 97
>> >> > are 97
>> >> > or 73
>> >> > other 56
>> >> > over 51
>> >> > our 48
>> >> > should 47
>> >> > before 43
>> >> > sherlock 39
>> >> > any 35
>> >> > sir 26
>> >> > sure 13
>> >> > country 9
>> >> > project 6
>> >> > gutenberg 6
>> >> > ebook 5
>> >> > adventures 5
>> >> > world 5
>> >> > arthur 4
>> >> > conan 4
>> >> > doyle 4
>> >> > series 2
>> >> > copyright 2
>> >> > laws 2
>> >> > check 2
>> >> > header 2
>> >> > changing 1
>> >> > downloading 1
>> >> > redistributing 1
>> >> >
>> >> > Also attached the sample input file
>> >> >
>> >> > Regards,
>> >> >
>> >> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <address@hidden>
>> >> > wrote:
>> >> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
>> >> >>> the errors happened inside 'hist' function, and I presume mostly
>> >> >>> due
>> >> >>> to the jot dot find (if understand correctly, operating on a matrix
>> >> >>> of
>> >> >>> length equal to : unique-length * words-length)
>> >> >>
>> >> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
>> >> >>
>> >> >> -k
>> >>
>> >
>
>
- Re: [Bug-apl] Spell corrector - APL, (continued)
- Re: [Bug-apl] Spell corrector - APL, Kacper Gutowski, 2016/09/10
- Re: [Bug-apl] Spell corrector - APL, Jay Foad, 2016/09/12
- Re: [Bug-apl] Spell corrector - APL, Ala'a Mohammad, 2016/09/12
- Re: [Bug-apl] Spell corrector - APL, Jay Foad, 2016/09/13
- Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/13
- Re: [Bug-apl] Spell corrector - APL,
Ala'a Mohammad <=
Re: [Bug-apl] Spell corrector - APL, Juergen Sauermann, 2016/09/10