Hi again,
sorry, I meant:
downcase←{ ⎕UCS (32×(⍵≥65)∧⍵≤90)+⍵←⎕UCS
⍵ }
/// Jürgen
On 09/12/2016 12:10 PM, Juergen
Sauermann wrote:
Hi Ala'a,
you can use ⎕FIO ¯1 to find out where the time is spent,
e.g.:
T←⎕FIO ¯1
file ← 'test.txt'
'T1:' ((T←⎕FIO ¯1)-T)
⎕ ← ⍴w ← words ftxt file
'T2:' ((T←⎕FIO ¯1)-T)
⎕ ← ⍴u ← ∪w
'T3:' ((T←⎕FIO ¯1)-T)
desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
'T4:' ((T←⎕FIO ¯1)-T)
Your downcase function fails on my machine:
⎕ ← ⍴w ←
words ftxt file
INDEX ERROR+
λ1[1] λ←(a,⎕AV)[(A,⎕AV)⍳⍵]
^ ^
)MORE
⎕IO=1 offending index=282 max index=282
probably due to a character in my testfile that is not contained
in ⎕AV.
You should use ⎕UCS instead of ⎕AV to avoid
that:
downcase←{
⎕UCS (32×(T≥65)∧T≤90)+⍵←⎕UCS ⍵ }
/// Jürgen
On 09/11/2016 08:23 PM, Ala'a
Mohammad wrote:
Just an update as a reference, I'm now able to parse the big.txt file
(without WS full or killed process), but it takes around 2 Hours and
20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
process reach 1GiB (after parsing the words), and tops that with
100MiB during the sequential 'Each' (thus a max of 1.1GiB).
The only change is scanning each unique word against the whole words vector.
Below is the code with a sample timed run.
Regards,
Ala'a
⍝ fhist.apl
a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
words ← { (alphamask ⍵) ⊂ downcase ⍵ }
desc ← {⍵[⍒⍵[;2];]}
ftxt ← { ⎕FIO[26] ⍵ }
file ← '/misc/big.txt' ⍝ ~ 6.2M
⎕ ← ⍴w ← words ftxt file
⎕ ← ⍴u ← ∪w
desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
)OFF
: time apl -s -f fhist.apl
1098281
30377
the 80003
of 40025
to 28760
in 22048
for 6936
by 6736
be 6154
or 5349
all 4141
this 4058
are 3627
other 1488
before 1363
should 1297
over 1282
your 1276
any 1204
our 1065
holmes 450
country 417
world 355
project 286
gutenberg 262
laws 233
sir 176
series 128
sure 123
sherlock 101
ebook 85
copyright 69
changing 44
check 38
arthur 30
adventures 17
redistributing 7
header 7
doyle 5
downloading 5
conan 4
apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total
On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <address@hidden> wrote:
Thanks to all for the input,
Replacing Find and Each OR with Match helped, now I'm parsing a 159K
(~1545 lines) text file (a sample chunk from the big.txt).
The strange thing for me that I'm trying to understand is that the APL
process (when fed the 159K text file) start allocating memory until it
reaches 2.7GiB, then after printing the result settle down to 50MiB.
Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
collection utility) which can be used to mitigate this issue?
Here is the updated code:
a ← 'abcdefghijklmnopqrstuvwxyz'
A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
words ← { (alphamask ⍵) ⊂ downcase ⍵ }
hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
desc ← {⍵[⍒⍵[;2];]}
ftxt ← { ⎕FIO[26] ⍵ }
fhist ← { hist words ftxt ⍵ }
file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
⎕ ← ⍴w ← words ftxt file
⎕ ← ⍴u ← ∪w
desc 39 2 ⍴ fhist file
And here is a sample run
: apl -s -f fhist.apl
30186
4155
the 1560
to 804
of 781
in 493
for 219
be 173
holmes 164
your 132
this 114
all 99
by 97
are 97
or 73
other 56
over 51
our 48
should 47
before 43
sherlock 39
any 35
sir 26
sure 13
country 9
project 6
gutenberg 6
ebook 5
adventures 5
world 5
arthur 4
conan 4
doyle 4
series 2
copyright 2
laws 2
check 2
header 2
changing 1
downloading 1
redistributing 1
Also attached the sample input file
Regards,
On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <address@hidden> wrote:
On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
the errors happened inside 'hist' function, and I presume mostly due
to the jot dot find (if understand correctly, operating on a matrix of
length equal to : unique-length * words-length)
Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
-k
|