Internal storage costs...

Is it more efficient to store a list of lexemes as character codes or single character atoms?

Without knowing the C code other than what I know of the FFI, is it more compact to store a list of integers which presumably represent themselves or is it more efficient to use single character atoms ?

I am guessing that the FlyWeight pattern is used or similar which means that the single character atoms are actually pointers to the atom so a list of one hundred 'a'-s is in fact a list of one hundred pointers into the atom store but is the pointer size bigger than the character code size ?

I ask because my lexer is working and producing output like this:

| ?- feltlex('small.felt',X).

X = [comment(block,pos(1,1),[' ','S',t,r,i,n,g,' ',t,e,s,t,i,n,g,'.','\n','\n',' ',' ',' ','A',l,l,o,w,' ',b,a,c,k,s,l,a,s,h,e,d,' ',d,e,l,i,m,t,e,r,' ',i,n,' ',t,h,e,' ',s,e,q,u,e,n,c,e,'.','.','.','.','\n']),chr(/),comment(single,pos(6,1),[' ','D',o,u,b,l,e,' ',q,u,o,t,e,d,' ',s,t,r,i,n,g,s,'.','.','.']),string(double,pos(7,1),[c,h,e,e,s,\,'"',e,b,u,r,g,e,r]),string(double,pos(8,1),[c,h,e,e,s,\,'''',e,b,u,r,g,e,r]),comment(single,pos(10,1),[' ','S',i,n,g,l,e,' ',q,u,o,t,e,d,' ',s,t,r,i,n,g,s,'.','.','.']),string(single,pos(11,1),[c,h,e,e,s,e,\,'"',b,u,r,g,e,r]),string(single,pos(12,1),[c,h,e,e,s,e,\,'''',b,u,r,g,e,r])]

That's from a source file:

/* String testing.

Allow backslashed delimter in the sequence....

; Double quoted strings...

"chees\"eburger"

"chees\'eburger"

; Single quoted strings...

'cheese\"burger'

'cheese\'burger'

Not a brilliant example but it was for testing the comment handling and string consumption allowing for a backslashed single or double quote to be part of the string. It's parsing using get_char/peek_char with LA(1) and that allows me to cope well enough for now. It is s-_expression_ based.

For a really large source file, I want to make sure that I am being as efficient with internal storage as possible because once I have completed the lexer I have to be able to create an AST from it and then translate it into something else and I have already found out recently that GNU Prolog seg-faults under OSX when dealing with large amounts of in-memory data.

So, anybody know what is the more space compact representation, atoms or character codes ?

Thanks,

Sean.

From:	Sean Charles
Subject:	Internal storage costs...
Date:	Wed, 24 Jul 2013 23:35:02 +0100