bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Use with word2vec


From: Leslie S Satenstein
Subject: Re: [Bug-apl] Use with word2vec
Date: Sat, 29 Apr 2017 01:50:15 +0000 (UTC)

Hi  Fred 
Following up on Xiao-Yong Jin's response.

You did not mention if you need the data in realtime or if you can work at the apl interpretor speed.
Do you have a structure for your data.  You mentioned a format of  [text][floats] without specifying
size of text and number of floats.  Is your data clean or does it need to be vetted. (NANs excluded)?

I believe you should create a data dictionary which constructed with sqlite.  That data would
be loaded into sqlite via some C, CPP, python code and subsequently read via shared variables.
APL is an interpretor.  What would take hours with APL to do what you want to do,  could take a few
minutes by externally loading the sql database and then using APL for presentation.

Its an interesting idea you have.  Can you put out a more formal draft starter document.
Something to fill in the topics below.
Aim:
Data Descriptions/Quantities:
Vetting and Filtering:
Processing speed:
Frequency of use.

Since you propose to do the work, who can estimate the cost.

From: Xiao-Yong Jin <address@hidden>
To: address@hidden
Cc: GNU APL <address@hidden>
Sent: Friday, April 28, 2017 9:32 PM
Subject: Re: [Bug-apl] Use with word2vec



If shared variables can go through SHMEM, you can probably interface
cuda that way without much bottle neck.
But with the way GNU APL is implemented now, there are just too many
other limitations on performance with arrays of such size.

> On Apr 28, 2017, at 9:19 PM, Fred Weigel <address@hidden> wrote:
>
> Jeurgen, and other GNU APL experts.
>
> I am exploring neural nets, word2vec and some other AI related areas.
>
> Right now, I want to tie in google's word2vec trained models (the
> billion word one GoogleNews-vectors-negative300.bin.gz)
>
> This is a binary file containing a lot of floating point data -- about
> 3.5GB of data. These are words, followed by cosine distances. I could
> attempt to feed this in slow way, and put it into an APL workspace.
> But... I also intend on attempting to feed the data to a GPU. So, what I
> am looking for is a modification to GNU APL (and yes, I am willing to do
> the work) -- to allow for the complete suppression of normal C++
> allocations, etc. and allow the introduction of simple float/double
> vectors or matrices (helpful to allow "C"-ish or UTF-8-ish strings: the
> data is (C string containing word name) (fixed number of floating
> point)... repeated LOTs of times.
>
> The data set(s) may be compressed, so I don't want read them directly --
> possibly from a shared memory region (64 bit system only, of course), or
> , perhaps using shared variables... but I don't think that would be fast
> enough.
>
> Anyway, this begins to allow the push into "big data" and AI
> applications. Just looking for some input and ideas here.
>
> Many thanks
> Fred Weigel
>





reply via email to

[Prev in Thread] Current Thread [Next in Thread]