gnumed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] experiments with gnumed - multiusers vnc, importing a


From: Tim Churches
Subject: Re: [Gnumed-devel] experiments with gnumed - multiusers vnc, importing an au emr
Date: Tue, 25 Apr 2006 13:23:50 +1000
User-agent: Thunderbird 1.5.0.2 (Windows/20060308)

Syan Tan wrote:
> i've processed  360,000 rows of clin.clin_narrative and parsed out all the 
> words
> 
> containing letters. I was thinking of using a stoplist method where any word 
> appearing
> 
> on the stoplist will be replaced by 'xxxx' . The stoplist would also include 
> all 
> the names
> 
> listed out from dem.names.lastnames and dem.names.firstnames.
> 
> BTW - what about a secondary structure for clin.clin_narrative, where the 
> narrative
> 
> consists of a list of indexes pointing into a table of words. this is the 
> simplest step before
> 
> having some sort of semantic linking at the word level ( but not at the 
> phrase 
> level).
> 
> whilst trying to recreate the gnumed database using a pg_dump,
> 
> the dump reload seems to stall ; I tried to turn off logging,  table 
> constraints, removing
> 
> internal log table data , and fsync , which all finally worked , but I'm not 
> sure what causes the stall.
> 
> 
> 
> 
> *On Mon Apr 24 18:53 , Karsten Hilbert sent:
> 
> *
> 
>     On Thu, Apr 20, 2006 at 09:47:54AM +0800, Syan Tan wrote:
> 
>      > thinking about it, the only correct thing to do seems to be to 
> preserve the
>      > structure of the instance data and the health issue + episode headings,
>     but to
>      > scramble the text with word substitution, as well as name 
> substitution, date
>      > fudging, and address random relinking . would that be de-identified 
> enough ?
>     Well, I tend to think that "de-identified enough" is a range
>     from "acceptably so" to "beyond use" rather than a cutoff.
>     The exact value used within that range depends on what sort
>     of protection you need.
> 
>     Yes, if you want to hide a patient's data securely from your
>     fellow doctor next door you will have to scamble the medical
>     content, too, as she might be able to match "real patient"
>     to "problems/operations listed" by her own medical skills
>     and thereby gain knowledge via the now re-identified EMR.
> 
>     But if you want to protect a patient's privacy from, say,
>     me, it's enough to falsify the identities. I do not have
>     access to your patients. I also have no idea how to find out
>     who your patients actually are in order to start matching
>     EMRs to patients. Hence proper protection is ensure, I dare
>     say. It is akin to not storing patient names with any
>     medical data and hold the EMR ID <-> patient identity
>     mapping elsewhere in a secure space (say, the patient's
>     brain).
> 
>     In a recent discussion on the openhealth list this topic was
>     chanced upon and the OpenEHR guys thought the latter
>     approach would be the most secure that's practically useful
>     - and they were talking real live patient data in actual
>     care.

I didn't mention it on the openEHR list (maybe I should) but merely
removing the direct identifiers (names, DOB etc) does not de-identify or
anonymise that data. For example, if the record reveals "32 yr old male,
with medical visits on 23/4/04, 12/6/05 and 14/01/06" then that record
has a very high probability of being unique to an individual in even a
large population. Hence if I know your age and sex (easily discovered or
ascertained) and I know that you had medical appointments on those dates
(eg if I had access to your work leave records, as staff in the
personnel department of your employer may have), then I can fairly
easily which record belongs to you. Disclosure control in microdata
almost always involves some degree of obfuscation, perturbation or
allocation to broad categories - in other words, a lot of detail needs
to be removed to make real data truly anonymous (in that it cannot be
re-identified). Also, anonymity of data is a continuum - it is not
dichotomous, and often it comes down to a risk judgement and some
assumptions about what additional information an 'attacker' who might
try to re-identify records might possess. If the data are to be made
publicly available, you can't make any assumptions about what an
attacker might or might not already know about a person, so you need to
be very conservative.

Tim C




reply via email to

[Prev in Thread] Current Thread [Next in Thread]