Improving Software for Society
News | Blog Post : DATA RETENTION: WHAT TO KEEP & HOW LONG FOR
By Dr Paul S. Ganney FIAP
The question of Data Retention has been around almost as long as storage has existed, but over the last couple of decades, the driver has changed.
Originally it was the cost of the media: it was standard practice for organisations such as the BBC to re-use the tapes that they recorded programmes on, rather than keep buying new ones. There have been many news stories about episodes of classic TV programmes suffering that fate but, somehow, someone had a copy or had recorded it from the broadcast. Now the drivers are two-fold: the cost of storing the media that the data is stored on and the difficulty in finding what has been stored.
To those in the first scenario, the concept of storage regimes such as GitHub and Apple’s Time Machine (both mechanisms for keeping all versions of files, not just the most recent one) would have seemed inefficient and extremely expensive. To those in the second scenario, the concept of not keeping something is alien, and not just because scientists have a tendency to be hoarders.
In the NHS, where I spent most of my career, data retention has always been important. From the warehouses full of paper notes and X-ray films to today’s big server farms and warehouses full of backup tapes, we generate a lot of data and tend to want to keep it all (and not just because we’re hoarders). We need to keep data mainly for three reasons: so we can use patient history to treat our patients better; so we can respond correctly to accusations of negligence; so we can perform longitudinal research in order to better treat our patients. You’ll be pleased to hear that the first and third of these reasons are the major use of this historical data.
The problem is, that we generate a lot of it, from written records of patient consultations to the detailed medical images that we have become accustomed to. Initially, we adopted a “keep it all” policy, partly because digital storage takes up much less space than physical storage (so seemed a leap forward), and partly because we didn’t know what we’d need in the future. Despite the massive increases in storage capacity, medical imaging has also advanced and thus produces even larger data sets. It has been estimated that 80% of PACS images are never viewed again. However, as a reliable method for identifying those 80% has not yet been achieved, all images must be kept, but keeping them online (on fast and therefore expensive storage) is not a sensible option. Thus old images are generally archived onto other media (there may be several “layers” of such storage, each slower to access than the previous, eventually reaching a removable media layer such as tape) or onto a slower, less expensive system and the original data deleted to free up space. There are several algorithms for identifying data suitable for archiving, but the most common is based on age: not the age of the data, but the time since it was last accessed. To implement such a system it is therefore imperative that each access updates the record, either in the database (for single items) or by the operating system (in the case of files).
But what about data that is no longer needed at all? The NHS wanted to address the question of how long you should keep data and so gathered together all the guidance that it had and compiled it into one document. These 110 pages were current until 2016 when they became an Excel spreadsheet with 118 entries and is currently a pdf with 38 pages of guidance, plus many others of supporting information. This latter move has been a good one as the two previous versions, whilst each better than what came before, were difficult to navigate and did contain some inconsistencies and contradictions. So now we know how long we should keep each type of medical record for.
One important point of data retention is that when records identified for disposal are destroyed, a register of these records needs to be kept.
An interesting question is what happens should data be kept beyond the recommended retention period? If it contains personal data then the GDPR’s principle e (storage limitation) applies and the data must either be deleted or anonymised, otherwise, there is the possibility of a fine. The exception to this is when it is being kept for public interest archiving, scientific or historical research, or statistical purposes.
If the data does not contain personal data then there appear to be no ramifications to keeping data beyond the retention limits, aside from the cost of doing so (as mentioned earlier).
The final question around data retention is this: if you have the old data, can you still read it? We discovered that all of our archived patient records for one of our services were stored on 5.25” floppy discs. Finding a machine anywhere in our hospital that would read these proved impossible. Fortunately, one of our scientists conformed to the “hoarder” stereotype and had one in his garage at home, which we borrowed in order to move the data onto something more useful. As a follow-up to this, we gave one of our trainee scientists a project to create a PC with all the data devices we had now realised we needed (ZIP drives, 3” floppies, optical media etc.). He needed 2 PCs to complete the task, which shows how many obsolete media formats we had in storage.
As I glance at the shelves in my study, I see DATs, 8-track ½” tapes, 3.5” Atari format floppy discs and VHS tapes. I guess I conform to the hoarder stereotype, but I also wish that I’d converted the media while I still had something to read them. If not for the vinyl revival which allowed me to replace my defunct player, I’m sure I’d have shelves of old LPs with the music also trapped in time, serving only as a memory of what I misspent my youth on (together with the 8” floppy discs that occupy another box).
Dr Paul S. Ganney FIAP is an almost-retired Consultant Clinical Scientist and author of ‘Introduction to Bioinformatics and Clinical Scientific Computing’, Boca Raton and London: CRC Press, 2023.