Banish the thumb drive: a call for better internal data sharing

Aug 17, 2020

Some people’s pet peeve is loud chewing; others hate the Oxford comma. Here’s mine: thumb drives. 

First of all, can we choose one name for these things? They’re also referred to as “flash drives,” “jump drives,” and “USB drives,” plus a few other names. Let’s choose one!

Secondly and more importantly, for years whenever I’ve seen someone walking around the office with a thumb drive, I’ve gotten nervous. What data is on that thumb drive? Why is the data being transmitted this way? Who is getting access to that data? 

Of course these days, many fewer of us are walking around the office (stay safe everyone!) and with so many people working remotely, the thumb drive is being used less often. Now, we have to share data differently, so it’s a good time to consider better ways that we can share data: ways that protect security and privacy while allowing the organization to better understand what data is critical to its operations.

So what’s wrong with thumb drives? Well here are a few things that come to mind:

  • They leave no record of the sharing. If I email you a file, there’s a record of that. If I download it from a server, there may be a log. Such evidence is useful both for accountability and for understanding which datasets are of most interest to other users (so that we can prioritize those datasets for data quality improvement).
  • They offer no opportunity to enforce rules about privacy. There’s nothing to stop me copying and sharing the source file with all of the individual-level identifiable data.
  • They are asking for trouble with version control. If I give you an export for your analysis, but then I later update the source, we’re going to have problems making our numbers reconcile.
  • They don’t come with an instruction manual. Most likely, the person sharing will give a brief verbal explanation. Will this be clear? Will the person receiving remember the details? Maybe not!

Before we go much further, let’s just consider: are email attachments better? Not much. There’s a record of the sharing, but there are still issues with privacy and version control. It’s also a pain to go back and find the right email message with the right attachment. Is it the one I sent you three months ago with attachment named quarterlyReport(1).csv or the one I sent two days later with attachment named quarterlyPull_FINAL.csv? Who knows! 

So sharing using thumb drives is a problem. Why do we do it? (And I will 100 percent fess up that I have done it.)

  • People want to share data. . . This is actually a good thing! One person is asking for data that they need in order to do their work, and they’ve found the right person with the right data. This is kind of a win!
  • . . .but there’s not a better way to share. It makes far more sense to share a dataset within the system of record instead of exporting it, copying it to a thumb drive, and walking it to your office, only to find that you needed a .CSV instead of an .XLS. . . but the ideal is not always an option.
  • There’s a lack of clarity about policies for data access and sharing. Maybe someone asked me to share data and I’m not completely sure whether it’s OK, but I want to be helpful. If there’s not a clear policy that tells me what to do, I might decide to share, but under the table.

So, what can we do about this? Ideally, this is something to address with your data governance group. Developing a policy for internal data sharing is a classic example of what governance can do. 

  • Investigate platform options: Whether it’s network drives or databases, consider what tools are available to control user access by role. Make sure your chosen tools include metadata so that users better understand what they’re working with.
  • Draft a policy that includes:
    • Directions on how to ask for data
    • Criteria for evaluating requests received
    • Options for sharing in a responsible way
    • Details on access restrictions, as appropriate
    • Rules for aggregation and de-identification to protect privacy
    • A means for logging requests and a routine for analyzing them
  • Consult your data inventory: This helps us to systematically apply these rules, and serves as a reference on the privacy requirements of each dataset.

If we do this effectively, we stand to gain a lot. We can foster collaboration among staff while avoiding issues with privacy and version control. We can prompt people to develop data dictionaries and documentation, ideally in concert with consistent data standards. We can establish an environment where people find and use data not based on informal connections, but on agreed-upon rules. Colleagues will share data with confidence, knowing that they’re doing the right thing. And the thumb drive can join the floppy disk in the museum of outdated technology.