Helping Data Leave Home
22 March 2018
Luigi Guarino | Director of Science and Programs
The relationship between scientists and their data can be a lot like that between overprotective parents and their children. You know you’ll have to let them go eventually, but it’s a real struggle bringing yourself to do it. They’re not ready. They’ll be misunderstood, or taken advantage of. I still need them. There’s always a good reason to put off the fateful day, the empty nest.
That’s certainly the case with data on crop diversity in genebanks. There’s much more data on the performance of germplasm accessions out there, on computers and in desk drawers, than has found its way into genebank databases. Way more. Often, that’s because it was generated by the genebank’s partners, rather than by the genebank itself, that is by breeders and researchers in other institutes, perhaps other countries. And, well, they love their data. They want to make sure it’s as good as it can be, will be used in the right way, be properly understood by others, and the source properly acknowledged.
At the Crop Trust, we want to help them with all that. It may not sound very exciting, but if we want genebanks to realize their potential, having solid procedures for publishing research data on their holdings will be absolutely critical. If there’s one thing users of Genesys, the online portal to genebank data, keep telling us, it’s that it needs to include as much performance data as possible. We need to make it as easy as possible for that to happen.
So, with support from the German government, we’re working with four national and two international genebanks to strengthen the way they gather data on their collections and make it available. We hope they’ll be the first of many.
That doesn’t just mean unearthing a long-lost spreadsheet and plonking it online, although even that can be challenging. It means recording who did the experiment that generated the data, who recorded the data, and who checked it; and when the experiment was done, where, and how. It means explaining what all the numbers in the spreadsheet actually mean: does a score of 10 mean high susceptibility to disease X, or high resistance? And it means making sure everyone knows who needs to be given credit when the data is used. This data about data is called metadata, and it’s a big part of the project, which has about six months to run.
The project coordinator, Dr Nora Castañeda-Álvarez, has been working hard with all the partners for over a year to prepare datasets for publication on Genesys, and to develop what we call “standard operating procedures” (SOPs) for data publication. These will be used by the genebanks to prepare additional datasets for uploading to Genesys, which we have been modifying in parallel to accept the new data, and the associated metadata.
Nora has visited them all, but the partners have only just met each other, when we organized a workshop here in Bonn. They updated everyone on their progress, and continued to develop their plans for the final few months of the project. It was great to see such excitement about metadata management. It may seem a nerdy, esoteric subject, but it’s not. Desterio Nyamongo, the head of the national genebank in Kenya, said at the beginning of the meeting that the project was a way of assuring researchers who are considering parting with their data that the genebank would take good care of it.
It can be liberating when the kids fly the coop, but that doesn’t mean you never want to hear from them again.