For my dissertation research, I have been working on collecting any and all available datasets of ecological networks. My main interest has been on food webs, but I have also been searching for other network types (plant/pollinator, plant/seed disperser, parasite/host, etc). As a result, I have found datasets in a variety of places; in large databases like the Interaction Web Database (IWDB), Ecological Archives, or Dryad, from personal correspondence with Jennifer Dunne of the PEaCE Lab, or from author’s websites like Robert Ulanowicz’s.
Let me first say, that it was very easy for me to find/get these datasets and I am very appreciative of the people who have made them available. But I also have to say that it is frustrating, as anyone who has taken a look at Twitter’s #otherpeoplesdata may know, dealing with other people’s data. When you are dealing with data from multiple sources there is no set standard for how those data are stored. With network data you typically get one of two forms (other than .xls vs .csv): (1) an adjacency matrix with row/column names that are either node identities or numbers, or (2) an edge list where the first column is the predator and the second is the prey.
The good part is that R can read in and understand both of these forms, which is nice. The bad part is that there are some issues with strangely formatted “species names” in the IWDB data which is a matrix, and the PEaCE lab data (edge list) has 3 columns instead of two, which presented its own challenges (although I was able to use R to reformat the data into a more directly usable form, and I may post on that later). Then there is always dealing with data that are specifically formatted for a type of program, such as the data provided by Ulanowicz which is formatted as SCOR for use with the particular ecosystem network analysis programs he uses (although hooray for R and enaR package for being able to actually read that format, i just had to go through and re-save each network as its own separate file since it is provided in one .txt file). So basically it is small idiosyncrasies with the data that can make using it a little more difficult, and I would think that in today’s era of open and reproducible science you would want your data to be as easy to use/understand as possible.
Now I want to mention two in progress efforts to make this type of data acquisition as painless as possible. The first is an effort by Tim Poisot who introduced mangal (first here, then examples here, here, and here), a web based means to store network data in a new and imaginative way. He has also, conveniently, released an R package to interface with mangal via an API, currently available at GitHub.
I have not yet been able to play around with Tim’s rmangal package, although I can definitely see that there would be a huge benefit to being able to directly interact with the web API through R. I can imagine that for someone like me it would make my workflow much smoother and definitely make reproducibility simpler. Having everything from data downloading to analysis in one script would simplify things as well. Maybe I should stop making excuses and try it out already. My only question is how much data is available through mangal, and whether those wonderful people who put food web/ecological network data together are going to adopt this formatting. As it stands it is pretty easy to put together a csv file that is either an adjacency matrix, or an edgelist. From what I have read of Tim’s mangal database it seems he wants to put the data in a much more flexible and informative framework (as a theoretician I am all for this)
The second effort is a recently released website/database from the Bascompte Lab, the Web of Life. Bascompte’s Web of Life currently has 89 mutualistic networks, 59 are plant-pollinator (20 of those are weighted, the rest are binary), and the remaining 30 are seed dispersal (with 12 weighted networks). I really like the visualization of this website. The homepage is a network with the options for the different types as nodes around a hub displaying the site name. When you go to the network page you get a map of Earth and each network of the selected type is represented as a point on the map. That way you can get a quick idea of where these data are coming from. You can also subset the available networks by type (e.g., pollinator vs. seed dispersal), weighted vs. binary, number of species, and number of interactions.
Once you have selected the subset of available networks that you want to download, there is a convenient button that allows you to download all of your selected networks into a single folder. Most importantly you can get your data in one of three forms: csv, Excel, and Pajek. For each of these you can choose whether or not to include species names. I found this useful because csv adjacency matrices are very easy to work with, especially when they are standardized in format.
I am definitely looking forward to seeing how these two projects develop and grow in the future. While I do think that both of these projects are very promising, I wonder whether they will be able to get those who have data on board with it, and who will be more successful at doing so.