The overarching theme of my research for the past two decades is interoperability in information systems, spanning the full spectrum of the technical and human components that are critical to create networked information systems that really work. Although the results of this research apply across a variety of information contexts, the primary thread of my research explores information systems to support scholarship and knowledge production. The principal question we ask is: how do we build and sustain cyberinfrastructure that supports the full cycle scholarship and that respects the cultural, methodological, and social variance among fields of scholarship? Some relevant sub questions of this primary issue are: what are the methodological approaches and theoretical foundations relevant to understanding the nuanced variations and scholars’ attitudes towards technical cyberinfrastructure? And: how do we design technical cyberinfrastructure that simultaneously supports large-scale interoperability it respects the diversity scholarship? These are essential questions at a time of significant investment in cyberinfrastructure by governments and funders and the emergence of critical scientific questions such as climate change and global pandemics, the investigation of which requires cross-disciplinary collaborations and cyberinfrastructure support. Some current and pending research projects under this broad umbrella are listed below.

Analyzing and comparing field-specific patterns of openness and closure in the sciences

An understanding of the reasons underlying field-specific attitudes towards openness and closure is vital to understanding and advancing innovative science in the data-driven “fourth paradigm” world, critical for the successful investigation of complex problems facing society, and necessary to ensure the return on investment of expensive cyberinfrastructure investments. In collaboration with Theresa Velden, we have been developing and experimenting with an innovative methodology that is a unique synergistic integration of ethnographic and network analysis techniques that together capture both scope and detail. This is a significant advancement over existing approaches that do not sufficiently scale for the purpose of making broad conclusions, or fail to capture details and therefore do not capture the nuances underlying these field-specific differences. To date, we have used this methodology to analyze collaborative patterns and attitudes towards openness and closure of two subfields of chemistry and physics and are planning, pending approval of funding, to extend this analysis to researchers of ecological systems. We are also planning a fund to extend the methodology to account for temporal dynamics such as change in collaborative patterns over time.

Leveraging novice and expert knowledge and citizen science

Citizen science has emerged as an important component of scientific discovery especially in the fields of ornithology, astronomy, and climate change. Unlike data collected by experiencde PhD scientists and members of the research teams, the integrity of citizen science data is often compromised by varying levels of expertise. It is often difficult to determine whether an extraordinary data point, such as the observation of a highly unusual bird at a specific location, is a false hit or indeed an extraordinary event. Most existing citizen science projects employ time-consuming and expensive experts to cross check contributed data. Pending NSF funding, we are planning to explore and experiment with innovative algorithms that synergistically leverage human and machine computation to evaluate and improve human expertise and refine the integrity of this citizen science contribute-knowledge base.

Specification for a resource synchronization protocol

Increasingly, large-scale digital collections are available from multiple hosting locations or are cached in multiple servers. In addition, high profile information portals rely on resources originating in many distributed repositories. The need has arisen for methods to keep these heterogeneous systems that rely on each other’s resources in sync, ways to ensure the freshness of content, and mechanisms to know in real time when and how resources are changing. Although synchronization methods exist, they are generally ad hoc, arranged by individuals involved, and cannot be universally deployed. There is a pressing need for a well-defined mechanism for resource synchronization that scales up to the existing size of digital collections and their distribution patterns. In collaboration with NISO and members of the open archives initiative team, we plan to research, develop, prototype, test and deploy mechanisms for the large-scale synchronization of web resources. This work will build on our successful earlier OAI-PMH work. The end product of the work will be a specification, vetted by experts and reference implementations, code libraries, and tools that detail an approach to synchronize web resources at scale in an interoperable manner.

Comprehensive Extensible Data Documentation and Access Repository (CED²AR)

The era of public-use micro-datasets as a cornerstone of empirical research in the social sciences is coming to an end. While it still is feasible to create such data without breaching confidentiality, scholars are pursuing research programs that mandate inherently identifiable data, such as geospatial relations, exact genome data, networks of all sorts, and linked administrative records. These researchers acquire authorized restricted access to the confidential identifiable data and perform their analyses in secure environments. The researcher is allowed to publish results that have been filtered through a statistical disclosure limitation protocol. Scientific scrutiny is hampered because the researcher cannot effectively implement a data-management plan that permits sharing these restricted-access data with other scholars. The data-custody problem is impeding the “acquire, archive, and curate” model that dominated social science data preservation in the era of public-use micro-data. This project bridges the transition to restricted-access data and offer the scholar, the scientific community, and the custodial agency a feasible path to long-term data preservation.  We are building a Comprehensive Extensible Data Documentation and Access Repository (CED²AR) designed to improve the documentation and discoverability of both public and restricted data from the federal statistical system. The CED²AR will be based upon leading metadata standards such as the Data Documentation Initiative (DDI) and Statistical Data and Metadata eXchange (SDMX) and be flexibly designed to ingest documentation from a variety of source files.