Data Hosting

Data Hosting Program

A major practical issue in collecting and releasing updated cross-national and cross-time data sets such as those collected by the Correlates of War (COW) Project has been the amount of resources needed to simultaneously maintain a large number of data sets. As a result, COW has implemented a distributed system of data set hosting based on the notion of “coordinated decentralization.” The goal is for each COW data set to obtain a semi-permanent “home” and “host,” that is, an institution and an individual who will agree to maintain a data set and the related documentation for a period of time. The care given to a data set by its host follows a set of guidelines designed to ensure continued consistency with COW standards. The Director and the COW Advisory Board are responsible for monitoring data sets and hosts.

Guidelines

There are several guidelines for the adoption of a data set by an institution:

  1. The host agrees to comply with the standards set by the COW project to data collection procedures, coding rules, structure and format of the data set, documentation procedures. 
  2. The host of a COW data set takes responsibility for revising and routinely updating the data set, documentation, and related archival material in his or her care for a period of 3-5 years. The host will keep track of reported errors and questions, and will release new revised versions at regular intervals, typically every six months if minor errors are discovered and corrected, and at longer interval as appropriate for major revisions and updates.
  3. Data set hosts must be experienced with the collection of quantitative data sets, and should have experience with the data set in question. Sufficient institutional resources should be available to support the hosting, possibly including relevant computer resources, research support, or (especially in the case of junior faculty) assurance that proper credit will be given to the host.
  4. The host agrees to serve as the primary contact person and deal with substantive questions concerning the data set (i.e., the host’s email address will be listed on the COW web site as the person to contact with data set questions).
  5. COW data sets will be released only through the COW website (not by individual hosts), and only after the data are final. Procedures for data set review are described on the website. The purpose of this rule is to avoid a proliferation of partial, unofficial, or inconsistent data sets through the research community.
  6. The host agrees not to publish any analytical results based on the resulting updated COW data set before the data are officially released by the COW project. Exceptions may be made for descriptive papers at conferences and dissertation theses, but it must be noted that such results represent analysis based on work in progress and of possibly incomplete data sets, and cannot be said to use official COW data. The purpose of this rule is to avoid a proliferation of non-replicable or frequently-revised results through the research community.
  7. When a major revision or update of a data set is complete, the host agrees to compose and publish an “article of record” concerning the new data set (for instance, in a professional journal). We expect all scholars who use the resulting data set to cite this article of record and to clearly state the data set version used for analysis.
  8. The host agrees to submit a yearly status report on the data set to the Director.
  9. Host shall maintain all available documentation associated with the data sets, and that documentation should be accessible to the Director, Associate Director, and Advisory Board.
  10. Hosts are expected to attend periodic COW meetings during annual meetings of the professional associations.
  11. A subset of data hosts serve on the COW Advisory Board on a rotating basis.
  12. Hosts are expected to work with the Director to secure external grant funds for data collection and updating.
  13. The Advisory Board reserves the right to remove a data host if there are significant problems with data collection or updating.

Data Hosting Standards

Data set hosts must agree to the following basic standards of data collection and data set management before agreeing to host a data set.

  1. Data collection procedures must be carefully documented, and actual data collection must follow these procedures.  Methods used in the prior data collection (where documentation identifies those procedures) and coding rules for the prior data set must be followed where possible to ensure cross-time reliability of the data.  Theoretical and substantive issues, including problems in coding particular cases, must be clearly noted in the documentation.
  2. Units of analysis must be maintained with the current version of the data set, or if changes to reflect better ways of structuring data sets, must be fully documented and old data converted to the new format. 
  3. New variables will only be made available in new versions of data sets if coded for the entire set of states and years.
  4. Data sources must be clearly identified. Documentation and/or the data set should contain information allowing identification of the source of each newly collected data point.  Archival material (e.g. copies of pages from source materials) will be given to the central COW office for permanent archiving.
  5. Each data set released will have a unique version number to maintain a chronological and developmental record of each data set.

Data Hosting Review Procedures

The criteria for release of an updated data set by COW include basic data standards of internal data set consistency, comparability to and compatibility with existing data sets, and high quality data. These criteria can be met by carefully following the coding rules defined at the beginning of a project, and working with the Director and Associate Director office to ensure consistency of the data set format and structure.

When a host believes that a data set is ready for release as an updated COW data set, he or she will submit the data to the COW Director and Associate Director. A series of checks will then be undertaken before a version number is assigned and the updated data released.

  1. A series of automated checks will be conducted to ensure that all countries and years have been included in the data set (where a data set is cross-national and cross-time), that all data points are unique (no duplicate records or values), and that country codes and data points included in the data set match the Correlates of War National System Membership lists.
  2. Variable names and value codes will be examined for uniqueness, descriptive accuracy, and consistency. For example, whenever possible variable names must match names from prior data sets, and must accurately describe of the variables’ content. Dummy variables will be coded as 0=no, and 1=yes. Missing value codes will be consistent (typically -9 when possible) and clearly described in the documentation. Names and categories deemed unique will be checked for uniqueness.
  3. A review of procedures will be done to ensure that coding rules have been followed.
  4. Spot checks of individual data points collected by the individual host will be conducted to verify data values and source identification.
  5. Documentation will be reviewed, and source lists will be examined to ensure that every new data point can be traced to a point of origin.
  6. The format of the data set (e.g. unit of analysis [country-year, monad-year], file type [Excel file, Access file, flat text]) will be examined and made consistent with other data sets.
  7. In case of problems, the data set may be updated by COW, or may be returned to the host for further work.
  8. The COW Advisory Board may be routinely consulted on issues of data set structure, coding rules, case coding, and other issues that arise in the course of data set review.
  9. In the case of disagreement between the host and COW about the release status of the data set, whether such disagreements concern issues of format or substantive coding decisions, the COW advisory board is available for consultation and problem resolution.
  10. The target for final data set release no more than six months after a candidate final release data set is submitted.