Introduction

The purpose of the International Spinal Cord Injury (SCI) Data Sets, to facilitate comparisons of injuries, treatments and outcomes between patients, centers and countries, has been described in previous publications.1, 2, 3, 4, 5, 6, 7, 8 These data sets appear on the web sites of the International Spinal Cord Society (http://www.iscos.org.uk) and the American Spinal Injury Association (http://www.asia-spinalinjury.org).

The National Institute of Neurological Disorders and Stroke (NINDS) Common Data Elements (CDE) Project was undertaken to facilitate the development of neurological data standards and to develop a web site (http://www.CommonDataElements.ninds.nih.gov) containing these data standards and accompanying tools. It is intended to help investigators and study staff to collect data with a ‘universal language’ in their clinical studies.

The purpose of the present project is to develop consistent variable names for the data elements included in the International SCI Data Sets and to develop a common database structure. This process will facilitate the adoption of these variables for use by both clinicians and researchers who are in the process of developing research projects or clinical databases. These data set variables have already been through a rigorous consensus, review and approval process within and among individuals and organizations interested in clinical and research work related to SCI.9 The free access to these variables will allow researchers and clinicians to avoid the laborious process of defining variables for their questionnaires or databases and should facilitate harmonization across clinical studies.

Materials and methods

Staff members from the NINDS CDE team (NINDS Program Directors along with their contractor, KAI Research, Inc.) approached the Executive Committee of the International SCI Standards and Data Sets committees (ECSCI) after they became aware of the work performed by the various SCI data set working groups. In subsequent discussions between the committee and the NINDS CDE team, a decision was made to cooperate in developing variable names for each variable in the data sets. The ECSCI and the NINDS CDE team decided to assign variable names that were at most eight characters long in order to accommodate a variety of database software/platform options, keeping the simplest type of data system in mind. They were mindful that limiting the length of the variable names to eight characters ensured compatibility with the SAS® Transport format. The SAS XPORT Transport format currently serves as a US Food and Drug Administration standard format for data sets in electronic submissions (http://www.fda.gov/drugs/developmentapprovalprocess/formssubmissionrequirements/electronicsubmissions/ucm085361.htm).

In the autumn of 2008, the NINDS CDE team began the data variable naming process with the International SCI Core Data Set (1). First, variable names of no more than eight characters in length were created for each data element in the International SCI Core Data Set. These variable names were sent by e-mail for review by members of ECSCI. Following the review, a teleconference was held with the involved individuals to discuss possible acceptance or modification of the proposed variable names. After this process was established as acceptable, the International SCI Basic Lower Urinary Tract Data Set (2) was reviewed in the same manner, followed by the International SCI Basic Urodynamic Data Set (3). The process continued with adjustments as the group learned more about how the eight character variable names needed to be structured to be as logical and consistent as possible across the various data sets. The NINDS CDE team subsequently developed a list of conventions to make certain that all variable names were consistently created (Table 1). The NINDS CDE team and the members of ECSCI also made sure that the variable names for all non-key data elements were unique across the data sets. The NINDS CDE team set up a simple database to help them verify the uniqueness of the variable names as the number of data sets they worked with evolved or increased.

Table 1 Conventions adopted during the construction of the eight character variable names for the data elements in the various International Spinal Cord Injury Data Sets (9)

While working to assign standard variable names to the data sets, the ECSCI and the NINDS CDE team soon became aware that often the way they assigned the variables for a data set depended upon the structure of the table(s) that would store the information in a relational database. (A relational database is a collection of data items organized as a set of tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables. Each table (which is sometimes called a relation) contains one or more data categories in columns. Each row contains a unique instance of data for the categories defined by the columns. The relational database was invented by EF Codd at IBM in 197010). It, therefore, was decided to also propose how the variables could be stored in an appropriate database structure to facilitate both analysis and sharing of data across studies. The proposed database structure is compatible with various relational database software packages, including Microsoft® Access®, SAS, Microsoft SQL®, Oracle® and so on.

Relational data tables linked by common patient identifiers were established for each data set, which could be used for either cross-sectional or longitudinal studies. With each new data set the ECSCI and the NINDS CDE team defined whether the data set would be captured in a single data table or more than one data table. With this approach, investigators can create limited data subsets of selected variables from multiple data sets for analysis. For example, needed information on patient characteristics could be easily merged with data from the lower urinary tract data set. Moreover, use of common data files will facilitate the combining of data sets collected at multiple locations. Besides determining the number of data tables for each data set, the group also needed to decide whether each data table would have a more horizontal (short and wide) or vertical (tall and narrow) structure. Of note, the proposed database structure offers one way of using the standard variable names in a database, but is not the only structure that could work based on the defined SCI CDE.

The continued process in this project has involved approximately monthly teleconferences and e-mail correspondence for more than one and a half years. In addition, a face-to-face meeting between the NINDS CDE team and members of ECSCI was held during the 35th Annual Scientific Meeting of American Spinal Injury Association, September 2009, in Dallas, Texas.

After there was common agreement with the iterative adjustment process between the NINDS CDE team and the members of ECSCI, the result was presented to the particular working group for the data set. After their review of the eight character variable names and the database structure, suggested final revisions and adjustments were made.

At a later stage in the process of working with the International SCI Data Sets, it was decided to include the International Standards for Neurological Classification of SCI (http://www.asia-spinalinjury.org/publications/2006_Classif_worksheet.pdf),11 so variable names and a database structure were developed in the same way.

Results

The following data sets have been through the complete process described above and will be posted with the eight characters variable names and the suggested relational database structure on the web sites of International Spinal Cord Society (http://www.iscos.org.uk) and American Spinal Injury Association (http://www.asia-spinalinjury.org) as well as the NINDS CDE project web site (http://www.CommonDataElements.ninds.nih.gov):

  • International SCI Core Data Set (1)

  • International SCI Basic Lower Urinary Tract Data Set (2)

  • International SCI Basic Urodynamic Data Set (3)

  • International SCI Basic Urinary Tract Imaging Data Set (7)

  • International SCI Basic Bowel Function Data Set (5)

  • International SCI Extended Bowel Function Data Set (6)

  • International SCI Basic Female Sexual and Reproductive Function Data Set

  • International SCI Basic Male Sexual Function Data Set

  • International SCI Basic Cardiovascular Function Data Set (8)

  • International SCI Basic Pain Data Set (4)

  • International Standards for Neurological Classification of SCI (http://www.asia-spinalinjury.org/publications/2006_Classif_worksheet.pdf)11

These web sites include an explanation of the purpose of the project and the standard variable names as well as the proposed database structure. The naming conventions described in Table 1 are also provided.

As an example, the original International SCI Core Data Set form (1) is shown in Figure 1 with the eight character variable names included along with notes for the division of the data set into two tables. Those variables that are designed to be collected only once are contained in Figure 1, TABLE #1. The core neurological data are included in Figure 1, TABLE #2 in which each time point of data collection is stored in a separate record to facilitate longitudinal analyses. In fact, this approach would allow more than the collection of admission and discharge data simply by adding additional records reflecting other times post-injury. Each record would be distinguished by its date of data collection, which would be part of the record key.

Figure 1
figure 1

The International SCI Core Data Set form (1) as it appears with the eight character variable names and division into two tables.

Discussion

In the process of developing the standard variable names, a priority was to make these as clinically meaningful as possible within the eight character limit, but consideration also was given to making the variable names for similar types of variables as consistent as possible across the various data sets. This process to establish consistency has been lengthy and continues to undergo modification as the authors of each newly reviewed data set experience challenges that need special resolution. This iterative process often requires re-review of data sets for which variable names have already been assigned to ensure full consistency across the entire bank of data sets. As soon as other International SCI Data Sets are completed and approved, these will likewise be added with standard variable names and a proposed database structure.

The ECSCI and the NINDS CDE team gave just as much thought to their work to establish relational data tables for the data sets as they did to developing standard variable names. As previously illustrated with the Core Data Set, the decision to break a data set into more than one data table often was dictated by whether groups of data elements in the data set could be collected at disparate time points from a patient. In general, a horizontal database structure was chosen to facilitate statistical analyses that usually require all variables to be included in a single record with results compared across patients. However, when the unit of analysis would more likely be the individual times of measurement, and multiple measurements could be obtained from each person at potentially inconsistent times post-study enrollment, a vertical approach was selected with each time of measurement as a separate record to store the data because of its inherent flexibility to accommodate repeated measurements. This approach is similar to the US Model Systems Database in which initial data are contained in a single table while annual follow-ups are in a second table.12 In the work to develop data tables for the International SCI Data Sets, the ECSCI and NINDS CDE team tried to assign consistent structures across the data sets so as to make it easier to assemble a study database and to share data from multiple sites/studies.

The data collection forms were originally designed to facilitate data collection rather than efficient data storage and analysis. As a result, there is no one-to-one correspondence between the data collection forms and the database structure. For example, ‘unknown’ may be a single check box on the paper form, but is a choice in multiple code lists in the data table. Rather than creating a unique variable for ‘unknown’, checking the unknown box would result in automatically assigning all appropriate variables the ‘unknown’ response. This explains why the ‘annotated forms’ included on the web sites (http://www.iscos.org.uk; http://www.asia-spinalinjury.org; http://www.CommonDataElements.ninds.nih.gov) have the eight character variable tags superimposed on the form.

Once those responsible for the development of each International SCI Data Set approve and release the variable names and database structures, clinical or research institutions may freely use them to write data entry software programs either for Internet or local data entry. Simple quality control procedures can also be incorporated into the data entry software or as stand-alone programs.

Although this work will greatly facilitate the combining of data from multiple sites, it is important to understand that data should not be combined without a thorough understanding of their origins. There must be an underlying research design and sampling frame, comparable case ascertainment and data collection procedures, methods to assess data quality at each location, methods to avoid duplicate patient entry and so on. Otherwise, there would be no way to assess representativeness or generalizability of the data, as well as the direction and magnitude of any potential bias that might be present, thereby making results difficult if not impossible to interpret.

Conclusion

Variable names and database structures have now been developed for each published International SCI Data Set and its associated CDEs. This process will continue as additional International SCI Data Sets fulfill the requirements of the development and approval process and are ready for implementation. Additional work is now needed to develop data entry and quality control software that would facilitate the use of these data sets.