Centralized Life Sciences Data (CLSD)
The Problem
Increasingly, biomedical research requires integration of data from a wide variety of sources. There's much work involved in assembling the data a researcher needs and each researcher has to repeat these tasks:
- Locating each data source
- Iteratively selecting data from each source
- Manually running programs such as BLAST
The Solution
Public datasets are downloaded and prepared for use locally at IU. User applications draw data through a single, centralized interface called CLSD. CLSD is implemented using a DB2 database on our IBM Research SP supercomputer which has been enhanced via IBM's Information Integrator to enable users to access both local and external data.
Maintenance
The clsd-update and clsd-monitor programs work together to automate the process of downloading data from FTP sites, parsing it into relational form, and loading it into DB2 for use by CLSD, while checking and reporting on any errors in the process.
