Research project · Bioinformatics · Data ingestion

RAIDGBS Research Project

Design and implementation of a system capable of retrieving DNA sequence data from public biological databases and storing it in a structured local database.

Period: February 2010 – June 2010
Role: Main developer during my final year internship
Organization: CBMN — Centre de Biophysique Moléculaire Numérique, Gembloux
Main stack: Perl, SQL, Bash, MySQL, Linux

Associated experience

CBMN

Perl Analyst Developer · 2010

View related experience

Context

This project was carried out as part of my final year thesis during my Bachelor’s degree in Computer Science.

The work took place at the CBMN, the Centre de Biophysique Moléculaire Numérique in Gembloux, and was part of the RAIDGBS research program.

RAIDGBS focused on the analysis of genetic sequences of Group B Streptococcus, a bacterium that can cause serious infections in newborns when transmitted from the mother during pregnancy or childbirth.

The broader objective of the research program was to better understand the genetic variability of these bacteria in order to contribute to the development of faster and more affordable diagnostic tests.

My work focused on building the data processing system used to collect and structure biological sequence data used in this research.

Objective

Design and implement a system capable of retrieving DNA sequence data from public biological databases and storing them in a structured local database.

The system needed to automate the retrieval and processing of these datasets so that researchers could easily access and work with the collected information.

My role

I was the main developer responsible for implementing the system during my final year internship.

The project was carried out under the supervision of researcher Sven Steinhauer, who had previously defined the research objectives, the data sources and the technological approach.

My work focused on the technical implementation of the system, including database design, data processing scripts and the automation of the data ingestion workflow.

Responsibilities

Implementation of the data ingestion system
Database schema design
Development of Perl scripts to retrieve and process biological datasets
Generation of SQL files used to populate the database
Implementation of the Bash script orchestrating the ingestion workflow
Testing and validation of the data ingestion process
Documentation of the system as part of the academic thesis

Technical challenges

Learning Perl at the beginning of the project in order to develop the data processing scripts
Retrieving and processing biological datasets from public databases
Designing a database schema adapted to store DNA sequence information
Automating the ingestion pipeline using Perl and Bash scripts
Handling performance issues when processing large datasets
Optimizing SQL queries to significantly reduce the overall execution time of the ingestion process

Technologies used

Languages

Perl
SQL
Bash

Database

MySQL

Data sources

Public biological databases

Environment

Linux

Tools

NetBeans
PhpMyAdmin

Technical implementation

The system implemented a data ingestion pipeline retrieving biological datasets from public databases.

Perl scripts were responsible for retrieving the datasets, transforming the data and generating SQL files containing the insertion queries.

A Bash script orchestrated the entire workflow, including the creation of the database schema, the creation of the database tables, the execution of the SQL insertion scripts and the application of integrity constraints once the data had been inserted.

This allowed the database to be recreated from scratch and ensured the full ingestion process could be executed automatically.

Performance optimization

While processing larger datasets, execution time became an issue during the ingestion process.

Several improvements were implemented: reduction of unnecessary generated SQL queries, use of batch insert strategies instead of executing many individual insert statements, and postponing the creation of foreign key constraints until after the bulk data insertion.

These optimizations significantly reduced the execution time of the ingestion pipeline, bringing it down from several hours to only a few minutes.

This was one of my first practical experiences dealing with database performance and data ingestion optimization.

Outcomes / impact

The project resulted in a working prototype capable of importing experimental datasets into a structured database.

The system automated a large part of the data ingestion process and provided a foundation for further research data analysis.

                    Key learnings

                                            First experience designing a database schema from scratch
Implementation of automated data processing scripts
Collaboration with researchers in a scientific environment
Exposure to data engineering concepts before entering the web development industry