Research project · Bioinformatics · Data ingestion

RAIDGBS Research Project

Design and implementation of a system capable of retrieving DNA sequence data from public biological databases and storing it in a structured local database.

Associated experience

CBMN

Perl Analyst Developer · 2010

View related experience

Context

This project was carried out as part of my final year thesis during my Bachelor’s degree in Computer Science.

The work took place at the CBMN, the Centre de Biophysique Moléculaire Numérique in Gembloux, and was part of the RAIDGBS research program.

RAIDGBS focused on the analysis of genetic sequences of Group B Streptococcus, a bacterium that can cause serious infections in newborns when transmitted from the mother during pregnancy or childbirth.

The broader objective of the research program was to better understand the genetic variability of these bacteria in order to contribute to the development of faster and more affordable diagnostic tests.

My work focused on building the data processing system used to collect and structure biological sequence data used in this research.

Objective

Design and implement a system capable of retrieving DNA sequence data from public biological databases and storing them in a structured local database.

The system needed to automate the retrieval and processing of these datasets so that researchers could easily access and work with the collected information.

My role

I was the main developer responsible for implementing the system during my final year internship.

The project was carried out under the supervision of researcher Sven Steinhauer, who had previously defined the research objectives, the data sources and the technological approach.

My work focused on the technical implementation of the system, including database design, data processing scripts and the automation of the data ingestion workflow.

Responsibilities

Technical challenges

Technologies used

Languages

  • Perl
  • SQL
  • Bash

Database

  • MySQL

Data sources

  • Public biological databases

Environment

  • Linux

Tools

  • NetBeans
  • PhpMyAdmin

Technical implementation

The system implemented a data ingestion pipeline retrieving biological datasets from public databases.

Perl scripts were responsible for retrieving the datasets, transforming the data and generating SQL files containing the insertion queries.

A Bash script orchestrated the entire workflow, including the creation of the database schema, the creation of the database tables, the execution of the SQL insertion scripts and the application of integrity constraints once the data had been inserted.

This allowed the database to be recreated from scratch and ensured the full ingestion process could be executed automatically.

Performance optimization

While processing larger datasets, execution time became an issue during the ingestion process.

Several improvements were implemented: reduction of unnecessary generated SQL queries, use of batch insert strategies instead of executing many individual insert statements, and postponing the creation of foreign key constraints until after the bulk data insertion.

These optimizations significantly reduced the execution time of the ingestion pipeline, bringing it down from several hours to only a few minutes.

This was one of my first practical experiences dealing with database performance and data ingestion optimization.

Outcomes / impact

The project resulted in a working prototype capable of importing experimental datasets into a structured database.

The system automated a large part of the data ingestion process and provided a foundation for further research data analysis.

Key learnings