Research project · Bioinformatics · Data ingestion
RAIDGBS Research Project
Design and implementation of a system capable of retrieving DNA sequence data from public biological databases and storing it in a structured local database.
Associated experience
CBMN
Perl Analyst Developer · 2010
View related experienceContext
This project was carried out as part of my final year thesis during my Bachelor’s degree in Computer Science.
The work took place at the CBMN, the Centre de Biophysique Moléculaire Numérique in Gembloux, and was part of the RAIDGBS research program.
RAIDGBS focused on the analysis of genetic sequences of Group B Streptococcus, a bacterium that can cause serious infections in newborns when transmitted from the mother during pregnancy or childbirth.
The broader objective of the research program was to better understand the genetic variability of these bacteria in order to contribute to the development of faster and more affordable diagnostic tests.
My work focused on building the data processing system used to collect and structure biological sequence data used in this research.
Objective
Design and implement a system capable of retrieving DNA sequence data from public biological databases and storing them in a structured local database.
The system needed to automate the retrieval and processing of these datasets so that researchers could easily access and work with the collected information.
My role
I was the main developer responsible for implementing the system during my final year internship.
The project was carried out under the supervision of researcher Sven Steinhauer, who had previously defined the research objectives, the data sources and the technological approach.
My work focused on the technical implementation of the system, including database design, data processing scripts and the automation of the data ingestion workflow.
Responsibilities
- Implementation of the data ingestion system
- Database schema design
- Development of Perl scripts to retrieve and process biological datasets
- Generation of SQL files used to populate the database
- Implementation of the Bash script orchestrating the ingestion workflow
- Testing and validation of the data ingestion process
- Documentation of the system as part of the academic thesis
Technical challenges
- Learning Perl at the beginning of the project in order to develop the data processing scripts
- Retrieving and processing biological datasets from public databases
- Designing a database schema adapted to store DNA sequence information
- Automating the ingestion pipeline using Perl and Bash scripts
- Handling performance issues when processing large datasets
- Optimizing SQL queries to significantly reduce the overall execution time of the ingestion process
Technologies used
Languages
- Perl
- SQL
- Bash
Database
- MySQL
Data sources
- Public biological databases
Environment
- Linux
Tools
- NetBeans
- PhpMyAdmin
Technical implementation
The system implemented a data ingestion pipeline retrieving biological datasets from public databases.
Perl scripts were responsible for retrieving the datasets, transforming the data and generating SQL files containing the insertion queries.
A Bash script orchestrated the entire workflow, including the creation of the database schema, the creation of the database tables, the execution of the SQL insertion scripts and the application of integrity constraints once the data had been inserted.
This allowed the database to be recreated from scratch and ensured the full ingestion process could be executed automatically.
Performance optimization
While processing larger datasets, execution time became an issue during the ingestion process.
Several improvements were implemented: reduction of unnecessary generated SQL queries, use of batch insert strategies instead of executing many individual insert statements, and postponing the creation of foreign key constraints until after the bulk data insertion.
These optimizations significantly reduced the execution time of the ingestion pipeline, bringing it down from several hours to only a few minutes.
This was one of my first practical experiences dealing with database performance and data ingestion optimization.
Outcomes / impact
The project resulted in a working prototype capable of importing experimental datasets into a structured database.
The system automated a large part of the data ingestion process and provided a foundation for further research data analysis.
Key learnings
- First experience designing a database schema from scratch
- Implementation of automated data processing scripts
- Collaboration with researchers in a scientific environment
- Exposure to data engineering concepts before entering the web development industry