Nov 8-12, 2021. Biohackathon Europe, Barcelona

Plazi will participate in the Biohackthon Europe. Each November ELIXIR organizes BioHackathon Europe, which brings together bioinformaticians from around the world. The event takes place in different locations around Europe. The BioHackathon offers an intense week of hacking, with over 160 international participants who work on diverse and exciting projects. The week starts with a half-day symposium to introduce these projects, and is followed by five days of hacking with one sole aim: coding to address problems in bioinformatics.

Plazi will participate in project 15: CAB2: A step towards Biodiversity data enrichment. Guido Sautter will be participating on site.

Abstract Project 15: CAB2: A step towards Biodiversity data enrichment

Linking molecular data to taxonomic names and their extensive taxonomic treatments represents a fundamental component in biodiversity assessment. Voucher specimens for sequenced data can be the key nodes to make these connections. During Biohackathon 2020, several projects investigated how sequence (meta)data could be retrieved from ENA and connected to taxonomic treatment or specimen databases like TreatmentBank and GBIF.

With this proposal, we aim to link more voucher specimens to sequences by applying machine learning techniques to specimen images, retrieving sequencing metadata physically on the specimen that can facilitate and maximize the linking process. We will then employ these metadata to improve the ENA linking process, allowing wider data discovery and enhancement. We also aim to develop a standard module to compare ENA, GBIF, and TB geographical data related to specific taxa and return the results in an interactive data exploration dashboard. The improvements will also address the gap-filling of gene names embedded in scientific papers relative to the accession numbers.

Results obtained in this project will reflect the importance of integrating different data sources in order to deliver consistent and complete biodiversity data to the scientific community and feed into European biodiversity projects such as Bioscan, BiCIKL and ERGA.

Expected outcomes

An adaptable workflow which finds sequenced specimens, captures sequencing data and uses this information to find the sequences. Voucher specimen records with explicit connections to DNA sequence records. Publication in BioHackRxiv.

Expected audience

Participants: Maarten Trekels, Steven Verstockt, Sofie Meeus, Kenzo Milleville, Krishna Kumar, Thirukokaranam Chandrasekar, Bachir Balech, Donat Agosti, Alberto Brusati, Anna Sandionigi, Dario Pescini and Marcus Guidoti

Skillsets: sequence and specimen databases image analysis text detection (OCR, HTR) text mining and matching scientific literature mining

Number of expected hacking days: 4