FAIR annotation and evaluation of RNA sequencing and microarray data – Elucidata
How to unlock the value of RNA sequencing (RNA-Seq) and microarray data from a public repository Gene Expression Omnibus (GEO)
- Using machine-learning assisted curation, guided by the FAIR principles
- Leverage ontologies and controlled vocabularies for annotation
GEO  is a large public-funded repository that primarily supports microarray and RNA-seq data submitted by the research community. These data are publicly available and contain a huge wealth of information. There are over 150,000 series, i.e, records that contain a collection of related samples with a unified research focus, in GEO. However, a majority of these series lack machine-readable standardized metadata at study and sample-levels, limiting reuse. Only ~4300 or 2.9% of series have been curated retrospectively and are available as datasets for advanced analyses. In this use case, we used a subset of the FAIR guiding principles to define customized metrics and establish a baseline FAIR score for GEO. We then defined a machine-learning assisted curation infrastructure to curate microarray and RNA-Seq data at scale and quantified the improvement in FAIR score post curation.
Adapting FAIR guiding principles for microarray and RNA-Seq data
We converted the general FAIR guiding principles as defined by Wilkinson et al.  into a set of 8 measurable metrics specifically suited to microarray and RNA-Seq data, as described in Table 1.
|FAIR Principle||Metric||Metric_ID||Relevant GEO metadata field|
|F2. data are described with rich metadata||Metadata contains biological keywords that describe the study||F2||!Series_summary|
|F3. metadata clearly and explicitly include the identifier of the data it describes||Metadata include machine actionable identifier to processed data||F3||!Sample_supplementary_file
|I2. (meta)data use vocabularies that follow FAIR principles||(meta)data metadata are described by biological keywords and their corresponding unique identifier from a reference ontology||I2||Not applicable (Metadata field for ontology tags not available)|
|R1.2. (meta)data are associated with detailed provenance||metadata provide information regarding the overall design of the study||R1_2_1||!Series_overall_design|
|metadata provide contact information of the creator(s)||R1_2_2||!Series_contact_email|
|metadata specify the publication identifier||R1_2_3||!Series_pubmed_id|
|metadata contain the platform ID||R1_2_4||!Platform_geo_accession|
|the processed data associated with the study contain counts||R1_2_5||!Sample_data_processing|
Calculating baseline FAIR score for GEO
GEO uses a format called Simple Omnibus in Text Format (SOFT), which is a compact, simple, line-based, ASCII text format that incorporates experimental data and metadata.
To calculate a baseline FAIR score for data on GEO, we identified relevant metadata fields for each of the metrics except I2. A scoring strategy (called as s) was then defined as follows:
- If the field contained relevant information, such as keywords and unique machine actionable identifiers, the metric was scored as 1.0 (PASS), if the information was absent, the metric was scored as 0 (FAIL).
- If there were more than one field in a metric, then each field was scored as 0 or 1 and an overall metric score was assigned as the average of each individual field score.
- I2 was scored as zero for all the data on GEO since they do not follow any ontology
Examples of keywords in metadata
|GEO ID||Metadata Tag
!Sample_data_processing = Supplementary_files_format_and_content
|FAIR metric evaluation
|GSE100007||TSV files include TPM (transcripts per million) for Ensembl transcripts from protein coding genes in release 84.||Pass||TPM,
|GSE100009||RNA-seq read counts were calculated using HTSeq 0.6.1p1||Pass|
|GSE100400||bigwig format, files contain mapped reads on chromosomes Y, 15, 3, 1 for visualization of each investigated marker||Fail|
Each of the metrics was also assigned a weight based on their perceived importance to an in-house team of bioinformaticians, with highest weights assigned to F2, F3, I2 and R1_2_5. The FAIR score for a series was calculated as the weighted average score for the 8 metrics and normalized to a maximum score of 100 (called a composite FAIR score). The distribution of calculated scores for one hundred random studies on GEO is shown in subsequent sections,
where wi = weight of ith metric and s = evaluation score.
We built a ML-assisted curation infrastructure that takes a series and its corresponding samples from GEO, stores the metadata and processed data together in a consistent tabular format, then tags the study and its samples with standard biomolecular ontologies (see listing below) using state-of the-art named entity recognition models. We have used this curation infrastructure to curate ~60000 microarray and RNA-Seq series from GEO, and other public gene expression data repositories LINCS, TCGA, Depmap etc. We have also extended the curation to other omics data such as metabolomics, proteomics and lipidomics.
A feature of the curation infrastructure was that all the metadata and data were indexed and queryable.
Examples of sample tags
A tag is a reference to a term in biomolecular ontologies
- GSM2752553 is a cell line sample that has been treated with a drug, it has two tags – Lu-65 (CVCL:1392) and trametinib (CHEBI:75998)
- GSE107650 is a diet-study involving tissue samples taken from obese patients with NASH, it has the tags Obesity (MESH:D009765), Non-alcoholic Fatty Liver Disease (MESH:D065626) and liver (BTO:0000759)
We use both public and proprietary ontologies for tagging data. Here’s a list of the types of tags we assign (the list is not exhaustive), along with the public ontologies we use:
- Organism: (NCBI Taxonomy) Organism from which the samples originated.
- Disease: (MeSH) Disease(s) being studied in the experiment.
- Tissue: (Brenda Tissue Ontology) The tissue from which the samples originated.
- Cell type: (Cell Ontology) Cell type of the samples within the study.
- Cell line: (The Cellosaurus) Cell line from which the samples were extracted.
- Drug: (CHEBI) Drugs that have been used in the treatment of the samples or relate to the experiment in some other way.
For keeping track of quality of tags we use metrics like accuracy and F1 score. These metrics are calculated w.r.t a gold standard corpus*. We calculate the metrics for 3 sets of tags.
- Human: Tags assigned by human curators
- Polly: Tags assigned by Polly ML models (developed in-house)
- Benchmark: Tags from publicly available ML models on the same metadata .
Figure 1: The “Human” F1 score indicates how accurate a human curator is on average.
* The gold standard corpus refers to a small set of GEO studies that were curated by in-house experts and went through several quality control checks. These studies contain tags that we use as the ground truth.
The curation infrastructure and process is described in detail in Figure 2.
Figure 2: Curation process and main constituents of the infrastructure
Improvement in FAIR score post curation
The score for each study was calculated again after curation with the help of Polly ML-assisted curation infrastructure. The minimum ~80% score for Polly GEO (post curation) is nearly equal to the maximum score for GEO. The difference between FAIR scores for GEO and Polly is significant, Polly has a higher score for most of the datasets with average score being 88% as compared to source’s score of 60% (See figure 3 below).
Figure 3: The plot represents a distribution of percentage increase of composite FAIR scores on Polly from randomly sampled 100 GEO studies.
- A method to quantify the impact of curation on microarray and RNA Seq data from GEO
- Quantification of FAIR score pre and post curation on GEO
- An ML-assisted curation infrastructure for FAIRification of molecular omics data
- Improved querying of data from a rich publicly accessible resource
References and Resources
- Ron Edgar, Michael Domrachev, Alex E. Lash, Gene Expression Omnibus: NCBI gene expression and hybridization array data repository, Nucleic Acids Research, Volume 30, Issue 1, 1 January 2002, Pages 207–210, https://doi.org/10.1093/nar/30.1.207
- Wilkinson, M. D. et al. 2016 The FAIR Guiding Principles for scientific data management and stewardship. Data 3, 160018. doi.org/10.1038/sdata.2016.18
- Wang, Z., Monteiro, C. D., Jagodnik, K. M., Fernandez, N. F., Gundersen, G. W., … & Ma’ayan, A. (2016) Extraction and Analysis of Signatures from the Gene Expression Omnibus by the Crowd. Nature Communications doi: 10.1038/ncomms12846
- Matthew N Bernstein, AnHai Doan, Colin N Dewey, MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive, Bioinformatics, Volume 33, Issue 18, 15 September 2017, Pages 2914–2923, https://doi.org/10.1093/bioinformatics/btx334
At a Glance
- Scientific Owner (2)
- Machine Learning Engineer (1)
- Data Scientist (2)
- Data Curators (10)
- Software Developers (2)
Benefits and deliverables
- Enabled the enhancement of 60,000 datasets from GEO to improve of FAIR implementation
- Gave ready access to analysis-ready data by machine (e.g. Polly tagger)
- Mukund Chaudhry, Data Engineer
- Pawan Verma, Bioinformatics Engineer (LinkedIn)