A population birth cohort was established on the Isle of Wight (IoW), UK, to prospectively study the natural history of allergic diseases among children. Of the 1536 pregnancies between January 1, 1989, and February 28, 1990, parents of all infants born over this period were contacted at birth, and subsequently, 1456 infants were enrolled following informed consent and exclusion. Follow‐ups for survey and clinical data were conducted at ages 1, 2, 4, 10, and 18 years. The IoW birth cohort (IOWBC) has been described in detail elsewhere . Detailed interviews and examinations were completed for each child at each follow-up.
Data collection—outcome, exposures, covariates
The International Study of Asthma and Allergy in Childhood (ISAAC) questionnaire was used to obtain information regarding asthma at 18 years . The questions used to assess asthma were: ‘History of physician diagnosed asthma?’, ‘Wheezing or whistling in the chest in the last 12 months?’ and ‘Asthma treatment in the last 12 months?’ Based on responses to these questions, a participant was determined to have asthma if she/he had experienced recurrent wheezing in the last 12 months and been given a clinical diagnosis of asthma by the physician with or without being treated with asthma medications.
Asthma incidence at 18 years was the outcome of interest for this study and was defined as not having asthma by age 10 years but developed asthma by age 18 years (no → yes). Subjects with asthma at ages 1, 2, 4 or 10 years were excluded to focus on the association of persistent BMI patterns in childhood with young adult asthma incidence. Height and weight of each participant was assessed at ages 1, 2, 4, and 10 years. Body mass index (BMI) was calculated using weight in kilograms divided by height in meters-squared at each age. Information regarding sex, maternal smoking during pregnancy, duration of breastfeeding, maternal and paternal disease status of asthma and age at specific pubertal events, i.e., age at onset of voice deepening in males and age at onset of menarche in females, was extracted from questionnaire data. Socio-economic status (SES) was defined based on household income, number of rooms and maternal education. Active smoking status at 18 years was recorded as either never smoker, current smoker or past smoker. Second-hand smoking exposure was determined using information obtained for tobacco smoke exposure from mother, father, or others at ages 1, 2, and 4 years.
DNA was extracted from whole blood samples collected at 10 years of age using a standard salting out procedure . One microgram of DNA was bisulfite-treated for cytosine to thymine conversion using the EZ 96-DNA methylation kit (Zymo Research, Irvine, CA, USA) for each sample, following the manufacturer’s standard protocol. Genome-wide DNAm for each CpG was assessed using either Illumina Infinium HumanMethylation450 BeadChips or the Methylation EPIC BeadChip (Illumina, Inc, San Diego, CA, USA), which interrogate > 484,000 and > 850,000 CpG sites, respectively. Arrays were processed using a standard protocol as described elsewhere , with multiple identical control samples assigned to each bisulfite conversion batch to assess assay variability. The BeadChips were scanned using a BeadStation, and the methylation level (beta (β) value) was calculated for each queried CpG locus using Methylation module of BeadStudio software.
DNAm data were preprocessed using the CPACOR pipeline for data from both platforms (HumanMethylation450 and MethylationEPIC) . Specifically, the DNAm intensity data were quantile‐normalized using the R package, minfi . Beta values were calculated representing proportions of intensity of methylated (M) over the sum of methylated and unmethylated (U) sites/probes (β = M/ [c + M + U], where c is a constant to prevent zero in the denominator if M + U is too small). Beta values close to 0 or 1 tend to suffer from severe heteroscedasticity, and it has been demonstrated that base-2 logit transformed beta values (denoted as M-values) perform better in differential analysis of methylation levels . Therefore, M-values were used to represent methylation levels in the analysis.
Principal components (PCs) inferred based on control probes to represent latent chip‐to‐chip and technical variations were generated. Since DNAm data were from two different platforms, PCs were determined based on DNAm at shared control probes. In total, 195 control probes were shared between the two platforms (450 K and EPIC) and used to calculate the control probe PCs and the top 15 were used to represent latent batch factors . These 15 PCs were included in subsequent analyses. Probes not reaching a detection p-value of 10–16 in at least 95% of samples were excluded. A comparable criterion was applied to exclude samples with a low quality of DNAm measurement. CpGs on the sex chromosomes were excluded to avoid bias in our analyses. Probes that contained single nucleotide polymorphisms (SNPs) within 10 base pairs of a targeted CpG site with a minor allele in at least 0.7% subjects (corresponding to at least 10 subjects in IoW with n = 1456) were excluded due to their influence on DNAm. After preprocessing, a total of 442,475 CpGs in common between the two platforms (450 K and EPIC) were included in the analyses.
Since blood is a mixture of functionally and developmentally distinct cell populations , adjusting for cell type compositions potentially mitigates the possibility of confounding cell heterogeneities in DNAm measured from blood samples . To this end, we estimated cell type proportions using the method proposed by Jaffe and Irizarry , adapted from Houseman et al. , using the Bioconductor minfi package . The estimated cell type proportions of CD4 + T cells, natural killer cells, neutrophil, B cells, monocytes, and eosinophil cells were included in the analyses as confounding factors.
Genome-wide RNA-seq gene expression data generation
Gene expression levels from peripheral blood samples collected at 26 years from the IOWBC was determined using paired-end (2 × 75 bp) RNA sequencing with the Illumina Tru-Seq Stranded mRNA Library Preparation Kit with IDT for Illumina Unique Dual Index (UDI) barcode primers following the manufacturer’s recommendations. All samples were sequenced twice using the identical protocol and for each sample the output from both runs were combined. FASTQC was run to assess the quality of the FASTQ files (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Reads were mapped against Human Genome (GRch37 version 75) using HISAT2 (v2.1.0) aligner . The alignment files, produced in the Sequence Alignment Map (SAM) format, were converted into the Binary Alignment Map (BAM) format using SAMtools (v1.3.1) . HTseq (v0.11.1) was used to count the number of reads mapped to each gene in the same reference genome used for alignment . Normalized read count FPKM (Fragments Per Kilobase of transcript per Million mapped reads) were calculated using the countToFPKM package (https://github.com/AAlhendi1707/countToFPKM) and the log-transformed values were used for data analysis.
To examine whether the analytic sample (n = 224) reasonably represents the complete cohort (n = 1456), chi-square tests for categorical variables and one-sample t-tests for continuous variables were applied, stratified by sex. In addition, all the subsequent analyses were stratified by sex, considering gender reversal in asthma prevalence.
BMI trajectories were determined separately for both the sexes using their BMI values at ages 1, 2, 4 and 10 years. Our study focused on temporal patterns of BMI. Thus, the less informative standardized BMI (or Z-BMI) was not needed . A group-based trajectory modelling, also referred to as a semiparametric mixture model [38, 39], was applied using PROC TRAJ in SAS  to identify BMI developmental paths in the form of trajectories across ages 1, 2, 4 and 10 years. The group-based trajectory method presumes that the data comprises of latent distinct groups (trajectories) that best summarize the distinct features as parsimonious as possible [38, 39]. Models with one to three groups were estimated for linear and quadratic terms. The selection of a best fit model was based on the smallest Bayesian Information Criterion value. Individuals were assigned to one of the trajectories/groups based on their highest estimated group-membership probabilities.
Association of BMI trajectories with asthma incidence
Subjects with asthma at ages 1, 2, 4 and 10 years were excluded from the analysis. We used multivariable logistic regression to evaluate the association of BMI-trajectory (independent variable) with asthma incidence at 18 years (dependent variable) along with covariates and confounders potentially associated with asthma incidence in the model: SES, active smoking status at 18 years, height at 10 years, maternal smoking during pregnancy, duration of breastfeeding (in weeks), parental history of asthma and age at pubertal events. For males, age at voice deepening, and for females, age at menarche were included in the model.
Screening for CpGs related to BMI trajectories
We regressed the M-values of DNAm at each CpG site on the aforementioned 15 PCs obtained from control probes and the 6 cell type proportions  to obtain batch- and cell-type-adjusted DNAm (residuals) for each sex. These residuals were batch- and cell types-adjusted DNAm and used in subsequent analyses. We applied an R package, ttScreening, to screen CpGs at 10 years with DNAm potentially associated with BMI trajectory groups . In the selection process, the method implemented in the package utilizes training and testing data in robust linear regressions. The screening was performed separately for each sex. Following the guideline , the minimum frequency of selecting an informative CpG sites was set at 50%, i.e., a CpG site gained statistical significance in at least 50% of the randomly selected training and testing data set pairs. For CpGs that passed the screening, they were treated as potential BMI-trajectory-associated-CpGs.
Screening for BMI-trajectory-related CpGs associated with asthma incidence
Subjects with asthma at ages 1, 2, 4 and 10 years were excluded from the analysis. We used multivariable logistic regression to evaluate the association of potential BMI-trajectory-associated-CpGs (independent variable) with asthma incidence at 18 years (dependent variable) along with covariates and confounders potentially associated with asthma incidence in the model: BMI trajectory groups [42, 43], SES, active smoking status at 18 years, and age at pubertal events. For males, age at voice deepening, and for females, age at menarche were included in the model. Statistical significance for this in-depth screening process was set at 0.05.
Using path analyses, we explored the association between BMI trajectories at 1, 2, 4, 10 years and asthma incidence at 18 years, and whether the relationship between these variables was mediated by DNAm at 10 years (Fig. 1), with potential confounders included in each path. Goodness of fit criteria using chi-square test p-value > 0.05, RMSEA < 0.05, CFI > 0.95 was used. The path coefficients (direct and indirect estimates) represent the partial correlation between the independent and dependent variables after adjusting for confounders and covariates used in the model above. An R package, MplusAutomation, was utilized to iteratively call MPlus from R to perform path analyses with each of the CpGs as a mediator (Fig. 1) .
Association of DNAm with gene expression
To evaluate the biological relevance of the identified mediating CpGs, association between DNAm (in M-values) and expression of genes within a 500 kilo base pairs (kbps) window (250kbps upstream and 250kbps downstream of the CpG site) was evaluated at 26 years using linear regressions. Gene expression (n = 140) was the dependent variable, and DNAm was the independent variables.
Replication cohort—the Avon Longitudinal Study of Children and Parents (ALSPAC) cohort
CpGs shown to mediate the association of BMI trajectory groups with asthma incidence in the IOWBC were further assessed in an independent cohort, the Avon Longitudinal Study of Children and Parents (ALSPAC) [45, 46]. Women residing in the South West of England who were pregnant and expecting to deliver between April 1, 1991 and December 31, 1992 were eligible to be recruited. Of the 14,541 pregnant women eligible for recruitment, 13,761 were included in the study with 10,321 participants having their DNA sampled. DNAm in the ALSPAC cohort was assessed using the Infinium HumanMethylation450 BeadChip. The pre-processing of DNAm was performed by correcting for batch effects using the minfi package  and removing CpGs with detection p-value ≥ 0.01. Samples were flagged that contained sex-mismatch based on X-chromosome methylation. Estimated cell type proportions of CD4 + T cells, natural killer cells, CD8 + T cells, B cells, monocytes, and granulocytes cells were used in the analyses to adjust for cell heterogeneity. DNAm at 7 years was included in our study and its residuals were calculated by regressing M-values on cell type proportions. Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/researchers/our-data/).
BMI trajectories were modeled at ages 1, 2, 4 and 7 years separately in both sexes. Subjects with asthma at 7 and 10 years were excluded, and asthma incidence was assessed at 17 years. Identical path analysis models as those applied in the IoW cohort were used with comparable covariates available in ALSPAC, including SES, active smoking status at 17 years and pubertal events. Secondhand smoking at 7 years was not considered due to low counts in asthma incidence in one of the secondhand smoking categories.
For the CpGs showing mediation effects, the genes annotated to the CpGs were summarized along with information such as gene location, chromosome number based on Illumina's manifest file and SNIPPER (https://csg.sph.umich.edu/boehnke/snipper/) version 1.2.