Main

Somatic mutations in cancer genomes are caused by mutational processes of both exogenous and endogenous origin that operate during the cell lineage between the fertilized egg and the cancer cell16. Each mutational process may involve components of DNA damage or modification, DNA repair and DNA replication (which may be normal or abnormal), and generates a characteristic mutational signature that potentially includes base substitutions, small insertions and deletions (indels), genome rearrangements and chromosome copy-number changes1. The mutations in an individual cancer genome may have been generated by multiple mutational processes, and thus incorporate multiple superimposed mutational signatures. Therefore, to systematically characterize the mutational processes that contribute to cancer, mathematical methods have previously been used to decipher mutational signatures from somatic mutation catalogues, estimate the number of mutations that are attributable to each signature in individual samples and annotate each mutation class in each tumour with the probability that it arose from each signature6,9,17,18,19,20,21,22,23,24,25,26,27.

Previous studies of multiple types of cancer have identified more than 30 single-base substitution (SBS) signatures, some of known—but many of unknown—aetiologies, some ubiquitous and others rare, some part of normal cell biology and others associated with abnormal exposures or neoplastic progression3,4,5,7,8,9,10,11,12,13,14,15. Genome rearrangement signatures have also previously been described11,25,28,29,30. However, the analysis of other classes of mutation has been relatively limited3,11,31,32,33.

Mutational signature analysis has predominantly used cancer exome sequences. However, the many-fold-greater numbers of somatic mutations in whole genomes provide substantially increased power for signature decomposition, enabling the better separation of partially correlated signatures and the extraction of signatures that contribute relatively small numbers of mutations. Furthermore, technical artefacts and differences in sequencing technologies and mutation-calling algorithms can themselves generate mutational signatures. Therefore, the uniformly processed and highly curated sets of all classes of somatic mutations from the 2,780 cancer genomes of the PCAWG project2, combined with most other suitable cancer genomes (accession code syn11801889, available at https://www.synapse.org/#!Synapse:syn11801889), present a notable opportunity to establish the repertoire of mutational signatures and determine their activities across different types of cancer. The timing of these signatures during the evolution of individual cancers and the repertoire of signatures of structural variation have been explored in other PCAWG analyses30,34.

Mutational signature analysis

The 23,829 samples—which include most types of cancer, and comprise the 2,780 PCAWG whole genomes2, 1,865 additional whole genomes and 19,184 exomes—yielded 79,793,266 somatic SBSs, 814,191 doublet-base substitutions (DBSs) and 4,122,233 small indels that were analysed for mutational signatures, about 10-fold-more mutations than any previous study of which we are aware (syn11801889)6.

We developed classifications for each type of mutation. For SBSs, the primary classification comprised 96 classes (available at https://cancer.sanger.ac.uk/cosmic/signatures/SBS) constituted by the 6 base substitutions C>A, C>G, C>T, T>A, T>C and T>G (in which the mutated base is represented by the pyrimidine of the base pair), plus the flanking 5′ and 3′ bases. In some analyses, two flanking bases 5′ and 3′ to the mutated base were considered (producing 1,536 classes) or mutations within transcribed genome regions were selected and classified according to whether the mutated pyrimidine fell on the transcribed or untranscribed strand (producing 192 classes). We also derived a classification for DBSs (78 classes; available at https://cancer.sanger.ac.uk/cosmic/signatures/DBS). Indels were classified as deletions or insertions and—when of a single base—as C or T, and according to the length of the mononucleotide repeat tract in which they occurred. Longer indels were classified as occurring at repeats or with overlapping microhomology at deletion boundaries, and according to the size of indel, repeat and microhomology (83 classes; available at https://cancer.sanger.ac.uk/cosmic/signatures/ID).

The PCAWG whole-genome sequences, the additional whole-genome sequences and the exome sequences were each analysed separately (syn11801889)2. Signatures were extracted from each type of cancer individually, from all cancer types together, as separate SBS, DBS and indel signatures, and as composite signatures of all three types of mutation (Supplementary Note 2).

We used two methods based on nonnegative matrix factorization (NMF): SigProfiler, an elaborated version of the framework used for the previous ‘Catalogue Of Somatic Mutations In Cancer’ (COSMIC) compendium of mutational signatures (COSMIC v.2, available at https://cancer.sanger.ac.uk/cosmic/signatures_v2)11,17, and SignatureAnalyzer, which is based on a Bayesian variant of NMF9,27,35. NMF determines the signature profiles and contributions of each signature to each cancer genome as part of its factorization of the input matrix of mutation spectra. However, with many signatures and/or heterogeneous mutation burdens across samples, the mutations observed in a particular sample can be reconstructed in multiple ways—often with small and/or biologically implausible contributions from many signatures. Therefore, each method has developed a separate procedure for estimating the contributions of signatures to each sample (Methods).

We tested SignatureAnalyzer and SigProfiler on 11 sets of synthetic data (including 64,400 synthetic samples), generated from known signature profiles (Methods, Supplementary Note 2). Both methods performed well in re-extracting known signatures from realistically complex data. Extracted signatures that were discordant from the known input usually arose from difficulties in selecting the correct number of signatures. The results confirm that use of NMF-based approaches for extracting mutational signatures is not a purely algorithmic process, but also requires consideration of evidence from experimentally determined mutational signatures and the DNA damage and repair literature, prior evidence of biological plausibility and human-guided sensitivity analysis confirming that extractions from different groupings of tumours yield consistent results. We used these types of evidence and approaches in determining the signature profiles reported here. The findings are consistent with results regarding NMF, and the related areas of probabilistic topic modelling and latent Dirichlet allocation, in multiple problem domains36,37. It is widely understood that the choice of the number of latent variables (for our purposes, the number of mutational signatures) is rarely amenable to complete automation.

The results from our SigProfiler and SignatureAnalyzer analyses of cancer data exhibited many similarities, and we assigned the same identifiers to similar signatures extracted using the two methods (syn12016215). However, there were also noteworthy differences. The numbers of SBS signatures found in PCAWG tumours with a low mutation burden (94.4% of cases that contain 47% of mutations) were similar: 31 using SigProfiler and 35 using SignatureAnalyzer. However, the numbers of additional SBS signatures extracted from hypermutated PCAWG samples (5.6% of cases, containing 53% of mutations) were different: 13 using SigProfiler and 25 using SignatureAnalyzer. There were also differences in SBS signature profiles, including among signatures found in cases with a low mutation burden. The latter primarily involved relatively featureless (‘flat’) signatures, which are mathematically challenging to deconvolute. Finally, there were differences in signature attributions to individual samples. SignatureAnalyzer used more signatures to reconstruct the mutational profiles (Extended Data Fig. 1) (syn12169204 and syn12177011) and attributions to flat signatures were different (Extended Data Fig. 2a, b) (syn12169204). The DBS and indel signatures were generally similar between the two methods (Extended Data Fig. 2c, d).

The final reference mutational signatures were determined from the PCAWG set, supplemented by additional signatures from the other datasets (COSMIC, available at https://cancer.sanger.ac.uk/cosmic/signatures). Each signature was allocated an identifier consistent with, and extending, the COSMIC v.2 annotation. Some previous signatures split into multiple constituent signatures: these were numbered as in the previous annotation, but with additional letter suffixes (for example, SBS17 was split into SBS17a and SBS17b). DNA sequencing and analysis artefacts also generate mutational signatures. We indicate which signatures are possible artefacts but do not present them below (full information is available at https://cancer.sanger.ac.uk/cosmic/signatures). The results of both SignatureAnalyzer and SigProfiler were used throughout the study. However, for brevity and for continuity with the signature set previously displayed in COSMIC v.2—which has been widely used as a reference—SigProfiler results are outlined here, and SignatureAnalyzer results are provided in Extended Data Figs. 3, 4 and at syn11738307.

Single-base substitution signatures

There were substantial differences in the numbers of SBSs between samples (ranging from hundreds to millions) and between cancer types38 (Fig. 1). In total, 67 SBS mutational signatures were extracted, of which 49 were considered likely to be of biological origin (Fig. 2, Methods; available at https://cancer.sanger.ac.uk/cosmic/signatures/SBS/). Except for signature SBS25, all signatures reported in COSMIC v.2 (ref. 6) were confirmed; the median cosine similarity between the newly derived signatures and those on COSMIC v.2 was 0.95, excluding the ‘split’ signatures (discussed below). SBS25 was previously found in cell lines derived from Hodgkin lymphomas treated with chemotherapy, and no primary cancers of this type were available. The newly derived signatures showed much improved separation from each other and more-distinct signature profiles, as compared with COSMIC v.2 signatures (see ‘Better separation compared to COSMIC v.2 signatures’ in Supplementary Note 2 for more information).

Fig. 1: Mutation burdens of SBSs, DBSs and small indels  across PCAWG tumour types.
figure 1

The numbers of cases of each tumour type are shown next to the labels. Each dot represents one tumour. Tumour types are ordered by the median numbers of single-base substitutions. Only tumour types with >20 samples are shown. AdenoCA, adenocarcinoma; BNHL, B-cell non-Hodgkin lymphoma; ChRCC, chromophobe renal cell carcinoma; CLL, chronic lymphocytic leukaemia; CNS, central nervous system; ColoRect, colorectal; Eso, oesophageal; GBM, glioblastoma; HCC, hepatocellular carcinoma; Medullo, medulloblastoma; MH, microhomology; MPN, myeloproliferative neoplasm; Osteosarc, osteosarcoma; Panc, pancreatic; PiloAstro, pilocytic astrocytoma; Prost; prostate; RCC, renal cell carcinoma; SCC, squamous cell carcinoma; TCC, transitional cell carcinoma; Thy, thyroid.

Fig. 2: Profiles of SBS, DBS and small indel mutational signatures.
figure 2

The classifications of each mutation type (SBS, 96 classes; DBS, 78 classes; and indels, 83 classes) are described in the main text. Magnified versions of signatures SBS4, DBS2 and ID3 (all of which are associated with tobacco smoking) are shown to illustrate the positions of each mutation subtype on each plot. The plotted data are available in digital form (along with the x axis labels) at syn12025148.

Thirteen of the SBS signatures we extracted (excluding those due to signature splitting) represent newly identified and probably real signatures, not present in COSMIC v.2. Some were rare (SBS31, SBS32, SBS35, SBS36, SBS42 and SBS44). Others were more common, but contributed relatively few mutations and/or were similar to previously discovered signatures (SBS38, SBS39 and SBS40). Notably, SBS40 is a flat signature similar to SBS5. It contributes to multiple types of cancer, but its similarity to SBS5 renders the extent of this contribution uncertain. For some of the newly identified signatures, there were plausible underlying aetiologies (Fig. 3, Extended Data Figs. 4, 5): for SBS31 and SBS35, platinum compound chemotherapy39; for SBS32, azathioprine therapy; for SBS36, inactivating germline or somatic mutations in MUTYH (which encodes a component of the base excision repair machinery)40,41; for SBS38, additional effects of exposure to ultraviolet (UV) light; for SBS42, occupational exposure to haloalkanes13; and for SBS44, defective DNA mismatch repair42.

Fig. 3: The number of mutations contributed by each mutational signature to the PCAWG tumours.
figure 3

The size of each dot represents the proportion of samples of each tumour type that shows the mutational signature. The colour of each dot represents the median mutation burden of the signature in samples that show the signature. Tumours that had few mutations or that were poorly reconstructed by the signature assignment were excluded. The contributions of composite signatures to the PCAWG cancers, and SBS signatures to the complete set of cancer samples analysed, are shown in Extended Data Figs. 4 and 5, respectively. AML, acute myeloid leukaemia; liposarc, liposarcoma; MDS, myelodysplastic syndrome.

Three previously characterized base substitution signatures (SBS7, SBS10 and SBS17) split into multiple constituent signatures (Fig. 2). Signature splitting probably reflects the existence of multiple distinct mutational processes initiated by the same exposure that have closely—but not perfectly—correlated activities. We previously regarded SBS7 as a single signature composed predominantly of C>T at CCN and TCN trinucleotides (the mutated base is underlined) together with many fewer T>N mutations. It was found in malignant melanomas and squamous skin carcinomas, and is probably due to the UV-light-induced formation of pyrimidine dimers, followed by translesion DNA synthesis by error-prone polymerases predominantly inserting A opposite to damaged cytosines. SBS7 has now been decomposed into four constituent signatures. SBS7a and SBS7b (consisting mainly of C>T at TCN and C>T at CCN, respectively) may reflect different pyrimidine-dimer photoproducts. SBS7c and SBS7d (consisting predominantly of T>A at NTT and T>C at NTT, respectively43) may be due to low frequencies of the misincorporation of T and G opposite to thymines in pyrimidine dimers. The splitting of SBS10 and SBS17 is described at https://cancer.sanger.ac.uk/cosmic/signatures/SBS/.

Several base substitution signatures showed transcriptional strand bias, which may be attributable to transcription-coupled nucleotide excision repair acting on DNA damage and/or to an excess of DNA damage on untranscribed strands of genes44. Both mechanisms result in more mutations of damaged bases on untranscribed than on transcribed strands of genes. Assuming that either mechanism is responsible for the observed transcriptional strand biases, DNA damage to cytosine (SBS7a and SBS7b), guanine (SBS4, SBS8, SBS19, SBS23, SBS24, SBS31, SBS32, SBS35 and SBS42), thymine (SBS7c, SBS7d, SBS21, SBS26 and SBS33) and adenine (SBS5, SBS12, SBS16, SBS22 and SBS25) may underlie these mutational signatures (plots of strand bias are available at https://cancer.sanger.ac.uk/cosmic/signatures/SBS/). The likely DNA-damaging agents are known for SBS4 (tobacco mutagens), SBS7a, SBS7b, SBS7c and SBS7d (UV light), SBS22 (aristolochic acid), SBS24 (aflatoxin), SBS25 (chemotherapy), SBS31 and SBS35 (platinum compounds), SBS32 (azathioprine) and SBS42 (haloalkanes).

Using the SBS classification of 1,536 mutation types, which uses the sequence context two bases 5′ and two bases 3′ to each mutated base, yielded signatures that are largely consistent with those based on substitutions in trinucleotide contexts. Notably, however, two forms of both SBS2 and SBS13 were extracted, one with mainly a pyrimidine and the other with mainly a purine at the −2 base (the second base 5′ to the mutated cytosine). These may represent the activities of the cytidine deaminases APOBEC3A and APOBEC3B, respectively45. If so, APOBEC3A accounts for many more mutations than APOBEC3B in cancers with high APOBEC activity. Other signatures showed nonrandom sequence contexts at +2 and −2 positions (for example, SBS17a, SBS17b and SBS9), but sequence context effects were generally much stronger for bases immediately 5′ and 3′ to mutated bases.

SBS signatures showed substantial variation in the numbers of cancer types and cancer samples in which they were found, and in the mutations attributed per cancer sample (Fig. 3). Almost all individual cancer samples exhibited multiple signatures, with a mode of three in the PCAWG set (syn12169204). The assigned signatures reconstruct well the mutational spectra of the cancer samples (in PCAWG samples, the median cosine similarity was 0.97; 96.3% of samples with cosine similarity >0.90): Fig. 4 shows illustrative examples.

Fig. 4: Illustrative examples of mutational spectra of individual cancer samples.
figure 4

The contributory SBS, DBS and small indel mutational signatures in two tumours are shown.

Some mutational processes generate base substitutions that cluster in small genomic regions. The limited numbers of such mutations may result in a failure to detect their signatures using standard methods. We therefore identified clustered mutations in each genome and analysed them separately (Methods). Four main clustered mutational signatures were identified (Fig. 2), as previously reported4,27,32. Two, which are found in multiple types of cancer, were similar to SBS2 and SBS13 (which have been attributed to APOBEC enzyme activity) and represent foci of kataegis3,32,46. Two further clustered signatures, one characterized by C>T and C>G mutations at (A or G)C(C or T) trinucleotides47 and the other T>A and T>C mutations at (A or T)T(A or T), were found in lymphoid neoplasms; they probably represent the direct and indirect consequences of activation-induced cytidine deaminase mutagenesis and translesion DNA synthesis by error-prone polymerases (SBS84 and SBS85, respectively)27.

Doublet-base substitution signatures

Tandem doublet, triplet, quadruplet, quintuplet and sextuplet base substitutions (syn11801938 and syn11726620) were observed at about 1% the prevalence of SBSs. In most cancer genomes, the number of DBSs was considerably higher than would be expected from the random adjacency of SBSs (syn12177057), indicating the existence of commonly occurring, single mutagenic events that cause substitutions at neighbouring bases. There was substantial variation in the number of DBSs, ranging from 0 to 20,818 in a sample. The numbers of DBSs were generally proportional to the numbers of SBSs (Fig. 1), although colorectal adenocarcinomas had fewer than expected, and lung cancers and melanomas had more (Extended Data Table 1). We extracted eleven DBS signatures (Fig. 2, of which three have previously been reported33,48.

Signature DBS1 was characterized by CC>TT mutations (Fig. 2), contributed hundreds to tens of thousands of mutations in malignant melanomas with SBS7a and SBS7b (Fig. 3), exhibited transcriptional strand bias consistent with damage to cytosines (syn12177063) and is a known consequence of DNA damage induced by UV light33,49. Excluding cancers associated with exposure to UV light also yielded a signature (DBS11) that was characterized predominantly by CC>TT mutations, but only contributing tens of mutations in many samples from multiple types of cancer (Figs. 2, 3). DBS11 was associated with SBS2, which is due to APOBEC activity: APOBEC activity may, therefore, also generate DBS11.

DBS2 was composed predominantly of CC>AA mutations, with smaller numbers of CC>AG and CC>AT mutations, and contributed hundreds to thousands of mutations in lung adenocarcinoma, lung squamous and head and neck squamous carcinomas, which are often caused by tobacco smoking33 (Figs. 2, 3). DBS2 showed transcriptional strand bias indicative of guanine damage (syn12177064) and was associated with SBS4, which is caused by exposure to tobacco smoke. It is likely, therefore, that DBS2 can be a consequence of DNA damage by tobacco-smoke mutagens.

A signature similar to DBS2 contributed hundreds of mutations to liver cancers and tens of mutations to other types of cancer without evidence of exposure to tobacco smoke. A pattern resembling DBS2 also dominates DBSs in healthy mouse cells50. The nature of the mutational processes that underlie these signatures in human cancers that are unrelated to smoking, and in healthy mice, is unknown. However, in experimental systems, acetaldehyde exposure has been shown to generate a mutational signature characterized primarily by CC>AA mutations, and lower burdens of CC>AG and CC>AT mutations, together with C>A SBSs48. Acetaldehyde is an oxidation product of alcohol and a constituent of cigarette smoke. The role of acetaldehyde, and perhaps other aldehydes, in generating DBS2 merits further investigation51.

DBS3, DBS7, DBS8 and DBS10 showed hundreds to thousands of mutations in rare colorectal, stomach and oesophageal cancers, some of which showed evidence of defective DNA mismatch repair (DBS7 and DBS10) or polymerase epsilon exonuclease domain mutations (DBS3) that generate hypermutator phenotypes (Figs. 2, 3). DBS5 was found in cancers exposed to platinum chemotherapy, and is associated with SBS31 and SBS35.

Small insertion-and-deletion signatures

Indels were usually present at about 10% of the frequency of base substitutions (Fig. 1). There was substantial variation between cancer genomes in the number of indels, even when cancers with evidence of defective DNA mismatch repair were excluded. Overall, the numbers of deletions and insertions were similar, but there was variation between cancer types: some cancers showed more deletions and others more insertions of various subtypes (Fig. 1). We extracted 17 indel mutational signatures (Fig. 2).

Indel signature 1 (ID1) was composed predominantly of insertions of thymine and ID2 was composed predominantly of deletions of thymine, both at long (≥5) thymine mononucleotide repeats (Fig. 2). Tens to hundreds of mutations of both signatures were found in most samples of most types of cancer, but were particularly common in colorectal, stomach, endometrial and oesophageal cancers and in diffuse large B cell lymphoma (Fig. 3). Together, ID1 and ID2 accounted for 97% and 45% of indels in hypermutated and non-hypermutated cancer genomes, respectively (Extended Data Table 2). They are probably due to slippage of either the nascent (ID1) or template strand (ID2) during DNA replication of long mononucleotide tracts.

ID3 was characterized predominantly by deletions of cytosine at short (≤5-bp long) mononucleotide cytosine repeats and exhibited hundreds of mutations in cancers of the lung, head and neck that are associated with tobacco smoking (Figs. 2, 3). There was transcriptional strand bias of mutations, with more guanine deletions than cytosine deletions on the untranscribed strands of genes, which is compatible with transcription-coupled nucleotide excision repair of damaged guanine (syn12177065 and syn12177066). The numbers of ID3 mutations positively correlated with the numbers of SBS4 and DBS2 mutations, which we have shown are associated with tobacco smoking (Extended Data Figs. 6, 7). Thus, DNA damage by components of tobacco smoke probably underlie ID3.

ID13 was characterized predominantly by deletions of thymine at thymine–thymine dinucleotides and exhibited large numbers of mutations in malignant melanomas of the skin (Figs. 2, 3). The numbers of ID13 mutations correlated with the numbers of SBS7a, SBS7b and DBS1 mutations, which we have attributed to DNA damage induced by UV light (Extended Data Figs. 6, 7). However, deletions of cytosine at cytosine–cytosine dinucleotides did not feature strongly in ID13, which may reflect the predominance of thymine compared to cytosine dimers induced by UV light52.

ID6 and ID8 were both characterized predominantly by ≥5-bp deletions (Fig. 2). ID6 exhibited overlapping microhomology at deletion boundaries with a mode of 2 bp (and often longer stretches) and correlated with SBS3, which we have attributed to defective homologous-recombination-based repair (Extended Data Figs. 6, 7). By contrast, ID8 deletions showed shorter or no microhomology at deletion boundaries and did not strongly correlate with SBS3. Both deletion patterns may be characteristic of DNA double-strand-break repair by non-homologous-recombination-based end-joining mechanisms and—if so—this suggests that at least two distinct forms are operative in human cancer53.

A small fraction of cancers exhibited very large numbers of ID1 and ID2 mutations (>10,000) (Fig. 3) (shown at https://cancer.sanger.ac.uk/cosmic/signatures/ID). These were usually accompanied by SBS6, SBS14, SBS15, SBS20, SBS21, SBS26 and/or SBS44, which are associated with deficiency in DNA mismatch repair—sometimes combined with POLE or POLD1 proofreading deficiency (SBS14 and SBS20)35. Occasional cases with these signatures additionally showed large numbers of indels attributed to ID7 (syn11738668), and rare samples showed large numbers of ID4, ID11, ID14, ID15, ID16 or ID17 mutations but did not show large numbers of ID1 and ID2 mutations or the SBS signatures associated with deficiency in DNA mismatch repair.

Correlations with age

A positive correlation between age of cancer diagnosis and the number of mutations attributable to a signature suggests that the underlying mutational process has been operative (at a more or less constant rate) throughout the cell lineage from fertilized egg to cancer cell, and thus in the normal cells from which that type of cancer develops6,54. Confirming previous reports6,54, the numbers of SBS1 and SBS5 mutations correlate with age, and exhibit different rates in different types of tissue (q values provided in syn12030687, syn20317940 and syn12217988). SBS40 also correlated with age in multiple types of cancer, although—given its similarity to SBS5—misattribution cannot be excluded. DBS2 and DBS4 correlated with age; consistent with activity in normal cells and, when combined their profiles closely resemble the spectrum of DBS mutations found in normal mouse cells50. ID1, ID2, ID5 and ID8 showed correlations with age in multiple tissues. ID1 and ID2 indels are probably due to slippage at poly T repeats during DNA replication and correlated with the numbers of SBS1 substitutions, which have previously been proposed to reflect the number of mitoses a cell has experienced6. Thus, SBS1, ID1 and ID2 may all be generated during DNA replication at mitosis. The number of ID5 mutations correlated with the number of SBS40 mutations, and the mutational processes that underlie these two age-correlated signatures may therefore contain common components. ID8, which is predominantly composed of ≥5-bp deletions with no or 1 bp of microhomology at their boundaries, is probably due to DNA double-strand breaks repaired by a non-homologous-end-joining mechanism. The results indicate that multiple mutational processes operate in normal cells.

Discussion

There are important constraints, limitations and assumptions in the analytic frameworks used here to characterize mutational signatures. Signatures extracted from sample sets in which multiple processes are operative remain mathematical approximations, with profiles that are potentially influenced by the mathematical approach used and other factors. For conceptual and practical simplicity, we assume that a single signature is associated with each mutational process and provide an average reference signature to represent it. However, we do not discount the possibility that further nuances and variations of signature profiles exist. We have estimated the contributions from each signature to the mutation burden in each sample. However, with increasing numbers of signatures and differences of multiple orders of magnitude in mutation burdens between some signatures, prior knowledge has helped to avoid biologically implausible results. Thus, the further development of methods for deciphering and attributing mutational signatures is warranted, ideally supported by signatures derived from experimental systems in which the causes are known. Nevertheless, signatures with many similarities and some differences can be found by different mathematical approaches, and these can be confirmed in several ways, including experimentally elucidated signatures5,31,39,42,43,54,55,56,57,58,59,60,61,62 and tumours dominated by a single signature (syn12016215).

This analysis includes most publicly available exome and whole-genome cancer sequences. Some rare or geographically restricted signatures may not have been captured, signatures conferring limited mutation burdens may have been missed and signatures of therapeutic mutagenic exposures have not been exhaustively explored. Nevertheless, it is likely that a substantial proportion of the naturally occurring mutational signatures found in human cancer have now been described. This comprehensive repertoire provides a foundation for research into the aetiologies of geographical and temporal differences in cancer incidence, the mutational processes that operate in healthy tissues and non-neoplastic disease states, clinical and public health applications of signatures and mechanistic understanding of the mutational processes that underlie carcinogenesis.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.

These online methods contain an abridged description of the methodology used in the current manuscript; extensive details about the methodology we used are provided in Supplementary Note 2. Importantly, two independently developed computational frameworks (SigProfiler and SignatureAnalyzer) based on NMF were applied separately to the examined sets of mutational catalogues. SigProfiler and SignatureAnalyzer take different approaches for deciphering mutational signatures and for assigning each signature to each sample. By using two methods, we aimed to provide a perspective on the effect that different methodologies can have on the numbers of signatures generated, signature profiles and attributions. In addition to applying SigProfiler and SignatureAnalyzer to cancer data, the tools were also applied to realistic synthetic data with known solutions.

Analysis of mutational signatures with SigProfiler

SigProfiler incorporates two distinct steps for identification of mutational signatures, based on the previously described methodology6,11,17 (Extended Data Fig. 8). The first step (SigProfilerExtraction) encompasses a hierarchical de novo extraction of mutational signatures based on somatic mutations and their immediate sequence context, and the second step (SigProfilerAttribution) focuses on accurately estimating the number of somatic mutations associated with each extracted mutational signature in each sample. SigProfilerExtraction is an extension of a previous framework for the analysis of mutational signatures11,17. In brief, for a given set of mutational catalogues, the algorithm deciphers a minimal set of mutational signatures that optimally explains the proportion of each mutation type and estimates the contribution of each signature to each sample. More specifically, for each NMF iteration, SigProfilerExtraction minimizes a generalized Kullback–Leibler divergence constrained for nonnegativity (Supplementary Note 2). The algorithm uses multiple NMF iterations (in most cases 1,024) to identify the matrix of mutational signatures and the matrix of the activities of these signatures, as previously described17. The unknown number of signatures is determined by human assessment of the stability and accuracy of solutions for a range of values, as previously described17. The framework is applied hierarchically to increase its ability to find mutational signatures that generate few mutations or are present in few samples.

After signatures are discovered by SigProfilerExtraction, SigProfilerAttribution estimates their contributions to individual samples. For each examined sample, the estimation algorithm involves finding the minimum of the Frobenius norm of a constrained function using a nonlinear convex optimization programming solver using the interior-point algorithm63. See Supplementary Note 2 and Extended Data Fig. 8b for further details.

Analysis of mutational signatures with SignatureAnalyzer

SignatureAnalyzer uses a Bayesian variant of NMF that infers the number of signatures through the automatic relevance determination technique and delivers highly interpretable and sparse representations for both signature profiles and attributions that strike a balance between data fitting and model complexity. Further details of the actual implementation of the computational approach have previously been published9,27,64. SignatureAnalyzer was applied by using a two-step signature extraction strategy using 1,536 pentanucleotide contexts for SBSs, 83 indel features and 78 DBS features. In addition to the separate extraction of SBS, indel and DBS signatures, we performed a ‘COMPOSITE’ signature extraction based on all 1,697 features (1,536 SBS + 78 DBS + 83 indel). For SBSs, the 1,536 SBS COMPOSITE signatures are preferred; for DBSs and indels, the separately extracted signatures are preferred.

In step 1 of the two-step extraction process, global signature extraction was performed for the samples with a low mutation burden (n = 2,624). These excluded hypermutated tumours: those with putative polymerase epsilon (POLE) defects or mismatch repair defects (microsatellite instable tumours), skin tumours (which had intense UV-light mutagenesis) and one tumour with temozolomide (TMZ) exposure. Because the underlying algorithm of SignatureAnalyzer performs a stochastic search, different runs can produce different results. In step 1, we ran SignatureAnalyzer 10 times and selected the solution with the highest posterior probability. In step 2, additional signatures unique to hypermutated samples were extracted (again selecting the highest posterior probability over ten runs) while allowing all signatures found in the samples with low mutation burden, to explain some of the spectra of hypermutated samples. This approach was designed to minimize a well-known ‘signature bleeding’ effect or a bias of hyper- or ultramutated samples on the signature extraction. In addition, this approach provided information about which signatures are unique to the hypermutated samples, which was later used when attributing signatures to samples.

A similar strategy was used for signature attribution: we performed a separate attribution process for low- and hypermutated samples in all COMPOSITE, SBS, DBS and indel signatures. For downstream analyses, we preferred to use the COMPOSITE attributions for SBSs and the separately calculated attributions for DBSs and indels. Signature attribution in samples with a low mutation burden was performed separately in each tumour type (for example, Biliary–AdenoCA, Bladder–TCC, Bone–Osteosarc, and so on). Attribution was also performed separately in the combined microsatellite instable tumours (n = 39), POLE (n = 9), skin melanoma (n = 107) and TMZ-exposed samples (syn11738314). In both groups, signature availability (which signatures were active, or not) was primarily inferred through the automatic relevance determination process applied to the activity matrix H only, while fixing the signature matrix W. The attribution in samples with a low mutation burden was performed using only signatures found in the step 1 of the signature extraction. Two additional rules were applied in SBS signature attribution to enforce biological plausibility and minimize a signature bleeding: (i) allow SBS4 (smoking signature) only in lung, head and neck cases; and (ii) allow SBS11 (TMZ signature) in a single GBM sample. This was enforced by introducing a binary, signature-by-sample signature indicator matrix Z (1, allowed; 0, not allowed), which was multiplied by the H matrix in every multiplication update of H. No additional rules were applied to indel or DBS signature attributions, except that signatures found in hypermutated samples were not allowed in samples with a low mutation burden.

Application of SigProfiler and SignatureAnalyzer to synthetic data

Our goal was to evaluate SignatureAnalyzer and SigProfiler on realistic synthetic data to identify any potential limitations of these two methods. SignatureAnalyzer and SigProfiler were tested on 11 sets of synthetic data, encompassing a total of 64,400 synthetic samples, in which known signature profiles were used to generate catalogues of synthetic mutational spectra. We operationally defined ‘realistic’ data as those based on the characteristics of either SignatureAnalyzer’s or SigProfiler’s analysis of the PCAWG genome data. SignatureAnalyzer’s reference signature profiles were based on COMPOSITE signatures, consisting of 1,536 types of strand-agnostic SBSs in pentanucleotide context, 78 types of DBSs and 83 types of small indels, for a total of 1,697 mutation types. SigProfiler’s reference analysis was based on strand-agnostic SBSs in the context of one 5′ and one 3′ base. For each test, we generated two sets of realistic data: SigProfiler-realistic (based on SigProfiler’s reference signatures and attributions) and SignatureAnalyzer-realistic (based on SignatureAnalyzer’s reference signatures and attributions), as well as two other types of data that involved using SignatureAnalyzer profiles with SigProfiler attributions and vice versa. A detailed description of each of the 11 sets of synthetic data and the results from applying SigProfiler and SignatureAnalyzer are provided in Supplementary Note 2.

Analysis of clustered mutational signatures

Somatic SBSs were considered clustered if they had intermutational distances < 1,000 bp. More specifically, for each sample, an SBS mutational catalogue was generated for substitutions that were <1,000 bp from another substitution. Subsequently, the set of SBS mutational catalogues containing clustered mutations underwent de novo extraction of mutational signatures. Any novel mutational signature (one that was not previously observed in the complete SBS catalogues) was reported as a clustered mutational signature.

Better separation compared to COSMIC v.2 signatures

As described in the manuscript, all mutational signatures previously reported in COSMIC v.2 were confirmed in the new set of analyses with median cosine similarity of 0.95. However, the separation between the COSMIC v.2 mutational signatures (https://cancer.sanger.ac.uk/cosmic/signatures_v2) is much worse than the separation between the mutational signatures reported here. For example, in COSMIC v.2, signatures 5 and 16 had a cosine similarity of 0.90, making them hard to distinguish from one another. By contrast, in the current analysis, SBS5 and SBS16 have a cosine similarity of 0.65. This allows us to unambiguously assign SBS5 and SBS16 to different samples. In the current analysis, the larger number of samples has allowed the reduction of bleeding between signatures and has given more unique and easily distinguishable signatures. One can evaluate the overall separation of a set of mutational signatures by examining the distribution of cosine similarities between the signatures in the set. The signatures in COSMIC v.2 had a median cosine similarity of 0.238. By contrast, the current signatures have a much lower median cosine similarity of 0.098. This twofold reduction in similarity is highly statistically significant (P value 9.1 × 10−25) and indicates a better separation between the signatures in the current analysis.

Correlations of mutational signature activity with age

Before evaluating the association between age and the activity of a mutational signature, all outliers for both age and numbers of mutations attributed to a signature in a cancer type were removed from the data. An outlier was defined as any value outside three standard deviations from the mean value. A robust linear regression model that estimated the slope of the line and whether this slope was significantly different from zero (F test; P value < 0.05) was performed using the MATLAB function robustfit (https://www.mathworks.com/help/stats/robustfit.html) with default parameters. The P values from the F tests were corrected using the Benjamini–Hochberg procedure for false discovery rates. Results are available at syn12030687 and syn20317940.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.