How many times has SARS-CoV-2 mutated?

According to

Coronavirus has mutated at least once

The novel coronavirus that has infected thousands of people across the world may have mutated at least once - meaning there may be two different types of the virus causing illnesses, a new study conducted by Chinese scientists suggests.

Scientists with Peking University's School of Life Sciences and the Institut Pasteur of Shanghai in a preliminary study found that one strain - type “L” - of the virus was more aggressive and accounted for about 70 percent of the strains analyzed. The second - type “S” - was less aggressive and accounted for about 30 percent of analyzed strains.

But also several webpages theorized the virus has mutated before passing from animals to humans.

This page What we know about the Wuhan virus says

This virus belongs to a family of viruses known as coronaviruses. Named for the crown-like spikes on their surfaces, they infect mostly bats, pigs and small mammals. But they mutate easily and can jump from animals to humans


Where did the new coronavirus come from? The new virus likely came originally from bats, scientists say. It isn't known exactly where or how it jumped to humans, though. Viruses from bats often infect another mammal first and then mutate to become more transmissible to humans.


Coronaviruses can also jump directly to humans, without mutating or passing through an intermediate species.

Is something more known about this? Is it possible the virus has mutated twice in a period of time of months?

This question makes a number of incorrect assumptions and I don't have time to correct them. The short answer is that the virus has mutated probably hundreds of times since it entered humans in late 2019.

The lower figure on the ncov page, "Diversity", shows the known mutations that have been identified so far. As I look at it now, there are maybe 100-200 shown there, but that will change daily and I can't be bothered to count them.

Is it possible the virus has mutated twice in a period of time of months? Of course it is. That's what we expect from coronaviruses. It would be shocking if it did not.

The default assumption (based on a vast amount of experience with coronaviruses and many other viruses) is that these mutations are neutral and do not affect the virus in terms of fitness, virulence, or transmissibility in any way; they are occasionally useful in tracking sources, but unless you have spent a lot of time looking at virus phylogenetic trees for several years, your interpretations of these mutations are almost certainly wrong.

An example of people making claims based on unfamiliarity with virus evolution is the "two different types of the virus" claim. That claim is addressed by MacLean and colleagues in their Response to “On the origin and continuing evolution of SARS-CoV-2”. Summary:

Two of the key claims made by this paper appear to have been reached by misunderstanding and over-interpretation of the SARS-CoV-2 data, with an additional analysis suffering from methodological limitations

How fast can the coronavirus mutate?

The new coronavirus, like all other viruses, mutates, or undergoes small changes in its genome.

The new coronavirus, like all other viruses, mutates, or undergoes small changes in its genome. A recently published study suggested that the new coronavirus, SARS-CoV-2, had already mutated into one more and one less aggressive strain. But experts aren't convinced.

In the study, a group of researchers in China analyzed the genomes of coronaviruses taken from 103 patients with COVID-19, the disease caused by SARS-CoV-2, in Wuhan, China, the epicenter of the outbreak. The team found differences in the genomes, which they said could be categorized into two "strains" of the coronavirus: the "L" type and the "S" type, the researchers wrote in the study, which was published Tuesday (March 3) in the journal National Science Review.

The researchers found the "L" type, which they deemed the more aggressive type, in 70% of the virus samples. They also found that the prevalence of this strain decreased after early January. The more commonly found type today is the older, "S" type, because "human intervention" such as quarantines may have reduced the ability of the "L" type to spread, researchers wrote in the paper.

However, Nathan Grubaugh, an epidemiologist at the Yale School of Public Health who was not part of the study, said the authors' conclusions are "pure speculation." For one thing, he said, the mutations the study authors referenced were incredibly small &mdash on the order of a couple of nucleotides, the basic building blocks of genes, he said. (SARS-CoV-2 is about 30,000 nucleotides long).

These slight changes likely wouldn't have a major impact, if any at all, on the functioning of the virus, so it would be "inaccurate" to say that these differences mean there are different strains, he said. In addition, the researchers looked at only 103 cases. "It's a very small sample set of the total virus population," Grubaugh told Live Science. Figuring out the mutations that a virus underwent worldwide takes "a nontrivial amount of effort and sometimes takes years to complete," he said.

Other scientists agree. The finding that the coronavirus mutates into two strains with the L strain leading to more severe disease "is most likely a statistical artifact," Richard Neher, a biologist and physicist at the University of Basel in Switzerland, wrote on Twitter. This statistical effect is probably due to early sampling of the L group in Wuhan, resulting in a "higher apparent" case fatality rates, he wrote.

When there's a rapidly growing local outbreak, scientists quickly sample the virus genomes from patients, resulting in the overrepresentation of some variants of the virus, Neher wrote. The authors of the paper acknowledge that the data in their study is "still very limited" and they need to follow-up with larger data sets to better understand how the virus is evolving, they wrote.


During the initial outbreak in Wuhan, China, various names were used for the virus some names used by different sources included the "coronavirus" or "Wuhan coronavirus". [25] [26] In January 2020, the World Health Organization recommended "2019 novel coronavirus" (2019-nCov) [5] [27] as the provisional name for the virus. This was in accordance with WHO's 2015 guidance [28] against using geographical locations, animal species, or groups of people in disease and virus names. [29] [30]

On 11 February 2020, the International Committee on Taxonomy of Viruses adopted the official name "severe acute respiratory syndrome coronavirus 2" (SARS‑CoV‑2). [31] To avoid confusion with the disease SARS, the WHO sometimes refers to SARS‑CoV‑2 as "the COVID-19 virus" in public health communications [32] [33] and the name HCoV-19 was included in some research articles. [8] [9] [10]

Human-to-human transmission of SARS‑CoV‑2 was confirmed on 20 January 2020, during the COVID-19 pandemic. [15] [34] [35] [36] Transmission was initially assumed to occur primarily via respiratory droplets from coughs and sneezes within a range of about 1.8 metres (6 ft). [37] [38] Laser light scattering experiments suggest that speaking is an additional mode of transmission [39] [40] and a far-reaching [41] and under-researched [42] one, indoors, with little air flow. [43] [44] Other studies have suggested that the virus may be airborne as well, with aerosols potentially being able to transmit the virus. [45] [46] [47] During human-to-human transmission, an average 1000 infectious SARS‑CoV‑2 virions are thought to initiate a new infection. [48] [49]

Indirect contact via contaminated surfaces is another possible cause of infection. [50] Preliminary research indicates that the virus may remain viable on plastic (polypropylene) and stainless steel (AISI 304) for up to three days, but does not survive on cardboard for more than one day or on copper for more than four hours [10] the virus is inactivated by soap, which destabilises its lipid bilayer. [51] [52] Viral RNA has also been found in stool samples and semen from infected individuals. [53] [54]

The degree to which the virus is infectious during the incubation period is uncertain, but research has indicated that the pharynx reaches peak viral load approximately four days after infection [55] [56] or the first week of symptoms, and declines after. [57] The duration of SARS-CoV-2 RNA shedding is generally between 3 and 46 days after symptom onset. [58]

A study by a team of researchers from the University of North Carolina found that the nasal cavity is seemingly the dominant initial site for infection with subsequent aspiration-mediated virus seeding into the lungs in SARS‑CoV‑2 pathogenesis. [59] They found that there was an infection gradient from high in proximal towards low in distal pulmonary epithelial cultures, with a focal infection in ciliated cells and type 2 pneumocytes in the airway and alveolar regions respectively. [59]

There is some evidence of human-to-animal transmission of SARS‑CoV‑2, including examples in felids. [60] [61] Some institutions have advised those infected with SARS‑CoV‑2 to restrict contact with animals. [62] [63]

Asymptomatic transmission

On 1 February 2020, the World Health Organization (WHO) indicated that "transmission from asymptomatic cases is likely not a major driver of transmission". [64] One meta-analysis found that 17% of infections are asymptomatic, and asymptomatic individuals were 42% less likely to transmit the virus. [65]

However, an epidemiological model of the beginning of the outbreak in China suggested that "pre-symptomatic shedding may be typical among documented infections" and that subclinical infections may have been the source of a majority of infections. [66] That may explain how out of 217 onboard a cruise liner that docked at Montevideo, only 24 of 128 who tested positive for viral RNA showed symptoms. [67] Similarly, a study of ninety-four patients hospitalized in January and February 2020 estimated patients shed the greatest amount of virus two to three days before symptoms appear and that "a substantial proportion of transmission probably occurred before first symptoms in the index case". [68]


There is uncertainty about reinfection and long-term immunity. [69] It is not known how common reinfection is, but reports have indicated that it is occurring with variable severity. [69]

The first reported case of reinfection was a 33-year-old man from Hong Kong who first tested positive on 26 March 2020, was discharged on 15 April 2020 after two negative tests, and tested positive again on 15 August 2020 (142 days later), which was confirmed by whole-genome sequencing showing that the viral genomes between the episodes belong to different clades. [70] The findings had the implications that herd immunity may not eliminate the virus if reinfection is not an uncommon occurrence and that vaccines may not be able to provide lifelong protection against the virus. [70]

Another case study described a 25-year-old man from Nevada who tested positive for SARS‑CoV‑2 on 18 April 2020 and on 5 June 2020 (separated by two negative tests). Since genomic analyses showed significant genetic differences between the SARS‑CoV‑2 variant sampled on those two dates, the case study authors determined this was a reinfection. [71] The man's second infection was symptomatically more severe than the first infection, but the mechanisms that could account for this are not known. [71]

The first known infections from SARS‑CoV‑2 were discovered in Wuhan, China. [17] The original source of viral transmission to humans remains unclear, as does whether the virus became pathogenic before or after the spillover event. [19] [72] [9] Because many of the early infectees were workers at the Huanan Seafood Market, [73] [74] it has been suggested that the virus might have originated from the market. [9] [75] However, other research indicates that visitors may have introduced the virus to the market, which then facilitated rapid expansion of the infections. [19] [76] A March 2021 WHO report on a joint WHO–China study stated that human spillover via an intermediate animal host was the most likely explanation, with direct spillover from bats next most likely. Introduction through the food supply chain and the Huanan Seafood Market was considered another possible, but less likely, explanation. [77]

The mutation rate estimated from early cases of SARS-CoV-2 was of 6.54 × 10 −4 per site per year. [77] Its viral evolution is slowed by the RNA proofreading capability of its replication machinery. [78]

Research into the natural reservoir of the virus that caused the 2002–2004 SARS outbreak has resulted in the discovery of many SARS-like bat coronaviruses, most originating in the Rhinolophus genus of horseshoe bats. Phylogenetic analysis indicates that samples taken from Rhinolophus sinicus show a resemblance of 80% to SARS‑CoV‑2. [79] [80] [81] Phylogenetic analysis also indicates that a virus from Rhinolophus affinis, collected in Yunnan province and designated RaTG13, has a 96% resemblance to SARS‑CoV‑2. [17] [82] The RaTG13 virus sequence is the closest known sequence to SARS-CoV-2. [77] Other closely-related sequences were also identified in samples from local bat populations. [83]

Bats are considered the most likely natural reservoir of SARS‑CoV‑2, [84] [85] but differences between the bat coronavirus and SARS‑CoV‑2 suggest that humans were infected via an intermediate host [75] although the source of introduction into humans remains unknown. [86]

Although the role of pangolins as an intermediate host was initially posited (a study published in July 2020 suggested that pangolins are an intermediate host of SARS‑CoV‑2-like coronaviruses [87] [88] ), subsequent studies have not substantiated their contribution to the spillover. [77] Evidence against this hypothesis includes the fact that pangolin virus samples are too distant to SARS-CoV-2: isolates obtained from pangolins seized in Guangdong were only 92% identical in sequence to the SARS‑CoV‑2 genome. In addition, despite similarities in a few critical amino acids, [89] pangolin virus samples exhibit poor binding to the human ACE2 receptor. [90]

SARS‑CoV‑2 belongs to the broad family of viruses known as coronaviruses. [26] It is a positive-sense single-stranded RNA (+ssRNA) virus, with a single linear RNA segment. Coronaviruses infect humans, other mammals, and avian species, including livestock and companion animals. [91] Human coronaviruses are capable of causing illnesses ranging from the common cold to more severe diseases such as Middle East respiratory syndrome (MERS, fatality rate

34%). SARS-CoV-2 is the seventh known coronavirus to infect people, after 229E, NL63, OC43, HKU1, MERS-CoV, and the original SARS-CoV. [92]

Like the SARS-related coronavirus implicated in the 2003 SARS outbreak, SARS‑CoV‑2 is a member of the subgenus Sarbecovirus (beta-CoV lineage B). [93] [94] Coronaviruses also undergo frequent recombination. [95] Its RNA sequence is approximately 30,000 bases in length, [96] relatively long for a coronavirus (which in turn carry the largest genomes among all RNA families) [97] Its genome consists nearly entirely of protein-coding sequences, a trait shared with other coronaviruses. [95]

A distinguishing feature of SARS‑CoV‑2 is its incorporation of a polybasic site cleaved by furin, [89] which appears to be an important element enhancing its virulence. [98] The furin protease recognizes the canonical peptide sequence RX[R/K]R↓X where the cleavage site is indicated by a down arrow and X is any amino acid. [99] [100] In SARS-CoV-2 the recognition site is formed by the incorporated 12 codon nucleotide sequence CCT CGG CGG GCA which corresponds to the amino acid sequence PRRA. [101] This sequence is upstream of an arginine and serine which forms the S1/S2 cleavage site (PRRAR↓S) of the spike protein. [102] Although such sites are a common naturally-occurring feature of other viruses, [101] including some members of the Beta-CoV genus and other genera of coronaviruses, [103] SARS-Cov-2 is unique among members of its subgenus for such a site. [89]

Viral genetic sequence data can provide critical information about whether viruses separated by time and space are likely to be epidemiologically linked. [104] With a sufficient number of sequenced genomes, it is possible to reconstruct a phylogenetic tree of the mutation history of a family of viruses. By 12 January 2020, five genomes of SARS‑CoV‑2 had been isolated from Wuhan and reported by the Chinese Center for Disease Control and Prevention (CCDC) and other institutions [96] [105] the number of genomes increased to 42 by 30 January 2020. [106] A phylogenetic analysis of those samples showed they were "highly related with at most seven mutations relative to a common ancestor", implying that the first human infection occurred in November or December 2019. [106] Examination of the topology of the phylogenetic tree at the start of the pandemic also found high similarities between human isolates. [107] As of 7 May 2020, [update] 4,690 SARS‑CoV‑2 genomes sampled on six continents were publicly available. [108] [ clarification needed ]

On 11 February 2020, the International Committee on Taxonomy of Viruses announced that according to existing rules that compute hierarchical relationships among coronaviruses based on five conserved sequences of nucleic acids, the differences between what was then called 2019-nCoV and the virus from the 2003 SARS outbreak were insufficient to make them separate viral species. Therefore, they identified 2019-nCoV as a virus of Severe acute respiratory syndrome–related coronavirus. [109]

In July 2020, scientists reported that a more infectious SARS‑CoV‑2 variant with spike protein variant G614 has replaced D614 as the dominant form in the pandemic. [110] [111]

Coronavirus genomes and subgenomes encode six open reading frames (ORFs). [112] In October 2020, researchers discovered a possible overlapping gene named ORF3d, in the SARS‑CoV‑2 genome. It is unknown if the protein produced by ORF3d has any function, but it provokes a strong immune response. ORF3d has been identified before, in a variant of coronavirus that infects pangolins. [113] [114]

Phylogenetic tree

A phylogenetic tree based on whole-genome sequences of SARS-CoV-2 and related coronaviruses is: [115] [116] [117]

Pangolin SARSr-COV-GX, 89% to SARS-COV-2, Manis javanica, Smuggled from Southeast Asia [120]

Pangolin SARSr-COV-GD, 91% to SARS-COV-2, Manis javanica, Smuggled from Southeast Asia [121]


There are many thousands of variants of SARS-CoV-2, which can be grouped into the much larger clades. [123] Several different clade nomenclatures have been proposed. Nextstrain divides the variants into five clades (19A, 19B, 20A, 20B, and 20C), while GISAID divides them into seven (L, O, V, S, G, GH, and GR). [124]

Several notable variants of SARS-CoV-2 emerged in late 2020. The World Health Organization has currently declared four variants of concern, which are as follows: [125]

  • Alpha: Lineage B.1.1.7 emerged in the United Kingdom in September 2020, with evidence of increased transmissibility and virulence. Notable mutations include N501Y and P681H.
    • An E484K mutation in some lineage B.1.1.7 virions has been noted and is also tracked by various public health agencies.

    Other notable variants include 6 other WHO-designated variants under investigation and Cluster 5, which emerged among mink in Demark and resulted in a mink euthanasia campaign rendering it virtually extinct. [126]


    Each SARS-CoV-2 virion is 50–200 nanometres in diameter. [74] Like other coronaviruses, SARS-CoV-2 has four structural proteins, known as the S (spike), E (envelope), M (membrane), and N (nucleocapsid) proteins the N protein holds the RNA genome, and the S, E, and M proteins together create the viral envelope. [127] Coronavirus S proteins are glycoproteins that are divided into two functional parts (S1 and S2). [91] In SARS-CoV-2, the spike protein, which has been imaged at the atomic level using cryogenic electron microscopy, [128] [129] is the protein responsible for allowing the virus to attach to and fuse with the membrane of a host cell [127] specifically, its S1 subunit catalyzes attachment, the S2 subunit fusion. [130]


    SARS-CoV-2 has a linear, positive-sense, single-stranded RNA genome about 30,000 bases long. [91] Its genome has a bias against cytosine (C) and guanine (G) nucleotides like other coronaviruses. [131] The genome has the highest composition of U (32.2%), followed by A (29.9%), and a similar composition of G (19.6%) and C (18.3%). [132] The nucleotide bias arises from the mutation of guanines and cytosines to adenosines and uracils, respectively. [133] The mutation of CG dinucleotides is thought to arise to avoid the zinc finger antiviral protein related defense mechanism of cells, [134] and to lower the energy to unbind the genome during replication and translation (adenosine and uracil base pair via two hydrogen bonds, cytosine and guanine via three). [133] The depletion of CG dinucleotides in its genome has led the virus to have a noticeable codon usage bias. For instance, arginine's six different codons have a relative synonymous codon usage of AGA (2.67), CGU (1.46), AGG (.81), CGC (.58), CGA (.29), and CGG (.19). [132] A similar codon usage bias trend is seen in other SARS–related coronaviruses. [135]

    Replication cycle

    Virus infections start when viral particles bind to host surface cellular receptors. [136] Protein modeling experiments on the spike protein of the virus soon suggested that SARS‑CoV‑2 has sufficient affinity to the receptor angiotensin converting enzyme 2 (ACE2) on human cells to use them as a mechanism of cell entry. [137] By 22 January 2020, a group in China working with the full virus genome and a group in the United States using reverse genetics methods independently and experimentally demonstrated that ACE2 could act as the receptor for SARS‑CoV‑2. [17] [138] [139] [140] Studies have shown that SARS‑CoV‑2 has a higher affinity to human ACE2 than the original SARS virus. [128] [141] SARS‑CoV‑2 May also use basigin to assist in cell entry. [142]

    Initial spike protein priming by transmembrane protease, serine 2 (TMPRSS2) is essential for entry of SARS‑CoV‑2. [23] The host protein neuropilin 1 (NRP1) may aid the virus in host cell entry using ACE2. [143] After a SARS‑CoV‑2 virion attaches to a target cell, the cell's TMPRSS2 cuts open the spike protein of the virus, exposing a fusion peptide in the S2 subunit, and the host receptor ACE2. [130] After fusion, an endosome forms around the virion, separating it from the rest of the host cell. The virion escapes when the pH of the endosome drops or when cathepsin, a host cysteine protease, cleaves it. [130] The virion then releases RNA into the cell and forces the cell to produce and disseminate copies of the virus, which infect more cells. [144]

    SARS‑CoV‑2 produces at least three virulence factors that promote shedding of new virions from host cells and inhibit immune response. [127] Whether they include downregulation of ACE2, as seen in similar coronaviruses, remains under investigation (as of May 2020). [145]

    Based on the low variability exhibited among known SARS‑CoV‑2 genomic sequences, health authorities likely detected the virus within weeks of its emergence among the human population in late 2019. [19] [146] The earliest case of infection currently known is dated to 1 December 2019, although an earlier case could have occurred on 17 November 2019. [147] [148] The pandemic onset was estimated by tMRCA analysis to have occurred before the end of December 2019, but this statistical inference do not provide definitive proof of time of origins. [77] The virus subsequently spread to all provinces of China and to more than 150 other countries across the world. [149] Human-to-human transmission of the virus has been confirmed in all these regions. [150] On 30 January 2020, SARS‑CoV‑2 was designated a Public Health Emergency of International Concern by the WHO, [151] [12] and on 11 March 2020 the WHO declared it a pandemic. [13] [152]

    Retrospective tests collected within the Chinese surveillance system revealed no clear indication of substantial unrecognized circulation of SARS‑CoV‑2 in Wuhan during the latter part of 2019. [153]

    A meta-analysis from November 2020 estimated the basic reproduction number ( R 0 > ) of the virus to be between 2.39 and 3.44. [20] This means each infection from the virus is expected to result in 2.39 to 3.44 new infections when no members of the community are immune and no preventive measures are taken. The reproduction number may be higher in densely populated conditions such as those found on cruise ships. [154] Many forms of preventive efforts may be employed in specific circumstances to reduce the propagation of the virus. [112]

    There have been about 96,000 confirmed cases of infection in mainland China. [149] While the proportion of infections that result in confirmed cases or progress to diagnosable disease remains unclear, [155] one mathematical model estimated that 75,815 people were infected on 25 January 2020 in Wuhan alone, at a time when the number of confirmed cases worldwide was only 2,015. [156] Before 24 February 2020, over 95% of all deaths from COVID-19 worldwide had occurred in Hubei province, where Wuhan is located. [157] [158] As of 27 June 2021, the percentage had decreased to 0.082%. [149]

    As of 27 June 2021, there have been 180,948,026 total confirmed cases of SARS‑CoV‑2 infection in the ongoing pandemic. [149] The total number of deaths attributed to the virus is 3,919,969. [149]

    A New Map Catalogs the Effects of Coronavirus Mutations

    Scientists have analyzed every possible mutation to one key part of the coronavirus. The data could help guide vaccine and drug development and hint at how the virus might spread.

    HHMI scientists are joining many of their colleagues worldwide in working to combat the new coronavirus. They’re developing diagnostic testing, understanding the virus’s basic biology, modeling the epidemiology, and developing potential therapies or vaccines. We will be sharing stories of some of this work.

    As the novel coronavirus spreads, it’s picking up new mutations – for better and for worse.

    Now, Howard Hughes Medical Institute Investigator Jesse Bloom and his colleagues have cataloged how nearly 4,000 different mutations alter SARS-CoV-2’s ability to bind to human cells.

    Their data, publicly available online as an interactive map, is a new resource for researchers developing antiviral drugs and vaccines to fight COVID-19, the infectious disease caused by SARS-CoV-2. The work also reveals how individual mutations may affect the virus’s behavior, the team reports August 11, 2020 in the journal Cell.

    “We don’t know how the virus will evolve, but now we have a way to look at the mutations that can occur and see their effects,” says Bloom, a virologist at the Fred Hutchinson Cancer Research Center.

    Each time a virus replicates, it can pick up new genetic mutations. Many of these mutations have no effect on a virus’s behavior. Others could make the virus better or worse at infecting people. To what extent mutations might be making SARS-CoV-2 more dangerous has been an open – and controversial – question. Doctors and scientists have analyzed genetic differences in virus samples collected from COVID-19 patients around the world, hunting for clues to the disease’s spread. But until now, no one had comprehensively linked potential mutations to their functional effect on SARS-CoV-2.

    Jesse Bloom, HHMI Investigator at the Fred Hutchinson Cancer Research Center

    The new study focused on mutations to a key part of SARS-CoV-2 – its “spike protein.” This protein binds to a protein on human cells called ACE2, a necessary step for infection. Mutations in the spike protein could change how well SARS-CoV-2 sticks to – and thus infects – human cells.

    Bloom’s team bred yeast cells to display a fragment of the spike protein on their surface. This fragment, called the receptor binding domain, makes direct contact with ACE2. The researchers systematically created thousands of versions of the fragment – each with different mutations. Then they measured how well these mutated fragments stuck to ACE2. That let them assess how various mutations might affect the function of the binding domain.

    The data show that many possible mutations could make the virus bind to human cells more strongly. But those mutations don’t seem to be gaining a foothold in circulating versions of the virus.

    “This would suggest that there’s some sort of sweet spot, where if the virus can bind ACE2 pretty well, then it’s able to infect humans,” Bloom says. “Maybe there’s no evolutionary need for it to get better.”

    Other mutations made it harder for the spike protein to bind to cells or prevented the protein from properly folding into its final shape, the team found. Versions of the virus with these mutations might be less likely to gain a foothold because they can’t infect cells as effectively. The team’s targeted lab tests aren’t a perfect proxy for how mutations will affect the virus in the wild, where many other factors influence how effectively it can spread – but they’re a useful starting place.

    The data will also be valuable for researchers designing drugs and vaccines to fight COVID-19, says Tyler Starr, a postdoc in Bloom’s lab who led the project alongside graduate student Allie Greaney. Understanding the consequences of different mutations can guide the development of drugs that will continue to work as the virus changes over time. Plus, Starr says, “it’s becoming clear that antibodies that stick to this part of the virus are really good, protective antibodies that we would want to elicit with a vaccine.”

    Study coauthor Neil King and his lab at the University of Washington is already working on such vaccines. His team is designing artificial proteins that mimic components of the virus. As part of a vaccine, such proteins could potentially train people’s immune systems to produce antibodies that target the coronavirus. The researchers modify the artificial proteins to make them more stable and easier to produce in large quantities than the natural versions of the proteins.

    The data from Bloom’s team offers a roadmap to making those modifications. “Normally, when we’re trying to figure out how to make a protein better, we’re shooting in the dark,” says Daniel Ellis, a graduate student in King’s lab. “The information they’ve given us is kind of like a cheat sheet. It makes our lives amazingly easier.”


    Data collection and pre-processing

    On 5 January 2020, the complete genome sequence of SARS-CoV-2 was first released on the GenBank (Access number: NC_045512.2) 5 . Since then, there has been a rapid accumulation of SARS-CoV-2 genome sequences. In this work, 45,494 complete genome sequences with high coverage of SARS-CoV-2 strains from the infected individuals in the world were downloaded from the GISAID database 2 ( as of 11 September 2020. All the incomplete records and those without the exact submission date in GISAID were not considered. To rearrange the complete genome sequences according to the reference SARS-CoV-2 genome, multiple sequence alignment (MSA) is carried out by using Clustal Omega 59 with default parameters.

    The amino acid sequence of NSP2, NSP12, NPS13, Spike protein, ORF3a, ORF8, and Nucleocapsid were downloaded from the GenBank 60 . The three-dimensional (3D) structures of NSP12, spike protein, and ORF3a used in this work were extracted from the Protein Data Bank (, denoted as 7BTF, 6VYB, and 6XDC, respectively. The 3D structures of NSP2, ORF8, NSP13, and Nucleocapsid were generated by I-TASSER model 61 . The 3D structure graph is created by using PyMOL 39 .

    Single-nucleotide polymorphism calling

    Single-nucleotide polymorphism (SNP) calling measures the genetic variations between different members of a species. Establishing the SNP calling method to the investigation of the genotype changes during the transmission and evolution of SARS-CoV-2 is of great importance 9,10 . By analyzing the rearranged genome sequences, SNP profiles, which record all of the SNP positions in teams of the nucleotide changes and their corresponding positions, can be constructed. The SNP profiles of a given SARS-CoV-2 genome isolated from a COVID-19 patient capture all the differences from a complete reference genome sequence and can be considered as the genotype of the individual SARS-CoV-2.

    Distance of SNP variants

    In this work, we use the Jaccard distance to measure the similarity between SNP variants and compare the difference between the SNP variant profiles of SARS-CoV-2 genomes.

    The Jaccard similarity coefficient is defined as the intersection size divided by the union of two sets A and B 62 :

    The Jaccard distance of two sets A and B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets:

    Therefore, the genetic distance of two genomes corresponds to the Jaccard distance of their SNP variants.

    In principle, the Jaccard distance of SNP variants takes account of the ordering of SNP positions, i.e., transmission trajectory, when an appropriate reference sample is selected. However, one may fail to identify the infection pathways from the mutual Jaccard distances of multiple samples. In this case, the dates of the sample collection provide key information. Additionally, clustering techniques, such as k-means, UMAP, and t-distributed stochastic neighbor embedding (t-SNE), enable us to characterize the spread of COVID-19 onto the communities.

    K-means clustering

    K-means clustering aims at partitioning a given data set (X=<_<1>,_<2>,cdots ,_,cdots ,_>,_in <>>^) into k clusters <C1, C2, ⋯ , Ck>, kN such that the specific clustering criteria are optimized. The standard K-means clustering algorithm picks k points as cluster centers randomly at beginning and separates each data to its nearest cluster. Here, k cluster centers will be updated subsequently by minimizing the within-cluster sum of squares (WCSS):

    where μk is the mean of points locating in the kth cluster Ck and nk is the number of points in Ck. Here, ∥ ⋅ ∥ 2 denotes the L2 distance.

    The aforementioned algorithm offers an optimal partition of k clusters. However, it is more important to find the best number of clusters for the given set of SNP variants. Therefore, the Elbow method is employed. A set of WCSS can be calculated in the k-means clustering process by varying the number of clusters k, and then plot WCSS according to the number of clusters. The optimal number of clusters will be the elbow in this plot. The WCSS measures the variability of the points within each cluster which is influenced by the number of points N. Therefore, as the number of total points of N increases, the value of WCSS becomes larger. Additionally, the performance of k-means clustering depends on the selection of the specific distance metric.

    In this work, we implement k-means clustering with the Elbow method for analyzing the optimal number of the subtypes of SARS-CoV-2 SNP variants. The Jaccard distance-based representation is considered as the input features for the k-means clustering method. If we have a total of N SNP variants concerning a reference genome in a SARS-CoV-2 sample, the location of the mutation sites for each SNP variant will be saved in the set Si, i = 1, 2, ⋯ , N. The Jaccard distance between two different sets (or samples) Si, Sj is denoted as dJ(Si, Sj). Therefore, the N × N Jaccard distance-based representation is

    This representation is used in our k-means clustering.

    Topology-based machine learning prediction of protein–protein binding free energy changes following mutations

    The topology-based network tree (TopNetTree) model was developed by an innovative integration between the topological representation and network tree (NetTree) to predict the binding free energy changes of protein–protein interaction (PPI) following mutation ΔΔG 28 . The TopNetTree is applied to predict the binding free energy changes upon mutations that occurred on the RBD of SARS-CoV-2. Algebraic topology 30 is utilized to simplify the structural complexity of protein–protein complexes and embed vital biological information into topological invariants. NetTree integrates the advantages of convolutional neural networks (CNN) and gradient-boosting trees (GBT), such that CNN is treated as an intermediate model that converts vectorized element- and site-specific persistent homology features into a higher-level abstract feature, and GBT uses the upstream features and other biochemistry features for prediction. The performance test of tenfold cross-validation on the dataset (SKEMPI 2.0 63 ) carried out using gradient boosted regression trees (GBRTs). The errors with the SKEMPI2.0 dataset are 0.85 in terms of Pearson correlations coefficient (Rp) and 1.11 kcal/mol in terms of the root mean square error (RMSE) 28 .

    Topology-based machine learning prediction of protein folding stability changes following mutation

    In this work, the prediction of protein folding stability changes upon mutation is carried out using a topology-based mutation predictor (TML-MP) ( which was introduced in literature 27 . The folding stability change following mutation ΔΔG = ΔGw−ΔGm measures the difference between the folding free energies of the wild type ΔGw and the mutant type ΔGw. More specifically, a positive folding stability change ΔΔG indicates that the mutation will stabilize the structure of the protein and vice versa. The essential biological information is revealed by persistent homology 30 . The machine learning features are generated by the element-specific persistent homology and biochemistry techniques. The dataset includes 2648 mutations cases in 131 proteins provided by Dehouck et al. 64 and is trained by a gradient boosted regression trees (GBRTs). The error with the corresponding dataset is given as Pearson correlations coefficient (Rp) of 0.79 and root mean square error (RMSE) of 0.91 kcal/mol from previous work 27 .

    The persistent homology is widely applied in a variety of practical feature generation problems 30 . It is also successful in the implementation of predictions of protein folding stability changes upon mutation 27 . The key idea in TML-MP is using the element-specific persistent homology (ESPH) which distinguishes different element types of biomolecules when building persistent homology barcodes. Commonly occurring protein element types include C, N, O, S, and H, where hydrogen and sulfur are excluded according to that hydrogen atoms are often absent from PDB data and sulfur atoms are too few in most proteins to be statistically important. Thus, C, N, and O elements are considered on the ESPH in protein characterization. Features are extracted from the different dimensions of persistent homology barcodes by dividing barcodes into several equally spaced bins which is called binned barcode representation. The auxiliary features, such as geometry, electrostatics, amino acid type composition, and amino acid sequence, are included for machine learning training as well. In TML-MP, gradient boosted regression trees (GBRTs) 29 are employed to train the dataset according to the size of the training dataset, absence of model overfitting, non-normalization of features, and ability of nonlinear properties 27 .

    Graph network models

    Graph networks can model interactions and their strength between pairs of units in molecules. These approaches are employed to understand mutation-induced structural changes. The biological and chemical properties are measured by comparing descriptors on different networks. In this work, the network consists of a set S of Cα atoms from every residue of protein structure except the target mutation residue such that a Cα atom is included if it is within 16 Å to any atom of the target mutation. The total atom set T is defined as the atoms (C, N, and O) of the target residue and Cα atoms of the network set S. Moreover, two vertices are connected in the network if their distance is <8 Å. Thus the adjacency matrix A can be defined as well where A is a matrix containing 0 and 1 such that A(i, j) = 0 if ith and jth atoms are disconnected and A(i, j) = 1 if ith and jth atoms are connected. Two graph network models employed in this work are described below.

    Flexibility-rigidity index

    FRI was introduced to study the flexibility of protein molecules 25,26 . The single residue molecular rigidity index measures its influence on the set S which is given as

    where α = w or m stands for the wild type w or the mutant type m, NS is the number of Cα atoms of the set S, and NT is the number of atoms in total atom set T. Here, ∥ rirj ∥ is the distance between atoms at ri and rj.

    The molecular FRI rigidity Rη measures the topological connectivity and the geometric compactness of the network consisting of Cα at each residue and the heavy atoms involved in the mutant.

    Average subgraph centrality

    Average subgraph centrality is built on the exponential of the adjacency matrix, E = e A , where A is the aforementioned adjacency matrix. The subgraph centrality is the summation of weighted closed walks of all lengths starting and ending at the same node 11,31 . Thus the average subgraph centrality reveals the average of participating rate of each vertex in all subgraph and the network motif, which is given as

    where IT is the index set of the mutation residue.

    Reporting summary

    Further information on research design is available in the Nature Research Reporting Summary linked to this article.


    We thank the following individuals for productive feedback on this manuscript: Uri Alon, Niv Antonovsky, David Baltimore, Rachel Banks, Arren Bar Even, Naama Barkai, Molly Bassette, Menalu Berihoon, Biana Bernshtein, Pamela Bjorkman, Cecilia Blikstad, Julia Borden, Bill Burkholder, Griffin Chure, Lillian Cohn, Bernadeta Dadonaite, Emmie De wit, Ron Diskin, Ana Duarte, Tal Einav, Avigdor Eldar, Elizabeth Fischer, William Gelbart, Alon Gildoni, Britt Glausinger, Shmuel Gleizer, Dani Gluck, Soichi Hirokawa, Greg Huber, Christina Hueschen, Amit Huppert, Shalev Itzkovitz, Martin Jonikas, Leeat Keren, Gilmor Keshet, Marc Kirschner, Roy Kishony, Amy Kistler, Liad Levi, Sergei Maslov, Adi Millman, Amir Milo, Elad Noor, Gal Ofir, Alan Perelson, Steve Quake, Itai Raveh, Andrew Rennekamp, Tom Roeschinger, Daniel Rokhsar, Alex Rubinsteyn, Gabriel Salmon, Maya Schuldiner, Eran Segal, Ron Sender, Alex Sigal, Maya Shamir, Arik Shams, Mike Springer, Adi Stern, Noam Stern-Ginossar, Lubert Stryer, Dan Tawfik, Boris Veytsman, Aryeh Wides, Tali Wiesel, Anat Yarden, Yossi Yovel, Dudi Zeevi, Mushon Zer Aviv, and Alexander Zlokapa.

    The COVID-19 Virus Is Mutating. What Does That Mean for Vaccines?

    A s we enter the second year of living with the new coronavirus SARS-CoV-2, the virus is celebrating its invasion of the world&rsquos population with yet more mutated forms that help it to spread more easily from person to person.

    One, first detected in the U.K. in December, has already raised alarms about whether the COVID-19 virus is now escaping from the protection that vaccines just being rolled out now might provide. The variant has also been found in the U.S. Already, U.K. officials have tightened lockdowns in England, Scotland and Wales, and over the holidays, more than 40 countries banned travelers from the region in an effort to keep the new strain from spreading to other parts of the world. Health officials are also concerned about a different strain found in South Africa that could become more resistant to vaccine protection. This variant includes a few mutations in key areas that antibodies, generated by the vaccine, target.

    Exactly how the new strains affect people who are infected&mdashsuch as whether they develop more severe symptoms&mdashand whether they can lead to more hospitalizations and deaths, aren&rsquot clear yet. But scientists are ramping up efforts to genetically sequence more samples from infected patients to learn how widespread they are. So far, there are enough hints to worry public health experts.

    The fact that SARS-CoV-2 is morphing into potentially more dangerous strains isn&rsquot a surprise. Viruses mutate. They must, in order to make up for a critical omission in their makeup. Unlike other pathogens such as bacteria, fungi and parasites, viruses have none of the machinery needed to make more copies of themselves, so they cannot reproduce on their own. They rely fully on hijacking the reproductive tools of the cells they infect in order to generate their progeny.

    Being such freeloaders means they can&rsquot be picky about their hosts, and must make do with whatever cellular equipment they can find. That generally leads to a flurry of mistakes when they sneak in to copy their genetic code as a result, viruses have among the sloppiest genomes among microbes. The bulk of these mistakes are meaningless&mdashfalse starts and dead ends&mdashthat have no impact on humans. But as more mistakes are made, the chances that one will make the virus better at slipping from one person to another, or pumping out more copies of itself, increase dramatically.

    Fortunately, coronaviruses in particular generate these genetic mistakes more slowly than their cousins like influenza and HIV&mdashscientists sequencing thousands of samples of SARS-CoV-2 from COVID-19 patients found that the virus makes about two errors a month. Still, that&rsquos led so far to about 12,000 known mutations in SARS-CoV-2, according to GISAID, a public genetic database of the virus. And some, by sheer chance, end up creating a greater public health threat.

    Just a few months after SARS-CoV-2 was identified in China last January, for example, a new variant, called D614G, superseded the original strain. This new version became the dominant one that infected much of Europe, North America and South America. Virus experts are still uncertain over how important D614G, named for where the mutation is located on the viral genome, has been when it comes to human disease. But so far, blood samples from people infected with the strain show that the virus can still be neutralized by the immune system. That means that the current vaccines being rolled out around the world can also protect against this strain, since the shots were designed to generate similar immune responses in the body. &ldquoIf the public is concerned about whether vaccine immunity is able to cover this variant, the answer is going to be yes,&rdquo says Ralph Baric, professor or epidemiology, microbiology and immunology at University of North Carolina Chapel Hill, who has studied coronaviruses for several decades.

    The so-called N501Y variant (some health officials are also calling it B.1.1.7.), which was recently detected in the U.K. and the U.S., may be a different story. Based on lab and animal studies, researchers believe this strain can spread more easily between people. That&rsquos not a surprise, says Baric, since to this point, most of the world&rsquos population has not been exposed to SARS-CoV-2. That means that for now, the strains that are better at hopping from one person to another will have the advantage in spreading their genetic code. But as more people get vaccinated and protected against the virus, that may change. &ldquoSelection conditions for virus evolution right now favor rapid transmission,&rdquo he says. &ldquoBut as more and more of the human population become immune, the selection pressures change. And we don&rsquot know which direction the virus will go.&rdquo

    In a worst case scenario, those changes could push the virus to become resistant to the immune cells generated by currently available vaccines. The current mutants are the virus&rsquo first attempts to maximize its co-opting of the human population as viral copying machines. But they could also serve as a backbone on which SARS-CoV-2 builds a more sustained and stable takeover. Like a prisoner planning a jailbreak, the virus is biding its time and chipping away at the defenses the human immune system has constructed. For example, the virus may mutate in a way that changes the makeup of its spike proteins&mdashthe part of the virus where the immune system&rsquos antibodies attempt to stick to in order to neutralize the virus. And that one mutation may not be enough to protect the virus from those antibodies. But two or three might.

    The biggest concern right now, says Baric, is that there are already two or three variants of SARS-CoV-2 that have mutations in just such places, &ldquowhere additional mutations can make a more significant change in terms of transmissibility or virulence.&rdquo

    The best way to monitor that evolution is by sequencing the virus in as many people who are infected, as often as possible. Only by tracking how SARS-CoV-2 is changing can scientists hope to stay ahead of the most dangerous and potentially more lethal mutations. In Nov., the U.S. Centers for Disease Control (CDC) launched a sequencing program that will ask each state to send 10 samples every other week from people who have been infected, in order to more consistently track any changes in SARS-CoV-2’s genome. But it’s a voluntary program. &ldquoIt&rsquos still not a national effort, it&rsquos voluntary, and there is no dedicated funding for it,&rdquo says Baric. &ldquoCome on, we&rsquore in the 21st century&mdashlet&rsquos enter the 21st century.&rdquo

    Without substantial federal funding dedicated specifically to sequencing SARS-CoV-2 genomes, most of the work in the U.S. is currently being done by scientists at academic centers like the Broad Institute of MIT and Harvard and the University of Washington. Since early last year, the CDC has been working to better characterize SARS-CoV-2 viruses from patient samples in partnership with some of these academic labs, as well as state and local health departments and commercial diagnostic companies, in the SARS-CoV-2 Sequencing for Public Health Emergency Response, Epidemiology and Surveillance (SPHERES) consortium &ldquoIf we sequence one out of 200 cases then we&rsquore missing a lot of information,&rdquo says Baric. &ldquoIf we&rsquore sequencing about 20% of cases, then we might start to see something and we would be in the ball game to find new variants. We probably could be doing a better job of that here in the U.S.&rdquo

    Other countries are also working on this effort. The U.K. has long been a leader in genetic sequencing, and likely because of their efforts were able to identify the new variant relatively quickly after it emerged. Globally, scientists have also been posting genetic sequences from SARS-CoV-2 to the public GISAID database.

    Dr. Anthony Fauci, director of the National Institute of Allergy and Infectious Diseases, and chief medical advisor to President-elect Joe Biden, says that his teams are sequencing and studying the new variants to better understand what effect they might have on disease, how close they might be to causing more severe illness and, more importantly as more people get vaccinated, whether the new variants can escape the protection of the vaccines we know work today.

    The good news is that if the mutant strains do become resistant to the current vaccines, the mRNA technology behind the Pfizer-BioNTech and Moderna should enable the companies to develop new shots without the same lengthy developing and testing that the originals required. &ldquoThe mRNA platform is eminently flexible to turn around,&rdquo says Fauci. If a new vaccine were needed, it would be treated by the Food and Drug Administration as a strain change in the virus target, similar to how flu shots are modified every year. &ldquoYou could get that out pretty quickly,” says Fauci, after showing in tests with a few dozen people that the new vaccine produced satisfactory amounts of antibodies and protection against the mutant virus.

    Tracking every change the virus makes will be critical to buying the time needed to shift vaccine targets before SARS-CoV-2 leaps too far ahead for scientists to catch up. &ldquoWe are taking [these variants] seriously and will be following them closely to make sure we don&rsquot miss anything,&rdquo says Fauci.

    SARS-CoV-2 is mutating slowly, and that's a good thing

    Viruses evolve over time, undergoing genetic changes, or mutations, in their quest to survive. Some viruses produce many variations, others only a few. Fortunately, SARS-CoV-2, the novel coronavirus that causes COVID-19, is among the latter. This is good news for scientists trying to create an effective vaccine against it.

    "The virus has had very few genetic changes since it emerged in late 2019," says Peter Thielen, a molecular biologist at the Johns Hopkins Applied Physics Laboratory and JHU Doctor of Engineering candidate, who, with colleagues from other areas of the Hopkins research community, has been sequencing the viral genome to better understand its makeup. "Designing vaccines and therapeutics for a single strain is much more straightforward than a virus that is changing quickly."

    Johns Hopkins responds to COVID-19

    Coverage of how the COVID-19 pandemic is affecting operations at JHU and how Hopkins experts and scientists are responding to the outbreak

    SARS-CoV-2 first appeared in China in December before rapidly spreading around the world. In a scant few months, it has sickened more than 7.25 million people worldwide, killing more than 410,000 and counting. In many affected countries, including the United States, the pandemic has prompted extreme mitigation measures to contain it, including lockdowns, extensive quarantines, and social distancing and mask-wearing—restrictions many experts believe will not entirely disappear anytime soon.

    "It isn't going to be possible for us to truly be able to return to normal until we have a vaccine," says Winston Timp, assistant professor of biomedical engineering in the Whiting School of Engineering, and who, along with Professor of Medicine Stuart Ray, is leading the Hopkins viral genomics effort. "The low mutation rate of the virus means it should be possible to generate a successful vaccine," he says, adding it also could boost efforts to develop potential treatments for the disease.

    Coronaviruses—of which there are hundreds, most of them occurring in animals—typically mutate more slowly than many other viruses. Influenza, for example, mutates quickly, which is why people must be inoculated annually against changing flu strains.

    Data from SARS-CoV-2 samples the researchers have examined from the Baltimore and Washington region are similar to those from other parts of the world. "So far, the genetic changes accumulating as the virus spreads are not resulting in different strains of the virus," Thielen says.

    This is important because a successful vaccine strategy must account for mutations in order to provide broad protection. "Influenza has a lot of very unique ways of changing over a short period of time, and it does so on local and global scales every flu season," Thielen says. "SARS-CoV-2 is almost the opposite so far—it is changing slowly, and because there is no existing immunity to the virus, it doesn't have any evolutionary pressure to change as it spreads through the population."

    Timp agrees. "It's hard to make the right vaccine for flu because there are different strains that circulate every year," he says. "With SARS-CoV-2, there are some small mutations, but nothing to lead us to suspect that if you have immunity here in Maryland that you won't have it anywhere else."

    Scientists are focusing on the "spike" protein of the virus, the part that docks with human cells and allows entry. "A vaccine that blocks the virus's ability to infect a cell would be highly effective, since there would be no ability for the virus to generate an active infection to cause or spread disease," Thielen says. "There are very small regions of the spike protein that make direct contact with the receptor on a human cell, and these are the highest likelihood targets for vaccine developers. To date, no changes have been observed to these parts of the virus in any of the more than 20,000 samples that have been sequenced globally."

    When a virus replicates—makes copies of its genome within a cell—it uses an enzyme called a polymerase. Some viruses have very accurate versions of these, while others, such as HIV or influenza, do not, Thielen says. "Coronavirus polymerases have something we call 'proofreading activity,' which is exactly like it sounds," he says. "Once a genome is replicated, the enzyme will identify mistakes it has made and correct them. Sometimes it still gets things wrong, however, and small changes can occur."

    JHU scientists say they have seen fewer than two dozen mutations between the current versions they are studying and original viral isolates from China, which is a very small number. "This means a vaccine will probably work against all of them," Timp says.

    It's still unclear, however, how long immunity will last against this virus, regardless of whether it arises from having recovered from illness or through vaccination.

    "The first step is to understand the immune responses required to clear the virus, or protect against it," says Heba Mostafa, an assistant professor of pathology in the School of Medicine who is providing viral material to APL researchers as well as sequencing viral samples in her own lab. She also developed a coronavirus screening test with JHU microbiologist Karen Carroll. "With a stable virus, reinfection will be less likely, or could be less aggressive. This is all helpful, especially in designing a vaccine."

    It will be important to monitor mutations that accumulate over time, especially as immunity increases in the population, Thielen says. To that end, APL scientists are working as part of a large group across JHU to characterize the virus's genomic diversity and share information through a multidisciplinary effort, the COVID-19 Research Response Program.

    In addition to APL, the initiative includes participants from Johns Hopkins Hospital, the Bloomberg School of Public Health, the Whiting School of Engineering's Department of Computer Science and Department of Biomedical Engineering, which is shared by the engineering school and the School of Medicine. "Data generated from our sequencing efforts gets uploaded to global data repositories, so that all researchers can study the virus at the same time," Thielen says.

    The researchers are aware that most people won't feel safe until there is a vaccine. Thielen has friends who have become sick, or who have lost loved ones. And he has a family at home—including two young children—to protect. Nevertheless, based on evidence about the virus generated thus far, he believes a vaccine will be forthcoming.

    "As time goes on, it is likely that all of us will have a personal connection to someone who has been impacted by the virus," he says. "What we observe in the local and global data provides excellent reassurance that we are on the right track."

    How Genetic Mutations Turned the Coronavirus Deadly

    L ong before the first reports of a new flu-like illness in China’s Hubei province, a bat—or perhaps a whole colony of them—was flying around the region carrying a new type of coronavirus. At the time, the virus was not yet dangerous to humans. Then, around the end of November, it underwent a slight additional mutation, evolving into the viral strain we now call SARS-CoV-2. With that flip of viral RNA, so began the COVID-19 pandemic.

    As in almost every outbreak, the mutations that set off this global crisis went undetected at first, even though the family of coronaviruses was already known to cause a variety of human diseases. “These viruses have long been understudied and have not been given the attention or funding they have deserved,” Craig Wilen, a virologist at Yale University, told me.

    A bat coronavirus caused the SARS outbreak that terrified much of the world and killed 774 people in 2002 and 2003 before it was contained. Since then, there have been regular flare-ups of Middle East respiratory Syndrome or MERS, caused by another bat coronavirus that passes through camels since 2012, it has killed 884 people. Most research on potential pandemics nevertheless continued to focus on influenza viruses, such as bird flu, because they carry a significant annual death toll. COVID-19 is exposing the dangers of such a single-minded approach.

    DO NO HARM: Bats (like the horseshoe bat seen here, common in China) carry dozens of viruses in the coronavirus family. Most of the viruses live in harmony with their hosts and cause no harm. Taylor, Stoffberg, Monadjem, Schoeman, Bayliss & Cotterill / Wikimedia

    A few scientists tried to sound the alarm. In a 2015 study, epidemiologist Ralph Baric and his colleagues at the University of North Carolina analyzed the genomes of bat coronaviruses and warned, “Our work suggests a potential risk of SARS-CoV re-emergence from viruses currently circulating in bat populations.” 1 A second paper from the same group the next year warned that another SARS-like disease from bat coronaviruses was “poised for human emergence.” 2

    Bats are well known as a reservoir for potential new human diseases. The animals carry dozens, perhaps hundreds, of members of the coronavirus family. Most of those viruses are part of the bats’ normal microbiome, living in harmony with their hosts and causing no harm. But coronaviruses, like all forms of life, accumulate random genetic changes as they reproduce. Occasionally the mutations allow the viruses to infect other animals (including humans) and to score the big win in natural selection: producing ever-more descendants.

    A win for the virus, that is. For us, not so much.

    T wo critical mutations in the bat coronavirus set us on the path to the COVID-19 pandemic. The first modified the structure of the spike-like structures that protrude from the virus. Those protrusions give the virus its family name: “Corona” means “crown” in Latin. The altered spikes allow the virus to latch onto a protein called ACE2, which lines the respiratory tract. 3 The related virus responsible for the SARS epidemic employs a similar infection mechanism, as does another bat coronavirus that causes common colds in humans.

    The second key mutation allowed the coronavirus to grow a protein dagger called a furin, which can slice through other proteins to make the virus bind tightly to throat and lung cells. 4 The furin protein is what made the COVID-19 virus so infectious and deadly to humans. In that sense, SARS-CoV-2 is similar to anthrax and various bird flus that also rely on furins to carry out their infection.

    Two critical mutations in the bat coronavirus set us on the path to COVID-19.

    Those mutations could have occurred while the virus was circulating in bats. It’s also possible that one or both mutations could have erupted in a person who was infected by an earlier version of the virus, but who showed no symptoms. Most likely, there was an intermediate host between bats and humans. The pangolin, a creature prized in China for its meat and for the alleged medicinal value of its scales, is a strong candidate. Epidemiologists suspect that someone bought a pangolin at one of the “wet markets” in Wuhan and got infected consuming it, setting off the chain of transmission.

    For now, this story of the emergence of COVID-19 is hypothesis. “I suspect we will never know,” Wilen said, because we don’t have enough samples of the viruses in bats, people, and potential intermediate hosts. But the narrative agrees with the ways that most pathogens transfer between species. This process of species-jumping, called zoonosis, accounts for 60 percent of human disease and 75 percent of newly emerging infections.

    Genes That Won the Fame Game

    Fame is something that sticks to someone or something, a quality earned or gained for no reason at all. It is also a force of connection. A famous person or thing forms a hub in a network that binds us. READ MORE

    The narrative agrees, too, with detailed new studies of the virus using the tools of genetic epidemiology. 5 Nature provides a convenient clock that aids the researchers trying to reconstruct the evolutionary history and development of viruses such as COVID-19. Every genome mutates at a predictable background rate. Most mutations yield no significant effects on the biology of the organism, but the tick-tick-tick of genetic changes allow scientists to construct the order in which different strains or species diverged from each other. With that information in hand, they can then construct a phylogenetic tree—a branching diagram that depicts the evolutionary relationships.

    On January 10, a group of Chinese and Australian researchers published the first sequence of the (then-unnamed) new virus from the Wuhan outbreak on GenBank, a publicly accessible genetic data source. Other sequences from Wuhan quickly followed, allowing Trevor Bedford at the Fred Hutchinson Cancer Research Center in Seattle to start constructing a phylogenetic tree of the virus that we now call SARS-CoV-2.

    So far, there is no sign the virus is becoming any more deadly or infectious.

    If the outbreak was caused by several viruses that had emerged from multiple sources, there would be much genetic diversity. In reality, Bedford saw little. “From very early on, it was clear that the nCoV genomes lacked the expected genetic diversity that would occur with repeated zoonotic events from a diverse animal reservoir,” he said. His research progressed rapidly. On January 13, the first cases of COVID-19 appeared outside of China, in Thailand. Adding those two cases into his database, Bedford concluded that the world was facing an outbreak of a single, deadly new disease.

    “The conclusion of sustained human-to-human spread became difficult to ignore on January 17, when novel coronavirus genomes from two Thai travel cases that reported no market exposure showed the same, limited genetic diversity,” he said. “This genomic data represented one of the first and strongest indications of sustained, epidemic spread. As this became clear to me, I spent the week of Jan 20 alerting every public health official I know.”

    Almost no one listened. “I remember being flabbergasted by the continued narrow case definition and restricted testing,” Bedford recalled. The United States did not declare a state of emergency until March 13.

    B edford is now trying to assist with the next stages in containing and combating the COVID-19 pandemic. He and his team run a project called Nexstrain that tracks multiple pathogens, including flu, tuberculosis, West Nile virus, and now SARS-CoV-2. As of March 20, the phylogenetic tree for SARS-CoV-2 included 855 genome sequences of different strains. The genetic sequences catalogued in Nexstrain and other genomic databases will allow infectious disease specialists to monitor any worrisome changes to the virus.

    So far, there is no sign that the virus is becoming any more deadly or infectious—although neither is it becoming any less so. Near-stasis is typical for a new human pathogen. From an evolutionary perspective, SARS-CoV-2 is already doing a great job of reproducing itself. It’s therefore feeling little evolutionary pressure to change. The viruses will just keep doing their thing until they are contained, or until they have killed their hosts.

    A model for how COVID-19 might change comes from HIV/AIDS.

    “Although none of the COVID-19 mutations look particularly interesting, there are a few things to watch for,” Bedford said. He is particularly vigilant in monitoring any mutations in the spike protein, which would have big implications for vaccine development.

    Flu season illustrates the potential impact of viral mutations. Influenza viruses manage to infect people continually because the proteins on their surfaces keep changing. Your immune system fails to recognize the viruses’ reshuffled proteins, so you need a new flu shot every year. Fortunately, SARS-CoV-2 is quite a bit different. Flu viruses have a much smaller set of genes, and they circulate constantly between multiple species—pigs, birds, and people. Both attributes make mutations more likely in influenza than in coronaviruses, which contain one of the biggest RNA viral genomes and which jump across species barriers far less often.

    A better model for how COVID-19 might change comes from the biggest pandemic of modern time: HIV/AIDS. Because of some of the most elegant genomic epidemiology work ever performed, we know that the ancestor of HIV had lived harmlessly in monkeys before jumping to chimps, probably through an ill-advised meal. 6 In chimps, the virus evolved into something close to modern HIV. Around 1931, AIDS first appeared in a human in southwestern Cameroon possibly someone was butchering a chimpanzee for food and got cut. HIV stayed rare among humans in rural Africa until entering the city of Kinshasa in what is now the Democratic Republic of the Congo. Once it reached an urban setting, transmission of the virus exploded, first there and then around the world.

    When HIV was discovered in 1981, many people feared it would mutate into something even more deadly, possibly an airborne form. No one knew then that the virus had already been around for four decades without changing much. Another four decades later, HIV has not become any more deadly or infectious, yet it has still infected 75 million people and killed 32 million. Those are sobering numbers as we confront another disease that has jumped from animals to humans. Even if the COVID-19 virus proves to be slow-changing, like HIV, it may take years to get fully under control. And many more outbreaks of species-crossing disease are likely due to population growth and human encroachment on wild environments.

    Will we learn our lessons? If you are a virologist, you will surely have a far easier time getting funds to study coronaviruses—for a while, at least. Almost every country is talking about beefing up its preparedness for another pandemic. And soon after the current outbreak began, the Chinese government shut down the sales of live animals for food and medicinal purposes in the loosely regulated wet markets.

    Then again, China did the same after the SARS outbreak, when epidemiologists deduced that the virus started in bats and spread through a cat-like animal called a civet, also sold in the wet markets. Over the years, tradition and corruption quietly allowed the markets to reopen.

    Robert Bazell is adjunct professor of Molecular, Cellular, and Developmental Biology at Yale. For 38 years, he was chief science correspondent for NBC News.

    1. Menachery, V.D., et al. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. Nature Medicine 21, 1508-1513 (2015).

    2. Menachery, V.D., et al. SARS-like WIV1-CoV poised for human emergence. Proceedings of the National Academy of Sciences 113, 3048-3053 (2016).

    3. Zhang, T., Wu, Q., & Zhang, Z. Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak. Current Biology (2020). Retrieved from Doi:10.1016/j.cub.2020.03.022

    4. Wang, Q., et al. A unique protease cleavage site predicted in the spike protein of the novel pneumonia coronavirus (2019-nCoV) potentially related to viral transmissibility. Virologica Sinica (2020).

    Scientists: the Coronavirus Has Already Mutated Into 30+ Strains

    New research suggests that SARS-CoV-2, the virus that causes COVID-19, could have already mutated into more than 30 separate strains.

    The study found that different strains can generate vastly different levels of viral load as others, the South China Morning Post reports, making them far more dangerous.

    One strain, for example, appeared to generate 270 times the viral load — meaning the infected person produces 270 times as much of the virus— than the least potent strain.

    That makes it far harder to fight off infections and facilitates spread, hypothetically explaining why some cases of COVID-19 are significantly worse than others.



    “Sars-CoV-2 has acquired mutations capable of substantially changing its pathogenicity,” Li Lanjuan, one of China’s most prolific epidemiologists and a researcher at Zhejiang University, wrote in the study, which was shared online in the preprint server MedRxiv on Sunday but hasn’t yet been vetted by the peer-review process or published in an academic journal.

    In the research, Li isolated different strains and, under laboratory conditions, measured how quickly and effectively they could infect and kill off host cells.

    The paper also traced different strains to outbreaks in different parts of the world, finding that the version of SARS-CoV-2 that spread across Europe and New York were far more efficient killers than the one that hit other parts of the U.S. such as Washington State.

    “Drug and vaccine development, while urgent, need to take the impact of these accumulating mutations… into account to avoid potential pitfalls,” Li and her colleagues wrote.

    What isn’t known

    There are still many mysteries about this virus and coronaviruses in general – the nuances of how they cause disease, the way they interact with proteins inside the cell, the structure of the proteins that form new viruses and how some of the basic virus-copying machinery works.

    Another unknown is how COVID-19 will respond to changes in the seasons. The flu tends to follow cold weather, both in the northern and southern hemispheres. Some other human coronaviruses spread at a low level year-round, but then seem to peak in the spring. But nobody really knows for sure why these viruses vary with the seasons.

    What is amazing so far in this outbreak is all the good science that has come out so quickly. The research community learned about structures of the virus spike protein and the ACE2 protein with part of the spike protein attached just a little over a month after the genetic sequence became available. I spent my first 20 or so years working on coronaviruses without the benefit of either. This bodes well for better understanding, preventing and treating COVID-19.

    [Get facts about coronavirus and the latest research. Sign up for our newsletter.]