WO2014124301A1

WO2014124301A1 - Self-assembling protein nanostructures

Info

Publication number: WO2014124301A1
Application number: PCT/US2014/015371
Authority: WO
Inventors: David Baker; Neil King; Jacob BALE; William Sheffler
Original assignee: University Of Washington Through Its Center For Commercialization
Priority date: 2013-02-07
Filing date: 2014-02-07
Publication date: 2014-08-14
Also published as: US10248758B2; US20240038331A1; US20190341124A1; US20150356240A1

Abstract

Synthetic nanostructures, proteins that are useful, for example, in making synthetic nanostructures, and methods for designing such synthetic nanostructures are disclosed herein.

Description

Self-Assembling Protein Nanostructures

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. Provisional Patent Application No. 61/762, 194 entitled "General Method for Designing Multi-Component Protein Materials" filed February 7, 2013, which is entirely incorporated by reference herein for all purposes

BACKGROUND OF THE INVENTION

Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.

Molecular self-assembly is an elegant and powerful approach to patterning matter on the atomic scale. Recent years have seen advances in the development of self-assembling biomaterials, particularly those composed of nucleic acids. DNA has been used to create, for example, nanoscale shapes and patterns, molecular containers, and three-dimensional macroscopic crystals. Methods for designing self-assembling proteins have progressed more slowly, yet the functional and physical properties of proteins make them attractive as building blocks for the development of advanced functional materials.

In any self-assembling structure, interactions between the subunits are required to drive assembly. Previous approaches to designing self-assembling proteins have satisfied this requirement in various ways, including the use of relatively simple and well-understood coiled-coil and helical bundle interactions, engineered disulfide bonds, chemical cross-links, metal-mediated interactions, templating by non-biological materials in conjunction with computational protein interface design, or genetic fusion of multiple protein domains or fragments which naturally self-associate.

In some scenarios, computational modeling and design of molecules can aid researchers in investigating the molecules. For example, computational protein design can provide valuable reagents for biomedical and biochemical research, identify sequences compatible with a given protein backbone, and design protein folds. SUMMARY

In one aspect, isolated nanostructures are provided, comprising

(a) a plurality of first proteins that self-interact to form a first multimeric substructure comprising at least one axis of rotational symmetry;

(b) a plurality of second proteins that self-interact to form a second multimeric substructure comprising at least one axis of rotational symmetry;

wherein multiple copies of the first multimeric substructure and the second multimeric substructure interact with each other at one or more symmetrically repeated, non-natural, non-covalent protein-protein interfaces that orient the first multimeric substructures and the second multimeric substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group.

The nanostructures of the invention may, for example, have a mathematical symmetry group is selected from the group consisting of tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry. In one embodiment, the first multimeric substructure comprises a dimer, trimer, tetramer, or pentamer of the first protein, and wherein the second multimeric substructure comprises a dimer or trimer of the second protein. In another embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a dimer of the second protein. In a further embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a trimer of the second protein. In another embodiment, the first protein and the second protein may be between 30-250 amino acids in length. In a still further embodiment, each symmetrically repeated instance of the non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure buries between 1000-2000 A² of solvent-accessible surface area (SASA) on the first multimeric substructure and the second multimeric substructure. In another embodiment, each symmetrically repeated, non- natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure has a shape complementary value between 0.5-0.8. In a further embodiment, at least 50% of the atomic contacts comprising each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure is formed from amino acid residues residing in elements of alpha helix and/or beta strand secondary structure. Exemplary first and second proteins are disclosed herein. In another aspect, the invention provides isolated proteins, comprising an amino acid sequence selected from the group consisting of SEQ ID NOS: 1-40, multimeric assemblies comprising a plurality of identical isolated protein monomers, recombinant nucleic acid encoding the isolated proteins, recombinant expression vector comprising the recombinant nucleic acids operatively linked to a promoter, and recombinant host cells, comprising the recombinant expression vectors of the invention, as well as kits comprising one or more of the compositions of the invention.

In a further aspect, a method is provided. A computing device generates a plurality of representations of a first protein building block. The computing device generates a plurality of representations of a second protein building block, where the first protein building block differs from the second protein building block. The computing device generates an arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block according to symmetric operations of a designated mathematical symmetry group. The computing device

computationally determines a docked configuration of the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block by at least generating at least one interface for each protein building block of the arrangement that is suitable for computational protein-protein interface design. The computing device computationally modifies amino acid sequences of the plurality of representations of the first protein building block and the plurality of

representations of the second protein building block in the docked configuration to specify a plurality of representations of protein-protein interfaces. The plurality of representations of protein-protein interfaces include one or more representations of protein-protein interfaces between the first protein building block and the second protein building block that are energetically favorable to drive self-assembly of the protein building blocks comprising the modified amino acid sequences to the docked configuration. The computing device generates an output that is based on at least one representation of the group consisting of: a

representation of the docked configuration, at least one representation of the plurality of representations of the protein-protein interfaces, and at least one representation of the representations of the first protein building block and the representations of the second protein building block having modified amino acid sequences. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a flow chart of an example method.

Figures 2A through 2E each show an example designed protein interface.

Figure 3 depicts example protein architectures.

Figures 4A-4F shows a method for building protein architectures.

Figures 5A, 5B, and 5C show three different symmetric fold tree representations using an example two-component architecture with D3 symmetry.

Figure 6A shows SEC chromatograms of designed pairs of proteins and wild-type oligomeric proteins.

Figure 6B shows a native PAGE analysis of in vzYro-assembled T32-28 and T33-15 in cell lysates.

Figures 6C-6G show respective native PAGE analyses of in Wire-assembled T32-28, T33-09, T33-15, T33-21, and T33-28 in cell lysates.

Figures 7A and 7B each show electron micrographs of designed two-component protein nanomaterials.

Figure 8 shows computational design models and crystal structures of designed two- component protein nanomaterials

Figure 9 is a block diagram of an example computing network.

Figure 1 OA is a block diagram of an example computing device.

Figure 10B depicts an example cloud-based server system.

Figures 1 lA-1 ID show the amino acid sequence of an exemplary protein (T32-28A) of the invention (SEQ ID Os: 1, 11, 21, 31).

Figures 12A-12D show the amino acid sequence of an exemplary protein (T32-28B) of the invention (SEQ ID NOs: 2, 12, 22, 32).

Figures 13A-13B show the amino acid sequence of an exemplary protein (T33-09A) of the invention (SEQ ID NOs: 3, 13, 23, 33).

Figures 14A- 14C show the amino acid sequence of an exemplary protein (T33-09B) of the invention (SEQ ID NOs: 4, 14, 24, 34).

Figures 15A-15D show the amino acid sequence of an exemplary protein (T33-15A) of the invention (SEQ ID NOs: 5, 15, 25, 35).

Figures 16A-16C show the amino acid sequence of an exemplary protein (T33-15B) of the invention (SEQ ID NOs: 6, 16, 26, 36). Figures 17A-17D show the amino acid sequence of an exemplary protein (T33-21A) of the invention (SEQ ID NOs: 7, 17, 27, 37).

Figures 18A-18C show the amino acid sequence of an exemplary protein (T33-21B) of the invention (SEQ ID NOs: 8, 18, 28, 38).

Figures 19A-19C show the amino acid sequence of an exemplary protein (T33-28A) of the invention (SEQ ID NOs: 9, 19, 29, 39).

Figures 20A-20C show the amino acid sequence of an exemplary protein (T33-28B) of the invention (SEQ ID NOs: 10, 20, 30, 40). DETAILED DESCRIPTION

Natural protein assemblies are most often held together by many weak, noncovalent interactions which together form large, highly complementary, low energy protein-protein interfaces. Such interfaces spontaneously self-assemble and allow precise definition of the orientation of subunits relative to one another, which is critical for obtaining the desired material with high accuracy. Designing assemblies with these properties has been difficult due to the complexities of modeling protein structures and energetics.

A general computational method for designing self-assembling protein materials is disclosed, involving symmetrical docking of protein building blocks in a target symmetric architecture.

In some embodiments of the general computational method, the protein building blocks can include two or more distinct protein building blocks. Then, classes of

nanomaterials can be constructed from docked configurations of the two or more distinct protein building blocks. Using multiple distinct protein building blocks can provide greater control over the assembly process and new functions. The nanomaterials can be engineered to encapsulate biomolecules of interest and deliver them to the cytosol of cultured cells to demonstrate their potential as next-generation targeted delivery vehicles.

The methods described herein can be used to design nanomaterials that combine several features of fundamental importance for their use in therapeutic applications. The nanomaterials can be designed with atomic-level accuracy that 1) underlies protein structure- function relationships, 2) is critical for the design of function, and 3) is currently inaccessible to other classes of materials such as synthetic nanoparticles or liposomes. Modular materials can be derived from these methods that enable the facile development of a variety of sophisticated functionalities. The nanomaterials can be "smart" materials that can respond in vitro or in vivo to therapeutically relevant environmental cues such as changes in pH.

Multi-component materials can enable design of larger cage-like assemblies with greater internal loading capacities, control over the initiation of assembly through mixing of separately purified components, and independent functionalization of each component.

These three features are important for many potential downstream applications, including targeted delivery, vaccine design, and biosynthetic pathway engineering.

Software can simultaneously model multiple distinct subunit types in all of the symmetry groups relevant to protein structure, including helical, point group, layer group, and space group symmetries. The software can contain functionality for designing symmetric nanostructures, efficiently calculating scores, and sampling symmetric degrees of freedom.

Example Operations

Figure 1 is a flow chart of an example method 100. Method 100 can begin at block 110, where a computing device, such as computing device 1000 described below in the context of at least Figure 10A, can generate a plurality of representations of a first protein building block. At block 120, the computing device can generate a plurality of

representations of a second protein building block, where the first protein building block differs from the second protein building block. In some embodiments, each of the first and second protein building blocks can include a synthetic polypeptide.

At block 130, the computing device can generate an arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block according to symmetric operations of a designated

mathematical symmetry group. In some embodiments, each of the plurality of the first and second protein building blocks can include a protein that shares an axis of symmetry with the designated mathematical symmetry group. In other embodiments, the designated

mathematical symmetry group can conform to a symmetry selected from tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry. In still other embodiments, generating the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block can include computationally aligning symmetry axes of the first protein building block and the second protein building block with at least one axis in the designated mathematical symmetry group. At block 140, the computing device can computationally determine a docked configuration of the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block by at least generating at least one interface for each protein building block of the arrangement that is suitable for computational protein-protein interface design.

In some embodiments, determining a docked configuration of the plurality of the first and second protein building blocks can additionally include sampling rotational degrees of freedom and translational degrees of freedom for each of the first and second protein building blocks. In particular of these embodiments, sampling the rotational degrees of freedom and the translational degrees of freedom can include: selecting a rotational value for a rotational degree of freedom for each of the first and second protein building blocks; selecting a translational value for a translational degree of freedom for each of the first and second protein building blocks; determining a sampled representation of the first protein building block based on the selected rotational value for the first protein building block and the selected translational value for the first protein building block; determining a sampled representation of the second protein building block based on the selected rotational value for the second protein building block and the selected translational value for the second protein building block; and determining a designability measure for the docked configuration using the sampled representation of the first protein building block and the sampled representation of the second protein building block.

In more particular of these embodiments, determining the designability measure of the docked configuration can include determining a number of beta carbon contacts within a specified distance threshold between the sampled representation of the first protein building block and the sampled representation of the second protein building block in the docked configuration based on the values of the selected rotational and translational degrees of freedom.

At block 150, the computing device can computationally modify amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block in the docked configuration to specify a plurality of representations of protein-protein interfaces. The plurality of representations of protein-protein interfaces can include one or more representations of protein-protein interfaces between the first protein building block and the second protein building block that are energetically favorable to drive self-assembly of the protein building blocks comprising the modified amino acid sequences to the docked configuration using the computing device.

In some embodiments, computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of

representations of the second protein building block can include selecting a selected representation of one or more amino acid sequences associated with a representation of at least one protein building block of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block. In particular of these embodiments, computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block can include computationally mutating an amino acid sequence of the selected representation of one or more amino acid sequences. In other embodiments, computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block can include evaluating an energy of an amino acid mutation using a computational score function.

At block 160, the computing device can generate an output of the computing device that is based on at least one representation of the group consisting of: a representation of the docked configuration, at least one representation of the plurality of representations of the protein-protein interfaces, and at least one representation of the representations of the first protein building block and the representations of the second protein building block having modified amino acid sequences.

Figure 3 depicts example protein architectures that can be designed using the method, in accordance with an example embodiment. Figure 3 shows ten example architectures arranged roughly into two columns with architectures labeled 03, 042, 132, T32, 152 in the column, and labeled 032, 043, T32, 153, and T3 in the right column. The right column of architectures also includes a reference line indicating a reference distance of 15 nanometers within Figure 3.

An architecture is labeled in Figure 3 as either Xy or Xyz, where X is a letter, and y and z are numbers. The letter X represents the symmetry of the architecture: T for tetrahedral symmetry, O for octahedral symmetry, and I for icosahedral symmetry. The number y indicates a number of monomers in a first building block used to build the architecture and the number z indicates a number of monomers in a second building block used to build the architecture. If the number z is not present, then only one type of building block is used to build the structure. For examples, the 03 architecture at the top of the left column of Figure 3, is an octahedral structure made up of one trimeric building block, the T33 architecture toward the bottom of the left column of Figure 3 is a tetrahedral structure made up of two different types of trimer building blocks, and the 152 architecture at the bottom of the left column of Figure 3 is an icosahedral structure made up of two multimer building blocks: a pentamer and a dimer.

Figure 3 also indicates how many of each type of building block are utilized to build the structure. In the T32 architecture shown in the middle of the right column of Figure 3 and in Figure 4F, the structure is assembled from 4 trimers aligned along the tetrahedral threefold symmetry axes and 6 dimers aligned along the two-fold symmetry axes. The T33 architecture(also shown in Figure 4A-E) is constructed from four copies of one trimer and four copies of a second trimer, with the three-fold symmetry axis of each trimer aligned at opposite poles of each tetrahedral three- fold symmetry axis.

Accurate design of coassembling multi-component protein nanomaterials

The self-assembly of proteins into highly ordered nanoscale architectures is a hallmark of biological systems. Compared to homooligomers, assemblies formed from multiple distinct components offer a wider range of possible structures due to their combinatorial nature, greater control over the timing of assembly, and enhanced modularity through independently addressable building blocks. Disclosed is a general computational method for designing protein nanomaterials in which two distinct types of subunits coassemble to a target symmetric architecture. The information necessary to direct assembly is encoded in designed protein-protein interfaces that precisely define the relative orientations of the building blocks. This method has been used to design five novel 24-subunit cage-like protein nanomaterials in two distinct symmetric architectures. The designed pairs of proteins self-assemble to form highly homogeneous nanocages when co-expressed in E. coli, and the assembly of two of the materials can be initiated upon demand by mixing independently produced components. Crystal structures of the materials are in close agreement with the computational design models at the level of both the designed interfaces and the overall architectures. The accuracy of the method and the universe of two-component materials that it makes accessible pave the way for the design of functional protein nanomaterials tailored to specific applications. The level of structural complexity available to self-assembled nanomaterials generally increases with the number of unique molecular components used to construct the material. DNA nanotechnology provides an extreme example of this phenomenon: strategies have been developed for encoding specific and directional interactions between hundreds of distinct DNA strands, allowing the construction of nanoscale objects with essentially arbitrary structures. Here the structural and functional range of designed protein materials is expanded with a general computational method for designing two-component coassembling protein nanomaterials with high accuracy.

Software can be used to model multi-component systems; that is, systems consisting of multiple distinct protein subunits, each associated with a distinct symmetry group. Within the updated framework we disclose herein, each distinct subunit can be modified

independently of one another, with the changes propagated to all symmetrically related copies.

Figures 4A-4F shows a method for building two-component symmetric protein nanostructures. Figures 4A-4E illustrates this method using a dual tetrahedral architecture (designated in Figure 3 and Figure 4A as T33) as an example. In the T33 architecture, four copies each of two distinct, naturally trimeric building blocks are aligned at opposite poles of the three-fold symmetry axes of a tetrahedron as shown in Figure 4A. This alignment places one set of building blocks at the vertices of the T33 tetrahedron and the second set of building blocks at the centers of the faces of the T33 tetrahedron, totaling twelve subunits of each protein.

Each trimeric building block is allowed to rotate around and translate along its threefold symmetry axis as indicated in Figure 4B; other rigid body moves are disallowed because they would lead to asymmetry. These four degrees of freedom can be systematically explored during docking to identify configurations with interfaces that are suitable for design, as shown in Figure 4C. The docking score function can maximize the number of inter- building block neighbors per residue and can favor residues in highly anchored regions of the protein structure that are unlikely to change conformation upon mutation of surface side chains as shown in Figure 4D. A design algorithm, such as but not limited to the

RosettaDesign™ algorithm, can be used to sample the identities and configurations of the side chains near the inter-building block interface, generating interfaces with features resembling those found in natural protein assemblies such as well-packed hydrophobic cores surrounded by polar rims, such as shown in Figure 4E. The end result is a pair of new amino acid sequences, one for each building block, predicted to stabilize the modeled interface and therefore spontaneously drive assembly to the specific target configuration. These docking and design procedures were implemented in software to enable the simultaneous modeling of multiple distinct symmetrically arranged protein components. In particular, different components can be moved independently of one another while maintaining their internal degrees of freedom. This enables the design strategy described above to be generalized to a wide variety of symmetric architectures in which multiple symmetric building blocks are combined in geometrically specific ways. Combining even two symmetry elements can give rise to a large number of distinct symmetric architectures with a range of possible morphologies, including those with dihedral and cubic point group symmetries, as well as helical, layer, and space group symmetries.

As shown in the non-limiting examples that follow, the designed interfaces can drive assembly of cage-like nanomaterials that closely match the computational design models: the backbone RMSD over all 24 subunits in each material range from 1.0 to 2.6 A. The precise control over interface geometry offered by our method thus enables the design of two- component protein nanomaterials with diverse nanoscale features such as surfaces, pores, and internal volumes with high accuracy.

The method described here can provide a general route to designing multi-component protein-based nanomaterials and molecular machines with programmable structures and functions. The capability to design highly homogeneous protein nanostructures with atomic- level accuracy and controllable assembly can open new opportunities in targeted drug delivery, vaccine design, plasmonics, and other applications that can benefit from the precise patterning of matter on the sub-nanometer to hundred nanometer scale. Multi-component symmetric modeling

The herein-described methods and techniques are not limited to use of

RosettaDesign™, the Rosetta™ software suite, or any other specific software package. For example, other software programs could be used in conjunction with this method to model multi-component symmetric protein nanostructures. As will be understood by those of skill in the art, the implementation of the design methods of the invention described below is non- limiting, and the methods are in no way limited to the implementation disclosed herein.

As an example embodiment, the Rosetta™ software package was modified for multi- component symmetric modeling. Rosetta' s symmetric modeling framework was updated out to enable modeling of multi-component systems; that is, systems consisting of multiple distinct protein subunits, each associated with a distinct symmetry group. Within this updated framework, each distinct subunit can be modified independently of one another, with the changes propagated to all symmetrically related copies. All of Rosetta's design and modeling functionality accessible to one-component symmetries is now accessible for multi- component symmetries as well, including efficient scoring calculations and sampling of symmetric degrees of freedom. These changes to Rosetta's symmetry machinery are illustrated in Figures 5A-5C and described briefly below. In both the one-component examples shown in Figures 5 A and 5B and the multi-component example of Figure 5C, the symmetry of a given target architecture is passed to Rosetta in the form of a symmetry definition file.

Figures 5A, 5B, and 5C show three different symmetric fold tree representations of an example D32 architecture. In each of Figures 5A, 5B, and 5C, the D32 architecture is made up of two trimeric building blocks, each shown in a relatively dark gray color, and three dimeric building blocks each shown in a relatively light gray color, arranged with D3 point group symmetry. Following the strategy described above, arranging the building blocks with D3 point group symmetry is accomplished by aligning the three-fold symmetry axes of the trimeric building blocks along the three-fold axis in D3 point group symmetry and the twofold symmetry axes of the dimeric building blocks along the two-fold axes in D3 point group symmetry. In the examples shown in Figures 5A-5C, rigid body degrees of freedom (RB DOFs) are shown using gray lines. Figures 5A and 5B show examples with one component symmetry. Figure 5A shows RB DOF JD3 connecting the master dimer subunit to the master trimer subunit. RB DOF JD3 is a child of RB DOFs JDI and JD2 controlling the master dimer subunit; in this case the positions of the trimeric subunits depend on the positions of the dimeric subunits. That is, the RB DOFs of the trimeric building blocks shown in Figure 5 A depend on the RB DOFs of the dimeric building blocks.

In the example shown in Figure 5B, RB DOF JT3 connecting the master trimer subunit to the master dimer subunit is a child of RB DOFs JTI and JT2 controlling the master trimer subunit. Then, in Figure 5B, the positions, and thus the RB DOFs, of the dimeric subunits depend on the positions, and thus the RB DOFs, of the trimeric subunits. Figure 5C shows an example with multi-component symmetry. With multi-component symmetric modeling the RB DOFs controlling the master trimer subunit (JTI and JT₂) are independent of the RB DOFs controlling the master dimer subunit (J_D1 and JD2); in the example of Figure 5C, the positions of the dimeric subunits do not depend on the positions of the trimeric subunits and vice versa.

In some embodiments, only a single connection was allowed from the symmetric fold tree into the asymmetric unit. Thus, when modeling a system with multiple distinct symmetric components, only one such component could have its internal DOFs preserved. For example, in the D32 system shown in Figures 5A and 5B, if only one connection into the asymmetric unit is allowed, then one must choose to connect the two subunits in the asymmetric unit to either the three-fold axis (middle panel) or the two-fold axis (left panel). If both are connected to the three-fold axis, rotations around this connection will correctly preserve the internal DOFs of the trimer, but disrupt the internal DOFs of the dimer such as shown in Figure 5B.

Other embodiments can enable multiple connections from the symmetric fold tree into the asymmetric unit, as the multi-component extension of symmetric modeling in Rosetta allows the asymmetric unit to be broken down into substructures that are independently managed by the symmetric fold tree. Using a multi-component symmetric fold tree in our D32 example allows the trimer to connect directly to the three- fold axis and the dimer to connect directly to the two-fold axis, thus any motions allowed by the symmetric architecture preserve the internal DOFs of both building blocks as shown in Figure 5C.

In both the one-component and multi-component case, the symmetry of a given target architecture; e.g., T32 and T33 architectures, can be passed to Rosetta in the form of a symmetry definition file. The multi-component symmetry definition file syntax can be largely the same as the one-component syntax, with the additional requirement that the jumps connecting the protein subunits to the fold tree must specify which component is connected to each symmetry element.

Herein we define a symmetric architecture as a conceptual representation of a known mathematical symmetry group comprising at least one element of rotational symmetry, in which one or more of the elements of symmetry are explicitly considered; along each of the considered symmetry elements, multimeric protein building blocks with matching elements of symmetry can be aligned such that the symmetry elements of the building blocks and the designated symmetry group are collinear. Known mathematical symmetry groups with multiple different types of symmetry elements can be considered (for instance, octahedral point group symmetry contains two-fold, three-fold, and four-fold rotational symmetry elements); modeling nanostructures possessing these symmetries can require multiple distinct multimeric protein building blocks with distinct symmetries. In this way, a symmetric architecture defines: 1) the overall symmetry of the nanostructure being modeled, 2) the symmetries of the one or more distinct multimeric building blocks making up the symmetric nanostructure, and 3) the relative orientations of the symmetry axes of the one or more multimeric building blocks.

As a non-limiting example, a symmetric framework can be provided to model systems in two different symmetric architectures with tetrahedral point group symmetry. In one architecture, the assembly can be constructed from 4 trimers aligned along the tetrahedral three- fold symmetry axes and 6 dimers aligned along the two-folds; this architecture can be referred to as T32 (tetrahedron constructed from 3mers and 2mers). The second architecture, T33, can be constructed from four copies of one trimer and four copies of a second trimer, with the three-fold of each trimer aligned at opposite poles of each tetrahedral three-fold. Throughout the docking and design process the relative orientation of each of the two subunits within the trimers and/or dimers was maintained while allowing the trimeric or dimeric building blocks to translate along and rotate about the tetrahedral three-fold or twofold symmetry axes.

The method disclosed herein can be used to model and design synthetic

nanostructures possessing a wide variety of symmetries. In addition to the two-component tetrahedral symmetric architectures discussed above, nanostructures possessing octahedral or icosahedral point group symmetries can be modeled using the method, as well as

nanostructures possessing dihedral point group symmetries, helical or line group symmetries, plane or layer group symmetries, or space group symmetries. In each symmetry, multimeric protein building blocks can be aligned along a subset of one or more of the elements of symmetry in the symmetry group in order to generate a synthetic nanostructure with the desired overall symmetry. The relative orientations of symmetry elements in all of the aforementioned symmetry group are known, and the symmetry definition file disclosed herein provides one general and non-limiting mechanism for providing this information to the computational design method.

Two-component symmetric docking

The herein-described methods and techniques are discussed herein in the context of an example embodiment of the Rosetta™ software suite. However, there herein-described methods and techniques are not limited to use of RosettaDesign™, the Rosetta™ software suite, or any other specific software package. For example, other software programs could be used in conjunction with this method to computationally dock multi-component symmetric protein nanostructures. As will be understood by those of skill in the art, the implementation of the design methods of the invention described below is non-limiting, and the methods are in no way limited to the implementation disclosed herein. An application, tc_dock, was written within Rosetta™ to dock two distinct oligomeric building blocks into higher order symmetries in order to identify docked configurations predicted to be suitable for interface design. The required inputs for the t c_do c k application are one PDB file containing a single subunit of the first scaffold component and a second PDB file containing a single subunit of the second scaffold component.

Sets of homodimeric and homotrimeric protein structures were curated to be input to our docking and design protocol. First, the PISA database was searched for all homodimeric or homotrimeric proteins passing the default criteria for dissociation energy, accessible surface area, buried surface area, percent buried surface area, and average chain length. The IDs obtained from PISA were then provided as input for the advanced search tool in the Protein Data Bank to select proteins clustered at 90% sequence identity with: 1) X-ray resolution less than 2 A, 2) chain lengths of 75 to 200 amino acids, and 3) Escherichia coli as the host organism for protein expression. One trimeric protein that did not pass our automated selection criteria, PDB ID 3FTT, was added because of previous experience indicating it may serve as a successful design scaffold.

Coordinates for each of the selected PDB IDs were downloaded from the biological assemblies in the PDB and standardized for input to Rosetta. For biological assemblies containing multiple models with one chain per model, each model was treated as a separate chain. For assemblies containing multiple models with multiple chains per model, only the first model was considered. Alternative side chains and HETATM records were removed, selenomethionines replaced with methionines, and the chain with the lowest average RMSD (as calculated by the super command in PyMOL) to all other chains was selected to be the input chain for design. Residues with missing main chain atoms were removed from the design input chain and its residues renumbered starting from 1. A new biological assembly was created in PyMOL by superimposing copies of the design input chain onto all other chains, and the assembly's symmetry axis was aligned along the vector [0, 0, 1] and its center of mass translated to the origin. Assemblies were discarded that were found to be too asymmetric, as assessed by the dispersion of symmetry axes implied by each tuple of symmetrically related atoms. For PDB IDs with multiple biological assemblies, the assembly with the lowest biological unit number found to match the expected C2 or C3 symmetry was chosen for design. The final set of 1, 161 homodimeric proteins is listed in Table 1 below. And the final set of 200 homotrimeric proteins is listed in Table 2 below.

lalx 1 la3c 1 lag9_ 1 lalu_ 1 lalv 1 laz5 1 lbOx 1 lb8z 1 lbgf 1 lbm9 1 lbtk 1 lbuo 1 lbyf_ 1 lbyr 1 lc02 1 lcdc 1 lci4 1 lciz 2 lcoz 1 lcxq 1 ld7j_ 3 ld9c_lldnl 1 ldz3 1 le71 1 lees 1 leeq 1 lek3_ 1 lepO_ 1 lesr 1 letx 1 levx 1 lex2 llext 1 leyv 1 lfle 1 lflg 1 Iflm 1

If9f 1 If 9z 1 lfit 1 lfmb 1 lfux 1 lfzv llg2i 1 lg2q 1 lg8q 1 lgd7_ 1 lgvj_ 1 lgvp 1 lgy6 1 lgy7_ 1 lgyx 1 lh8x 1 lhe7 llhgx 1 lht9 1 lhur 1 liOr 1 lil2 1 li3c_ 1 li4y 1 li9d 2 lic2 2 lihr 1 lilk lliq6 1 lis6_ 1 lixl 2 lizm 1 lj24 1 lj27 2 lj2r 2 lj3m 1 lj3q 1 lj55 _1 lj7g_llj8b_ 1

1 j 98_ 1 ljc4_ 1 ljhc_ 1 ljhg_ 1 ljk3 2 ljr8 1 ljrl 2 ljya 1 lk04 1 lk2e llk3s 1 lk66 1 lk8u 1 lk9u 1 lkcq 2 lkl9 2 lkll 1 lkso 1 lllq 1

113p_ 1 118s_ lllgp 1 llj9_l llq9_ 1 llyl 1 ImOd 1 1ml f 1 lm2d 1 lm4i 1 lmi8 1 lmj h 1 lmk4 1

lmka 1 lmkk 1 Imp 9 1 lmsc 1 lmxi 1 lmy6 1 lmy7 1 ln99 1 ln91 2 lng2 2 lnjh llnki 1 lnp6 1 lnqd 1 lnrz 5 lns5 1 lnu3 1 lnxm 1 lnzn 2 lo22 1 lo3u 1 lo4t_llo4w 1 lo50_ 1 lo6a 1 lo6d 2 lohO 1 lohp 1 loiv 1 lon2 1 loqc 1 loru 1 lovs 21p6o 1 lpbj 2 lpdo 1 lpsr 1 lpuc 1 lpvm 1 lpy9_ 1 lpzw 1 lq08_ 1 lq7s 3 lq8b 21q98 1 lq9u 1 lqip 1 lqou 1 lqto 1 lqwi 1 lrlt 1 lrlu 1 lr29 1 lr5q 2 lr7j llr71 1 lr9c 1 lrdo 1 lrfy 1 lrlk 2 lrxq 2 ls4k 1 ls67_ 1 ls7i 1 ls7z 2 ls99 llsd4 1 lsei 1 lsgm 1 lsh8_ 1 Isj w 2 lsjy_ 1 lsk4 2 lsl8 2 lsnd 1 lt82 1 lt92 _lltc5_l ltfe_ 1 ltgj_ 1 lto4_ 1 ltul_ 1 ltuh_ 1 ltuv 2 ltuw 1 ltvd 1 ltwu 1 lu2w llu3y_ 1 lu5f_ 2 lu69_ 2 lu7i 1 luat 2 ludv 1 lues 1 lukk 1 lusc 1 lusm 1 lusp llut7 1 luww 1 luz3 1 lv05_ 1 lv2z 1 lv70 1 lv8y 1 lv96 1 lv9y 1 lvcl 1 lvh5 1

lvia 1 lvj2 1 lvj e 1 lvjl 1 lvkc 1 lvki 1 lvl7 1 lvq3 2 lvr7 1 lvzg 1 lw53_ llwc9 1 lwkq 1 lwlt 1 lwn2 1 lwoc 1 lwpn 1 lwu9 1 lwwc 2 lwwi 2 lwz3 1 lxOj_llx2i 1 lx6i 1 1x82 1 lx8d 1 lxel 2 lxfs 1 lxhn 1 lxqa 1 lxrk 1 lxsO_ 1 lxso 1 lxsq 1 lxty 1 lxvq 2 lyOb 1 lyOh 1 lyOu_ 1 ly5h_ 1 ly7r 1 ly9q_ 1 ly9w 1 lyb3 21ybx 1 lybz 2 lyfu 1 lygt_l lyhf 2 lyib 1 lylk 1 lylm 1 lyo3_ 1 lyoa 1 lyrO llysp 2 lzOp 1 lz2w 3 lz4e 1 lz9n 1 lz9p_ 1 lzb9_ 1 1 zdn 3 1 zhq 1 lzhv 2 lzj6 21zlj 1 lzn8 1 lzo2 1

1 zop 1 1 zps 1 lzpv 1 1 zpw 2 lztd 1 lzva 1 lzwy 1 lzxk 12al5 1 2a67_ 1 a6c 1 2a72 1 2a8n 1 2a9s_1 2aan 2 2aao 3 2akp 3 2aps 1 2aqs 12asf 1auw 2 2b06_ 1 2b0a_ 1 2b0v 2 2bl8 1 2bly 1 2b3s 3 2b5a 1 2b5g 1b6h 22b8m 1 2b9a_l 2bbe 2 2bdr 1 2bnl 1 2bsj 1 2bzl 1 2c2i 1 2c9q_ 1car 1 2cvd 1

cvi 1 2cwz 1 2cyy_ 1 2d37 1 2d4p 2 2d4u 1 2d5m 1 2d7v 1 2d8d 1dc3_ 1 2dc4 12dlb 1 2dm9 1 2dob 2 2dp9 1 2dpf 1 2dql 1 2duy 2 2dvk 1dxq 1 2elf 2 2eln 12e6u 2 2e8e 1 2ebl 1 2ebb 1 2ecu 1 2een 1 2ef8_ 1efv 1 2egd 1 2eh3 1 2ehp 12ei5 1 2eiq 3 2ejn 1 2eo4 1 2erb 3 2esu 2f22 1 2f4p 1 2f5g 1 2f62 1 2f99 22f9h 1 2fal 1 2fa5 1 2fbh 1 2fbn 1fck 1 2fd5_ 1 2fe3_ 1 2fex 1 2fhq 1 2fip 12fiu 1 2fj9 2 2fjt 1 2f14 1 fpr 1 2fq4 1 2fr2 2 2fre 1 2ftr 1 2fu4 1 2fyq 12fyx 1 2g0c 2 2g0i 1glu 1 2g3a 1 2g3r 2 2g7s 1 2g84 1 2gax 3 2gbt 1 2ge7 12gen 1 2gff 1glz 1 2goj 1 2gpc 1 2gu9_ 1 2gux 1 2gxg 1 2gyq 1 2gzv 2 2hlt 12h2b 1h8e 1 2h9u 1 2ha8 1 2hbo 1 2hcm 1 2hhg 2 2hhz 1 2hiq 1 2hkv 1hl0 12hlj 1 2hng 1 2hq9 1 2hql 1 2hsl 1 2hsb 1 2htd 1 2huh 1 2hur 2hyt_ 1 2hzc 2

hzt 2 2i02 1 2i51 1 2i7a 1 2i7d 1 2i8b 1 2i8d 1 2i8t 1 2ial 1ict_ 2 2idl 12iek 1 2if5 1 2ifx 1 2ig6 1 2igi 1 2ikk 1 2imj 1 2iml 1ims 1 2imz 1 2inb 12isy 1 2iu5 1 2ivy 1 2iwq 1 2ixk 1 2j6b 1 2j 6y_ 1j7j_ 1 2j 8m 1 2j ar 1 2jba 12jdj 1 2je3 1 2jlj 1 21ig 1 2nlv 1 2nrk 1nsa 2 2nwv 1 2nx4 1 2nx8 1 2nyb 12nyc 1 2nyi 1 2nz7 1 2nzo 1 2o08_ 1o28 1 2o38_ 1 2o4t_ 1 2o6f 1 2o70 1 2o7m 12o95 1 2o99 1 2oa2 1 2oai 1o 5_ 1 2od4 1 2od6_ 1 2oda 1 2oee 1 2ogi 1 2oik 12okf 1 2oku 1 2olm 1omo 1 2onf 1 2oo2 1 2ooj 1 2ook 1 2opo 1 2oqk 2 2oqm 12oso 1 2ou3 1ou5 1 2ou6 1 2ouf 1 2ovs 1 2owp 1 2oyn 1 2oyz 1 2ozh 1 2ozj 12p08 1p09_ 1 2pl2 1 2p25 1 2p3w 1 2p5q 1 2p7o 1 2p84 1 2p8g 1 2p8i 1p92 12pa7 1 2pey 1 2pfb 1 2pfi 1 2pfw 1 2pjs 1 2pk8 1 2pkh 1 2pmr 1pn0 1 2pn2 1

pq3 2 2pqv 1 2prx 1 2pwo 1 2pyt 1 2q03 1 2q0y 1 2q20 1 2q24 1q2f 1 2q2h 12q2i 1 2q30 1 2q3p 1 2q3t 1 2q3x 1 2q4n 1 2q5c 3 2q79_ 1q82 1 2q8o 1 2q9k 12q9r 1 2qe9 1 2qhk 1 2qjw 1 2qkp 1 2ql8 1 2qml 2qmm 1 2qnd 3 2qnl 1 2qnt 12qqz 1 2qrr 1 2qsi 1 2qsw 1 2qtr 1 2qud 1qvm 2 2qx0 1 2r0x 1 2rli 1 2r47 12r4i 1 2r6u 1 2r6v 1 2r78 1 2rbb 1rc3 1 2rcz 1 2rey 1 2rh0 1 2rhm 1 2ril 12riq 1 2rk3 1 2rk9 1 2rkf 1rkh 1 2uv4 1 2v57 1 2v90 1 2vez 1 2vkl 1 2voc 12vpk 1 2vs0 1 2vsv 1vvp 1 2vvw 1 2wlr 1 2w2a 1 2w31 1 2w4e 1 2w7w 1 2wb6 12wce 1 2wcr 1 wcu 1 2wcw 1 2wfc 1 2wnx 1 2wp7 1 2wra 1 2wtg 1 2wzo 1 2x3g 12x5c 1x5h 1 2x5r 1 2x7 z 1 2xbq 1 2xdp 1 2xfl 1 2xhf 1 2xr4 1 2xrh 1xxc _12y0o_l 2y39_l 2y6w 1 2y78 1 2yfd 1 2ykz 1 2yqy 3 2ysk 1 2yvo 1ywl 1 2yxh 1

yzl 3 2yzk 1 2zl0 1 2z6d 1 2z8u 1 2z98 1 2zcm 1 2zdo 1 2 zdp 1 zej 1 2 zgl 12znd 1 2 zpm 2 2zvy 1 2zw2 1 2zxy 1 3a2y 2 3a5p 1 3a6r 1a6s 3 3acd 1 3agx 13ah7 1 3aly_3 3b02_l 3b09_l 3b33_l 3b47 1 3b5g 1b5t_ 1 3b76_ 1 3b7c_ 1 3b7h 13b9c 1 3bb9 1 3bcw 1 3bde 1 3bln 1 3bml 1bm7 1 3bmz 1 3bn7 1 3bpj 1 3bpv 13bqx 1 3bri 1 3bs3 1 3but_l 3by8_ 2byr 1 3bzh 1 3bzt_ 2 3c0f_ 1 3cld 3 3clq 13c3m 1 3c97 1 3can 1 3cb0_ 1cby 1 3cel 1 3cex 1 3cj d 1 3cje 1 3cjn 1 3cm3 13cng 1 3cnk 1 3cnu 1cp3 1 3ct6_ 1 3cu3 1 3czt_ 1 3d00 1 3d0f 1 3d0j 1 3d0w 13d5p_l 3d7a_ 1db7_ 1 3dcm 1 3df8_ 1 3dib 2 3dlo 1 3dm8 1 3dmc 1 3dn7 1 3dnx 13do8_ 1dpj 1 3dr6_ 1 3dsb 1 3dz8_ 1 3el0_l 3el7_l 3e2c_l 3e39_ 1 3e4v 1e5h_ _13e8o_l 3ebt_l 3ec6 1 3ec9 1 3ecf 1 3f3x 1 3f43 1 3f7e_l 3f71 1f8h_ 1 3f8x_ 1

f9s_ 1 3fcd_ 1 3fcn 1 3fd7 3 3ff0 1 3ffy 1 3fg9 1 3fgv 1 3fgy_ 1fhl 1 3fjs_ 13fkc 2 3flj_l 3fm2 1 3fm5 1 3fmb 1 3fn7 1 3fov 1 3fqm 1frq 1 3ful_ 1 3fv6_ 13fwz 1 3fx7 1 3fxh 1 3fyb 1 3fyn 1 3g0k 1 3gl3_ 1gl4_ 1 3gl6_ 1 3g26 1 3g2b 13g46 1 3g7p 1 3g8g 1 3g8k 1 3g8z 1 3gby 1gdw 1 3gfa 1 3ggq 1 3ggu 1 3ghj 13gla 1 3glv 1 3gm5 1 3gpv 1 3grd 1guz 1 3gwk 1 3gwn 1 3gxh 3 3gya 2 3gyd 13gzr 1 3h05 1 3h0n 1 3h0x 2hls_ 1 3h2d 1 3h36_ 1 3h3h 1 3h4o 1 3h4y 1 3h51 13h6q 2 3h8h 2 3h8u_ 1h95_ 1 3ha2 1 3ha9_ 2 3hcz 2 3hdc 1 3hf5 1 3hhv 1 3hiu 13hix 5 3hk4 2hm4 1 3hmf 2 3hmz 1 3hoi 1 3hqx 1 3hr7 1 3htl 1 3huh 1 3hup 13hvv 1hx9 1 3hyq 2 3hzb 1 3hzp 1 3i24 1 3i3g 1 3ial 1 3ia8 3 3ibm 1ifj_ 33ift_2 3igr 2 3iis 1 3ijm 1 3ilx 1 3in8 1 3inq 1 3ip0 2 3ir3_ 1itf_ 1 3ix3 1

j rz 1 3jtf_ 1 3j tw 1 3jtz_ 1 3jum 1 3jx9 1 3k0z 1 3kle 1 3k21 1k2v 1 3k3v 23k67 1 3k69 1 3k86 1 3kb5 2 3kbe 1 3kbq 1 3kby_l 3kg0_ 2kgz 1 3kk4 1 3kkg 13kll 1 3kol 1 3kor 1 3kpc 1 3ksh 1 3ksv 1 3kuv 1kwk 1 3kyz 1 3118_ 1 311e_ 1311n 2 3134 1 313u 1 3146 1 317h 1 317x 118u_ 1 319y_ 1 31ag 1 31as_ 1 31b5 131by 1 31e4 1 31e5 1 31eq 2 31f6_ 11fh 1 31fp_ 1 31fr 1 31hc 1 31hr 1 31in 131io 1 311v 2 31mo 1 31qn 11qy_ 2 31r0_ 1 31te_ 1 31w3 1 31wc 1 31x7 1 31yd 131yg 1 31yh_l 31yx_ 11za 1 31zl 1 3mle 1 3m5b 1 3m6j 1 3m8e 1 3m9z 1 3mcw 13mdp 1 3mgd 1 3mgm 1 3mhx 1 3mmh 1 3mng 1 3msh 1 3mti 1 3mtq 1 3mws 1 3myf 13nls 1

3n4j_ 1 3n4w 1 3n6y 3 3n8b_ 1 3nad 1 3nbc 1 3neu 1 3nfc 1 3nj2 1

3nj c 13nl9 1 3nqn 1 3nrl 1 3nrh 3 3nrp 1 3ny5 1 3nym 1 3o0m 1 3ol0_ 1

3olc 1 3o2e 2

3o2r 1 3o79_ 3 3oa4_ 1 3obh 1 3oga 1 3ogh 1 3ohe _1 3oj7_ 1 3oj i 1

3okx 1 3oms 13on4 1 3oni 2 3oop 1 3ose 2 3ov8 1 3oxp 1 3p0t_ 1 3p2t_ 2

3pc6 1 3pg6 1 3pmd 13pn3 3 3pp9 1 3pr6 2 3pu7 1 3q20 1 3q34_ 1 3q3y 3

3q62 1 3q63 1 3q64_ 1 3q6a 13q7r 1 3q8i 2 3q90 1 3qbm 1 3qdo 1 3qfl_ 1

3qh6 2 3qmq 1 3qoo 1 3qp4 1 3qp8 13qs2 1 3qui 1 3qzx 1 3r0n 1 3r5g 1

3r68_ 2 3r6a 1 3r6f_ 1 3rcp 2 3rdl 1 3rem 13rfi 2 3rkc 1 3rmh 1 3rmu 1

3rob 1 3rqi 1 3rt2_ 2 3s2r 1 3s45_l 3s6f_l 3s8i _13s9f_l 3sbl_ 1 3sd2_ 1

3sk2 1 3sl2_ 2 3sl7_ 1 3slz 1 3smd 2 3smj 3 3son 1 3soy 13svi 2 3sxm 1

3sz7 2 3szj 1 3tls_ 1 3t43_ 1 3t46_2 3t8r_l 3t90 _1 3t9y_ 1 3td4 43teq_ 1

3tgn 1 3tgv 1 3tj 8_ 1 3tk0_ 1 3tnj_l 3tol_l 3trc 1 3typ 1 3tys 1

3u04_ _13ul5_l 3uld_l 3u2a 1 3u5v_l 3u6g_l 3u80_ 1 3ub6_l 3ucb 1 3ucg 1

3ufe 1 3uh9_ 1

3uie 1 3ulb_ 1 3ups 1 3urr 1 3uv0 1 3ux2 1 3vj z 1 3vk6 1 3vp5 1

3vql 1 3vub 13zrd 1 3zve 1 3zw5 1 3zxc 1 3zxq 1 3zy7 1 4ali 1 4a5k_ 1

4a5n_ 1 4ae4_ 1 4aeq 14ag7 1 4agh 1 4alg 1 4avp 1 4ax2 1 4b4p 1 4b6i 1

4di0 1 4duq 1 4e08_ 1 4e0h 14e2g 1 4e74 1 4e7p 1 4eae_l 4egu 1 4em8 1

4err 1 4esl_ 1 4eun 1 4ew5 1 4ew7 14exo 1 4exr 1 4ezg 1 4f82 1 4f8y_ 1

4fak 1 4fiv 1 4flb 1 4fld 1 4g5a 1 4g6x 14gdh 1 4ghj 1 4giw 1 4go7_ 1

4gs3_ 2 4gwb 1

Table 1

lbuu 1 ldbf 1 ldg6_ 1 ldi6_ 1 lf71 1 lfth 1 lgr3 1 lgu9_ 1 lgxl 1 lh7z 1 lh9m llhfo 1 lidp 1 liv2 1 ljdl_l ljlj_l ljqO_l 1jw8 2 lknb 1 lkr4 1 llrO 2 ln2m llnog 1 lnza 1 lo5j_l lo91 1 locy 1 loni 1 lox3 1 lpll 2 lpwb 1 lq5h 1 lq5x llqul 1 lrlh 2 ls55_l lseh 1 Isj n 1 ltOa_ 1 ltcz_ 1 ltd4_ 1 lu5x 1 lu9d_ 2 lufy_lluku 1 lusn 2 luuy 1 luxa 1 lv3w 1 lveO_ 1 lvfj 1 lvhf 2 lvmf 1 lvmh 1 lvph llwoz 1 lwyl 1 1x25 1 lxhd 2 lyq5_ 1 2aal 1 2bcm 1 2brj 1 2bt9_ 1 2bzv 1 2chc 12cu5 1 2cvl 1 2dt4_ 1

2e7a 1 2ed6_ 1 2eg2 1 2f0c_ 1 2fb6 2 2fvh 1 2g2d 1 2gdg 12gr8 1 2gw8 1

2h61 1 2hx0 1 2ibl 1 2ieq 1 2ig8 1 2is8 1 2j2j_ 1 2j9c_ 1 2jb7 — 12nuh 2 2oj 6 1 2oll 1 2otm 1 2p2o 1 2p6c 1 2p6h 1 2p6y 1 2p9o_1 2pii 1

2qg8 12qih 1 2r32 1 2r6q 1 2rfr 1 2rie 1 2tnf 1 2uyk 1 2vnl 1 2w5p 1

2wds 1 2wh7 1

2wkb 1 2wpq 1 2wq4 1 2x4j 1 2xcz 1 2xdh 1 2xdj 1 2xx6 1 2y75 2

2yzj_ 1 2 zhz 13aqe 7 3b64_l 3b81_l 3bsw 1 3bzq 1 3c6v 1 3cc0_ 1 3ci3 1

3cpl 1 3d01_ 1 3d9x_ 13da0_l 3djh_l 3e6q 1 3eby 1 3efg 1 3ehw 1 3ej c 1

3ej v 1 3emf 1 3f09_ 1 3f0d_ 13f4f 1 3fq3_3 3ftt_l 3fuy_l 3fwt_ 1 3fwu 1

3gqh 1 3h5i_ 1 3h6x 1 3htn 1 3hwu 13hza 1 3i3f_l 3i7t_l 3ixc 1 3jvl_ 1

3k6a_ 1 3kan 1 3kjj_ 1 31aa_ 1 31qw 1 3mlx 13mc3 1 3mci 1 3mdx 1 3mf7 1

3mhy 1 3mko 1 3mqh 1 3mxu 2 3n79 1 3ne3 2 3nfd 13o46 1 3oiu 2 3opk 1

3p48_ 1 3pzy 1 3qc7 1 3qr7 1 3quw 1 3rlw 1 3r3r 2 3rwn 13so2 1 3ta2_ 1

3tio 1 3tq5_ 1 3tqz 1 3uv9_ 2 3v4d 1 3vi6 2 3zw0 14aff 1 4g2k 14gb5 1

4gdz 1

Table 2

The subunits can be arranged at the origin according to the symmetry specified by command-line options or through a user-provided symmetry definition file. Then the full space of contacting symmetric configurations can be sampled by systematically varying the translational and rotational degrees of freedom (DOFs) in the system. In order to test all four possible orientations of the two building blocks (inside/inside, inside/outside, outside/inside, outside/outside) two separate docking runs can be performed in which the orientation of one of the building blocks is reversed by setting the Rosetta command-line option

tcdock : : reverse to true. Configurations in which backbone or beta carbon atoms from different building blocks clash (distance between backbone amide nitrogen and carbonyl oxygen atoms <= 2.6 A; distance between all other backbone/beta carbon atom pairs <= 3.0 A) can be discarded.

In each non-clashing configuration, a designability score can be calculated. For example, the designability score can be calculated as the sum of the number of beta carbon contacts between building blocks (where a contact is defined as two beta carbon atoms within 12 A), weighted by the type of secondary structures on which the contacting positions exist (by setting the Rosetta tcdock : : cb_weight_secstruct command line option to true) and the average degree of connectivity (the number of amino acid positions within a user- specified distance threshold within the multimeric building block) of the contacting positions (by setting the Rosetta tcdock : : cb_weight_average_degree command line option to true). This designability measure favors the selection of docked configurations with large numbers of contacting residues on well-anchored regions of protein structure. In addition to inter-component contacts, which can be contacts between building blocks of the two different components, two-component systems can also possess intra-component contacts or contacts between building blocks of the same component. The Rosetta command-line options tcdock: : intra, tcdock : : intral, and tcdock : : intra2 control the contribution to the designability score of intra-component contacts for both components, for component 1, and for component 2, respectively.

Data and PDB files can be output for a user-defined number of top scoring configurations (set by the Rosetta tcdock : : topx command-line option). The data, which can be saved by redirecting the output of the run to a log file, includes the rigid body DOFs, the designability score, the number of carbon beta contacts between building blocks, the number of contacting residues between building blocks, the average score per carbon beta contact, and the average score per contacting residue.

In one example, the 1 161 dimers and 200 trimers listed in the scaffold sets listed in Tables 1 and 2 provided 232,200 unique pairwise combinations of trimers with dimers, and 19,900 unique pairwise combinations of trimers. Docking was carried out for each of these unique combinations with or without the tcdock : : reverse option set to true, for a total of 504,200 independent docking trajectories. The tcdock :: intra option was set to false such that intra-component contacts were not included in the calculated scores.

For each unique scaffold combination, the 3 top scoring T33 docks were selected. This set of 59,700 distinct configurations was ranked by the average designability score per residue and the top 1,000 used as input for interface design. For T32, data was output for the 40 top scoring docked configurations per docking trajectory. This set of 18,576,000 distinct configurations was filtered to remove all configurations with less than 80 contacting residues between building blocks and ranked by the average designability score per residue. This set was filtered to retain only the one top ranked configuration for each unique scaffold pair and the top 1,000 configurations were used as input for interface design.

Two-Component Symmetric Interface Design

The herein-described methods and techniques are not limited to use of RosettaDesign, the Rosetta software suite, or any other specific software package. For example, other software programs could be used in conjunction with this method to design new amino acid sequences at protein-protein interfaces. As will be understood by those of skill in the art, the implementation of the design methods of the invention described below is non-limiting, and the methods are in no way limited to the implementation disclosed herein.

A set of protein-protein interface design protocols was developed within Rosetta to identify mutations predicted to drive assembly of two distinct protein building blocks into higher order symmetric complexes. The design functionality was broken into modular components and implemented within the RosettaScripts™ framework in order to facilitate future code development and to provide users the ability to modify each step of the design process without having to change the underlying C++ code.

The design process can have four stages: I) interface design, II) shape

complementarity optimization, III) automated reversion, and IV) resfile-based refinement. The protocols used in each stage can take as input a symmetry definition file and a PDB file containing a single subunit of both scaffold proteins; the latter can be produced by concatenating the two scaffold protein PDB files used as input for docking and changing the chain of the second subunit to be "B". In addition, initial values for the translational and rotational symmetric rigid body DOFs can be specified through user-defined variables. All design calculations can be performed on the two independent subunits and propagated symmetrically.

Stage I. Interface design can involve carrying out multiple design trajectories for each docked configuration. At the start of each trajectory, the symmetric rigid body DOFs can be perturbed in order to sample nearby docked configurations. The behavior of these perturbations can be set by the user, including specifying whether to sample values from a user-defined grid of angles and displacements or randomly from user-defined uniform or Gaussian distributions of angles and displacements. Trajectories yielding docked configurations with clashing backbones (distance between backbone amide nitrogen and carbonyl oxygen atoms <= 2.6 A; distance between all other backbone/beta carbon atom pairs <= 3.0 A) can be discarded prior to interface design based on user-defined cutoff values for the number of clashing atoms.

In each of the remaining trajectories, interface residues can be selected according to the some or all of the following three criteria: 1) the residue has a beta carbon (alpha carbon in the case of glycine) within a user-defined cutoff distance to a beta carbon (alpha carbon in the case of glycine) in a different building block (in this study the default 10 A cutoff was used), 2) the residue has a nonzero solvent accessible surface area when the protein subunits are in the unbound state, and 3), with the exception of residues that have high Lennard-Jones repulsive scores (fa_rep), the residue does not make contacts (any heavy atoms within 5 A) with other subunits in the same oligomeric building block. Residues matching all three criteria can be considered designable, with the exception of proline and glycine, which are restricted to repacking. In some scenarios, criterion 3 is not enforced.

Residues fulfilling criteria 1 and 2 can be termed "interface positions" and criteria 1, 2, and 3 can be termed "design positions". Then, all design positions are also interface positions, but not all interface positions are design positions. These positions can be updated at multiple points throughout design stages I through IV; appending any positions that newly satisfy the selection criteria to the previously defined sets. All residues not in the selected sets remain fixed throughout the design process. In addition, mutations to proline, glycine, or cysteine are prohibited unless explicitly specified otherwise by the user via a resfile (see stage rV). Optionally, a reduced amino acid set can be used during Stage I such that only the native amino acid and mutations to a subset of the 20 common amino acids are allowed at each design position.

Once the design positions have been selected, an initial round of design can be carried out using the standard RosettaDesign™ algorithm and a version of the Rosetta™

scorefunction, soft_rep, in which the Lennard-Jones repulsive term (fa_rep) is down- weighted to favor tightly packed interfaces. The scorefunction can be then set to score 12 and the Rosetta energy is minimized through a series of small changes to the design position side chain configurations and the symmetric rigid body DOFs (i.e., the side chains and rigid body DOFS are symmetrically minimized). Designs with contacting interface areas not meeting user-defined thresholds can be discarded. For those designs passing the interface area cutoffs, the design positions can be updated and a second round of interface design is carried out using the standard RosettaDesign™ algorithm with the score 12 score function. The design position side chains can be repacked and the interface position side chains and rigid body DOFs can be subjected to at least one round of minimization.

Several metrics can be used to gauge the quality of the interfaces resulting from this first stage of design and to select designs to carry forward to shape complementarity optimization in Stage II. These metrics include, but are not limited to: 1) the number of buried unsatisfied hydrogen bonds at the designed interface, 2) the shape complementarity of the designed interface, and 3) the predicted binding energy of the interface, defined as the difference in energy between the bound and unbound (individual building blocks) state following repacking of the side chains at the design positions and minimization of the side chains at the interface positions in the unbound state. For each passing design, the values of the final rigid body DOFs can be output to a scorefile along with the metric values and the standard scorel2 score terms, and a resfilecan be generated containing each of the design positions and their amino acid identities.

In one example, 100 independent design trajectories were run for each of the top 1000 docked T32 and T33 configurations (supra vide). At the start of each trajectory, the building blocks were displaced 2 A away from the assembly's center of mass along their symmetry axes, and the translational rigid body DOFs were perturbed by sampling randomly from a Gaussian distribution with a standard deviation of 0.75 A and the rotational rigid body DOFs were perturbed by sampling randomly from a Gaussian distribution with a standard deviation of 2 degrees. Trajectories yielding more than 8 clashing backbone atoms were removed from further design considerations. A reduced amino acid set was employed during this stage of the design process such that only mutations to the following 8 amino acids were allowed: alanine, aspartate, isoleucine, leucine, asparagine, serine, threonine, and valine. Additionally, during all RosettaDesign™ steps in all stages, the chi2 angle for aromatic side chains being repacked or designed was restricted to between 70 and 110 degrees.

T32 design trajectories yielding contacting interface areas of less than 1,100 A² or greater than 2,000 A² following the first round of design were discarded. The passing T32 designs were further filtered at the end of Stage I to remove those that had more than 45 mutations or 8 buried unsatisfied hydrogen bonds at the designed interface, a predicted binding energy greater than -12 REU, or a shape complementarity score of less than 0.60. The T33 design trajectories were filtered based on contacting interface areas at the end of Stage I rather than after the first round of design, discarding those that yielded contacting interface areas of less than 600 A². The passing T33 designs were further filtered to remove those with more than 100 mutations or 10 buried unsatisfied hydrogen bonds at the designed interface, a predicted binding energy greater than -12 REU, or a shape complementarity of less than 0.55. The resulting 1,292 T32 designs and 593 T33 designs were subjected to the protocol described in Stage II below.

Stage II: Stage II involves to regenerate the initial design from the two input scaffolds: 1) the rigid body DOFs output from Stage I are used to reposition the subunits in the fully assembled state, 2) the interface positions are re-selected using the same criteria as before, with the exception that all positions specified in the input resfile are included regardless of whether or not they fulfill the criteria in the input state, 3) the resfile output from stage I is used as input to the RosettaDesign algorithm to reintroduce the initial design mutations, and 4) the interface position side chains are subjected to one or more rounds of minimization and/or repacking.

Then, optimization techniques, such as greedy optimization, can test individual reversions to native amino acids at all mutated residues. A custom reversion score can be used in which individual mutations are filtered to remove those that increase the number of buried unsatisfied hydrogen bonds at the designed interface and scored according to the sum of the predicted binding energy, the total Rosetta energy, and a residue type constraint energy favoring the native amino acid. The potential reversions can be combined one at a time proceeding from the individually best scoring to worst scoring reversions at each position, only accepting those that do not increase the number of buried unsatisfied hydrogen bonds at the designed interface and improve the reversion score in the context of all previously accepted mutations. In some embodiments, the buried unsatisfied hydrogen bond criterion is optional; for example, this criterion was used for the T32 designs, but not T33.

Following another one or more rounds of interface position side chain minimization and/or repacking, optimization techniques are used to increase the shape complementarity of the designed interfaces. Mutations to all amino acids except cysteine, glycine, and proline can be tested individually at each design position as defined by the input resfile. Each mutation can be ranked by the shape complementarity of the design interface if the mutation does not: 1) increase the total Rosetta energy by more than 2.0 Rosetta energy units (REU), 2) decrease the predicted binding energy by 1.0 REU, 3) introduce any new unsatisfied hydrogen bonds, or 4) increase the f a_dun component of the score, which can be an internal energy of side chain rotamers as defined by statistics from the Dunbrack library, by more than 2.5 REU (the f a_dun criterion is optional; it was used for the T32 designs, but not T33). Next, mutations cam be combined one at a time proceeding from the best scoring to worst scoring individual mutations, only accepting those that still pass the same three or four criteria and improve the shape complementarity in the context of all previously accepted mutations. During both the reversion and shape complementarity optimization, all of the interface positions can be subjected to at least one round of minimization, repacking, and minimization prior to evaluating the effects of each mutation. In addition to the standard Rosetta scores, the following metrics, and perhaps others, can used to assess the quality of each design following one or more rounds of interface position side chain minimization and/or repacking: 1) the total number of mutations, 2) the number of buried unsatisfied hydrogen bonds at the interface, 3) the average degree of each design position, 4) the RosettaHoles packing score, 5) the average total Rosetta energy, fa_atr, fa_rep, and fa_dun for each filter position, 6) the contacting interface area, 7) the predicted binding energy, 8) the shape complementarity, and 9) the change in predicted binding energy resulting from individual mutations of each interface side chain to alanine (i.e., a computational alanine scan of the designed interface). Those designs passing a set of user-defined thresholds for each metric are subsequently subjected to visual inspection to further filter the designs. A scorefile with the metric values and the standard scorel2 score terms, and a resfile containing the design positions and their amino acid identities is generated for each design at the end of Stage II.

In one example, the T32 designs resulting from Stage II were filtered to remove those with a shape complementarity score less than 0.65, predicted binding energies of greater than -25 REU, a positive Rosetta holes score for the designed interface, an interface area less than 1,200 A², or more than 1 buried unsatisfied hydrogen bond at the designed interface. The 283 passing T32 designs were visually inspected and manually curated down to a list of 68 designs that were subjected to the reversion protocol outlined in Stage III. The T33 designs resulting from Stage II were filtered, visually inspected and manually curated down to a list of 38 designs that were subjected to the reversion protocol outlined in Stage III

Stage III: The third stage in the design process can identify, via an automated computational process, mutated residues predicted not to be critical for assembly and to revert them back to their native amino acid identities. This helps to minimize the number of mutations being made to the scaffold proteins and reduces the amount of refinement required in Stage IV.

Stage III can be begin by regenerating the design from the two input scaffolds using the rigid body DOFs from stage I and the resfile output from stage II: 1) the rigid body DOFs can be used to reposition the subunits in the fully assembled state, 2) the interface positions can be re-selected in the same manner as in Stage II, 3) the resfile can be used as input to the RosettaDesign algorithm to reintroduce the initial design mutations, and 4) at least one round of interface position side chain and rigid body DOF minimization, side chain repacking, and minimization is performed. Next, greedy optimization or another optimization algorithm can be used to revert mutations to the native amino identities as follows. During the first part of the optimization algorithm, each reversion can be tested individually and ranked by the change in shape complementarity if the reversion does not: 1) decrease the predicted binding energy by more than 2.0 REU, 2) increase the number of buried unsatisfied hydrogen bonds at the interface, or 3) decrease the shape complementarity of the interface by more than 0.02. During the second part of the optimization algorithm, reversions that passed the first part can be combined one at a time proceeding from the best scoring to the worst scoring individual mutations, only accepting those that still pass the three criteria above in the context of all previously accepted mutations. Then, optimization can be terminated if a mutation passes these criteria but causes the predicted binding energy to be greater than a user-defined threshold (in one example, -15 REU was used for T32 designs and -17 REU for T33 designs) or the shape complementarity to be less than 0.65. During both parts of optimization, all interface positions can be subjected to at least one round of minimization, repacking, and minimization prior to evaluating the effects of each mutation. Furthermore, during the second part, the reference structure for measuring the change in shape complementarity can be reset after each accepted mutation.

Following at least one round of rigid body and side chain minimization, side chain repacking, and minimization, the full suite of additional metrics can be evaluated (as outlined at the end of Stage II) with the additional calculation of a Boltzmann weighted estimation of the probability of each designed side chain configuration in the bound versus the unbound state. For each design, the values of the final rigid body DOFs are output to a score file along with the additional metrics and the standard score12 score terms, and a resfile is generated containing the interface positions and their amino acid identities.

In one example, all 68 T32 designs and 38 T33 designs resulting from Stage III were run through the resfile-based refinement protocols outlined in Stage IV below.

Stage IV: Stage IV of the design process can involve one or more iterations of resfile-based redesign with user-guided mutations. In each iteration of the process, a combination of visual inspection and analysis of the design metrics can be used to generate modified resfiles for each design, with each modified resfilecontaining a small number of user-defined mutations relative to a correspondingresfile output from Stage III. Two different protocols, resfile_optimize and resfile_design, can be used to test the user-defined mutations. In both protocols, the starting configuration can be generated from the two input scaffolds using the rigid body DOFs from the previous round of design.

The resfile_optimize protocol uses greedy optimization to test the user- defined mutations. First the reverted design resulting from Stage III can be regenerated using the unmodified resfile output from Stage III together with the standard RosettaDesign™ algorithm, and the side chains specified in the resfile are minimized, repacked, and minimized. Next, user-defined mutations can be tested individually at each design position. Each mutation can be ranked by the change in shape complementarity of the designed interface, if the mutation does not decrease the predicted binding energy by greater than 2.0 REU or decrease the shape complementarity of the designed interface by more than 0.02. The passing mutations are then combined one at a time proceeding from the best ranked to the worst ranked individual mutations, only accepting those that still do not decrease the binding energy by more than 2.0 REU or the shape complementarity by more than 0.02 in the context of all previously accepted mutations. Optimization can be terminated if a mutation passes these criteria, but causes the predicted binding energy to be greater than - 15 REU or the shape complementarity to be less than 0.63. All positions specified in the input resfile can be subjected to at least one round of minimization, repacking, and minimization prior to evaluating the effects of each mutation. Furthermore, during the combining stage, the reference structure for measuring the change in predicted binding energy and the change in the shape complementarity can be reset after each accepted mutation.

The resfile_design protocol involves taking the starting design configuration generated using the rigid body DOFs from the previous round of design and applying the standard RosettaDesign algorithm with the user-defined resfile.

In both protocols, the symmetric rigid body DOFs and the side chains specified in the input resfile are minimized, side chains repacked, and minimized prior to calculating the full suite of design metrics. This process can be iterated until designs are obtained which are deemed suitable for experimental testing or until the user decides the designs are no longer worth pursuing. Example Computing Environment

Figure 9 is a block diagram of an example computing network. Some or all of the above-mentioned techniques disclosed herein, such as but not limited to techniques disclosed as part of and/or being performed by software, the Rosetta™ software suite, RosettaDesign , Rosetta applications, and/or other herein-described computer software and computer hardware, can be part of and/or performed by a computing device. For example, Figure 9 shows protein design system 902 configured to communicate, via network 906, with client devices 904a, 904b, and 904c and protein database 908. In some

embodiments, protein design system 902 and/or protein database 908 can be a computing device configured to perform some or all of the herein described methods and techniques, such as but not limited to, method 100 and functionality described as being part of or related to the Rosetta™ software suite. Protein database 908 can, in some embodiments, store information related to and/or used by the Rosetta™ software suite.

Network 906 may correspond to a LAN, a wide area network (WAN), a corporate intranet, the public Internet, or any other type of network configured to provide a

communications path between networked computing devices. Network 906 may also correspond to a combination of one or more LANs, WANs, corporate intranets, and/or the public Internet.

Although Figure 9 only shows three client devices 904a, 904b, 904c, distributed application architectures may serve tens, hundreds, or thousands of client devices. Moreover, client devices 904a, 904b, 904c (or any additional client devices) may be any sort of computing device, such as an ordinary laptop computer, desktop computer, network terminal, wireless communication device (e.g., a cell phone or smart phone), and so on. In some embodiments, client devices 904a, 904b, 904c can be dedicated to problem solving / using the Rosetta software suite. In other embodiments, client devices 904a, 904b, 904c can be used as general purpose computers that are configured to perform a number of tasks and need not be dedicated to problem solving. In still other embodiments, part or all of the

functionality of protein design system 902 and/or protein database 908 can be incorporated in a client device, such as client device 904a, 904b, and/or 904c.

Computing Device Architecture

Figure 10A is a block diagram of an example computing device (e.g., system) In particular, computing device 1000 shown in Figure 10A can be configured to: include components of and/or perform one or more functions of protein design system 902, client device 904a, 904b, 904c, network 906, and/or protein database 908 and/or carry out part or all of any herein-described methods and techniques, such as but not limited to method 100. Computing device 1000 may include a user interface module 1001, a network- communication interface module 1002, one or more processors 1003, and data storage 1004, all of which may be linked together via a system bus, network, or other connection mechanism 1005.

User interface module 1001 can be operable to send data to and/or receive data from external user input/output devices. For example, user interface module 1001 can be configured to send and/or receive data to and/or from user input devices such as a keyboard, a keypad, a touch screen, a computer mouse, a track ball, a joystick, a camera, a voice recognition module, and/or other similar devices. User interface module 1001 can also be configured to provide output to user display devices, such as one or more cathode ray tubes (CRT), liquid crystal displays (LCD), light emitting diodes (LEDs), displays using digital light processing (DLP) technology, printers, light bulbs, and/or other similar devices, either now known or later developed. User interface module 1001 can also be configured to generate audible output(s), such as a speaker, speaker jack, audio output port, audio output device, earphones, and/or other similar devices.

Network-communications interface module 1002 can include one or more wireless interfaces 1007 and/or one or more wireline interfaces 1008 that are configurable to communicate via a network, such as network 906 shown in Figure 9. Wireless interfaces 1007 can include one or more wireless transmitters, receivers, and/or transceivers, such as a Bluetooth transceiver, a Zigbee transceiver, a Wi-Fi transceiver, a WiMAX transceiver, and/or other similar type of wireless transceiver configurable to communicate via a wireless network. Wireline interfaces 1008 can include one or more wireline transmitters, receivers, and/or transceivers, such as an Ethernet transceiver, a Universal Serial Bus (USB) transceiver, or similar transceiver configurable to communicate via a twisted pair, one or more wires, a coaxial cable, a fiber-optic link, or a similar physical connection to a wireline network.

In some embodiments, network communications interface module 1002 can be configured to provide reliable, secured, and/or authenticated communications. For each communication described herein, information for ensuring reliable communications (i.e., guaranteed message delivery) can be provided, perhaps as part of a message header and/or footer (e.g., packet/message sequencing information, encapsulation header(s) and/or footer(s), size/time information, and transmission verification information such as CRC and/or parity check values). Communications can be made secure (e.g., be encoded or encrypted) and/or decrypted/decoded using one or more cryptographic protocols and/or algorithms, such as, but not limited to, DES, AES, RSA, Diffie-Hellman, and/or DSA. Other cryptographic protocols and/or algorithms can be used as well or in addition to those listed herein to secure (and then decrypt/decode) communications.

Processors 1003 can include one or more general purpose processors and/or one or more special purpose processors (e.g., digital signal processors, application specific integrated circuits, etc.). Processors 1003 can be configured to execute computer-readable program instructions 1006 contained in data storage 1004 and/or other instructions as described herein. Data storage 1004 can include one or more computer-readable storage media that can be read and/or accessed by at least one of processors 1003. The one or more computer-readable storage media can include volatile and/or non- volatile storage

components, such as optical, magnetic, organic or other memory or disc storage, which can be integrated in whole or in part with at least one of processors 1003. In some embodiments, data storage 1004 can be implemented using a single physical device (e.g., one optical, magnetic, organic or other memory or disc storage unit), while in other embodiments, data storage 1004 can be implemented using two or more physical devices.

Data storage 1004 can include computer-readable program instructions 1006 and perhaps additional data. For example, in some embodiments, data storage 1004 can store part or all of data utilized by a protein design system and/or a protein database; e.g., protein designs system 902, protein database 908. In some embodiments, data storage 1004 can additionally include storage required to perform at least part of the herein-described methods and techniques and/or at least part of the functionality of the herein-described devices and networks.

Figure 10B depicts a network 906 of computing clusters 1009a, 1009b, 1009c arranged as a cloud-based server system in accordance with an example embodiment. Data and/or software for protein design system 902 can be stored on one or more cloud-based devices that store program logic and/or data of cloud-based applications and/or services. In some embodiments, protein design system 902 can be a single computing device residing in a single computing center. In other embodiments, protein design system 902 can include multiple computing devices in a single computing center, or even multiple computing devices located in multiple computing centers located in diverse geographic locations.

In some embodiments, data and/or software for protein design system 902 can be encoded as computer readable information stored in tangible computer readable media (or computer readable storage media) and accessible by client devices 904a, 904b, and 904c, and/or other computing devices. In some embodiments, data and/or software for protein design system 902 can be stored on a single disk drive or other tangible storage media, or can be implemented on multiple disk drives or other tangible storage media located at one or more diverse geographic locations.

Figure 10B depicts a cloud-based server system in accordance with an example embodiment. In Figure 10B, the functions of protein design system 902 can be distributed among three computing clusters 1009a, 1009b, and 1008c. Computing cluster 1009a can include one or more computing devices 1000a, cluster storage arrays 1010a, and cluster routers 1011a connected by a local cluster network 1012a. Similarly, computing cluster 1009b can include one or more computing devices 1000b, cluster storage arrays 1010b, and cluster routers 101 lb connected by a local cluster network 1012b. Likewise, computing cluster 1009c can include one or more computing devices 1000c, cluster storage arrays 1010c, and cluster routers 101 lc connected by a local cluster network 1012c.

In some embodiments, each of the computing clusters 1009a, 1009b, and 1009c can have an equal number of computing devices, an equal number of cluster storage arrays, and an equal number of cluster routers. In other embodiments, however, each computing cluster can have different numbers of computing devices, different numbers of cluster storage arrays, and different numbers of cluster routers. The number of computing devices, cluster storage arrays, and cluster routers in each computing cluster can depend on the computing task or tasks assigned to each computing cluster.

In computing cluster 1009a, for example, computing devices 1000a can be configured to perform various computing tasks of protein design system 902. In one embodiment, the various functionalities of protein design system 902can be distributed among one or more of computing devices 1000a, 1000b, and 1000c. Computing devices 1000b and 1000c in computing clusters 1009b and 1009c can be configured similarly to computing devices 1000a in computing cluster 1009a. On the other hand, in some embodiments, computing devices 1000a, 1000b, and 1000c can be configured to perform different functions.

In some embodiments, computing tasks and stored data associated with protein design system 902 can be distributed across computing devices 1000a, 1000b, and 1000c based at least in part on the processing requirements of protein design system 902, the processing capabilities of computing devices 1000a, 1000b, and 1000c, the latency of the network links between the computing devices in each computing cluster and between the computing clusters themselves, and/or other factors that can contribute to the cost, speed, fault-tolerance, resiliency, efficiency, and/or other design goals of the overall system architecture.

The cluster storage arrays 1010a, 1010b, and 1010c of the computing clusters 1009a, 1009b, and 1009c can be data storage arrays that include disk array controllers configured to manage read and write access to groups of hard disk drives. The disk array controllers, alone or in conjunction with their respective computing devices, can also be configured to manage backup or redundant copies of the data stored in the cluster storage arrays to protect against disk drive or other cluster storage array failures and/or network failures that prevent one or more computing devices from accessing one or more cluster storage arrays.

Similar to the manner in which the functions of protein design system 902 can be distributed across computing devices 1000a, 1000b, and 1000c of computing clusters 1009a, 1009b, and 1009c, various active portions and/or backup portions of these components can be distributed across cluster storage arrays 1010a, 1010b, and 1010c. For example, some cluster storage arrays can be configured to store one portion of the data and/or software of protein design system 902, while other cluster storage arrays can store a separate portion of the data and/or software of protein design system 902. Additionally, some cluster storage arrays can be configured to store backup versions of data stored in other cluster storage arrays.

The cluster routers 101 1a, 101 lb, and 101 lc in computing clusters 1009a, 1009b, and 1009c can include networking equipment configured to provide internal and external communications for the computing clusters. For example, the cluster routers 101 la in computing cluster 1009a can include one or more internet switching and routing devices configured to provide (i) local area network communications between the computing devices 1000a and the cluster storage arrays 1001a via the local cluster network 1012a, and (ii) wide area network communications between the computing cluster 1009a and the computing clusters 1009b and 1009c via the wide area network connection 1013a to network 906.

Cluster routers 101 lb and 1011c can include network equipment similar to the cluster routers 1011a, and cluster routers 101 lb and 101 lc can perform similar networking functions for computing clusters 1009b and 1009b that cluster routers 101 la perform for computing cluster 1009a.

In some embodiments, the configuration of the cluster routers 1011a, 101 lb, and 101 1c can be based at least in part on the data communication requirements of the computing devices and cluster storage arrays, the data communications capabilities of the network equipment in the cluster routers 101 la, 101 lb, and 101 lc, the latency and throughput of local networks 1012a, 1012b, 1012c, the latency, throughput, and cost of wide area network links 1013a, 1013b, and 1013c, and/or other factors that can contribute to the cost, speed, fault- tolerance, resiliency, efficiency and/or other design goals of the moderation system architecture.

Nanostructures and Proteins

The present invention provides synthetic nanostructures comprising

wherein multiple copies of the first multimeric substructure and the second multimeric substructure interact with each other at symmetrically repeated, non-natural, non- covalent protein-protein interfaces that orient the first multimeric substructures and the second multimeric substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group.

The nanostructures of the invention can be used for any suitable purpose, including but not limited to delivery vehicles, as the nanostructures can encapsulate molecules of interest and/or the first and second proteins can be modified to bind to molecules of interest (diagnostics, therapeutics, detectable molecules for imaging and other applications, etc.)

The nanostructures of the invention are synthetic, in that they are not naturally occurring. The first protein and the second protein are non-naturally occurring proteins that can be produced by any suitable means, including recombinant production or chemical synthesis. Each member of the plurality of first proteins is identical to each other, and each member of the plurality of second proteins is identical to each other. The first proteins and the second proteins are different. There are no specific primary amino acid sequence requirements for the first and second proteins. As described in detail herein, the inventors disclose methods for designing the synthetic nanostructures of the invention, where the nanostructures are not dependent on specific primary amino acid sequences of the first and second proteins that make up the multimeric structures that interact to form the

nanostructures of the invention. As will be understood by those of skill in the art, the design methods of the invention can produce a wide variety of nanostructures made of a wide variety of subunit proteins, and the methods are in no way limited to the subunit proteins disclosed herein.

As used herein, a "plurality" means at least two; in various embodiments, there are at least 2, 3, 4, 5, 6 or more first proteins in the first multimeric substructure and second proteins in the second multimeric substructure.

The number of first proteins in the first multimeric substructure may be the same or different than the number of second proteins in the second multimeric substructure. In one exemplary embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a dimer of the second protein. In a further exemplary embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a trimer of the second protein.

The first and second proteins may be of any suitable length for a given purpose of the resulting nanostructure. In one embodiment, the first protein and the second protein are typically between 30-250 amino acids in length; the length of the first protein and the second protein may be the same or different. In various further embodiments, the first protein and the second protein are between 30-225, 30-200, 30-175, 50-250, 50-225, 50-200, 50-175, 75- 250, 75-225, 75-200, 75-175, 100-250, 100-225, 100-200, 100-175, 125-250, 125-225, 125- 200, 125-175, 150-250, 150-225, 150-200, and 150-175 amino acids in length.

In another embodiment, the first protein and the second protein comprise or consist of proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO : 11) and T32-28B SEQ ID NO : 12);

(b) T33-09A SEQ ID NO: 13) and T33-09B SEQ ID NO: 14);

(c) T33-15A SEQ ID NO: 15) and T33-15B SEQ ID NO: 16);

(d) T33-21A SEQ ID NO: 17) and T33-21B SEQ ID NO: 18); and

(e) T33-28A SEQ ID NO: 19) and T33-28B SEQ ID NO: 20).

Figures 11 -20 show the primary amino acid sequences of the proteins noted and allowable substitutions. Each figure includes four columns, which show:

1) The residue position in the protein

2) The identity of that residue in the designed sequence

3) The allowed amino acids at that position within our genus (labeled 1-4, indicating the AAs at that position in the different SEQ ID NOs for the relevant protein); and 4) The solvent-accessible surface area (SASA) of that residue in crystal structures (T32-28, T33-15, T33-21, and T33-28) or computationally designed models (T33-09) of the nanostructures.

In some embodiments certain residues can be any amino acid residue ("any"); such residues with a solvent-accessible surface area of greater than 50 square Angstroms are defined as being present on the polypeptide surface, and thus can be substituted with a different amino acid as desired for a given purpose without disruption of protein structure or multimer assembly (for example, SEQ ID NOS: 11-20). In various other embodiments, these same residues can be modified by conservative substitutions (for example, SEQ ID NOS:21- 30).

As further shown in the table, certain other residues can only be substituted with conservative amino acid substitutions. Such residues have a solvent-accessible surface area of less than or equal to 50 square Angstroms and are present in the polypeptide interior, and thus can be modified only by conservative substitutions to maintain overall protein structure to permit multimer assembly. As used here, "conservative amino acid substitution" means that:

o hydrophobic amino acids (Ala, Cys, Gly, Pro, Met, See, Sme, Val, He, Leu) can only be substituted with other hydrophobic amino acids;

o hydrophobic amino acids with bulky side chains (Phe, Tyr, Trp) can only be substituted with other hydrophobic amino acids with bulky side chains;

o amino acids with positively charged side chains (Arg, His, Lys) can only be substituted with other amino acids with positively charged side chains;

o amino acids with negatively charged side chains (Asp, Glu) can only be

substituted with other amino acids with negatively charged side chains; and o amino acids with polar uncharged side chains (Ser, Thr, Asn, Gin) can only be substituted with other amino acids with polar uncharged side chains.

Certain other residues in the proteins are invariant; these residues have one or more atoms within 5 Angstroms of one or more atoms across the interface between the first and second multimeric substructures, and are therefore directly involved in self-assembly.

As used herein, the amino acid residues are abbreviated as follows: alanine (Ala; A), asparagine (Asn; N), aspartic acid (Asp; D), arginine (Arg; R), cysteine (Cys; C), glutamic acid (Glu; E), glutamine (Gin; Q), glycine (Gly; G), histidine (His; H), isoleucine (He; I), leucine (Leu; L), lysine (Lys; K), methionine (Met; M), phenylalanine (Phe; F), proline (Pro; P), serine (Ser; S), threonine (Thr; T), tryptophan (Trp; W), tyrosine (Tyr; Y), and valine (Val; V).

In a further embodiment, the first protein and the second protein comprise or consist of proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 21) and T32-28B SEQ ID NO: 22);

(b) T33-09A SEQ ID NO: 23) and T33-09B SEQ ID NO: 24);

(c) T33-15A SEQ ID NO: 25) and T33-15B SEQ ID NO: 26);

(d) T33-21A SEQ ID NO: 27) and T33-21B SEQ ID NO: 28); and

(e) T33-28A SEQ ID NO: 29) and T33-28B SEQ ID NO: 30

(a) T32-28A (SEQ ID NO: 31) and T32-28B SEQ ID NO: 32);

(b) T33-09A SEQ ID NO: 33) and T33-09B SEQ ID NO: 34);

(c) T33-15A SEQ ID NO: 35) and T33-15B SEQ ID NO: 36);

(d) T33-21A SEQ ID NO: 37) and T33-21B SEQ ID NO: 38); and

(e) T33-28A SEQ ID NO: 39) and T33-28B SEQ ID NO: 40).

In one embodiment, the first protein and the second protein comprise or consist of proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 11, 21, or 31) and T32-28B SEQ ID NO: 12, 22, or 32), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID

NO: 1 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 2;

(b) T33-09A SEQ ID NO: 13, 23, or 33) and T33-09B SEQ ID NO: 14, 24, or 34), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 3 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 4;

(c) T33-15A SEQ ID NO: 15, 25, or 35) and T33-15B SEQ ID NO: 16, 26, or 36), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 5 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 6;

(d) T33-21A SEQ ID NO: 17, 27, or 37) and T33-2 IB SEQ ID NO: 18, 28, or 38), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 7 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 8; and

(e) T33-28A SEQ ID NO: 19, 29, or 39) and T33-28B SEQ ID NO: 20, 30, or 40), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 9 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 10.

In various further embodiments, the first and second proteins are at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98% identical to the amino acid sequence of the designed protein.

In various further embodiments, the first and second proteins comprise or consist of proteins selected from the following pairs of first and second proteins

(a) T32-28A (SEQ ID NO: 1)

MGEVPIGDPKELNGMEIAAVYLQPIEMEPRGIDLAASLADIHLEADIHALKNNPNGFP EGFWMPYLTIAYALANADTGAIKTGTLMPMVADDGPHYGANIAMEKDKKGGFGVG TYALTFLISNPEKQGFGRHVDEETGVGKWFEPFVVTYFFKYTGTPK; and

T32-28B (SEQ ID NO:2)

MSQAIGILELTSIAKGMELGDAMLKSANVDLLVSKTISPGKFLLMLGGDIGAIQQAIE TGTSQAGEMLVDSLVLANIHPSVLPAISGLNSVDKRQAVGIVETWSVAACISAADLA VKGSNVTLVRVHMAFGIGGKCYMVVAGDVLDVAAAVATASLAAGAKGLLVYASII PRPHEAMWRQMVEG;

(b) T33-09A (SEQ ID NO:3)

MEEVVLITVPSALVAVKIAHALVEERLAACVNIVPGLTSIYRWQGSVVSDHELLLLV KTTTHAFPKLKERVKALHPYTVPEIVALPIAEGNREYLDWLRENTG; and

T33-09B (SEQ ID NO:4)

MVRGIRGAITVEEDTPAAILAATIELLLKMLEANGIQSYEELAAVIFTVTEDLTSAFPA EAARLIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQDRVRHVYLNEAVRLRP DLESAQ;

T33-15A (SEQ ID NO:5) MSKAKIGIVTVSDRASAGITADISGKAIILALNLYLTSEWEPIYQVIPDEQDVIETTLIK MADEQDCCLIVTTGGTGPAKRDVTPEATEAVCDRMMPGFGELMRAESLKEVPTAIL SRQTAGLRGDSLIV LPGDPASISDCLLAVFPAIPYCIDLMEGPYLECNEAMIKPFRPK AK; and

T33-15B (SEQ ID NO:6)

MVRGIRGAITV SDTPTSIIIATILLLEKMLEANGIQSYEELAAVIFTVTEDLTSAFPAE

AARQIGMHRVPLLSAREVPVPGSLPRVIRVLALW TDTPQDRVRHVYLSEAVRLRPD

LESAQ;

(d) T33-21A (SEQ ID O:7)

MRITTKVGDKGSTRLFGGEEVWKDSPIIEANGTLDELTSFIGEAKHYVDEEMKGILEE IQNDIYKIMGEIGSKGKIEGISEERIAWLLKLILRYMEMV LKSFVLPGGTLESAKLDV CRTIARRALRKVLTVTREFGIGAEAAAYLLALSDLLFLLARVIEIEKNKLKEVRS; and

T33-21B (SEQ ID NO:8)

MPHLVIEATANLRLETSPGELLEQANKALFASGQFGEADIKSRFVTLEAYRQGTAAV ERAYLHACLSILDGRDIATRTLLGASLCAVLAEAVAGGGEEGVQVSVEVREMERLSY AKRVVARQR; and

(e) T33-28A (SEQ ID O:9)

MESV TSFLSPSLVTIRDFDNGQFAVLRIGRTGFPADKGDIDLCLDKMIGVRAAQIFL GDDTEDGFKGPHIRIRCVDIDDKHTY AMVYVDLIVGTGASEVERETAEEEAKLALR VALQVDIADEHSCVTQFEMKLREELLSSDSFHPDKDEYYKDFL; and

T33-28B (SEQ ID NO: 10)

MPVIQTFVSTPLDHHKRLLLAIIYRIVTRVVLGKPEDLVMMTFHDSTPMHFFGSTDPV ACVRVEALGGYGPSEPEKVTSIVTAAITAVCGIVADRIFVLYFSPLHCGWNGTNF.

As shown in the examples that follow, these non-naturally occurring protein pairs self-interact to form multimeric substructures, which can interact to form the nanostructures of the invention. As will be understood by those of skill in the art, the design methods of the invention can produce a wide variety of nanostructures made of a wide variety of subunit proteins, and the methods are in no way limited to these particular protein pairs; they are merely exemplary.

The plurality of the first proteins self- interact to form a first multimeric substructure and the plurality of the second proteins self- interact to form a second multimeric substructure, where each multimeric substructure comprises at least one axis of rotational symmetry. As will be understood by those of skill in the art, the self-interaction is a non-covalent protein-protein interaction. Any suitable non-covalent interaction(s) can drive self- interaction of the proteins to form the multimeric substructure, including but not limited to one or more of electrostatic interactions, π-effects, van der Waals forces, hydrogen bonding, and hydrophobic effects. The self- interaction in each of the two different multimeric substructures may be natural or synthetic in origin; that is, the synthetic proteins making up the nanostructures of the invention may be synthetic variations of natural proteins that self-interact to form multimeric substructures, or they may be fully synthetic proteins that have no amino acid sequence relationships to known natural proteins.

As used herein, "at least one axis of rotational symmetry" means at least one axis of symmetry around which the substructure can be rotated without changing the appearance of the substructure. In one embodiment, one or both of the substructures have cyclic symmetry, meaning rotation about a single axis (for example, a three-fold axis in the case of a trimeric protein; generally, multimeric substructures with n subunits and cyclic symmetry will have n- fold rotational symmetry, sometimes denoted as C„ symmetry). In other embodiments, one or both substructures possess symmetries comprising multiple rotational symmetry axes, including but not limited to dihedral symmetry (cyclic symmetry plus an orthogonal two-fold rotational axis) and the cubic point group symmetries including tetrahedral, octahedral, and icosahedral point group symmetry (multiple kinds of rotational axes). The first multimeric substructure and the second multimeric substructure may comprise the same or different rotational symmetry properties. In one non-limiting embodiment, the first multimeric substructure comprises a dimer, trimer, tetramer, or pentamer of the first protein, and wherein the second multimeric substructure comprises a dimer or trimer of the second protein. In a further non-limiting embodiment, the first multimeric protein comprises a trimeric protein, and the second multimeric protein comprises a dimeric protein. In another non-limiting embodiment,

the first multimeric protein comprises a trimeric protein, and the second multimeric protein comprises a different trimeric protein. In the nanostructures of the invention, there are at least two identical copies of the first multimeric substructure and at least two identical copies of the second multimeric substructure in the nanostructure. In general, the number of copies of each of the first and second multimeric substructures is dictated by the number of symmetry axes in the designated mathematical symmetry group of the nanostructure that match the symmetry axes in each multimeric substructure. This relationship arises from the requirement that the symmetry axes of each copy of each multimeric substructure must be aligned to symmetry axes of the same kind in the synthetic nanostructure. By way of non-limiting example, a synthetic nanostructure with tetrahedral point group symmetry can comprise exactly four copies of a first trimeric substructure aligned along the exactly four three- fold symmetry axes passing through the center and vertices of a tetrahedron. Likewise, the same non-limiting example tetrahedral nanostructure can comprise six (but not five, seven, or any other number) copies of a dimeric substructure aligned along the six two-fold symmetry axes passing through the center and edges of the tetrahedron (an example of a synthetic nanostructure with this symmetric architecture, referred to here as T32, is shown in Figure 4F). In general, although every copy of each multimeric substructure must have its symmetry axes aligned to symmetry axes of the same kind in the synthetic nanostructure, not all symmetry axes in the synthetic nanostructure must have a multimeric building block aligned to them. By way of non-limiting example, we can consider a synthetic nanostructure with icosahedral point group symmetry comprising multiple copies of each of a first multimeric substructure and a second multimeric substructure. There are 30 two-fold, 20 three-fold, and 12 five- fold rotational symmetry axes in icosahedral point group symmetry. The nanostructures of the invention are those in which two different multimeric substructures are aligned along all instances of two types of symmetry axes in a designated mathematical symmetry group. Therefore, the nanostructures in this non-limiting example could include icosahedral nanostructures comprising 30 dimeric substructures and 20 trimeric substructures, or 30 dimeric

substructures and 12 pentameric substructures, or 20 trimeric substructures and 12 pentameric substructures. In each case, one of the three types of symmetry axes is left unoccupied by multimeric substructures.

The interaction between the first and second multimeric substructures is a non-natural

(e.g., not an interaction seen in a naturally occurring protein multimer), non-covalent interaction; this can comprise any suitable non-covalent interaction(s), including but not limited to one or more of electrostatic interactions, π-effects, van der Waals forces, hydrogen bonding, and hydrophobic effects. The interaction occurs at multiple identical interfaces (symmetrical) between the first and second multimeric substructures, wherein the interfaces can be continuous or discontinuous. This symmetric repetition of the non-covalent protein- protein interfaces between the first and second multimeric substructures results from the overall symmetry of the subject nanostructures; because each protein molecule of each of the first and second multimeric substructures is in a symmetrically equivalent position in the nanostructure, the interactions between them are also symmetrically equivalent.

Non-covalent interactions between the first multimeric substructures and the second multimeric substructures orient the substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group as described above. This feature provides for the formation of regular, defined nanostructures, as opposed to irregular or imprecisely defined structures or aggregates. Several structural features of the non-covalent interactions between the first multimeric substructures and the second multimeric substructures help to provide a specific orientation between substructures. Generally, large interfaces that are complementary both chemically and geometrically and comprise many individually weak atomic interactions tend to provide highly specific orientations between protein molecules. In one embodiment of the subject invention, therefore, each symmetrically repeated instance of the non-natural, non-covalent protein- protein interface between the first multimeric substructure and the second multimeric substructure may bury between 1000-2000 A² of solvent-accessible surface area (SASA) on the first multimeric substructure and the second multimeric substructure combined. SASA is a standard measurement of the surface area of molecules commonly used by those skilled in the art; many computer programs exist that can calculate both SASA and the change in SASA upon burial of a given interface for a given protein structure. A commonly used measure of the geometrical complementarity of protein-protein interfaces is the Shape Complementarity (5_c) value of Lawrence and Colman (J. Mol. Biol. 234:946-50 (1993)). In a further embodiment, each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure has an S_c value between 0.5-0.8. Finally, in order to provide a specific orientation between the first multimeric substructures and the second multimeric substructures, in many embodiments the interface between them may be formed by relatively rigid portions of each of the protein substructures. This feature ensures that flexibility within each protein molecule does not lead to imprecisely defined orientations between the first and second multimeric substructures. Secondary structures in proteins, that is alpha helices and beta strands, generally make a large number of atomic interactions with the rest of the protein structure and therefore occupy a rigidly fixed position. Therefore, in one embodiment, at least 50% of the atomic contacts comprising each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure are formed from amino acid residues residing in elements of alpha helix and/or beta strand secondary structure.

The nanostructures of the invention are capable of forming a variety of different structural classes based on the designated mathematical symmetry group of each

nanostructure. As the teachings above indicate, the nanostructures comprise multiple copies of a first multimeric substructure and multiple copies of a second multimeric substructure that interact at one or more symmetrically repeated, non-covalent protein-protein interfaces that orient the first multimeric substructures and the second multimeric substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group. There are many symmetry groups that comprise multiple types of symmetry axes, including but not limited to dihedral symmetries, cubic point group symmetries, line or helical symmetries, plane or layer symmetries, and space group symmetries. Collectively, the nanostructures of the invention may possess any symmetry that comprises at least two types of symmetry axes; however, each individual nanostructure possesses a single, mathematically defined symmetry that results from the interface between the first and second multimeric substructures orienting them such that their symmetry axes align to those in a designated mathematically symmetry group. Individual nanostructures possessing different symmetries may find use in different applications; for instance, nanostructures possessing cubic point group symmetries may form hollow shell- or cage-like structures that could be useful, for example, for packaging or encapsulating molecules of interest, while nanostructures possessing plane group symmetries will tend to form regularly repeating two-dimensional protein layers that could be used, for example, to array molecules, nanostructures, or other functional elements of interest at regular intervals.

In one embodiment, the mathematical symmetry group is selected from the group consisting of tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry.

As will be apparent to those of skill in the art, the ability to widely modify surface amino acid residues without disruption of the protein structure permits many types of modifications to endow the resulting self-assembled multimers with a variety of functions. In one non-limiting embodiment, the protein can be modified to facilitate covalent linkage to a "cargo" of interest. In one non-limiting example, the protein can be modified, such as by introduction of various cysteine residues at defined positions to facilitate linkage to one or more antigens of interest, such that an assembly of the protein would provide a scaffold to provide a large number of antigens for delivery as a vaccine to generate an improved immune response (similar to the use of virus-like particles). In another non-limiting embodiment, the protein of the invention may be modified by linkage (covalent or non-covalent) with a moiety to help facilitate "endosomal escape." For applications that involve delivering molecules of interest to a target cell, such as targeted delivery, a critical step can be escape from the endosome - a membrane-bound organelle that is the entry point of the delivery vehicle into the cell. Endosomes mature into lysosomes, which degrade their contents. Thus, if the delivery vehicle does not somehow "escape" from the endosome before it becomes a lysosome, it will be degraded and will not perform its function. There are a variety of lipids or organic polymers that disrupt the endosome and allow escape into the cytosol. Thus, in this embodiment, the first or second protein can be modified, for example, by introducing cysteine residues that will allow chemical conjugation of such a lipid or organic polymer to the monomer or resulting multimer surface.

In a further aspect, the present invention provides isolated proteins, comprising or consisting of an amino acid sequence selected from the group consisting of

(a) T32-28A (SEQ ID NO: 11);

(b) T32-28B SEQ ID NO 12);

(c) T33-09A SEQ ID NO 13);

(d) T33-09B SEQ ID NO 14);

(e) T33-15A SEQ ID NO 15);

(f) T33-15B SEQ ID NO 16);

(g) T33-21A SEQ ID NO IV);

(h) T33-21B SEQ ID NO 18);

(i) T33-28A SEQ ID NO 19);

G) T33-28B SEQ ID NO 20).

The isolated proteins of the invention can be used, for example, to prepare the nanostructures of the invention. In some embodiments, the isolated proteins may be produced in the same time and place; for instance, they may be expressed recombinantly in the same bacterial or eukaryotic cell. In other embodiments, each protein may be produced separately from the other, either by recombinant expression in separate bacterial or eukaryotic cells or by protein synthesis in separate vessels. The isolated proteins of the invention can be modified in a number of ways, including but not limited to the ways described above, either before or after assembly of the nanostructures of the invention. As a non-limiting example, the T33-15A protein and the T33-15B protein could be produced by recombinant expression in separate cultures of bacterial cells and purified independently of one another. Prior to mixing the two proteins, each protein could be modified chemically to introduce additional functionality as described above. The modified proteins could then be mixed to initiate assembly of a modified T33-15 nanostructure that comprises multiple copies of each of the T33-15A and T33-15B proteins. Alternatively, the T33-15A and T33-15B proteins could be produced recombinantly in the same cell to produce the assembled T33-15 nanostructure of the invention, which could then be modified as desired.

Figures 11-20 show the primary amino acid sequences of the proteins noted and allowable substitutions, as discussed above. In another embodiment, the isolated proteins comprise or consist of an amino acid sequence selected from the group consisting of:

(a) T32-28A (SEQ ID NO: 21);

(b) T32-28B SEQ ID NO: 22);

(c) T33-09A SEQ ID NO: 23);

(d) T33-09B SEQ ID NO: 24);

(e) T33-15A SEQ ID NO: 25);

(f) T33-15B SEQ ID NO: 26);

(g) T33-21A SEQ ID NO: 27);

(h) T33-21B SEQ ID NO: 28);

(i) T33-28A SEQ ID NO: 29); and

G) T33-28B SEQ ID NO: 30).

In another embodiment, the isolated proteins comprise or consist of an amino acid sequence selected from the group consisting of:

(a) T32-28A (SEQ ID NO: 31);

(b) T32-28B SEQ ID NO: 32);

(c) T33-09A SEQ ID NO: 32);

(d) T33-09B SEQ ID NO: 34); (e) T33-15A SEQ ID NO: 35)

(f) T33-15B SEQ ID NO: 36);

(g) T33-21A SEQ ID NO: 37);

(h) T33-21B SEQ ID NO: 38);

(i) T33-28A SEQ ID NO: 39) ; and

(j) T33-28B SEQ ID NO: 40).

In another embodiment, the isolated proteins comprise or consist of an amino acid sequence:

(A) T32-28A (SEQ ID NO: 1 1, 21, or 31), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 1 ;

(B) T32-28B SEQ ID NO: 12, 22, or 32), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 2;

(C) T33-09A SEQ ID NO: 13, 23, or 33), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 3;

(D) T33-09B SEQ ID NO: 14, 24, or 34), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 4;

(E) T33-15A SEQ ID NO: 15, 25, or 35), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 5;

(F) T33-15B SEQ ID NO: 16, 26, or 36), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 6;

(G) T33-21A SEQ ID NO: 17, 27, or 37), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 7;

(H) T33-21B SEQ ID NO: 18, 28, or 38), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 8;

(I) T33-28A SEQ ID NO: 19, 29, or 39), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 9; and

(J) T33-28B SEQ ID NO: 20, 30, or 40), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 10.

In various further embodiments, the protein of any one of (A)-(J) is at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98% identical to the amino acid sequence of the designed protein.

In a further embodiment, the isolated protein comprises or consists of an amino acid sequence selected from the group consisting of: SEQ ID NOS: 1-10. As used throughout the present application, the term "protein" is used in its broadest sense to refer to a sequence of subunit amino acids. The polypeptides of the invention may comprise L-amino acids, D-amino acids (which are resistant to L-amino acid-specific proteases in vivo), or a combination of D- and L-amino acids. The polypeptides described herein may be chemically synthesized or recombinantly expressed. The polypeptides may be linked to any other moiety as deemed useful for a given purpose. Such linkage can be covalent or non-covalent as is understood by those of skill in the art.

In one non-limiting embodiment, the protein can be modified to facilitate covalent linkage to a "cargo" of interest. In one non-limiting example, the protein can be modified, such as by introduction of various cysteine residues at defined positions to facilitate linkage to one or more antigens of interest, such that an assembly of the protein would provide a scaffold to provide a large number of antigens for delivery as a vaccine to generate an improved immune response (similar to the use of virus-like particles). In another non- limiting embodiment, the protein of the invention may be modified by linkage (covalent or non-covalent) with a moiety to help facilitate "endosomal escape."

In a further aspect, the present invention provides multimers, comprising a plurality of identical protein monomers according to any embodiment or combination of embodiments of the proteins of the invention. As is disclosed herein, proteins of the invention are capable of self-interacting into multimeric substructures (i.e.: dimers, trimers, hexamers, pentamers, hexamers, etc.) formed from self-assembly of a plurality of a single protein monomer of the invention (i.e., "homo-multimeric assemblies"). As used herein, a "plurality" means 2 or more. In various embodiments, the multimeric assembly comprises 2, 3, 4, 5, 6, or more identical protein monomers. The multimeric assemblies can be used for any purpose, including but not limited to creating the nanostructures of the present invention.

In another aspect, the present invention provides isolated nucleic acids encoding a protein of the present invention. The isolated nucleic acid sequence may comprise RNA or DNA. As used herein, "isolated nucleic acids" are those that have been removed from their normal surrounding nucleic acid sequences in the genome or in cDNA sequences. Such isolated nucleic acid sequences may comprise additional sequences useful for promoting expression and/or purification of the encoded protein, including but not limited to polyA sequences, modified Kozak sequences, and sequences encoding epitope tags, export signals, and secretory signals, nuclear localization signals, and plasma membrane localization signals. It will be apparent to those of skill in the art, based on the teachings herein, what nucleic acid sequences will encode the proteins of the invention.

In a further aspect, the present invention provides recombinant expression vectors comprising the isolated nucleic acid of any embodiment or combination of embodiments of the invention operatively linked to a suitable control sequence. "Recombinant expression vector" includes vectors that operatively link a nucleic acid coding region or gene to any control sequences capable of effecting expression of the gene product. "Control sequences" operably linked to the nucleic acid sequences of the invention are nucleic acid sequences capable of effecting the expression of the nucleic acid molecules. The control sequences need not be contiguous with the nucleic acid sequences, so long as they function to direct the expression thereof. Thus, for example, intervening untranslated yet transcribed sequences can be present between a promoter sequence and the nucleic acid sequences and the promoter sequence can still be considered "operably linked" to the coding sequence. Other such control sequences include, but are not limited to, polyadenylation signals, termination signals, and ribosome binding sites. Such expression vectors can be of any type known in the art, including but not limited to plasmid and viral-based expression vectors. The control sequence used to drive expression of the disclosed nucleic acid sequences in a mammalian system may be constitutive (driven by any of a variety of promoters, including but not limited to, CMV, SV40, RSV, actin, EF) or inducible (driven by any of a number of inducible promoters including, but not limited to, tetracycline, ecdysone, steroid-responsive). The construction of expression vectors for use in transfecting prokaryotic cells is also well known in the art, and thus can be accomplished via standard techniques. (See, for example,

Sambrook, Fritsch, and Maniatis, in: Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Laboratory Press, 1989; Gene Transfer and Expression Protocols, pp. 109-128, ed. E.J. Murray, The Humana Press Inc., Clifton, N.J.), and the Ambion 1998 Catalog (Ambion, Austin, TX). The expression vector must be replicable in the host organisms either as an episome or by integration into host chromosomal DNA. In a preferred embodiment, the expression vector comprises a plasmid. However, the invention is intended to include other expression vectors that serve equivalent functions, such as viral vectors.

In another aspect, the present invention provides host cells that have been transfected with the recombinant expression vectors disclosed herein, wherein the host cells can be either prokaryotic or eukaryotic. The cells can be transiently or stably transfected. Such transfection of expression vectors into prokaryotic and eukaryotic cells can be accomplished via any technique known in the art, including but not limited to standard bacterial transformations, calcium phosphate co-precipitation, electroporation, or liposome mediated-, DEAE dextran mediated-, polycationic mediated-, or viral mediated transfection. (See, for example, Molecular Cloning: A Laboratory Manual (Sambrook, et al., 1989, Cold Spring Harbor Laboratory Press; Culture of Animal Cells: A Manual of Basic Technique, 2^nd Ed. (R.I. Freshney. 1987. Liss, Inc. New York, NY). A method of producing a polypeptide according to the invention is an additional part of the invention. The method comprises the steps of (a) culturing a host according to this aspect of the invention under conditions conducive to the expression of the polypeptide, and (b) optionally, recovering the expressed polypeptide.

In a further aspect, the present invention provides kits comprising:

(a) one or more of the isolated proteins, multimeric protein assemblies, oir nanostructures of the invention;

(b) one or more recombinant nucleic acids of the invention;

(c) one or more recombinant expression vectors comprising recombinant nucleic acids of the invention; and/or

(d) one or more recombinant host cell, comprising recombinant expression vectors of the invention. Nanostructure and Protein Examples

Two example distinct tetrahedral architectures have been considered in detail: the T33 architecture described above and the T32 architecture shown in Figures 3 and 4F, in which the materials are formed from four trimeric and six dimeric building blocks aligned along the three-fold and two-fold tetrahedral symmetry axes. In an experiment, all pairwise combinations of a set of 1, 161 dimeric and 200 trimeric protein building blocks of known structure were docked in the T32 and T33 architectures. This resulted in a large set of potential novel nanomaterials: 232,200 and 19,900 docked protein pairs, respectively, with a given pair often yielding several distinct promising docked configurations. Interface sequence design calculations were carried outon the 1,000 highest scoring docked configurations in each architecture, and the designs were evaluated based on the predicted binding energy, shape complementarity, size, and number of buried unsatisfied hydrogen bonding groups (vide supra). After filtering on these criteria, 30 T32 and 30 T33 materials were selected for experimental characterization. The 60 designs were derived from 39 distinct trimeric and 19 dimeric proteins, and contained an average of 19 amino acid mutations per pair of subunits compared to the native sequences. The designed interfaces reside mostly on elements of secondary structure, both a-helices and β-strands, with nearby loops often making minor contributions.

Five example designed interfaces are shown in Figures 2A through 2E, with Figure 2A showing a designed interface for T32-28, Figure 2B showing a designed interface for T33-09, Figure 2C showing a designed interface for T33-15, Figure 2D showing a designed interface for T33-21, and Figure 2E showing a designed interface for T33-28. In each of Figures 2A through 2E, the designed interface of each component is shown at left or right, and a side view of the interface as a whole is shown at center. Each image is oriented such that a vector originating at the center of the tetrahedral material and passing through the center of mass of the designed interface would pass vertically through the center of the image. The side chains of all amino acids considered as "interface positions" are shown in stick representation; these positions were allowed to repack and minimize during the interface design procedure. The alpha carbon atoms of positions that were mutated during design are shown as spheres, and the mutations are indicated by the labels. To highlight the

morphologies of the contacting surfaces, atoms within 5 A of the opposite building block are shown in semi-transparent surface representation.

Synthetic genes encoding each designed pair of proteins were cloned in tandem in a single expression vector to allow inducible co-expression in E. coli. Polyacrylamide gel electrophoresis (PAGE) under non-denaturing (native) conditions was used to rapidly screen the assembly state of the designed proteins in clarified cell lysates. Several designed protein pairs yielded single bands that migrated more slowly than the wild-type proteins from which they were derived, suggesting assembly to higher-order species. These proteins were subcloned to introduce a hexahistidine tag at the C terminus of one of the two subunits and purified by nickel affinity chromatography and size exclusion chromatography (SEC). Five pairs of designed proteins, one T32 design (T32-28) and four T33 designs (T33-09, T33-15, T33-21, and T33-28), co-purified off of the nickel column and yielded dominant peaks at the expected size of approximately 24 subunits when analyzed by SEC, such as shown in Figure 6A. Figure 6A shows SEC chromatograms of the designed pairs of proteins (solid lines) and the wild-type oligomeric proteins from which they were derived (dashed and dotted lines). The co-expressed designed proteins elute at the volumes expected for the target 24- subunit nanomaterials, while the wild-type proteins elute as dimers or trimers. The T33-15 in vitro panel shows chromatograms for the individually produced and purified designed components (T33-15A and T33-15B [dashed and dotted lines]) as well as a stoichiometric mixture of the two components (solid line).

Figure 6B shows a native PAGE analysis of in vzYro-assembled T32-28 (left panel) and T33-15 (right panel) in cell lysates. In Figure 6B, lysates of the co-expressed design components (lanes 5-6) contain slowly migrating species (arrows) not present in lysates of the wild-type and individually expressed components (lanes 1-4). Mixing equal volumes (e.v.) of crude lysates containing the individual designed components yields the same assemblies (lane 7), although some unassembled building blocks remain due to unequal levels of expression (particularly for T33-15). When the differences in expression levels are accounted for by mixing adjusted volumes of lysates (a.v.), more efficient assembly is observed (lane 8).

Figures 6C-6G respectively show native PAGE analyses of in v¾ro-assembled T32- 28, T33-09, T33-15, T33-21, and T33-28 in cell lysates. In Figures 6C-6G, lane 1 is from cells expressing the wild-type scaffold for component A and lane 2 the wild-type scaffold for component B. Lanes 3-4 are from cells expressing the individual design components and lanes 5-6 the co-expressed components. Lanes 7-8 are from samples mixed as crude lysates (cr.e.v or cr.a.v), while lanes 9-10 are from samples mixed as cleared lysates (cl.e.v. or cl.a.v.). Lanes 7 and 9 are from lysates mixed with equal volumes (cr.e.v. or cl.e.v.), while lanes 8 and 10 are from lysates mixed with adjusted volumes (cr.a.v. or cl.a.v.). Lane 5 is from cells expressing the C-terminally A 1 -tagged constructs; all other lanes are from cells expressing the C-terminally His-tagged constructs. An arrow is positioned next to each gel indicating the migration of 24-subunit assemblies and the gel regions containing unassembled building blocks are bracketed. Each gel was stained with GelCode™ Blue (Thermo

Scientific).

The ability of each material to assemble in vitrowas tested by expressing the two components in separate E. coli cultures and mixing them at various points after cell lysis. Native PAGE revealed that in two cases, T33-15 shown in Figure 6E and T32-28 shown in Figure 6G, the two separately expressed components efficiently assembled to the designed materials in vitro when equal volumes of cell lysates were mixed as indicated in Figures 6B, 6C, and 6E. Adjusting the volume of each lysate in the mixture to account for differences in the level of soluble expression of the two components allowed for more quantitative assembly. In the case of T33-15, the two components of the material could also be purified independently: T33-15A and T33-15B each eluted from the SEC column as trimers in isolation. After mixing the two purified components in a 1 : 1 molar ratio and two hour incubation at room temperature, the mixture eluted from the SEC column as predominantly the 24mer assembly, with small amounts of residual trimeric building blocks remaining as shown in Figure 6A. The assembly of our designed materials can thus be controlled by simply mixing the two components.

Figures 7A and 7B shows electron micrographs of designed two-component protein nanomaterials. Figure 7A shows negative stain electron microscopy of the five designed materials confirmed that they assemble specifically to the target architectures (Figure 3). For each material, fields of remarkably monodisperse particles of the expected size and symmetry were observed, confirming the homogeneity of the materials suggested by SEC. Particle averaging yielded images that recapitulate features of the computational design models at low resolution. For example, class averages of T33-09 revealed roughly square or triangle-shaped structures with well-defined internal cavities that closely resemble projections calculated from the computational design model along its two-fold and three-fold axes as shown in Figure 3, T33-09 inset.

Figure 7B shows electron micrographs of in Wire-assembled T33-15 (unpurified) and T33-15A and T33-15B in isolation. Negative stain electron micrographs of independently purified T33-15 components (left and middle panels)and unpurified, in v iro-assembled T33- 15 (right panel) are shown to scale (scale bar; 25 nm). Micrographs of T33-15 assembled in vitro as described above were indistinguishable from those of co-expressed T33-15 as shown in Figures 7A and 7B, demonstrating that the same material is obtained using both methods X-ray crystal structures were solved four of the designed materials (T32-28, T33-15, T33-21, and T33-28) to resolutions ranging from 2.1 to 4.5 A. Table 3 provides

crystallographic statistics for T32-28, T33-15, and T33-28 data collection and refinement, where statistics in parentheses refer to the highest resolution shell.

93.93 - 4.5 (4.66 - 75.49 - 2.8 (2.901 94.21 - 3.5

Resolution range (A) 4.5) - 2.8) (3.625 - 3.5)

Space group P3_L21 F432 P2i

213.52 213.52 124.91 189.25

Unit cell [a/b/c (A)] 246.01 246.01 290.94 213.52 376.83

Unit cell [α/β/γ (°)] 90 90 120 90 90 90 90 90.02 90

Total reflections 436516 (44096) 146590 (14934) 808494 (85695)

Unique reflections 59814 (5903) 10783 (1045) 217956 (21869)

Multiplicity 7.3 (7.5) 13.6 (14.3) 3.7 (3.9)

Completeness (%) 98.31 (97.93) 99.91 (100.00) 98.80 (99.57)

Mean I/sigma(I) 13.20 (2.17) 19.80 (2.16) 8.95 (2.39)

Wilson B-factor 184 79.32 90.49

R-merge 0.1383 (0.9457) 0.1234 (1.767) 0.144 (0.6014)

R-meas 0.1492 0.1282 0.1683

CCl/2 0.997 (0.586) 0.999 (0.718) 0.994 (0.685)

CC* 0.999 (0.859) 1 (0.914) 0.998 (0.902)

R-work 0.2971 (0.3574) 0.2020 (0.3181) 0.2614 (0.3126)

R-free 0.3429 (0.3937) 0.2515 (0.3765) 0.2987 (0.3639)

Number of non-hydrogen

atoms 20307 2011 88861 macromolecules 20307 2008 88861 ligands 0 1 0

water 0 2 0

Protein residues 4075 285 12686

RMS(bonds) 0.003 0.003 0.002

RMS(angles) 0.55 0.77 0.49

Ramachandran favored (%) 97 98 97

Ramachandran outliers (%) 0.15 0 0

Clashscore 0.89 2.26 4.61

Average B-factor 216.2 72.6 91.7

macromolecules 216.2 72.6 91.7

ligands 1 11.5

solvent 56.6

Table 3

Table 4 shows crystallographic statistics for T33-21 data collection and refinement, with Statistics in parentheses refer to the highest resolution shell. T33-21 R32 (PDB ID

4NWP) T33-21 F4_!32 (PDB ID 4NWQ)

Wavelength (A) 1.0393 0.9716

Resolution range (A) 93.78 - 2.1 (2.175 - 2.1) 96.23 - 2.8 (2.9 - 2.8)

Space group R32 F4_!32

Unit cell [a/b/c (A)] 113.35 113.35 634.88 272.18 272.18 272.18

Unit cell [α/β/γ (°)] 90 90 120 90 90 90

Total reflections 901047 (89024) 431476 (43290)

Unique reflections 92425 (9127) 21830 (2129)

Multiplicity 9.7 (9.8) 19.8 (20.3)

Completeness (%) 99.94 (99.97) 99.99 (99.95)

Mean I/sigma(I) 14.46 (2.48) 20.89 (3.14)

Wilson B-factor 37.68 69

R-merge 0.1123 (1.179) 0.1215 (1.203)

R-meas 0.1187 0.1248

CCl/2 0.998 (0.749) 0.999 (0.878)

CC* 1 (0.925) 1 (0.967)

R-work 0.1879 (0.3925) 0.1815 (0.3340)

R-free 0.2183 (0.4478) 0.1958 (0.3804)

Number of non-hydrogen

atoms 8248 2112

macromolecules 7882 2041

ligands 141 55

water 225 16

Protein residues 1046 269

RMS(bonds) 0.004 0.001

RMS(angles) 0.67 0.41

Ramachandran favored (%) 100 99

Ramachandran outliers (%) 0 0

Clashscore 1.87 1.2

Average B-factor 42.5 73.1

macromolecules 42.2 72.5

ligands 64.5 98.8

solvent 40.6 64

In the provided cases, the structures can reveal that the inter-building block interfaces were designed with high accuracy: comparing a pair of chains from each structure to the computationally designed model yields backbone root mean square deviations (RMSD) between 0.5 and 1.2 A, as indicated on the right side of Figure 8 and Table 5 below.

For Table 5, global RMSDs were calculated over all 24 subunits of each design model and corresponding subunits in each crystal structure and 2 chain RMSDs were calculated over chains A and B of each design model and corresponding subunits in each crystal structure. 24 subunits composing one complete cage were derived from each crystal structure as indicated and the chains renamed to match the corresponding names in the design models. In the case of T33-28, four different sets of RMSD calculations were carried out; one for each of the four cages contained in the asymmetric unit of 4NWR.

In the structures with resolutions that permit detailed analysis of side chain configurations (T33-15 and two independent crystal forms of T33-21), 87/113 side chains at the designed interfaces can adopt the predicted conformations as indicated in Tables 6 and 7 below. Table 6 shows a side chain chi value comparison of T33-15 crystal structure (PDB ID 4NWO) with the design model. The numbers reported are the differences in the value of each side chain chi value for each amino acid resolved in the crystal structure. Residue Achil Achi2 Achi3 Achi4 Achi5

19 1.6 4.2

T10 4.8

Vl l 1.1

N12 -0.8 -3.4

S13 -8.3

T15 -0.5

P16 2.0 -0.7 -0.9 2.4

T17 135.3

S18 116.3

120 1.3 -13.5

121 4.8 -1.4

124 -7.1 -4.4

L25 -1.1 -6.8

E28 -103.8 -14.8 79.8

K29 -16.6 - - -

E32 - - -

Q64 90.3 145.9 -25.5

165 -1.8 1.7

R86 -5.2 -8.2 -1.3 -3.6 7.5

L108 1.6 -0.6

S109 3.9

E110 - - -

T140 -

1143 2.2 -9.0

146 100.0 3.9 - -

1149 -98.7 -11.1

L150 0.8 -9.1

N153 13.3 4.6

L154 3.5 3.4

E226 - - -

S227 -1.3

K229 - - - -

E230 - - -

V231 -

D255 -9.8 4.6

P256 -4.9 3.3 -0.2 -3.0

S258 -10.3 S260 0.5

D261 2.4 18.8

L264 103.8 108.6

N285 -6.7 -7.6

M288 -8.4 2.0 -10.2

1289 -4.5 -3.3

Pass 29 23 4 3 1

Fail 7 2 2 0 0

Table 6

In Table 6, residue numbers refer to positions in the T33-15 design model, the "pass" values are the number of residues where |Achi| < 25m, and the "fail" values are the number of residues where |Achi| > 25. Residues with missing atoms in the crystal structure, for which a Achi value could not be determined, are indicated with a dash. All Achi values are reported in degrees.

Table 7 shows side chain chi value comparison of T33-21 crystal structures (PDB IDs 4NWP and 4NWQ) with the design model.

T109 -7.4 -9.8

R110 -102.3 14.3 -4.1 39.6 0.1 -106.9 19.8 -1.9 17.6 0.4

1114 -6.7 -20.3 -4.2 4.6

E117 0.3 168.2 -55.5 -16.4 160.2 5.0

L123 -2.2 1.4 -2.2 1.8

D127 -9.4 2.5 -9.8 14.0

D145 - - - - 175 3.5 - - - -1.8 -5.0 2.6 -0.6

D221 13.3 -21.7 19.4 -17.7

1222 109.5 110.2 100.6 -9.5

T224 -4.6 -4.7

R225 -15.7 -6.2 0.4 -9.4 1.0 -12.3 -1.4 -6.0 2.4 0.2

T226 -5.1 -1.8

L227 4.5 0.5 5.3 0.3

S231 111.7 -4.9

C233 -7.9 -2.1

V235 -1.6 -6.8

E238 - - - -11.4 0.6 16.9

E258 14.0 -7.2 62.5 -1.1 -13.0 84.3

R259 -4.5 3.5 -52.7 73.8 -1.2 2.6 -5.9 -120.7 -85.6 0.0

L260 20.9 -12.0 6.8 -3.3

S261 0.9 4.1

Y262 -11.7 2.4 -5.4 -6.2

K264 12.2 -1.4 -9.0 7.4 6.6 -1.7 -129.1 7.8

R265 90.0 -33.2 139.5 -16.2 0.1 3.3 2.0 11.5 -6.7 0.1

Pass 31 23 6 5 6 34 28 12 9 6

Fail 6 3 5 3 0 6 4 5 1 0

Table 7

In Table 7, residue numbers refer to positions in the T33-21 design model, the "pass" values are the number of residues where |Achi| < 25m, and the "fail" values are the number of residues where |Achi| > 25. Residues with missing atoms in the crystal structure, for which a Achi value could not be determined, are indicated with a dash. All Achi values are reported in degrees.

As intended, the designed interfaces can drive assembly of cage-like nanomaterials that closely match the computational design models: the backbone RMSD over all 24 subunits in each material range from 1.0 to 2.6 A. The precise control over interface geometry offered by our method thus enables the design of two-component protein nanomaterials with diverse nanoscale features such as surfaces, pores, and internal volumes with high accuracy.

The method described here can provide a general route to designing multi-component protein-based nanomaterials and molecular machines with programmable structures and functions. The capability to design highly homogeneous protein nanostructures with atomic- level accuracy and controllable assembly can open new opportunities in targeted drug delivery, vaccine design, plasmonics, and other applications that can benefit from the precise patterning of matter on the sub-nanometer to hundred nanometer scale.

Experimental Methods

Amino acid sequences. Enumerated below are the amino acid sequences for the five successful designs that were characterized in detail in this study (T32-28, T33-09, T33-15, T33-21, and T33-28) along with the wild-type proteins from which these designs were derived (referred to by their Protein Data Bank accession numbers followed by the suffix "- wt"). As described in the main text, each designed material comprises a pair of designed proteins. The two components are referred to here by the name of the designed material followed by the suffix "A" or "B". The amino acid sequences of the two C-terminal tags used in this study are also presented.

31zl-wt (dimeric scaffold for T32-28A) (SEQ ID NO: 41)

MGEVPIGDPKELNGMEIAAVYLQPIEMEPRGIDLAASLADIHLEADIHALKNN PNGFPEGFWMPYLTIAYELKNTDTGAIKRGTLMPMVADDGPHYGANIAMEKDKKG GFGVGNYELTFYISNPEKQGFGRHVDEETGVGKWFEPFKVDYKFKYTGTPK

3n79-wt (trimeric scaffold for T32-28B)(SEQ ID NO: 42)

MSQAIGILELTSIAKGMELGDAMLKSANVDLLVSKTICPGKFLLMLGGDIGAI QQAIETGTSQAGEMLVDSLVLANIHPSVLPAISGLNSVDKRQAVGIVETWSVAACISA ADRAVKGSNVTLVRVHMAFGIGGKCYMVVAGDVSDVNNAVTVASESAGEKGLLV YRSVIPRPHEAMWRQMVEG lnza-wt (trimeric scaffold for T33-09A) (SEQ ID NO: 43) MEEVVLITVPSEEVARTIAKALVEERLAACV IVPGLTSIYRWQGEVVEDQEL LLLVKTTTHAFPKLKERVKALHPYTVPEIVALPIAEGNREYLDWLRENTG lufy-wt (trimeric scaffold for T33-09B and T33-15B) (SEQ ID NO: 44)

MVRGIRGAITVEEDTPEAIHQATRELLLKMLEANGIQSYEELAAVIFTVTEDLT

SAFPAEAARQIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQDRVRHVYLREA

VRLRPDLESAQ

3k6a-wt (trimeric scaffold for T33-15A) (SEQ ID NO: 46)

MSKAKIGIVTVSDRASAGIYEDISGKAIIDTLNDYLTSEWEPIYQVIPDEQDVIE TTLIKMADEQDCCLIVTTGGTGPAKRDVTPEATEAVCDRMMPGFGELMRAESLKFV PTAILSRQTAGLRGDSLrVNLPGKPKSIRECLDAVFPAIPYCIDLMEGPYLECNEAVIKP FRPKAK lwyl-wt (trimeric scaffold for T33-21A) (SEQ ID NO: 47)

MRITTKVGDKGSTRLFGGEEVWKDSPIIEANGTLDELTSFIGEAKHYVDEEMK GILEEIQNDIYKIMGEIGSKGKIEGISEERIKWLEGLISRYEEMVNLKSFVLPGGTLESA KLDVCRTIARRAERKVATVLREFGIGKEALVYLNRLSDLLFLLARVIEIEKNKLKEVR

S

3e6q-wt (trimeric scaffold for T33-21B)(SEQ ID NO: 48)

MPHLVIEATANLRLETSPGELLEQANAALFASGQFGEADIKSRFVTLEAYRQG TAAVERAYLHACLSILDGRDAATRQALGESLCEVLAGAVAGGGEEGVQVSVEVRE MERASYAKRVVARQR

3fuy-wt (trimeric scaffold for T33-28A) (SEQ ID NO: 49)

MESVNTSFLSPSLVTIRDFDNGQFAVLRIGRTGFPADKGDIDLCLDKMKGVRD AQQSIGDDTEFGFKGPHIRIRCVDIDDKHTYNAMVYVDLrVGTGASEVERETAEELA KEKLRAALQVDIADEHSCVTQFEMKLREELLSSDSFHPDKDEYYKDFL

3fwu-wt (trimeric scaffold for T33-28B) (SEQ ID NO: 50) MPVIQTFVSTPLDHHKRENLAQVYRAVTRDVLGKPEDLVMMTFHDSTPMHF FGSTDPVACVRVEALGGYGPSEPEKVTSIVTAAITKECGIVADRIFVLYFSPLHCGW GTNF

T32-28A (SEQ ID NO: 1)

MGEVPIGDPKELNGMEIAAVYLQPIEMEPRGIDLAASLADIHLEADIHALKNN PNGFPEGFWMPYLTIAYALANADTGAIKTGTLMPMVADDGPHYGANIAMEKDKKG GFGVGTYALTFLISNPEKQGFGRHVDEETGVGKWFEPFVVTYFFKYTGTPK

T32-28B (SEQ ID NO: 2)

MSQAIGILELTSIAKGMELGDAMLKSANVDLLVSKTISPGKFLLMLGGDIGAI QQAIETGTSQAGEMLVDSLVLANIHPSVLPAISGLNSVDKRQAVGIVETWSVAACISA ADLAVKGSNVTLVRVHMAFGIGGKCYMVVAGDVLDVAAAVATASLAAGAKGLLV YASIIPRPHEAMWRQMVEG

T33-09A (SEQ ID NO: 3)

MEEVVLITVPSALVAVKIAHALVEERLAACVNIVPGLTSIYRWQGSVVSDHEL LLLVKTTTHAFPKLKERVKALHPYTVPEIVALPIAEGNREYLDWLRENTG

T33-09B (SEQ ID NO: 4)

MVRGIRGAITVEEDTPAAILAATIELLLKMLEANGIQSYEELAAVIFTVTEDLT SAFPAEAARLIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQDRVRHVYLNEA VRLRPDLESAQ

T33-15A (SEQ ID NO: 5)

MSKAKIGIVTVSDRASAGITADISGKAIILALNLYLTSEWEPIYQVIPDEQDVIE TTLIKMADEQDCCLIVTTGGTGPAKRDVTPEATEAVCDRMMPGFGELMRAESLKEV PTAILSRQTAGLRGDSLIVNLPGDPASISDCLLAVFPAIPYCIDLMEGPYLECNEAMIK PFRPKAK

T33-15B (SEQ ID NO: 6) MVRGIRGAIT V SDTPT SIIIATILLLEKMLEANGIQ S YEELAA VIFT VTEDLT S A FPAEAARQIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQDRVRHVYLSEAVR LRPDLESAQ

T33-21A (SEQ ID NO: 7)

MRITTKVGDKGSTRLFGGEEVWKDSPIIEANGTLDELTSFIGEAKHYVDEEMK GILEEIQNDIYKIMGEIGSKGKIEGISEERIAWLLKLILRYMEMVNLKSFVLPGGTLESA KLDVCRTIARRALRKVLTVTREFGIGAEAAAYLLALSDLLFLLARVIEIEKNKLKEVR S

T33-21B (SEQ ID NO: 8)

MPHLVIEATANLRLETSPGELLEQANKALFASGQFGEADIKSRFVTLEAYRQG TAAVERAYLHACLSILDGRDIATRTLLGASLCAVLAEAVAGGGEEGVQVSVEVREM ERLSYAKRVVARQR

T33-28A (SEQ ID NO: 9)

MESVNTSFLSPSLVTIRDFDNGQFAVLRIGRTGFPADKGDIDLCLDKMIGVRA AQIFLGDDTEDGFKGPHIRIRCVDIDDKHTYNAMVYVDLIVGTGASEVERETAEEEA KLALRVALQVDIADEHSCVTQFEMKLREELLSSDSFHPDKDEYYKDFL

T33-28B (SEQ ID NO: 10)

MPVIQTFVSTPLDHHKRLLLAIIYRIVTRVVLGKPEDLVMMTFHDSTPMHFFG STDPVACVRVEALGGYGPSEPEKVTSIVTAAITAVCGIVADRIFVLYFSPLHCGWNGT NF

Al tag (for fluorescent labeling and lysate screening) (SEQ ID NO: 45)

LEGGDSLDMLEWSL hexahistidine tag (for purification) (SEQ ID NO: 51)

LEHHHHHH

Protein expression, lysate screening, and purification. Codon-optimized genes encoding the designed and corresponding wild-type proteins were either purchased (Gen9) or constructed from sets of purchased oligonucleotides (Integrated DNA Technologies) by recursive PCR All genes were cloned using the Gibson assembly method into a variant of the pET29b expression vector (Novagen) that had been digested by Ndel and Xhol restriction endonucleases. The genes encoding the wild-type proteins were each cloned into the vector individually, while the genes encoding the designed proteins were cloned in pairs along with the following intergenic region derived from the pETDuet-1 vector (Novagen):

5 ' TAATGCTTAAGTCGAACAGAAAGTAATCGTATTGTACACGGCCGCATAA TCGAAATTAATACGACTCACTATAGGGGAATTGTGAGCGGATAACAATTCCCCAT CTTAGTATATTAGTTAAGTATAAGAAGGAGATATACAT-3 ' (SEQ ID NO: 52)

The constructs for the designed protein pairs thus possessed the following set of elements from 5' to 3': Ndel restriction site, upstream gene, intergenic region, downstream gene, Xhol restriction site. The upstream genes encoded components denoted with the suffix "A" above; the downstream genes encoded the "B" components. This allowed for co- expression of the designed protein pairs in which both the upstream and downstream gene had their own T7 promoter//ac operator and ribosome binding site.

The pET29b variant used for the initial constructs appended the Al peptide tag (vide supra) to the C terminus of each wild-type gene and to the downstream gene of each designed protein pair for fluorescent labeling via the AcpS system. For purification purposes, vectors encoding C-terminally His-tagged versions of the designed protein pairs, the individual protein components, and the corresponding wild-types were subsequently constructed by subcloning (via Gibson assembly) into the standard pET29b vector between the Ndel and Xhol restriction sites. As with the Al peptide tag, the hexahistidine tag was only appended to the downstream component in the co-expression constructs.

Expression plasmids were transformed into BL21 (DE3) E. coli cells. Cells were grown in LB medium supplemented with 50 mg L^"1 of kanamycin (Sigma) at 37° C until an OD₆oo of 0.8 was reached. Protein expression was induced by addition of 0.5 mM isopropyl- thio- -D-galactopyranoside (Sigma) and allowed to proceed for either 5 h at 22° C or 3 h at 37° C before cells were harvested by centrifugation.

The designed proteins were screened for assembly by subjecting cleared lysates to native (non-denaturing) PAGE as described previously in the context of at least Figures 6A- 6G. Single bands for each of the five successful materials were visible when stained with GelCode Blue (Thermo Scientific). In these initial screens, all constructs were tested under both the 22° C and the 37° C expression conditions. Based on these results, in all subsequent work T32-28, T33-28, and the corresponding wild-type proteins were expressed at 22° C, while T33-09, T33-15, T33-21, and the corresponding wild-type proteins were expressed at 37° C.

For purification, cells were lysed by sonication in 50 mM TRIS pH 8.0, 250 mM NaCl, 1 mM DTT, 20 mM imidazole supplemented with 1 mM phenylmethanesulfonyl fluoride, and the lysates were cleared by centrifugation and filtered through 0.22 μΜ filters (Millipore). The proteins were purified from the filtered supernatants by nickel affinity chromatography on HisTrap™ HP columns (GE Life Sciences) and eluted using a linear gradient of imidazole (0.02-0.5 M). Fractions containing pure protein(s) of interest were pooled, concentrated using centrifugal filter devices (Sartorius Stedim Biotech), and further purified on a Superdex™ 200 30/100 gel filtration column (GE Life Sciences) using 25 mM TRIS pH 8.0, 150 mM NaCl, 1 mM DTT as running buffer. Gel filtration fractions containing pure protein in the desired assembly state were pooled, concentrated, and stored at room temperature or 4° C for subsequent use in analytical size exclusion chromatography, in vitro mixing, electron microscopy, and X-ray crystallography.

Analytical size exclusion chromatography. Analytical SEC was performed on a Superdex™ 200 30/100 gel filtration column (GE Life Sciences) using 25 mM TRIS pH 8.0, 150 mM NaCl, 1 mM DTT as the running buffer. The designed materials were loaded onto the column with each component present at a subunit concentration of 50 μΜ. Individual designed components and wild-type proteins were loaded at a concentration of 50 μΜ. The apparent molecular weights of the designed proteins were estimated by comparison to the corresponding wild-type proteins and a set of globular protein standards.

In vitro mixing. Individual components of the five successful designs were expressed from pET29b vectors encoding C-terminally His-tagged versions of each component (under the same induction conditions outlined above). Lysates containing corresponding pairs of designed components were mixed either immediately following lysis (crude lysates) or after clearance by centrifugation (cleared lysates). Each was mixed with either a one-to-one volumetric ratio or with adjusted volumetric ratios intended to account for observed differences in expression levels of the two components in each designed pair. After incubating for two hours at room temperature, insoluble material was cleared by

centrifugation and the samples were subjected to native PAGE analysis. For comparison, these samples were analyzed together with cleared lysates of unmixed component A and B, and cleared lysates from co-expressed A 1 -tagged designs, co-expressed His-tagged designs, and corresponding His-tagged wild-types. Bands corresponding to the assembled state were clearly visible in the crude lysate mixtures of T32-28 and T33-15. Corresponding bands for T32-28 and T33-15 were also visible in the cleared lysate mixtures, although noticeably less intense in the case of T32-28. It is also noteworthy that while the Al-tagged co-expression construct of T33-09 yielded a visible band for the assembled material, the His-tagged co- expression construct did not. While the His-tagged construct also provided low yield from purification, it did clearly express and assemble (as shown by size exclusion chromatography and electron microscopy). Thus the concentration of the His-tagged assembly appears to be below the detection limit of our native PAGE analysis.

Based on the results from the mixed lysates experiments, T32-28 and T33-15 were additionally subjected to in vitro mixing experiments from purified components. Each of the C-terminally His-tagged components was purified by nickel affinity and gel filtration chromatography, and the purified components were mixed in a 1 : 1 molar ratio with each component present at a subunit concentration of 50 μΜ. Following incubation for two hours at room temperature, the mixtures were subjected to analytical size exclusion

chromatography. The purifications and size exclusion chromatography were carried out as described above with the exception that 5% (v/v) glycerol was added to all buffers. While T33-15 assembled efficiently from the independently purified components, T32-28 yielded only a small peak for the assembly product. The purified T32-28A component eluted significantly earlier than 31zl-wt, indicating that lack of assembly in this case may be due to aggregation of the T32-28A component in the absence of T32-28B.

For T32-28A and 31zl-wt containing samples, DTT was excluded from all buffers and ImM Q1SO₄ added to the lysis buffer. This was done in accordance with previous work on the 31zl-wt protein, which revealed copper binding sites at the dimeric interface and putative copper-dependent dimerization. While T32-28 did yield a native PAGE band and a size exclusion peak corresponding to the 24mer assembly without these modifications to the buffers, the purified assemblies were found to partially dissociate upon dilution (as assessed by size exclusion chromatography). In contrast, lysis and purification with the modified buffers yielded stable assemblies with no detectable disassembly upon dilution. Negative stain electron microscopy. 2-3μ1 of purified T32-28, T33-09, T33-15, T33- 21 and T33-28 samples at concentrations ranging from O.Olmg/mL to 5mg/mL were applied to negatively glow discharged, carbon coated 200-mesh copper grids (Ted Pella, Inc.), washed with Milli-Q™ water and stained with 0.075% uranyl formate. Grids were visualized for oligomer validation and optimized for data collection. Screening and data collection was performed on a 120 kV Tecnai Spirit™ T12 transmission electron microscope (FEI,

Hillsboro, OR). All images were recorded using a Teitz CMOS 4k camera at either 49,000x (T33-21 and T33-28) or 60,000x (T32-28, T33-09 and T33-15) magnification.

Coordinates for 3,910 (T32-28), 29,153 (T33-09), 18, 197 (T33-15), 5,478 (T33-21) and 13,715 (T33-28) unique particles were obtained for averaging using either Ximdisp™ or EMAN™. Extracted frames of these particles were used to obtain class averages by refinement in either SPIDER™ or IMAGIC™ using multiple rounds of MSA (multivariate statistical analysis) and MRA (multi-reference alignment). A low-resolution (17-30A) volume from the design .pdb files outputted from Rosetta3 was obtained using SPIDER™ and validated using UCSF Chimera. Back-projection images were obtained by calculation using SPIDER™ on the low-resolution volumes and visualized using WEB.

Separated, purified components (T33-15A and T33-15B) were screened as above, T33-15A and T33-15B were then mixed in a 1 : 1 ratio and grids prepared of the mixture after 5 minutes, 1 hour and 2 hours at room temperature and screened as above.

Crystallization ofT32-28. T32-28 was crystallized with hanging drop vapor diffusion at room temperature. Crystals were formed within four days by mixing 1 uL of 11.7 mg mL ¹ protein and 1 uL of a 500 uL well solution containing only 1.675 M D,L-malic acid at pH 7.0. The crystals were cryo-protected in 2.0 M lithium sulfate and soaked for 20 seconds. The crystals diffracted to at least 4.5 A and the asymmetric unit contained 12 molecules of T32- 28A and 12 molecules of T32-28B in space group P3i21.

Crystallization ofT33-15. As described above, crystals of T33-15 were grown within one week by mixing 1 uL of 7.6 mg mL^"1 protein and 1 uL of a 500 uL well solution containing 100 mM sodium cacodylate at pH 6.5, 200 mM calcium acetate, and 28% (v/v) PEG 300. Crystals were cryo-protected by successive 30-second soaking in 10 uL solutions of mother liquor with glycerol added at final concentrations of 5%, 10%, 15%, and 20%. The crystals diffracted to at least 2.8 A and the asymmetric unit contained one molecule each of T33-15A and T33-15B molecules in space group F432. Crystallization ofT33-21 in space groups R32 and F4i32. T33-21 was crystallized similarly as described above. Crystals grew within three weeks following the mixing of 1 uL of 8.6 mg mL^"1 protein and 0.5 uL of a 200 uL well solution containing 100 mM citric acid pH at 5.0 and 800 mM ammonium sulfate. Crystals were cryo-protected with 2.0 M lithium sulfate as described above. The crystals diffracted to at least 2.0 A and the asymmetric unit contained 4 molecules each of T33 -21 A and T33 -2 IB in space group R32.

Alternatively, crystals also grew within one week by mixing 0.5 uL of 8.6 mg mL^"1 protein and 1 uL of a 200 uL well solution containing 100 mM Bis-Tris at pH 5.5 and 2.12 M ammonium sulfate. Cryo-protection was performed with 2.0 M lithium sulfate as described above. These crystals diffracted to at least 2.6 A and the asymmetric unit contained one molecule each of T33-21A and T33-21B in space group F4i32.

Crystallization ofT33-28. T33-28 was crystallized as described above. Crystals grew within three days in hanging drops containing 0.5 uL of 15.8 mL^"1 protein and 0.5 uL of a 200 uL well solution containing 100 mM sodium citrate tribasic dihydrate pH at 5.6, 200 mM ammonium acetate, and 24% (v/v) (+/-)-2-methyl-2,4-pentanediol. Cryo-protection involved passage of the crystal through drops of paratone-N oil until no more mother liquor appeared present around the crystal. The crystals diffracted to at least 3.5 A and the asymmetric unit contained 48 molecules each of T33-28A and T33-28B in space group P2i.

Crystallographic data collection and structure determination. Diffraction data sets were collected at the Advanced Photon Source (APS) beamline 24-ID-C equipped with a Pilatus™-6M detector. All data were collected at 100 K. Data were collected for T32-28, T33-15, T33-21 (space group R32), T33-21 (space group F4₁32), and T33-28 at detector distances of 650 mm, 450 mm, 300 mm, 300 mm, and 575 mm; with 0.5°, 0.5°, 0.2°, 0.5°, and 0.5° degree oscillations; and at wavelengths of 0.9793 A, 0.9792 A, 1.0393 A, 0.9716 A, and 0.9793 A, respectively.

Data reduction, integration, and scaling were performed with XDS/XSCALE™. The program PHASER™ was used to determine all crystal structures by molecular replacement (MR). For T33-15 and T33-21 structures, the MR search models were the original PDB scaffolds for each computationally-designed component. The MR search models for the structures of T33-28 and T32-28 were models of the tetrahedral assemblies with and without side-chain atoms beyond β-carbons, respectively.

The X-ray diffraction data collected for T32-28 underwent additional processing in XSCALE™ to visualize anomalous scattering from copper ions anticipated in the T32-28A subunits. The data was scaled with unmerged Friedel mates and the resultant electron density map was used to calculate an anomalous Fourier map with the refined model in PHENIX™. The anomalous peaks in the calculated map were not used to model copper ions in the final structure due to unmodeled, coordinating side chains. All deposited structure factors used for refinement were scaled with merged Friedel mates.

Crystallographic refinement. All refinement steps were run using the phenix.refine module of PHENIX™. Molecular replacement solutions were first refined with rigid body refinement, and then underwent individual coordinate refinement in addition to other strategies. Refinement strategies were tested comparing grouped and individual atomic displacement parameter (ADP) refinement, translation libration screw-motion (TLS) group definitions, and simulated annealing. Each refinement protocol was iteratively run while the quality of the model between runs was assessed in COOT™ using the 2mF₀-DF_c with unfilled F_cbs map and the mF₀-DF_c difference map. Subsequent cycles of alternating refinement and model adjustment in COOT were performed to obtain the final refined models.

T32-28, T33-15, T33-21 (space group F4₁32), and T33-28 were refined with individual isotropic ADP parameterization with 1 TLS group per polypeptide chain. T32-28 was refined as a model comprised of glycine, alanine, proline, and all other side chains truncated to the β-carbon due to poor electron density visibility in regions occupied by side chains. T33-15 was refined with reference model restraints assigned to T33-15B from chain A of PDB entry 1UFY. T33-21 (space group R32) was refined with individual isotropic ADP parameterization and 3-8 TLS group definitions per chain determined near residual minimization from the TLSMD server.

Model quality was assessed during and after refinement using geometric validation and MolProbity™ tools as a part of the PHENIX™ suite. Structures of T33-15, T33-21, and T33-28 contain 97-100% of the residues within the most favored regions of the

Ramachandran plot. Residues in the disallowed regions of the Ramachandran plot are found in T32-28 at positions where the phi and psi angles of the scaffold protein are also disallowed. T32-28, T33-15, and both T33-21 structures have ERRAT scores of 97.0%, 96.6%, 99.4%, and 98.2%, respectively. ERRAT scores indicate the percentage of residues that fall below the 95% confidence limit for erroneous modeling. The large asymmetric unit of the T33-28 structure was inspected with VERIFY3D due to incompatibility with ERRAT, and resulted in a passing score of greater than 80% of residues scored greater than or equal to 0.2 in the 3D/1D profile. The coordinates of the final models and the merged structure factors have been deposited in the Protein Data Bank with PDB codes 4NW , 4NWO, 4NWR, 4NWP, and 4NWQ. The particulars shown herein are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of various embodiments of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for the fundamental understanding of the invention, the description taken with the drawings and/or examples making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.

The above definitions and explanations are meant and intended to be controlling in any future construction unless clearly and unambiguously modified in the following examples or when application of the meaning renders any construction meaningless or essentially meaningless. In cases where the construction of the term would render it meaningless or essentially meaningless, the definition should be taken from Webster's Dictionary, 3^rd Edition or a dictionary known to those of skill in the art, such as the Oxford Dictionary of Biochemistry and Molecular Biology (Ed. Anthony Smith, Oxford University Press, Oxford, 2004).

As used herein and unless otherwise indicated, the terms "a" and "an" are taken to mean "one", "at least one" or "one or more". Unless otherwise required by context, singular terms used herein shall include pluralities and plural terms shall include the singular.

Unless the context clearly requires otherwise, throughout the description and the claims, the words 'comprise', 'comprising', and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in the sense of

"including, but not limited to". Words using the singular or plural number also include the plural or singular number, respectively. Additionally, the words "herein," "above" and "below" and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of this application.

The above description provides specific details for a thorough understanding of, and enabling description for, embodiments of the disclosure. However, one skilled in the art will understand that the disclosure may be practiced without these details. In other instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The description of embodiments of the disclosure is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. While specific embodiments of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize.

All of the references cited herein are incorporated by reference. Aspects of the disclosure can be modified, if necessary, to employ the systems, functions and concepts of the above references and application to provide yet further embodiments of the disclosure. These and other changes can be made to the disclosure in light of the detailed description.

Specific elements of any of the foregoing embodiments can be combined or substituted for elements in other embodiments. Furthermore, while advantages associated with certain embodiments of the disclosure have been described in the context of these embodiments, other embodiments may also exhibit such advantages, and not all embodiments need necessarily exhibit such advantages to fall within the scope of the disclosure.

The above detailed description describes various features and functions of the disclosed systems, devices, and methods with reference to the accompanying figures. In the figures, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, figures, and claims are not meant to be limiting. Other embodiments can be utilized, and other changes can be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.

With respect to any or all of the ladder diagrams, scenarios, and flow charts in the figures and as discussed herein, each block and/or communication may represent a processing of information and/or a transmission of information in accordance with example

embodiments. Alternative embodiments are included within the scope of these example embodiments. In these alternative embodiments, for example, functions described as blocks, transmissions, communications, requests, responses, and/or messages may be executed out of order from that shown or discussed, including substantially concurrent or in reverse order, depending on the functionality involved. Further, more or fewer blocks and/or functions may be used with any of the ladder diagrams, scenarios, and flow charts discussed herein, and these ladder diagrams, scenarios, and flow charts may be combined with one another, in part or in whole.

A block that represents a processing of information may correspond to circuitry that can be configured to perform the specific logical functions of a herein-described method or technique. Alternatively or additionally, a block that represents a processing of information may correspond to a module, a segment, or a portion of program code (including related data). The program code may include one or more instructions executable by a processor for implementing specific logical functions or actions in the method or technique. The program code and/or related data may be stored on any type of computer readable medium such as a storage device including a disk or hard drive or other storage medium.

The computer readable medium may also include non-transitory computer readable media such as computer-readable media that stores data for short periods of time like register memory, processor cache, and random access memory (RAM). The computer readable media may also include non-transitory computer readable media that stores program code and/or data for longer periods of time, such as secondary or persistent long term storage, like read only memory (ROM), optical or magnetic disks, compact-disc read only memory (CD- ROM), for example. The computer readable media may also be any other volatile or nonvolatile storage systems. A computer readable medium may be considered a computer readable storage medium, for example, or a tangible storage device.

Moreover, a block that represents one or more information transmissions may correspond to information transmissions between software and/or hardware modules in the same physical device. However, other information transmissions may be between software modules and/or hardware modules in different physical devices.

Numerous modifications and variations of the present disclosure are possible in light of the above teachings.

Claims

1. An isolated nanostructure, comprising

2. The nanostructure of claim 1, in which the mathematical symmetry group is selected from the group consisting of tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry.

3. The nanostructure of claim 1, in which the mathematical symmetry group is tetrahedral point group symmetry.

4. The nanostructure of any one of claims 1-3, wherein the first multimeric substructure comprises a dimer, trimer, tetramer, or pentamer of the first protein, and wherein the second multimeric substructure comprises a dimer or trimer of the second protein.

5. The nanostructure of any one of claims 1-3, wherein the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a dimer of the second protein.

6. The nanostructure of any one of claims 1-3, wherein the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a trimer of the second protein.

7. The nanostructure of any one of claims 1-6, wherein the first protein and the second protein are between 30-250 amino acids in length.

8. The nanostructure of any one of claims 1-7, wherein each symmetrically repeated instance of the non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure buries between 1000-2000 A² of solvent-accessible surface area (SASA) on the first multimeric substructure and the second multimeric substructure.

9. The nanostructure of any one of claims 1-8, wherein each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure has a shape complementary value between 0.5-0.8.

10. The nanostructure of any one of claims 1-9, wherein at least 50% of the atomic contacts comprising each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure is formed from amino acid residues residing in elements of alpha helix and/or beta strand secondary structure.

11. The nanostructure of any one of claims 1-10, wherein the first protein and the second protein comprise proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO : 11) and T32-28B SEQ ID NO : 12);

(b) T33-09A SEQ ID NO: 13) and T33-09B SEQ ID NO: 14);

(c) T33-15A SEQ ID NO: 15) and T33-15B SEQ ID NO: 16);

(d) T33-21A SEQ ID NO: 17) and T33-21B SEQ ID NO: 18); and

(e) T33-28A SEQ ID NO: 19) and T33-28B SEQ ID NO: 20).

12. The nanostructure of any one of claims 1-10, wherein the first protein and the second protein comprise proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 21) and T32-28B SEQ ID NO: 22);

(b) T33-09A SEQ ID NO: 23) and T33-09B SEQ ID NO: 24);

(c) T33-15A SEQ ID NO: 25) and T33-15B SEQ ID NO: 26);

(d) T33-21A SEQ ID NO: 27) and T33-21B SEQ ID NO: 28); and

(e) T33-28A SEQ ID NO: 29) and T33-28B SEQ ID NO: 30

13. The nanostructure of any one of claims 1-10, wherein the first protein and the second protein comprise proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 31) and T32-28B SEQ ID NO: 32);

(b) T33-09A SEQ ID NO: 33) and T33-09B SEQ ID NO: 34);

(c) T33-15A SEQ ID NO: 35) and T33-15B SEQ ID NO: 36);

(d) T33-21A SEQ ID NO: 37) and T33-21B SEQ ID NO: 38); and

(e) T33-28A SEQ ID NO: 39) and T33-28B SEQ ID NO: 40).

14. The nanostructure of any one of claims 1-10, wherein the first protein and the second protein comprise proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 11, 21, or 31) and T32-28B SEQ ID NO: 12, 22, or 32), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 1 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 2;

(e) T33-28A SEQ ID NO: 19, 29, or 39) and T33-28B SEQ ID NO: 20, 30, or

40), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 9 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 10.

15. The nanostructure of any one of claims 1-10, wherein the first protein and the second protein comprise proteins selected from the following pairs of first and second proteins:

(a) SEQ ID NOS: 1-2;

(b) SEQ ID NOS: 3-4;

(c) SEQ ID NOS: 5-6;

(d) SEQ ID NOS: 7-8; and

(e) SEQ ID NOS: 9-10.

An isolated protein, comprising an amino acid sequence selected from the group

(a) T32-28A (SEQ ID NO: 11);

(b) T32-28B SEQ ID NO 12);

(c) T33-09A SEQ ID NO 13);

(d) T33-09B SEQ ID NO 14);

(e) T33-15A SEQ ID NO 15);

(f) T33-15B SEQ ID NO 16); (g) T33-21A SEQ ID O: 17);

(h) T33-21B SEQ ID O: 18);

(i) T33-28A SEQ ID NO: 19); and

(j) T33-28B SEQ ID NO: 20).

17. The isolated protein of claim 16, comprising an amino acid sequence selected from the group consisting of:

(a) T32-28A (SEQ ID NO: 21);

(b) T32-28B SEQ ID NO: 22);

(c) T33-09A SEQ ID NO: 23);

(d) T33-09B SEQ ID NO: 24);

(e) T33-15A SEQ ID NO: 25);

(f) T33-15B SEQ ID NO: 26);

(g) T33-21A SEQ ID NO: 27);

(h) T33-21B SEQ ID NO: 28);

(i) T33-28A SEQ ID NO: 29); and

(j) T33-28B SEQ ID NO: 30).

18. The isolated protein of claim 16, comprising an amino acid sequence selected from the group consisting of:

(a) T32-28A (SEQ ID NO: 31);

(b) T32-28B SEQ ID NO: 32);

(c) T33-09A SEQ ID NO: 33);

(d) T33-09B SEQ ID NO: 34);

(e) T33-15A SEQ ID NO: 35);

(f) T33-15B SEQ ID NO: 36);

(g) T33-21A SEQ ID NO: 37);

(h) T33-21B SEQ ID NO: 38);

(i) T33-28A SEQ ID NO: 39); and

(j) T33-28B SEQ ID NO: 40).

19. The isolated protein of claim 16, comprising an amino acid sequence selected from the group consisting of:

(A) T32-28A (SEQ ID NO: 11, 21, or 31), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 1; (B) T32-28B SEQ ID NO: 12, 22, or 32), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 2;

20. An isolated protein, comprising an amino acid sequence selected from the group consisting of SEQ ID NOS: 1-10.

21. A multimeric assembly, comprising a plurality of identical protein monomers according to any one of claims 16-20.

22. A recombinant nucleic acid encoding the isolated protein of claim 21.

23. A recombinant expression vector comprising the recombinant nucleic acid of claim 22 operatively linked to a promoter.

24. A recombinant host cell, comprising the recombinant expression vector of claim 23.

25. A kit, comprising

(a) one or more isolated nanostructures according to any one of claims 1-15;

(b) one or more of the isolated proteins of any one of claims 16-20 or the multimeric assembly of claim 21 ;

(b) one or more recombinant nucleic acids of claim 22;

(c) one or more recombinant expression vectors of claim 23; and/or

(d) one or more recombinant host cell of claim 24.

26. A method, comprising:

generating a plurality of representations of a first protein building block using a computing device;

generating a plurality of representations of a second protein building block using the computing device, wherein the first protein building block differs from the second protein building block;

generating an arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block according to symmetric operations of a designated mathematical symmetry group using the computing device;

computationally determining a docked configuration of the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block by at least generating at least one interface for each protein building block of the arrangement that is suitable for computational protein-protein interface design using the computing device;

computationally modifying amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block in the docked configuration to specify a plurality of representations of protein- protein interfaces, wherein the plurality of representations of protein-protein interfaces comprise one or more representations of protein-protein interfaces between the first protein building block and the second protein building block that are energetically favorable to drive self-assembly of the protein building blocks comprising the modified amino acid sequences to the docked configuration using the computing device; and

generating an output of the computing device that is based on at least one representation of the group consisting of: a representation of the docked configuration, at least one representation of the plurality of representations of the protein-protein interfaces, and at least one representation of the representations of the first protein building block and the representations of the second protein building block having modified amino acid sequences.

27. The method of claim 26, where each of the first and second protein building blocks comprise a synthetic polypeptide.

28. The method of claim 26 or 27, where each of the first and second protein building blocks comprise a protein multimer that shares an axis of symmetry with the designated mathematical symmetry group.

29. The method of any one of claims 26-28, where the designated mathematical symmetry group conforms to a symmetry selected from tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry.

30. The method of any one of claims 26-29, where generating the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises computationally aligning symmetry axes of the first protein building block and the second protein building block with at least one axis in the designated mathematical symmetry group.

31. The method of claim 30, wherein determining a docked configuration of the plurality of the first and second protein building blocks further comprises: sampling rotational degrees of freedom and translational degrees of freedom for each of the first and second protein building blocks.

32. The method of claim 31, wherein sampling the rotational degrees of freedom and the translational degrees of freedom comprises:

selecting a rotational value for a rotational degree of freedom for each of the first and second protein building blocks;

selecting a translational value for a translational degree of freedom for each of the first and second protein building blocks;

determining a sampled representation of the first protein building block based on the selected rotational value for the first protein building block and the selected translational value for the first protein building block;

determining a sampled representation of the second protein building block based on the selected rotational value for the second protein building block and the selected translational value for the second protein building block; and

determining a designability measure for the docked configuration using the sampled representation of the first protein building block and the sampled representation of the second protein building block.

33. The method of claim 32, wherein determining the designability measure of the docked configuration comprises determining a number of beta carbon contacts within a specified distance threshold between the sampled representation of the first protein building block and the sampled representation of the second protein building block in the docked configuration based on the values of the selected rotational and translational degrees of freedom.

34. The method of any one of claims 26-33, wherein computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises selecting a selected representation of one or more amino acid sequences associated with a representation of at least one protein building block of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block.

35. The method of claim 34, wherein computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises computationally mutating an amino acid sequence of the selected representation of one or more amino acid sequences.

36. The method of any one of claims 26-35, wherein computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises evaluating an energy of an amino acid mutation using a computational score function.