KR20010083870A

KR20010083870A - Methods of populating data structures for use in evolutionary simulations

Info

Publication number: KR20010083870A
Application number: KR1020017001735A
Authority: KR
Inventors: 세르게이 에이. 셀리포노프; 윌럼 피.씨. 스테머
Original assignee: 추후제출; 맥시겐, 인크.
Priority date: 1999-01-18
Filing date: 2000-01-18
Publication date: 2001-09-03

Abstract

구체적으로는, 본 발명은 진화 모델링에서 사용하는 데이터 구조를 분포시키는 새로운 방법을 제공한다. 더 구체적으로는, 본 발명은 복수의 캐릭터 열들을 가진 데이터 구조를 분포시키는 방법들을 제공한다. 본 방법들은 둘 이상의 서로 다른 초기 캐릭터 열들의 집합을 제공하기 위해서 둘 이상의 생물학적 분자들을 문 열들로 부호화하는 단계; 캐릭터 열로부터 적어도 둘 이상의 하위열들을 선택하는 단계; 하위열들을 연결하여 초기 캐릭터 열들 하나 이상의 길이와 거의 동일한 길이를 가진 생성 열들을 하나 이상 형성하는 단계; 생성 열들을 열들의 집합에 더하는 단계; 및 선택적으로 생성 열들 하나 이상을 초기 캐릭터 열들의 집합의 초기 열로 사용해서 상기 단계를 반복하는 단계를 포함한다.Specifically, the present invention provides a novel method of distributing data structures used in evolutionary modeling. More specifically, the present invention provides methods for distributing a data structure having a plurality of character strings. The methods include encoding two or more biological molecules into strings to provide a set of two or more different initial character strings; Selecting at least two lower columns from the character string; Connecting the lower columns to form one or more generation columns having a length substantially equal to one or more lengths of the initial character columns; Adding the generated columns to the set of columns; And optionally repeating the step of using one or more of the generated columns as an initial column of the set of initial character strings.

Description

[0001] METHODS OF POPULATING DATA STRUCTURES FOR USE IN EVOLUTIONARY SIMULATIONS [0002]

생명, 개체의 유전적 시스템 및/또는 유전/생물 개체 표현형 시스템을 조사하고/하거나 시뮬레이션하고/하거나 유전적/표현형 시스템을 분포시키는 데에 컴퓨터를 사용한 오랜 역사가 있다. 모터로 작동하는 대부분의 인공 생명체(Alife) 시뮬레이션은 인공 생명체가 환경에 적응하고/하거나 진화할 수 있게 해 주는 알고리즘이다. 기본 알고리즘은 크게 두 개의 카테고리로 나눌 수 있는데, 학습 알고리즘(신경 망에 의해 대표되는 알고리즘)과 예컨대, 유전적 알고리즘에 의해 대표되는 진화 알고리즘이 그것이다.There is a long history of using computers to investigate and / or simulate life, genetic systems of individuals and / or genetic / biological entity phenotypic systems, and / or genetic / phenotypic systems. Most motor-driven alife simulations are algorithms that allow artificial life to adapt and / or evolve in the environment. The basic algorithms can be broadly divided into two categories: learning algorithms (algorithms represented by neural networks) and evolutionary algorithms represented, for example, by genetic algorithms.

많은 인공 생명체 연구자들은, 특히 학습과 적응 같은 비교적 고차원의 프로세스에 관심이 있는 사람들은 인공 두뇌 역할을 하는 신경 망을 자신의 인공 생명체에게 부여한다(Touretzky(1088-1991)Neural Information Processing Systems, volume 1-4. Morgan Kaufmann, 1988-1991 참조). 신경 망은 학습 알고리즘이다.예컨대 이미지들을 카테고리로 분류하도록 신경 망을 훈련시킬 수 있다. 전형적인 과제로는 주어진 필기체에 대응되는 문자를 인식하는 것이 있다.Many artificial life researchers, especially those who are interested in relatively high-level processes such as learning and adaptation, give their artificial creatures a neural network that acts as an artificial brain (Touretzky (1088-1991) Neural Information Processing Systems , volume 1 -4 Morgan Kaufmann, 1988-1991). The neural network is a learning algorithm. For example, neural networks can be trained to classify images into categories. A typical task is to recognize a character corresponding to a given writing body.

신경 망은 입력-출력 장치들의 조합으로 구성되는데, 이는 뉴런(neuron)이라고 불리며, (고차원적으로 연결된) 망에서 조직화되어 있다. 일반적으로 망은 층(layer)들로 조직되어 있는데, 센서의 입력을 수신하는 입력층(input layer) 및 실제 연산을 수행하는, 소위 다수의 은폐층(hidden layer) 및 이런 연산의 결과를 알려주는 출력층(output layer)으로 이루어져 있다. 신경 망 훈련은 망의 뉴런들 사이의 연결의 강도를 조절하는 것을 포함한다.The neural network consists of a combination of input-output devices, called neurons, that are organized in a (higher-level connected) network. Generally, networks are organized into layers, which include an input layer for receiving the input of the sensor, a so-called hidden layer for performing the actual operation, And an output layer. Neural network training involves controlling the intensity of connections between neurons in the network.

생물학적으로 영감을 받은 기본 알고리즘의 다른 중요한 형태로 진화 알고리즘이 있다. 은유적으로 말하여 학습 과정들(예, 신경 망들)이 개체의 유기체에서 학습 과정에 기반을 두고 있다면, 진화 알고리즘은 이들 개체들의 분포에서의 진화적 변화에서 영감을 받은 것이라 할 수 있다. 신경망에 비하여 최근에야 비로서 진화 알고리즘은 학계와 산업계로부터 광범위한 지지를 받게 되었다.Another important form of the biologically inspired basic algorithm is evolutionary algorithms. Metaphorically speaking, if learning processes (eg neural networks) are based on a learning process in an organism of an individual, the evolutionary algorithm can be said to be inspired by evolutionary changes in the distribution of these individuals. Recently, as compared to neural networks, evolutionary algorithms have received widespread support from academia and industry.

진화 알고리즘은 일반적으로 반복적(iterative)이다. 한 번의 되먹임(iteration)을 일반적으로 한 "세대(generation)"라고 부른다. 기본적인 진화 알고리즘은 전통적으로 무작위로 선택된 개체의 분포로부터 시작된다. 매 세대에서, 각각의 개체는 주어진 문제를 해결하기 위해서 내부적으로 "경쟁(compete)"한다. 비교적 잘 수행한 개체들은 다음 세대까지 "생존(survive)"할 가능성이 더 높다. 다음 세대까지 생존한 개체들은 작지만 무작위적인 변화에 직면할 수도 있다. 만일 알고리즘이 제대로 세워졌고, 그 문제가 실제로 이런 방식에 의해 해답에 도달할 수 있는 종류라면, 되먹임이 계속됨에 따라 개체 분포은 점점 더 고품질의 해답들을 보여주게 될 것이다.An evolutionary algorithm is generally iterative. A single iteration is generally called a "generation". The basic evolutionary algorithm traditionally begins with the distribution of randomly selected individuals. In each generation, each entity "compete" internally to solve a given problem. Relatively well-executed individuals are more likely to "survive" to the next generation. Individuals surviving until the next generation may face small but random changes. If the algorithm is well established and the problem is in fact the kind of solution that can be reached in this way, then the distribution of individuals will show increasingly higher quality solutions as the feedback continues.

가장 인기있는 진화 알고리즘은 J. Holland의 유전적 알고리즘(genetic algorithm)이다. (J.H.Holland(1992)Adaptation in Natural and Artificial Systems.University of Michgan Press 1975, Reprinted by MIT Press.) 유전적 알고리즘은 실제 상황(예, 금융 시장 예측, 경영 과학 등)에서 광범위하게 사용된다. 특히, 해답 공간(solution space)이 불연속적("울퉁불퉁한(rugged)")이고 쉽게 이해될 수 없는 다변수 문제들에 잘 적용된다. 유전적 알고리즘을 적용하기 위해서는, 1)파라미터 세트로부터 비트 열(bit string)(예, 캐릭터 열(character string)) 세트로의 맵핑(mapping)과 2) 비트 열로부터 실수들(reals) 모의 맵핑, 소위 적합도 함수(fitness function)를 정의한다.The most popular evolutionary algorithm is J. Holland's genetic algorithm. (JHHolland (1992) Adaptation in Natural and Artificial Systems. University of Michgan Press 1975, Reprinted by MIT Press.) Genetic algorithms are widely used in real-life situations (eg, financial forecasting, management science, etc.). In particular, it applies well to multivariate problems where the solution space is discontinuous ("rugged") and not easily understood. To apply a genetic algorithm, it is necessary to 1) map from a set of parameters to a set of bit strings (e.g., character strings), 2) map mats from reals, Called fitness function.

대부분의 진화 알고리즘에서, 무작위로 선택된 비트 열 세트가 초기 분포을 구성한다. 기본적인 유전적 알고리즘에서는 한 번의 사이클 동안 집단 내의 각 개체의 적합도를 평가하고, 이러한 각 개체들의 적합도에 비례하여 개체들의 복사본(copy)을 생성하며, 이러한 사이클을 다시 반복한다. 이와 같이 이러한 진화 알고리즘들은 무작위로 선택된 하나의 비트 열 세트를 그 전형적인 출발적으로 하고 있다. 이러한 무작위적이거나 계획성 없는 "임의의(arbitrary)" 시작 분포를 사용하는 것은, 해당 진화 알고리즘으로 하여금 주어진 문제에 대한 효율적이고도 정확한 또는 간결한 해법을 제공할 수 없게 하며, 이러한 문제는 특히 해당 알고리즘이 생물학적 역사나 진화를 모델링하거나 분석하는 데 사용될 때 더욱 두드러진다. 사실상, 그것이 어떤 것이건 간에 소정의 해답으로 진화 알고리즘을 움직이게 하는 유일한 "힘(force)"은, 적합도 결정(fitness determination)과 이와 연관된 선택 압력(associated selection pressure)이다. 궁극적으로는 하나의 해답에 도달한다 할지라도, 집단 구성원 상호간의 관련성이 전혀 없는 무작위의(예, 임의의) 초기 상태로부터 과정이 출발했기 때문에, 이러한 알고리즘이 진행되는 동안의 해당 분포의 분포의 동적 상태(dynamics)는 시뮬레이션의 대상이 되는 시스템의 동적 상태를 반영하는 정보를 거의 또는 전혀 드러내지 못한다.In most evolutionary algorithms, a randomly selected set of bit strings constitutes an initial distribution. The basic genetic algorithm evaluates the fitness of each individual in a group for a single cycle, generates a copy of the individuals proportional to the fitness of each of these individuals, and repeats this cycle again. As such, these evolutionary algorithms typically set a randomly selected set of one bit string as a starting point. Using such a random or unplanned "arbitrary" starting distribution makes it impossible for the evolutionary algorithm to provide an efficient and accurate or concise solution to a given problem, It is even more prominent when it is used to model or analyze biological history or evolution. In fact, whatever it is, the only "force" that drives evolutionary algorithms with a given solution is fitness determination and the associated selection pressure. Since the process has departed from a random (eg, arbitrary) initial state, which ultimately reaches an answer, the distribution of that distribution during the course of this algorithm is dynamic The dynamics show little or no information reflecting the dynamic state of the system being simulated.

게다가, 진화 알고리즘은 전형적으로 고차원의 시뮬레이션이므로, 집단 수준의 정보를 제공한다. 특정 유전 정보는, 존재한다할지라도, 일반적으로 대립 형질로서(전형적으로 하나의 캐릭터로서) 또는 대립 형질의 빈도(frequency)에 대한 추상적인 표시(representation)로서 존재한다. 결과적으로 진화 알고리즘은 분자 수준의 사건에 관한 정보는 거의 또는 전혀 제공하지 못한다.In addition, evolutionary algorithms are typically high-dimensional simulations and thus provide group-level information. Certain genetic information, although present, generally exists as an allele (typically as a single character) or as an abstract representation of the frequency of alleles. As a result, evolutionary algorithms provide little or no information about events at the molecular level.

유사하게, 신경 망 및/또는 세포 오토마타(automata)는 본질적으로 인공적인 구조물을 자신의 출발점으로 삼고, 생물의 프로세스를 모방하기 위해서 내부 규칙들(알고리즘)을 사용한다. 결과적으로 이러한 모델들은 일반적으로 프로세스 또는 메타프로세스를 흉내내지만, 이 역시 분자 수준에서의 사건들에 관한 정보나 통찰력을 거의 또는 전혀 제공하지 못한다.Similarly, neural networks and / or automata use internal rules (algorithms) to mimic the processes of an organism, essentially taking an artificial structure as its starting point. As a result, these models generally mimic processes or meta processes, but this also provides little or no information or insight into events at the molecular level.

<본 발명의 요약><Summary of the Present Invention>

본 발명은 예컨대, 유전적/진화 알고리즘을 이용해서 좀더 심화된 연산 과정에 적합한 "초기" 분포를 생성하는 새로운 방법을 제공한다. 본 발명의 방법에 따라 생성된 집단의 구성원들은 자연적으로 발생하는 분포에서 발견되는 공변(covariance)의 정도를 반영하는 서로 간의 "연관성" 또는 "유사성"의 다양한 정도를 소유한다. 게다가, 종래의 진화 알고리즘에서 입력으로 사용된 분포와는 다르게, 본 방법에 의해 생성된 분포는 구성원 개체에 대한 상세한 정보를 포함하고 있으며, 일반적으로 이 정보는 구성원간 다양성 및/또는 연관성의 (이분법적 구분이라기보다는) "연속적인(continuous)" 측정을 제공하기에 충분할 정도로 세밀하다. 사실상 본 발명의 방법들은 본 발명에 따라 생성된 분포에 포함된 개체의 분자 수준의 정보를 상세하게 부호화하는 방법을 제공한다.The present invention provides, for example, a novel method for generating " initial " distributions suitable for further computation using genetic / evolutionary algorithms. Members of the population generated according to the method of the present invention possess varying degrees of "association" or "similarity" with each other that reflects the degree of covariance found in naturally occurring distributions. Moreover, unlike the distributions used as input in conventional evolutionary algorithms, the distributions generated by the method include detailed information about the member entities, and generally this information is used to determine the diversity and / Quot; continuous " measurement (rather than the < / RTI > In fact, the methods of the present invention provide a method for precisely encoding information at the molecular level of an individual contained in a distribution generated according to the present invention.

따라서, 한 실시예에서, 본 발명은 캐릭터 열들로 하나의 데이터 구조를 분포시키는(예, 캐릭터 열들의 라이브러리 또는 집합을 생성하는) 방법들을 제공한다. 본 방법은 i) 캐릭터 열 안의 각 생물 분자가 적어도 10개 서브유닛들을 포함하고 있는 둘 이상의 서로 다른 초기 캐릭터 열의 집합을 제공하기 위해 상기의 둘 이상의 생물 분자들을 상기의 캐릭터 열들로 부호화하는 단계; ii) 상기의 캐릭터 열들에서 적어도 두개의 하위열들을 선택하는 단계; iii) 상기의 초기 캐릭터 열들중 하나 이상의 길이와 거의 동일한 길이인 하나 이상의 생성 열을 형성하도록 상기의 하위열들을 연결하는 단계; iv) 상기의 생성 열들을 열들의 집합(데이터 구조)에 더하는 단계; 및 v) 상기의 생성 열들 중 하나 이상을 초기 캐릭터 열들의 집합에서 초기 열로 사용해서 i) 또는 ii)부터 iv)까지 선택적으로 반복하는 단계를 포함하는 것이 바람직하다. 특히 바람직한 실시예에서, "부호화하는" 단계는 하나 이상의 핵산 시퀀스들 및/또는 하나 이상의 아미노 산 시퀀스들을 캐릭터 열들로 부호화하는 단계를 포함한다. 핵산 및/또는 아미노 산 시퀀스들은 미지의 것일 수 있고/있거나 우연히 선택될 수 있지만, 공지의 단백질(들)을 부호화하는 것이 바람직하다. 바람직한 한 실시예에서, 생물 분자가 서로 적어도 약 30%, 바람직하게는 적어도 약 50%, 더욱 바람직하게는 약 75%, 그리고 가장 바람직하게는 적어도 약 85%, 90%, 또는 심지어 95% 이상 시퀀스 동일성을 갖도록 생물 분자들이 선택된다.Thus, in one embodiment, the present invention provides methods for distributing one data structure with character strings (e.g., creating a library or collection of character strings). The method comprises the steps of: i) encoding each of the biomolecules in the string of characters with the character strings to provide a set of two or more different initial character strings, each containing at least 10 sub-units; ii) selecting at least two lower columns in said character strings; iii) concatenating the sub-columns to form one or more generated columns having a length substantially equal to one or more of the initial character strings; iv) adding the generated columns to a set of columns (data structure); And v) optionally repeating steps i) or ii) to iv) using one or more of the above generated columns as an initial column in the set of initial character strings. In a particularly preferred embodiment, " encoding " comprises encoding one or more nucleic acid sequences and / or one or more amino acid sequences into character strings. The nucleic acid and / or amino acid sequences may be unknown and / or may be selected by chance, but it is desirable to encode the known protein (s). In a preferred embodiment, the biomolecule is at least about 30%, preferably at least about 50%, more preferably about 75%, and most preferably at least about 85%, 90%, or even 95% sequence Biomolecules are selected to be identical.

한 실시예에서, 동일한 2개의 열간의 전체적인 시퀀스 동일성(sequence identity)보다 상기 초기 캐릭터 열들중 다른 열의 대응하는 영역과의 시퀀스 동일성이 높은 약 3개 내지 약 300개, 바람직하게는 약 6개 내지 약 20개, 더욱 바람직하게는 약 10개 내지 약 100개 그리고 가장 바람직하게 약 20개 내지 약 50개의 캐릭터들로 구성된 열 영역에서 하위열들의 끝점이 발생하도록 하위열(들)이 선택된다. 다른 실시예에서, 약 4개 내지 약 100개, 바람직하게는 약 4개 내지 약 50개, 더욱 바람직하게는 약 4개 내지 약 10개, 더욱 더 바람직하게 약 6개 내지 약 30개 그리고 가장 바람직하게는 약 6개 내지 약 20개의 캐릭터들로 된 미리 결정된 모티브에서 상기의 하위열들의 끝점이 발생하도록 하위열을 선택하는 과정이 선택하는 단계에 포함될 수 있다.In one embodiment, about 3 to about 300, preferably about 6 to about < RTI ID = 0.0 > about < / RTI > about < RTI ID = 0.0 > The lower column (s) are selected such that the end points of the lower rows occur in the column region consisting of 20, more preferably about 10 to about 100, and most preferably about 20 to about 50 characters. In another embodiment, about 4 to about 100, preferably about 4 to about 50, more preferably about 4 to about 10, even more preferably about 6 to about 30, and most preferably, Selecting a sub-row so that the end points of the sub-columns occur in a predetermined motif of about 6 to about 20 characters.

한 실시예에서, 선택하는 단계와 연결하는 단계는 서로 다른 2개의 초기 열들로부터 각각 선택된 하위열들을 연결하는 단계를 포함하되, 상기 연결은 상기 서로 다른 2개의 초기 열들간의 전체적인 시퀀스 동일성보다 상기 서로 다른 2개의 초기 열들사이에서 더 높은 시퀀스 동일성을 갖는 약 3개 내지 약 20 개의 캐릭터들로 구성된 영역에서 이루어지는 단계를 포함한다. 선택하는 단계는 캐릭터 열들의 둘 이상의 하위 열들 간의 패어와이즈 동일성을 최대화하기 위해서 상기 초기 캐릭터 열들중 2열 이상을 정렬하는 단계 및 정렬된 쌍의 구성원인 한 캐릭터를 한 하위열의 끝점으로 선택하는 방법을 포함할 수 있다.In one embodiment, the step of selecting comprises concatenating each selected lower column from two different initial columns, wherein the concatenation is performed by comparing the sequence sequences of the two initial columns with each other, And in an area consisting of about three to about twenty characters with higher sequence identity between the other two initial columns. The selecting may include aligning two or more of the initial character strings to maximize pairwise identity between two or more sub-columns of the character strings, and selecting a character that is a member of the aligned pair as the end point of one sub-column .

어떤 실시예들에서, "더하는" 단계는 이론적인 PI, PK, 분자 무게, 소수성, 이차 구조 및/또는 캐릭터 열에 의해 부호화된 단백질의 다른 성질들을 연산하는 과정을 포함한다. 바람직한 한 실시예에서, 생성 열들은 초기 열들과 30% 이상, 바람직하게는 50%이상, 더욱 바람직하게는 75% 또는 85% 이상의 시퀀스 동일성이 있을 때에만 상기의 집합에 더해진다.In some embodiments, the " adding " step includes computing the theoretical PI, PK, molecular weight, hydrophobicity, secondary structure, and / or other properties of the protein encoded by the character string. In one preferred embodiment, the resultant columns are added to the set only when there is a sequence identity of at least 30%, preferably at least 50%, more preferably at least 75% or 85%, with the initial columns.

본 방법은 캐릭터 열들의 하나 이상의 캐릭터들을 무작위로 변경하는 것을 더 포함할 수 있다. 무작위 열을 초기 열 집합에 도입하기 및/또는 후술할 추계 연산자를 사용하는 방법을 포함하는 많은 방법들에 따라 이를 달성할 수 있는데, 이에 국한 되는 것은 아니다. 특히 바람직한 실시예에서, 상기의 동작들은 컴퓨터에서 수행된다.The method may further comprise randomly changing one or more characters of the character strings. This can be accomplished in a number of ways including, but not limited to, introducing a random column into the initial column set and / or using a stagger operator described below. In a particularly preferred embodiment, the above operations are performed in a computer.

다른 실시예에서, 본 발명은 i) 각각이 적어도 약 10개의 서브유닛을 포함하는 둘 이상의 생물학적 분자를 캐릭터 열들로 부호화시켜, 둘 이상의 서로 다른 초기 캐릭터 열들의 집합을 제공하는 단계; ii) 상기 캐릭터 열들로부터 적어도 2개의 하위열들을 선택하는 단계; iii) 상기 하위열들을 연결하여 상기 초기 캐릭터 열들중 하나 이상과 거의 동일한 길이를 갖는 하나 이상의 생성 열들을 형성하는 단계; iv) 상기 생성 열들을 열들의 집합에 더하는 단계; 및 v) 상기의 생성 열들중 하나 이상을 상기 초기 캐릭터 열 집합 내의 초기 열로 사용하여 상기 단계 (i)또는 (ii)에서 상기 단계 (iv)까지 선택적으로 반복하는 단계로 이루어진 컴퓨터 코드를 포함하는 것을 특징으로 하는 컴퓨터 프로그램 제품을 제공한다. 다른 말로, 컴퓨터 프로그램 제품은 전술한 동작들을 수행하는 컴퓨터 코드를 포함한다. 프로그램 코드는 소스 코드, 오브젝트 코드, 또는 실행 가능한 코드 등의 컴파일된 형태로 제공될 수 있다. 이 프로그램은 예컨대, 자기 미디어, 광 미디어, 전자 미디어, 광자기 미디어 등과 같은 편리한 미디어로 제공될 수 있다. 또, 이 코드는 컴퓨터, 예컨대, 하드 드라이브 상의 메모리(동적 또는 정적 메모리)에 존재할 수 있다.In another embodiment, the invention provides a method comprising: i) encoding two or more biological molecules, each comprising at least about 10 subunits, into character strings to provide two or more different sets of initial character strings; ii) selecting at least two sub-columns from the character strings; iii) connecting the sub-columns to form one or more generated columns having substantially the same length as at least one of the initial character strings; iv) adding the generated columns to a set of columns; And v) selectively repeating from step (i) or (ii) to step (iv) using one or more of the above generated columns as an initial column in the initial set of character strings And a computer program product. In other words, the computer program product includes computer code for performing the above-described operations. The program code may be provided in a compiled form such as source code, object code, or executable code. The program may be provided in a convenient medium such as, for example, magnetic media, optical media, electronic media, magneto-optical media, and the like. The code may also reside in a memory, such as a dynamic or static memory, on a computer, e.g., a hard drive.

다른 실시예에서, 본 발명은 라벨(태그) 생산용 시스템 및/또는 생물 분자들의 시퀀스로부터 유도된 음악을 제공한다. 이 시스템은 생물학적 분자(예, 핵산 및/또는 단백질)들로부터 둘 이상의 초기 열들을 부호화하는 부호화기; 상기 둘 이상의 열들로부터 하위열들을 식별 및 선택하는 아이솔레이터(isolator); 상기 하위열들을 연결하는 연결기(concatenator); 상기 연결된 하위열들을 열들의 집합으로 저장하는 데이터 구조; 상기 열들의 집합의 수와 변이성(variability)을 측정하고, 충분한 열들이 상기 열들의 집합에 존재하는지 결정하는 비교기; 및 상기 열들의 집합을 원형 열 파일(raw string file)에 기록하는 명령 기록기를 포함한다. 바람직한 실시예에서, 상기 아이솔레이터는 상기 둘 이상의 초기 열들 간의 식별 지역들을 정렬하고 결정하기 위한 비교기를 포함한다. 유사하게, 비교기는 시퀀스 동일성을 연산하기 위한 방법을 포함할 수 있고, 아이솔레이터와 비교기는 선택 사항으로 이 방법을 공유할 수 있다. 바람직한 실시예에서, 아이솔레이터는 동일한 2개의 열간의 전체적인 시퀀스 동일성보다 상기 초기 캐릭터 열들중 다른 열의 대응하는 영역과의 시퀀스 동일성이 높은 약 3개 내지 약 100개의 캐릭터들로 구성된 열 영역에서 상기 하위열의 끝점이 발생하도록 상기 하위열들을 선택한다.In another embodiment, the invention provides music derived from a system for producing labels (tags) and / or sequences of biomolecules. The system includes an encoder for encoding two or more initial columns from biological molecules (e.g., nucleic acids and / or proteins); An isolator for identifying and selecting sub-columns from the two or more columns; A concatenator for connecting the lower columns; A data structure for storing the connected lower-order columns as a set of columns; A comparator for measuring the number and variability of the set of columns and determining if sufficient rows are present in the set of columns; And an instruction register for writing the set of columns into a raw string file. In a preferred embodiment, the isolator includes a comparator for aligning and determining the identification regions between the two or more initial columns. Similarly, a comparator may include a method for computing sequence identity, and an isolator and a comparator may optionally share this method. In a preferred embodiment, the isolator is arranged in a column region consisting of about 3 to about 100 characters with a sequence identity with the corresponding region of the other of the initial character strings, rather than the overall sequence identity of the same two columns, The sub-columns are selected so that a point occurs.

다른 실시예에서, 약 4개 내지 약 100개, 바람직하게는 약 4개 내지 약 50개, 더욱 바람직하게는 약 4개 내지 약 10개, 더욱 더 바람직하게는 약 6개 내지 약 30개 그리고 가장 바람직하게는 약 6개 내지 약 20개의 캐릭터들로 된 미리 정의된 모티브에서 상기 하위열들의 끝점이 발생하도록 아이솔레이터가 하위열들을 선택한다. 한 실시예에서, 상기 아이솔레이터 및 연결기는 서로 다른 2개의 초기 열들로부터 각각 선택된 하위열들을 개별적으로 또는 조합으로 연결하는 단계를 포함하되, 상기 연결은 상기 서로 다른 2개의 초기 열간의 전체적인 시퀀스 동일성보다 상기 서로 다른 2개의 초기 열사이에서 보다 높은 시퀀스 동일성을 갖는 약 3개 내지 약 300개, 바람직하게는 약 5개 내지 약 200개, 가장 바람직하게는 약 10개 내지 약 100개의 캐릭터들로 된 영역에서 이루어진다. 바람직한 한 실시예에서, 아이솔레이터는 캐릭터 열들의 둘 이상의 하위 열들 간의 패어와이즈 동일성을 최대화하기 위해서 상기 초기 캐릭터 열 둘 이상을 정렬하고, 정렬된 쌍의 구성원인 한 캐릭터를 한 하위열의 끝점으로 선택한다.In another embodiment, about 4 to about 100, preferably about 4 to about 50, more preferably about 4 to about 10, even more preferably about 6 to about 30, and most preferably, Preferably, the isolator selects the bottom rows so that the end points of the bottom rows occur in a predefined motif of about six to about twenty characters. In one embodiment, the isolator and the coupler may comprise connecting individually or in combination a respective lower column selected from two different initial columns, wherein the connection is made by a combination of a plurality of initial columns, In the region of about 3 to about 300, preferably from about 5 to about 200, and most preferably from about 10 to about 100, characters having higher sequence identity in the two different initialisations . In one preferred embodiment, the isolator aligns two or more of the initial character strings to maximize pairwise identity between two or more sub-columns of character strings, and selects one character that is a member of the aligned pair as the endpoint of one sub-column.

비교기는 다양한 선택 기준을 부과할 수 있다. 따라서, 다양한 실시예에서, 비교기는 이론적인 PI, PK, 분자 무게, 소수성, 이차 구조 및/또는 캐릭터 열에 의해 부호화된 단백질의 다른 성질들을 연산할 수 있다. 바람직한 한 실시예에서, 비교기는 초기 열들과 30% 이상의 시퀀스 동일성이 있을 때에만 열들을 데이터 구조에 더한다.The comparator can impose various selection criteria. Thus, in various embodiments, the comparator can calculate other properties of the protein encoded by the theoretical PI, PK, molecular weight, hydrophobicity, secondary structure, and / or character string. In a preferred embodiment, the comparator adds the columns to the data structure only when there is at least 30% sequence identity with the initial columns.

상기 시스템은 선택 사항으로 캐릭터 열들의 하나 이상의 캐릭터들을 무작위로 변경하는 연산자(operator)를 포함할 수 있다. 어떤 실시예에서, 이같은 연산자는 상기 캐릭터 열들의 미리 선택된 특정 캐릭터의 한 번 이상의 발생을 무작위로 선택하고 변경할 수 있다. 이 시스템에서 바람직한 데이터구조는 부호화된(또는 풀린(deconvolved)) 핵산 시퀀스 및/또는 부호화된 또는 풀린 아미노 산 시퀀스를 저장한다.The system may optionally include an operator that randomly changes one or more characters of the character strings. In some embodiments, such an operator may randomly select and change one or more occurrences of a pre-selected particular character of the character strings. Preferred data structures in this system store encoded (or deconvolved) nucleic acid sequences and / or encoded or unwound amino acid sequences.

본 발명의 더 정확한 이해는 아래의 실시예의 자세한 논의로부터 얻을 수 있다. 명료함을 위해, 본 논의는 특정 예에 따른 장치들, 방법들 및 개념들을 언급한다. 그러나, 본 발명의 방법은 다양한 형태의 논리 장치들 내에서 작동할 수 있다. 본 발명은 (균등론에 입각해서 해석되어야 할) 첨부된 청구 범위로 제시된 것을 제외하고는 한정의 의미로 제시되어 있지 않다.A more accurate understanding of the present invention can be obtained from the following detailed discussion of the embodiments. For clarity, this discussion refers to devices, methods and concepts according to particular examples. However, the method of the present invention can operate in various types of logic devices. The present invention is not meant to be limited except as indicated by the appended claims (which should be interpreted in accordance with the doctrine of equivalents).

또한, 논리 시스템은 모듈식으로 다른 요소들 및 다른 기능들을 매우 광범위하게 포함할 수 있다. 시스템의 서로 다른 실시예들은 서로 다른 요소들과 기능들을 혼합해서 포함할 수 있으며 다양한 요소들의 부분으로서 다양한 기능들을 집단화할 수 있다. 명료함을 위해, 본 발명은 많은 혁신적인 요소들과 그 요소들의 혁신적인 결합을 포함하는 시스템의 관점에서 기술되어 있다. 어떤 추론의 결과도 본 발명을 본 명세서에 예시적인 실시예로 목록화된 혁신적인 요소들을 모두 포함하는 조합들로 한정하는 것으로 받아들여져서는 안된다.Further, the logic system may include other elements and other functions in a very broad fashion. Different embodiments of the system may include a mixture of different elements and functions and may group various functions as part of various elements. For clarity, the present invention is described in terms of a system that incorporates many innovative elements and innovative combinations of those elements. The results of any inference should not be construed as limiting the invention to combinations that include all of the innovative elements listed herein as exemplary embodiments.

<정의(DEFINITIONS)><DEFINITIONS>

"캐릭터 열(character string)" "워드(word)", "이진 열(binary string)" 또는 "부호 열(encoded string)"은 시퀀스 정보(예. 핵산의 뉴클레오타이드(nucleotide) 시퀀스, 단백질의 아미노 산 시퀀스, 다당류의 당(sugar) 시퀀스 등과 같은 생물학적 분자의 서브유닛(subunit) 구조)를 저장할 수 있는 어떤 실체(entity)를 나타낸다. 한 실시예에서, 캐릭터 열은 캐릭터들(글자들, 숫자들 또는 다른 심볼들)의 단순한 시퀀스일 수 있으며, 또는 이러한 정보를 유형 또는 무형(전자 또는 자기, 등)의 형태로 수치적으로 표현한 것일 수도 있다. 캐릭터 열은 "선형적(linear)"이어야 할 필요는 없으며, 연결된 리스트 등과 같은 다른 많은 형태로 존재할 수 있다.A "character string", a "word", a "binary string" or an "encoded string" refers to a sequence of information (eg, a nucleic acid nucleotide sequence, A subunit structure of a biological molecule such as a sugar sequence of a polysaccharide, and the like. In one embodiment, the character string may be a simple sequence of characters (letters, numbers, or other symbols), or a numerical representation of such information in the form of a type or an intangible (electronic or magnetic, etc.) It is possible. The character string need not be " linear ", and may exist in many other forms, such as a linked list.

"캐릭터(character)"는, 캐릭터 열의 하나의 캐릭터와 관련된 의미로 사용될 경우에는 이 캐릭터 열의 서브 유닛을 지칭한다. 바람직한 실시예에서, 캐릭터 열의 캐릭터 하나는 부호화된 생물학적 분자의 서브 유닛 하나를 부호화한다. 따라서, 예컨대, 바람직한 실시예에서, 부호화된 생물학적 분자가 단백질이라면, 이 때의 캐릭터 열의 캐릭터 하나는 아미노 산 하나를 부호화한다.A " character " refers to a sub-unit of this character string when used in a meaning associated with one character in the character string. In a preferred embodiment, one character in the string of characters encodes one sub-unit of the encoded biological molecule. Thus, for example, in a preferred embodiment, if the encoded biological molecule is a protein, then one character in the character string at this time encodes one amino acid.

"모티브(motif)"는 생물학적 분자를 구성하는 서브유닛의 패턴을 말한다. 모티브는 부호화되지 않은 생물학적 분자의 서브 유닛 패턴을 의미할 수도 있으며, 생물학적 분자의 부호화된 표현의 서브 유닛 패턴을 의미할 수도 있다.&Quot; Motif " refers to a pattern of subunits that make up a biological molecule. A motif may mean a sub-unit pattern of an uncoded biological molecule, or a sub-unit pattern of a coded representation of a biological molecule.

"하위열(substring)"은 다른 열 내에서 발견되는 열을 말한다. 하위열은 "부모(parent)" 열의 표준 길이(full-length)를 포함할 수 있지만, 일반적으로는 이러한 표준 길이 열의 하위열을 나타낸다.A " substring " refers to a column found in another column. A sub-column may include a full-length of the "parent" column, but generally refers to a sub-column of such a standard length column.

"데이터 구조"는 정보, 일반적으로는 복수 "조각(pieces)"의 정보들 의 저장을 위한 조직체(organization) 및 이와 선택적으로는 연관된 장치를 말한다. 데이터 구조는 이러한 정보의 단순한 기록(예, 리스트)일 수도 있고, 또는 그 안에 포함된 이들 정보와 관련된 부가적인 정보(예, 주석(annotation))를 포함할 수도 있으며, 데이터 구조의 여러 "구성원(members)(정보 "조각"들) 간의 관계를 설립할 수도 있고, 데이터 구조의 외부에 있는 자원(resource)을 가리키는 포인터(pointer) 또는 그것에 대한 링크를 제공할 수도 있다. 데이터 구조는 무형의 것일 수 있지만, 유형 매체에 저장/표시되면 유형적이게 된다. 데이터 구조는 단순 리스트에 한정되지 않고, 링크된 리스트, 인덱스된 리스트, 데이터 표, 인덱스, 해시(hash) 지수, 단층 파일 데이터베이스(flat file database), 관계 데이터베이스(relational database), 로컬 데이터베이스, 분배 데이터베이스, 신 클라이언트 데이터베이스(thin client database) 등과 같은 다양한 정보 아키텍쳐를 나타낼 수 있다. 바람직한 실시예에서, 데이터 구조는 하나 이상의 캐릭터 열의 저장에 충분한 필드(field)를 제공한다. 데이터 구조는 캐릭터 열의 정렬(alignment)을 가능하게하도록 조직화되는 것이 바람직하며, 선택적으로는, 이러한 정렬 및/또는 열의 유사도 및/또는 열의 차이도에 관한 정보를 저장하도록 조직화된다. 한 실시예에서, 이 정보는 정렬 "스코어(score)"(예, 유사도 지수) 및/또는 개개의 서브유닛(예, 핵산의 경우에는 뉴글레오타이드)의 정렬을 나타내는 정렬 지도(alignment map)의 형태를 갖는다. "부호화된 캐릭터 열"은 생물학적 분자에 대한 표현을 의미하는 것으로, 해당 분자에 대한 바람직한 시퀀스/구조 정보를 보유하고 있다.&Quot; Data structure " refers to an organization for storage of information, typically a plurality of " pieces ", and optionally associated devices. The data structure may be a simple record (e.g., a list) of such information, or it may include additional information (e.g., annotations) related to the information contained therein, members (information "pieces"), or may provide pointers to or pointers to resources external to the data structure. The data structure may be intangible The data structure is not limited to a simple list, but may be a linked list, an indexed list, a data table, an index, a hash index, a flat file database, , A relational database, a local database, a distributed database, a thin client database, and so on. In a preferred embodiment, the data structure provides a field sufficient to store one or more character strings. The data structure is preferably organized to enable alignment of character strings, and optionally, (E.g., a similarity index) and / or the degree of similarity of the columns and / or the degree of similarity of the columns. In one embodiment, Quot; has the form of an alignment map that indicates alignment of a subunit (e.g., nucleotides in the case of nucleic acids). The " encoded string of characters " means a representation of a biological molecule, Sequence / structure information.

"유사도"란, 본 명세서에서는 분자의 부호화된 표현(예, 초기 캐릭터 열)(들) 간 또는 이러한 부호화된 캐릭터 열에 의해 표현되는 분자들 간의 유사성 지수(measurement)의 의미로 사용된다.The term " similarity degree " is used herein to mean a similarity measure between molecules expressed by an encoded representation of a molecule (e.g., initial character string) (s) or by such a coded character string.

열에 대한 연산(예. 삽입, 삭제, 변형, 등)을 언급할 때, 이러한 연산은 생물학적 분자의 부호화된 표현에 대해 행해질 수도 있으며, 또는 이러한 부호화된 표현이 해당 연산을 포함할 수 있도록 부호화 전의 "분자"에 대해서는 행해질 수 있다.When referring to an operation on a column (e.g., insert, delete, transform, etc.), such an operation may be performed on an encoded representation of a biological molecule, Molecule ". < / RTI >

"서브유닛"이란, 생물학적 분자와 관련된 의미로 사용될 경우, 해당 생물학적 분자를 구성하는 특징적인 "단위체(monomer)"를 의미한다. 그러므로, 핵산의 서브유닛은 뉴클레오타이드이고, 폴리펩티드의 서브유닛은 아미노 산이며, 다당류의 서브유닛은 당이 된다.&Quot; Subunit " when used in the context of a biological molecule means a characteristic " monomer " constituting the biological molecule. Thus, the subunit of the nucleic acid is the nucleotide, the subunit of the polypeptide is amino acid, and the subunit of the polysaccharide is the sugar.

"풀(pool)" 또는 "집합(collection)"은 열의 지칭을 위해 사용될 때 같은 의미로 사용된다.A " pool " or " collection " is used interchangeably when used to refer to a column.

"생물학적 분자(biological molecule)"는 생물학적 유기체에서 일반적으로 발견되는 분자를 말한다. 바람직한 생물학적 분자 또는, 복수의 서브유닛으로 구성되어 중합체로서의 본질을 갖는 생물학적 고분자가 포함된다. 전형적인 생물학적 분자는 (뉴클레오타이드 서브유닛으로 구성된) 핵산, (아미노 산 서브유닛으로 구성된) 단백질, (당 서브유닛으로 구성된) 다당류 등을 포함하는데, 이에 국한되는 것은 아니다.A " biological molecule " refers to a molecule commonly found in biological organisms. A desired biological molecule or a biological polymer composed of a plurality of subunits and having the nature as a polymer. Typical biological molecules include, but are not limited to, nucleic acids (composed of nucleotide subunits), proteins (composed of amino acid subunits), polysaccharides (consisting of the subunits), and the like.

"생물학적 분자 부호화(encoding a biological molecule)"란, 바람직하게는 원래의 생물학적 분자의 정보 내용물을 포함하고 있고, 따라서 그 정보 내용물의 재생성에도 사용될 수 있는 당해 생물학적 분자에 대한 표현을 생성하는 것을 의미한다.The term " encoding a biological molecule " preferably means generating an expression for the biological molecule that contains the information content of the original biological molecule and thus can also be used for the regeneration of the information content .

"핵산"은 단일- 또는 이중 나선 형태로 꼬인 디오티리보뉴클레오타이드(deoxyribonucleotide) 또는 리보뉴클레오타이드(ribonucleotide) 중합체를 말하는데, 달리 한정되지 않는 한, 자연 발생의 뉴클레오타이드와 유사한 방식으로 작용할 수 있는 것으로 알려진 자연산 뉴클레오타이드의 모든 유사물도 포함된다.&Quot; Nucleic acid " refers to a deoxyribonucleotide or ribonucleotide polymer that is twisted in a single- or double-stranded form, including, but not limited to, naturally occurring nucleotides known to act in a manner similar to naturally occurring nucleotides &Lt; / RTI >

"핵산 시퀀스(nucleic acid sequence)"는 핵산을 포함하는 뉴클레오타이드의 염기순서와 그 정체(identity)를 말한다.&Quot; Nucleic acid sequence " refers to the nucleotide sequence and identity of a nucleotide containing a nucleic acid.

"폴리펩티드(polypeptide)", "펩티드(peptide)" 및 "단백질(protein)"은 아미노산 잉여물(amino acid residues)의 중합체를 말하며, 본 명세서에서는 동일한 의미로 사용된다. 또한, 이들 용어들은 자연 발생의 아미노산 중합체뿐만 아니라, 하나 이상의 아미노산 잉여물이 대응하는 자연 발생의 아미노산에 유사한 인공 화학물질로 구성된 핵산 중합체에도 적용된다.&Quot; Polypeptide ", " peptide ", and " protein " refer to polymers of amino acid residues and are used interchangeably herein. In addition, these terms apply not only to naturally occurring amino acid polymers, but also to nucleic acid polymers in which one or more amino acid surpluses are composed of artificial chemicals similar to the corresponding naturally occurring amino acid.

"폴리펩티드 시퀀스(polypeptide sequence)"는 폴리펩티드를 포함하는 핵산의 염기순서와 그 정체를 말한다.&Quot; Polypeptide sequence " refers to the nucleotide sequence and identity of a nucleic acid comprising a polypeptide.

"열 집합에 생성 열을 더하기(adding the product strings to a collection of strings)"는 여기서는 수학적 덧셈을 요구하는 것이 아니다. 그보다는 열의 세트 내에 포함되어 있는 하나 이상의 열을 식별하는 과정을 말한다. 이러한 과정은, 반드시 이렇게 열거된 수단에 국한되는 것은 아니나, 문제가 되는 열(들)을 열의 집합인 데이터 구조로 복사 또는 이동시키는 수단, 열로부터 열의 집합인 데이터 구조를 가리키는 포인터를 설정 또는 제공하는 수단, 열과 관련되어 특정 세트 내에 열이 속해 있음(inclusion)을 지시하는 플래그(flag)를 설정하는 수단, 또는 단지 그렇게 생성된 열을 집합에 포함시키는 규칙을 지정하는 수단 등과 같은 다양한 수단들에 의해 수행될 수 있다."Adding the product strings to a collection of strings" does not require mathematical addition here. Rather, it refers to the process of identifying one or more rows contained in a set of columns. This process includes, but is not necessarily limited to, the means of copying or moving the column (s) in question into a data structure that is a collection of columns, or a pointer to a data structure that is a collection of columns from a column Means for setting a flag indicating the inclusion of a column in a particular set in relation to the column, or means for specifying a rule that merely includes such a generated column, and the like .

본 발명은 컴퓨터 모델링과 시뮬레이션 분야에 관한 것이다. 보다 구체적으로는, 본 발명은 진화 모델링에서 사용하는 데이터 구조를 분포시키는 새로운 방법을 제공한다.The present invention relates to the field of computer modeling and simulation. More specifically, the present invention provides a new method of distributing data structures used in evolutionary modeling.

도 1은 본 발명의 방법의 한 실시예를 도시한 흐름도.1 is a flow diagram illustrating one embodiment of a method of the present invention.

도 2는 본 발명의 방법에 따른 하위시퀀스(subsequence)의 선택 및 연결(concatenation)을 설명한 도면.Figure 2 illustrates selection and concatenation of subsequences according to the method of the present invention;

도 3은 하위열의 순서를 고정시키기 위해서 정렬 알고리즘을 이용해 본 발명의 방법에 따른 하위시퀀스의 집합과 연결을 나타낸 도면.Figure 3 shows a set of subsequences and a connection according to the method of the present invention using a sorting algorithm to fix the order of the sub-columns.

도 4는 본 발명에 따른 대표적인 디지털 장치(700)를 나타낸 도면.Figure 4 illustrates an exemplary digital device 700 in accordance with the present invention.

도 5는 서로 다른 서브틸리신(subtilisin)(초기 캐릭터 열 세트의 예) 간의 유사성 퍼센트를 보여주는 차트와 관계 트리(tree)FIG. 5 is a chart showing the similarity percentages between different subtilisins (an example of an initial character row set)

도 6은 서로 다른 서브틸리신 간의 상동 영역을 보여주는 쌍으로된 도트-플롯(pairwise dot-plot) 정렬을 나타낸 도면.Figure 6 shows a pairwise dot-plot alignment showing the homology region between different subtilisins.

도 7은 7개의 서로 다른 부모(parent) 서브틸리신 간의 상동 영역을 보여주는 쌍으로된 도트-플롯 정렬을 나타낸 도면.Figure 7 shows a pair of dot-plot alignments showing homology regions between seven different parent subtilins.

I. 캐릭터 열의 분포 생성I. Generate distribution of character strings

본 발명은 진화 모델, 특히 유전적 알고리즘에 의해 대표되는 진화 모델에서의 초기(또는 성숙한/진행된) 분포로서 사용하기에 적절한 엔티티(entity)의 실제적 또는 이론적 분포 표현을 생성시키는 새로운 연산 방법을 제공한다. 본 발명의 방법에 의해 생성된 각각의 엔티티들은, 특정 생물학적 유기체의 특징을 반영하도록 초기화할 경우, 근거가 되는 분자 생물학과 관련한 중요한 정보(예. 대표 아미노 산 또는 핵산 시퀀스)를 포함하고 있고, 따라서 유전적 또는 다른 알고리즘에 기반한 모델들로 하여금 분자 수준에서의 진화 과정과 관련한 전례가 없는 수준의 정보를 제공할 수 있도록 한다.The present invention provides a new computational method for generating a realistic or theoretical distribution representation of an entity suitable for use as an initial (or mature / advanced) distribution in an evolutionary model, in particular an evolutionary model represented by a genetic algorithm . Each entity generated by the method of the present invention contains important information (e.g., a representative amino acid or nucleic acid sequence) related to the underlying molecular biology when it is initialized to reflect the characteristics of a particular biological organism, Enables models based on enemy or other algorithms to provide unprecedented levels of information related to evolution at the molecular level.

특히 바람직한 실시예에 있어서, 본 발명의 방법은 각각의 열이 하나 이상의 생물학적 분자를 표시하는 캐릭터 열의 분포를 생성시킨다. 이 방법은 단지 몇 개의 열만을 "시드(seeds)"로 이용하여 초기 시드 구성원에 대한 "진화" 관계를 함유하는 대규모 열 분포를 생성한다. 초기 구성원 세트가 임의적 또는 무작위적/우연적이거나, 또는 수학적 내지는 표시상의 편의를 위해 선택되는 종래의 유전적 알고리즘과는 달리, 바람직한 실시예에서 본 발명의 방법들에 의해 생성된 분포는 이미 공지된 실재하는 생물학적 "선구체(先驅體; precursor)"(예, 특정 핵산 시퀀스와/또는 폴리펩티드 시퀀스)로부터 유도된 것이다.In a particularly preferred embodiment, the method of the present invention produces a distribution of character strings in which each row represents one or more biological molecules. This method uses only a few columns as " seeds " to generate large-scale heat distributions containing " evolution " relationships to the initial seed members. Unlike conventional genetic algorithms in which the initial set of members is arbitrary or random / inadvertent, or mathematically or for convenience of indication, in a preferred embodiment the distribution generated by the methods of the present invention is a known entity Derived from a biological " precursor " (e.g., a particular nucleic acid sequence and / or a polypeptide sequence).

바람직한 실시예에서, 본 발명의 방법들은 다음 단계들을 포함한다:In a preferred embodiment, the methods of the present invention comprise the following steps:

1) 두개 이상의 생물학적 분자를 선택/식별하는 단계;1) selecting / identifying two or more biological molecules;

2) 그 생물학적 분자를 캐릭터 열로 부호화하는 단계;2) encoding the biological molecule as a character string;

3) 그 캐릭터 열에서 적어도 둘 이상의 하위열을 선택하는 단계;3) selecting at least two sub-columns in the character string;

4) 상기의 하위열들을 연결하여 하나 이상의 초기 캐릭터 열과 길이가 같은 생성 열을 하나 이상 형성하는 단계;4) forming at least one generation column having the same length as at least one initial character string by connecting the lower columns;

5) 초기 열 세트 또는 독립 세트가 될 수 있는 열 집합에 생성 열을 더하는 단계; 및5) adding a generation column to a set of columns that can be an initial set of columns or a set of independent sets; And

6) 선택적으로, 산출된 열 세트에 부가적인 변형(variation)을 도입하는 단계;6) optionally, introducing additional variations in the calculated set of columns;

7) 선택적으로, 산출된 열 세트에 선택 압력을 더하는 단계;7) optionally, adding a selected pressure to the calculated set of columns;

8) 선택적으로, 하나 이상의 생성 열을 초기 캐릭터 열 집합의 초기 열로 사용하여 단계 2) 또는 3)부터 7)까지 반복하는 단계.8) optionally repeating steps 2) or 3) to 7) using one or more generated columns as the initial columns of the initial character column set.

각각의 과정은 이하 더 자세하게 기술되어 있다.Each process is described in more detail below.

II. 하나 이상의 생물학적 분자를 캐릭터 열로 부호화하기II. Encode one or more biological molecules as character strings

본 발명의 방법은 일반적으로 하나 이상의 "시드" 구성원들을 사용한다. 이 "시드" 구성원들은 하나 이상의 생물학적 분자에 대한 표현인 것이 바람직하다. 따라서, 본 발명의 바람직한 실시예의 초기 단계들은 두 개 이상의 생물학적 분자들을 선택하고 그 분자들을 하나 이상의 캐릭터 열로 부호화하는 과정을 포함한다.The method of the present invention generally uses one or more " seed " members. These " seed " members are preferably representations of one or more biological molecules. Thus, the initial steps of the preferred embodiment of the invention include selecting two or more biological molecules and encoding the molecules into one or more character sequences.

A. "시드/초기" 생물학적 분자를 식별/선택하기A. Identifying / selecting "seed / early" biological molecules

실제 어떤 생물학적 분자도 본 발명의 방법들에서 사용될 수 있다. 그러나, 바람직한 생물학적 분자들은 복수의 "서브 유닛"을 포함하는 "중합체" 생물학적 고분자들이다. 본 발명의 방법들에 특히 적합한 생물학적 고분자들은 핵산(예, DNA, RNA, 등), 단백질, 당단백질(glycoprotein), 탄수화물(carbohydrates), 다당류, 지방산 등을 포함하지만, 이에 국한되는 것은 아니다.Any actual biological molecule may be used in the methods of the present invention. However, preferred biological molecules are " polymer " biological polymers comprising a plurality of " subunits ". Particularly suitable biological polymers for the methods of the present invention include, but are not limited to, nucleic acids (e.g., DNA, RNA, etc.), proteins, glycoproteins, carbohydrates, polysaccharides,

핵산을 선택했을 때, 핵산은 단일 나선일 수도 있고, 이중 나선일 수도 있다(비록, 단일 나선으로도 충분히 이중 나선 핵산을 표시/부호화할 수 있지만). 핵산은 바람직하게는 공지의 핵산이다. 그러한 핵산의 시퀀스는 공개 데이터베이스(예, GenBank), 개인 소유 데이터베이스(proprietary database)(예, Incyte databases), 과학 출판물, 상업 또는 개인 시퀀싱 연구소들(sequencing laboratories), 기업 시퀀싱 연구소들 등을 포함하는 다양한 소스들로부터 이미 쉽게 얻을 수 있다.When a nucleic acid is selected, the nucleic acid may be a single strand or a double strand (although a single strand may well represent / encode a double stranded nucleic acid). The nucleic acid is preferably a known nucleic acid. Such sequences of nucleic acids may be obtained from a variety of sources, including public databases (e.g., GenBank), proprietary databases (e.g., Incyte databases), scientific publications, commercial or personal sequencing laboratories, You can easily get it from the sources.

핵산은 게놈의 핵산, cDNAs, mRNAs, 인공 시퀀스, 변형된 뉴클레오타이드를 갖고 있는 자연 시퀀스 등을 포함할 수 있다.The nucleic acid may comprise a nucleic acid of the genome, cDNAs, mRNAs, an artificial sequence, a natural sequence having a modified nucleotide, and the like.

한 실시예에서, 둘 이상의 생물학적 분자들은 서로 "연관되어(related)" 있지만, 동일한 것은 아니다. 그러므로, 핵산들은 같은 유전자 또는 유전자들을 나타낼 수 있지만, 그들이 유래한 계통, 종, 속, 과, 목, 문 또는 계는 서로 다를 수 있다. 유사하게, 한 실시예에서, 단백질, 다당류 또는 다른 분자(들) 역시 이들이 다른 계통들, 종들, 속들, 과들, 목들, 문들, 또는 계들에서 선택되었기 때문에 갖는 분자들 간의 차이점을 제외하고는, 동일한 단백질, 다당류 또는 다른 분자들일 수 있다.In one embodiment, two or more biological molecules are " related ", but not identical. Thus, nucleic acids may represent the same gene or genes, but the strain, species, genus, subsp., Neck, gate or system from which they are derived may be different. Similarly, in one embodiment, a protein, polysaccharide, or other molecule (s) may also be used, except for differences between the molecules that they have because they have been selected from other lines, species, entities, The same protein, polysaccharide or other molecules.

생물학적 분자들은 단일 유전자 생성물(예, 하나의 mRNA, 하나의 cDNA, 하나의 단백질 등)을 나타낼 수 있으며, 또는 유전자 생성물들 및/또는 부호화되지 않은 핵산들의 집합을 나타낼 수 있다. 어떤 실시예들에 있어서, 생물학적 분자들은 하나 이상의 특정 신진대사 경로들(예, 조절, 전달 또는 종합 경로들)의 구성원을 나타낼 것이다. 따라서, 예를 들면, 생물학적 분자들은 전체 오페론, 또는 완전한 생합성 경로(예, 랙 오페론(lac operon), 단백질: B-DNA 갈 오페론(B-DNA gal operon), 콜리신 A 오페론(the colicin A operon), 럭스 오페론(the lux operon), 폴리케티드 종합 경로들(polyketide synthesis pathways), 등.)의 구성원들을 포함할 수 있다.Biological molecules may represent a single gene product (e.g., one mRNA, one cDNA, one protein, etc.) or may represent a collection of gene products and / or uncoded nucleic acids. In some embodiments, the biological molecules will represent members of one or more particular metabolic pathways (e.g., modulatory, transmissible, or synthetic pathways). Thus, for example, the biological molecules may include whole operons, or complete biosynthetic pathways (e.g., lac operon, B-DNA gal operon, the colicin A operon, ), The lux operon, polyketide synthesis pathways, etc.).

어떤 바람직한 실시예에 있어서는, 생물학적 분자들은 다른 유전자, 단백질 등을 얼마든지 포함할 수 있다. 따라서, 어떤 실시예에서는, 생물학적 분자들은 동종 또는 이종 개체(들)의 전체 핵산(예. 게놈 DNA, cDNA, 또는 mRNA) 또는 전체 단백질, 또는 전체 지질(lipid) 등을 포함할 수 있다.In certain preferred embodiments, the biological molecules may comprise any number of other genes, proteins, and the like. Thus, in some embodiments, the biological molecules may comprise whole nucleic acids (e.g., genomic DNA, cDNA, or mRNA) or whole proteins, or whole lipids, of the same or different species (s).

어떤 실시예에서는, 생물학적 분자들은 자신의 종의 분자들의 전체 분포의 "표시"를 반영할 수 있다. 분자 분포의 고차원적인 표시는 실험실에서 이루어졌으며, 본 발명의 방법들에 따라 in silico 방법으로 수행될 수 있다. 복잡한 분자 또는 분자의 분포를 표시하는 방법들은 RDA(Representational Difference Analysis)와 이에 관련된 기술들(예, Lisitsyn(1995)Trends Genet., 11(8):303-307, Rinsinger등.(1994)Mol Carcinog. 11(1):13-18,과 Michiels 등(1998)Nucleic Acids Res26: 15 3608-3610,과 여기에 인용된 것들을 참조)에서 볼 수 있다.In some embodiments, the biological molecules may reflect an " indication " of the overall distribution of molecules of their species. A high dimensional representation of the molecular distribution was made in the laboratory and can be performed in an in silico method according to the methods of the present invention. How to display the complex molecule or the distribution of molecules in and related technologies (Representational Difference Analysis) RDA (for example, Lisitsyn (1995) Trends Genet, 11 (8):.. 303-307, Rinsinger the like (1994) Mol Carcinog 11 (1): 13-18, and Michiels et al. (1998) Nucleic Acids Res 26: 15 3608-3610, and references cited therein).

본 발명의 방법들에 따라 부호화하고 조작하기에 특히 적합한 생물학적 분자들은 단백질 및/또는 EPO(erythropoietin)와 같은 치료용 단백질(therapeutic proteins), 인슐린, 인간 성장 호르몬(human growth hormone)과 같은 펩티드 호르몬; 성장 인자(growth factor) 및 펩티드-78을 활성화하는 뉴트로필(Neutrophil)과 같은 사이토카인(cytokines), GROα/MGSA, Groβ, GROγ, MIP-1α, MIP-16, MCP-1, 표피 성장 인자(epidermal growth factor), 섬유아세포 성장 인자(fibroblast growth factor), 헤파토사이트 성장 인자(hepatocyte growth factor), 유사-인슐린 성장 인자(insulin-like growth factor), 인터페론, 인터루킨(interleukin), 각막 성장 인자(keratinocyte growth factor), 혈구암 억제 인자(leukemia inhibitory factor), 온코스타틴 M(oncostatin M), PD-ECSF, PDGF, 플레이오트로핀(pleiotropin), SCF, c-kit ligand, angiogenesis factors(예, 혈관 내피 성장 인자들(vascular endothelial growth factors) VEGF-A, VEGF-B, VEGF-C, VEGF-D, 태반 성장 인자(PLGF; placental growth factor) 등), 성장 인자들(예. G-CSF, GM-CSF), 용해성 수용체(예, IL4R, IL_13R, IL-10R, 수용성 T-cell 수용체(soluble T-cellreceptors) 등)와 같은 분자들의 다양한 계층의 단백질을 부호화하는 핵산을 포함한다.Particularly suitable biological molecules for encoding and manipulating according to the methods of the present invention include therapeutic proteins such as protein and / or erythropoietin (EPO), peptide hormones such as insulin, human growth hormone; Growth factors and cytokines such as Neutrophil which activates peptide-78, GROα / MGSA, Groβ, GROγ, MIP-1α, MIP-16, MCP-1, epidermal growth factor epidermal growth factor, fibroblast growth factor, hepatocyte growth factor, insulin-like growth factor, interferon, interleukin, corneal growth factor keratinocyte growth factor, leukemia inhibitory factor, oncostatin M, PD-ECSF, PDGF, pleiotropin, SCF, c-kit ligand, angiogenesis factors Growth factors such as VEGF-A, VEGF-B, VEGF-C, VEGF-D, placental growth factor (PLGF) -CSF), soluble receptors (eg, IL4R, IL_13R, IL-10R, soluble T-cell receptors, etc.) Various comprises a nucleic acid encoding a protein of the layer.

다른 바람직한 부호화용 분자들은 복사(transcription)와 표현(expression) 활성자(activator)들을 포함하지만, 이에 국한되는 것은 아니다. 복사와 표현 활성자들은 원핵 생물, 바이러스와 균류, 식물, 동물을 포함하는 진핵 생물에서 발견되는 세포 성장, 차별화, 조절 등등을 조율하는 유전자 및/또는 단백질을 포함하고있다. 표현 활성자들은 사이토킨, 염증을 일으키는 분자들(inflammatory molecules), 성장 인자들, 성장 인자 수용기, 종양 유전자 생산자, 인터루킨(예, IL-1, IL-2,IL-8,등), 인터페론, FGF, IGF-I, IGF-II, FF, PDGF, TNF, TGF-α, TGF-β, EGK, KGF, SCR/c-kit, CD40L/CD40, VLA-4/VCAM-1, ICAM-1/LFA-1과 히알루린(hyalurin)/CD44, 형질 도입 분자들과 이에 대응되는 종양 유전자 생성자(예, Mos, RAS, Raf와 Met), 표현 활성자와 억제 유전자(예, p53, Tat, Fos, Myc, Jun, Myb, Rel), 에스트로겐(estrogen), 프로게스테론(progesterone), 테스토스테론(testosterone), 알도스테론(aldosterone), LDL 수용기 리간드(LDL receptor ligand)와 코티코스테론(corticosterone) 등과 같은 스테로이드 호르몬 수용기를 포함한다.Other preferred encoding molecules include, but are not limited to, transcription and expression activators. Radiation and expression activators include genes and / or proteins that coordinate cell growth, differentiation, regulation, etc. found in eukaryotes, including prokaryotes, viruses and fungi, plants, and animals. Expression activators include cytokines, inflammatory molecules, growth factors, growth factor receptors, tumor gene producers, interleukins (e.g. IL-1, IL-2, IL-8, , IGF-I, IGF-II, FF, PDGF, TNF, TGF- ?, TGF- ?, EGK, KGF, SCR / c-kit, CD40L / CD40, VLA-4 / VCAM-1, ICAM-1 / (Eg, Mos, RAS, Raf and Met), expression activators and inhibitory genes (eg, p53, Tat, Fos, Myc) and hyalurin / CD44, , Steroid hormone receptors such as Jun, Myb, Rel), estrogen, progesterone, testosterone, aldosterone, LDL receptor ligand and corticosterone do.

본 발명의 방법들에서 부호화를 위한 바람직한 분자들은 Aspregillus sp., Candida sp., E. coli, staphyloccoi sp, Streptocci sp., Clostridia sp., Neisseria sp., Enterobacteriacea sp., Helicobacter sp., Vibrio sp., Capylobacter sp., Pseudomonas sp., Ureaplasma Sp., Legionella sp., Spirochetes, Mycobacteria sp., Actnomyces sp., Nocardia sp., Chlamydia sp., Rickettsia sp., Coxiella sp., Ehrilichia sp., Rochalimaea, Brucella, Yersinia, Fracisella 와 Pasturella의 특징을 가진 단백질과 같은 감염체 또는 병원체의 단백질을 포함하고 있다; protozoa 바이러스, (+)RNA 바이러스, (-)RNA 바이러스, Orthmyxoviruse, dsDNA 바이러스, RNA종양바이러스(retroviruse) 등Preferred molecules for encoding in the methods of the present invention are A spregillus sp., Candida sp., E. coli, staphylococci sp., Streptocci sp., Clostridia sp., Neisseria sp., Enterobacteriace sp., Helicobacter sp., Vibrio sp Rocettsia sp., Coxiella sp., Ehrilichia sp., Rochalimaea sp., Pseudomonas sp., Pseudomonas sp., Ureaplasma sp., Legionella sp., Spirochetes, Mycobacteria sp., Actnomyces sp., Nocardia sp. Brucella, Yersinia, Fracisella, and Pasturella ; protozoa virus, (+) RNA virus, (-) RNA virus, Orthmyxoviruse, dsDNA virus, RNA tumor virus (retroviruse) etc.

또 다른 적합한 분자들은 복사 반응 억제제(inhibitors of transcription),곡식 전염병의 독소, 산업적으로 중요한 효소(proteases, nucleases와 lipases) 등의 역할을 하는 핵산 및/또는 단백질을 포함한다.Other suitable molecules include nucleic acids and / or proteins that act as inhibitors of transcription, toxins of cereal infectious diseases, industrially important enzymes (proteases, nucleases and lipases), and the like.

핵산이나 그들의 부호화된 단백질 중 연관된 "과들(families)"의 구성원들을 분자들로 포함하는 게 바람직하다. 연관성(예. "과"에서 포함 또는 배제)은 단백질의 기능(protein function) 및/또는 같은 과의 다른 구성원과의 시퀀스 동일성(sequence identity)에 의해서 결정될 수 있다. 시퀀스 동일성은 위에서 기술된 것 같이 결정될 수 있고, 과 구성원들은 적어도 30%의 시퀀스 동일성을 공유하고 있는 것이 바람직하고, 적어도 50%를 공유하는 것이 더욱 바람직하며, 적어도 80%이상이 되어야 가장 바람직하다. 어떤 경우에는, 시퀀스 동일성이 낮지만(30% 미만), 중요한 연관성을 가진 분자들을 포함시키는 것이 바람직하다. 그러한 방법들은 생물 정보 분야에서 잘 알려져 있고, 시퀀스/유사성 정보에 관한 분자 접지 변환 패턴(molecular folding patterns)을 일반적으로 포함한다. 그러한 접근법의 공통되는 실행법은 "스레딩 알고리즘(threading algorithms)"을 포함한다. 스레딩 알고리즘은 시퀀스들을 구조 템플리트(structural templates)와 비교해서 먼 상동 관계를 찾아낸다. 타겟과 템플리트 간의 구조적 유사성이 충분히 크다면, 그들의 이런 관계는 상당한 시퀀스의 유사성이 없어도 찾아낼 수 있다. 스레딩 알고리즘은 당업자에게 잘 알려져 있으며, 예컨대, NCBI Structure Group Threading Package(National Center for Biological Information(참조, http://www.ncbi.nlm.nih.gov/Structure/RESEARCH/threading.html)에서 이용 가능)와 SeqFold(Molecular Simulations, Inc.)에서 찾을 수 있다.It is desirable to include members of the related " families " of nucleic acids or their encoded proteins as molecules. The association (eg, inclusion or exclusion in "and") can be determined by the protein function of the protein and / or its sequence identity with other members of the same. Sequence identity may be determined as described above, and the members preferably share at least 30% sequence identity, more preferably at least 50%, and most preferably at least 80%. In some cases, it is desirable to include molecules with an important association although the sequence identity is low (less than 30%). Such methods are well known in the field of bioinformatics and generally involve molecular folding patterns on sequence / similarity information. A common implementation of such an approach includes " threading algorithms ". The threading algorithm compares sequences with structural templates to find distant homogeneous relationships. If the structural similarity between the target and the template is large enough, then these relationships can be found without significant sequence similarity. Threading algorithms are well known to those skilled in the art and are available, for example, from the NCBI Structure Group Threading Package (National Center for Biological Information (see, http://www.ncbi.nlm.nih.gov/Structure/RESEARCH/threading.html) ) And SeqFold (Molecular Simulations, Inc.).

B) 생물학적 분자를 캐릭터 열로 부호화하기B) Encoding a biological molecule as a character string

생물학적 분자(들)는 캐릭터 열들로 부호화된다. 가장 단순한 경우로 캐릭터 열은 생물학적 분자를 나타내기 위해 사용되는 캐릭터 코드와 동일하다. 따라서, 예컨대, 핵산이 부호화될 때의 캐릭터 열은 A,C,G,T 또는 U란 캐릭터를 포함하게 된다. 유사하게, 폴리펩티드 시퀀스를 나타내는 데에는 기본적인 아미노산 학명(nomenclature)이 사용될 수 있다. 어느 정도까지 부호화 방식은 임의적으로 행해질 수도 있다. 따라서, 예컨대 핵산에서 A, C, G, T 와 U는 숫자 1, 2, 3, 4와 5로 각각 표시될 수 있고 핵산은 이러한 숫자들의 열로 표시될 수 있으며, 따라서 이 때의 열은 그 자체로 (일반적으로 매우 큰 숫자지만) 하나의 숫자가 된다. 다른 부호화 방법들도 가능하다. 예컨대, 분자의 "서브유닛" 각각을 다-캐릭터로 부호화한 캐릭터 열로 생물학적 분자를 부호화할 수 있다. 다양한 압축 표시들도 가능하다(예, 주기적으로 나타나는 모티브들은 각각의 출현할 때를 지시하는 적절한 포인터들로서 한 번에 표시할 수 있다.)The biological molecule (s) are encoded into character strings. In the simplest case, the character string is the same as the character code used to represent a biological molecule. Thus, for example, a character string when a nucleic acid is encoded includes a character A, C, G, T, or U. Similarly, a basic amino acid nomenclature can be used to represent a polypeptide sequence. To some extent, the encoding scheme may be performed arbitrarily. Thus, for example, in a nucleic acid, A, C, G, T and U may be represented by numbers 1, 2, 3, 4 and 5, respectively, and the nucleic acid may be represented by such a column of numbers, (Usually a very large number) but one number. Other encoding methods are possible. For example, a biological molecule can be encoded into a character string obtained by encoding each " subunit " of a molecule with a multi-character. A variety of compression indications are also possible (eg, periodic motifs can be displayed at one time as appropriate pointers to indicate when they occur).

또한, 생물학적 분자들이 반드시 이산/단일 열들의 데이터 구조로 부호화될 필요는 없다. 생물학적 분자(들)를 부호화하는 데 더욱 복잡한 데이터 구조들(예. 데이터베이스 또는 데이터 표 등을 포함하는 배열들(arrays), 연결 목록, 색인 구조들)을 사용할 수 있다.In addition, biological molecules need not necessarily be encoded into discrete / single column data structures. More complex data structures (e.g., arrays, link lists, index structures, including databases or data tables) may be used to encode the biological molecule (s).

본질적으로 생물학적 분자 표시의 입력, 저장과 검색이 가능한 데이터 구조는 모두 적합하다. 이러한 작업들은 수동으로도(예, 연필과 종이 또는 카드-파일 등을 갖고) 가능하지만, 데이터 구조들을 광학적 및/또는 전자적 및/또는 자기적으로 조작할 수 있고, 따라서 자동화된 입력, 저장 및 출력 작업(예, 컴퓨터에 의해서)을 가능하게 하는 것이 더 바람직하다.Essentially, data structures capable of input, storage, and retrieval of biological molecule representations are all appropriate. These tasks may be performed manually (e.g., with pencil and paper or card-file, etc.), but data structures may be manipulated optically and / or electronically and / or magnetically, It is more desirable to enable operations (e.g., by a computer).

III. 하위열들 선택하기 I II. Choosing sub-columns

바람직한 실시예에서, 부호화된 생물학적 분자들의 캐릭터 열들은 초기 열 분포를 제공하며, 이 안에서 하위열들을 선택하게 된다. 일반적으로는, 각각의 초기 열로부터 하나씩의 하위열을 얻도록 하여, 적어도 2개 이상의 하위열들을 선택한다. 둘 이상의 초기 캐릭터 열들이 있는 경우, 적어도 두 개의 초기 캐릭터 열들이 각각의 하위열들을 제공하기만 한다면 모든 초기 캐릭터 열들이 각각 하나씩의 하위열을 제공할 필요는 없다. 그러나, 바람직한 실시예에서는, 각각의 초기 열로부터 적어도 하나의 하위열이 선택될 것이다.In a preferred embodiment, the character strings of the encoded biological molecules provide an initial thermal distribution in which the lower columns are selected. Generally, at least two sub-columns are selected by obtaining one sub-column from each initial column. If there are two or more initial character strings, then all the initial character strings do not have to provide one sub-column each, provided that at least two initial character strings provide each of the sub-columns. However, in the preferred embodiment, at least one lower column from each initial column will be selected.

A) 하위열 길이(Substring length)A) Substring length

주어진 열로부터 생성될 수 있는 이론적인 열의 최대 수치라는 한계를 제외하고는 초기 열로부터 선택될 수 있는 하위열들의 최대 수치에는 본질적으로 한계가 없다. 따라서, 예컨대, 하나의 초기 열에서 선택된 하위 열들의 최대 수치는 초기 열(들)의 완전한 순열에서 생성되는 열의 수가 된다.There is essentially no limit to the maximum number of sub-columns that can be selected from the initial column, except for the limit of the theoretical maximum number of columns that can be generated from a given column. Thus, for example, the maximum number of sub-columns selected in one initial column is the number of columns generated in the complete permutation of the initial column (s).

그러나, 비교적 중간 정도 길이의 초기 열에서도, 순열의 수는 매우 크다. 따라서, 바람직한 실시예에서, 하나의 초기 열로부터 선택되는 하위열들은 서로 중첩되지 않도록하여 선택된다. 즉, 바람직한 실시예에서, 임의의 한 초기 열로부터 선택된 하위열들은 이들을 올바른 순서대로 연결할 경우, 자신의 초기 열을 완벽하게 재생성할 수 있게끔 해당 초기 열로부터 선택된다.However, even in the initial rows of relatively intermediate lengths, the number of permutations is very large. Therefore, in the preferred embodiment, the sub-columns selected from one initial column are selected so that they do not overlap with each other. That is, in the preferred embodiment, the lower columns selected from any one initial column are selected from the corresponding initial columns so that they can be completely regenerated if they are connected in the correct order.

또, 하위열들은 지나치게 짧지 않도록 선택되는 것이 바람직하다. 일반적으로 부호화된 생물학적 분자의 서브 유닛 하나를 표시하기에 충분한 최소의 길이보다 짧으면 안 된다. 따라서, 예컨대, 부호화된 생물학적 분자가 핵산이라면, 하위열은 적어도 하나의 뉴클레오타이드를 부호화할 만큼은 길어야 한다. 유사하게, 부호화된 생물학적 분자가 폴리펩티드라면, 하위열은 적어도 하나의 아미노산을 부호화할 만큼은 길다.It is also preferable that the lower rows are selected so as not to be too short. Generally, it should not be shorter than the minimum length sufficient to display one subunit of the encoded biological molecule. Thus, for example, if the encoded biological molecule is a nucleic acid, the lower row should be long enough to encode at least one nucleotide. Similarly, if the encoded biological molecule is a polypeptide, the bottom row is long enough to encode at least one amino acid.

바람직한 실시예에서, 선택된 하위열은 부호화된 생물학적 분자의 적어도 두 개의 서브유닛들을 부호화하는데, 바람직하게는 4개 이상의, 더욱 바람직하게는 10개 이상의, 휠씬 더 바람직하게는 20개 이상의, 가장 바람직하게는 50,100 또는 1000개 이상의 서브유닛을 부호화한다.In a preferred embodiment, the selected sub-sequence encodes at least two sub-units of the encoded biological molecule, preferably four or more, more preferably ten or more, even more preferably twenty or more, Encodes 50,100 or more than 1000 subunits.

하위열의 길이는 소정 레벨의 생명의 유기 조직을 포착할 수 있도록 결정된다. 예컨대, 하나의 완전한 유전자, cDNA, mRNA를 부호화하는 것을 하나의 하위열로 선택할 수 있다. "더 높은" 레벨의 유기 조직에서는, 오페론이나 조절 또는 합성 경로 등에서 발견되는 일련의 연관 유전자들, cDNA들, mRNA들 등을 부호화하는 것을 하나의 하위열로 선택할 수 있다. 보다 더 높은 레벨에서는, 개체의 핵산(예. 게놈 DNA, 전체 mRNA, 전체 cDNA) 전체를 부호화하는 것을 하위열로 선택할 수 있다. 하위열이 선택되는 초기 열이 하위열의 그것보다 더 높은 레벨의 유기 조직을 부호화하는 한, 하위 열에 의해 포착되는 "유기 조직의 레벨"에는 본질적으로 제한이 없다. 그러므로, 개체의 유전자를 부호화하는 것을 하위열로 선택하면, 초기 열은 전체 신진 대사 경로를 부호화하는 것으로 되며, 하위 열이 하나의 개체의전체 핵산을 부호화하는 것으로 선택되면, 초기 열은 모집단의 전체 핵산을 부호화하는 것으로 된다.The length of the lower row is determined so as to capture a predetermined level of organic tissue of life. For example, encoding one complete gene, cDNA, and mRNA can be selected as one sub-sequence. In a " higher " level of organic tissue, one subsequence can be selected to encode a set of associated genes, cDNAs, mRNAs, etc. found in operons or control or synthetic pathways. At higher levels, you can choose to encode the entire nucleic acid of an individual (eg, genomic DNA, total mRNA, whole cDNA) as a sub-sequence. There is essentially no limit to the " level of organic tissue " captured by the lower row, as long as the initial column from which the lower row is selected encodes a higher level of organic tissue than that of the lower row. Therefore, when encoding a gene of an individual is selected as a subordinate column, the initial column encodes the entire metabolic path, and when the subordinate column is selected to encode the entire nucleic acid of one entity, the initial column is the entire Thereby encoding the nucleic acid.

반대로 하위열은 소정 레벨의 생물학적 유기 조직에 대한 서브유닛을 부호화하는 것으로 선택될 수도 있다. 따라서, 예컨대, 하나의 하위열은 단백질의 특정 부위, 염색체의 특정 부위(예, 특징적으로 확대, 삭제, 또는 변위되는 지역)를 선택하는 데에 사용될 수 있다.Conversely, the bottom row may be selected to encode a sub-unit for a given level of biological organic tissue. Thus, for example, one subsequence can be used to select a specific region of a protein, a specific region of a chromosome (e.g., a region that is characteristically expanded, deleted, or displaced).

B) 하위열 선택 알고리즘B) Sub-column selection algorithm

어떤 접근 방법이라도 하위열을 선택하는 데에 사용될 수 있지만, 모델링되는 문제에 따라 특정 접근 방법이 결정된다. 바람직한 선택 방법은 무작위 하위열 선택 방법(random substring selection), 균일 하위열 선택 방법(uniform substring selection), 모티브-기반 선택 방법(motif-based selection), 정렬-기반 선택 방법(alignment-based selection)과 빈도-기반 선택 방법(frequency-based selection)을 포함하지만, 이에 국한되는 것은 아니다. 모든 초기 캐릭터 열에 동일한 하위열 결정 방법이 적용될 필요는 없으며, 서로 다른 초기 열에 서로 다른 하위열 결정 방법이 사용될 수 있다. 또한, 임의의 초기 캐릭터 열에 대해 복수의 하위열 선택 방법들을 적용하는 것도 가능하다.Any approach can be used to select sub-columns, but the specific approach is determined by the problem being modeled. Preferred selection methods include random substring selection, uniform substring selection, motif-based selection, alignment-based selection, and so on. But are not limited to, frequency-based selection. It is not necessary to apply the same sub-column determination method to all the initial character strings, and different sub-column determination methods may be used for different initial columns. It is also possible to apply a plurality of sub-column selection methods to an arbitrary initial character string.

1. 무작위 하위열 선택 방법1. How to select random sub-columns

하위열은, 간단하게는, 무작위될 수 있다. 하위열을 "무작위"로 선택하는 데에 많은 방법이 사용될 수 있다. 예컨대, 최소 길이 "L"인 하위열을 길이 "M"인 부호화된 캐릭터 열로부터 선택할 경우, (짧은 열들은 피하기 위해서) L부터 M-L까지의 숫자(열에서의 위치를 의미함)들을 생성시키는 무작위 수 발생기(ramdom number generator)를 이용해서 "분기점들(cleavage points)"을 선택할 수 있다. 길이가 L보다 작은 "내부(internal)" 하위열은 버린다.The subordinate column, in brief, can be random. Many methods can be used to select a sub-column as " random ". For example, if a sub-column of minimum length " L " is selected from a string of encoded characters of length " M ", randomness (which means the position in the column) from L to ML You can select "cleavage points" using a ramdom number generator. "Internal" sub-columns of length less than L are discarded.

다른 접근 방법으로는, 캐릭터 열 상의 각각의 위치를 번지로서 지정하여 어드레스하는 방법이 있다(예를 들어, 길이 N의 캐릭터 열은 1부터 N까지 범위의 숫자에 의해서 각각의 위치가 어드레스될 수 있다). 본 방법에 따르면, 우선 하위열에 대한 최소 길이 "L"과 최대 길이 "M"을 선택한 뒤, 무작위 수 발생기를 사용하여 L과 M 사이의 숫자 "V"를 발생시킨다. 계속하여, 위치 1 내지 V로부터 하나의 하위열을 선택하고, 그 다음 위치 V+1은 다시 위치 1로 하여 나머지 하위열 선택 작업을 반복한다. 이 처리 과정은 초기 열 전체가 모두 어드레스될 때까지 반복된다.Another approach is to address each position on the character string by addressing it as an address (e.g., a character string of length N can be addressed by a number ranging from 1 to N ). According to the present method, the minimum length " L " and the maximum length " M " for the sub-column are firstly selected and a random number generator is used to generate the number " V " Subsequently, one lower column is selected from positions 1 to V, and the next position V + 1 is set to position 1 again, and the remaining lower column selecting operation is repeated. This process is repeated until all of the initial columns have been addressed.

하위열을 무작위 선택하는 다른 방법들이 이미 고안되어 있다. 본 발명의 목적을 위해서는 "무작위" 선택 방법이 무작위(randomness)에 대한 형식적인 통계학상의 요구를 충족할 필요는 없다. 의사무작위(pseudorandom) 또는 우연(haphazard) 선택 방법도 본 발명의 관점에서는 충분하다.Other ways of randomly selecting sub-columns are already being devised. For the purposes of the present invention, the " random " selection method need not satisfy the formal statistical requirements for randomness. A pseudorandom or haphazard selection method is also sufficient in view of the present invention.

2. 균일 하위열 선택 방법2. Uniform sub-column selection method

균일 하위열 선택 방법에서는, 우선 각각의 초기 열로부터 얻고자 하는 하위열들의 갯수를 결정한다. 이어서, 초기 열은 이 희망하는 갯수의 하위열들로 균일하게 분할된다. 만일, 초기 열이 균일 분할을 허용하지 않는 길이를 갖는 경우에는, 하나 이상의 평균치보다 더 짧거나 긴 하위열들도 허용될 수 있다.In the uniform sub-column selection method, first, the number of sub-columns to be obtained from each initial column is determined. The initial columns are then evenly divided into the desired number of lower rows. If the initial column has a length that does not allow uniform division, sub-columns shorter or longer than one or more average values may be allowed.

3. 모티브-기반 선택 방법3. Motif-based selection method

하위열들은 모티브-기반 선택 방법을 이용해서 초기 열들로부터 선택될 수 있다. 이 접근 방법에서는, 초기 캐릭터열(들)을 스캔(scan)하여 미리 선택된 특정 모티브을 발생시킨다. 계속하여, 하위열은 그 끝점(endpoint)(들)이 이 모티브와 미리 정의된 관계로 발생하게끔 하여 선택한다. 따라서, 예컨대, 이러한 하위열의 끝(end)은 모티브 내부에 있을 수도 있으며, 모티브의 끝으로부터 미리 선택된 서브 유닛들의 수만큼 "상향(upstream)" 또는 "하향(downstream)" 위치에 있을 수도 있다.Sub-columns can be selected from the initial columns using a motif-based selection method. In this approach, the initial character string (s) is scanned to generate a pre-selected specific motif. Subsequently, the sub-row is selected by causing its endpoint (s) to occur in a predefined relationship with this motif. Thus, for example, the end of this sub-column may be inside the motive and may be in an " upstream " or " downstream " position by the number of previously selected sub-units from the end of the motif.

모티브는 완벽하게 임의적일 수도 있고, 물리적 작용제(physical agent)나 생물학적 분자의 특성들을 반영할 수도 있다. 따라서, 예컨대, 부호화된 생물학적 분자가 핵산인 경우, 모티브는 제한효소 엔도뉴클레아제(restriction endonuclease)(예, EcoRv, HindIII, BamHI, PvuII 등)의 결합 특이성, 단백질 결합 부위(protein binding site), 특정 인트론/엑손 접합(particular intron/exon junction), 트랜스포존(transposon) 등을 반영하도록 선택될 수 있다. 유사하게, 부호화된 생물학적 분자가 단백질인 경우에는, 모티브는 프로티아제 결합 부위(protease binding site), 단백질 결합 부위, 수용기 결합 부위, 특정 리간드, 상보성 결정 지역(complementarity determining region), 에피토프(epitope) 등을 반열할 수 있다.The motif may be perfectly arbitrary, or it may reflect the physical agent or characteristics of the biological molecule. Thus, for example, when the encoded biological molecule is a nucleic acid, the motif may include a binding specificity of a restriction endonuclease (e.g. EcoRv, HindIII, BamHI, PvuII, etc.), a protein binding site, A particular intron / exon junction, a transposon, and the like. Similarly, when the encoded biological molecule is a protein, the motif may include a protease binding site, a protein binding site, a receptor binding site, a specific ligand, a complementarity determining region, an epitope, And so on.

유사하게, 다당류는 특정 당 모티브를 포함할 수 있고, 당단백질은 특정 당 모티브 및/또는 특정 아미노 산 모티브 등을 가질 수 있다.Similarly, the polysaccharide may comprise a particular sugar motif, and the glycoprotein may have certain sugar motifs and / or certain amino acid motifs.

모티브는 부호화된 생물학적 분자의 일차 구조를 반영해야만 하는 것은 아니다. 이차와 삼차 구조 모티브들도 가능하며, 하위열의 끝점들을 기술하는 데에 이들을 사용할 수도 있다. 따라서, 예를 들면, 부호화된 단백질은 특정 α-helix, β-sheet, α-helix 모티브를 포함할 수 있으며, 이 모티브의 발생이 하위열 끝점을 기술하는 데에 사용될 수 있다.The motif does not have to reflect the primary structure of the encoded biological molecule. Secondary and tertiary structure motives are possible, and they can be used to describe the end points of the sub-columns. Thus, for example, the encoded protein may include certain a-helix, b-sheet, and a-helix motifs, and the occurrence of this motif may be used to describe the bottom row endpoint.

다른 "더 높은 차원"의 모티브 종류는 "무사분열 소화(fragmentation digest)"에 의해 표시되는 것처럼 "메타-모티브"가 될 수 있다. 이런 접근 방법에서는, 하위열 끝점이 단일 모티브의 발생에 의해 결정되지 않고, 하나 이상의 모티브들의 결합 패턴 및 간격에 의해 기술된다.Other "higher dimensional" motif types can be "meta-motifs" as indicated by "fragmentation digest". In this approach, the lower column endpoint is not determined by the occurrence of a single motif, but is described by the combination pattern and spacing of one or more motifs.

시퀀스 패턴들을 엄격하게 반영하기보다는 캐릭터 열의 특정 영역의 정보 내용을 반영하는 모티브들 역시 선택/사용될 수 있다. 따라서, 예컨대, 미국 특허 제 5,867,402 호는 R_i(b,l)로 표시되는 정보 내용 가중치 매트릭스(information content weight matrix)로 변환시킴으로써 해당 시퀀스 신호들을 처리하는 컴퓨터 시스템 및 연산 방법에 대해 기술하고 있다. 계속하여, 정보 내용 가중치 매트릭스 R_i(b,l)에 특정 시퀀스 신호를 인지하는 제2 변형과정을 수행함으로써, 이 특정 시퀀스 신호의 개별 정보 내용을 포함하는 R_i값을 산출한다. 캐릭터 열의 정보 내용을 결정하는 다른 접근 방법들도 역시 이미 널리 알려져 있다.(Staden (1984) Nucleic Acids Res. 12: 505-519; Schneider(1994)Nanotechnology5: 1-8; Herman 등.(1992)J. Bacteriol.pp. 3558-3560; Schneider 등(1990) Nucleic Acids Res.,18(20):6097-6100; Berg 등(1988) J.Mol. Biol., 200(4): 709-723).Rather than strictly reflecting the sequence patterns, motifs that reflect the information content of a particular region of the character string may also be selected / used. Thus, for example, U.S. Patent No. 5,867,402 describes a computer system and method of computing corresponding sequence signals by transforming them into an information content weight matrix denoted R _i (b, l). Subsequently, a second transformation process of recognizing the specific sequence signal to the information content weight matrix R _i (b, l) is performed to calculate the R _i value including the individual information content of this specific sequence signal. Other approaches to determining the information content of a character string are also well known (Staden (1984) Nucleic Acids Res. 12: 505-519; Schneider (1994) Nanotechnology 5: 1-8; Herman et al. Berg, et al. (1988) J. Mol. Biol., 200 (4): 709-723), Nucleic Acids Res., 18 (20): 6097-6100, J. Bacteriol. .

생물학적 신호를 반영하는 또 다른 모티브들도 생각해 볼 수 있다. 따라서, 예컨대, 하위열의 끝을 기술하는 한 모티브는, 부호화된 핵산의 경우 정지 코돈(stop codon) 또는 개시 코돈(start codon)이 될 수 있으며, 단백질의 경우 메티오닌이나 폴리아데닐레이션(polyadenylation) 신호가 될 수 있다.Other motifs that reflect biological signals can be considered. Thus, for example, one motif describing the end of a lower row can be a stop codon or a start codon for an encoded nucleic acid, and a methionine or polyadenylation signal for a protein .

모든 초기 시퀀스에 동일한 모티브가 적용될 필요는 없다. 또한, 복수의 모티브, 메타-모티브 및/또는 모티브/메타 모티브 조합들이 임의의 시퀀스에 대하여 적용될 수도 있다.The same motif need not be applied to all initial sequences. In addition, a plurality of motifs, meta-motifs, and / or motif / metamorphic combinations may be applied for any sequence.

4. 정렬-기반 선택4. Sort-based selection

다른 접근 방법으로, 둘 이상의 초기 캐릭터 열들을 정렬하고, 이들 초기 열 사이에 높은 수준의 동일성이 있는 영역들을 추출하여, 그 안에서 하위열(들)의 끝점들을 선택하는 방식으로 하위열을 선택하는 방법이 있다. 따라서, 예컨대, 하위열들은, 시퀀스 정렬을 행한 후, 길이가 최소한 5 서브 유닛이상, 바람직하게는 적어도 10 이상, 더 바람직하게 20 이상, 더 바람직하게는 30 이상, 그리고 가장 바람직하게는 50, 100, 200, 500, 또는 1000 이상인 윈도우에 대해 최소한 30% 이상, 바람직하게는 50% 이상, 더 바람직하게는 70% 이상, 심지어는 99% 이상의 시퀀스 동일성을 갖는 영역 내(예컨대, 이들 영역의 중간)에서 해당 하위열(들)의 끝점이 출현하도록 선택될 수 있다.Another approach is to sort the two or more initial character strings, extract regions with high levels of identity between these initial columns, and select the sub-columns by selecting the end points of the sub-column (s) therein . Thus, for example, the sub-columns may have a length of at least 5 subunits, preferably at least 10, more preferably at least 20, more preferably at least 30, and most preferably at least 50, 100 (E.g., in the middle of these regions) with a sequence identity of at least 30%, preferably at least 50%, more preferably at least 70%, even at least 99%, for a window of at least 200, 500, The end point of the corresponding lower column (s) may be selected to appear.

둘 이상의 생물학적 고분자(예, 핵산 또는 폴리펩티드)에 대한 맥락에서 "시퀀스 동일성" 또는 "퍼센트 시퀀스 동일성" 또는 "퍼센트 동일성," 또는 "퍼센트상동 관계"라는 단어들은, 최대 대응(maximum correspondence)을 기준으로 정렬되고 비교되었을 때, 시퀀스 비교 알고리즘(sequence comparison algorithsm) 또는 육안 검사(visual inspection) 결과 전체 또는 특정 퍼센트의 서브유닛, 예컨대, 아미노산 잉여물, 또는 뉴클레오타이드가 동일한 것으로 측정되는 둘 이상의 시퀀스 또는 하위시퀀스를 의미한다.The terms "sequence identity" or "percent sequence identity" or "percent identity," or "percent homology" in the context of two or more biological polymers (eg, nucleic acids or polypeptides) Refers to two or more sequences or subsequences in which all or a certain percentage of subunits, such as amino acid surplus, or nucleotides, are determined to be identical, when sorted and compared, by sequence comparison algorithms or visual inspection do.

시퀀스 비교를 위해서는, 일반적으로 한 시퀀스가 기준(reference) 시퀀스 역할을 하고, 테스트 시퀀스들은 이 기준 시퀀스와 비교된다. 바람직한 실시예에서, 시퀀스 비교 알고리즘을 사용할 경우, 테스트 시퀀스 및 기준 시퀀스는 컴퓨터로 입력되고, 하위 시퀀스 좌표들이 지정되며, 필요한 경우 시퀀스 알고리즘 프로그램 파라미터들이 지정된다. 계속하여, 시퀀스 비교 알고리즘은 지정된 프로그램 파라미터들에 기초해서, 기준 시퀀스에 대한 테스트 시퀀스(들)의 퍼센트 시퀀스 동일성을 연산한다.For sequence comparison, one sequence generally acts as a reference sequence, and test sequences are compared to this reference sequence. In a preferred embodiment, when using a sequence comparison algorithm, the test sequence and the reference sequence are input to a computer, subsequence coordinates are specified, and sequence algorithm program parameters are specified, if necessary. Subsequently, the sequence comparison algorithm calculates the percentage sequence identity of the test sequence (s) to the reference sequence, based on the specified program parameters.

정렬 및 시퀀스 비교 알고리즘은 당업자에게 잘 알려져 있다. 예컨대, 비교를 위한 시퀀스 최적 정렬 알고리즘에는 Smith & Waterman(1981)의 local homology algorithm(adv, Apple. Math. 2:482), Needle man & Wench(1970)의 homology alignment algorithm(J. Mol. Biol.48:443)을 Pearson & Lipan(1988)의 유사성 방법 연구(the search for similarity method), 상기의 알고리즘을 상업용 모듈 및/또는 상업용 소프트웨어 팩키지(예, Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, WI), 또는 육안 검사(Amusable 등, 상기 인용 참조)를 이용해 구현한 알고리즘이 포함되지만, 이에국한되는 것은 아니다.Alignment and sequence comparison algorithms are well known to those skilled in the art. For example, a sequence optimal alignment algorithm for comparison is disclosed in Smith &Waterman's (1981) local homology algorithm ( adv. Apple. Math . 2: 482), Needle man & Wench (1970) homology alignment algorithm ( J. Mol. Biol. 48: 443) was compared with Pearson & Lipan (1988) 's search for similarity method, using the above algorithm in commercial modules and / or commercial software packages (eg, Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr ., Madison, WI), or algorithms implemented using visual inspection (see Amusable et al., Cited above).

유용한 알고리즘의 한 예로 PIEUP이 있다. PILEUP은 프로그레시브, 패어와이즈 정렬 방법들(progressive, pairwise alignments)을 사용해서 연관된 시퀀스 그룹으로부터 복수의 시퀀스 배열을 생성한다. 또한, 이 배열을 생성하는 데 사용되는 군집 관계(clustering relationship)를 보여주는 트리 또는 엔도가미(endogamy)를 작성한다. PILEUP은 Feng & Doolittle(1987)(J. Mol. Evol.35:351-360)의 프로그레시브 정렬 방법의 간이화된 방법을 사용한다. 이 사용된 방법은 Higgins & Sharp(1989)(CABIOS 5:151-153)에서 기술된 방법과 유사하다. 이 프로그램은 5,000 개의 뉴클레오타이드 또는 아미노 산을 최대길이로 하는 시퀀스를 300개까지 정렬할 수 있다. 복수 정렬 프로시저(multiple alignment procedure)는 두 개의 가장 유사한 시퀀스들을 한 쌍씩 정렬하는 방식으로 시작해서, 두 개의 정렬된 시퀀스들의 클러스터를 생성한다. 이 클러스터는 그 후 두 번째로 가장 유사한 시퀀스 또는 정렬된 시퀀스들의 클러스터 옆에 정렬된다. 시퀀스들의 클러스터 둘은 두 개의 개개의 시퀀스들을 한 쌍씩 정렬하는 방식을 단순히 연장시킨 형태로 정렬된다. 프로그레시브, 패어와이즈 정렬 방법을 계속하여 정렬을 완료한다. 이 프로그램은 특정 시퀀스들과 이들의 아미노 산 또는 뉴클레오타이드의 좌표들을 시퀀스 비교 지역들로 지정하고 프로그램 파라미터들을 지정하는 방식으로 실행된다. 예컨대, 기준 시퀀스와 다른 테스트 시퀀스들을 비교하여 디폴트 갭 무게(3.00), 디폴트 갭 길이 무게(0.10)와 가중된 엔드 갭들(weighted end gaps)과 같은 파라미터들을 사용해서 퍼센트 시퀀스 동일성 관계를 판단한다.An example of a useful algorithm is PIEUP. PILEUP generates multiple sequence arrays from the associated sequence group using progressive, pairwise alignments. You also create a tree or endogamy that shows the clustering relationships used to create this array. PILEUP uses a simplified method of the progressive sorting method of Feng & Doolittle (1987) ( J. Mol. Evol. 35: 351-360). The method used is similar to that described in Higgins & Sharp (1989) (CABIOS 5: 151-153). The program can sort up to 300 sequences with a maximum length of 5,000 nucleotides or amino acids. A multiple alignment procedure begins with a pairwise alignment of the two most similar sequences, creating a cluster of two aligned sequences. This cluster is then aligned next to the cluster of the second most similar sequence or ordered sequence. The two clusters of sequences are arranged in a manner simply extending the manner of arranging the two individual sequences one by one. Continue the progressive, pairwise alignment method to complete the alignment. The program is run in such a way that it assigns specific sequences and their amino acid or nucleotide coordinates to sequence comparison regions and specifies program parameters. For example, the reference sequence and other test sequences are compared to determine percent sequence identity using parameters such as default gap weight (3.00), default gap length weight (0.10), and weighted end gaps.

퍼센트 시퀀스 동일성과 시퀀스 유사성을 판단하는 데 적합한 다른 예로 BLAST 알고리즘이 있으며, 이는 Altschul 등(1990,J.Mo.Biol. 215:403-410)에 기술되어 있다. BLAST 분석을 수행하는 소프트웨어는 National Center for Biotechnology Information(http://www.ncbi.nlm.nih.gov/)을 통해 사용 가능하다. 이 알고리즘은 조회 시퀀스(query sequence)에서 길이 W인 짧은 워드들을 식별함으로써 HSPs(high scoring sequence pairs)를 먼저 식별하는 과정을 포함하는데, 이 조회 시퀀스는 데이터베이스 시퀀스 안의 똑같은 길이의 워드와 같이 정렬되었을 때 어떤 양수의 스레스홀드 점수(some positive-valued threshold score) T를 만족하거나 이와 동일한 것을 의미한다. T는 네이버후드 워드 스코어 스레스홀드(neighborhood word score threshold)(상기 인용된 Altschul 등)로 불린다. 이러한 초기 네이버후드 워드 히트들(initial neighborhood word hits)은 이들을 포함하는 더 긴 HSPs를 찾는 것을 개시하는 시드 역할을 한다. 다음으로 이 워드 히트들은 누적된 정렬 스코어를 증가시킬 수 있는 한 각각의 시퀀스를 따라 양 방향으로 확장된다. 양 방향에서의 워드 히트들의 확장은 누적된 정렬 스코어가 도달한 최대값으로부터 X 값만큼 떨어질 때, 하나 이상의 음의 스코어인 잔류(residue) 정렬이 누적되어서 누적된 스코어가 영 이하가 될 때, 또는 시퀀스의 어느 한 끝에 도달할 때에 중단된다. 이 BLAST 알고리즘 파라미터들인 W, T와 X는 정렬의 감도와 속도를 결정한다. 이 BLAST 프로그램은 디폴트로 워드길이(W)를 11로, BLOSUM62 스코어링 매트릭스(Henikoff & Henikoff(1989,Proc. Natl. Acad. Sci. USA89:10915) 배열들(B)을 50으로, 기대값(E)을 10으로, M=5, N=-4 그리고, 양 나선의 비교 방법을 사용한다.Another example suitable for determining percent sequence identity and sequence similarity is the BLAST algorithm, which is described in Altschul et al. (1990, J. Mo Biol . 215: 403-410). Software that performs BLAST analysis is available through the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/). The algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in a query sequence, which is arranged as a word of the same length in a database sequence It means that some positive-valued threshold score T satisfies or is equal to T. T is referred to as the neighborhood word score threshold (Altschul et al. Cited supra). These initial neighborhood word hits serve as seeds to initiate finding longer HSPs that contain them. These word hits are then expanded in both directions along each sequence as long as they can increase the cumulative alignment score. The expansion of the word hits in both directions is performed when the cumulative score is less than or equal to zero as the accumulated alignment score is accumulated from the one or more negative scores accumulated when the cumulative alignment score falls by X from the maximum reached, And stops when reaching one end of the sequence. The BLAST algorithm parameters W, T and X determine the sensitivity and speed of the alignment. This BLAST program defaults to word length (W) of 11, BLOSUM62 scoring matrix (Henikoff & Henikoff, 1989, Proc. Natl Acad Sci USA 89: 10915) E) to 10, M = 5, N = -4, and the method of comparison of both helix.

퍼센트 시퀀스 동일성 계산 외에, BLAST 알고리즘은 두 시퀀스 간의 유사성의 통계학적인 분석도 수행한다(예, Karlin & Altschul(1993,Proc. Natl. Acad. Sci. USA90:5873-5787). BLAST 알고리즘에 의해 제공되는 유사성의 한 척도는 최소 합계 확률(the smallest sum probability)(P(N))인데, 이는 두가지 뉴클레오타이드 또는 핵산 시퀀스들 간의 부합(match)이 우연히 일어날 확률의 정도를 제공한다. 예를 들어, 만일 한 핵산이 기준 핵산과 비교했을 때의 최소 합계 확률이 0.1보다 작고, 더 바람직하게는 0.01보다 작고 가장 바람직하게는 0.001보다 작다면, 그 검사 핵산은 레펀런스 시퀀스와 유사하다고 본다.In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of the similarity between two sequences (eg, Karlin & Altschul (1993, Proc. Natl. Acad Sci USA 90: 5873-5787). One measure of similarity is the smallest sum probability (P (N)), which provides a measure of the probability that a match between two nucleotide or nucleic acid sequences will occur by chance. If the minimum total probability of a nucleic acid relative to a reference nucleic acid is less than 0.1, more preferably less than 0.01, and most preferably less than 0.001, the test nucleic acid is considered to be similar to the reference sequence.

상기의 유사성 알고리즘은 예시이지 이에 한정되는 것은 아니다. 초기 캐릭터 열들의 전체 길이에 대해 유사성이 결정되거나 특정 하위영역들에만 유사성 판단이 한정될 수 있다면 바람직할 것이다.The above similarity algorithm is illustrative, but not limited thereto. It would be desirable if similarity was determined for the entire length of the initial character strings or if similarity determinations could be limited only to certain sub-regions.

5. 빈도-기반 선택5. Frequency-based selection

빈도-기반 하위시퀀스 선택 방법에서는, 미리 선택된 특정 빈도 기준을 충족하는 하위시퀀스 영역들과 하위시퀀스(들)의 끝점들이 특정 관계에 있을 때, 발생하도록 하위열들을 선택한다. 예를 들어, 고도로 반복적인 서브유닛 패턴(예, 핵산에서는, AC가 고도로 집중되어서 "ACACACACACAC"을 반복한다)을 포함하는 부호화된 생물학적 분자들을 배제하는 것이 바람직한 곳에서는, 특정 서브유닛이나 서브유닛들의 모티브의 특정 반복 밀도에 도달하기 전에 끝점이 나오게 서브유닛 선택을 고안할 수 있다. 이 예에서 반복 밀도는 측정한 서브유닛들의 캐릭터 열 길이당 서브유닛의 수 또는 서브유닛 모티브들의 길이 당 서브유닛 모티브의 수가 된다.In the frequency-based subsequence selection method, subsequences are selected to occur when the subsequence regions that meet the preselected specific frequency criterion and the endpoints of the subsequence (s) are in a particular relationship. For example, where it is desirable to exclude encoded biological molecules, including highly repetitive sub-unit patterns (e.g., in nucleic acids, AC is highly concentrated and repeats "ACACACACACAC"), The sub-unit selection can be devised such that the end point is before the specific iteration density of the motif is reached. In this example, the repetition density is the number of subunits per length of the character string of the measured subunits or the number of subunit motives per subunit motive length.

따라서, 상기의 예에서, 예컨대 적어도 4 모티브 길이(이 경우 8 서브유닛 길이) 당 0.5(50%) 이상의 빈도로 AC 모티브가 발생하는 캐릭터 열 영역 전에 하위열의 끝점이 오도록 하위열을 선택할 수 있다.Thus, in the above example, it is possible to select the lower column so that the end point of the lower row precedes the character row region where AC motifs occur with a frequency of 0.5 (50%) or more per at least four motive lengths (8 sub-unit lengths in this case).

다른 예로 적어도 X 서브유닛 당 100% 발생하는 특정 서브유닛의 발생에 기반해서 하위열을 선택하는 방법이 있다. 그러므로, 예를 들어, 부호화된 생물학적 분자가 핵산이고 서브유닛은 아데노신 "A"라면, 빈도-기반 선택법은 폴리아데닐레이션 신호(예, AAAAAAA)이 발생하는 곳에 하위열 끝점을 설정할 수 있다. 빈도-기반 하위열 선택 기준의 고안에 의존하면, 상기의 모티브-기반 선택 방법을 사용한 결과와 동일한 결과가 얻어질 수 있다.Another example is to select a lower column based on the occurrence of a specific sub-unit that occurs at least 100% per X sub-unit. Thus, for example, if the encoded biological molecule is a nucleic acid and the subunit is adenosine " A ", the frequency-based selection method may set the lower column endpoint where the polyadenylation signal (e.g., AAAAAAA) occurs. Depending on the design of the frequency-based sub-column selection criteria, the same results as those obtained using the above motif-based selection method can be obtained.

6. 다른 기준.6. Other standards.

특정 하위열들의 선택에 영향을 주고/주거나 선택을 결정하는 데 다른 여러 가지 기준들을 사용할 수 있다. 그러한 기준에는 하위열에 의해 부호화된 분자의 예측된 hybrophobicity 및/또는 PI 및/또는 PK가 있다. 다른 기준으로는 교차 수(cross-over number), 원하는 단편 크기(desired fragment size), 하위열 길이 분포 및/또는 하위열(들)에 의해 부호화된 분자(들)의 접힘(folding)과 관련한 수치 정보(rational information) 등이 있다.Several other criteria may be used to influence the selection of certain sub-columns and / or to determine the choice. Such criteria include predicted hybrophobicity and / or PI and / or PK of molecules encoded by the bottom row. Other criteria include numerical values associated with the folding of the molecule (s) encoded by the cross-over number, the desired fragment size, the subsequence length distribution, and / or the subsequence (s) And rational information.

IV. 하위열들의 연결IV. Connection of sub-columns

초기 열들로부터 하위열들의 분포가 일단 선택되면, 부모 초기 열들과 거의또는 완벽하게 동일한 길이의 새로운 열들을 생성하기 위해 하위열들을 서로 연결한다. 매우 다양한 방법으로 열 연결을 실시할 수 있다.Once the distribution of the sub-columns from the initial columns is selected, the sub-columns are connected to each other to generate new columns of nearly or exactly the same length as the parent initial columns. Thermal connections can be made in a wide variety of ways.

한 실시예에서는, "재결합된(recombined)" 열들을 생성하기 위해서 무작위로 하위열들을 연결한다. 이같은 "무작위" 연결의 한 방식으로, 각각의 하위열에 고유의 식별자(identifier)(예, 정수 또는 다른 식별자)를 할당하는 방식이 있다. 이 식별자들을 풀로부터 무작위로(예, 난수 생성기를 사용해서) 선택하고, 이러한 식별자들에 대응하는 하위시퀀스들을 합쳐서 연결된 시퀀스를 만든다. 합쳐진 하위열들이 초기 캐릭터 열(들)의 길이와 거의 또는 정확하게 일치할 때, 다른 연결된 열을 만들기 위해 이 프로세스는 새롭게 시작된다. 이 프로세스는 하위열들을 남김 없이 사용하기 전까지 반복된다. 다른 방식으로 "하위열 풀"로부터 하위열을 빼내지 않고 하위열들을 선택하는 방식이 있으며, 이 경우에는 원하는 수만큼의 "표준-길이" 열들이 얻어질 때까지 이 프로세스를 반복한다.In one embodiment, the sub-columns are concatenated randomly to generate " recombined " One way of doing this "random" connection is to assign a unique identifier (eg, an integer or another identifier) to each sub-column. These identifiers are selected randomly from the pool (eg, using a random number generator), and the subsequences corresponding to these identifiers are combined to form a concatenated sequence. When the combined lower rows match the length of the initial character string (s) almost or exactly, this process is newly started to create another connected column. This process is repeated until the sub-columns are exhaustively used. Another way is to select the lower columns without subtracting the lower column from the "lower column pool", in which case this process is repeated until the desired number of "standard-length" columns are obtained.

그러나, 바람직한 실시예에서, 초기 열의 하위열들에 존재했던 상대적인 순서를 연결된 열들을 형성하는 하위열들에도 유지하는 것이 바람직하다. 이는 매우 많은 방법들에 의해 이루어질 수 있다. 예컨대, 한 부모 열로부터 파생된 다른 하위열들의 위치에 대한 그 하위열의 상대적인 위치를 가리키는 식별자(예, 포인터)를 그 부모 열로부터 선택된 각각의 하위열에 "부착(tagged)"할 수 있다. 다른 초기 열들의 대응하는 위치들로부터 파생된 하위열들은 유사한 위치 식별자들을 할당받는다. 이런 방식이 도 2에 도시되어 있는데, 여기서 (A, B, 및 C로 명명된) 세 초기 열들 각각이 1부터 5까지 번호가 매겨진 5개의 하위열들을 갖고 있다. 각각의 하위열들은 도시된 바와 같이 각각 식별된다.(예, A1, A2,...A5, B1, B2,...B5, C1, C2,...C5). (A1, B1, 및 C2로 구성된) 풀 1로부터 하위열 하나, (A2, B2, C2로 구성된) 풀 2로부터 하위열 하나, 이런 식으로 풀 5까지 하위열들을을 무작위로 선택하여 연결된 열을 만들 수 있다.However, in a preferred embodiment, it is desirable to keep the relative order that was present in the lower columns of the initial column also in the lower columns that form the connected columns. This can be done in so many ways. For example, an identifier (e.g., a pointer) may be " tagged " to each sub-column selected from its parent column, indicating the relative position of that sub-column relative to the position of the other sub-columns derived from the parent sequence. The lower rows derived from corresponding positions of other initial columns are assigned similar position identifiers. This scheme is illustrated in FIG. 2, where each of the three initial columns (labeled A, B, and C) has five sub-columns numbered from one to five. A5, B1, B2, ..., B5, C1, C2, ..., C5). One row from pool 1 (consisting of A1, B1, and C2), one column from pool 2 (consisting of A2, B2, and C2), and so on up to pool 5, Can be made.

이런 연결 방식에서는, 하위열이 일단 연결되면 그 하위열은 풀에서 제거된다. 그러나, 하위시퀀스를 "복사"하여 연결된 시퀀스에 사용하는 방식으로 연결을 실행하면, 뒤에 있을 다른 연결에서 이 하위열을 사용할 수 있다. 이 방법은 연결이 더 다양한 방법으로 이루어질 수 있게 한다.In this connection scheme, once a child column is connected, the child column is removed from the pool. However, if you execute the connection by "copying" the subsequence and using it in the connected sequence, you can use this sub-column in subsequent connections. This method allows connections to be made in more ways.

다른 실시예에서, 다양한 정렬 알고리즘 및/또는 유사성 알고리즘을 연결 중인 하위열의 상대적인 시퀀스를 유지하는 데에 사용할 수 있다. 이 방법에서, 높은 유사성이 있는 지역들을 연결함으로써 연결된 시퀀스에서의 상대적인 위치를 하위시퀀스들에 할당할 수 있다.(예, 도 3 참조)In other embodiments, various sorting algorithms and / or similarity algorithms can be used to maintain the relative sequence of the sub-columns being concatenated. In this method, the relative positions in the connected sequences can be assigned to the subsequences by linking regions with high similarity (e.g., see FIG. 3).

바람직한 실시예에서, 초기의 부호화된 생물학적 분자들은 서로 어느 정도 관계가 있다. 따라서, 예컨대, 부호화된 분자들이 특정 효소 과(enzyme family)의 구성원들을 표시하는 것이라면, 분자들은 특정 집단에 속하는 개체들을 표시하는 것이 된다. 이 하위시퀀스들은 상당한 유사성의 영역을 공유할 것으로 예측된다. 게다가, 임계 기능 영역들(critical functional domains)은 유지되는 경향이 있으며, 그 영역들은 하위시퀀스들의 특정 영역들의 유사성도 증대시킨다. 따라서, 하위시퀀스들 간에 높은 유사성을 가진 지역들을 정렬하면 초기 열들에서의 하위열들의 순서를 반영하는 하위열들의 상대적인 순서를 다시 재건할 수 있다.In a preferred embodiment, the initial encoded biological molecules are somewhat related to each other. Thus, for example, if the encoded molecules represent members of a particular enzyme family, then the molecules represent individuals belonging to a particular group. These subsequences are expected to share a region of considerable similarity. In addition, critical functional domains tend to be retained, which also increases the similarity of certain regions of the subsequences. Thus, aligning regions with high similarities between subsequences can reconstruct the relative order of the subsequences that reflects the order of the subsequences in the initial columns.

모든 연결된 캐릭터 열에서 순서가 완벽하게 설정되어야만 하는 것은 아니다. 연결된 시퀀스들의 일부분(예, 바람직하게는 적어도 1 퍼센트 이상, 더욱 바람직하게는 적어도 10 퍼센트 이상, 더욱 더 바람직하게 20% 이상, 그리고 가장 바람직하게는 40%, 60%, 또는 80% 이상)이 원래의 순서를 보존하는 것이 바람직하다.It is not necessary that the order in all linked character columns be set perfectly. (E.g., preferably at least 1 percent or more, more preferably at least 10 percent, even more preferably at least 20 percent, and most preferably at least 40 percent, 60 percent, or 80 percent or more) It is preferable to preserve the order of

하위시퀀스들의 순서를 다시 정렬하기 위한 유사성 측정법은 유사성 알고리즘을 완벽한 핵산 시퀀스의 조각들로부터 핵산 시퀀스들을 재생하는데 사용하는 하이브리디제이션(SBH) 방법에 의한 시퀀싱과 유사하다(참조, 예, Barinaga(1991)Science, 253: 1489 및 Bains(1992)Bio/Technology10: 757-758 및 Drmanac과 Crkvenjakov, Yugoslav Patent Application #570/87, 1987 및 Drmanac 등(1989)Genomics, 4: 114.; Strezoska 등(1991)Proc. Natl. Acad. Sci.USA88: 10089 및 Drmanac과 Crkvenjako, U,S. Pat. No. 5,202,231).A similarity measure for reordering the order of the subsequences is similar to sequencing by the hybridization (SBH) method, which uses a similarity algorithm to reproduce nucleic acid sequences from fragments of a complete nucleic acid sequence (cf. Barinaga, 1991 ) Science, 253: 1489 and Bains (1992) Bio / Technology 10 : 757-758 and Drmanac and Crkvenjakov, Yugoslav Patent Application # 570/ 87, 1987 and Drmanac, etc. (1989) Genomics, 4: 114 .; Strezoska et al. (1991 Proc Natl Acad Sci USA 88: 10089 and Drmanac and Crkvenjako, U. S. Pat. No. 5,202,231).

어떤 연결 연산 단독으로, 또는 선택 및 연결 연산이 함께 특정 연산자에 의해 표시될 수 있다. 이런 종류의 연산자들은 유전적 알고리즘에서 잘 알려져 있다. 따라서, 예컨대, "교차(crossing over)"(역좌: reciprocal translocation) 연산자는 두 개의 서로 다른 초기 시퀀스들의 유사한 위치에 있는 하위시퀀스들을 서로 교환하는 것으로 정의될 수 있다. 유사하게 "연결(linkage)" 연산자들은, 하위시퀀스들이 (이웃한 하위시퀀스들이건 아니건 간에) 함께 교차하도록 교차 시 특정 하위시퀀스들을 서로 연결하는 것으로 정의될 수 있다. 앞으로의 설명에서 나올 다른 연산자들은 당업자에게는 잘 알려진 것들이다.Any connection operation can be displayed alone, or by a specific operator, together with selection and connection operations. These kinds of operators are well known in genetic algorithms. Thus, for example, a " crossing over " (reciprocal translocation) operator can be defined as exchanging subsequences at similar locations of two different initial sequences. Similarly, " linkage " operators may be defined as connecting certain subsequences to each other at an intersection such that the subsequences intersect (whether neighboring subsequences or not) together. Other operators that will appear in the discussion that follows are well known to those skilled in the art.

V) 생성 열들을 열들의 집합에 더하기V) < / RTI > generation columns to a set of columns

본 발명의 방법들에 의해 생성된 연결된 열들을 "분포된 데이터세트(populated dataset)"를 형성하는 열들의 집합에 더한다. 이 집합의 열들을 전술한 방법들의 추가 반복 시에 초기 열들로 사용할 수 있다(도 1 참조). 더하기는 열의 세트 내에 포함된 하나 이상의 열들을 식별하는 과정을 여기서는 의미한다. 문제가 되는 열(들)을 열들의 집합인 데이터 구조로 복사 또는 이동시키기, 열로부터 열들의 집합을 표시하는 데이터 구조를 지시하는 포인터를 제공하기, 특정 세트 내의 포함 요소를 지시하는 플래그(flag)를 해당 열과 연관시켜 설정하기, 또는 생성된 열(들)이 집합 내에 포함되게 하는 규칙만 단순하게 고안해내기를 포함하는 다양한 방법들로 이를 달성할 수 있는데, 이 열거된 방법들에만 국한되는 것은 아니다.The connected columns generated by the methods of the present invention are added to a set of columns that form a " populated dataset ". The columns of this set can be used as initial columns in additional iterations of the above methods (see FIG. 1). Addition means here a process of identifying one or more columns included in a set of columns. Providing a pointer to a data structure indicating a data structure representing a set of columns from a column, a flag indicating an containing element in a particular set, Including, but not limited to, simply associating the generated column (s) with the column, or simply devising rules that cause the generated column (s) to be included in the set .

일단 하나 이상의 캐릭터 열들이 생성되면, 연결 열들이 (예, 제 2되먹임을 위한 초기 열들로서 및/또는 분포된 데이터구조의 요소로서) 열들의 집합에 포함될 지 말지를 결정하는 선택 기준을 선택 사항으로 부과할 수 있다. 매우 다양한 선택 기준을 사용할 수 있다.Once one or more character strings are created, a selection criterion that determines whether or not the connection columns are included in the set of columns (e.g., as initial columns for the second feedback and / or as elements of the distributed data structure) Can be imposed. A wide variety of selection criteria can be used.

한 실시예에서, 유사성 지수를 선택 기준으로 사용할 수 있다. 따라서 신규 생성된 연결 캐릭터 열들은 미리 정의된 특정 유사성을 (예, 10% 이상, 바람직하게는 20% 또는 30% 이상, 더욱 바람직하게는 40% 또는 50% 이상, 그리고 가장 바람직하게는 60%, 70%, 80%, 또는 심지어 90% 이상) 서로 및/또는 초기 열들(또는 부호화된 분자들)과 및/또는 하나 이상의 "기준" 열들과 공유해야만 한다.In one embodiment, a similarity index may be used as a selection criterion. Thus, the newly generated concatenated character strings may have a predefined specific similarity (e.g., greater than 10%, preferably greater than 20% or greater than 30%, more preferably greater than 40% or greater than 50%, and most preferably greater than 60% (Or coded molecules) and / or one or more " reference " columns with each other and / or at least 70%, 80%, or even 90% or more.

시퀀스의 동일성이 매우 낮다할지라도 "연관성(relatedness)"을 평가하는 알고리즘의 사용을 선택에 포함시킬 수 있다. 이러한 방법들에는 "쓰레딩(threading)" 알고리즘 및/또는 공분산 측정법(covariance measures)을 포함된다.Although the identity of a sequence is very low, the use of an algorithm for evaluating " relatedness " can be included in the selection. These methods include " threading " algorithms and / or covariance measures.

다른 선택 기준으로 연결된 열들에 의해 표시되는 분자(들)이 연산에서 예측되는 어떤 성질들을 충족할 것을 요구될 수 있다. 따라서, 예컨대 최소 또는 최대 분자 무게, 특정 버퍼 시스템에서의 최소 또는 최대 자유 에너지, 특정 타겟 분자 또는 표면을 가진 최소 또는 최대 접촉 표면, 어떤 버퍼 시스템에서의 특정 순 전하량(net charge), 예측된 PK, PI, 결합력(binding avidity), 특정 2차 또는 3차 형태들 등을 선택 기준으로 요구할 수 있다.It may be required that the molecule (s) represented by the columns connected by other selection criteria meet certain properties that are predicted in the operation. Thus, for example, the minimum or maximum molecular weight, the minimum or maximum free energy in a particular buffer system, the minimum or maximum contact surface with a particular target molecule or surface, the specific net charge in any buffer system, the predicted PK, PI, binding avidity, specific secondary or tertiary forms, and so on.

연결된 열에 의해 표시되는 분자(들)이 경험상 물리적으로 조사되는 어떤 성질들을 충족하기를 또다른 선택 기준으로 요구될 수 있다. 따라서, 예컨대, 어떤 온도에서의 안정성, 효소 활동 수준, 특정 pH의 용액 생산, 특정 최적 온도 및/또는 pH, 특정 용매 시스템에서의 최소 또는 최대 용해도, 최소 또는 최대 친화도(affinity)로 타겟 분자와의 결합 등등을 연결된 열에 의해 표시되는 분자에게 요구하는 선택 기준이 있을 수 있다. 물리 성질 판단을 특정 선택 기준으로 하는 것은 일반적으로 연결된 열(들)에 의해 표시되는 분자(들)을 (예, 화학적으로 또는 유전자 재조합 방법으로) 합성하거나 고립화시킬 때 필요하다.The molecule (s) represented by the connected columns may be required as another selection criterion to meet certain properties that are physically irradiated. Thus, for example, the activity of the target molecule and / or the activity of the target molecule can be measured with stability at any temperature, enzyme activity level, solution production at a specific pH, specific optimal temperature and / or pH, minimum or maximum solubility in a particular solvent system, And the like, may be a selection criterion that requires the molecules indicated by the connected columns. Making a physical property determination a particular selection criterion is generally necessary when synthesizing or isolating the molecule (s) represented by the associated column (s) (eg, chemically or by recombinant means).

물리 시스템에서의 전술한 선택 기준의 적용은 당업자에게 알려져 있다(참조, 예, Stemmer 등(1999,Tumor Targeting4: 1-4) 및 Ness 등(1999,Nature Biotechnology17: 893-896) 및 Chang 등(1999,Nature Biotechnology17: 793-797) 및 Minshull과 Stemmer(1999,Current Opinion in Chemical Biology3: 284-290) 및 Christians 등(1999,Nature Biotechnology15: 436-438) 및 Zhang 등(1997,Proc. Natl. Acad. Sci., USA, 94:4504-4509 및 Patten 등(1997,Curr. Opin. Biotech.8: 724-733) 및 Crameri 등(1996,Nature Med. 2:100-103) 및 Crameri 등(1996,Naure Biotechnology14: 315-319) 및 Gates 등(1996,J. Mol. Biol.255:373-386) 및 Stemmer(1996) 및 Crameri와 Stemmer(1995,BioTechniques18: 194-195) 및 U. S. Patents 5,605,793, 5,811,238, 5,830,721, 5,834,252, 5,837,458, WO 95/22625, WO 97/0078, WO 97/35966, WO 99/41402; WO 99/41383, WO 99/41369, WO 9941368, EP 0934999; EP 0932670; WO 9923107; WO 9921979; WO 9831837; WO 9827230 및 WO 9813487)The application of the above selection criteria in physical systems is known to those skilled in the art (see, for example, Stemmer et al. (1999, Tumor Targeting 4: 1-4) and Ness et al. (1999, Nature Biotechnology 17: 893-896) (1999, Nature Biotechnology 17: 793-797 ) and Minshull and Stemmer (1999, Current Opinion in Chemical Biology 3: 284-290) and the like Christians (1999, Nature Biotechnology 15: 436-438 ) and Zhang et al (1997, Proc .. Natl Acad Sci, USA, 94:.. 4504-4509 , and Patten, etc. (.. 1997, Curr Opin Biotech 8:. 724-733) , and Crameri, etc. (1996, Nature Med 2:. 100-103) , and Crameri (1996, Naure Biotechnology 14: 315-319) and Gates et al. (1996, J. Mol. Biol. 255: 373-386) and Stemmer (1996) and Crameri and Stemmer (1995, BioTechniques 18: 194-195) US Patents 5,605,793, 5,811,238, 5,830,721, 5,834,252, 5,837,458, WO 95/22625, WO 97/0078, WO 97/35966, WO 99/41402; WO 99/41383, WO 99/41369, WO 9941368, EP 0934999, EP 0932670 ; WO 9923107; WO 9921979; WO 9831837; WO 9827230 and WO 9813487)

VI. 추가적인 변수의 도입VI. Introduction of additional variables

어떤 경우에는 상기의 분포에 추가적인 변수를 도입하는 것이 바람직하다. 본 발명의 방법들에 의해 생성된 초기 분포를 사용한 진화 알고리즘의 되먹임이 모델링된 문제에 대한 해답을 제공하지 못할 때(예, 어떤 구성원도 선택 기준을 충족시키지 못하는 경우)에 특히 더 바람직하다In some cases it is desirable to introduce additional variables into the above distribution. It is particularly preferred when the feedback of the evolutionary algorithm using the initial distribution generated by the methods of the present invention does not provide a solution to the modeled problem (e.g., no member fulfills the selection criteria)

본 발명의 방법들에 따른 열 분포에 변수를 도입하는 데 많은 방법들을 사용할 수 있다. 초기 열(들)(방법에서의 입력) 또는 연결된 열(들)(출력)에 변수를 도입할 수 있다. 이런 변수는 선택 단계 이전에 도입하는 것이 바람직하지만, 어떤 경우에 있어서는, 선택 후에(예, 제2 되먹임 전에) 변수를 도입할 수도 있다.Many methods can be used to introduce variables into the thermal distribution according to the methods of the present invention. You can introduce variables in the initial column (s) (input in method) or in connected column (s) (output). These variables are preferably introduced before the selection step, but in some cases, variables may be introduced after selection (e.g., before the second feedback).

한 접근 방법에서는, 부호화된 분자를 포함하는 하나 이상의 서브유닛을 무작위적으로/우연히 변경하는 추계(stochastic) 연산자를 알고리즘에 도입한다. (캐릭터 열로 다시 부호화되는) 부호화 전의 분자에 변수를 도입할 수 있고/있거나 부호화된 캐릭터 열에 직접 변수를 도입할 수도 있다고 알려져 있다. 추계 연산자는 일반적으로 두 개의 선택 과정을 필요로 한다. 하나는 어떤 서브유닛(들)을 변경할 지 여부를 결정하는 과정이고, 다른 하나는 그 서브 유닛(들)을 어떻게 변경시킬 지 선택/결정하는 과정이다. 이 두 선택 과정 모두가 추계적일 수도 있고, 그 중 하나는 아닐 수도 있다. 따라서, 예컨대, "돌연변이(mutate)"시킬 서브유닛(들)의 선택은 임의적/우연에 의할 수 있지만, 돌연변이화는 동일한 신규/대체 서브유닛에서 항상 이루어져야 한다. 다른 방법으로 돌연변이화될 특정 서브유닛들은 미리 결정될 수 있지만, 돌연변이화된/결과적으로 발생하는 서브유닛은 무작위적/우연적일 수 있다. 또 다른 실시예에서는, 돌연변이시킬 서브유닛의 선택 및 돌연변이화의 결과 모두 임의적/우연적일 수 있다.In one approach, a stochastic operator that randomly / accidentally changes one or more sub-units containing encoded molecules is introduced into the algorithm. It is also known that variables can be introduced into the pre-encoding molecule (which is re-encoded into character strings) and / or direct variables can be introduced into the encoded string of characters. The guess operator generally requires two selection steps. One is the process of determining which subunit (s) to change and the other is the process of selecting / determining how to change the subunit (s). Both of these selection processes may be stochastic, but not both. Thus, for example, the selection of the subunit (s) to be " mutated " can be arbitrary / random, but mutation must always occur in the same new / alternative subunit. The specific subunits that would otherwise be mutated may be predetermined, but the mutated / resulting subunits may be random / contingent. In yet another embodiment, both the selection of the subunit to be mutated and the result of the mutagenization can be arbitrary / contingent.

바람직한 실시예에서, 추계 연산자는 "돌연변이화" 발생 평균 빈도를 설정하는 "돌연변이화 빈도"라는 입력 또는 파라미터가 될 수도 있다. 따라서, 예컨대, 돌연변이화 빈도가 10%로 설정되면, 연산자는 초기 열들에 포함되어 있는 10 서브유닛 당 하나 씩만 돌연변이화를 허용하게 된다. 돌연변이화 빈도를 어떤 범위(예, 5%-10%, 등)로 설정할 수도 있다.In a preferred embodiment, the guess operator may be an input or parameter called " mutation frequency " which sets the " mutation " occurrence average frequency. Thus, for example, if the mutation frequency is set to 10%, the operator will allow only one mutation per 10 subunits contained in the initial columns. The frequency of mutation may be set to any range (eg, 5% -10%, etc.).

이 "추계 연산자"는 모든 초기 열 또는 초기 열에 포함된 모든 하위열에 적용되어야만 하는 것은 아니다. 따라서, 어떤 실시예에서는, 추계 연산자의 적용을 특정 초기 열들 및/또는 하나 이상의 초기 열들의 특정 하위열들(예. 영역들)로 국한할 수 있다.This " guess operator " is not necessarily applied to all initial columns or all sub-columns included in the initial column. Thus, in some embodiments, the application of the stochastic operator may be limited to certain initial columns and / or to specific sub-columns (e.g., regions) of one or more initial columns.

추계 연산자의 선택 기준이 고정되면, 추계 연산자는 더 이상 추계적이지 않으므로, "지정된 돌연변이화(directed mutation)"를 도입한 것이 된다. 이것은 만나는 모든 서브유닛 "A"를 서브유닛 "B"로 변환시키는 연산자와 같은 것이다. 이 지정된 돌연변이화 연산자는 돌연변이화 빈도를 파라미터/속성/입력으로 할 수도 있다. 전술한 바와 같이, 돌연변이화 빈도는 연산자가 "만난" 서브 유닛들 중 실제로 변형하는 서브 유닛들의 수를 제한하게 된다.When the selection criterion of the estimation operator is fixed, the estimation operator is no longer stochastic, and therefore, "directed mutation" is introduced. This is like an operator that converts all encountered subunits "A" to subunits "B". This designated mutation operator can also be a parameter / attribute / input for the mutation frequency. As described above, the mutation frequency limits the number of subunits that the operator actually modifies, among the " met " subunits.

전술한 바와 같이 추계 연산자는 부호화된 서브유닛 하나 이상을 변경할 수 있다는 것도 이해할 것이다. 어떤 실시예에서는, 이 연산자가 복수의 암호화된 서브유닛들 또는 전체 하위열/영역들을 변경한다.It will be appreciated that, as described above, the operator may change one or more encoded subunits. In some embodiments, this operator changes a plurality of encrypted sub-units or the entire sub-columns / regions.

삽입 연산자 또는 삭제 연산자를 사용하여 변수를 도입할 수도 있다. 삽입 연산자 또는 삭제 연산자는 본질적으로 "추계 돌연변이화" 연산자의 변형이다. 하나 이상의 서브유닛들을 변형시키는 대신에, 삭제 연산자는 하나 이상의 서브유닛들을 제거하고, 반면에 삽입 연산자는 하나 이상의 서브유닛들을 삽입한다. 삭제 연산자와 삽입 연산자에는 두 개의 선택 과정이 있다. 삽입 또는 삭제할 위치를 선택하는 것이 한 과정이고, 다른 하나는 삭제 크기 또는 삽입 대상을 선택하는 것이다. 두 선택 과정들이 미리 결정되면(추계적이지 않은 경우) 삽입 연산자 또는 삭제 연산자는 지정된 삽입 연산자 또는 지정된 삭제 연산자가 된다. 추계 연산자의 경우와 동일하게, 삽입 연산자 또는 삭제 연산자는 돌연변이화 빈도를 변수/속성/입력으로 할 수 있다.You can also introduce variables by using insert or delete operators. The insertion or deletion operator is essentially a variant of the "degenerate mutation" operator. Instead of modifying one or more sub-units, the delete operator removes one or more sub-units, whereas the insert operator inserts one or more sub-units. There are two selection processes for delete and insert operators. Selecting a position to insert or delete is a process, and the other is to select a deletion size or insertion target. If the two selection processes are predetermined (not stochastic), the insertion or deletion operator becomes the specified insertion operator or the specified deletion operator. As in the case of the counting operator, the insertion operator or deletion operator can make the mutation frequency variable / attribute / input.

다른 실시예에서, 생물학적 분자(들)로부터 파생된 초기 열들과 아무런 필연적 관계가 없고 무작위적 또는 우연히 생성된 하나 이상의 초기 열들을 더함으로써 변수를 증가시킬 수 있다. 변수-도입 초기 열(들)(variation-introducing initial string(s))을 전적으로 무작위적으로 또는 우연히 생성할 수 있으며, 또 어떤 실시예에서는, 변수-도입 열(들)을 미리 특정된 어떤 기준(예, 특정 서브유닛들의 발생 빈도, 부호화된 열들과의 유사성 최소 및/또는 최대, 등)에 따라 생성할 수 있다. 변수-도입 초기 열들은 표준 길이의 열들일 필요는 없고, 단순히 하나 이상의 하위열들을 포함할 수도 있다. 이런 성질을 가진 열들 또는 하위열들은 변수를 감소시키는 데 사용할 수 있다. 따라서, 특정 분자 영역이 "선호되는(favored)" 열들 또는 하위열(들)이라면 이 영역을 부호화하여 초기 열들의 분포에 더할 수 있다.In another embodiment, the variable can be increased by adding one or more initial columns that are randomly or accidentally generated and have no inevitable relationship with the initial columns derived from the biological molecule (s). Variation-introducing initial string (s) may be generated entirely randomly or by chance, and in some embodiments, the variable-introducing column (s) E.g., the frequency of occurrence of certain sub-units, the minimum and / or maximum similarity with encoded columns, etc.). Variable-initialization columns need not be standard length columns, but may simply include one or more sub-columns. Columns or sub-columns of this nature can be used to reduce variables. Thus, if a particular molecular region is " favored " columns or a lower column (s), then this region may be encoded and added to the distribution of initial columns.

VII. 데이터 구조를 분포시키기VII. Distributing Data Structures

한 실시예에서, 본 발명의 방법들에 따른 모든 연결된 열(들)을 데이터 구조를 분포시키는 데 사용하고/사용하거나 전술한 방법들의 새로운 되먹임에서 초기 열로서 사용한다. 다른 실시예에서, 전술한 바대로 선택 기준을 부과하고, 이 선택 기준을 충족하는 연결된 열들만을 초기 열들로서 사용하고/사용하거나 데이터 구조를 분포시키는 데 사용한다. 전술한 과정들에서 사용된 부호화된 분자(들)의 연결된 표시로 데이터 구조를 분포시킬 수 있고, 또는 연결된 열들이 일부 다시 풀려서(deconvolve), 부호화된 생물학적 분자들의 더 단순한 부호화된 또는 직접적인 표시를 재생성하고 이 풀린 열들을 데이터 구조를 분포시키는 데 사용할 수도 있다.In one embodiment, all connected column (s) according to the methods of the present invention are used to distribute data structures and / or are used as initial columns in the new feedback of the above-described methods. In another embodiment, a selection criterion is imposed as described above, and only the connected columns that meet this selection criterion are used as initial columns and / or used to distribute the data structure. It is possible to distribute the data structure as a concatenated representation of the encoded molecule (s) used in the above processes, or to deconvolve some of the connected columns and to reproduce a more simply encoded or direct representation of the encoded biological molecules And use these released columns to distribute the data structure.

한 실시예에서, 데이터 구조는 연결된 열들이 써 있는 종이 한 장 또는 하나 이상의 연결된 열들을 목록화한 카드로 된 카드 모음집처럼 간단할 수 있다. 바람직한 실시예에서, 적절하게 고안된 컴퓨터로 내장된 데이터 구조의 조작이 가능한(예, 기계적인 및/또는 광학적 및/또는 자기적 및/또는 전기적) 매체에 데이터 구조를 개시한다. 특히 바람직한 실시예에서, 데이터 구조는 (예, 동적, 정적, 읽기만 가능한) 컴퓨터 메모리 및/또는 광학, 자기 또는 광자기 저장 매체 안에 형성된다.In one embodiment, the data structure may be as simple as a collection of cards with a card listing connected columns or a card listing one or more connected columns. In a preferred embodiment, a data structure is disclosed on a medium (e.g., mechanical and / or optical and / or magnetic and / or electrical) capable of manipulation of a suitably designed computer-embedded data structure. In a particularly preferred embodiment, the data structure is formed in computer memory (e.g., dynamic, static, read only) and / or optical, magnetic or magneto-optical storage media.

컴퓨터로 접근가능한 형태에서라 할지라도, 데이터 구조는 단순히 연결된 열들의 목록만 제공할 수 있다. 데이터 구조는 다양한 "엔트리" 간의 관계를 보존하도록 구조화될 수도 있다. 단순한 수준에서는 엔트리의 단순한 정체 및/또는 순서만 유지하도록 할 수 있다. 더욱 복잡한 데이터 구조들도 이용 가능하며, 이들은 데이터 구조 내의 하나 또는 그 이상의 엔트리들(예, 연결된 열들) 간의 관계 색인화(indexing) 및/또는 정렬 및/또는 유지를 위한 보조 구조도 제공할 수 있다. 엔트리에 관한 주석들 또는 엔트리와 외부 데이터 소스 간의 연결을 데이터 구조가 부가적으로 포함할 수도 있다. 바람직한 데이터 구조들은 연결된 리스트들, 표들, 해시 테이블들 및 다른 색인들, 플랫-파일 데이터베이스, 관계 데이터베이스들, 로컬 또는 분산 연산 시스템들을 포함하지만, 이에 국한되는 것은 아니다. 특히 바람직한 실시예에서는, 데이터 구조는 종래의(예, 자기 및/또는 광) 미디어에 저장된 또는 컴퓨터 메모리에서 읽히는 데이터 파일이다.Even in a computer-accessible form, a data structure can simply provide a list of connected columns. The data structure may be structured to preserve the relationship between the various " entries ". At a simple level, it is possible to maintain only the simple congestion and / or sequence of entries. More complex data structures are also available, and they may also provide an auxiliary structure for indexing and / or aligning and / or maintaining one or more entries in the data structure (e.g., connected columns). The data structure may additionally include a connection between an entry and an external data source with comments or entries relating to the entry. Preferred data structures include, but are not limited to, linked lists, tables, hash tables and other indexes, flat-file databases, relational databases, local or distributed computing systems. In a particularly preferred embodiment, the data structure is a data file stored in or stored in conventional (e.g., magnetic and / or optical) media.

VIII. 프로그램된 디지털 장치의 실시예VIII. Embodiment of Programmed Digital Device

본 발명은 적절하게 구성된 연산 장치에 로드(load)되었을 때 데이터 구조를 본 발명의 방법에 따라 분포시키는(예, 연결된 열들의 풀/집합을 생성시키는) 논리 명령들 및/또는 데이터를 포함하는 특정 미디어 또는 전송 가능한 프로그램 요소에 개시될 수 있다.The present invention includes logic instructions and / or data that includes data and / or data that distributes the data structure according to the method of the present invention when loaded into a suitably configured computing device (e.g., creating a pool / Media or transportable program element.

도 4는 미디어(717) 및/또는 네트워크 단자(719)로부터 명령들을 읽을 수 있는 논리 장치로서 이해될 수 있는 디지털 장치(700)를 보여준다. 장치(700)는 이런 명령들을 생물학적 분자들의 부호화, 분자들의 부호화된 표시(들)의 조작 및 데이터 구조의 분포를 명령하는 데 사용할 수 있다. 본 발명을 개시할 수 있는 논리 장치의 한 유형에는 CPU(707), 선택 사양인 입력용 매체(709 및 711), 디스크 드라이브(715) 및 선택 사양인 모니터(705)를 포함하는 컴퓨터 시스템(700)이 있다. 고정 미디어(717)는 이와 같은 시스템을 프로그램하는 데 사용할 수 있고 디스크 형태의 광 매체 또는 자기 매체, 또는 메모리를 말한다. 통신 단자(719)는 이와 같은 시스템을 프로그램하는 데 사용할 수 있고 통신 연결의 어떤 형태든지 모두 포함한다.4 shows a digital device 700 that may be understood as a logical device capable of reading commands from media 717 and / or network terminal 719. [ The device 700 may use these instructions to command the encoding of biological molecules, the manipulation of the encoded representation (s) of the molecules, and the distribution of the data structure. One type of logic device that can present the invention includes a computer system 700 that includes a CPU 707, optional input media 709 and 711, a disk drive 715, and an optional monitor 705 ). The fixed media 717 can be used to program such a system and refers to optical media or magnetic media, or memory, in the form of a disk. The communication terminal 719 can be used to program such a system and includes any form of communication connection.

본 발명은 ASIC(응용 주문형 회로; application specific integrated circuit) 또는 PLD(programmable logic device)로도 개시될 수 있다. 이와 같은 경우에는, 전술한 바대로 작동하는 ASIC 또는 PLD를 생성하는 데 사용할 수 있는 컴퓨터 이해가능한 기술어(computer understandable descriptor language)로 본 발명을 개시할 수 있다.The present invention may also be initiated as an ASIC (application specific integrated circuit) or PLD (programmable logic device). In such a case, the present invention can be disclosed in a computer understandable descriptor language that can be used to generate an ASIC or PLD that operates as described above.

카메라, 디스플레이, 이미지 편집기, 등과 같은 다른 디지털 장치의 회로 또는 논리적 처리과정들로도 본 발명을 개시할 수 있다.The present invention may also be disclosed by circuits or logical processing procedures of other digital devices such as cameras, displays, image editors, and the like.

IX. 웹 사이트에서의 실시예IX. Examples on the website

본 발명의 방법들을 로컬 또는 분산 컴퓨팅 환경에서 개시할 수 있다. 분산 환경에서, 복수의 프로세서들을 포함하는 컴퓨터 한 대 또는 복수의 컴퓨터에서 본 방법들은 실행될 수 있다. 컴퓨터들은 예컨대 공통 버스를 통해 연결될 수도 있지만, 컴퓨터(들)가 네트워크 상의 노드이면 더욱 바람직하다. 이 네트워크는 일반 또는 전문, 지역 네트워크 또는 광역 네트워크일 수 있고, 바람직한 어떤 실시예에서는, 상기의 컴퓨터들이 인트라넷 또는 인터넷의 구성 요소가 될 수도 있다.The methods of the present invention may be disclosed in a local or distributed computing environment. In a distributed environment, the present methods can be implemented in a computer or a plurality of computers including a plurality of processors. Computers may be connected, for example, via a common bus, but it is even more desirable if the computer (s) is a node on the network. The network may be a general or professional, a regional network or a wide area network, and in some preferred embodiments the computers may be components of an intranet or the Internet.

바람직한 인터넷 상의 실시예에서, 클라이언트 시스템은 일반적으로 웹 브라우저를 실행하여 웹 서버 역할을 하는 서버 컴퓨터에 연결된다. 웹 브라우저는 일반적으로 IBM's 웹 익스플로러, 또는 넷스케이프, 또는 모자이크와 같은 프로그램이다. 웹 서버는 일반적으로 IBM의 HTTP 대몬(Daemon) 또는 다른 WWW 대몬과 같은 프로그램이지만, 반드시 그런 것은 아니다. 클라이언트 컴퓨터는 유선 또는 무선으로 본 발명의 방법들을 개시하는 소프트웨어에 대한 접근을 제공하는 웹사이트와 양 방향으로 연결되어 있다.In a preferred Internet embodiment, the client system is typically connected to a server computer that runs a web browser and acts as a web server. Web browsers are typically programs such as IBM's Web Explorer, or Netscape, or Mosaic. Web servers are typically programs like IBM's HTTP daemon or other WWW daemon, but this is not necessarily the case. The client computer is connected in both directions with a web site that provides access to the software that initiates the methods of the present invention either wired or wirelessly.

인트라넷 또는 인터넷과 연결된 클라이언트 컴퓨터의 사용자는 클라이언트 컴퓨터가 본 발명의 방법들의 개시를 제공하는 응용 프로그램을 호스팅하는 웹 사이트의 일부분인 리소스들을 클라이언트 컴퓨터가 요청하게 시킬 수 있다. 서버 프로그램(들)은 요청을 처리하여 (리소스들이 이용 가능한 상태라면) 그 특정된 리소스들을 리턴해 준다. URL(Uniform Resource Locator)라고 알려진 표준 네이밍규약(standard naming convention)이 채택되었다. 이 규약은 몇 가지 형태의 지역 이름들을 포함하는데, http(Hypertext Transport Protocol), ftp(File Transport Protocol), gopher 및 WAIS(Wide Area Information Service)와 같은 하위항목들을 현재 포함하고 있다. 리소스가 다운로드되면, 그 안에 URL의 부가적인 리소스들도 포함하고 있을 수 있다. 따라서, 클라이언트 측의 사용자는 특정해서 요청하지 않는 새로운 리소스들의 존재도 쉽게 알 수 있다.A user of a client computer connected to an intranet or the Internet may have a client computer request resources that are part of a web site hosting an application program providing the disclosure of the methods of the present invention. The server program (s) processes the request and returns the specified resources (if the resources are available). A standard naming convention known as a Uniform Resource Locator (URL) has been adopted. This protocol includes some types of local names, including sub-items such as http (Hypertext Transport Protocol), ftp (File Transport Protocol), gopher, and WAIS (Wide Area Information Service). When a resource is downloaded, it may also contain additional resources of the URL. Therefore, the user on the client side can easily recognize the existence of new resources that are not specifically requested.

본 발명의 방법(들)을 개시하는 소프트웨어는 진정한 클라이언트-서버 아키텍쳐의 웹사이트를 호스팅하는 서버에서 자체적으로 실행될 수 있다. 따라서, 클라이언트 컴퓨터 포스트들이 호스트 서버에 요청하면, 서버는 요청된 프로세스를 자체적으로 실행하고, 이후에 클라이언트 측에 그 결과를 다운로드해준다. 본 발명의 방법들은 "다단계(multi-tier)" 형태로 개시될 수 있는데 , 이경우 방법(들)의 구성 요소가 클라이언트 측에서 자체적으로 실행된다. 이것은 클라이언트 측의 요청에 대해 서버로부터 소프트웨어를 다운로드하는 방식으로 개시될 수 있으며(예, 자바 기반 응용 프로그램), 클라이언트 측에 인스톨된 소프트웨어에 의해 개시될 수도 있다.The software that initiates the method (s) of the present invention may be executed by itself on a server hosting a website in a true client-server architecture. Thus, when client computer posts request a host server, the server executes the requested process itself and then downloads the result to the client side. The methods of the present invention may be disclosed in a " multi-tier " form in which the components of the method (s) are executed on their own at the client side. This may be initiated by downloading software from a server for a client-side request (e.g., a Java-based application) and initiated by software installed on the client-side.

한 실시예에서, 본 발명의 방법들을 개시하는 응용 프로그램(들)은 프레임 별로 나누어져 있다. 이런 경우에는, 응용 프로그램을 특징들 또는 기능성의 집합체로 보기보다는 별개의 프레임들 또는 뷰(view)들의 집합체로 보는 것이 도움이 된다. 예컨대, 전형적인 응용 프로그램은 한 세트의 메뉴 항목을 포함하는데, 이 메뉴 항목은 당해 응용 프로그램의 어떤 기능을 명확하게 보여주는 형태로 된 특정프레임을 불러오도록 되어 있다. 이런 관점에서, 응용 프로그램은 단순한 코드 덩어리가 아닌 애플리트(applete)의 집합 또는 기능의 꾸러미로서 파악된다. 브라우저 내부로부터 이 같은 방식으로 사용자는 웹 페이지 링크를 선택해서, 당해 응용 프로그램의 특정 프레임(하위 응용 프로그램)을 불러 올 수 있다. 따라서, 예컨대, 하나 이상의 프레임들이 생물학적 분자(들)를 하나 이상의 캐릭터 열들로 입력 및/또는 부호화하는 기능을 제공할 수 있는 반면에, 다른 프레임은 부호화된 캐릭터 열(들)의 다양성을 생성 및/또는 증대하는 도구들을 제공할 수 있다.In one embodiment, the application program (s) that disclose the methods of the present invention are frame-by-frame. In this case, it is helpful to look at the application as a collection of distinct frames or views rather than as a collection of features or functionality. For example, a typical application program includes a set of menu items, which are intended to invoke a particular frame in a form that clearly shows a certain function of the application in question. From this point of view, an application is seen as a set of applets or a bundle of functions, rather than just a chunk of code. From within the browser, the user can select a web page link and fetch a particular frame (sub-application) of the application in this manner. Thus, for example, one or more frames may provide the ability to input and / or code the biological molecule (s) into one or more character strings, while another frame may generate and / or generate a variance of the encoded character string Or < / RTI >

응용 프로그램을 프레임들의 집합체로 표현하는 것 외에도, 해당 응용 프로그램을 가리키는 URL 주소인 인트라넷 및/또는 인터넷의 한 장소로 응용 프로그램을 표현한다. 각각의 URL은 두가지 특징들을 포함하는것이 바람직하다: URL을 위한 콘텐트 데이터(즉,서버에 저장된 모든 데이터) 및 데이터 타입 또는 MIME(Multipurpose Internet Mail Extension) 타입. 데이터 형식은 서버로부터 받은 데이터를 어떻게(예, .gif 파일은 비트맵 이미지로 변환되는 것과 같이) 변환해야 하는 지 웹 브라우저가 결정하도록 도와준다. 사실상, 데이터 타입은 브라우저에서 받은 데이터를 어떻게 처리해야 할지에 관한 지침 역할을 한다. 이진 데이터의 스트림을 HTML 형식으로 받았다면, 브라우저는 이 데이터를 HTML 페이지 형식으로 보여준다. 비트맵 타입으로 받았다면, 브라우저는 비트 맵 이미지 형식으로 보여준다.In addition to representing the application as a collection of frames, the application is represented in a place on the intranet and / or the Internet, which is the URL address of the application. Each URL preferably includes two features: content data for the URL (i.e. all data stored on the server) and data type or Multipurpose Internet Mail Extension (MIME) type. The data format helps the web browser determine how data received from the server should be converted (eg, .gif files are converted to bitmap images). In fact, data types provide guidance on how to handle data received from the browser. If you received a stream of binary data in HTML format, the browser displays this data in HTML page format. If you receive it as a bitmap type, the browser will display it in bitmap image format.

마이크로소프트 윈도우에서, 호스트 응용 프로그램이 데이터 오브젝트(즉, 특정 타입의 데이터)를 취급한다는 것을 등록하도록 하는 서로 다른 기술들이 존재한다. 해당 응용 프로그램이 윈도우에 어떤(예, .doc-"Microsoft Word Document") 특정 파일 확장자에 대한 관심(interest)을 등록하는 방식이 하나 존재한다; 이 방식은 윈도우 기반 응용 프로그램이 가장 일반적으로 채택하는 기술이다. 마이크로소프트의 OLE(Object Linking and Embedded)에서 채택한 다른 방법은 서버의 특정 응용 프로그램을 (GUID를 가진 문서를 호스팅하기 위해) 불러오는 class Globally Unigue Identifier 또는 GUID- 16 바이트 식별자를 사용하는 것이다. 클래스 ID가 특정 DLL(Dynamic Link Library) 또는 응용 프로그램 서버로 연결되면, 특정 기계에 등록된다.In Microsoft Windows, there are different techniques for registering that a host application handles a data object (i.e., a particular type of data). There is a way for the application to register interest in certain file extensions (eg .doc- "Microsoft Word Document") in Windows; This is the most commonly adopted technology for Windows-based applications. Another method adopted by Microsoft's Object Linking and Embedding (OLE) is to use a class Globally Unigue Identifier or GUID-16 byte identifier to load a specific application on the server (to host documents with GUIDs). When a class ID is linked to a specific DLL (Dynamic Link Library) or an application server, it is registered in a specific machine.

특정된 파일만 취급한다는 것을 등록하는 한 실시예에서, 호스트 응용 프로그램을 문서와 연결하는 기술에 MIME 타입의 사용을 통하는 방법이 있다. MIME은 문서 오브젝트를 패키지화하는 표준화된 기술을 제공한다. 이는 어떤 응용 프로그램이 그 문서를 호스팅하는 데 적합한지 지시하는 MIME 헤더를 포함하고 있으며, 인터넷을 통한 전송에 적합한 형태로 모두 포함되어 있다.In one embodiment registering to handle only specified files, there is a method through the use of a MIME type in a technique for linking a host application with a document. MIME provides standardized techniques for packaging document objects. It contains a MIME header that indicates which application is eligible to host the document, and is included in a form suitable for transmission over the Internet.

바람직한 한 실시예에서, 본 발명의 방법들은 본 발명의 방법들의 이용에 맞게 특정된 MIME 타입을 부분적으로 사용함으로써 개시된다. MIME 타입은 문서(예, Microsoft ActiveX Document)를 자체적으로 생성하는 데 필요한 정보를 클라이언트 측에 포함하고 있으며, 게다가 필요한 경우에 문서를 볼 수 있게 하는 프로그램 코드를 찾고 다운로드하는 데에 필요한 정보를 포함하고 있다. 프로그램 코드가 이미 클라이언트 측에 존재하면, 자신의 복사본을 갱신할 목적을 위해서만 다운로드한다. 이는 문서를 볼 수 있게 해주는 다운로드 가능한 프로그램을 지원하는 정보를 포함하는 새로운 문서 타입을 정의한다.In one preferred embodiment, the methods of the present invention are disclosed by partially using a MIME type specified for use in the methods of the present invention. The MIME type contains the information needed to generate the document itself (eg, a Microsoft ActiveX Document) on the client side, as well as information needed to find and download the program code that allows the document to be viewed when needed have. If the program code already exists on the client side, download it only for the purpose of updating its copy. It defines a new document type that contains information that supports a downloadable program that allows you to view the document.

MIME 타입은 .APP 파일 확장자와 연관될 수도 있다. .APP 확장자를 가진 파일은 OLE 문서이므로, OLE DocObject에 의해 개시된다. .APP 파일은 파일이기 때문에 HTML HREF를 이용해서 서버에 있을 수 있고 링크될 수 있다. .APP 파일은 다음과 같은 데이터들을 포함하는 것이 바람직하다: (1) 이것은 본 발명의 방법들의 사용에 적합한 하나 이상의 형태들로 개시되는 OLE 문서 뷰어(Viewer)인 ActiveX 오브젝트의 CLSID. (2) 내부에서 오브젝트의 코드를 찾을 수 있는 코드베이스의 URL, 및 (3) (선택 사항으로) 요청된 버전 넘버. .APP DocObject 구동기 코드가 설치되고 APP MIME 타입을 등록하면, 사용자의 웹 브라우저로 .APP 파일을 다운로드하는데 사용될 수 있다.The MIME type may be associated with an .APP file extension. The file with the .APP extension is an OLE document, so it is started by the OLE DocObject. Because the .APP file is a file, it can be linked to the server using HTML HREF. The .APP file preferably contains the following data: (1) This is the CLSID of an ActiveX object that is an OLE Document Viewer (Viewer), disclosed in one or more forms suitable for use with the methods of the present invention. (2) the URL of the codebase from which the code of the object can be found, and (3) the version number requested (optionally). .APP When the DocObject driver code is installed and the APP MIME type is registered, it can be used to download the .APP file to the user's web browser.

서버 측에서는, 그 .APP 파일이 진짜 파일이기 때문에, 당해 서버는 단순히 요청을 받아 그 파일을 클라이언트에게 보내 준다. .APP 파일이 다운로드되었을 때, .APP DocObject 구동기는 오퍼레이팅 시스템이 그 .APP 파일에서 특정된 오브젝트에 대한 코드베이스를 다운로드할 것을 요청한다. 이 시스템 기능은 GoGetClassObjectFromURL 기능을 통해 윈도우에서 이용 가능하다. ActiveX object의 코드베이스가 다운로드된 후에, 이 .APP DocObject 구동기는 브라우저에게 예컨대 익스플로러 문서 사이트에서 ActiveMe 방법을 호출함으로써 브라우저 상에 뷰를 생성할 것을 요청한다. 인터넷 익스플로러는 이후에는 뷰를 구체적으로 보여달라고 다시 DocObject를 호출하고, 다운로드된 코드로부터 ActiveX view object의 예를 생성함으로써 수행한다. 일단 생성되면, ActiveX view object는 인터넷 익스플로러에서 활성화되어 적소에 있게 되고, 인터넷 익스플로러는 적합한 형식과 부수적인 제어들을 만든다.On the server side, because the .APP file is a real file, the server simply receives the request and sends it to the client. When the .APP file is downloaded, the .APP DocObject driver requests the operating system to download the codebase for the object specified in the .APP file. This system function is available on Windows via the GoGetClassObjectFromURL function. After the code base of the ActiveX object is downloaded, the .APP DocObject driver requests the browser to create a view on the browser, for example, by calling the ActiveMe method on the explorer document site. Internet Explorer then calls DocObject again to show the view in detail, and then creates an example ActiveX view object from the downloaded code. Once created, the ActiveX view object is activated in Internet Explorer and is in place, and Internet Explorer creates the appropriate format and ancillary controls.

형식이 일단 만들어지면, 기능을 수행하는 데 필요한 어떤 원격 서버의 오브젝트들에 대해서도 다시 연결할 수 있다. 이 시점에서, 사용자는 이 형식과 상호작용할 수 있고, 이 형태는 인터넷 익스플로러 프레임에 내재된 상태로 나타나게 된다. 사용자가 다른 페이지로 전환할 때, 브라우저는 이 형식을 궁극적으로는 종료하고 파괴할 (또, 원격 서버에 대한 어떤 특별한 연결들도 포기할) 권한을 갖게 된다.Once the format is created, you can reconnect any remote server objects needed to perform the function. At this point, the user can interact with this format, which appears to be embedded in the Internet Explorer frame. When the user switches to another page, the browser ultimately has the right to terminate and destroy (and also abandon any special connections to the remote server).

바람직한 한 실시예에서, 최종 사용자의 데스크톱으로부터, 시스템에 대한 입구(entry point)는 다른 특정 웹 사이트의 홈 페이지 또는 연합 홈 페이지가 된다. 선택 사항으로, 이 페이지는 기존의 방식으로 다양한 링크를 포함할 수 있다. 사용자가 어떤 응용 프로그램 페이지(예, 본 발명의 방법들의 기능을 제공하는 페이지)에 대한 특정 링크를 클릭하면, 이에 대한 응답으로 웹 브라우저는 서버에 있는 해당 응용 프로그램 페이지(파일)를 연결한다.In one preferred embodiment, from the end user's desktop, the entry point to the system will be the home page or the combined home page of another particular web site. Optionally, this page can contain various links in a conventional manner. When a user clicks on a particular link to an application page (e.g., a page that provides functionality of the methods of the present invention), the web browser connects the corresponding application page (file) on the server in response.

한 실시예에서, 본 발명의 방법들에 대한 접근을 사용자가 요청하면, 사용자는 특정 페이지 타입, 예컨대 웹 브라우저에서의 (본 발명의 방법들의 하나 이상의 요소들을 개시하는) 응용 프로그램의 즉시 실행을 위한 응용 프로그램(app문서형태) 페이지로 바로 연결된다. 각각의 응용 프로그램 페이지는 URL을 이용해서 위치가 정해지기 때문에, 다른 페이지들은 이에 대한 하이퍼링크를 가질 수 있다. 복수의 응용 프로그램 페이지들은 응용 프로그램 페이지들에 대한 하이퍼링크를 포함하는 카탈로그 페이지를 만듦으로써 함께 그룹으로 묶어놓을 수 있다. 한 응용 프로그램 페이지를 가리키는 하이퍼링크를 사용자가 선택할 때, 웹 브라우저는 해당 응용 프로그램의 코드를 다운로드하고 브라우저 내의 해당 페이지를 실행한다.In one embodiment, when a user requests access to the methods of the present invention, the user may select a particular page type, e.g., for instant execution of an application (which initiates one or more elements of the methods of the present invention) It connects directly to the application (app document type) page. Because each application page is located using a URL, other pages can have a hyperlink to it. Multiple application pages can be grouped together by creating catalog pages that contain hyperlinks to application pages. When a user selects a hyperlink that points to an application page, the web browser downloads the application's code and launches the page in the browser.

브라우저가 해당 응용 프로그램 페이지를 다운로드할 때, (정의된 MIME 타입에 기반한) 브라우저는 로컬 구동기 또는 해당 타입의 문서를 위한 구동기를 작동시킨다. 특히, 해당 응용 프로그램이 해당 문서를 호스팅할 수 있도록 원격(다운로드 가능한) 응용 프로그램을 식별하는 GUID 및 코드베이스 URL을 응용 프로그램에 포함시키는 것이 바람직하다. 응용 프로그램 페이지와 함께 도착한 해당 문서 오브젝트와 GUID가 주어지면, 로컬 구동기는 호스팅 응용 프로그램이 (예, 윈도우 95/NT 레지스트리를 검사함으로써) 이미 자체적으로 존재하는 지 보기 위해 클라이언트 측을 조사한다. 이 시점에서 로컬 구동기는 (만약, 로컬 복사본이 존재한다면) 로컬 복사본을 작동시킬 수도 있고 또는 호스트 응용 프로그램의 최신 버전을 다운로드 받을 수도 있다.When the browser downloads the application page, the browser (based on the defined MIME type) activates the local driver or the driver for that type of document. In particular, it is desirable to include in the application a GUID and codebase URL that identifies a remote (downloadable) application so that the application can host the document. Given the corresponding document object and GUID that arrived with the application page, the local driver examines the client side to see if the hosting application already exists by itself (eg, by checking the Windows 95 / NT registry). At this point, the local driver may activate the local copy (if local copy exists) or download the latest version of the host application.

코드를 다운로드하는 서로 다른 모델들이 일반적으로 사용 가능하다. 코드가 다운로드되면, "코드 베이스" 특정부분(파일)을 서버로부터 초기에 요청한다. 코드 베이스 그 자체는 단순한 DLL 파일로부터 복수의 압축된 파일들을 내재하고 있는 Cabinet 파일(마이크로소프트 .cab 파일)까지 다양하다. 게다가, 다운로드된 응용 프로그램을 어떻게 설치할지 클라이언트 시스템에게 지시하기 위한 information(예. 마이크로소프트 .inf) 파일을 채택할 수 있다. 이러한 작동 원리들은 응용 프로그램의 어떤 요소를 다운로드 받고 언제 하는냐에 따라 다양한 방법을 가능하게 한다.Different models of downloading code are generally available. When the code is downloaded, it initially requests the "codebase" specific part (file) from the server. The code base itself can range from simple DLL files to cabinet files (Microsoft .cab files) that contain multiple compressed files. In addition, you can adopt information (eg Microsoft .inf) files to tell the client system how to install the downloaded application. These operating principles enable a variety of methods depending on what elements of the application are downloaded and when.

바람직한 실시예를 위해서, 실제 다운로드하는 프로그램 코드 자체를 위해서 채택된 장치는 표준 마이크로소프트 ActiveX API(응용 프로그래밍 인터페이스)-호출(calls)에 기반을 두고 있다. 이 ActiveX API가 웹을 통해 전달되는 응용 프로그램을 지원하기 위해 제공된 것은 아니지만, ActiveX API는 정확한 버전의 프로그램 코드 찾기, 프로그램 코드를 로컬 장치에 복사하기, 무결성(integrity) 검증하기 및 프로그램 코드를 클라이언트 오퍼레이팅 시스템에 등록하기를 가능하게 한다. 이 코드가 일단 다운로드되면, 구동기는 (이미 설정되어 있다면, 레지스트리를 통해 응용 프로그램을 호스팅할 수 있게 하는 것과 비슷한 방식으로) 문서 오브젝트를 실행하기 위한 현재의 응용 프로그램 호스팅을 진행한다.For the preferred embodiment, the device adopted for the actual downloading program code itself is based on standard Microsoft ActiveX application programming interface (API) -calls. Although this ActiveX API is not provided to support applications delivered over the Web, the ActiveX API can be used to find the correct version of the program code, copy the program code to the local device, verify the integrity, Enables registration to the system. Once this code has been downloaded, the driver proceeds to host the current application to run the document object (similar to how it can host the application through the registry, if it is already set up).

호스팅 응용 프로그램(OLE 서버)이 클라이언트 측에 존재하게 되었으므로, 해당 클라이언트 시스템은 해당 응용 프로그램을 브라우저 내에서 정확하게 실행할 OLE 문서 뷰 아키텍쳐를 채택할 수 있으며, 브라우저의 메뉴에 해당 응용 프로그램의 메뉴를 첨가하고 (단일 ActiveX 제어 사각형 안에서만 해당 응용 프로그램을 실행하는 것을 요구하는 종래의 한계와는 반대되게) 브라우저의 크기 조정에 따라 해당 응용 프로그램의 크기를 정확하게 조정하기 위한 종래의 OLE 방법들의 사용을 포함한다. 일단 해당 응용 프로그램이 클라이언트 측에서 실행되면, RPC(Remote Procedure Call) 방법을 사용하는 것과 같은 원격 로직을 응용 프로그램은 실행할 수 있다. 이런 방식으로 원격 프로시저로서 바람직하게 개시된 로직은 여전히 사용될 수 있다.Now that the hosting application (OLE server) is on the client side, the client system can adopt the OLE document view architecture to execute the application correctly in the browser, add the menu of the application to the menu of the browser Includes the use of conventional OLE methods to precisely adjust the size of the application in accordance with browser sizing (as opposed to the conventional limitations of requiring the application to run only within a single ActiveX control rectangle). Once the application is running on the client side, applications can execute remote logic, such as using the Remote Procedure Call (RPC) method. In this way, the logic preferably disclosed as a remote procedure may still be used.

바람직한 특정 실시예에서, 본 발명의 방법들은 다음과 같은 기능을 제공하는 하나 이상의 프레임들로 개시된다. 내부의 생물학적 분자들 각각 적어도 약 10개 이상의 서브유닛을 포함하고 있는 둘 이상의 서로 다른 초기 캐릭터 열들의 집합을 제공하기 위해서 둘 이상의 상기의 생물학적 분자들을 캐릭터 열들로 부호화하는 기능; 캐릭터 열들로부터 적어도 두 개의 하위열들을 선택하는 기능; 초기 캐릭터 열 하나 이상의 길이와 거의 동일한 길이인 하나 이상의 생성 열들을 형성하기 위해 하위열들을 연결하는 기능; 및 생성 열들을 열들의 집합에 더하는(위치하는) 기능.In certain preferred embodiments, the methods of the present invention are disclosed in one or more frames that provide the following functions. Encoding at least two of said biological molecules into character strings to provide a set of two or more different initial character strings each containing at least about ten or more sub-units of biological molecules therein; Selecting at least two lower columns from the character strings; The ability to concatenate sub-columns to form one or more generated columns that are approximately equal in length to one or more initial character columns; And the ability to add (place) the generating columns to the set of columns.

둘 이상의 생물학적 분자들을 부호화하는 기능은 사용자가 생물학적 분자의 표시(들)을 삽입할 수 있는 하나 이상의 윈도우를 제공하는 것이 바람직하다. 또한, 부호화 기능은 옵션으로 데이터베이스 내의 하나 이상의 시퀀스들을 본 발명의 방법들에 입력할 수 있는 로컬 네트워크 및/또는 인트라넷를 통해 접근 가능한 사설 및/또는 공중 데이터베이스에 대한 접근을 제공한다. 따라서, 예컨대, 한 실시예에서, 최종사용자가 시퀀스화된 핵산을 부호화 기능에 입력하면, 그는 옵션으로 GenBank 탐색을 요청할 수 있고, 그러한 탐색에 의해 받은 하나 이상의 시퀀스들을 부호화 및/또는 다양성 발생 함수에 입력할 수 있다.The ability to encode two or more biological molecules preferably provides one or more windows through which a user can insert an indication (s) of biological molecules. In addition, the encoding function optionally provides access to private and / or public databases accessible via the local network and / or intranet, which may input one or more sequences in the database into the methods of the present invention. Thus, for example, in one embodiment, if the end user enters a sequenced nucleic acid into the encoding function, he may optionally request a GenBank search and may code one or more sequences received by such a search and / Can be input.

연산 및/또는 데이터 접근 과정의 인트라넷 및/또는 인트라넷 실시를 개시하는 방법은 당업자에게 널리 공지되어 있고, 매우 자세하게 문서화되어 있다.(참조 Cluer 등(1992)A General Framework for the Optimization of Object-Oriented Queries, Proc SIGMOD International Conference on Management of Data, SanDiego, California,Jun. 2-5, SIGMOD Record, vol. 21, Issue 2, Jun., 1992; Stonebraker, M., Editor; ACM Press, pp. 383-392; ISO-ANSI, Working Draft, "Information Technology-Database Language SQL", Jim Melton, Editor, International Organization for Standardization and American National Standards Institute, Jul, 1992; Microsoft Corporation, "ODBC 2.0 Programmer's Reference and SDK Guide. The Microsoft Open Database Standard for Microsoft Windows. TM. and Windows NT. TM., Microsoft Open Database Connectivity. TM. Software Development Kit", 1992, 1993, 1994 Microsoft Press, pp. 3-30 and 41-56; ISO Working Draft, "Database Language SQL-Part 2:Foundation(SQL/Foundation)", CD9075-2:199.chi.SQL, Sep. 11, 1997, 등)Methods for initiating intranet and / or intranet implementations of operations and / or data access processes are well known to those skilled in the art and are documented in great detail (see Cluer et al. (1992). A General Framework for the Object-Oriented Queries , Proc SIGMOD International Conference on Management of Data, SanDiego, California, Jun. 2-5, SIGMOD Record, vol. 21, Issue 2, Jun., 1992, Stonebraker, M., Editor, ACM Press, pp. 383-392 ; Microsoft Corporation, "ODBC 2.0 Programmer's Reference and SDK Guide. The Microsoft," ISO-ANSI, Working Draft, "Information Technology-Database Language SQL," Jim Melton, Editor, International Organization for Standardization and American National Standards Institute, Open Database Standard for Microsoft Windows. ™ and Windows NT ™, Microsoft® Open Database Connectivity ™ ™ Software Development Kit ", 1992, 1993, 1994 Microsoft Press, pp. 3-30 and 41-56, ISO Working Draft, "Database Language SQL-Part 2: Foundation (SQL / Foundation) ", CD9075-2: 199.chi.SQL, Sep. 11, 1997, etc.)

당업자는 본 발명의 범위에서 벗어남이 없이 본 발명의 구성에 많은 변화들을 가할 수 있다는 것을 알 것이다. 예컨대, 이단 구성에서는, WWW 게이트웨이 기능을 수행하는 서버 시스템은 웹 서버의 기능들도 수행할 수도 있다. 예컨대, 전술한 실시예 중 어느 것이라도 URL이 아닌 형태로 사용자/사용자 단말기로부터 요청을 받을 수 있도록 변경시킬 수 있다. 또 다수-관리자 환경에 적합한 변경도 포함될 수 있다.Those skilled in the art will appreciate that many modifications may be made to the configuration of the present invention without departing from the scope of the present invention. For example, in the two-stage configuration, the server system performing the WWW gateway function may also perform the functions of the web server. For example, any of the above-described embodiments can be changed to receive a request from a user / user terminal in a form other than a URL. It may also include changes appropriate to the multi-administrator environment.

X. 물리적 가치평가 및 피드백 루프를 포함하기X. Including physical valuation and feedback loop

전술한 바와 같이, 바람직한 어떤 실시예에서는, 연결된 열들에 의해 표시되는 분자(들)가 실제의 물리적인 성질들을 충족하는 것을 선택 기준으로 요구할 수 있다. 이러한 성질들을 검증하기 위해서는 부호화된 분자들을 얻을 필요가 있다.이를 달성하기 위해서는, 연결된 열(들)에 의해 표시되는 분자(들)는 (예, 화학적으로 또는 유전자 재조합의 방법으로) 물리적으로 합성되거나 분리된다.As described above, in some preferred embodiments, it may require selection criteria that the molecule (s) represented by the connected columns meet actual physical properties. To achieve this, it is necessary to obtain encoded molecules. To accomplish this, the molecule (s) represented by the connected column (s) may be physically synthesized (e.g., chemically or by recombinant means) Separated.

본 발명에 따른 캐릭터 열들의 집합(들)에 의해 부호화된 유전자들, 단백질, 다당류의 물리적 합성은 하나 이상의 요구되는 성질들에 대한 물리적 검증에 적합한 문제의 물리적 표시를 창출하는 중요한 도구다.The physical synthesis of genes, proteins, polysaccharides encoded by the set (s) of character strings in accordance with the present invention is an important tool for creating a physical representation of the problem suitable for physical verification of one or more desired properties.

바람직한 실시예에서, 본 발명의 방법들에 따른 연결된 열들의 집합에서 제공되는 시퀀스 표시들에 충실하게 ,그리고 일관된 방식으로 유전자 합성 기술은 라이브러리를 구성하는 데 사용된다.In a preferred embodiment, gene synthesis techniques are used to construct a library in a faithful and consistent manner to the sequence indications provided in the set of connected columns according to the methods of the present invention.

바람직한 유전자 합성 방법들은 10⁴-10⁹유전자/단백질 변수를 가진 라이브러리를 빠르게 구성할 수 있게 해준다. 더 큰 라이브러리는 만들고 유지하기 더 어렵고 때로는 물리적 검증 또는 선택 방법들에 의해 완벽하게 표본 추출될 수 없기 때문에, 일반적으로 이 정도 크기의 라이브러리가 심사(screeing)/선택 프로토콜로서 적합하다. 예컨대, 당해 기술 분야의 현존하는 물리적 검증 방법들은 (예컨대, "life-and-death" 선택 방법들을 포함해서) 특정 라이브러리의 특정 스크린에 의해 10⁹이하의 변수의 표본 추출을 일반적으로 허용하고, 많은 검증법은 약 10⁴-10⁵구성원을 표본 추출하는데 국한된다. 따라서, 큰 라이브러리에서는 쉽게 완전한 표본추출을 할 수 없으므로 여러 개의 더 작은 라이브러리들을 세우는 것이 바람직한 방법이다. 그러나, 더 큰 라이브러리들은 예컨대, 대규모 처리 방법들을 사용해서 표본 추출할 수 있다.Preferred gene synthesis methods allow rapid construction of libraries with 10 ⁴ -10 ⁹ gene / protein parameters. Larger libraries are generally more suitable as screeing / selection protocols, since larger libraries are more difficult to create and maintain and sometimes can not be perfectly sampled by physical validation or selection methods. For example, existing physical verification methods in the art generally allow for sampling of variables less than 10 ⁹ by a particular screen of a particular library (e.g., including "life-and-death" selection methods) The verification method is limited to sampling about 10 ⁴ -10 ⁵ members. Thus, it is desirable to build several smaller libraries, since large libraries can not easily do complete sampling. However, larger libraries can be sampled, for example, using large-scale processing methods.

잘 정의된 시퀀스들을 갖고 유전자, 다당류, 단백질, 등을 합성할 수 있는 많은 방법들이 존재하며, 이런 분야는 매우 빠르게 성장하고 있다. 설명의 명증성을 위해서, 이하의 설명은 생물학적 분자 생산을 위한 가능하고 이용할 수 있는 공지된 많은 방법들 중 하나에만 초점을 맞추기로 하겠다.There are many ways to synthesize genes, polysaccharides, proteins, etc., with well-defined sequences, and these areas are growing very rapidly. For clarity of explanation, the following discussion will focus only on one of many known and available methods for the production of biological molecules.

당업자가 올리고뉴클레오타이드를 효과적으로 준비할 수 있도록 해주는 널리 알려지고 성숙한 포스포라미디테(phosphoramidite) 화학반응은 폴리뉴클레오타이드 합성에 있어서 현재의 기술을 대표한다. 이 화학반응을 100bp보다 상당히 긴 올리고뉴클레오타이드의 일상적인 합성에 사용하는 것은 가능하기는 하지만, 다소 비현실적이고, 이 경우 합성 산출량(yield)이 감소하고 정제(purificaition) 필요 정도가 증가한다. "전형적인" 40-80 bp 크기의 올리고뉴클레오타이즈는 높은 순도로 일상적으로 그리고 직접적으로 얻을 수 있다.A widely known and mature phosphoramidite chemical reaction that allows a person skilled in the art to effectively prepare oligonucleotides represents current technology in the synthesis of polynucleotides. Although it is possible to use this chemical reaction for routine synthesis of oligonucleotides that are considerably longer than 100 bp, it is somewhat unrealistic, in which case the yield of synthesis is reduced and the degree of purifica- tion required is increased. A " typical " 40-80 bp oligonucleotide can be obtained routinely and directly with high purity.

올리고뉴클레오타이드 및 심지어 완벽한 합성(이중 나선 또는 단일 나선의) 유전자들을 The Midland Certified Reagent Company(mcrc@oligos.com), The Great American Gene Company(http://www.genco.com), ExpressGen, Inc.(www.expressgen.com) Operon Technologies Inc.(alameda, CA) 등과 같은 많은 상업용 소스에서 주문할 수 있다. 유사하게, 펩티드를 PeptidoGenic(pkim@ccnet.com), HTI Bio-pro=ducts,Inc.(http://www.htibio.com), BMA Biomedicals, Ltd(U.K., Bio-Synthesis, Inc. 등과 같은 다양한 소스로부터 주문할 수 있다.Oligonucleotides and even perfect synthetic (double stranded or single stranded) genes are available from The Midland Certified Reagent Company (mcrc@oligos.com), The Great American Gene Company (http://www.genco.com), ExpressGen, Inc. (www.expressgen.com), and Operon Technologies Inc. (alameda, CA). Similarly, peptides were purchased from PeptidoGenic (pkim@ccnet.com), HTI Bio-pro = Ducts, Inc. (http://www.htibio.com), BMA Biomedicals, Ltd (UK, Bio-Synthesis, You can order from a variety of sources.

이미 최적화, 병렬성(parallelism) 및 대량 생산(high throughput)에 적합한작은 단편들(fragments)로부터 전체 유전자 합성의 적절한 실증은 Dillon과 Rosen(1990)Biotechniques, 9(3): 298-300에 의해 이루어졌다. 리가제(ligase)를 사용하지 않고 일부 겹치는 단일 나선 뉴클레오타이드의 세트로부터의 간단하고 빠른 PCR-기반(based) 유전자 조합 과정이 기술되어 있다. 몇몇 그룹들은 크기를 늘려 가며 다양한 유전자 합성에 대한 동일한 PCR-기반 유전자 조합(PCR-based gene assembly) 접근법의 변형들의 성공적인 적용을 기술하고 있으며, 돌연변이화된 유전자의 라이브라이의 합성을 위한 일반적인 적용가능성과 조합 성질(combinatorial nature)을 실제로 증명하고 있다(참조, Sandhu 등(1992)Biotechniques, 12(10): 15-16, Prodomou and Pearl(1992)Protein Engin., 5(8): 827-829, Chen 등(1994) JACS, 1194(11): 8799-8800, Hayashi 등(1994)Biotechniques, 17:310-314 등).Appropriate demonstration of total gene synthesis from small fragments already suitable for optimization, parallelism and high throughput was done by Dillon and Rosen (1990) Biotechniques , 9 (3): 298-300 . A simple and rapid PCR-based gene assembly process from a set of overlapping single-stranded nucleotides without the use of ligases is described. Some groups describe the successful application of variants of the same PCR-based gene assembly approach to a variety of gene syntheses, increasing in size, and the general applicability for the synthesis of live lyes of mutated genes Protein Engin. , 5 (8): 827-829, 1996), which has been proven to be a combinatorial nature (Sandhu et al. (1992) Biotechniques , 12 (10): 15-16, Prodomou and Pearl Chen et al. (1994) JACS, 1194 (11): 8799-8800, Hayashi et al. (1994) Biotechniques , 17: 310-314).

최근에 Stemmer 등(1995)Gene1645: 49-53은 PCR-기반 조합 방법이 40 bp 의 합성 올리고뉴클레오타이드 수십 개 또는 심지어 수백 개로부터 적어도 2.7 kb에 이르는 더 큰 유전자들을 만드는 데 유용하다는 증거를 제공했다. 이 저자들은 PCR-기반 유전자 합성 프로토콜(올리고뉴클레오타이드 합성, 유전자 조합, 유전자 확대(gene amplification) 및 일반적으로, 클로닝(cloning))이라 알려진 네 단계에서 유전자 확대 과정을 "순환(circular)" 조합 PCR을 사용한다면 생략할 수 있음을 보여주었다.Recently, Stemmer et al. (1995) Gene 1645: 49-53 provided evidence that PCR-based combinatorial methods are useful for producing larger genes ranging from dozens or even hundreds of synthetic oligonucleotides of 40 bp to at least 2.7 kb . These authors used a "circular" combination PCR for the gene amplification process in four steps known as PCR-based gene synthesis protocols (oligonucleotide synthesis, gene combination, gene amplification and generally cloning) And can be omitted if used.

일단 준비되면, 유전자(들)은 벡터들(vectors)에 삽입되고 벡터들은 당업자에게 이미 널리 알려진 일반적인 방법에 따라 호스트 셀들을 감염시키고(transfect) 부호화된 단백질(들)을 합성한다. 이런 목적을 달성하기위한 클로닝 방법 및 핵산의 시퀀스를 규명하려는 시퀀싱 방법들은 업계에 이미 널리 알려져 있다. 많은 클로닝 훈련을 받은 당업자가 실시하기에 충분한 적절한 클로닝 및 시퀀싱 기술 및 지침들의 예들은 Berger and Kimmel,Guide to Molecular Cloning Techniques, Methods in Enzymology Vol. 152 Academic Press, Inc., SanDiego, CA(Berger); Sambrook 등(1989)Molecular Cloning_A Laboratory Manual(2nd ed.)Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor Press, NY; 및Current Protocols in Molecular Biology, F.M. Ausubel 등, eds., Currrent Protocols, a joint venture between Greene Publishing Associates, Inc, and John Wiley & Sons, Inc., (1994 Supplement)에서 찾을 수 있다. 생물 시약들과 실험 장비의 생산자들로부터 제품 정보도 공지된 생물 방법들에 유용한 정보를 제공한다. 그러한 생산자에는 SIGMA chemical company(Saint Louis, MO), R&D systems(Minneapolis, MN), Pharmacia LKB Biotechnology(Piscataway, NJ), CLONTECH Laboratories, Inc.(Palo Alto, CA), Chem Genes Corp., Aldrich Chemical Company(Milwaukee, WI), Glen Research, Inc., GIBCO BRL Life Technologies, Inc.(Gaithersberg, MD), Fluka Chemica Biochemika Analytika(Fuka Chemie AG, Buchs, Switzerland), Invitrogen, San Diego, CA 및 Applied Biosystems(Foster City, CA), 기타 당업자에게 잘 알려진 많은 상업적인 소스들이 있다.Once prepared, the gene (s) are inserted into vectors and the vectors transfect the host cells according to the usual methods well known to those skilled in the art and synthesize the encoded protein (s). Methods for cloning to achieve this purpose and sequencing methods for identifying sequences of nucleic acids are well known in the art. Examples of suitable cloning and sequencing techniques and instructions that are well-suited to those skilled in the art of many cloning training are described in Berger and Kimmel, Guide to Molecular Cloning Techniques , Methods in Enzymology Vol. 152 Academic Press, Inc., San Diego, CA (Berger); Sambrook et al. (1989) Molecular Cloning_A Laboratory Manual (2nd ed.) Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor Press, NY; And Current Protocols in Molecular Biology , FM Ausubel et al., Eds., Currrent Protocols, a joint venture between Greene Publishing Associates, Inc., and John Wiley & Sons, Inc., (1994 Supplement). Product information from producers of biological reagents and experimental equipment also provides useful information on known biological methods. Such producers include the SIGMA chemical company (Saint Louis, MO), R & D systems (Minneapolis, MN), Pharmacia LKB Biotechnology (Piscataway, NJ), CLONTECH Laboratories, Inc. (Palo Alto, Calif.), Chem Genes Corp., Aldrich Chemical Company (Milwaukee, WI), Glen Research, Inc., GIBCO BRL Life Technologies, Inc. (Gaithersberg, MD), Fluka Chemica Biochemika Analytika (Fuka Chemie AG, Buchs, Switzerland), Invitrogen, San Diego, CA and Applied Biosystems City, CA), and many other commercial sources well known to those skilled in the art.

일단 합성된 물리적 분자들은 하나 이상의 성질들에 대해 검사될 수 있고, 당해 선택 기준을 충족하는 지 여부가 결정될 수 있다. 이후, 물리적 선택 기준을충족하는 분자들을 부호화한 캐릭터 열들은 전술한 바와 같이 선택된다. 실제 성질들(예. 결합 특이성(binding specificity) 및/또는 어비더티(avidity), 효소 활동도(enzymatic activity), 분자 무게, 전하, 열적 안정성(thermal stability), 최적 온도(temperature optima), 최적 pH(pH optima) 등)에 대한 많은 분석들이 당업자에게 널리 알려져 있다.Once synthesized physical molecules can be inspected for one or more properties, it can be determined whether they meet the selection criteria. Thereafter, the character strings that encode the molecules that meet the physical selection criteria are selected as described above. Actual properties (eg binding specificity and / or avidity, enzymatic activity, molecular weight, charge, thermal stability, temperature optima, optimum pH, (pH optima), etc.) are well known to those skilled in the art.

어떤 실시예에서, 전술한 방법들에 따라 부호화되고 처리과정을 거칠 수 있는 새로운 분자들을 생성하기 위해서 실제 분자들은 하나 이상의 "셔플링(shuffling)" 프로시저에 직면할 수 있으며, 선택적으로 특정 물리적 성질들에 대해 검사될 수 있다.In some embodiments, actual molecules may encounter one or more " shuffling " procedures to create new molecules that can be encoded and processed according to the methods described above, Lt; / RTI >

다양한 "셔플링 방법들"이 알려져 있으며, 이 중에는 예컨대, Stemmer, 등(1994) Nature 370: 389-391, Stemmer(1994)Proc. Natl. Acad. Sci.USA, 91: 10747-10751, Stemmer, U.S. Patent No: 5,603,793, Stemmer 등 U.S. Patent No: 5,830,721, Stemmer 등 U.S Patent No: 5,811,238, Minshull 등, U.S.Patent No: 5,837,458, Crameri 등(1996)Nature Med., 2(1): 100-103, PCT Publications WO 95/22625, WO 97/20078, WO 96/33207, WO 97/33957, WO 98/27230, WO 07/35966, WO 98/31837, WO 98/13487, WO 98/13485 및 WO 98/42832 등의 발명자와 공동 연구자가 내놓은 것들도 포함된다. 게다가, 현재 출원 계속중인 몇몇 출원들은 중요한 DNA 셔플링 방법들을 기술하고 있다(참조, 예 copending USSN 09/116,188, filed July 15, 1998, USSN 60/102,362 and Seilfonov and StemmerMethods for making character strings, polynucleotides & polypeptides having desiredcharacteristics filed02/05/1999, USSN 60/118,854)Various " shuffling methods " are known, including, for example, Stemmer, et al. (1994) Nature 370: 389-391, Stemmer (1994) Proc. Natl. Acad. Sci. USA, 91: 10747-10751, Stemmer, US Patent No. 5,603,793, Stemmer et al., US Patent No. 5,830,721, Stemmer et al., US Patent No. 5,811,238, Minshull et al., US Pat. No. 5,837,458, Crameri et al. , 2 (1): 100-103, PCT Publications WO 95/22625, WO 97/20078, WO 96/33207, WO 97/33957, WO 98/27230, WO 07/35966, WO 98/31837, 13487, WO 98/13485 and WO 98/42832, which are incorporated herein by reference. In addition, several applications that are currently filed describe important DNA shuffling methods (cf. US Patent Application Serial No. 09 / 116,188, filed July 15, 1998, USSN 60 / 102,362 and Seilfonov and Stemmer Methods for making character strings, polynucleotides & polypeptides having desiredcharacteristics filed 02/05/1999, USSN 60 / 118,854)

게다가, 후속 물리적 검사 과정을 위해 복수의 유전자, 단백질, 다당류 등을 포함하는 라이브러리 구성원들 각각을 공간적으로 분리된 용기들 또는 용기 층들에서 합성하는 병행 모드(parallel mode)로, 또는 선호되는 복수의 분자들의 전부 또는 일부를 단일 용기에서 합성하는 풀와이즈 방식(poolwise manner)으로 전술한 방법들은 실행될 수 있다. 많은 다른 합성 방식들도 알려져 있으며 당업자는 이들 간에 어떤 것이 상대적인 이점이 있는지 쉽게 알 수 있다.In addition, for subsequent physical testing procedures, the library members, including a plurality of genes, proteins, polysaccharides, etc., are each introduced into a parallel mode in which they are spatially separated in containers or container layers, The above-described methods can be performed in a poolwise manner in which all or a portion of the particles are synthesized in a single vessel. Many other synthetic schemes are known and one skilled in the art can easily see which of these is a relative advantage.

논의된 프로세스들은 대규모 처리(high-throughput) 시스템을 사용하는 생산 방법에도 적용 가능하다. 대규모 처리(예, 로봇에 의한) 시스템은 상업적으로 이용 가능하다(참조, 예, Zymark Corp., Hopkinto, MA; Air Technical Industries, Mentor, OH; Beckman Instruments, Inc. Fullerton, CA; Precision Systems, Inc., Natick, MA, 등). 이러한 시스템들은 모든 샘플과 시제 피펫팅(pipetting), 용액 분주(liquid dispensing), 시간에 맞춘 인큐베이션(timed incubation) 및 검사에 적합한 탐지기(들) 내의 마이크로플레이트(microplate) 최종 해석을 포함하는 전 과정을 일반적으로 자동화한다. 이런 배열이 가능한 시스템은 높은 수준의 유연성과 주문생산화 외에 대량 생산과 빠른 개시(start up)를 제공한다. 이러한 시스템의 생산자들은 다양한 대량 생산에 대한 자세한 프로토콜들을 제공한다. 따라서, 예컨대 Zymark Corp.은 대량 생산 시스템을 화학적 또는 유전자 재조합에 의한 생산품에 대한 클로닝 표시(cloning expression) 및 검사에 사용하는 방법을 기술한 기술적인 정기보고서(bulletin)를 제공한다.The discussed processes are also applicable to production methods using high-throughput systems. (Eg, Zymark Corp., Hopkinto, MA; Air Technical Industries, Mentor, OH; Beckman Instruments, Inc. Fullerton, Calif .; Precision Systems, Inc ., Natick, MA, et al.). These systems include the entire process, including all samples and final analysis of microplates in the detector (s) suitable for test pipetting, liquid dispensing, timed incubation and inspection. It is generally automated. Systems capable of this arrangement provide mass production and fast start-up in addition to high levels of flexibility and customization. Producers of these systems provide detailed protocols for various mass production. Thus, for example, Zymark Corp. provides technical bulletins describing how to use mass-production systems for cloning expression and testing of chemical or recombinant products.

XI. 생성된 열의 분포(들)의 사용XI. Use of the distribution (s) of generated columns

A) 유전적/진화 알고리즘에서의 사용A) Use in genetic / evolutionary algorithms

한 실시예에서, 본 발명의 방법들은 캐릭터 열들의 분포를 제공한다. 특히 바람직한 캐릭터 열들은 부호화된 생물학적 분자들을 표시하고 일반적으로 당해 부호화된 분자들은 생물학적 유기체의 수준을 반영하기 때문에 어느 정도 관련성을 갖고 있다. 따라서, 본 발명의 방법들에 의해 생성된 캐릭터 열들은 균일한 시퀀스 공간으로부터 무작위적 또는 우연히 선택한 것을 반영한다기보다는 자연에서 발견되는 유기체(예, 유전자, 유전자들(gene family), 개체, 하위분포, 등)의 특정 수준을 반영하는 관계(또는 변수)의 정도를 보여준다. 따라서, 본 발명의 방법들에 따른 캐릭터 열들의 집합들(예, 분포된 데이터 구조)은 다양한 진화 모델들을 위한 유용한 시작점을 제공하고 진화 알고리즘(진화 연산)에서 사용하기에 편리하다.In one embodiment, the methods of the present invention provide a distribution of character strings. Particularly desirable character strings denote encoded biological molecules and generally have some relevance because the encoded molecules reflect the level of the biological organism. Thus, the character strings generated by the methods of the present invention are more likely to represent organisms (e.g., genes, genes, entities, subspecies, etc.) found in nature than those that reflect random or accidental selection from a uniform sequence space. Etc.) that reflects a particular level of the relationship (or variable). Thus, sets of character strings (e.g., distributed data structures) according to the methods of the present invention provide a useful starting point for various evolutionary models and are convenient for use in evolutionary algorithms (evolutionary operations).

이와 같은 모델들에서 사용될 때, 본 발명의 방법들에 따른 분포(캐릭터 열들의 집합)는 진화 알고리즘에서 임의의 분포를 기초로 실행된 경우보다 훨씬 더 많은 정보를 제공한다.When used in such models, the distribution according to the methods of the present invention (the set of character strings) provides much more information than would be done based on any distribution in the evolutionary algorithm.

예컨대, 진화 알고리즘이 무작위의 또는 임의의 구성원들을 포함하는 분포를 시작점으로 사용하면, 이 시뮬레이션의 역동성은 임의의 시작점으로부터 특정 해답(예, 최종 분포(들)에서 성질들의 분포된 형태)으로의 진행 과정을 반영한다. 시작점이 임의적이고 자연 변화에 의해 생성되는 분포와는 본질적으로 아무런 관련이 없기 때문에, 이러한 역동성은 자연 과정/분포의 역동성과 관련된 정보를 제공할 수 없다.For example, if an evolutionary algorithm uses a random or random distribution of distributions, then the dynamics of this simulation will be from a given starting point to a specific solution (eg, a distribution of properties in the final distribution (s)) Reflect the process. This dynamics can not provide information related to the dynamics of the natural process / distribution, since the starting point is arbitrary and essentially irrelevant to the distribution generated by the natural change.

반대로, 본 발명의 방법들에 따른 캐릭터 열들의 집합들은 종래의 진화 알고리즘에서 사용된 무작위로 생성된 시작점들보다 훨씬 더 맣은 정보를 포함하고 있다. 첫째, 분포의 각 구성원은 분자 구조와 관련하여 상당한 정보를 포함하고 있다. 따라서, 각 구성원은 다른 구성원과 단순히 "나/타인"'(즉, 대립유전자 형질 표시)으로 구분되는 것이 아니라 연관성/유사성의 다양한 정도에 의해 구분된다. 본 발명의 방법들에 따른 분포의 구성원들은 다양한 공변(covariation) 정도를 반영한다.Conversely, the sets of character strings according to the methods of the present invention contain far more information than the randomly generated starting points used in conventional evolutionary algorithms. First, each member of the distribution contains considerable information regarding the molecular structure. Thus, each member is differentiated by the degree of association / similarity, rather than being simply distinguished by "other" (ie, allelic trait) with other members. Members of the distribution according to the methods of the present invention reflect varying degrees of covariation.

게다가, 본 발명의 방법들에 따른 분포는 초기 열들에 부호화된 생물학적 유기체의 수준을 특징으로 하는 미세 구조를 반영하기 때문에, 상기 초기 열들을 시작 세트로 사용하는 시뮬레이션 실행의 초기 역동성은 "실제 세상"의 분포의 역동성을 반영하고 진화 과정들에 대한 상당한 통찰력을 제공한다.In addition, since the distribution according to the methods of the present invention reflects the microstructure characterizing the level of the biological organism encoded in the initial columns, the initial dynamics of the simulation run using the initial columns as a starting set are " real world " , And provides considerable insight into evolutionary processes.

게다가, 본 발명의 방법들을 사용해서 생성된 구성원은 특정 분자들을 표시하기 때문에, 이러한 데이터 구조들을 사용한 진화 알고리즘의 실행은 분자의 진화 및/또는 새롭고 유용한 분자 성질의 디자인에 관한 실질적인 정보를 제공한다.Moreover, since members generated using the methods of the present invention represent particular molecules, the execution of evolutionary algorithms using these data structures provides substantial information about molecular evolution and / or the design of new and useful molecular properties.

B) 색인 생성시의 사용 B ) Use when indexing

다른 실시예에서, 본 발명의 방법들에 따른 데이터 구조들을 본질적으로 종류에 관계없이 정보를 색인화하기 위한 태그(색인)로 사용할 수 있다. 이런 접근 방법에서, 더 많은 유사성을 지닌 정보에는 더 많은 유사성을 지닌 데이터 구조의 구성원들(캐릭터 열들)을 이용해서 태그를 부착하고, 반면에 더 낮은 유사성을 지닌 정보에는 더 적은 유사성을 지닌 데이터 구조의 구성원들을 사용해 태그를 부착한다. 바람직한 실시예에서, 두가지 서로 다른 종류의 데이터에 사용되는 캐릭터 열의 유사성은 태그가 부착된 정보의 유사성을 반영한다(유사성에 비례한다).In another embodiment, data structures according to the methods of the present invention can be used as tags (indexes) for indexing information, essentially regardless of the type. In this approach, information with more similarity is tagged using members of the data structure with more similarities (character strings), while information with lower similarity is tagged with data structures with less similarity Attach tags using members of. In a preferred embodiment, the similarity of character strings used in two different kinds of data reflects the similarity of the tagged information (proportional to similarity).

검색이 수행되면, 전통적인 검색 기술들을 사용해 초기 히트가 식별된다. 강한 연관성을 지닌 정보가 필요할 때는, 전술한 널리 공지된 유사성 알고리즘 중 어느 것이든 사용해서 유사한 구성원을 데이터 구조에서 검색한다. 이러한 유사성 알고리즘은 거대한 데이터 공간을 완벽하고, 신속하게 그리고 효율적으로 검색하는 방법을 제공하기 위해 고안되었다. 원하는 유사성을 지닌 구성원들(색인들)이 식별되면, 이들은 태그가 부착된 데이터를 가리켜서 사용자에게 연관된 정보를 제공한다.Once the search is performed, the initial hits are identified using traditional search techniques. When information with strong association is needed, similar members are retrieved from the data structure using any of the well-known similarity algorithms described above. This similarity algorithm is designed to provide a way to search huge data spaces perfectly, quickly and efficiently. Once the members (indexes) with the desired similarity are identified, they point to the tagged data and provide the user with the associated information.

C) 데이터베이스 검색에서 레퍼런스 오브젝트로서 사용C) Use as a reference object in database search

연관된 응용예로, 본 발명의 방법들에 따른 데이터 구조들 또는 이 데이터 구조들(즉, 캐릭터 열들)의 구성원을 데이터베이스 검색에서 레퍼런스 오브젝트로 사용할 수 있다. 예컨대, 초기에 이미 알려진 정보(예, 분자 구조 또는 전술한 것과 같은 지식 데이터베이스로부터의 색인 열들)는 여기서 기술한 방법들에 따라 부호화되고 변경된다. 이 방법은 초기의 부호화된 정보와 연관되었지만 불분명한 변수들을 보여주는 새로운 데이터구조를 생성한다.As a related application, data structures according to the inventive methods or members of these data structures (i.e., character strings) can be used as reference objects in database searches. For example, initially known information (e.g., molecular structures or index columns from a knowledge database such as those described above) is encoded and modified according to the methods described herein. This method creates a new data structure that shows the variables associated with the original encoded information but unclear.

최종 정보(예, 데이터 구조의 구성원들)는 다시 풀려서(deconvolved), 실제 또는 이론적 분자(들)를 식별하게 되므로, 동일한 또는 연관된 분자들을 위한 일반적인 데이터베이스를 검색하는 데 이 방법을 사용할 수 있다. 부호화된 정보가 데이터베이스의 색인에서 나온 것이라면, 데이터 구조의 구성원을 적절한/연관된 정보 식별을 위한 원래의 또는 새로운 데이터베이스 검색에 사용할 수 있다.The final information (e.g., members of the data structure) will be deconvolved and will identify the actual or theoretical molecule (s), so that this method can be used to search a common database for the same or related molecules. If the encoded information is from an index in the database, then members of the data structure may use the original or new database search for proper / associated information identification.

D) 특정 분자의 성질을 부여하는 구조적 모티브의 식별D) Identification of structural motifs that give properties to specific molecules

예컨대 기능의 실행을 실현하는 것과 같은 특정 성질의 원인이 될 수 있는 분자(예, 단백질)의 지역을 식별하는 것은 자주 관심의 대상이 된다. 이 식별은 보통 X-ray 결정학(crystallography)에 의해 얻어진 구조적 정보를 이용해서 전통적으로 이루어졌다.It is often the subject of interest to identify regions of molecules (e.g., proteins) that may be responsible for certain properties, such as, for example, the performance of a function. This identification has traditionally been done using structural information obtained by X-ray crystallography.

유사한 또는 심지어 동일한 반응들에 촉매 작용하는 자연 발생하는 효소들의 시퀀스들은 매우 다양할 수 있다; 시퀀스들은 50%만 동일하거나 또는 그 이하일 수 있다. 그 시퀀스가 있는 효소들의 집합은 동일한 반응에 촉매 작용할 수 있는 반면에, 이 효소들의 다른 성질들은 상당히 서로 다를 수 있다. 이런 성질들에는 온도와 유기 용매에 대한 안정성, 최적 pH, 용해도, 용액에 용해되지 않았을 때 활성을 유지하는 능력, 서로 다른 호스트 시스템에서의 유전자 발현의 용이성과 같은 물리적 성질 등이 포함된다. 활성(K_cat와 K_m)을 포함한 촉매적 성질, 수용 기질(accepted substrates)의 범위 및 화학 반응의 범위를 포함한다. "상동" 시퀀스들에 의해 다 기능 차원들이 부호화될 수 있으면, 여기서 기술된 THE 방법들은 효소 성질을 갖지 않은(non-catalytic) 단백질(예, 사이토킨 같은 리간드) 및 심지어 핵산 시퀀스들(서로 다른 많은 리간드들에 의해 유도되는 프로모터(promoter)와 같은 것)에 적용될 수 있다.The sequences of naturally occurring enzymes that catalyze similar or even the same reactions can vary widely; The sequences may be only 50% identical or less. A set of enzymes with that sequence can catalyze the same reaction, while other properties of these enzymes can be quite different. These properties include temperature and stability to organic solvents, optimum pH, solubility, ability to retain activity when not dissolved in solution, and physical properties such as ease of gene expression in different host systems. The catalytic properties including activity (K _cat and K _m ), the range of accepting substrates and the range of chemical reactions. If the multifunctional dimensions can be encoded by " homology " sequences, the THE methods described herein can be used for non-catalytic proteins (e.g., ligands such as cytokines) and even nucleic acid sequences Such as a promoter that is induced by a promoter.

유사한 촉매 기능이 있는 효소들 간에도 다양한 차이가 존재하기 때문에 특정 성질과 각각의 아미노 산을 정확하게 연결하는 것이 항상 가능한 것은 아니다. 아미노 산에는 너무 많은 차이점들이 존재한다. 그러나, 본 발명의 방법들에 따라 상동인 자연 시퀀스들(homologous natural sequences)의 구성원을 초기 열들에 부호화하고, 그 이후에 하위열들을 선택하고 연결하여 부호화된 변수들을 지닌 데이터 구조를 분포시키면 상동인 자연 시퀀스들로부터 변수들의 라이브러리들을 준비할 수 있다.It is not always possible to accurately link each amino acid with a particular property, since there are various differences between enzymes with similar catalytic functions. There are too many differences in amino acids. However, according to the methods of the present invention, when a member of homologous natural sequences is encoded into initial columns, and thereafter, a data structure having variables encoded by selecting and connecting the lower-order columns is distributed, You can prepare libraries of variables from natural sequences.

원하는 성질들에 대해 부호화된 또는 풀린 변수들을 in silico 방법으로 테스트하고/하거나 부호화된 변수들을 다시 풀어버리고, 전술한 것처럼 대응되는 분자를 물리적으로 합성할 수 있다. 합성된 분자를 하나 이상의 원하는 성질들에 대해 검사할 수 있다.It is possible to test the encoded or unwrapped variables for desired properties in an in silico manner and / or to decode the coded variables again and physically synthesize the corresponding molecules as described above. The synthesized molecule may be inspected for one or more desired properties.

데이터 구조의 구성원들을 특정 성질에 대한 특정 조건들 하에서 테스트하면, 데이터 구조 (또는 초기 열 집합)의 시퀀스들의 이런 조건들에 대한 최적의 조합을 결정할 수 있다. 이 검사 조건들에서 한 파라미터만 바꾸면, 해당 라이브러리(데이터 구조)의 서로 다른 개체들 중에서 최적의 수행자를 찾을 수 있다. 검사 조건들은 매우 유사하기 때문에, 대부분의 아미노 산들은 최고 수행자들 두 세트(초기 열 집합의 최고 수행자들(세트 1) 및 분포된 데이터 구조의 최고 수행자들(세트 2)) 중에 남아 있을 것이다. 서로 다른 두 조건들 하에서 최고인 효소들의 시퀀스들을 비교하면 수행 능력에서의 차이를 가져오는 시퀀스의 차이점들을 찾을 수 있다.Testing members of a data structure under specific conditions for a particular property can determine an optimal combination of sequences of data structures (or initial column sets) for these conditions. By changing only one parameter in these check conditions, it is possible to find an optimal performer among different entities of the library (data structure). Because the test conditions are very similar, most of the amino acids will remain in two sets of top performers (the highest performers of the initial set of columns (set 1) and the highest performers of the distributed data structure (set 2)). Comparing the sequences of the best enzymes under two different conditions, we can find differences in sequences that lead to differences in performance.

주요 요소 분석(principle component analysis)(예, Partek type software를 사용)은 이러한 분석에 유용한 많은 다변수용 도구들(multi-variate tools) 중 하나다.Principle component analysis (eg using Partek type software) is one of the many multi-variate tools available for this analysis.

E) 음악 생성에서의 이용E) Use in music production

또 다른 실시예에서, 본 발명의 방법들은 음악을 만드는 데에도 사용할 수 있다. 잘 알려진 프로그램들 중 하나를 사용해서, 생물학적 분자들(예, DNA, 단백질 등)을 음악 악보로 부호화할 수 있다. 이 방법은 특정 서브유닛을 특정 악보로 맵핑하는 것을 포함한다. 모티브 및/또는 서브유닛이 발생하는 이차 구조가 음악의 타이밍과 음질을 결정한다.In yet another embodiment, the methods of the present invention may be used to create music. Using one of the well-known programs, biological molecules (eg, DNA, proteins, etc.) can be encoded as musical scores. The method includes mapping a particular sub-unit to a particular score. The secondary structure in which the motif and / or subunit occurs determines the timing and sound quality of the music.

따라서, 예컨대 프로그램 SS-midi는 다양한 핵산과 아미노 산 시퀀스들을 부호화하는 데에 사용되어 왔다. DNA 칼립소 접근 방법에서, 푸린은 피리미딘의 3/2 배로 연주되고, 염기인 C, T, G, A는 건반의 C, F, G, A로 변환되고 첫 번째 나선은 재즈 올갠으로 연주되고 다른 나선은 베이스로 연주된다. 다른 접근 방법에서는 음표/서브유닛이 베타-시트에 있는 경우보다 나선에 있을 때, 음표의 길이가 더 길어진다. 물론 다른 변수들도 가능하다.Thus, for example, the program SS-midi has been used to encode various nucleic acid and amino acid sequences. In the DNA Calypso approach, the purines are played at 3/2 times the pyrimidine, and the bases C, T, G, and A are converted to C, F, G, A of the keyboard, the first helix is played as a jazz orol The spiral is played on the bass. In another approach, when the note / subunit is in a spiral rather than in the beta-sheet, the note becomes longer. Of course other variables are also possible.

본 발명의 방법들에서, 전술한 것처럼 생물학적 분자들을 열들에 부호화시키고, 하위열들을 선택하고 연결해서, 데이터 구조를 분포시킨다. 분포된 데이터 구조를 음악으로 변환하는 프로그램(예, SS-미디)에 입력하면, 이 프로그램은 데이터 구조 내의 부호화된 이 새로운 시퀀스들을 음악으로 변환한다. 이 데이터 구조를 전술한 바대로 되먹임을 이용하여 재분포하여, 음악 악구들의 변형들을 생성한다.In the methods of the present invention, the biological molecules are encoded into columns as described above, and the lower columns are selected and concatenated to distribute the data structure. When you enter a distributed data structure into a program that converts it to music (eg, SS-MIDI), the program converts these new encoded sequences in the data structure to music. This data structure is redistributed using feedback as described above to generate variations of music phrases.

F) 합성 장치 구동에의 이용F) Utilization to drive synthesizer

전술한 것처럼, 본 발명의 방법들에 따른 데이터 구조는 부호화된 분자들의 화학적 합성 장치를 구동하는 데에 사용할 수 있다(예. 폴리펩티드, 핵산, 다당류, 등). 몇 개의 초기 시퀀스들("시드 구성원들")만을 사용해서 본 발명의 방법들은 캐릭터 그대로, 수십, 수백, 수천, 수만, 수십만, 또는 심지어 수 백만의 서로 다른 부호화된 분자들을 제공할 수 있다. 최종 데이터 구조 또는 구조의 구성원들을 화학적(또는 유전자 재조합의) 합성을 하는 데에 사용할 때에, 실제 모든 크기의 원하는 분자들의 "유전자 재조합된" 라이브러리를 준비할 수 있다. 치료제, 산업상의 프로세스 분자들, 특정 효소들 등에 대한 검사 가능한 시스템을 제공하기 위해서 이와 같이 "유전자 재조합된" 라이브러리들을 많이 필요로 하고 있다.As described above, the data structure according to the methods of the present invention can be used to drive an apparatus for chemical synthesis of encoded molecules (e.g., polypeptides, nucleic acids, polysaccharides, etc.). Using only a few initial sequences ("seed members"), the methods of the present invention can provide tens, hundreds, thousands, tens of thousands, hundreds of thousands, or even millions of different encoded molecules, literally. When using the final data structure or members of the structure for chemical (or recombinant) synthesis, it is possible to prepare "genetically recombined" libraries of desired molecules of virtually any size. Many "genetically modified" libraries are needed to provide an inspectable system for therapeutic agents, industrial process molecules, specific enzymes, and the like.

예(EXAMPLES)Yes (EXAMPLES)

다음의 예들은 예시적으로 제시된 것이지 본 발명의 청구 범위를 이에 한정하기 위한 것은 아니다.The following examples are provided by way of illustration and not by way of limitation of the claims of the present invention.

예 1: 서브틸리신 패밀리 모델Example 1: Subtilisin Family Model

핵산 시퀀스들은 정렬된다(바람직한 (유전자) 표현 시스템을 위한 역좌(retrotranslation)에 대해 코돈 암호해석은 최적화될 수 있고, 합성하려는 올리고뉴클레오타이드의 수는 최소화된다). 7개의 부모로부터 가능한 모든 패어와이즈 배열의 도트-플롯이 도시되어 있다(도. 5,6,7). 쌍 6과 7은 7aa 이상인 윈도우 당 95%의 퍼센트 동일성을 보여주는 반면에 다른 모든 쌍들은 7aa 이상인 윈도우 당 80%의 퍼센트 동일성을 보여준다. 정렬의 엄격함( 및 후속하는 부모들 간 교차의 표시)은 각각의 쌍에 대해 개별적으로 처리될 수 있으므로, 높은 상동성을 지닌 부모들 덕택에 낮은 상동 교차(homology crossover)를 나타낼 수 있다. 어떤 구조적 편중 또는 활동 영역 편중도 이 모델에는 포함되어 있지 않다.The nucleic acid sequences are aligned (codon cryptanalysis can be optimized for retrotranslation for the desired (gene) expression system, and the number of oligonucleotides to synthesize is minimized). A dot-plot of all possible pairwise arrangements from seven parents is shown (Figures 5, 6, 7). Pairs 6 and 7 show a percent identity of 95% per window above 7aa, while all other pairs show a percent identity of 80% per window above 7aa. The strictness of the alignment (and subsequent indication of intersection between parents) can be handled separately for each pair, so it can represent a low homology crossover due to parents with high homology. No structural biases or biases of activity are included in this model.

예 2: 키메리컬 폴리뉴클레오타이드(chimerical polynucleotides)의 합성을 위한 교차 올리고뉴클레오타이드(crossover oligonucleotides) 디자인 과정Example 2: Design of crossover oligonucleotides for the synthesis of chimerical polynucleotides

첫째, 교차 연산자를 키메릭 접합점에 있는 하위열들에 적용하기 위해서 하위열들은 부모(초기) 열들에서 식별되고 선택된다. 이것은 다음과 같이 이루어진다: a) 모든 캐릭터 열들 간의 쌍으로된 상동 영역들 모두 또는 일부를 식별하기, b) 선택된 쌍으로된 상동 지역들 각각의 내부에 있는 적어도 하나 이상의 교차점을 색인화하기 위해서 식별된 쌍으로된 상동 영역의 전부 또는 일부를 선택하기 c) 선택된 쌍으로된 비상동 지역들 각각의 내부에 있는 적어도 하나 이상의 교차점을 색인화하기 위해서 식별된 쌍으로된 비상동 영역의 전부 또는 일부를 선택하기("c"는 생략해도 무방한 옵션이고 구조-활성을 기초로 한 소수정예주의(structure-activity based elitism)가 적용된다.), 따라서, 교차점을 더 선택하기에 적합한 부모 캐릭터 열들의 위치- 및 부모-색인화된 지역들/영역들(하위열들)의 세트에 대한 기술을 제공한다.First, in order to apply the crossover operator to the sub-columns in the chimeric junction, the sub-columns are identified and selected in the parent (initial) columns. This is accomplished as follows: a) identifying all or some of the homology regions in pairs between all the character sequences, b) identifying the identified pairs in order to index at least one or more intersections within each of the homologous regions of the selected pair Selecting all or a portion of the homologous region of the identified pair to index at least one or more intersections within each of the selected pairs of non-heterologous regions; The "c" may be omitted and structure-activity based elitism is applied based on structure-activity), and thus the location of the parent character strings that are more suitable for selecting the intersection - Provides a description of a set of indexed regions / regions (sub-columns).

둘째, Part 1에서 선택된 하위열들로 이루어진 세트의 하위열 각각의 내부에 있는 교차 점들을 더 선택한다. 이 과정들에 a) 각각의 선택된 하위열들에 있는 교차 점들 중 적어도 하나를 무작위로 선택하기, 및/또는 b) 하나 이상의 아닐링 시뮬레이션-기초 모델들(annealing simulation-based models)을 선택된 하위열들각각의 내부에서 교차점 선택 확률(probability of the crossover point selection)을 결정하는 데 사용해서, 선택된 하위열들 각각의 교차점들 중 적어도 하나를 선택하기 및/또는 c) 선택된 각 하위열의 중간 쯤에 있는 교차점 하나씩을 추출해서 쌍으로된 교차점 세트를 만들기(여기서 각 교차점은 당해 교차점의 키메릭 접합점(chimeric junction)으로부터 부모 열들의 대응되는 캐릭터 위치들에 따라 색인화된다.)Second, we further select the intersection points inside each of the sub-columns of the set of sub-columns selected in Part 1. These processes include: a) randomly selecting at least one of the intersection points in each selected sub-column, and / or b) selecting one or more annealing simulation- To determine a probability of the crossover point selection within each of the selected sub-columns, and / or c) selecting at least one of the intersections of each of the selected sub- Extracting one intersection to create a set of paired intersections where each intersection is indexed according to the corresponding character positions of the parent columns from the chimeric junction of the intersection.

셋째, 옵션인 코돈 암호해석 조정(codon usage adjustments)을 수행한다. 상동 (DNA 또는 AA를 부호화하는 열들)을 결정하는 데 사용되는 방법들에 따라, 처리 과정은 달라질 수 있다. 예컨대, DNA 시퀀스가 사용되었다면: a) 모든 부모 열에 대해 선택된 표현 시스템을 위해 코돈 조정을 수행하고 b) 모든 대응하는 위치의 주어진 모든 아미노 산에 대한 코돈 암호해석을 표준화하기 위해 부모들 간의 코돈 조정을 수행할 수 있다. 이 과정은 유전자 라이브러리 합성을 위한 올리고뉴클레오타이드의 전체 수치를 상당히 감소시킬 수 있고 AA 상동 수치가 DNA 상동 수치보다 높거나, 높은 상동 관계인 유전자(예. 80%+동일)의 패밀리들로 된 경우에 특히 유리하다.Third, perform optional codon usage adjustments. Depending on the methods used to determine homology (columns that code DNA or AA), the process may vary. For example, if a DNA sequence is used: a) perform codon adjustments for the expression system selected for all parent sequences, and b) adjust the codons between the parents to standardize the codon cryptanalysis for all given amino acids at all corresponding positions Can be performed. This process is particularly useful in cases where the total number of oligonucleotides for gene library synthesis can be significantly reduced and the AA homology level is higher than the DNA homology level or is a family of highly homologous genes (eg 80% + same) It is advantageous.

본질적으로 이 옵션은 극소수만 추출된 돌연변이화 연산자(elitism mutation operator)의 표시이므로, 조심스럽게 실행되어야 한다. 따라서, 올리고뉴클레오타이드의 비용을 삭감하는 이익과 이 옵션을 도입할 때의 이익을 비교해서 고려해봐야 하는데, 후자의 경우에는 원하지 않는 결과들을 일으킬 수 있기 때문이다. 가장 일반적으로, 다수의 부모들의 주어진 위치에 AA를 부호화하는 코돈을 이용한다.In essence, this option is an indication of only a few of the elitism mutation operators that must be run carefully. Therefore, you should consider the benefit of reducing the cost of oligonucleotides versus the benefit of introducing this option, as the latter can lead to undesirable results. Most commonly, a codon is used to encode an AA at a given location of multiple parents.

AA 시퀀스가 사용되면: a) DNA를 퇴화(degenerate)시키기 위해서 시퀀스를 역좌(retrotranslate)시킨다; b) (부모들의 다수 또는 대응되는 부모의) 원래의 DNA에서의 코돈 암호해석을 기준으로 위치 비교를 사용하여 퇴화 뉴클레오타이드를 정의하고/정의하거나 물리적인 시험이 수행될 선택된 표현 시스템에 적합한 코돈 조정을 실행한다.If the AA sequence is used: a) retrotranslate the sequence to degenerate the DNA; b) Define and / or define degenerate nucleotides using position comparisons based on codon cryptanalysis in the original DNA (of a large number of parents or corresponding parents) or codon adjustments to the selected expression system in which the physical test is to be performed .

라이브러리의 엔트리의 후속 식별(identification)/QA/디콘볼루션(deconvolution)/조작(manipulation)을 위해서 유전자의 부호화되는 부분들 내에 금지 구역들을 도입하는 데에도 본 단계를 사용할 수 있다.This step can also be used to introduce prohibited regions in the encoded portions of the gene for subsequent identification / QA / deconvolution / manipulation of the entries in the library.

넷째, 유전자 조합을 위해서 올리고뉴클레오타이드 배열을 선택할 수 있다. 이 과정은 몇 가지의 의사 결정 단계들을 포함한다:Fourth, oligonucleotide sequences can be selected for gene combinations. This process involves several decision-making steps:

균일한 40-60 mer 올리고뉴클레오타이드가 일반적으로 사용된다(더 긴 올리고뉴클레오타이드를 사용하면 부모를 만들 올리고뉴클레오타이드 수가 감소하게 된다. 따라서, 밀접한 교차/돌연변이화의 표시를 제공하기 위해서 부가적인 전용 올리고뉴클레오타이드를 사용한다).A uniform 40-60 mer oligonucleotide is commonly used (longer oligonucleotides reduce the number of oligonucleotides that make up the parent, so additional dedicated oligonucleotides are needed to provide a close cross / use).

더 짧거나 더 긴 올리고뉴클레오타이드가 허용되는 지 여부를 선택한다(즉, 예/아니오? 결정). "예"로 결정하면 갭들(삭제/삽입)이 있는 서로 다른 길이의 높은 상동 관계에 있는 유전자를 위한, 특히 1-2aa)를 위한 올리고뉴클레오타이드의 전체 수를 줄이게 된다.Select whether shorter or longer oligonucleotides are allowed (i.e. yes / no?). Deciding "yes" will reduce the total number of oligonucleotides for genes with high homology to different lengths, especially 1-2aa, with gaps (deletions / insertions).

겹치는 길이를 결정한다(일반적으로 15-20 베이스(bases)이고, 대칭적이어도되고 비대칭적이어도 된다.)Determine the overlap length (typically 15-20 bases, symmetrical or asymmetric).

퇴화된 올리고뉴클레오타이드(degenerate oligonucleotides)가 허용되는 지 여부를 선택한다(예/아니오?). 또다른 강력한 비용 삭감 항목이면서 부가적인 시퀀스의 다양성을 얻을 수 있는 강력한 방법이기도 하다. 일부 퇴화 계획 및 최소화된 퇴화 계획이 돌연변이 유발이 쉬운 라이브러리를 만드는 데 특히 유용하다.Select whether degenerate oligonucleotides are allowed (yes / no?). It is another powerful cost reduction item and a powerful way to get additional sequence diversity. Some degenerative schemes and minimized degeneracy schemes are particularly useful for creating libraries that are easy to mutate.

소프트웨어 도구를 이런 작업에 사용한다면, 라이브러리의 복잡성을 최대화하고 비용을 최소화하기 위해서 파라미터들의 몇가지 변동을 실행시킬 수 있다. 다양한 길이의 올리고뉴클레오타이드를 이용해서 복잡한 조합 계획을 실행하는 것은 색인 붙이기 과정 및 이어지는, 위치에 의해 부호화된 병행 또는 일부 풀링(pooling) 형태인 라이브러리의 조합을 상당히 복잡하게 한다. 복잡한 소프트웨어 없이도 이것이 가능하다면, 단순하고 일관된 계획(예. 모든 올리고뉴클레오타이드가 40 베이스의 길이와 20 베이스의 중복(overlap)을 가진 경우)을 사용할 수 있다.If you use software tools for this task, you can run some variation of parameters to maximize library complexity and minimize cost. Performing a complex combination scheme using oligonucleotides of varying lengths complicates the combination of indexing processes and subsequent, library-encoded parallel or some pooled forms of libraries. If this is possible without complex software, a simple and consistent scheme (eg, all oligonucleotides with 40 base lengths and 20 base overlaps) can be used.

다섯 째, "편의 시퀀스(convenience sequence)"를 부모 열들의 전후에 디자인한다. 모든 라이브러리 엔트리 끝에 만들어지는 동일한 세트가 이상적이다. 편의 시퀀스는 금지 지역들, 조합되어 생성되었음을 식별하기 위한 프라이머 시퀀스(primer sequence), RBS, 리더 펩티드(leader peptides) 및 다른 특별한 또는 원하는 특징들을 포함한다. 원칙적으로, 후속 단계에서 편의 시퀀스를 정의할 수 있으며, 현 단계에서는 적절한 길이의 "더미(dummy)" 세트를 사용한다. 예컨대, 쉽게 인식할 수 있는 금지된 캐릭터들로 된 하위열 같은 것들이다.Fifth, "convenience sequences" are designed before and after parent sequences. The same set created at the end of every library entry is ideal. The sequence of accesses includes forbidden regions, a primer sequence to identify that they were generated in combination, RBS, leader peptides, and other special or desired features. In principle, you can define a convenience sequence at a later stage and use the "dummy" set of the appropriate length at this stage. For example, sub-columns of easily recognizable forbidden characters.

Part 6에서 부모를 만들기 위한 올리고뉴클레오타이드 열들의 색인화된 매트릭스가 선택된 계획에 따라 생성된다. 모든 올리고뉴클레오타이드의 색인은 다음과 같은 것을 포함한다: 부모 식별자(부모ID), 코딩 또는 보조 체인의 지시(indication) 및 포지션 넘버. 머리 및 꼬리 편의 하위열을 지닌 모든 부모 의 색인화된 코딩 열(indexed coding string)에 대해 교차 점들이 결정된다. 모든 열의 보조 체인이 생성된다. part 4의 조합 PCR 계획(예. 40 bp의 증가)에서 선택된 대로 모든 코딩 열을 선택한다. 모든 보조 열은 동일한 계획(예, 20 bp 이동이 있는 40 bp)에 따라 분리된다.In Part 6, an indexed matrix of oligonucleotide columns to create a parent is generated according to the selected scheme. The index of all oligonucleotides includes: a parent identifier (parent ID), an indication of a coding or ancillary chain, and a position number. The intersection points are determined for all indexed coding strings of all the parents with the head and tail subscripts. A secondary chain of all columns is created. Select all coding columns as selected in the combination PCR scheme (eg increase of 40 bp) of part 4. All auxiliary columns are separated according to the same scheme (eg 40 bp with 20 bp shift).

Part 7에서, 모든 쌍으로된 교차 동작(pairwise crossover operation)을 위해서 올리고뉴클레오타이드의 색인된 매트릭스를 생성한다. 첫째, 쌍으로된 교차 표시가 있는 모든 올리고뉴클레오타이드가 정해진다. 둘째, 동일한 위치와 동일한 부모 교차 표시 쌍(교차점 당 4개)인 모든 올리고뉴클레오타이드의 모든 세트가 결정된다. 셋째, 동일한 교차 표시가 있는 4 올리고뉴클레오타이드 열들로 이루어진 모든 세트가 있고, 2 코딩과 2 보조 체인들(예 40=20+20 계획에서 20bp 이동을 가진)을 부호화한 캐릭터들을 포함하는 4 키메릭 올리고뉴클레오타이드로 이루어진 파생된 세트가 만들어진다. 부모의 하위열 시퀀스 포워드 끝점을 가진, 그리고, 교차점 뒤에 두 번째 부모의 하위열 백워드 끝점이 후속되는 2 코딩 열들이 가능하다. 보조 열들도 이와 동일한 방식으로 디자인되므로, PCR에 의한 유전자 라이브러리 조합에 적합한 올리고뉴클레오타이드를 부호화한, 열들의 색인도 이루어진 완전한 목록을 얻는다.In Part 7, an indexed matrix of oligonucleotides is generated for all pairwise crossover operations. First, all oligonucleotides with paired cross-marking are determined. Second, all sets of all oligonucleotides that are the same position and the same parent crossing pair (four per crossing) are determined. Third, there are all sets of 4 oligonucleotide columns with the same crossover designation, and a 4-chimeric oligo (which includes characters encoding 2 coding and 2 auxiliary chains (eg, with a 20 bp shift in the 20 = 20 plan) A derived set of nucleotides is created. Two coding columns are possible, with the parent's lower column sequence forward endpoint followed by the lower column backward endpoint of the second parent after the intersection. Since the secondary columns are designed in the same way, a complete list of indexed columns of oligonucleotides encoding PCR-compatible gene library combinations is obtained.

선택 사항으로 각각의 올리고뉴클레오타이드 열의 색인에 "abundance=amount" 필드에 카운트 값을 도입하여 남은 올리고뉴클레오타이드(redundant oligonucleotides)를 찾아내고, 그들의 수를 세고, 목록에서 삭제함으로써 본 발명을 좀더 정교하게 만들 수 있다. 이 방법은 라이브러리 합성 시에 전체 올리고뉴클레오타이드의 수를 감소시키므로 매우 유용한 단계가 될 수 있으며, 특히 부모 시퀀스들이 높은 상동 관계에 있는 경우에 그러하다.Optionally, counting values in the " abundance = amount " field in each oligonucleotide sequence index can be used to find redundant oligonucleotides, count their number, and remove them from the list to further refine the invention. have. This method can be a very useful step as it reduces the total number of oligonucleotides in the library synthesis, especially when the parent sequences are highly homologous.

청구 범위에 기재된 본 발명의 사상과 범위로부터 벗어나지 않고 전술한 방법 및 재료들을 변경할 수 있으며, 본 발명은 매우 다양한 곳에 이용될 수 있으며, 다음과 같은 것들이 이에 포함된다:It will be apparent to those skilled in the art that changes may be made in the above methods and materials without departing from the spirit and scope of the invention as set forth in the claims,

반복적인 과정에 포함된 셔플된 핵산을 생성하고/하거나 셔플된 핵산을 테스트하는 종합 시스템의 이용Use of a comprehensive system to generate shuffled nucleic acids included in the iterative process and / or test shuffled nucleic acids

전술한 선택 전략, 재로, 요소들, 방법들 또는 기질들 중 하나라도 사용하는 검사소, 키트(kit) 또는 시스템. 키트는 선택 사항으로 방법들 또는 검사의 수행 지침, 패키지로 된 물질, 검사소, 장치 또는 시스템 요소들을 포함하는 하나 이상의 용기들 등을 부가적으로 포함한다.An inspection site, kit or system using any of the above-described selection strategies, ashes, elements, methods or substrates. The kit optionally additionally includes instructions for performing the methods or tests, one or more containers containing packaged material, an inspection station, a device or system components, and the like.

부가적인 측면에서, 본 발명은 전술한 방법들과 장치를 구체화한 키트를 제공한다. 본 발명의 키트는 선택 사항으로 다음에서 하나 이상을 포함한다: (1) 전술한 것과 같은 셔플된 요소;(2) 전술한 방법을 실행하기 위한 및/또는 상기의 선택 프로시저를 작동하기 위한 지침들;(3) 하나 이상의 검사 요소;(4) 핵산 또는 효소, 다른 핵산들, 유전자 변형된 식물들(transgenic plants), 동물들, 세포들 등을 담는 용기;(5) 패키지로 된 물질들 (6) 상기의 프로세스 및/또는 의사 결정 단계를 실행하는 소프트웨어In a further aspect, the invention provides a kit embodying the methods and apparatus described above. The kit of the present invention optionally includes one or more of the following: (1) a shuffled element as described above; (2) instructions for performing the above-described method and / (3) one or more test elements, (4) containers containing nucleic acids or enzymes, other nucleic acids, transgenic plants, animals, cells, etc. (5) 6) Software that executes the above process and / or decision step

더 나아가, 본 발명은 상기의 요소 또는 키트의 사용, 상기의 방법 또는 검사의 실행 및/또는 상기의 검사 또는 방법을 실행하는 장치 또는 키트의 사용을 위해 제공된다.Furthermore, the present invention is provided for use of the above-described element or kit, the method or the execution of the test, and / or the use of an apparatus or kit for performing the test or method.

전술한 예들과 실시예들은 예시의 목적을 위해서만이며, 이에 대한 다양한 변경이나 변화는 당업자에게 명백하고, 본 출원의 사상과 범위 및 첨부된 청구 범위의 범위 안에 포함되는 것이다. 여기에 인용된 모든 간행물, 특허들 및 특허출원들은 참조 문헌으로서만 포함되어 있을 뿐이다.The foregoing examples and embodiments are for purposes of illustration only and various changes and modifications thereto will be apparent to those skilled in the art and are intended to be within the spirit and scope of the present application and the scope of the appended claims. All publications, patents, and patent applications cited herein are included as a reference only.

Claims

A method for distributing a data structure with a plurality of character strings,

i) encoding two or more biological molecules, each containing at least about 10 subunits, into the character strings to provide two or more different sets of initial character strings;

ii) selecting at least two substrings from the character strings;

iii) concatenating the sub-columns to form one or more product strings having substantially the same length as one or more of the initial character strings;

iv) adding the generated columns to a set of columns; And

v) selectively repeating from step i) or ii) to step iv) using one or more of the generated columns as an initial column in the set of initial character strings,

Wherein the data structure comprises a plurality of data structures.

The method according to claim 1,

Wherein the encoding step comprises encoding one or more nucleic acid sequences into the character strings.

3. The method of claim 2,

Wherein the one or more nucleic acid sequences comprise a nucleic acid sequence encoding a known protein.

The method according to claim 1,

Wherein the encoding step comprises encoding one or more amino acid sequences into the character strings.

5. The method of claim 4,

Wherein the one or more amino acid sequences comprise a nucleic acid sequence encoding a known protein.

The method according to claim 1,

Wherein the biological molecules have a sequence identity of at least 30%.

The method according to claim 1,

Wherein the sub-column selection step comprises the steps of: selecting, in a column region composed of about 3 to about 20 characters having high sequence identity with the corresponding regions of the other columns among the initial character strings than the overall sequence identity of the same two columns, And selecting the sub-columns so that a point occurs.

The method according to claim 1,

Wherein the selecting of the sub-columns comprises selecting the sub-columns such that the end points of the sub-columns occur within a predefined motif of about 4 to about 8 characters. &Lt; RTI ID = 0.0 > Way.

The method according to claim 1,

Wherein the selecting and concatenating sub-columns comprises concatenating sub-columns selected from two different initial columns, wherein the concatenation is performed by using two different initial values of the two different initial sequences, Lt; RTI ID = 0.0 > 3 < / RTI > to about 20 characters having higher sequence identity between columns.

The method according to claim 1,

Wherein the step of selecting a lower column comprises the steps of sorting two or more of the initial character strings to maximize a pairwise identity between two or more lower columns of the character strings, The method comprising the steps of:

The method according to claim 1,

Wherein the generated columns are added to the set only when there is at least 30% sequence identity with the initial columns.

The method according to claim 1,

Further comprising: randomly modifying one or more characters of the character strings.

13. The method of claim 12,

Further comprising the step of randomly selecting and changing the appearance frequency of one or more specific characters pre-selected in the character strings.

In a computer program product,

i) encoding two or more biological molecules, each comprising at least about 10 subunits, into character strings, thereby providing a set of two or more different initial character strings;

ii) selecting at least two sub-columns from the character strings;

iii) connecting the sub-columns to form one or more generated columns having substantially the same length as at least one of the initial character strings;

iv) adding the generated columns to a set of columns; And

v) selectively repeating from step (i) or (ii) to step (iv) using one or more of the generated columns as an initial column in the initial set of character strings

Computer code

&Lt; / RTI >

15. The method of claim 14,

Wherein the at least two biological molecules are nucleic acid sequences.

15. The method of claim 14,

Wherein the at least two biological molecules are nucleic acid sequences of known proteins.

15. The method of claim 14,

Characterized in that the two or more biological molecules are amino acid sequences.

15. The method of claim 14,

Wherein said biological molecules have at least 30% sequence identity.

15. The method of claim 14,

The code is arranged to cause the lower-column endpoint to occur in a column region comprised of about 3 to about 20 characters having a higher sequence identity with the corresponding region of the other of the initial character strings than the overall sequence identity of the same two columns And selects the sub-columns.

15. The method of claim 14,

Wherein the code selects the lower columns so that the end points of the lower columns occur within a predefined motif of about 4 to about 8 characters.

15. The method of claim 14,

Wherein the code is arranged so that a connection occurs within an area of about three to about twenty characters having a higher sequence identity in the two different initial states than the overall sequence identity between the two different initial strings Selecting and connecting the lower columns from the other two initial columns.

15. The method of claim 14,

The code may be arranged to align two or more of the initial character strings to maximize pairwise identity between two or more lower rows of the character strings and to select a character that is a member of the aligned pair as an end point of one lower row, The computer program comprising:

15. The method of claim 14,

The method further comprising randomly changing one or more characters of the character strings.

25. The method of claim 24,

Wherein the method further comprises randomly selecting and modifying one or more occurrences of a pre-selected character in the character strings.

15. The method of claim 14,

Wherein the code is stored in a medium selected from the group consisting of a magnetic medium, an optical medium, and a magneto-optical medium.

15. The method of claim 14,

Wherein the code is in a dynamic or static memory of the computer.

A label generation system for generating a plurality of associated labels,

An encoder for encoding two or more initial columns from biological molecules;

An isolator for identifying and selecting sub-columns from the two or more columns;

A concatenator for connecting the lower columns;

A data structure for storing the connected lower-order columns as a set of columns;

A comparator for measuring the number and variability of the set of columns and determining if sufficient rows are present in the set of columns; And

A command register for writing the set of columns to a raw string file

The label generation system comprising:

29. The method of claim 28,

Wherein the isolator comprises a comparator for aligning and determining the same areas between the two or more initial columns.

29. The method of claim 28,

Wherein the encoder includes means for encoding a nucleic acid sequence into a character string.

29. The method of claim 28,

Wherein the encoder includes means for encoding an amino acid sequence into a character string.

29. The method of claim 28,

Wherein the comparator comprises means for computing sequence identity. &Lt; RTI ID = 0.0 > 8. < / RTI >

29. The method of claim 28,

The isolator is configured to cause an endpoint of the lower column to occur in a column region comprised of about 3 to about 100 characters having a higher sequence identity with the corresponding region of the other column of the initial character strings than the overall sequence identity of the same two columns And selecting the sub-columns.

29. The method of claim 28,

Wherein the isolator selects the sub-columns so that the end points of the sub-columns occur within a predefined motif of about 4 to about 8 characters.

29. The method of claim 28,

The isolator and the coupler coupling the selected lower columns individually or in combination to each other from two different initial columns, wherein the connection comprises two different initials of the two different initial columns, And in an area made up of about 3 to about 100 characters with higher sequence identity in the thermosyphon.

29. The method of claim 28,

Wherein the isolator comprises: aligning two or more of the initial character strings to maximize pairwise identity between two or more sub-columns of the character strings; and selecting a character that is a member of the ordered pair of character strings as the endpoint of one sub-column Features of Label Generation System

29. The method of claim 28,

Wherein the comparator adds the columns to the data structure only when there is at least 30% sequence identity with the initial columns.

29. The method of claim 28,

Further comprising an operator to randomly change one or more characters of the character strings.

39. The method of claim 38,

Wherein the operator randomly selects and changes the occurrence frequency of one or more predetermined characters of the pre-selected character strings of the character strings.

29. The method of claim 28,

Wherein the data structure is a data structure that stores encoded nucleic acid sequences.

29. The method of claim 28,

Wherein the data structure is a data structure for storing encoded amino acid sequences.