[go: up one dir, main page]

CN114936296B - Indexing method, system and computer equipment for super-large-scale knowledge map storage - Google Patents

Indexing method, system and computer equipment for super-large-scale knowledge map storage Download PDF

Info

Publication number
CN114936296B
CN114936296B CN202210874965.8A CN202210874965A CN114936296B CN 114936296 B CN114936296 B CN 114936296B CN 202210874965 A CN202210874965 A CN 202210874965A CN 114936296 B CN114936296 B CN 114936296B
Authority
CN
China
Prior art keywords
input
vector
entity
knowledge graph
triple
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210874965.8A
Other languages
Chinese (zh)
Other versions
CN114936296A (en
Inventor
王文广
陈运文
纪达麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Chengdu Co ltd
Original Assignee
Daguan Data Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Chengdu Co ltd filed Critical Daguan Data Chengdu Co ltd
Priority to CN202210874965.8A priority Critical patent/CN114936296B/en
Publication of CN114936296A publication Critical patent/CN114936296A/en
Application granted granted Critical
Publication of CN114936296B publication Critical patent/CN114936296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Neurology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an indexing method for super-large scale knowledge graph storage, which specifically comprises the following steps: dividing the input of the index into three types of entity, relation triple and attribute triple; coding the three types of input by using a BERT compatible model, and respectively outputting vector representations of the three types of input; the multilayer perceptron returns the initial position of data storage and the length of physical storage according to the received vector representation; accessing and maintaining the knowledge graph data on the physical storage equipment according to the initial position and the physical storage length to realize intelligent indexing of the super-large scale knowledge graph storage; also relates to an indexing system and computer equipment for the large-scale knowledge map storage intelligence. The indexing method, the indexing system and the computer equipment are suitable for intelligent indexing of the large-scale semantization knowledge graph so as to improve the retrieval efficiency and provide more convenient service for intelligent reasoning based on the knowledge graph.

Description

Indexing method, system and computer equipment for super-large scale knowledge graph storage
Technical Field
The invention relates to the field of artificial intelligence, in particular to an indexing method, a system and computer equipment for super-large-scale knowledge graph storage.
Background
With the increasing application of knowledge maps and the increasing depth of knowledge maps, large-scale enterprises are dedicated to constructing huge knowledge maps from ubiquitous knowledge and providing knowledge-based application in different scenes. The entities of these knowledgemaps can be as high as billions of entries, while the number of relational and attribute triples can scale to the hundreds of billions, or even trillions. In such an ultra-large-scale knowledge graph storage, how to perform efficient retrieval is a huge challenge. The method realizes real-time retrieval of entities, online multi-hop query and relationship analysis, second-level complex analysis and the like, and is an urgent need for super-large-scale knowledge map engineering practice and industrial application.
Conventional knowledge-graph storage typically employs graph databases or relational databases, the physical models of which typically employ B + trees or hash algorithms, the mapping relationships of which are simple arithmetic. For a small-scale knowledge graph, the existing common indexing mode is practical enough, and a practical intelligent indexing method is not needed. For the ultra-large knowledge graph, the existing indexing method has low efficiency and is even infeasible, so that a more practical and intelligent indexing mode is needed.
Disclosure of Invention
In order to achieve the aim, the invention provides a super-large scale knowledge graph intelligent indexing method and system based on deep learning intelligent hash. The method and the system are suitable for intelligent indexing of the knowledge graph with large-scale semantization, so that the retrieval efficiency is improved, and more convenient service is provided for intelligent reasoning based on the knowledge graph.
In order to achieve the purpose of the invention, the technical scheme provided by the invention patent is as follows:
the invention firstly provides an indexing method for storing a super-large-scale knowledge graph, wherein the super-large-scale knowledge graph refers to the fact that the number of triples in the knowledge graph reaches billions, billions or even trillions, and the super-large-scale knowledge graph is stored in the indexing process and is used for realizing Hash calculation based on a deep learning model to obtain the initial position and the storage length of physical storage, and the method specifically comprises the following steps:
firstly, dividing the input of an index into three types of entity, relation triple and attribute triple, and designing an intelligent Hash algorithm based on the three input types, wherein the intelligent Hash algorithm structurally comprises a BERT compatible model, a convergence network and a multilayer perceptron;
secondly, coding and learning the three types of input by using the BERT compatible model respectively, and sending the learned vector to the convergence network;
thirdly, in the convergence network, for the entity, converging the adjacent vertexes and the associated edges of all the entities, and outputting the vector representation of the corresponding entity; for the relation triples and the attribute triples, learning the triples and respectively outputting vector representations of the corresponding relation triples and the corresponding attribute triples;
fourthly, vector representations obtained by the aggregation network are respectively input into the multilayer perceptron, and the initial position of data storage and the length of physical storage are regressed;
and fifthly, accessing and maintaining the knowledge graph data on the physical storage equipment according to the output initial position and the physical storage length, and realizing intelligent indexing of the super-large-scale knowledge graph storage.
In the method for storing the intelligent index by the super-large-scale knowledge graph, in the first step, for each index input, three input types are respectively input in the following mode
Figure 819159DEST_PATH_IMAGE001
Figure 616082DEST_PATH_IMAGE002
And
Figure 848481DEST_PATH_IMAGE003
specifically, the following are respectively:
if it is an entity, the input is
Figure 823390DEST_PATH_IMAGE001
Figure 774028DEST_PATH_IMAGE002
And
Figure 390954DEST_PATH_IMAGE003
is empty;
in the case of a relationship triple, the relationship,
Figure 963012DEST_PATH_IMAGE001
is a head entity, and is characterized in that,
Figure 639981DEST_PATH_IMAGE002
in order to be in a relationship of,
Figure 812337DEST_PATH_IMAGE003
is a tail entity;
in the case of an attribute triple, the attribute triple,
Figure 967375DEST_PATH_IMAGE001
is a solid substance and is provided with a plurality of groups of different structures,
Figure 908786DEST_PATH_IMAGE002
in order to be the name of the attribute,
Figure 491077DEST_PATH_IMAGE003
is an attribute value.
In the indexing method for the storage of the super-large-scale knowledge graph, in the second step, the encoding process of the BERT compatible model is as follows:
s21, segmenting a text corresponding to the entity or the relation into a word element sequence, if the input is Chinese segmentation by characters, and if the input contains English words, segmenting by directly using a blank space;
s22, adding position information in the sequence of the lemmas, namely the serial number of each lemma in the sequence of the lemmas, and setting the input of upper and lower sentences to be 0 if the input also has upper and lower sentence codes;
s23, for each input, obtaining respective vector representation in an embedding mode, and summing the vectors to obtain an input vector of the model;
s24, the model carries out representation learning on the input vector, and finally the input vector passes through the model
Figure 150728DEST_PATH_IMAGE004
The location acquisition learned vector is recorded as
Figure 93145DEST_PATH_IMAGE005
In the indexing method for storing the super-large-scale knowledge graph, if the input of the BERT compatible model is an entity, the output is carried out
Figure 154642DEST_PATH_IMAGE005
I.e. a vector representation of the corresponding entity; if the inputs are relations, then the outputs
Figure 173414DEST_PATH_IMAGE005
For the vector representation of the corresponding relation, the output vector
Figure 320361DEST_PATH_IMAGE005
As input to the aggregation network in the next step.
In the third step, the aggregation network aggregates the information of all adjacent vertexes and associated edges for the entity to realize deep semantic learning, and for the entity to realize the index method for storing the super-large-scale knowledge graph
Figure 817202DEST_PATH_IMAGE006
Figure 467626DEST_PATH_IMAGE006
Refers to the vector representation of the entity (vertex) obtained by the model:
Figure 408031DEST_PATH_IMAGE007
wherein,
Figure 776696DEST_PATH_IMAGE008
represent
Figure 77227DEST_PATH_IMAGE009
Is determined by the set of all the contiguous vertices of (a),
Figure 847737DEST_PATH_IMAGE010
indicates the number of contiguous vertices,
Figure 208311DEST_PATH_IMAGE011
is shown and
Figure 64272DEST_PATH_IMAGE009
the top points of the adjacent parts are in a same plane,
Figure 152182DEST_PATH_IMAGE012
represent
Figure 42778DEST_PATH_IMAGE009
And
Figure 308674DEST_PATH_IMAGE011
the relationship between;
finally output
Figure 386352DEST_PATH_IMAGE013
Figure 294265DEST_PATH_IMAGE014
Is a vector representation of the correspondent entity in the output of the converged network.
In the index method for storing the super-large-scale knowledge graph, in the third step, for the triples, the mean value of each vector of the triples is directly solved through the following formula,
Figure 39367DEST_PATH_IMAGE015
wherein:
for a relational triple:
Figure 492476DEST_PATH_IMAGE016
is a head entity
Figure 57450DEST_PATH_IMAGE017
Is represented by a vector of (a) or (b),
Figure 237895DEST_PATH_IMAGE018
is a relationship of
Figure 103083DEST_PATH_IMAGE019
Is represented by a vector of (a) or (b),
Figure 710782DEST_PATH_IMAGE020
is a vector representation of the tail entity;
for attribute triplets:
Figure 28631DEST_PATH_IMAGE016
as an entity
Figure 262035DEST_PATH_IMAGE017
Is represented by a vector of (a) or (b),
Figure 981729DEST_PATH_IMAGE018
as attribute names
Figure 760329DEST_PATH_IMAGE019
Is used to represent the vector of (a),
Figure 565474DEST_PATH_IMAGE020
is a vector representation of attribute values.
In the indexing method for storing the super-large-scale knowledge graph, the vector representations obtained by the aggregation network are respectively input into a position multilayer perceptron and a length multilayer perceptron, and the initial position and the length of data storage are respectively regressed.
The invention also relates to an intelligent indexing system for storing the super-large-scale knowledge graph, wherein the super-large-scale knowledge graph is stored in a physical storage device, the system calculates the input data through a deep learning model to obtain the starting position pos of the physical storage and the length len of the physical storage of the data, so that the required knowledge graph is read according to the starting position pos _ start = pos and the ending position pos _ end = pos + len, the system comprises a BERT compatible model, a convergence network module and a multilayer perceptron, wherein,
the BERT compatible model encodes the input of the index, respectively obtains vector representations of the input, and sends the vector representations to a convergence network, wherein each index input is one of three types of entity, relationship triple and attribute triple;
the aggregation network module aggregates information of all adjacent vertexes and associated edges of the entity according to the characteristics of the knowledge graph, so that deep semantic learning is realized, and vector representation of the entity is output; for the triple, calculating the average value of each vector of the triple, and respectively obtaining the vector representation of the relation triple and the vector representation of the attribute triple;
the multilayer perceptron inputs vector representation obtained by the aggregation network into the multilayer perceptron, and returns the initial position and the length of data storage, and the initial position and the length of physical storage are used as the basis for accessing, reading and storing knowledge graph data on the physical storage equipment.
Based on the technical scheme, the ultra-large-scale knowledge graph intelligent indexing method and system based on deep learning intelligent hash obtain the following technical advantages through practical application:
1. the method and the system are suitable for intelligent indexing of the knowledge graph with large-scale semantization, so that the retrieval efficiency is improved, and more convenient service is provided for intelligent reasoning based on the knowledge graph.
2. The method and the system of the invention adopt the intelligent Hash algorithm, and can realize the extremely high-efficiency retrieval in the storage of the super-large-scale knowledge map due to fully utilizing the understanding of deep learning to the semantics, wherein the retrieval comprises simple retrieval, complex multi-hop retrieval, complex analysis with tasks such as knowledge reasoning and the like.
3. The method and the system provided by the invention design an intelligent Hash algorithm architecture, and realize short-time efficient response of input indexes under the condition of super-large-scale knowledge map storage through the application of a BERT compatible model and a convergence network module, thereby greatly improving the index efficiency.
Drawings
FIG. 1 is a schematic diagram of an indexing method for very large-scale knowledge-graph storage according to the present invention.
FIG. 2 is a schematic diagram of an implementation of an intelligent hash algorithm in the indexing method for supersized knowledge graph storage according to the present invention.
FIG. 3 is a schematic diagram of BERT compatible encoding process in the indexing method of the super-large-scale knowledge map storage of the present invention.
Detailed Description
The present invention will be further described in detail with reference to the drawings and examples, so as to more clearly understand the structural composition of the system for storing intelligent indexes by using very large-scale knowledge graphs and the working process of the method for storing intelligent indexes by using very large-scale knowledge graphs, but the scope of the present invention should not be limited thereby.
The scheme provided by the invention is directed to a super-large-scale knowledge graph, wherein the knowledge graph is a multi-relation graph formed by entities (nodes) and relations (edges of different types), each edge connects two entities at the head and the tail, and is usually represented by an SPO triple (object), which is called a fact. The ultra-large-scale knowledge graph means that the knowledge graph contains triple quantities of billions, billions or even trillions. For a small-scale knowledge graph, the existing common indexing mode is practical enough, and a practical intelligent indexing method is not needed. For the ultra-large-scale knowledge graph, the existing indexing method is low in efficiency and even infeasible, so that an intelligent indexing method is required. In such an ultra-large-scale knowledge graph storage, how to perform efficient retrieval is a huge challenge. In addition, real-time retrieval of entities is realized, online multi-hop query and relationship analysis are realized, second-level complex analysis is realized, and the method is an urgent need for super-large-scale knowledge map engineering practice and industrial application.
The invention is used as a brand-new intelligent indexing method for storing the super-large-scale knowledge graph, the super-large-scale knowledge graph is stored in indexing, hash calculation is realized on the basis of a deep learning model, and the initial position and the storage length of physical storage are obtained through calculation, so that the required knowledge graph is quickly retrieved from physical storage equipment. When the deep learning algorithm is selected, the data characteristics of the knowledge graph and the application characteristics of the knowledge graph are fully considered, so that the applications such as high-efficiency storage retrieval and complex analysis are provided.
The intelligent hash system architecture provided by the invention is shown in fig. 1, and the invention adopts a deep learning model to realize hash calculation, namely input data is calculated through the deep learning model to obtain a starting position pos of physical storage and a length len of data physical storage, so as to obtain the starting position pos _ start = pos and the ending position pos _ end = pos + len, and then based on the starting position and the physical storage length, the data of the knowledge graph on a physical storage device is accessed and maintained, so that the intelligent index of the super-large scale knowledge graph storage is realized. The intelligent hash algorithm of the invention fully utilizes the understanding of deep learning on semantics, and can realize highly efficient retrieval in large-scale knowledge map storage, including simple retrieval, complex multi-hop retrieval, complex analysis with tasks such as knowledge reasoning and the like.
The method specifically comprises the following steps:
the method comprises the steps of firstly, dividing index input into three types of entities, relationship triples and attribute triples, designing an intelligent Hash algorithm based on the three input types, wherein the intelligent Hash algorithm structurally comprises a BERT compatible model, a convergence network and a multilayer perceptron. For each index input, the three input types are respectively
Figure 353302DEST_PATH_IMAGE001
Figure 927503DEST_PATH_IMAGE002
And
Figure 142583DEST_PATH_IMAGE003
specifically, the following are respectively:
if it is an entity, the input is
Figure 642880DEST_PATH_IMAGE001
Figure 234398DEST_PATH_IMAGE002
And
Figure 397526DEST_PATH_IMAGE003
is empty;
in the case of a relationship triple, the relationship,
Figure 783508DEST_PATH_IMAGE001
is a head entity, and is characterized in that,
Figure 563245DEST_PATH_IMAGE002
in order to be in a relationship of,
Figure 692875DEST_PATH_IMAGE003
is a tail entity;
in the case of an attribute triple, the attribute triple,
Figure 959777DEST_PATH_IMAGE001
is a solid substance and is provided with a plurality of groups of different structures,
Figure 782240DEST_PATH_IMAGE002
in order to be the name of the attribute,
Figure 518115DEST_PATH_IMAGE003
is an attribute value.
Secondly, coding and learning the three types of input respectively by using the BERT compatible model, and sending the learned vector to the convergence network;
thirdly, in the convergence network, for the entity, converging the adjacent vertexes and the associated edges of all the entities, and outputting the vector representation of the corresponding entity; for the relation triples and the attribute triples, learning the triples and respectively outputting vector representations of the corresponding relation triples and the corresponding attribute triples;
fourthly, vector representations obtained by the aggregation network are respectively input into the multilayer perceptron, and the initial position of data storage and the length of physical storage are regressed; specifically, vector representations obtained by the aggregation network are respectively input into the position multilayer perceptron and the length multilayer perceptron, the position multilayer perceptron respectively regresses the initial position of data storage, and the length multilayer perceptron regresses the length of physical storage.
And fifthly, accessing and maintaining the knowledge graph data on the physical storage equipment according to the output initial position and the physical storage length, and realizing intelligent indexing of the super-large-scale knowledge graph storage.
The core of the intelligent index is the design of an intelligent Hash algorithm architecture, and the intelligent Hash algorithm architecture comprises a BERT compatible model, a convergence network and a multilayer perceptron. An intelligent Hash algorithm framework is provided by using the latest deep learning latest results and combining the characteristics of the knowledge map. For three different types of index input, they are expressed as
Figure 185856DEST_PATH_IMAGE001
Figure 589156DEST_PATH_IMAGE002
And
Figure 67673DEST_PATH_IMAGE003
form (a): if it is an entity, the input is
Figure 290844DEST_PATH_IMAGE001
Figure 27856DEST_PATH_IMAGE002
And
Figure 20082DEST_PATH_IMAGE003
is empty; if the relation triple is input
Figure 918768DEST_PATH_IMAGE001
Is a head entity, and is characterized in that,
Figure 894815DEST_PATH_IMAGE002
in order to be in a relationship with each other,
Figure 153626DEST_PATH_IMAGE003
is a tail entity; if the attribute triples are input
Figure 265939DEST_PATH_IMAGE001
Is a solid substance which is a mixture of the components,
Figure 601105DEST_PATH_IMAGE002
in order to be the name of the attribute,
Figure 533289DEST_PATH_IMAGE003
is an attribute value. The intelligent Hash algorithm makes full use of the understanding of deep learning on semantics, so that extremely efficient retrieval including simple retrieval, complex multi-hop retrieval, complex analysis with tasks such as knowledge reasoning and the like can be realized in large-scale knowledge map storage.
In the method for storing intelligent indexes by using the super-large-scale knowledge graph, the three inputs are coded by using a BERT-like model. In selecting a BERT compatible model, the selection may be made according to the richness of the computational power. Generally, when the calculation is very rich, a BERT or a similar large model can be selected, so that a better effect can be obtained; and for the situation that the computational power is more tense, BERT-tiny or similar small models can be selected, and the use of computational power resources is saved on the premise of obtaining acceptable effects. The BERT-like model is not particularly limited in this patent, and newly developed models may be used in the future instead of the models that are currently widely used.
Specifically, as shown in fig. 3, the encoding process of the BERT-compatible model is as follows:
and S21, segmenting a text (an intelligent index in the figure 3) corresponding to the entity or the relation into a sequence of word elements (namely an intelligent sequence, an index sequence and an index sequence in the figure 3), and if the input is Chinese character segmentation, or if the input contains English words, directly segmenting by using a blank space.
S22, adding position information in the sequence of the lemmas, namely the serial number of each lemma in the sequence of the lemmas, and setting the input of the upper sentence and the lower sentence as 0 if the BERT compatible input also has the coding of the upper sentence and the lower sentence;
s23, for each input, obtaining respective vector representation in an embedding mode, and summing the vectors to obtain an input vector of the model, namely the input in the figure 3
Figure 612104DEST_PATH_IMAGE021
Figure 844502DEST_PATH_IMAGE022
Figure 84990DEST_PATH_IMAGE023
Figure 520782DEST_PATH_IMAGE024
And so on.
S24, the model carries out representation learning on the input vector, and finally the input vector passes through the model
Figure 137708DEST_PATH_IMAGE021
The location acquisition learned vector is recorded as
Figure 959034DEST_PATH_IMAGE005
. As the input of the BERT compatible model, data information containing the lemma, the upper and lower sentences and the position can be obtained by inputting in each of the three types.
In the method for storing the intelligent index by the super-large-scale knowledge graph, in the second step, if the input of the BERT compatible model is an entity, a vector is output
Figure 370423DEST_PATH_IMAGE005
I.e. a vector representation of the corresponding entity; if the inputs are relational, the outputs are
Figure 542779DEST_PATH_IMAGE005
For the vector representation of the corresponding relation, namely outputting the vector representation of the corresponding relation triple and the vector representation of the corresponding attribute triple respectively
Figure 947084DEST_PATH_IMAGE005
As input to the next step aggregation network.
In the method for storing the intelligent index by the super-large-scale knowledge graph, the convergence network module is designed, and the characteristics of the knowledge graph are fully utilized to learn more appropriate vector representation. Specifically, for an entity, the convergence network converges information of all adjacent vertexes and associated edges to realize deep semantic learning; and for the relation triples and the attribute triples, only the triples are learned so as to reduce the calculation amount.
For an entity, according to the characteristics of the knowledge graph, the convergence network converges the information of all adjacent vertexes and associated edges, thereby realizing deep semantic learning and aiming at the entity
Figure 888495DEST_PATH_IMAGE006
Figure 736366DEST_PATH_IMAGE006
Refers to the vector representation of the entity (vertex) obtained by the model:
Figure 130438DEST_PATH_IMAGE007
wherein,
Figure 89167DEST_PATH_IMAGE008
to represent
Figure 150664DEST_PATH_IMAGE009
Is determined by the set of all the contiguous vertices of (a),
Figure 654588DEST_PATH_IMAGE010
indicates the number of adjacent vertices and indicates the number of adjacent vertices,
Figure 535957DEST_PATH_IMAGE011
is shown and
Figure 298376DEST_PATH_IMAGE009
the vertex of the adjacent point is provided with a plurality of adjacent points,
Figure 214380DEST_PATH_IMAGE012
to represent
Figure 138473DEST_PATH_IMAGE009
And
Figure 772717DEST_PATH_IMAGE011
the relationship between them;
finally output
Figure 56937DEST_PATH_IMAGE013
Figure 827447DEST_PATH_IMAGE025
Is a vector representation of the correspondent entity in the output of the converged network.
For the triple, the vector values of the triple are directly averaged through the following formula,
Figure 188021DEST_PATH_IMAGE026
wherein:
for a relational triple:
Figure 778402DEST_PATH_IMAGE016
is a head entity
Figure 882624DEST_PATH_IMAGE017
Is used to represent the vector of (a),
Figure 773220DEST_PATH_IMAGE018
is a relationship of
Figure 304695DEST_PATH_IMAGE027
Is used to represent the vector of (a),
Figure 133105DEST_PATH_IMAGE028
is a vector representation of the tail entity;
for attribute triplets:
Figure 775439DEST_PATH_IMAGE016
as an entity
Figure 254962DEST_PATH_IMAGE017
Is represented by a vector of (a) or (b),
Figure 957339DEST_PATH_IMAGE018
as attribute names
Figure 787892DEST_PATH_IMAGE027
Is used to represent the vector of (a),
Figure 217605DEST_PATH_IMAGE028
is a vector representation of the attribute values.
In the method for storing the intelligent index by the super-large-scale knowledge graph, after a network is converged, a simple multi-layer perceptron (MLP) is connected, and then the initial position pos of data storage and the length len of physical storage can be regressed. The Multilayer Perceptron (MLP), belongs to the simplest neural network. In addition to the input layer and the output layer, the multi-layer perceptron can have a plurality of hidden layers in the middle, and the simplest MLP only comprises one hidden layer, namely a three-layer structure, wherein the lowest layer of the three-layer structure is the input layer, the middle layer of the three-layer structure is the hidden layer, and the last layer of the three-layer structure is the output layer.
The layers of the multilayer perceptron are all connected, namely any neuron on the upper layer is connected with all neurons on the lower layer. Inputting an N-dimensional vector at the input layer, there are N neurons, and assuming that the input layer is represented by a vector X, the output of the hidden layer is f (W1X + b 1), W1 is a weight (also called a connection coefficient), b1 is an offset, and the function f may be a commonly used sigmoid function or tanh function. The hidden layer to the output layer can be regarded as a multi-class logistic regression, namely, softmax regression, so that the output of the output layer is softmax (W2X 1+ b 2), and X1 represents the output f (W1X + b 1) of the hidden layer. The MLP of the three layers is summarized by the formula, i.e., the function G is softmax,
Figure 82793DEST_PATH_IMAGE029
therefore, all the parameters of MLP are the connection weights and biases between the layers, including W1, b1, W2, b2. Solving each parameter is an optimization problem. The simplest is the gradient descent method (SGD): all parameters are first initialized randomly and then trained iteratively, with gradients computed and parameters updated continuously until a certain condition is met. These are conventional techniques in artificial intelligence algorithm, which are not innovative points of the present invention patent and are not described herein. The whole network learns pos and len simultaneously in a multi-task learning mode in a training stage. When the method is applied, the shared backbone network is fully utilized, the efficiency is very high, and pos and len can be calculated simultaneously. The core of the invention is to output pos and len, after the starting position pos of physical storage and the length len of data physical storage are obtained, so as to obtain the starting position pos _ start = pos and the ending position pos _ end = pos + len, and the mature file system api can be used to access the knowledge graph data stored on the physical device (such as a disk, ssd, a memory, even a tape, etc.).
The invention also designs a system for storing the intelligent index by the super-large-scale knowledge graph, wherein the super-large-scale knowledge graph is stored in a physical storage device, the system calculates the input data through a deep learning model to obtain the initial position pos of the physical storage and the length len of the physical storage of the data, the initial position pos _ start = pos, the end position pos _ end = pos + len, and the system comprises a convergence network module and a multilayer perceptron. The BERT compatible model respectively encodes three types of input, respectively obtains vector representations of corresponding types, and sends the vector representations to a convergence network, wherein the three input types are respectively three types of entities, relationship triples and attribute triples, and the input of each index is necessarily one of the three types; the aggregation network module aggregates information of all adjacent vertexes and associated edges of the entity according to the characteristics of the knowledge graph, so that deep semantic learning is realized, and vector representation of the entity is output; for the triple, calculating the average value of each vector of the triple, and respectively obtaining the vector representation of the relation triple and the vector representation of the attribute triple; the multilayer perceptron inputs vector representations obtained by the aggregation network into the multilayer perceptrons respectively and regresses the initial position of data storage and the length of physical storage; in the convergence network module, the characteristics of the knowledge graph are fully utilized to learn more appropriate vector representation. Specifically, for an entity, aggregating the adjacent vertexes and associated edges of all entities; and for the relation triples and the attribute triples, only the triples are learned so as to reduce the calculation amount. The core of the invention is the starting position pos of the output and the length len of the physical storage. After the start position pos of storage and the length len of physical storage are obtained, the mature file system api can be used to access data stored on physical devices (such as a disk, ssd, memory, even a tape, etc.), thereby realizing intelligent indexing of the very-large-scale knowledge-graph storage.
On the basis of the method and the system, the computer equipment is provided with an intelligent indexing system, the intelligent indexing system executes the algorithm of the hash architecture, the intelligent indexing of the super-large-scale knowledge graph stored in the physical storage equipment is realized, and the indexing of the knowledge graph is realized in data on the physical equipment (such as a disk, ssd, a memory, even a tape and the like) efficiently and in a short time.
The method and the system provided by the invention are suitable for intelligent indexing of large-scale semantization knowledge graphs and can be suitable for all fields. The core of the method is to provide intelligent index for the ultra-large-scale knowledge graph so as to improve the retrieval efficiency and provide more convenient service for intelligent reasoning based on the knowledge graph.

Claims (9)

1. The method for indexing the storage of the super-large-scale knowledge graph is characterized in that the super-large-scale knowledge graph is stored in the indexing process, hash calculation is realized on the basis of a deep learning model, and the initial position and the storage length of physical storage are obtained, and the method specifically comprises the following steps:
firstly, dividing the input of an index into three types of entity, relation triple and attribute triple, and designing an intelligent Hash algorithm based on the three input types, wherein the intelligent Hash algorithm structurally comprises a BERT compatible model, a convergence network and a multilayer perceptron;
secondly, coding and learning the three types of input respectively by using the BERT compatible model, and sending the learned vector to a convergence network;
thirdly, in the convergence network, for the entity, converging the adjacent vertexes and the associated edges of all the entities, and outputting the vector representation of the corresponding entity; for the relation triples and the attribute triples, learning the triples and respectively outputting vector representations of the corresponding relation triples and the corresponding attribute triples;
fourthly, vector representations obtained by the aggregation network are respectively input into the multilayer perceptron, and the initial position of data storage and the length of physical storage are regressed;
and fifthly, accessing and maintaining the knowledge graph data on the physical storage equipment according to the output initial position and the physical storage length, and realizing intelligent indexing of the super-large-scale knowledge graph storage.
2. The method of claim 1, wherein in the first step, for each index input, the three input types are respectively selected from the group consisting of
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE002
And
Figure DEST_PATH_IMAGE003
specifically, the following are respectively:
if it is an entity, the input is
Figure 919726DEST_PATH_IMAGE001
Figure 467382DEST_PATH_IMAGE002
And
Figure 293255DEST_PATH_IMAGE003
is empty;
in the case of a relationship triple, the relationship,
Figure 658378DEST_PATH_IMAGE001
is a head entity, and is characterized in that,
Figure 77858DEST_PATH_IMAGE002
in order to be in a relationship with each other,
Figure 553838DEST_PATH_IMAGE003
is a tail entity;
in the case of an attribute triple, the attribute triple,
Figure 375164DEST_PATH_IMAGE001
is a solid substance and is provided with a plurality of groups of different structures,
Figure 911187DEST_PATH_IMAGE002
in order to be the name of the attribute,
Figure 817963DEST_PATH_IMAGE003
is an attribute value.
3. The method of claim 1, wherein in the second step, if the input of the BERT-compatible model is an entity, the output is output
Figure DEST_PATH_IMAGE004
I.e. a vector representation of the corresponding entity; if the input is a relationship, then the output
Figure 35318DEST_PATH_IMAGE004
For the vector representation of the corresponding relation, the output vector
Figure 101363DEST_PATH_IMAGE004
As input to the aggregation network in the next step.
4. The method of claim 3, wherein in the second step, the BERT compatible model is encoded as follows:
s21, segmenting a text corresponding to the entity or the relation into a word element sequence, if the input is Chinese segmentation by characters, and if the input contains English words, segmenting by directly using a blank space;
s22, adding position information in the sequence of the lemmas, namely the serial number of each lemma in the sequence of the lemmas, and if the input also comprises upper and lower sentence codes, setting the input of the upper and lower sentences to be 0;
s23, for each input, obtaining respective vector representation in an embedding mode, and summing the vectors to obtain an input vector of the model;
s24, performing expression learning on input vectors by using models, and finally passing the models
Figure DEST_PATH_IMAGE005
The location acquisition learned vector is recorded as
Figure 277130DEST_PATH_IMAGE004
5. The method of claim 1, wherein in the third step, the aggregation network aggregates information of all adjacent vertices and associated edges for entities to achieve deep semantic learning, and in the third step, the aggregation network aggregates information of all adjacent vertices and associated edges for entities to achieve deep semantic learning
Figure DEST_PATH_IMAGE006
Figure 264677DEST_PATH_IMAGE006
Refers to the vector representation of the entity obtained by the model:
Figure DEST_PATH_IMAGE007
wherein,
Figure DEST_PATH_IMAGE008
to represent
Figure DEST_PATH_IMAGE009
Is determined by the set of all the contiguous vertices of (a),
Figure DEST_PATH_IMAGE010
indicates the number of adjacent vertices and indicates the number of adjacent vertices,
Figure DEST_PATH_IMAGE011
is shown and
Figure 410357DEST_PATH_IMAGE009
the vertex of the adjacent point is provided with a plurality of adjacent points,
Figure DEST_PATH_IMAGE012
to represent
Figure 268591DEST_PATH_IMAGE009
And
Figure 146417DEST_PATH_IMAGE011
the relationship between;
final output
Figure DEST_PATH_IMAGE013
Figure DEST_PATH_IMAGE014
Is a vector representation of the correspondent entity in the output of the converged network.
6. The method for indexing supersized knowledge-graph storage according to claim 1, wherein in said third step, for the triples, the vector values of the triples are directly averaged by the following formula,
Figure DEST_PATH_IMAGE015
wherein:
for a relational triple:
Figure DEST_PATH_IMAGE016
is a head entity
Figure DEST_PATH_IMAGE017
Is used to represent the vector of (a),
Figure DEST_PATH_IMAGE018
is a relationship of
Figure DEST_PATH_IMAGE019
Is represented by a vector of (a) or (b),
Figure DEST_PATH_IMAGE020
is a vector representation of the tail entity;
for attribute triplets:
Figure 548492DEST_PATH_IMAGE016
as an entity
Figure 904387DEST_PATH_IMAGE017
Is represented by a vector of (a) or (b),
Figure 554811DEST_PATH_IMAGE018
as attribute names
Figure 337959DEST_PATH_IMAGE019
Is represented by a vector of (a) or (b),
Figure 706624DEST_PATH_IMAGE020
is a vector representation of the attribute values.
7. The method for indexing supersized knowledge-graph storage according to claim 1, wherein in said fourth step, vector representations obtained from said aggregation network are inputted into a position multi-level perceptron and a length multi-level perceptron, respectively, and the starting position and the length of data in physical storage are regressed respectively.
8. An index system for storing a super-large-scale knowledge graph is characterized in that the super-large-scale knowledge graph is stored in a physical storage device, the system calculates a starting position pos of physical storage and a length len of data physical storage through a deep learning model for input data, and accordingly a required knowledge graph is read according to the starting position pos _ start = pos and the ending position pos _ end = pos + len, the system comprises a BERT compatible model, a convergence network module and a multilayer sensor, wherein,
the BERT compatible model encodes the input of the index, respectively obtains vector representations of the input, and sends the vector representations to a convergence network, wherein each index input is one of three types of entity, relation triple and attribute triple;
the convergence network module converges information of all adjacent vertexes and associated edges for the entity by a convergence network, thereby realizing deep semantic learning and outputting vector representation of the entity; for the triple, calculating the average value of each vector of the triple, and respectively obtaining the vector representation of the relation triple and the vector representation of the attribute triple;
and the multilayer perceptron inputs the vector representation obtained by the aggregation network into the multilayer perceptron, and regresses the initial position of data storage and the length of physical storage, wherein the initial position and the length of physical storage are used as the basis for accessing and maintaining the knowledge map data on the physical storage equipment.
9. A computer device, characterized in that the computer device is provided with an intelligent indexing system, the intelligent indexing system executes the method of claim 1, and the intelligent indexing system realizes intelligent indexing to the super-large-scale knowledge-graph stored in the physical storage device.
CN202210874965.8A 2022-07-25 2022-07-25 Indexing method, system and computer equipment for super-large-scale knowledge map storage Active CN114936296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210874965.8A CN114936296B (en) 2022-07-25 2022-07-25 Indexing method, system and computer equipment for super-large-scale knowledge map storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210874965.8A CN114936296B (en) 2022-07-25 2022-07-25 Indexing method, system and computer equipment for super-large-scale knowledge map storage

Publications (2)

Publication Number Publication Date
CN114936296A CN114936296A (en) 2022-08-23
CN114936296B true CN114936296B (en) 2022-11-08

Family

ID=82869128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210874965.8A Active CN114936296B (en) 2022-07-25 2022-07-25 Indexing method, system and computer equipment for super-large-scale knowledge map storage

Country Status (1)

Country Link
CN (1) CN114936296B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339313A (en) * 2020-02-18 2020-06-26 北京航空航天大学 Knowledge base construction method based on multi-mode fusion
CN113094449A (en) * 2021-04-09 2021-07-09 天津大学 Large-scale knowledge map storage scheme based on distributed key value library
CN114064931A (en) * 2021-11-29 2022-02-18 新疆大学 A method and system for first aid knowledge question answering based on multimodal knowledge graph
CN114625830A (en) * 2022-03-16 2022-06-14 中山大学·深圳 Chinese dialogue semantic role labeling method and system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376864A (en) * 2018-09-06 2019-02-22 电子科技大学 A Knowledge Graph Relational Reasoning Algorithm Based on Stacked Neural Networks
US11687733B2 (en) * 2020-06-25 2023-06-27 Sap Se Contrastive self-supervised machine learning for commonsense reasoning
US11755838B2 (en) * 2020-09-14 2023-09-12 Smart Information Flow Technologies, LLC Machine learning for joint recognition and assertion regression of elements in text
US20220129621A1 (en) * 2020-10-26 2022-04-28 Adobe Inc. Bert-based machine-learning tool for predicting emotional response to text
CN112765991B (en) * 2021-01-14 2023-10-03 中山大学 Knowledge enhancement-based deep dialogue semantic role labeling method and system
CN113378573B (en) * 2021-06-24 2025-01-10 北京华成智云软件股份有限公司 Small sample relationship extraction method and device for content big data
CN113988079B (en) * 2021-09-28 2025-03-14 浙江大学 A dynamic enhanced multi-hop text reading recognition processing method for low-data
CN114064926B (en) * 2021-11-24 2025-02-11 国家电网有限公司大数据中心 Multimodal power knowledge graph construction method, device, equipment and storage medium
CN114168749A (en) * 2021-12-06 2022-03-11 北京航空航天大学 Question generation system based on knowledge graph and question word drive

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339313A (en) * 2020-02-18 2020-06-26 北京航空航天大学 Knowledge base construction method based on multi-mode fusion
CN113094449A (en) * 2021-04-09 2021-07-09 天津大学 Large-scale knowledge map storage scheme based on distributed key value library
CN114064931A (en) * 2021-11-29 2022-02-18 新疆大学 A method and system for first aid knowledge question answering based on multimodal knowledge graph
CN114625830A (en) * 2022-03-16 2022-06-14 中山大学·深圳 Chinese dialogue semantic role labeling method and system

Also Published As

Publication number Publication date
CN114936296A (en) 2022-08-23

Similar Documents

Publication Publication Date Title
CN113190699B (en) Remote sensing image retrieval method and device based on category-level semantic hash
CN108108854B (en) Urban road network link prediction method, system and storage medium
CN112417289B (en) Information intelligent recommendation method based on deep clustering
CN109960737B (en) A semi-supervised deep adversarial self-encoding hash learning method for remote sensing image content retrieval
Chen et al. Rlpath: a knowledge graph link prediction method using reinforcement learning based attentive relation path searching and representation learning
Du et al. Relation extraction for manufacturing knowledge graphs based on feature fusion of attention mechanism and graph convolution network
CN114817568B (en) Knowledge hypergraph link prediction method combining attention mechanism and convolutional neural network
CN115115862B (en) High-order relational knowledge distillation method and system based on heterogeneous graph neural network
CN111597371A (en) Appearance patent multi-mode image retrieval method and system
CN109960738A (en) A Content Retrieval Method for Large-scale Remote Sensing Image Based on Deep Adversarial Hash Learning
CN112035689A (en) Zero sample image hash retrieval method based on vision-to-semantic network
CN112463987A (en) Chinese classical garden knowledge graph completion and cognitive reasoning method
CN107634943A (en) A weight reduction wireless sensor network data compression method, device and storage device
CN116680469A (en) Sequence recommendation algorithm based on dynamic graph neural network
CN114861863A (en) Heterogeneous graph representation learning method based on meta-path multi-level graph attention network
CN115587626A (en) Heterogeneous Graph Neural Network Attribute Completion Method
CN118780767A (en) Project evaluation and review method and system integrating natural language processing
CN116975782A (en) Hierarchical time sequence prediction method and system based on multi-level information fusion
CN115953902A (en) Traffic flow prediction method based on multi-view space-time diagram convolution network
CN118916714B (en) Code similarity detection method, equipment and medium based on graph neural network
CN111444316B (en) Knowledge graph question-answering-oriented compound question analysis method
CN114936296B (en) Indexing method, system and computer equipment for super-large-scale knowledge map storage
CN116738983A (en) Word embedding method, device and equipment for performing financial field task processing by model
Wei et al. Compression and storage algorithm of key information of communication data based on backpropagation neural network
CN116796797A (en) Network architecture search method, image classification method, device and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant