CN102004631A - Method and device for processing information document - Google Patents
Method and device for processing information document Download PDFInfo
- Publication number
- CN102004631A CN102004631A CN2010105198699A CN201010519869A CN102004631A CN 102004631 A CN102004631 A CN 102004631A CN 2010105198699 A CN2010105198699 A CN 2010105198699A CN 201010519869 A CN201010519869 A CN 201010519869A CN 102004631 A CN102004631 A CN 102004631A
- Authority
- CN
- China
- Prior art keywords
- current
- structure element
- subscript
- processing
- submodule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method and a device for processing an information document. The method comprises the following steps of: A, mapping extensible markup language XML document label information to an XML document vector model capable of being accessed by a subscript; and B, accessing and processing structural elements of the XML document vector model by using the subscript according to parallel processing strategy information, wherein the structural elements needing to be parallelly processed are processed in a parallel mode. The device is used for implementing the method. By using the method and the device, the speed for returning a processing result of the XML document can be increased, the implementation efficiency of software is improved, and the time for a user to wait the processing result of the XML document is shortened.
Description
Technical field
The present invention relates to the microcomputer data processing field, relate in particular to the disposal route and the device of a kind of extend markup language (XML) information document.
Background technology
XML document is a kind of general and adaptable format file that is used for structural data, has widely applied in the computer software industry field at present.In the existing processing mode to XML document, main simple application programming interface (SAX, the Simple API for XML) event-driven mode that adopts at XML is directly handled document in the Event triggered process.The main processing mode of SAX event-driven mode is: act on XML document with a kind of streamlined event-driven processing mode, will trigger an incident to an element whenever, handle by event handler, and directly in event handler, document content is handled and return results (being direct return results).The advantage of this processing mode is that the event analysis device reads XML document in order, and entire document is not read in internal memory, so processing speed is very fast.
But, the shortcoming of existing this processing mode is: reading XML document from the beginning to the end, must could handle by the continuous redirect between label one by one, be a block type processing procedure of not falling, so cause the processing procedure of entire document very very long, the speed of return results is very slow.When opening much more relatively documents of a content, (for example read at the WEB webpage, the office documents form reads in the process), program will be blocked by the dissection process process of XML document, must document all could be disposed by a long time, the user waits for the chronic of result, has had a strong impact on the execution efficient of software.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of disposal route of information document, accelerates the speed that XML document is returned result, improves the execution efficient of computer software.
A further object of the present invention is to provide a kind for the treatment of apparatus of information document, can accelerate the speed that XML document is returned result, improves the execution efficient of computer software.
For achieving the above object, technical scheme of the present invention is achieved in that
A kind of disposal route of information document, this method comprises:
A, with the expandable mark language XML document label information be mapped to can XML document vector model by subscript visit in;
B, according to the parallel processing policy information, by subscript visit and handle the structural element of described XML document vector model, wherein adopt parallel mode to handle for the structural element that needs parallel processing.
In a kind of preferred embodiment, in the described steps A, in a structural element of XML document vector model, the structural element of one of them label mapping includes following content information with the information correspondence mappings that each label comprised in the described XML document:
1) serial number of current label in XML document;
2) current label is to the distance of the father's label serial number that comprises this label;
3) the subtab number that comprises of current label;
4) tag name of current label;
5) attribute information of current label;
6) the initial content of text that comprises of current label;
7) the endtext content that comprises of current label.
In a kind of preferred embodiment, described steps A specifically comprises following event-driven processing procedure to handle based on the mode of event-driven analyzing and processing:
Initial document event-driven is handled: set up or clear up one can be by the subscript visit data structure as described XML document vector model, initialization is when pre-treatment label subscript;
The start-tag event-driven is handled: the structural element that makes up a new label, and serial number, this new label of this new construction element carried out assignment to the distance of father's label serial number, the bookmark name and the attribute of this new construction element, and this newly-built structural element is joined in the data structure of available subscript visit, it is the serial number of current structure element that the pre-treatment subscript is worked as in change;
The contents processing event-driven is handled: judge that the current content that runs into is initial content or end content, initial in this way content will be to the initial content assignment of currentElement structure, otherwise to the end content assignment of currentElement structure;
The end-tag event-driven is handled: the daughter element that comprises to current Processing Structure element is counted assignment, and father's element subscript of current Processing Structure element is composed to working as pre-treatment label subscript.
In a kind of preferred embodiment, described step B specifically comprises:
B1, with first structural element of described XML document vector model as the current structure element, the processing context that makes up this structural element correspondence is as when the pre-treatment context;
B2, in the processing of carrying out current structure element initial period in the pre-treatment context;
B3, carrying out the processing of the initial content of current structure element in the pre-treatment context, and be current minor structure element subscript assignment;
B4, judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into step B9, otherwise enters into step B5;
B5, in the processing of carrying out the end content of current structure element correspondence in the pre-treatment context;
B6, in the processing of carrying out the ending phase of current structure element correspondence in the pre-treatment context;
B7, judge whether the current structure element is the root architecture element, if process ends then, otherwise enter step B8;
B8, current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, enter step B4;
B9, according to default parallel processing policy information, whether the minor structure element of judging the current structure element can parallel processing, if then enter step B10, otherwise enters step B13;
B10, traversal find out the minor structure element of brotherhoods each other all under the current structure element;
B11, adopt parallel modes to handle to the whole minor structure elements that find out;
B12, intact all the minor structure elements of wait parallel processing enter step B5;
B13, will work as the pre-treatment context and handle accordingly, as current Processing Structure element, and the processing context that makes up current Processing Structure element correspondence returns step B2 as when the pre-treatment context with current minor structure element.
In a kind of preferred embodiment, the concrete grammar of described step B10 comprises:
B401, the current minor structure element subscript of record;
B402, current minor structure element subscript is made as its next brother structural element subscript;
B403, judge current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into step B401, otherwise execution in step B404;
B404, the minor structure element subscript that step B401 is write down are judged to be the indexed set of the minor structure element of brotherhood each other that can parallel processing, finish this traversal and search flow process.
In a kind of preferred embodiment, the processing to each minor structure element of finding out described in the above-mentioned steps B11 specifically comprises:
B41, with the minor structure element subscript that needs parallel processing that finds among the step B10 as current minor structure element subscript;
B42, with current minor structure element subscript as current Processing Structure element subscript, and make up the processing context of this structural element correspondence, should handle context as when the pre-treatment context;
B43, in the processing of carrying out current structure element initial period in the pre-treatment context;
B44, carrying out the processing of the initial content of current structure element in the pre-treatment context, and be current minor structure element subscript assignment;
B45, judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into step B42, otherwise enters into step B46;
B46, in the processing of carrying out the end content of current structure element correspondence in the pre-treatment context;
B47, in the processing of carrying out the ending phase of current structure element correspondence in the pre-treatment context;
B48, judge whether current structure element subscript is the minor structure element subscript described in the described step B41, if process ends then, otherwise enter step B49;
B49, current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, return step 45.
In a kind of preferred embodiment, the parallel processing policy information described in the step B comprises: for the label that needs parallel processing is provided with the parallel processing sign; But, label judges the subtab of the brotherhood each other in this label of parallel processing if having the parallel processing sign.
A kind for the treatment of apparatus of information document, this device comprises:
First module, being used for the expandable mark language XML document label information is mapped to can be by the XML document vector model of subscript visit;
Second module is used for according to the parallel processing policy information, by subscript visit and handle the structural element of described XML document vector model, wherein adopts parallel mode to handle for the structural element that needs parallel processing.
In a kind of preferred embodiment, described first module specifically comprises following event-driven processing module:
Initial document event-driven processing module, be used to set up or clear up one can be by the subscript visit data structure as described XML document vector model, initialization is when pre-treatment label subscript;
Start-tag event-driven processing module, be used to make up the structural element of a new label, and serial number, this new label of this new construction element carried out assignment to the distance of father's label serial number, the bookmark name and the attribute of this new construction element, and this newly-built structural element is joined in the data structure of available subscript visit, it is the serial number of current structure element that the pre-treatment subscript is worked as in change;
Contents processing event-driven processing module is used to judge that the current content that runs into is initial content or end content, and initial in this way content will be to the initial content assignment of currentElement structure, otherwise to the end content assignment of currentElement structure;
End-tag event-driven processing module is used for the daughter element that comprises of current Processing Structure element is counted assignment, and father's element subscript of current Processing Structure element is composed to working as pre-treatment label subscript.
In a kind of preferred embodiment, described second module specifically comprises following submodule:
The B1 submodule is used for first structural element with described XML document vector model as the current structure element, and the processing context that makes up this structural element correspondence is as when the pre-treatment context;
The B2 submodule is used for carrying out the processing of current structure element initial period when the pre-treatment context;
The B3 submodule is used for carrying out the processing of the initial content of current structure element when the pre-treatment context, and is current minor structure element subscript assignment;
The B4 submodule is used to judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into the B9 submodule, otherwise enters into the B5 submodule;
The B5 submodule is used for carrying out the processing of the end content of current structure element correspondence when the pre-treatment context;
The B6 submodule is used for carrying out the processing of the ending phase of current structure element correspondence when the pre-treatment context;
The B7 submodule is used to judge whether the current structure element is the root architecture element, if process ends then, otherwise enter the B8 submodule;
The B8 submodule is used for current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, enters the B4 submodule;
The B9 submodule is used for according to default parallel processing policy information, and whether the minor structure element of judging the current structure element can parallel processing, if then enter the B10 submodule, otherwise enters the B13 submodule;
The B10 submodule is used to travel through the minor structure element that finds out brotherhoods each other all under the current structure element;
The B11 submodule is used for adopting parallel mode to handle to the whole minor structure elements that find out;
The B12 submodule is used to wait for that parallel processing finishes all minor structure elements, enters the B5 submodule;
The B13 submodule is used for and will handles accordingly when the pre-treatment context, and as current Processing Structure element, and the processing context that makes up current Processing Structure element correspondence returns the B2 submodule as when the pre-treatment context with current minor structure element.
In a kind of preferred embodiment, specifically the comprising of described B10 submodule:
The B401 submodule is used to write down current minor structure element subscript;
The B402 submodule is used for current minor structure element subscript is made as its next brother structural element subscript;
The B403 submodule is used to judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into the B401 submodule, otherwise enters the B404 submodule;
B404 submodule, the minor structure element subscript that is used for that the B401 submodule is write down are judged to be the indexed set of the minor structure element of brotherhood each other that can parallel processing, finish this traversal and search flow process.
In a kind of preferred embodiment, comprise in the described B11 submodule:
The B41 submodule is used for a minor structure element subscript that needs parallel processing that the B10 submodule finds as current minor structure element subscript;
The B42 submodule is used for current minor structure element subscript as current Processing Structure element subscript, and makes up the processing context of this structural element correspondence, should handle context as when the pre-treatment context;
The B43 submodule is used for carrying out the processing of current structure element initial period when the pre-treatment context;
The B44 submodule is used for carrying out the processing of the initial content of current structure element when the pre-treatment context, and is current minor structure element subscript assignment;
The B45 submodule is used to judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into the B42 submodule, otherwise enters into the B46 submodule;
The B46 submodule is used for carrying out the processing of the end content of current structure element correspondence when the pre-treatment context;
The B47 submodule is used for carrying out the processing of the ending phase of current structure element correspondence when the pre-treatment context;
The B48 submodule is used to judge whether current structure element subscript is the minor structure element subscript described in the described B41 submodule, if process ends then, otherwise enter the B49 submodule;
The B49 submodule is used for current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, enters the B45 submodule afterwards.
Because the present invention at first all is mapped to the information of XML document in the XML document vector model of available subscript visit, can need not to be streamlined, element accessing mode one by one to the visit of this XML document vector model, but wherein all structural elements of mobile access at random, therefore the present invention can be according to the standard and the characteristics of XML document, the parallel processing strategy is set, label for can parallel processing makes full use of the multithreading resource at a plurality of subtab elements of synchronization parallel processing.Thereby reduce the time of processing XML document, improve the handling property of XML document, accelerate the speed that XML document is returned result, improve the execution efficient of software.The present invention is particularly useful under the situation of multi-core CPU, can realize different CPU is effectively distributed in the processing of XML document in processing procedure, realizes the purpose of maximum using system resource, has further improved software and has carried out efficient.
Description of drawings
Fig. 1 is the core processing flow process of the method for the invention;
The process flow diagram that Fig. 2 a handles for initial document event-driven of phase one of the present invention;
The process flow diagram that Fig. 2 b handles for phase one start-tag event-driven of the present invention;
The process flow diagram that Fig. 2 c handles for phase one contents processing event-driven of the present invention;
The process flow diagram that Fig. 2 d handles for phase one end-tag event-driven of the present invention;
Fig. 3 is the detailed process process flow diagram of subordinate phase described in an embodiment of the present invention;
Fig. 4 a is the processing flow chart that traversal finds out the minor structure element of brotherhoods each other all under the current structure element among the described embodiment;
Fig. 4 b is that described embodiment is in the parallel processing process, to the processing flow chart of each minor structure element;
Fig. 5 handles the synoptic diagram of an XML document example for existing SAX event-driven;
Fig. 6 for the phase one of the present invention with the mapping of this XML document example obtain can be by the data structure mapping graph of subscript visit;
Fig. 7 handles the synoptic diagram of described XML document vector model for subordinate phase of the present invention.
Embodiment
Below by specific embodiments and the drawings the present invention is described in further details.
Fig. 1 is the core processing flow process of the method for the invention, and this flow process comprises:
Step 101 be the phase one, with the XML document label information be mapped to can XML document vector model by subscript visit in;
Step 102 is subordinate phase, according to the parallel processing policy information, by the subscript visit and handle the structural element of described XML document vector model, wherein adopt parallel mode to handle for the structural element that needs parallel processing.
In described step 101, in a structural element of XML document vector model, the structural element of one of them label mapping includes following content information with the information correspondence mappings that each label comprised in the described XML document:
1) serial number of current label in XML document;
2) current label is to the distance of the father's label serial number that comprises this label;
3) the subtab number that comprises of current label;
4) tag name of current label;
5) attribute information of current label;
6) the initial content of text that comprises of current label;
7) the endtext content that comprises of current label.
Following table 1 is the content example of the structural element of a label correspondence:
Table 1
The content of another one label can also comprise an identifier bool m_bBegin, so that expression works as pre-treatment is whether label is the label initiating process.
Described step 101 is to handle based on the mode of event-driven analyzing and processing, for example can be in a kind of preferred embodiment based on simple api interface (SAX at XML, Simple API for XML) event-driven mode is carried out the event-driven processing, in this SAX event-driven mode, the event handler of realizing is not directly handled return results earlier, but the XML document contents processing is become the XML document vector model.
In the step 101, specifically comprise the event-driven processing procedure shown in Fig. 2 a~Fig. 2 d.
Fig. 2 a is the process flow diagram that initial document event-driven is handled, step 211~step 212 referring to Fig. 2 a, this flow process comprises: set up or clear up one can be by subscript visit data structure as described XML document vector model, initialization is when pre-treatment label subscript variable.
The process flow diagram that Fig. 2 b handles for the start-tag event-driven, step 221~step 226 referring to Fig. 2 b, this flow process comprises: the structural element that makes up a new construction element, and serial number, this new label of this new construction element carried out assignment to the distance of father's label serial number, the bookmark name and the attribute of this new construction element, and this newly-built structural element is joined in the data structure of available subscript visit, it is the serial number of current structure element that the pre-treatment subscript is worked as in change.
The process flow diagram that Fig. 2 c handles for the contents processing event-driven, step 231~step 233 referring to Fig. 2 c, this flow process comprises: judge that the current content that runs into is initial content or end content, initial in this way content, will be to the initial content assignment of currentElement structure, otherwise to the end content assignment of currentElement structure.
The process flow diagram that Fig. 2 d handles for the end-tag event-driven, step 241~step 242 referring to Fig. 2 d, this flow process comprises: the daughter element that comprises to current Processing Structure element is counted assignment, and father's element subscript of current Processing Structure element is composed to working as pre-treatment label subscript.
By making up such XML document vector model, just XML document information can be recorded in the above-mentioned data structure, thereby can pass through the information that subscript visits label at random, and also document can be carried out cutting apart at random according to the information in the middle of the structural element.In step 102, correspond to each structural element in the described document vector model, obtain the content information of each label in the document with this by a subscript.
By the processing of above-mentioned steps 101, the full content of XML document has all recorded in the data structure of XML document vector model, so the present invention is in step 102, all will become processing for described XML document vector model for the processing of XML document.In step 102, just become by being marked on visit structural element wherein in the XML document vector model down for the visit of each label substance of XML document.
Fig. 3 is the detailed process process flow diagram of step 102 for subordinate phase described in an embodiment of the present invention.Suppose that in this embodiment having made up an XML document vector model that comprises n structural element is vector[n], c is the structural element subscript when pre-treatment, s is current minor structure element subscript.Referring to Fig. 3, this flow process specifically comprises:
Step 32, in the processing of carrying out current structure element initial period in the pre-treatment context; Here can comprise to current structure attribute of an element information (vector[c]->mp_attributelist) processing, and other relevant processing.
Step 33, carrying out the processing of the initial content of current structure element in the pre-treatment context, and be current minor structure element subscript assignment, promptly to vector[c]->the m_beginContent information processing, and be c+1 with the s assignment.
Step 37, judge whether the current structure element is the root architecture element, and promptly whether c is 0, if process ends is then returned result and given the user, otherwise enters step 38.
Step 38, current minor structure element subscript is made as the fraternal structural element of current structure element, be s=c+vector[c]->m_count+1, and turn back in the processing context of father's structural element of current structure element, be c=c-vector[c]->m_distance, enter step 34 afterwards;
Herein, described parallel processing policy information can be set according to the own characteristic of XML document.For a concrete XML document, its content generally all has corresponding label standard.For example for ODF (ODF, Open Document Format) document, the standard techniques standard that has International Organization for Standardization to assert, in this technical manual, set a row of expression, a paragraph, a pairing label information such as the page, also set information such as how carrying out association between the information between its pairing label.And the implementor of each technical manual will classify to XML document and judge according to this technical manual, generates corresponding parallel processing strategy.Such as in electrical form, fraternal tag element between between table is parallel, can carry out parallel processing, can show between the fraternal each other subtab under this father's label and can carry out parallel processing for father's label of these fraternal labels is provided with the parallel processing sign in the parallel processing policy information so.
In the above-mentioned steps 310, described traversal finds out the idiographic flow of minor structure element of brotherhoods each other all under the current structure element referring to shown in Fig. 4 a, and this flow process comprises:
In the above-mentioned steps 311, in the parallel processing process, described treatment scheme to each minor structure element of finding out specifically comprises shown in Fig. 4 b:
By above-mentioned steps 102 is subordinate phase, can adopt multi-threaded parallel to handle its subtab when running into the parallel tag element of onrelevant, thereby handles for a plurality of threads simultaneously the Processing tasks reasonable distribution.Reduce the processing time of XML document, improve document process speed.
Below by the processing procedure of an example XML document, the process in each stage of the present invention is carried out further detailed description.This XML document is as follows:
<A?name=″a″>
<B?name=″b1″>BeginContent1<C?name=″c1″/>EndContent1</B>
<B?name=″b2″>BeginContent2<C?name=″c2″/>EndContent2</B>
<B?name=″b3″>BeginContent3<C?name=″c3″/>EndContent3</B>
<B?name=″b4″>BeginContent4<C?name=″c4″/>EndContent4</B>
<B?name=″b5″>BeginContent5<C?name=″c5″/>EndContent5</B>
<B?name=″b6″>BeginContent6<C?name=″c6″/>EndContent6</B>
</A>
Fig. 5 handles the synoptic diagram of above-mentioned XML document example for existing SAX event-driven.Referring to Fig. 5, when handling such XML document, its treatment scheme must be that a kind of streamlined event-driven is handled by the continuous redirect between label one by one in existing SAX event driven procedure, is a block type processing procedure of not falling.Like this in processing procedure, facing to the label of a plurality of repetitions, if for label<B〉processing time very long, the processing procedure of entire document will be very very long like this.But the processing of present stage can't be skipped each process.
Yet utilize method of the present invention, when handling above-mentioned such XML document (as word processing), concrete process is as follows.
At first, in the phase one, the event driven procedure by SAX resolve obtain one as shown in Figure 6 can be by the data structure of subscript visit, i.e. XML document vector model.Shown in Fig. 2 a~Fig. 2 d, each event driven detailed process is expressed as follows:
Described XML document is carried out SAX event-driven analytic process, drive initial document incident at first: setting up or clear up one can be by the data structure of subscript visit, and initialization is when pre-treatment label subscript variable, and for example initialization subscript variable is 0.
Then run into start-tag<A name=" a " 〉, drive the start-tag incident, comprise: the structural element that makes up a new label, and be 0 to the serial number m_local assignment of this new Object (being current this newly-built structural element), is 0 to this new Object to father's tag distances m_distance assignment, bookmark name m_name assignment to this new Object is A, attribute mp_attributelist assignment is name=" a ", and this newly-built structure element is joined in the data structure of available subscript visit, as the m_local among Fig. 6 that delegation of 0; It is the value 0 of the serial number m_local of current object that the pre-treatment subscript is worked as in change.
Then run into start-tag<B name=" b1 " 〉, drive the start-tag incident this moment, comprise: the structural element that makes up a new label, and be 1 to the serial number assignment of this new Object, to this new Object be 1 to father's tag distances assignment, be B to the bookmark name assignment of this new Object, attribute assignment is name=" b1 ", but the newly-built structure element being joined in the data structure of subscript visit, is that delegation of 1 as the m_local among Fig. 6; It is the value 1 of the serial number of current object that the pre-treatment subscript is worked as in change.
Then drive analysis downwards, run into content BeginContent1, drive the contents processing incident, comprise: judge that the current content that runs into is initial content or end content, be initial content at this moment, with the initial content assignment of current structure element, the initial content assignment that promptly is designated as 1 structural element down is BeginContent1.
Next run into start-tag<C name=" c1 ", drive the start-tag incident, comprise: the structural element that makes up a new label, and be 2 to the serial number assignment of this new Object, is 1 to this new Object to father's tag distances assignment, bookmark name assignment to this new Object is C, and attribute assignment is " c1 ", but this newly-built structure element is joined in the data structure of subscript visit; It is the value 2 of the serial number of current object that the pre-treatment subscript is worked as in change.
Next run into end-tag/, drive the end-tag incident, comprising: to count the m_count assignment be 0 to the daughter element that comprises of current Processing Structure element (subscript value is 2), and father's element subscript 1 of current Processing Structure element composed give when pre-treatment label subscript.
Next run into content EndContent1, drive the contents processing incident, comprise: judge that the current content that runs into is initial content or end content, be to finish content at this moment, with the end content assignment of current Processing Structure element, the end content assignment that promptly is designated as 1 structural element down is EndContent1.
Next run into end-tag</B, drive the end-tag incident, comprise: to count the m_count assignment be 1 to the daughter element that comprises of current Processing Structure element (subscript value is 1), and father's element subscript 0 of current Processing Structure element composed give when pre-treatment label subscript.
Next run into start-tag<B_name=" b2 " 〉, drive the start-tag incident, comprise: the structural element that makes up a new label, and be 3 to the serial number assignment of this new Object, is 3 to this new Object to father's tag distances assignment, bookmark name assignment to this new Object is B, and attribute assignment is name=" b2 ", but the structural element that this is newly-built joins in the data structure of subscript visit; It is the value 3 of the serial number of current object that the pre-treatment subscript is worked as in change.
Next run into content BeginContent2, drive the contents processing incident, comprise: judge that the current content that runs into is initial content or end content, be initial content at this moment, with the initial content assignment of current Processing Structure element, the initial content assignment that promptly is designated as 3 structural element down is BeginContent2.
Next according to handling, finish, obtain complete XML document vector model as shown in Figure 6 up to SAX event-driven analyzing and processing process with the similar mode of above-mentioned flow process.
Then, enter the processing of subordinate phase.According to flow process as shown in Figure 3, begin to handle described XML document vector model from being designated as 0 structural element down, a kind of synoptic diagram of processing procedure is as shown in Figure 7.Herein, can see, for tag element<A〉all subtab<B do not have the association of information each other, and can it be divided into six zonules by subscript 1,3,5,7,9,11, handle label<A like this〉daughter element the time just can take multi-thread concurrent to dispatch mode handle.
Therefore, in described parallel processing policy information, can be described tag element<A〉parallel processing sign is set, so that represent can parallel processing between its subtab.In the step 39 of the described flow process of Fig. 3, if be designated as c=0 under the current structure element, be element<A〉subscript, judge that then its minor structure element can parallel processing, all that find this structural element according to the described method of step 310 minor structure elements of brotherhood each other afterwards, be that subscript value is 1,3,5,7,9,11 minor structure element, to these minor structure elements and the parallel processing of minor structure element thereof, the promptly parallel described flow process of execution graph 4b respectively, for example handle the detailed process that is designated as 3 minor structure element down and shown in Fig. 4 b, be specially:
In step 41, will be designated as 3 minor structure element subscript down as current minor structure element subscript s, i.e. s=3.In step 42, as current Processing Structure element subscript, i.e. c=s=3, and make up the corresponding processing context of this structural element (being designated as 3 down) should handle the context conduct and work as the pre-treatment context with current minor structure element subscript 3.In step 43, in the processing of in the pre-treatment context, carrying out current structure element (being designated as 3 down) initial period.In step 44, in the processing of in the pre-treatment context, carrying out the initial content of current structure element (being designated as 3 down), and be that current minor structure element subscript assignment is s=c+1=4.
Enter step 45, this moment, s=4 was more than or equal to c+1=4 and smaller or equal to c+vector[c]->m_count=4, so enter into step 42, with current minor structure element subscript 4 as current Processing Structure element subscript, be c=s=4, and make up the corresponding processing context of this structural element (being designated as 4 down), should handle context as working as the pre-treatment context.In step 43, in the processing of in the pre-treatment context, carrying out current structure element (being designated as 4 down) initial period.In step 44, in the processing of in the pre-treatment context, carrying out the initial content of current structure element (being designated as 4 down), and be that current minor structure element subscript assignment is s=c+1=5.
Enter step 45 once more, this moment s=5, more than or equal to c+1=5, but greater than c+vector[c]->m_count=4, so enter into step 46, in the processing of in the pre-treatment context, carrying out the corresponding end content of current structure element (being designated as 4 down), afterwards in step 47, in the processing of in the pre-treatment context, carrying out the corresponding ending phase of current structure element (being designated as 4 down).In step 48, judge that current structure element subscript (being designated as 4 down) is not the described minor structure element of step 41 subscript (subscript 3), so enter step 49.In step 49, current minor structure element subscript is made as the fraternal structural element subscript of current structure element, i.e. s=c+vector[c]->m_count+1=5; And turn back in the processing context of father's structural element of current structure element i.e. c=c-vector[c]->m_distance=3.
Enter step 45 afterwards once more, this moment s=5, more than or equal to c+1=4, but greater than c+vector[c]->m_count=4, so enter into step 46.In step 46, in the processing of in the pre-treatment context, carrying out the corresponding end content of current structure element (subscript 3).In step 47, in the processing of in the pre-treatment context, carrying out the corresponding ending phase of current structure element (subscript 3).
Enter step 48, judge that current structure element subscript (subscript 3) is the described minor structure element of step 41 subscript (subscript 3), so process ends.
By that analogy, by being 1 to subscript value, 3,5,7,9,11 minor structure element is carried out above-mentioned flow process, structural element and minor structure element (being the structural element of subscript 2) thereof that can parallel processing subscript 1, the structural element of subscript 3 and minor structure element thereof (being the structural element of subscript 4), the structural element of subscript 5 and minor structure element thereof (being the structural element of subscript 6), the structural element of subscript 7 and minor structure element thereof (being the structural element of subscript 8), the structural element of subscript 9 and minor structure element thereof (being the structural element of subscript 10), the structural element of subscript 11 and minor structure element thereof (being the structural element of subscript 12), under above-mentioned, be designated as 1,3,5,7,9, after 11 structural element and minor structure element thereof all dispose, carry out the end process operation to being designated as 0 structural element (be label<A 〉) down again.
By the processing of above-mentioned subordinate phase, make under the situation of multinuclear, to make full use of cpu resource and handle a plurality of subtab<B at synchronization element.Thereby reduce the time of processing XML document; Improve the handling property of XML document.
The above is preferred embodiment of the present invention only, is not to be used to limit protection scope of the present invention.
Claims (12)
1. the disposal route of an information document is characterized in that, this method comprises:
A, with the expandable mark language XML document label information be mapped to can XML document vector model by subscript visit in;
B, according to the parallel processing policy information, by subscript visit and handle the structural element of described XML document vector model, wherein adopt parallel mode to handle for the structural element that needs parallel processing.
2. method according to claim 1, it is characterized in that, in the described steps A, in a structural element of XML document vector model, the structural element of one of them label mapping includes following content information with the information correspondence mappings that each label comprised in the described XML document:
1) serial number of current label in XML document;
2) current label is to the distance of the father's label serial number that comprises this label;
3) the subtab number that comprises of current label;
4) tag name of current label;
5) attribute information of current label;
6) the initial content of text that comprises of current label;
7) the endtext content that comprises of current label.
3. method according to claim 2 is characterized in that, described steps A specifically comprises following event-driven processing procedure to handle based on the mode of event-driven analyzing and processing:
Initial document event-driven is handled: set up or clear up one can be by the subscript visit data structure as described XML document vector model, initialization is when pre-treatment label subscript;
The start-tag event-driven is handled: the structural element that makes up a new label, and serial number, this new label of this new construction element carried out assignment to the distance of father's label serial number, the bookmark name and the attribute of this new construction element, and this newly-built structural element is joined in the data structure of available subscript visit, it is the serial number of current structure element that the pre-treatment subscript is worked as in change;
The contents processing event-driven is handled: judge that the current content that runs into is initial content or end content, initial in this way content will be to the initial content assignment of currentElement structure, otherwise to the end content assignment of currentElement structure;
The end-tag event-driven is handled: the daughter element that comprises to current Processing Structure element is counted assignment, and father's element subscript of current Processing Structure element is composed to working as pre-treatment label subscript.
4. method according to claim 1 is characterized in that, described step B specifically comprises:
B1, with first structural element of described XML document vector model as the current structure element, the processing context that makes up this structural element correspondence is as when the pre-treatment context;
B2, in the processing of carrying out current structure element initial period in the pre-treatment context;
B3, carrying out the processing of the initial content of current structure element in the pre-treatment context, and be current minor structure element subscript assignment;
B4, judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into step B9, otherwise enters into step B5;
B5, in the processing of carrying out the end content of current structure element correspondence in the pre-treatment context;
B6, in the processing of carrying out the ending phase of current structure element correspondence in the pre-treatment context;
B7, judge whether the current structure element is the root architecture element, if process ends then, otherwise enter step B8;
B8, current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, enter step B4;
B9, according to default parallel processing policy information, whether the minor structure element of judging the current structure element can parallel processing, if then enter step B10, otherwise enters step B13;
B10, traversal find out the minor structure element of brotherhoods each other all under the current structure element;
B11, adopt parallel modes to handle to the whole minor structure elements that find out;
B12, intact all the minor structure elements of wait parallel processing enter step B5;
B13, will work as the pre-treatment context and handle accordingly, as current Processing Structure element, and the processing context that makes up current Processing Structure element correspondence returns step B2 as when the pre-treatment context with current minor structure element.
5. method according to claim 4 is characterized in that, the concrete grammar of described step B10 comprises:
B401, the current minor structure element subscript of record;
B402, current minor structure element subscript is made as its next brother structural element subscript;
B403, judge current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into step B401, otherwise execution in step B404;
B404, the minor structure element subscript that step B401 is write down are judged to be the indexed set of the minor structure element of brotherhood each other that can parallel processing, finish this traversal and search flow process.
6. method according to claim 4 is characterized in that, the processing to each minor structure element of finding out described in the above-mentioned steps B11 specifically comprises:
B41, with the minor structure element subscript that needs parallel processing that finds among the step B10 as current minor structure element subscript;
B42, with current minor structure element subscript as current Processing Structure element subscript, and make up the processing context of this structural element correspondence, should handle context as when the pre-treatment context;
B43, in the processing of carrying out current structure element initial period in the pre-treatment context;
B44, carrying out the processing of the initial content of current structure element in the pre-treatment context, and be current minor structure element subscript assignment;
B45, judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into step B42, otherwise enters into step B46;
B46, in the processing of carrying out the end content of current structure element correspondence in the pre-treatment context;
B47, in the processing of carrying out the ending phase of current structure element correspondence in the pre-treatment context;
B48, judge whether current structure element subscript is the minor structure element subscript described in the described step B41, if process ends then, otherwise enter step B49;
B49, current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, return step 45.
7. according to claim 1 or 4 described methods, it is characterized in that the parallel processing policy information described in the step B comprises: for the label that needs parallel processing is provided with the parallel processing sign; But, label judges the subtab of the brotherhood each other in this label of parallel processing if having the parallel processing sign.
8. the treating apparatus of an information document is characterized in that, this device comprises:
First module, being used for the expandable mark language XML document label information is mapped to can be by the XML document vector model of subscript visit;
Second module is used for according to the parallel processing policy information, by subscript visit and handle the structural element of described XML document vector model, wherein adopts parallel mode to handle for the structural element that needs parallel processing.
9. device according to claim 8 is characterized in that, described first module specifically comprises following event-driven processing module:
Initial document event-driven processing module, be used to set up or clear up one can be by the subscript visit data structure as described XML document vector model, initialization is when pre-treatment label subscript;
Start-tag event-driven processing module, be used to make up the structural element of a new label, and serial number, this new label of this new construction element carried out assignment to the distance of father's label serial number, the bookmark name and the attribute of this new construction element, and this newly-built structural element is joined in the data structure of available subscript visit, it is the serial number of current structure element that the pre-treatment subscript is worked as in change;
Contents processing event-driven processing module is used to judge that the current content that runs into is initial content or end content, and initial in this way content will be to the initial content assignment of currentElement structure, otherwise to the end content assignment of currentElement structure;
End-tag event-driven processing module is used for the daughter element that comprises of current Processing Structure element is counted assignment, and father's element subscript of current Processing Structure element is composed to working as pre-treatment label subscript.
10. device according to claim 8 is characterized in that, described second module specifically comprises following submodule:
The B1 submodule is used for first structural element with described XML document vector model as the current structure element, and the processing context that makes up this structural element correspondence is as when the pre-treatment context;
The B2 submodule is used for carrying out the processing of current structure element initial period when the pre-treatment context;
The B3 submodule is used for carrying out the processing of the initial content of current structure element when the pre-treatment context, and is current minor structure element subscript assignment;
The B4 submodule is used to judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into the B9 submodule, otherwise enters into the B5 submodule;
The B5 submodule is used for carrying out the processing of the end content of current structure element correspondence when the pre-treatment context;
The B6 submodule is used for carrying out the processing of the ending phase of current structure element correspondence when the pre-treatment context;
The B7 submodule is used to judge whether the current structure element is the root architecture element, if process ends then, otherwise enter the B8 submodule;
The B8 submodule is used for current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, enters the B4 submodule;
The B9 submodule is used for according to default parallel processing policy information, and whether the minor structure element of judging the current structure element can parallel processing, if then enter the B10 submodule, otherwise enters the B13 submodule;
The B10 submodule is used to travel through the minor structure element that finds out brotherhoods each other all under the current structure element;
The B11 submodule is used for adopting parallel mode to handle to the whole minor structure elements that find out;
The B12 submodule is used to wait for that parallel processing finishes all minor structure elements, enters the B5 submodule;
The B13 submodule is used for and will handles accordingly when the pre-treatment context, and as current Processing Structure element, and the processing context that makes up current Processing Structure element correspondence returns the B2 submodule as when the pre-treatment context with current minor structure element.
11. device according to claim 10 is characterized in that, specifically the comprising of described B10 submodule:
The B401 submodule is used to write down current minor structure element subscript;
The B402 submodule is used for current minor structure element subscript is made as its next brother structural element subscript;
The B403 submodule is used to judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into the B401 submodule, otherwise enters the B404 submodule;
B404 submodule, the minor structure element subscript that is used for that the B401 submodule is write down are judged to be the indexed set of the minor structure element of brotherhood each other that can parallel processing, finish this traversal and search flow process.
12. device according to claim 10 is characterized in that, comprises in the described B11 submodule:
The B41 submodule is used for a minor structure element subscript that needs parallel processing that the B10 submodule finds as current minor structure element subscript;
The B42 submodule is used for current minor structure element subscript as current Processing Structure element subscript, and makes up the processing context of this structural element correspondence, should handle context as when the pre-treatment context;
The B43 submodule is used for carrying out the processing of current structure element initial period when the pre-treatment context;
The B44 submodule is used for carrying out the processing of the initial content of current structure element when the pre-treatment context, and is current minor structure element subscript assignment;
The B45 submodule is used to judge that current minor structure element subscript whether in the minor structure elemental range of current structure element, if then enter into the B42 submodule, otherwise enters into the B46 submodule;
The B46 submodule is used for carrying out the processing of the end content of current structure element correspondence when the pre-treatment context;
The B47 submodule is used for carrying out the processing of the ending phase of current structure element correspondence when the pre-treatment context;
The B48 submodule is used to judge whether current structure element subscript is the minor structure element subscript described in the described B41 submodule, if process ends then, otherwise enter the B49 submodule;
The B49 submodule is used for current minor structure element subscript is made as the fraternal structural element subscript of current structure element, and turns back in the processing context of father's structural element of current structure element, enters the B45 submodule afterwards.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010105198699A CN102004631A (en) | 2010-10-19 | 2010-10-19 | Method and device for processing information document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010105198699A CN102004631A (en) | 2010-10-19 | 2010-10-19 | Method and device for processing information document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102004631A true CN102004631A (en) | 2011-04-06 |
Family
ID=43812015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010105198699A Pending CN102004631A (en) | 2010-10-19 | 2010-10-19 | Method and device for processing information document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102004631A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1456990A (en) * | 2002-05-09 | 2003-11-19 | 日本电气株式会社 | Applied program parallel processing system and method |
CN1575464A (en) * | 1999-06-18 | 2005-02-02 | 奔流系统公司 | Segmentation and processing of continuous data streams using transactional semantics |
US20060015816A1 (en) * | 2004-07-14 | 2006-01-19 | International Business Machines Corporation | Framework for development and customization of web services deployment descriptors |
CN1825306A (en) * | 2005-10-31 | 2006-08-30 | 北京神舟航天软件技术有限公司 | XML data storage and access method based on relational database |
CN101329665A (en) * | 2007-06-18 | 2008-12-24 | 国际商业机器公司 | Method for analyzing marking language document and analyzer |
CN101350007A (en) * | 2007-06-26 | 2009-01-21 | 英特尔公司 | Method and apparatus for parallel XSL transformation with low contention and load balancing |
-
2010
- 2010-10-19 CN CN2010105198699A patent/CN102004631A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1575464A (en) * | 1999-06-18 | 2005-02-02 | 奔流系统公司 | Segmentation and processing of continuous data streams using transactional semantics |
CN1456990A (en) * | 2002-05-09 | 2003-11-19 | 日本电气株式会社 | Applied program parallel processing system and method |
US20060015816A1 (en) * | 2004-07-14 | 2006-01-19 | International Business Machines Corporation | Framework for development and customization of web services deployment descriptors |
CN1825306A (en) * | 2005-10-31 | 2006-08-30 | 北京神舟航天软件技术有限公司 | XML data storage and access method based on relational database |
CN101329665A (en) * | 2007-06-18 | 2008-12-24 | 国际商业机器公司 | Method for analyzing marking language document and analyzer |
CN101350007A (en) * | 2007-06-26 | 2009-01-21 | 英特尔公司 | Method and apparatus for parallel XSL transformation with low contention and load balancing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9298680B2 (en) | Display of hypertext documents grouped according to their affinity | |
CN108334585A (en) | A kind of spiders method, apparatus and electronic equipment | |
CN102880607A (en) | network dynamic content capturing method and network dynamic content crawler system | |
US20140149851A1 (en) | Method for data chunk partitioning in xml parsing and method for xml parsing | |
CN103092936B (en) | A kind of Internet of Things dynamic page real-time information collection method | |
CN109445773A (en) | A kind of language based on programming promotes the method and electronic equipment of browser performance | |
US20190187964A1 (en) | Method and Apparatus for Compiler Driven Bank Conflict Avoidance | |
US8601481B2 (en) | Workflow validation and execution | |
CN108536584A (en) | A kind of automated testing method based on Sikuli | |
CN112925968A (en) | Crawler-based data capturing method and device, computer equipment and storage medium | |
CN105824647A (en) | Form page generating method and device | |
US20120166460A1 (en) | Utilizing Metadata Generated During XML Creation to Enable Parallel XML Processing | |
CN102004631A (en) | Method and device for processing information document | |
CN105243020B (en) | A kind of automated testing method suitable for wide-area distribution type real-time data base | |
CN107643892B (en) | Interface processing method, device, storage medium and processor | |
Hinsen | A data and code model for reproducible research and executable papers | |
CN103389893A (en) | Read-write method and device for configuration register | |
CN102004722B (en) | Method and device for processing information documents | |
Ding et al. | Design and implementation of web crawler based on coroutine model | |
US20180107634A1 (en) | Markup language parser | |
Pawlas et al. | Universal web pages content parser | |
CN102981839A (en) | Data expanding optimization method of merging execution large-scale parallel thread | |
Abe et al. | Reordering control approaches to state explosion in model checking with memory consistency models | |
Somasundaram | Efficient File-Based Data Ingestion for Cloud Analytics: A Framework for Extracting and Converting Non-Traditional Data Sources | |
Afshar et al. | Per processor spin-lock priority for partitioned multiprocessor real-time systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20110406 |
|
C20 | Patent right or utility model deemed to be abandoned or is abandoned |