GB2417349A - Digital-content extraction using multiple algorithms; adding and rating new ones - Google Patents
Digital-content extraction using multiple algorithms; adding and rating new ones Download PDFInfo
- Publication number
- GB2417349A GB2417349A GB0523074A GB0523074A GB2417349A GB 2417349 A GB2417349 A GB 2417349A GB 0523074 A GB0523074 A GB 0523074A GB 0523074 A GB0523074 A GB 0523074A GB 2417349 A GB2417349 A GB 2417349A
- Authority
- GB
- United Kingdom
- Prior art keywords
- data
- digital
- content
- algorithms
- extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/907—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G06F17/30017—
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A digital-content extractor (214) comprises a data-acquisition device (12) configured to generate a digital representation of a source, a data-extraction engine (330) communicatively coupled to the data-acquisition device (12), the data-extraction engine (330) configured to apply a combination of a plurality of digital-content extraction algorithms (332) over the source, wherein the data-extraction engine (330) is configured to automatically accommodate new data-extraction algorithms (315). A method for improving the accuracy of extracted digital content comprises reading a digital source (402 fig.4) identifying the digital source by type (404), generating an acceptance level for each of a plurality of digital-content extraction algorithms based on a confidence value and a credibility rating associated with the accuracy of each of the plurality of digital-content extraction algorithms (410), and applying a combination of at least two of the plurality of digital-content extraction algorithms based on the acceptance level to thereby generate extracted digital content of the digital source (412).
Description
24 1 7349 SYSTEMS AND METHODS FOR IMPROVEI) ACCURACY Ol1' EXTRACTED
DIGITAL CONTF,N'I'
TECHNICAL FIELD
1] The present disclosure generally relates to systems and methods for generating data from a digital infonnation source. More particularly, the invention relates to systems and methods for Improving the accuracy of extracted digital content.
BACKGROUND OF THE INVEN'I'ION
[0002) Digital-content extraction (DCE) is a catch phrase that encompasses the concept of denvmg useful data (e g, metadata) from a digital source A digital source can be any of a variety of digital media, mcludng but not limited to voice (I e, speech), music, and other auditory data; nnages, including film and other two dmensonal data images; three- dimensional graphics; and the like.
[00031 Metadata Is data about data. Metadata may describe how, when, and sometimes by whom, a particular set of data was collected, how the data Is fommattcd, em Metadata Is essential for wderstandmg mformaton stored m data warehouses l\,letadata is used by search engines to locate pertinent data related to search terms and/or other descriptors used to describe or characterize the underlying content.
0004] There are numerous algorithms that can be used for extracting content from documents. Many of these are public domain, available on the Intemet at venous umvet-stes, commercial, and even personal Web sites. Many algorithms designed to perform digital content extractions are proprietary. t he following are representative examples of DCE algorithms: a) speech recogmton algorithms; b) optical character recognition (OCR), or text recognition, algorithms, c) page/document analysts algorithms; d) forms recognition packages; e) document template matching algorithms, f) search engines, semantic-based and otherwise, including Web spiders and "hots" (/ e, robots); and g) mtellgent agents (e g, expert systems) 51 A variety of highly developed, and therefore, hgh-value algorithms exist to rcsolVc ISSUCS related t-'specitc D(-'E problems Intuitively, one ought to be able to combine the results Tom select d:ta-cxtacton algorithms to improve the perlonnance (l e, the accuracy) of the resulting metadata llowever7 programmatic apphcatton ot these algorithms Is pece-tneal. Consequcully, the results often oftemo Improvement to an end user T7or example, the combination of two or more OCR engmes usmg a "voting scheme" or other simple combination mechanism of'teu results m little or no rnprovernent In performance In some situations, OC'E algorithm combmaton methodologies may even result In a decrease in performance when one compares the results of the algorithms separately executed over the data (I e, a printed page) with the results from the eombmed algorithm. Conventional DCE algorithm eombmatons are often limited due to the nature of their designs
SUMMARY Ol1' THE INVENTION
[00061 An embodiment of a digtal-eontent extractor, comprises a dataaequston devotee configured to generate a digital representat>n of a source, a data-extracton engme cornmunreatively coupled to the dataacquston device, the data-extracton engine configured to apply a eombmaton of a plurality of dgital-eontent extraction algorithms over the source, wherein the data-extraeton engine is configured to accommodate new data-extraeton algorithms [0007] An embodiment of a method for improving the accuracy of extracted digital content, comprises An embodiment of a method for improving the accuracy of extracted digital content, comprises reading a digital sour- cc, dentfymg the digital source by type, generating an acceptance level for each of a plurality of digtal content extraction algorithms based on a confidence value and a credibility rating associated with the accuracy of'each of the plurality of digital-content extraction algorithms, and applying a combmaton of at least two of the plurality of dgtal content extraction algorithms based on the acceptance level to thereby generate extracted digital content of the digital source.
BRIEF' DESCRIPTION Ol:' THE DRAWINGS
8] Systems and methods for mprovmg the accuracy of extracted digital contcut are Illustrated by way of example and not limited by the Implementations m the f'ollc>wng drav.ungs 'I'he components no the drawings are not necessarily to scale, emililass Instead Is placctl upon clearly llustratmg the pnncpies of'the pre.seut mventmn Moreover, In the drawings, hkc r-elcrencc numerals designate con-espondmg parts throughout the several views ? [00119] FIG 1 Is a schematic diagram illustrating a possible operational environment for ernhodrmerrts of a data assessment system according to the present invention.
[OO 1 ()] FIG 2 Is a functional block diagram of the computing device of l;lG I. [OO I I J FIG. 3 Is a funchonal block diagram of an embodiment of an intelligent digital content extractor operable on the computing device of FIG 2 according to the present invention.
[00121 FIG. 4 is a flow chart illustrating a method fommprovng the accuracy of extracted digital content that may be realized by the mtelhgent digital content extractor of Fl(] 3.
3] FIG. 5 Is a flow chart 11ustratug an emhodrnent of a method for generating err optimal mterpretaton of a partreular aspect of a source document leading to the production of metadata that may be realized by the ntelhgent digital content extractor of FIG. 3 fOO 1 4J FIG. 6 Is a flow chart illustrating an embodiment of a method for Integrating a digtal-eontent extraction algorithm In the intelligent digital content extractor of FIG. 3
DETAILED DESCRIPTION
0()151 An mpro\ed data assessment system, havmg been sunrnanzed above, reference will now be made m detail to the description of the mventon as illustrated In the drawings. For clarity of presentation, the data assessment system and an embodiment of the underlying intelligent digital content extractor (IDCE) will be exemplified and described with focus on the generation of useful dale from a two-dimensonal digital source or "document." A document can be obtained from an image acquisition device such as a scanner, a digital camera, or read into memory from a data storage device (e g, In the form of a file).
GO 16] Embodiments of the IDCE rely on several levels of data extraction sophistication, a broad set of mtelleet "elements," and the ability to compare and contrast mfonmaton across each of these levels Each resulting network of digital content extraction algorithms can m essence, think for Itself, thus provdmg an aulomatc assessment capabhty that allows the ID('E to continue mprovmg its data cxtncton capabhtes 100171 I urmug now to the drawings, wherem hke-r-elerenced muncrals designate corresponding parts throughout the drawings, reference Is made to l;l(-i 1, which Illustrates a schematic of an exemplary operational environment suited for a data assessment system. In this regard, a data assessment system Is generally denoted by reference numeral 10 and may Include a computing device 16 communicatively coupled with a searmer 17 and a local data storage device 18. As further Illustrated in the schematic of FIG. I, the data assessment system may Include a remotely located data-acquisition device 12 and a remote data storage device 14 associated with the computing system 16 via local area network (LAN)/wide area network (WAN) 15.
[()O I X) I he data assessment system I O Includes at least one dataacquistion device 12 (eg., scanner 17) communicatively coupled with the computing device 16 In this regard, the data-acqustor1 device 12 can be any device capable of generating a digital representation of a soul-cc document. While the computing device 16 Is associated with the scanner 17 In the 11ustratron of FIG. 1, it should be appreciated that there are a host of Image acquisition devotees that may be eommuneatively coupled with the computing device 16 in order to transfer a digital representation of a document to the computing device 16. For example, the image acquisition device could be a digital camera, a video camera, a portable (he., handheld) scanner, etC. In other embodiments, the underlying source data can take other comas than a two-drmensional document For example, In some cases, the data may take the fonn of an audio recording (e g, speech, music, and <other auditory data), Images, rncludmg film and other two-dimensional data Images, three-dimensional graphics; and the Irke.
[00191 The network 15 can be any local area network (LAN) or wide area network (WAN). When the network 15 is configured as a LAN, the LAN could be configured as a ring network, a bus network, and/or a wireless local network. When the network 15 takes the fomm of a WAN, the WAN could be the public-switched telephone network, a proprietary network, arrd/or the public access WAN commonly known as the Intemet [0020] Regardless of the actual network used In particular embodiments, data can be exchanged over the network I S using various communication protocols. For example, transmission control protocol/lntemet protocol (TCP/IP) may be used If the network I 5 is the Intemet Proprietary Image data cornmurucation protocols may be used when the network 15 Is a prxpretary LAN Ol- WAN While the data assessment system I () Is llustl- aled In Fl(.i I in connection with the network coupled data-acquston device 12 and data storage device 14, the data assessment system 1 () Is not dependent upon network collllectlvty 00211 Those skilled in the art will appreciate that venous portions of the data assessment system 10 can be implemented In hardware software firmware or combinations thereof: In a preferred embodiment the data assessment system 10 Is Implemented using a combination of hardware and software or firmware that Is stored In memory and executed by a suitable Instruction execution system. If implemented solely in hardware as in an alternative embodiment the data assessment system 10 can be implemented with any or a combination of technologies which are well-known In the art (e.g., discrete loge circuits application specific Integrated circuits (ASIC's) programmable gate arrays (PCAs) field programmable gate arrays (FPGAs) etc.), or technologies later developed.
00221 In a preferred embodiment the data assessment system 1() Is Implemented via the combuaton of'a computing device 16 a scanner 17 and a local data storage device 18. In this regard local data storage device 18 can be an Internal hard-disk drive a magnetic tape dove a compact-disk drive and/or other data storage devices now known or later developed that can be made operable with computing device 16 In some embodiments software instructions and/or data associated with the intelhgent digital content extractor (lDCE) may be distributed across several of the above mentoned data storage devices ()0231 In a preferred embodiment the IDCE Is Implemented in a eombinaton of software and data executed and stored under the control of a computing processor It should be noted however that the IDCE Is not dependent upon the nature of the underlying computer In order to accomplish designated functions.
[0024J Reference Is now directed to FIG. 2 which Illustrates a functional block diagram of the computing device 16 of FIG. 1. Generally in terms of hardware architecture as shown In FIG. 2 the computing device 16 may include a processor 20O memory 210 data acquisition nterface(s) 230 input/output device nterface(s) 240 and LAN/WAN nterface(s) 250 that are communicatively coupled via local Interface 220. The local interface 220 can be for example but not limited to one or more buses or other wired or wireless eonnectons as Is known In the art or nay be later developed 'I'he local nterl:ace 22() may have additional elements which arc onttedforsunrhcty suchas controllers butt'ers(caches) drivers repeaters and receivers to enable communleatons Further the local ntertace may Include address, control, and/or data connections to enable appropriate communications among the aforementioned components.
too25] In the embodhnent of FIG. 2, the processor 2()0 is a hardware dcvce for executing software that can be stored m memory 2] 0. The processor 200 can be any custom-made or commercally-avalable processor, a central processing unit (CPU) or an auxhay processor among several processors associated with the computing device 16 and a seiniconductor- based microprocessor (in the form of a microchip) or a macroprocessor [002G] The memory 210 can Include any one or combmaton of volatile memory elements (e g, random access memory (RAM, such as dynamic RAM or DRAM, static RAM: or SRAM, etc)) and nonvolatile memory elements (e g, read- only niernory (ROM), hard caves, tape caves, compact discs (CD-IkOM), etc. ) Moreover, the memory 210 may incorporate electronic, magnetic, optical, and/or other types of storage media now known or later developed. Note that the memory 210 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by processor 200 [0027] The software in memory 210 may include one or more separate programs, each ot'whch comprises an ordered hating of'executal>le mstructons for mplementmg logical functions. In the example of FIG 2, the software in the memory 210 Includes IDCE 214 that functions as a result of and in accordance with operating system 212. The operating system 212 preferably controls the execution of other computer programs, such as the mtellgent digital content extractor (IDCE) 214, and provides scheduling, nput-output control, file and data management, memory management, and commumcaton control and related services.
8] In a preferred embodiment, IDCE 214 Is one or more source programs, executable programs (object code), scripts, or other collections each comprising a set of nstructons to be performed. It will be wellunderstood by one skilled in the art, after having become familiar with the teachings of the invention, that IDCE 214 may be written In a number of programming languages now known or later developed.
[()()29] I'he mlut/outpul device mtert:ace(s) 240 may take the form of' hurnan/machme dcvcc utcrt'aces for cormrlllmcatmg via various devices, such as but not hrmted to, a keyboard, a mouse or other suitable pomtmg device, a microphone, c tr Furtllermor-e, the nput/output device nlert'lce(s) 24() may also mclude known or- later developed (i output devices, for example but not limited to, a punter, a mowtor, an external speaker, elc.
[0()301 LAN/WAN interf'ace(s) 250 may Include a host of devices that may establish one or more communication sessions between the computing device 16 and LAN/WAN 15 (FIG. 1). LAN/WAN interface(s) 250 may include but are not Emoted to, a modulator/demodulator or modern (for accessing another device, system, or network); a radio frequency (RF) or other transceiver; a telephonic interface; a bridge; an optical interface, a router; eta For srnphcity of Illustration and explanation, these aforementioned two-way cornmumcaton devices are not shown.
[00311 When the computing device 16 is m operation, the processor 200 is configured to execute software stored within the memory 210, to commumcate data to and from the memory 210, and to generally control operations of the computing device 16 pursuant to the software. The IDCE 214 and the operating system 212, in whole or in part, but typically the latter, are read by the processor 200, perhaps buffered within the processor 200, and then executed.
2] The IDCE 214 can be embodied m any computer-readable medium for use by or In connection with an instruction execution system, apparatus, or device, such as a computer-basctl system, processor-contammg system, or other system that can fetch the n1stuctons from the Instruction execution system, apparatus, or device, and execute the instnctons. In the context of this disclosure, a "computer-readable medurn" can be any means that can store, commumcate, propagate, or transport a program for use by or In connection with the mstructon execution system, apparatus, or device The computer-readable medium can be, for example but not limited to, an electromc, magnetic, optical, electromagnetic, Infrared, or semiconductor system, apparatus, device, or propagation medium now known or later developer. Note that the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanmng of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner f'necessary, and then stored in a computer mcmol-y [00331 llefcencc Is now directed to [;l(, 3, which presents an cmbodm1ent ova linctonal block diagram of IDLE 214. As Illustrated In 1;1G 3, the ll) ('F. 214 may comprsc a user n1terf'ace 320 and a data-extracton engine 33() 11-)C'E 2 l 4 may receive data via venous data Input devices.310 When the input data originates from a punted document, the Input device 31 () may take the forth of a scanner, such as the flatbed scanner 17 of FIG. I 'I'he scanucr 17 may be used to acquire a digital representation of the printed document that Is cornmunrcatcd to the data-extraction engine 330.
[00341 As f'urthemllustrated in the functional block diagram of FIG. 3, the data extractron engine 330 may comprise a data discriminator 331, a plurality of DCE algorithms 332, an algorithm accuracy recorder 336, a statistical comparator 337, a key nfonnaton Identifier 33X, and logic 40() . Furthermore, the data-extraction engine 330 records venous data values or scores based on inteorn processing performed by the data discriminator 331, the DCE algorithms 332, statistical comparator 337 and logic 400 l or example, the data-exlracton engine 330 records ground-truthrng (GT) correlation data 333, categorization data 334, and acceptance level data values 335 Logic 400 coordinates data distribution to each of the various functional algorithms. Logic 400 also coordinates inter-algonthm proeessmg and data transfers both between the data extractron engine 330 and external devices (e.g. input devrees 310) and between th various internal functional algorithms (e.g. the data discumuator 33 1, the DCE algorithms 332, the statistical comparator-, and the like) and the various data types (egg, the C3T coirclaton data.333, the categor-zaton data 334, the acceptance level 335, and the Irke).
5] The functional block diagram of FIG 3 further Illustrates that the data-extraction engine 330 may generate an optunzed digital content extraction result 340 that may be forwarded to one or more output devices 350 to convey venous data extraction results 355 to an operator of the IDCE 214.
6] To effectively communicate between the various DCE algorithms 332, logic 400 Is configured to accept and process a set of cornmorr datarnterchange standards. The data-'nterchange standards provide a framework of recognizable data types that each of the DCE algorithms 332 may use to define a data source (e g, a document) These standards can Include standards for zomng, layout, data and/or document type, and text standards, among other-e Note that the data-interchange standards employed between a plurality ol'l)C'I? algorithms 3 32 may vary depending on the specific D('2l algorithms 3.32 that are comtmuncatng underlying document data ()()371 Zoning Is the classlcaton and segmentation of' various regions that may together comprise a data source Varaous regions ova document may comprise areas
X
contamng text, photos, and spectahzed graphics such as a border or the like In the case of a "scanned" magazine at tale, a single page may contain some or all of the aforementioned features In order to accurately Identify and classify the underlying data content, the venous LICE algorithms 332 should be appropriately matched to portions of the data. In this regard, zoning is a method for targeting the application of the venous DCE algorithms 332 over portions or segments of the underlying digital data where required. Electroncally-formatted data such as html, . xml, .doe and.pdf files, for example, should not require zoning. However, even fully electronieally generated documents may benefit from zomng for repurposirrg of their content nor other domains (em, I'DF to DIITML/IITML/XMLtXSLT, etc).
1()0381 Layout can be described as the relative relationship between the underlying data LO! example, In the context ot'a document, layout may Include rnfonnaton reflective of such features as articles, columns within articles, titles separating articles, sub-titles separating portions of an article, and the like.
9] Data type can include a classification of the media upon which the acquired digital data onginated. By way of example, digital documents may have been scanned or otherwise acquired from various media types, such as a "magazine page," a "shde,''a "transparency." etr) It should be appreciated that Information reflective of the media type may be used to select a particular DCE algorithm 332 that Is well suited for extracting digital content from that particular media type In other cases, it may be possible to fine tune or otherwise adjust a DCE algorithm 332 In order to achieve more accurate results [0040] Text standards can include optical character recognition (OCR), synopses, grammar tagging, language identification, purpose of the text (e.g. photo credit, title, caption, etc.), text formatting, translation mto other languages, and the like. Many of these standards exist already In public formats, such as HTML for rendering of text on web pages, l,DF for rendering of pages to the screen and punters, DOC for rendering Microsoft Word documents, etc. However, the IDCE 214 herein described may use an abstract set of text-based standards that are Independent of any particular tonnat.
()()411 By using an abstract set ot'data-'nterchange standards, the II)CE 214 enables any algorithm that Is useful m one of these areas (zoning, layout, document typing ancl text) or us a subset ot'onc ol'these areas, to Interact no a cooperatve-yet conpettve t'ashon with other L)(''L algonthrIs 332 populating the same set ot' abstract interchange data (e g, ground-truthing correlation data 333, categorization data 334, and acceptance level 335) Looking back to the data assessment system 10 llustt-ated In IIG. I, it should be appreciated that the D('E algonthtns 332 and the various other elements of the data-extt-acton engttte 330 may be stored and operative on a stogie computing device or distributed among several memory devices under the coordination of a computing device.
2] Moreover, various tnforrnatton, such as but not limited to, the ground-tnthing correlation data 333, the categorization data 334, the acceptance levels 335, and data In a algorithm accuracy recorder 336 '11ustratcd m the functional block diagram of FIG 3, may forte a dataextract/on engine knowledge base 339 Regardless of the actual mtplementatton, the data-extractott engme knowledge base 339 contatus the nfonnattort that logic 400 uses to select and combme various DCE algorithms 332 to reach a data extraction result with Improved accuracy [0043] In alternative embodiments, e.g. when the source data takes the form of an audio file, the data-tnterchange standards described above may be replaced m their entirety by a set of appropriate data-interchange standards suited for eharactenztng digital audio data rather than digital representations of print media Other data nterchange standards may be selected for specific types of'mage basecl data (photos7 film, graphics, etc) Regardless of the underlying media and the data-interchange standards selected, m order for two or more DCE algorithms 332 and/or other portions of the data-extract/on engme 330 to interface, the data- mterchattge standard selected preferably subscribes to at Icast one element that Is commonly used by both algorithms.
4] As also illustrated in the function block diagram of FIG. 3, the IDCE 214 may integrate new extraction algorithms 315 for use In the dataextract/on engme 330. In tints regard, the IDCE 214 may automatically accommodate new DOE algorithms 315 as they become available to the IDCF: 214. For the purposes of this disclosure, "accommodate" Is defined to encompass one or more of at least the following features. a) the data-extract/on engine 330 Is configured such that new extraction algorithms 315 can subset Be to any subsets ol'the ovet-all set oi tnetEItiatEl thElt can be createcl; b) the data- extr:cttctt engttte 330 catt automattcElily coinpElre the accuracy of any new extraction algorithms 315 to exstug L)(: 12 algorithms 332 rot any digital source, c) the data-extract/on engme 330 Is configured to accept and apply metrics 1() describing a particular new cxtracton algorithms performance (e g, absolute and comparative) as new data enters the system; d) the data-cxtracton engme 330 can Integrate each new extraction algorithm 315 into the IDCI- 214 without affecting any of the DCE algorithms 332 already m the system.
5] While the functional block diagram presented In F1(. 3 illustrates an IDCE 214 having a single centrally-located data-extraction engine 330 with co- located logic 400 and functional elements, it should be appreciated that the various tunctonal elements of the IL)CE 214 may be distributed across multiple kcatons (e g., with J2EE, NET, enterprise Java beans, or other distributed computing teclnology). For example, various DCE algorithms 332 can exist in different locations, on different servers, on different opcratmg systems, and m different computing environments because of the flexbhty provided no that they Interact via only common mterchangc data.
6] Because the highest levels of the Interchange standards are concerned with the synopses (i e., abstracts) of different documents and the correlation and interaction between documents, random queries, based on key phrases or other information extracted and/or generated m response to the documents, can be run agamst the knowledge base no automated attempts to formulate new relationships among the data In turn, these newfound relationships may be recorded, tested, and where proven accurate, can be reflected In updates to the knowledge base of the IDCE 214. In this way, the IDCE 214 may continuously Improve or "learn" over tune [0047] The IDCE 214 may also generate new Information via the use of coordinated searches for new correlations among documents. For example, related mformaton in documents that are otherwise unrelated can be crosscorrelated without the manual mstantation of a query or "search." Coordinated searches could be triggered periodically based on time, date, the number of documents processed since the last cross-correlation check, or some other ntatmg cntena. Recently processed documents could be analyzed for key words, phrases, or other data. The key words, phrases, or other data could be used In a comparison with prevously-processcd documents Any discovered marches occult In a cross-correlaton hok between the source:locurnents Such c:n-relatons are stored within the II)CE system as mvsbic hnks (as opposed to visible hnks such as hyperhnks), or assocntons that exist but are not vsbic to the user. 1 1
Data-lxtraction EM ine Oneration [0048] 'I'hc II)C'E 214 has several levels of ntcracton, each ofthc levels is scalable, easily updated, and ncremcntally nnprovcd over tune as each subsequent document Is added to the knowledge base over tune. The venous levels ot'mteracton nchde the following: Ground-Truthina [00491 An Initial pool of represcntatve digital media are hand-analyzed anal "proofed" to obtam fully "ground-truthed" representations Growd-truthng is the manual analysts that results in a highly accurate descapton of the Interchange data for a particular documcut 'I'he primary purpose of ground-trutllmg Is to determine basehnc data for comparing algorithm generated accuracy reporting stahstcs to estabhsh accurate comparisons of the effectiveness of DCE algorithms 332 Ground truthng data may include but are not Emoted by the following.
(a) Zomng Zomng nfonmaton that may be readily obtained from the user Interface during ground-truthing are the region boundary (polygonal), page boundary (which provides border and margin mfonnaton), the region type(text, photo, drawing, table, etc.), region skew, onentaton, z-order, and color content.
(b) Layout: Layout elements may Include groupings (articles, associated regions such as business graphics and photo credits, etc.), columns, headings, reading order, and a few specific types of text (e g, address, signature, list, etc., where possible). Abstracts and nontext-regon associated text (text written over another region, like a photo or film frame) may prove useful In layout ground-truthing, as well (c) Document 'hyping Where possible, the document will be tagged as a specific type of document from a hst that may Include types such as "photo," "transparency," "Journal article," etc ypilig Illay ttil thel- IllChl(ie StlhCatCgOI ICS. For example, a color photo, a black and wililc photo, a glossy-fimshcd Photo' c tc, as may plOVC tlSeUI (d) Text The language and Individual words, lines, and paragraphs of text may he Identified by ()CR and/or other methods and manual Inspection of'II?e OCR results. Synopses, outin?es, abstracts, and the like may be checked for accuracy Where possible, grammar tags and translations will be ground-tnthed.
Formatting (e.g., font family, style, etc.) may be eliminated from the ground-truth for text as text formatting is a presentational issue Important for final rendering [0050] Note that the relative usefulness of' each of these grounfi- trutinilg data can be assessed by pancpal component analysts of the conelaton matrices obtained for the correlation of algonth'T?s with ground-truth results. In this way, non-useful correlates can be dropped and useful correlates that are clustered can be represented by a single correlation.
1] Ground-truth Is an absolute measure of DCE algorithm 332 accuracy and effectiveness. It is, however, a manual process, and as such lS expensive, poorly scalable, and may suffer value degradation as the number of documents m the corpus or database grows, and as the number of'sub-categones grows [00521 Ground-truthn?g establishes a baseline performance statistic, as well as a eredibihty rating for the DCE algorithms 332, as described below DCE algorithms 332 subscribing to a set of data-n?tercl?ange standards may be tested against Filly ground-truthed media to see how well they perform. They may also be rated for the subcategories of media types, as described in the following section Categorization [0053] Categonzation or dent'fieaton of the dgital-meda types Is a useful step in the selective appheaton/generaton of an improved dgital-eontent extraction 'he utility of ground-truthing (see above)7 performance statistics, and credibility ratings (see below) Is enhanced when the overall set of digital media Is subdivided or pre caegorzed. Sonic pre-catego''/aton can be done based on the media type ( A, fle CXtellSIol?, I?a?-dW?-C SOUICt', 'tC') VU1 tile (?ata (flSCIlI?lU?1tO' 331 [()()541 Sub-catego?-'zalon may he perf'onne:l wlhn? the data-exlracton engine 330 nor ref'inemcul of scope l')'glal mecla can be sub-eategonzed based on tl?eIr media type, their class'ficaton/segmentalon charactenstcs, their layout, etc Even simple classification, segmentalon, layout, elc., schemes can be used lor this sub categonzation. An exernple Is the use of a simple zoning algorithm that consists solely of a non-overlappng ("Manhattan layout") segmentation algorithm ("segmenter"), a "text" vs. "non-lext solid" vs "non-text non-solid" region classifier, and a simple column/ttle layout scheme While such a simple zoning/layout algorithm Is not generally very useful for extracting metadata from digital documents, it is useful In sub- categorizaton The embodiment of an IDCE 2] 4 described herein uses such a "reduced" or "partial" zoning+layout scheme lo sub-categonze mcomrng documents, In addition to the meda-fonmat typing as described above.
[OOSSJ Further sub-categonzaton can be achieved using simple relative document classification schemes such as a document clustering scheme, neural network classification, super-threshold pixel centroids and moments, and/or other pubirc domain techniques. The data dscnminator 331 may also perform these and other sub-categorizaton or sorting operations.
6] Applicable document-clustenng schemes Include but are not limited to thresholding, smearing, regon-distnbuton profiling, etc These and other sub categcnzato[1 techniques allow the refinement of the statistics described below f car example, a certain layout algorithm may perform well on journal articles but poorly on magazine articles, the two of which are unlikely to be clustered together. The specific layout algorithm will therefore have higher performance and credibility statistics generated for its "Journal article" sub-category than for its "magazine article" sub-category.
7] It should be appreciated that the data discrrmmator 331 enables the automatic localization of the venous ACE algorithms 332 designed to extract information from specific data sources. Thus prohrbrtmg the application of a DCE algorithm 332 designed to extract mfonnaton fiom an audio recording to a data source Identified as a printed document. Consequently, the IDCE 214 may apply DCE algorithms 332 designed to extract nfomaton prom a printed document to appropriate data sources.
[()()SX1 I he DCE algorithms 332 may be r-eadly adapted and applied to documents ol any language There are no language-specfic hmtatons llowcver, nil the case of ()('R date extractors, it Is preferred to match the printed language with the language ol the OCR engnc Ibis can he accomphshed by finding the highest percentage ol matched words to dictonanes for each of the languages m the set, or by other methods.
Published Performance Statistics [00591 'I'he data-extracton engine 330 Is constructed to post a confidence statistic for each DCE algorithm 332. This statistical baseline for perfonnanee can be described as a p-value [p range O to 1], where p=1.00 Implies that the algorithm is 100% confident In its results. DC'E algorithms 332 that may not be (a) pubhe domain, (b) readily retrofitted to generate such statstes, or (c) Innately poor in comparing then results for different eases, can be assigned a default p-value (e g, a default p-value of 0.5() Is suggested, but any value greater than zero and less than or equal to I 00 will suffice.) It should be appreciated that the posted eont'idenee statists for each particular DCE algorithm 332 may be specific to each category and /or sub-category.
Consequently, a plurality of posted confidence statistics may be applicable for each DCE algorithm 332 Regardless, of the specific number of posted confidence statists values associated with each particular DCE algorithm 332, logic 400 may apply the appropriate statistic as indicated by the data discriminator 331.
Crediblitv Ratings [00601 Sophisticated DCE algorithms 332 will have the ability to assess their "published statstes" or p-value no hght of each new media Instance (e g, for each new doewnent) Less sophisticated DCE algorithms 332, as described m the preceding section, will have the same published statutes irrespective of the document Unfortunately, a poorly-characterized DCE algorithm 332 may report a default statistic or a higher statists than Is appropriate, while a well-eharaeterized DCE algorithm 332, in making an honest assessment, may report a lower statists even when it will surely outperf'onm the poorly-eharaetenzed L)C'E algorithm 332.
0061] To account for possible dscrepanees between the ''published statistic" or p value and the actual abhty of a particular DCES algorithm 332 to perform on a particular doeumenl, a eredblty rating may be generated for each algorithm l'he existence of ground-tmthed documents can be used to generate the cret-lbhty rating New extraction algorithms 315, upon entity Into the ID(E 214, are automatically compared to ground- tnth results by perf'onnmg a "teal" analysts on ground-truthed documents. It should be appreciated that both the ground-truth correlation data 333 and the published p-value t'or the new extraction algorithm 315 can be used as an estimate of the expected performance of the new extraction algonthm(s) 315 'I'hs eonelation of the new extraction-algorithm performance with ground-truth can be performed on each sub-categoy of documents in the ground-truth set. The Correlation with ground-truth information can be used to generate the credibility rating of the new extraction algorithm 315. In the absence of suf'fieent ground-truth information, eonrelating partial algorithms and/or mter-algonthm comparison (both described below) may be used to autornatcally Improve the estimate of credibility Acceptance Levels [0062] The data-extraction engine 330 Is constructed to generate an acceptance-level statistic for each DCE algorithm 332. This statistical derivation for expected data extracton accuracy of performance Is generated as a function of the credibility rating and the published confidence statistic of the particular DCE algorithm 332 In its simplest fonm, the acceptance level 335 Is a simple mathematical combination of the credibility rating and the published confidence statistic. In one embodiment, the acceptance level 335 may be a multiplication of the pubhshed confidence level and the credibrhty rating (see above).
3] Despite the corrective nature of the acceptance level 335, further nonmalizaton of the published statistics is contemplated This nonnalzation, hke other aspects of the IDCE 214, is readily updated as more and more documents are added to the system. Essentially, the nonnalizaton accounts for DCE algorithms 332 that over-report their expected performance m their published confidence statutes or p-values. Note that each DCE algor.thrn 332 may have a plurality of p-values associated with venous categories and/or sub-categores of source data types.
Preferably, the DCE algonthms' p-values are adjusted to have the same mean pubhshed statistic when averaged over all of the documents In the corpus. In this way, the credibility rating still dictates which DCE;. algorithms 332 have overall higher credbhty It will be understood by those skilled In the art ot' the present hveuton that ID('E 214 may apply a confinned confidence statistic EIS an altenatve to nonnal zmg a published confidence statistic that incorrectly r-ctiect.s the effectiveness ol'thc respective DCE algorithm 3.32 [0064] For example, suppose algorithm (A) has a mean crcdhhty rating of 0 95, and algorithm (B) has a mean credibhty raImg of'().85 For the purposes of this example, algorithm (A) is also sophisticated enough to rate its pubhshed statistics relatively (from 0.00 to I 00, with a mean of 0.75), while algorithm (id) decides that t will always post a statistic of 1.00. Relative to algorithm (A), then, algorithm (B)'s published statistic should be adjusted by a factor of 0.75. This adjustment can be Implemented as described above by applying the adjustment factor to the published statistic, or alternatively correcting (I e, replacing) the published statistic with a more accurate value.
5] Now, suppose a document Is tested by troth algorithms Algorithm (A) pubhshes a statistic of 0.85 and has a credibility rating of 0 9 for this particular document Algonthm (B) pubhshes the p-value of 1.00 (as it always does) and has a credibility ratmg of 0.9 for this docuinent. The acceptance level of (at) is U 85 x 0.9 = 0.765, while that of (B) is 1.00 x 0.9 x 0.75 (the latter normalizing factor to account for its credibility) = 0 675 [00661 Each of the previously described data-extraction engme elements enables a methodology to optimally-analyze digital sources to extract mformaton for the generation of useful metadata. In this methodology, new extraction algorithms 315 are seamlessly integrated mto the II)CE 214, cooperating with and competing with existing DCE algorithms 332 in the determination of the most accurate mctadata description for the particular data source. As previously described, each of the data extracton engine elements functions via commonality m a set of data- interchange standards that bridge the gaps between each of the particular elements and the other elements of the data-extraction engme 330.
7] Partial or correlating algorithms share some smlantes to "subcategorizaton schemes" as described above. Those partial or correlating algorithms provide predictive behavior for the complete or "full" algorithms when ground-truthing Is either not possible, feasible, or desirable (i.e. In most cases!). These partial algorithms can in some cases provide a statistical Indication of'how well any algorithm (e g, D('E algorithms 332 and/.,r new extraction algorithms 315) that have been catered Into the ID(:'E 214 will perform on a prcvously-uncxamned document This Is possible especially If there Is a corclaton between the "till" algorithm and the partial algorithm and when there Is a corrclaton between the "full" algorithm and the ground-truth data.
8] However, partial algorithms will not always provide useful predictive value for the correlation of a "frill" algorithm with groundtruth. In such cases, the partial algorithms can be useful for winnowing out ''fil11''algonthms that are likely to be the most accurate in their analysts. Partial algorithms solve a simplified subset of the metadata generation problem, and in doing so, can Identify "full" algorithm failures [0069] Using the Manhattan segmenter again, for example, Is illustrative. A Manhattan segmenter simplifies the segmentation by formmg non- overlappng rectangles. Thus, m even moderately complex page layouts, a Manhattan segmenter results In a simplification of segmentation, smce any regions that may overlap another reglows rectangular bounding box get added to the region until no rectangles overlap.
Often, for magazine pages, etc., this results m columns or even an entire magazine page being reduced to a single region. Thus, if a full algorithm provides a region that overlaps two or more Manhattan regions, it is highly likely that this Is because the full algorithm has erred and Inadvertently smeared two regions together.
0] A prlorl, it would seem likely that If enough DCE algorithms 332 populate a given dataunterchange standard area, such as layout detenmnaton for example, that they would tend to "cluster" on an optimal solution. This may well be the case m certain areas, such as OCR. However, for difficult documents, it Is likely that many, If not most, algorithms will tend to fall because of similar misconceptions or design choices. In these cases, it may actually be the algorithms that do not cluster that provide the best solution for the problem. In these situations, the existence of ground truth data will be of use. How the different algorithms cluster and correlate for smlarly-structured (or "sub- categorized") documents can be determined by looking at the ground-truth set These tendencies, which are automatically updated as new algorithms or new ground-truthed documents are entered mto the system, can then be used to winnow out the appropriate algorithms during an "inter-algonthm consideration" stage ()()711 A comment on coning algorithms may prove uscthl here In SOnlC cases (eg, zonmg and text analysts), regions and words (respectively) may be tootled that did not exist In any ot the uldvdual algorithms Using text extractors as an exanple, suppose the sentence " T he Mormon keystone " was analyz.cd by one ()CK cngme as
IX
"Themor monkey stone " and by another OCR engine as "The Monm on keystone " When the two algorthns are analyzed by logic 400 for cornbmmg7 the sentence may be broken down into its most basic (e.g., the shortest) text pieces based on where word breaks (i.e., spaces) were found In any of the OCR engines: "The monm on key stone." From this last arrangement, new words not originally present in either OCR interpretation, such as "Mormon" and "onkey," can be formed, providing a means to correctly parse the sentence not separately available in either OCR engine.
2] A similar "emergent" region Is possible for zomng. Suppose a document comprises two text columns, referred to here as regions " 1 " and "2," and a photo, referred to here as region "3" Is located between regions 1 and 2 (overlapping their rectangular bounds) Suppose one zoning algorithm smears the photo together with region 1, and the other with region 3 That Is, one zomng algorithm segments the document into two regions, "1+3" and "2." The other zoning algorithm segments the document into regions, "I" and "2+3," respectively. The new region emerges by subtracting the second algonthm's "l" from the first algorithm's "13" and/or by subtracting the first algorithm's "2" from the second algorithms "2+3." This method for combing the results from multiple algorithms Is referred to as "atomize and cluster. ' [00731 The lDCE 214 offers an opportumty for synergistic rnprovement in performance over that possible by simply selecting the most accurate single DCE algorithm 332 available for a particular source-data type. As described above, the "atomize and cluster" method for combining algorithms offers the possibility for solving problems that no single algorithm can solve. Many combining techmques, such as voting for OCR, may improve the overall accuracy of a set of algorithms by continually selecting the "best" of multiple existing results However, this atomize and cluster technique provides the emergent capabhty of provdmg more accurate results even when no single DCE algorithm has m fact found the correct result The examples given above for "The Monmon keystone" and zonmg regions "I," "2," and "3" are testament to this [00741 Whlc the full mpicrncutaton of the optm'zcd statistical comhnaton ol L)CIS algorthtus 3 32 Is very cornplex7 m concept it Is straightforward Since all algorthrms pubhsh their statistical confidences m tour findings, dfferenccs between dffcrcut algor-'thtus can be statistically compared and an optmnzed solution (e<, usmg a cost function based on the data-mterchangc standards of the algorithms) nvolvmg results, where appropriate, horn any subscnbng algorithms, can be crafted. Such a solution Is made possible by the use of statistical publishing by each of the DCE algorithms 332.
[0o7s1 As new documents are added to the knowledge base of the IDC'12 214, hgh wcight or high-priority key words may be generated from the text, if any exists, of the new documents These keywords may trigger automatic queries into the knowledge base to generate a correlation analysis among venous documents. This process may be automated, can be run at any lime (e g, during spare processor cycles, In "batch mode," etc.), and can be used to generate new data not located m any smgle document within the corpus, or knowledge base 339 1()()761 Reference Is now directed to the flow chart Illustrated In Fly. 4. In this regard, the various steps shown in the flow chart present a method for improving the accuracy of extracted digital content that may be realized by IDCE 214. As illustrated in FIG. 4, the method 400 may begin by r ending and/or otherwise acquiring source data as shown In step 402. Next, the source data received in step 402 may be analyzed and one or more categones/sub-categones may be associated with the source data as Illustrated in step [0077] After having received and Identified the source data In steps 402 and 404, the IDCE 214 may read a confidence value as indicated in step 406. The IDCE 214 may also read a credibility rating as Illustrated in step 408. After having read a confidence value and a credbhty rating for each of a plurality of applicable DCE algorithms 332 when applied to the identified source data, as illustrated in steps 406 and 408, the IDCE 214 may generate an acceptance level for each DCE algorithm 332 as Indicated in step 410 After having generated an acceptance level responsive to the confidence value and credibility rating of steps 406 and 408, the IDCE 214 may generate an optimal interpretation of the source data as Illustrated in step 412.
[00781 As previously explained, an optimal interpretation of the source data may comprise the interaction of a data dscaminator 331, a plurality of DCE algorithms 332, ground-tnthng correlation data 333, categonzaton data 334, the acceptance level generated m step 410, an algorithm accuracy recorder 336, a statistical comparator 337, and a key nfomaton Identifier 33X. As also described above, the venous elements that Interact to generate the optimal Interpretation of the source data 2() may each interact with the other elements via commonality m a set of data mterchange standards that budge the gaps between each of the particular elements and the other elements of the data-extracton engnc 330 Moreover, the optunal interpretation may be rcsponsve to partial or correlating algorithms, mter-algonthm considerations, statistical analysis and combination, and generation of metadata.
[00791 FIG. 5 is a flow chart illustrating an embodiment of a method for generating an optimal interpretation of a source document that may be realized by IDLE 214. In this regard, the various steps shown In the flow chart present a method for combing DCE algorithms for improving the accuracy of extracted digital content that may be realized by IDCE 214. As illustrated in FIG. 5, the method 500 may begin by reading and/or otherwise acquiring performance statistics associated with each of' the venous DCE algorithms that may be applied over a particular document of Interest as shown m step 502. Next, the IDCE 214 may be programmed to rank the various DCE algorithms in order based on their respective acceptance level as shown In step 504.
0080] After having identified and ranked the venous DCE algorithms in steps 502 and 504, the IDCE 214 (FIG. 3) may perfonn a statistical test on the obtained statistics to determine which of any of the venous DCE algorithms is statistically dissimilar from the others As illustrated In step 506, the Il)CE 214 may be programmed to select statistically similar DCE algorithms.
1] One way that this can be accomplished is to calculate a l-value and apply the t valuc to a standard t-test to determine if results from the DCE algorithms are statistically dt'ferent from one another. The t- test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups. The l-value can be detcamhed from the following equation I= X Xz, Eq.(l) Var. Var2 ant - ' (n2 --- I J where, X, is the mean, Vu-, is the variance, n the number ot samples f'or each of the respectvc D('E algorithms, and the subscript "1" identifies the corresponding values Tom the top ranked OCE algorithm. For situations where results fiom more than twc> nc algorithms need to be cmnpacd, the top-ranked l')CE algorithm may be compared to results fiom subsequent DCE algorithms one at a tune As is evdcnt from equation (1) above, the t-valuc will be postttvc tfthe first mean Is larger than the second, and negative when It is smallet.
[00821 Generally, once the l-value has been computed It may be compared to a table of significance to test whether the ratio Is large enough to indicate that the difference between the results generated by the DCE algorithms is not likely to have been a chance finding. In order to test the l-value against a table of significance, the number of degrees of freedom Is preferably computed and a risk level (i e, an alpha level) selected In the l-test, the degrees of freedom is equivalent to the sum of the samples In both groups minus 2 In most social research, the "rule of thumb" is to set the risk level at 0.05. With a risk level of 0.05, five times out of a hundred the t-test would Identify a statistically significant difference between the means even If there was none (I I', by "chance ") 0083] Given the risk or alpha level, the degrees of freedom, and the t- value, one can look the l-value up in a standard table of significance (often available as an appendix In the back of most statistics texts) to determine whether the l-value is large enough to be stgnificatt. When it is, the difference between the means for the two groups Is different (even given the variabthty) Stattsttcal-analysis computer programs routinely provide the sgmficance test results. After having statistically Identified sitlarDCE algorithms as described above, the IDCE 214 may he programmed to combme the similar DCE algorithms as indicated in step 508 [0084] Reference is now directed to the flow chart illustrated m FIG. 6, which illustrates an embodiment of a method for mtegratmg dtgttal- contertt extraction algorithms in the intelligent digital content extractor of FIG. 3 In this regard, DCE algorithm Integration logic herem illustrated as method 600 may begin with step 602 where a user of the IDCE 214 Identifies one or more DCE algorithms 332 (see FIG. 3) that the user desires to add to the lDCE 214 Next, m step 604, the integration logic may set a counter, N. equal to the number of DCE algorithms 332 that the user desires to Integrate with the IDCE 214. As Illustrated in step 606, the integration logic may read a published confidence value It should he appreciated that In some cases, the new DCE algontl1m may pubitsh E! cottftdence value for a number of venous source data types For exEuttple, Ent Eligonthtit designed to extract digital content from E! dgttal photo may provtdc conldettcc values for various digital photograph file formats.
00851 Next, as illustrated in step 608, the Integration logic may search for the number of ground-tmthed data sources In the IDCE knowledge base related to the present DC'E.
aigoritlurr. Once the nitegtation logic has identified the type of data source that the DCE algorithm 332 Is designed to extract from, the Integration logic may begin reading each of the ground-truthed data files or documents as shown In step 610. The Integration logic may proceed by applying the underlying DCE algorithm 332 to the ground-truthed data presently in memory as shown in step 612. As illustrated In step 614, the results of comparison to the ground-truthed data may be used to update the GT correlation data Similarly, as Illustrated in step 616,1he integration logic can update the credibility data [0086] Thereafter; as Illustrated in step 618, the Integration logic may query the knowledge base if filcher ground-lnthed data source examples are available. If the response to the query ot step 618 Is aftinmatve, '. e, more ground-truthed data sources exist, the Integration logic may update a counter as shown In step 620 and return to step 610. As shown In the flow chart of FIG. 6, the Integration logic may perform steps 610 through 620 until a determination has been made that the entire set of ground-truthed data sources has been processed.
0087] Otherwise, if the response to the query of step 6 1 8 Is negative, i e, the set of ground-truthed data sources that match the type of data that the DCE algorithm Is targeted to extract nfonnatron from, the integration logic may perform a second query as illustrated in step 622 As illustrated in the flow chart of FIG. 6, if there are more DCE algorithms to Integrate into the IDCE 214, as indicated by the negative branch exiting the query of step 622, the Integration logic may decrement a counter as shown in step 624 and repeat steps 606 through 624 to assimilate the remaining DCE algorithms identified for integration As is also illustrated in the flow chart of FIG. 6, if the response to the query of step 622 Is affirmative, / e, all the new algorithms have been added to the system, the integration logic may terminate.
8] It should be appreciated that the Integration logic may report or otherwise communicate with other elements of the IDCE 214. In this regard, the integration logic may forward Identifiers of the newly-'negrated DCE algorithms, together with published contidcncc values, credbhty values, e' In tiers way, lD( 15 214 can i''tcKnte any number of algor-thms [00891 As described above, each new DC E algorithm 315 (see IG 3) htegrated with IDCl. 214 may not accurately report its own absolute credibility. Stated anotlrcr way, 2.] the IDCE 214 uses the ground- truthing Information and venous pertinent Information resident In the knowledge base 339 to derivea normalized credibility rating. It Is sgmficant to note that sophisticated L)CE algorithms 332 can still report relative statistics that educate their relative effectiveness on different types of documents.
100901 In addihon to the ability to integrate new DCE algorithms 332, as illustrated and described in association with the flow chart of FIG. 6, it should be appreciated that as new documents (i.e., data sources) are entered into the IDCE 214, and as new ground-tnthng is performed, the knowledge base 339 of the IDCE 214 Is further expanded. For example, information responsive to data source categorization and/or subcategorizatons may be automatically updated. Where appropriate, ground truthng, credibility statistics, acceptance levels, and query-generated statistics may be updated further chauTgng the IDCE 214 knowledge base 339.
1] Any process descriptions or blocks in the flow charts presented in FlGs. 4, 5, and 6, should be understood to represent modules, segments, or portions of code or logic, which include one or more executable instructions for Implementing specific logical functions or steps In the associated process. Altennate Implementations are Included within the scope of the present Invention In which functions may be executed out of order from that shown or discussed, ncludng substantially concurrently or In reverse order, depending on the functionality Involved, as would be understood by those reasonably skilled in the art after having become familiar with the teachings of the present Invention
Claims (3)
- l I A method for cxtractng digital content, comprising: 2 reading a digital source (402); 3 identifying the digital source by type (404); 4 generating an acceptance level for each of a plurality of digital- content extraction algorithms based on a confidence value and a credibility rating associated 6 with the accuracy of each of the plurality of digital-content extraction algorithms 7 (410); and 8 applying a combination of at least two of the plurality of dtgital- content 9 extraction algorttitttis based on the acceptance level to thereby generate extracted digital content of the digital soutce (412) 1
- 2. The method of claim 1, wherein generating an acceptance level (410) 2 comprises a normalization of the relative accuracy of the associated digital-content 3 extraction algorithm (332) when applied to a verified source of the dtgttal-source 4 type.1
- 3. A method tor asstmtlatng a dtgtal-c-'ntent extraction algorithm m an 2 intelligent dtgital-content extractor, comprising: 3 identifying a dtgital-content extraction algorithm intended for integration with 4 the intciltgent dtgttal-content extractor (602); reading a confidence value purporting the expected accuracy of the identified 6 digital-content extraction algorithm when applied to a particular type of source data 7 (606); 8 applying the digtal-content extraction algorithm over source data (6 l 2); 9 generating a measure of the realized accuracy of the dtgttal-content extraction algorithm over the source data (614); and 11 updating a knowledge base reflective of previously integrated digttalcontent 12 extraction algorithms with a result of the generattttg step (616) 1 4 1 he method of claim 3, wherein updating (616) co'Ttpnscs modifying 2 ground-tnthed correlation data (333) I S. The method of claims, wherein updating (616) comprises generating 2 an acceitancc value (335).1 6. A dgtal-content extractor, composing: 2 a data-acquston device (12) configured to generate a digital representation 3 of a source; 4 a data-extracton engmc (330) communicatively coupled to the data acquisition device (12), the data-extraction engine (330) configured to apply a {i combmaton of a plurality of dgtal-content extraction algorithms (332) over the 7 source, wherein the data-extracton engine (330) Is configured to automatically accornnodate new data-extracton algorithms (] 15) 1 7. The extractor of claim 6, wherein the data-extraction engine (330) 2 detenmmes a more accurate interpretation of digital content (340) within the source 3 than can be realized by separately applying each respective dgitalcontent extraction 4 algorithm.1 8. The extractor of clams 6, wherem the data-extracton engine (33()) 2 compares the relative effectiveness of the plurahty of dgital-content extraction 3 algorithms (332) in response to a verification that the combined dgtal- content 4 extraction algorithms (.332) share a common data type Identified in a data-interchange standard I 9. The extractor of claim 6, wherein the data- extraction engine (330) 2 applies the combination of the plurality of digital-content extraction algorithms (332) 3 In response to mformaton in a knowledge base (339).1 10 The extractor of claun 9, wherein the knowledge base (339) comprises 2 au acceptance level (335) rejective of each mdvidLral dgtal-content extract'.n 3 algonthm's verified abhty to correctly Interpret content withy the source l
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/199,530 US20040015775A1 (en) | 2002-07-19 | 2002-07-19 | Systems and methods for improved accuracy of extracted digital content |
GB0316633A GB2391087A (en) | 2002-07-19 | 2003-07-16 | Content extraction configured to automatically accommodate new raw data extraction algorithms |
Publications (2)
Publication Number | Publication Date |
---|---|
GB0523074D0 GB0523074D0 (en) | 2005-12-21 |
GB2417349A true GB2417349A (en) | 2006-02-22 |
Family
ID=35688866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0523074A Withdrawn GB2417349A (en) | 2002-07-19 | 2003-07-16 | Digital-content extraction using multiple algorithms; adding and rating new ones |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2417349A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0561848A (en) * | 1991-09-02 | 1993-03-12 | Hitachi Ltd | Device and method for selecting and executing optimum algorithm |
EP0594196A1 (en) * | 1992-10-22 | 1994-04-27 | Digital Equipment Corporation | Address lookup in packet data communications link, using hashing and content-addressable memory |
WO2001031463A1 (en) * | 1999-10-22 | 2001-05-03 | Yodlee.Com, Inc. | Method and apparatus for providing calculated and solution-oriented personalized summary-reports to a user through a single user-interface |
US6321224B1 (en) * | 1998-04-10 | 2001-11-20 | Requisite Technology, Inc. | Database search, retrieval, and classification with sequentially applied search algorithms |
US20020046002A1 (en) * | 2000-06-10 | 2002-04-18 | Chao Tang | Method to evaluate the quality of database search results and the performance of database search algorithms |
US6397212B1 (en) * | 1999-03-04 | 2002-05-28 | Peter Biffar | Self-learning and self-personalizing knowledge search engine that delivers holistic results |
-
2003
- 2003-07-16 GB GB0523074A patent/GB2417349A/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0561848A (en) * | 1991-09-02 | 1993-03-12 | Hitachi Ltd | Device and method for selecting and executing optimum algorithm |
EP0594196A1 (en) * | 1992-10-22 | 1994-04-27 | Digital Equipment Corporation | Address lookup in packet data communications link, using hashing and content-addressable memory |
US6321224B1 (en) * | 1998-04-10 | 2001-11-20 | Requisite Technology, Inc. | Database search, retrieval, and classification with sequentially applied search algorithms |
US6397212B1 (en) * | 1999-03-04 | 2002-05-28 | Peter Biffar | Self-learning and self-personalizing knowledge search engine that delivers holistic results |
WO2001031463A1 (en) * | 1999-10-22 | 2001-05-03 | Yodlee.Com, Inc. | Method and apparatus for providing calculated and solution-oriented personalized summary-reports to a user through a single user-interface |
US20020046002A1 (en) * | 2000-06-10 | 2002-04-18 | Chao Tang | Method to evaluate the quality of database search results and the performance of database search algorithms |
Also Published As
Publication number | Publication date |
---|---|
GB0523074D0 (en) | 2005-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20040015775A1 (en) | Systems and methods for improved accuracy of extracted digital content | |
US8488916B2 (en) | Knowledge acquisition nexus for facilitating concept capture and promoting time on task | |
US12222980B2 (en) | Generating congruous metadata for multimedia | |
US8452132B2 (en) | Automatic file name generation in OCR systems | |
US11106718B2 (en) | Content moderation system and indication of reliability of documents | |
US10572528B2 (en) | System and method for automatic detection and clustering of articles using multimedia information | |
US8260062B2 (en) | System and method for identifying document genres | |
US8064703B2 (en) | Property record document data validation systems and methods | |
US20070226321A1 (en) | Image based document access and related systems, methods, and devices | |
US11854285B2 (en) | Neural network architecture for extracting information from documents | |
CN112132710B (en) | Legal element processing method and device, electronic equipment and storage medium | |
RU61442U1 (en) | SYSTEM OF AUTOMATED ORDERING OF UNSTRUCTURED INFORMATION FLOW OF INPUT DATA | |
US20230334885A1 (en) | Neural Network Architecture for Classifying Documents | |
JP2007172077A (en) | Image search system, method thereof, and program thereof | |
CN117493645B (en) | Big data-based electronic archive recommendation system | |
WO2022231943A1 (en) | Intelligent data extraction | |
Myasnikov et al. | Detection of sensitive textual information in user photo albums on mobile devices | |
CN117612182A (en) | Document classification method, device, electronic equipment and medium | |
CN116980646A (en) | Video data processing method, device, equipment and readable storage medium | |
GB2417349A (en) | Digital-content extraction using multiple algorithms; adding and rating new ones | |
US20240338659A1 (en) | Machine learning systems and methods for automated generation of technical requirements documents | |
US20240135739A1 (en) | Method of classifying a document for a straight-through processing | |
Flynn | Document classification in support of automated metadata extraction form heterogeneous collections | |
US20250014374A1 (en) | Out of distribution element detection for information extraction | |
CN114049639B (en) | Image processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |