CN120873875B

CN120873875B - Machine learning driven intelligent classification method and system for email attachments

Info

Publication number: CN120873875B
Application number: CN202511390275.5A
Authority: CN
Inventors: 许可; 韩帅; 曹浪平
Original assignee: Zhejiang Jiumu Holding Group Co ltd
Current assignee: Zhejiang Jiumu Holding Group Co ltd
Priority date: 2025-09-26
Filing date: 2025-09-26
Publication date: 2025-12-16
Anticipated expiration: 2045-09-26
Also published as: CN120873875A

Abstract

The invention relates to the technical field of machine learning and email management, and discloses an email attachment intelligent classification method and system driven by machine learning, wherein the method comprises the steps of obtaining an email attachment original data stream, extracting business keywords through three-level slicing, calculating semantic concentration indexes to obtain semantic features, mapping the semantic features to a semantic feature set space, and generating semantic fingerprint codes; the modeling recognition algorithm obtains three-level areas, extracts a structured feature set, calculates a correlation intensity value to construct a fusion feature set, extracts a context element set, calculates a total correlation intensity value, and sets a hierarchical decision rule to construct a directional correlation set. The method solves the problems of difficult conversion of unstructured attachments, insufficient multi-modal understanding, weak context association, low classification accuracy and low query efficiency in the classification of the electronic mail attachments.

Description

Machine learning driven intelligent classification method and system for email attachments

Technical Field

The invention relates to the technical field of machine learning and email management, in particular to an intelligent email attachment classification method and system driven by machine learning.

Background

With the deepening of enterprise digital transformation and the popularization of distributed office modes, an email becomes a core hub for internal and external business communication and information transmission, and accessories of the email bear high-value business data such as contracts, invoices, reports and the like, and the email is an important carrier for data access in a distributed environment. The mail system only simply processes texts, attachments rely on manual downloading and archiving, version confusion and state difference are easy to cause in multi-body operation, the problems of classification subjectivity, time sequence fracture, information island and the like are also caused, and the problems of state and time sequence can not be solved easily and are avoided based on file extension and keyword static rules, so that the defects are prominent, and the key problem is solved currently.

In the prior art, the Chinese patent with the authorized bulletin number of CN120388388B discloses an automatic identification method and an automatic identification system for mail association regulations, wherein the core flow comprises file uploading, preprocessing, character recognition, chapter recognition, content extraction, tree structure construction and result display. The method is focused on image preprocessing and character recognition, and a tree structure is built by depending on a hierarchy, so that full-flow automation is realized, and regulation information is intuitively displayed for a user. The Chinese patent with the bulletin number of CN118779459B discloses a mail classification method and a mail classification system based on a multi-mode large language model. Preprocessing multi-mode text content of mail data to be classified, randomly sampling preset types to mark mails as samples, analyzing the samples through a multi-mode large language model to obtain category characteristic texts, inputting the mail text to be classified and the category characteristic texts into the model through a prompt word template, and judging mail categories. The method can realize the detection of the junk mail and classification according to industries or disciplines, meet the mail data management requirements, improve the classification accuracy and reduce the cost.

However, the two prior art techniques have some value in mail regulation association identification and multimodal mail classification, but fail to solve the core problem of email attachment management. The patent with the authority bulletin number of CN120388388B focuses on mail, is related to regulations, is subjected to image preprocessing and is constructed in a tree structure, accessory multi-mode feature fusion is not involved, mail text, personnel relationship and three-dimensional context of time are not integrated, accessory version conflict of multi-main-body parallel operation cannot be solved, timing accuracy is not managed, patent with the authority bulletin number of CN118779459B focuses on mail classification, accessory state consistency mechanism is not designed, information island is not broken, and accessory timing feature is not managed. Both have no accessory state collaboration and time management system, either separate from business scenes in a classified way or cause version confusion, and cannot meet the requirements of credible, stable and orderly management of accessories under distributed offices.

Disclosure of Invention

The invention is suitable for E-mail attachment management scenes of enterprises of various scales, such as attachment intensive business scenes of financial invoice archiving, personnel resume screening, purchasing contract checking and the like, can meet the compatible management requirements of multi-format attachments and enterprise multi-department collaboration, realizes the dual targets of converting unstructured contents of the E-mail attachments into structured semantic representation and classifying accuracy through three-level slicing processing, semantic fingerprint coding and context element collection, converts subjective judgment of attachment classification into quantitative analysis by combining a total association intensity value calculation by a fusion feature matrix, accurately identifies core semantic features and auxiliary semantic features, solves the problems of attachment version confusion and time sequence breakage under multi-subject operation, reduces classification misjudgment rate, realizes information on-demand association by combining a directed association set with a hierarchical decision rule, ensures the matching degree of attachment classification and business scene, provides accurate data support for enterprise information retrieval, shortens attachment searching time and reduces E-mail attachment management operation and maintenance cost.

In order to achieve the above purpose, the present invention provides the following technical solutions:

an intelligent classification method for machine learning driven email attachments comprising:

Acquiring an electronic mail attachment original data stream, performing three-level slicing processing on the electronic mail attachment original data stream, obtaining semantic features according to the three-level slicing, mapping the semantic features to a semantic feature set space, and obtaining semantic fingerprint codes;

identifying semantic fingerprint codes and three-level slices to obtain three-level regions, obtaining a structured feature set according to the three-level regions, and converting the structured feature set to obtain a fused structured feature set;

And extracting a context element set according to the original data stream of the email attachment, and calculating the context element set to obtain a total association strength value, and formulating a directed association set for the total association strength value.

Further, the three-stage slicing includes:

The first level slice identifies page separators from the original data stream of the email attachment, divides the page separators into a plurality of independent page slices according to page boundaries, and distributes unique page numbers for each page slice;

Identifying paragraph terminators for all page slices obtained by the first-stage slice by the second-stage slice, dividing the page slices into a plurality of independent paragraph slices by taking the terminators as boundaries, distributing paragraph serial numbers for each paragraph slice, and associating page numbers of pages to which each paragraph serial number belongs;

and the third-stage slicing identifies sentence terminators for all paragraph slices obtained by the second-stage slicing, divides the sentence terminators into a plurality of independent sentence slices by taking the sentence terminators as boundaries, distributes sentence serial numbers for each sentence slice according to a text reading sequence, and associates the page number of the page to which each sentence serial number belongs with the serial number of the paragraph.

Further, the method for obtaining semantic features according to the three-level slice comprises the following steps:

identifying service keywords in each sentence slice, and calculating the service keywords to obtain semantic concentration indexes of the service keywords;

and setting a concentration threshold for the semantic concentration index, and comparing the semantic concentration index with the concentration threshold to obtain semantic features.

Further, the method for identifying the semantic fingerprint code and the tertiary slice to obtain the tertiary region comprises the following steps:

Selecting a target slice from the three-level slices, detecting to obtain the number of identifiable characters and the total number of pixels, judging the number of identifiable characters and the total number of pixels to obtain a text region, extracting the target slice to obtain edge pixels, obtaining edge density according to the edge pixels, calculating the target slice to obtain a color entropy value, combining the edge density and the color entropy value to obtain an image region, detecting the target slice to obtain a regular grid, and obtaining a table region according to the regular grid;

The table area, image area and text area of all target slices are combined into a three-level area.

Further, the method for obtaining the structured feature set according to the three-level region comprises the following steps:

extracting a text region of the three-level region to obtain a text structured feature set;

extracting an image area of the three-level area to obtain an image structural feature set;

extracting a table area of the three-level area to obtain a table structured feature set;

and forming the text structured feature set, the image structured feature set and the table structured feature set which are obtained through extraction into a structured feature set.

Further, the method for converting the structured feature set to obtain the fused structured feature set comprises the following steps:

Normalizing the structured feature set to obtain a normalized value of the text labeling density, a normalized value of the numerical cell occupation ratio and a normalized value of the definition;

And calculating the association strength values of each two of the text, the image and the table based on the structural feature set, and forming a fusion structural feature set by all the association strength values among the three.

Further, the method for extracting the context element set according to the original data stream of the email attachment comprises the following steps:

Identifying an original data stream of the electronic mail attachment to obtain a basic verb set and intention classification;

Judging a service scene according to the semantic feature set and the text structural feature set, and classifying according to intention and obtaining text semantic element analysis results;

acquiring personnel identity attributes from the original data stream of the email attachment, and acquiring personnel relationship network element analysis results according to the personnel identity attributes;

acquiring basic time information of the mail from an E-mail attachment original data stream, and acquiring a time mode feature element analysis result according to the basic time information;

the text semantic element analysis result and the personnel relationship network element analysis result and the time mode feature element analysis result form a context element set.

Further, the method for calculating the context element set to obtain the total association strength value comprises the following steps:

converting the intention classification and the business scene into a set representation, and calculating the set to obtain content correlation;

Acquiring a specific scene of personnel work and a personnel history cooperative relationship through personnel identity attributes, and acquiring personnel correlation according to the specific scene and the personnel history cooperative relationship;

setting a time correlation score according to a service scene to obtain a time correlation;

and carrying out weighted calculation on the content relevance, the personnel relevance and the time relevance to obtain a total association strength value.

Further, the set of directed associations includes:

The directed association set comprises a strong association scene, a medium association scene and a weak association scene;

When the context information is a strong association scene, when the judging element is needed, if a conclusion deduced by the context and other single and isolated information exist, the context deducing result is adopted;

When the scene is a medium association scene, the conclusion output by the context inference and the content analysis is subjected to weighted comparison according to the total association intensity value, and a judgment conclusion with high signal weighted ratio is adopted;

in the case of a weakly associated scene, the attachment content analysis is relied upon.

A machine learning driven e-mail attachment intelligent sorting system for implementing the machine learning driven e-mail attachment intelligent sorting method described above, the system comprising:

The semantic fingerprint module is used for acquiring an electronic mail attachment original data stream, performing three-level slicing processing on the electronic mail attachment original data stream, obtaining semantic features according to the three-level slicing, mapping the semantic features to a semantic feature set space, and obtaining semantic fingerprint codes;

The fusion feature module is used for identifying semantic fingerprint codes and three-level slices to obtain three-level areas, obtaining a structured feature set according to the three-level areas, and converting the structured feature set to obtain a fusion structured feature set;

And the directed association module is used for extracting a context element set according to the original data stream of the attachment of the electronic mail, calculating the context element set to obtain a total association strength value, and formulating a directed association set for the total association strength value.

Compared with the prior art, the invention has the beneficial effects that:

the invention realizes the conversion of unstructured contents of e-mail attachments to structured semantic representation through three-level slicing processing, semantic fingerprint coding and domain dictionary extraction, solves the problems that the traditional scheme depends on the pain points with low manual classification efficiency and high file extension or keyword static rule misjudgment rate, the fusion characteristic matrix converts fuzzy understanding of multi-mode attachments into quantitative association analysis, accurately identifies the core association relation of different mode attachments, solves the problem of weak processing capacity of text, image and form compound attachments in the prior art, combines the weighted calculation of total association intensity values, realizes the classification decision of context perception, solves the problem that the traditional classification ignores mail context to cause classification to deviate from service scenes and inconsistent multi-main operation state, and realizes the association and retrieval of information according to requirements, thereby breaking up the information island stored in a scattered way of attachments, providing accurate data support for operation and maintenance, shortening the attachment searching and fault positioning time and reducing the cost of electronic mail attachments management operation and maintenance of enterprises.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a machine learning driven intelligent classification method for email attachments provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of step S10 according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of step S20 according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of step S30 according to an embodiment of the present invention;

Fig. 5 is a functional block diagram of a machine learning driven e-mail attachment intelligent classification system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1:

Referring to fig. 1, the present embodiment provides a machine learning driven intelligent classification method for email attachments, which includes:

step S10, an electronic mail attachment original data stream is obtained, three-level slicing processing is carried out on the electronic mail attachment original data stream, semantic features are obtained according to the three-level slicing, the semantic features are mapped to a semantic feature set space, and semantic fingerprint codes are obtained.

Further, as shown in fig. 2, step S10 includes:

Step S11, acquiring an electronic mail attachment original data stream, and performing three-level slicing processing on the electronic mail attachment original data stream.

Extracting the original attachment from the mail system, calling an operating system underlying file interface of the original attachment, and obtaining an email attachment original data stream, wherein the email attachment original data stream comprises file header information data, content body data and metadata. Wherein the file header information data represents the format and basic attribute data of the original attachment, the content body data represents the core part of the original attachment, such as text content, image data, table information, and the metadata represents the auxiliary data of the basic attribute of the attachment itself, such as creation time, modification time, and attachment size.

The file header information data is judged by using the magic number tool, the real file type of the attachment is determined, and misjudgment caused by judging only depending on the file extension is avoided.

And judging the attachment of the real file type through three-level slicing processing, wherein the purpose is to convert unstructured attachment content into a structured data unit. The first-stage slice identifies page separators of the accessory content from the accessory content main body data through a document structure analysis technology, divides the accessory content main body data into a plurality of independent page slices according to page boundaries, distributes unique page numbers for each page slice, for example, P1 represents a first page, wherein the page boundaries are set according to the page margins of the document types, and the second-stage slice identifies paragraph terminators, for example, double line-feed symbols and paragraph markers, for all page slices obtained by the first-stage slice. Dividing the second stage section into a plurality of independent section sections by taking a terminator as a boundary, distributing section serial numbers to each section according to a left-to-right and top-to-bottom arrangement sequence, for example, S1 represents a first section and associates page numbers of pages to which the section belongs, for example, P2-S3 represents a second page and a third section, and identifying sentence terminators, for example, sentence numbers, question marks and exclamation marks, to all section sections obtained by the second stage section, dividing the sections into a plurality of independent sentence sections, distributing sentence serial numbers to each sentence section according to a text reading sequence, for example, L1 represents a first sentence, and associates page numbers of pages to which the section belongs and serial numbers of the sections, for example, P2-S3-L5 represents a 2 nd page and a 3 rd section and a 5 th sentence.

Step S12, extracting service keywords in the sentence slices by adopting a domain dictionary method based on the three-level slices, calculating the service keywords to obtain semantic concentration indexes, and judging the semantic concentration indexes to obtain semantic features of the sentence slices.

The domain dictionary method is adopted to identify the business keywords in each sentence slice, such as receivables, asset liabilities lists, unreliability and default liabilities, for all the sentence slices of the three-level slice.

And counting the occurrence frequency F of the business keywords in the sentence slices and the distribution density D of the business keywords in the sentence slices. Wherein the frequency of occurrence represents the number of times the same business keyword has occurred in all sentence slices. The distribution density represents the number of consecutive sentence slice groups in which the business keyword is located, for example, the unreliability appears continuously in the first sentence in the second sentence, and one consecutive sentence slice group is formed. Calculating a semantic concentration index C of the business keywords,Wherein the total number of sentence slices represents the total number of sentence slices in a single attachment, and the semantic concentration index of the business keyword is used for measuring the importance degree of the business keyword in the attachment file.

Collecting marked core words and non-core words in a history document by adopting a statistical method, setting a concentration threshold A for a semantic concentration index, and whenWhen sentence slices corresponding to the service keywords are represented as core semantic features, whenWhen the sentence slice corresponding to the service keyword is represented as an auxiliary semantic feature, wherein the core semantic feature represents the main content and the purpose of the attachment, such as a payment mode, the auxiliary semantic feature represents the background information of the attachment, such as a company name, and the core semantic feature and the auxiliary semantic feature are combined into the semantic feature.

And S13, mapping the semantic features of the sentence slices to a semantic feature set space, and generating semantic fingerprint codes by applying a local sensitive hash algorithm.

The semantic feature set space comprises four subspaces, namely a document type subspace, a content theme subspace, a structure subspace and a grid space. The document type feature subspace represents category attributes of the encoded document, such as contracts, reports and invoices, the content theme subspace represents specific contents of the encoded document, such as financial data and project progress, the structure subspace represents organization structures of the encoded document, such as chapter number and paragraph distribution, and the style subspace represents writing styles of the encoded document and is used for identifying writing scenes. Such as contracts in the legal field.

And analyzing the semantic features by using a domain dictionary method to obtain attribute types, distributing the attribute types to subspaces according to the attribute types, and distributing mapping values for the subspaces. The method comprises the steps of setting a mapping value according to a semantic concentration index, setting the mapping value as the semantic concentration index when the semantic feature is a core semantic feature, setting the mapping value as the product of the semantic concentration index and an attenuation coefficient when the semantic feature is an auxiliary semantic feature, wherein the attenuation coefficient is obtained by verifying a numerical value with highest accuracy and interference according to the intersection of each field in a history, judging by adopting a weighted average method when a plurality of semantic features are mapped to the same dimension, and determining the weight according to metadata, wherein the weight of the semantic feature is higher when the time is earlier.

The semantic feature set space is converted into semantic fingerprint codes through a local sensitive hash algorithm, and the semantic fingerprint codes comprise document type codes, content subject code structure feature codes and style feature codes. The semantic fingerprint codes are in one-to-one correspondence with the four subspaces, so that partial matching can be performed according to different requirements, and the query efficiency is improved.

Step S10 solves the current accessory classification problem through three-level slicing, semantic features, semantic feature set space and semantic fingerprint coding, and realizes the conversion from unstructured accessories to structured semantic representation. The three-level slicing is not only used for content positioning, but also implies structural information of the document, the organization mode of the document can be deduced by analyzing the distribution mode of the identifier, for example, the document paragraph is short as frequently described by the change of the paragraph sequence number in the identifier, the document paragraph can be a clause document such as a contract, the semantic concentration index shows robustness to noise when distinguishing the core characteristics, even if a large amount of irrelevant content exists in the document, the core business content can still be accurately identified, and the semantic fingerprint coding brings great advantages of storage and transmission.

And step S20, identifying the semantic fingerprint codes and the three-level slices to obtain three-level areas. And obtaining a structured feature set according to the three-level region, and converting the structured feature set to obtain a fused structured feature set.

Further, as shown in fig. 3, step S20 includes:

And S21, establishing a modal identification algorithm, and identifying semantic fingerprint codes and three-level slices to obtain three-level areas.

And establishing a modal identification algorithm, wherein the modal identification algorithm comprises text modal identification, image modal identification and form modal identification. The method comprises the steps of selecting a target slice in three-level slices according to precision requirements, namely selecting a page slice if macroscopic judgment is needed, detecting the target slice through an optical character recognition technology and a document structure analysis technology, avoiding misjudgment of a non-text target slice as a text region to obtain the number of recognizable characters and the total number of pixels, dividing the number of recognizable characters by the total number of pixels to obtain a text mode ratio, and judging the target slice as the text region only when the text mode ratio is larger than a text mode threshold. The text mode threshold is determined according to character density data of a marked text area and a marked non-text area in the historical document.

The image mode identification is to extract edge pixels from a target slice through an edge detection algorithm, divide the edge pixels by the total pixel number corresponding to the target slice to obtain edge density reflecting the regional edge distribution sparseness degree, calculate the target slice through an information entropy formula to obtain a color entropy value, and judge the target slice as an image region only when the color entropy value is larger than a color threshold value and the edge density is smaller than the edge threshold value, so that the interference of texts and tables is eliminated. The edge threshold is set according to the edge density data of the marked image area and the marked non-image area in the history document, and the color entropy value is set according to the color entropy value data of the marked image area and the marked non-image area in the history document.

The method comprises the steps of detecting straight lines in the horizontal direction and the vertical direction of a target slice through Hough transformation in table mode identification, ensuring that only the straight lines are judged, eliminating interference of images and texts, enabling the straight lines to form a regular grid, and judging that the obtained target slice is a table area only when the regular grid is larger than a grid threshold value. Wherein the grid threshold is set according to grid structure data of the marked table area and the non-table area in the history document.

And S22, extracting the three-level region to obtain a structured feature set.

And extracting five dimensions of the text region to form a text structured feature set. The five dimensions comprise a subject word set, emotion tendency values, term density, sentence complexity and paragraph structure.

The method comprises the steps of extracting content of a text region through a TF-IDF algorithm to obtain potential core words in a text, carrying out differential position weight according to positions of the core words, distributing titles, first sections and last sections or other positions from high to low according to levels, calculating through the TF-IDF algorithm to obtain TF-IDF values, calculating the content of the text region through a dictionary and rule mixing method to obtain emotion tendency values, identifying emotion attitude of the content of the text region through the emotion tendency values, identifying professional words in the content of the text region through a domain dictionary by using a professional term density value, calculating the professional words through professional term density measurement, identifying professional degrees of the content of the text region by using a professional term density measurement and calculation, calculating the content of the text region through a syntax analysis technology to obtain sentence type complexity, reflecting sentence structure complexity of the content of the text region, calculating independent paragraphs in the text region through a statistical analysis method, and obtaining paragraph structure representing characteristic collection of paragraph organization modes in the text region.

The image structured feature set comprises a chart class feature, a photo class feature, and an icon class feature. Judging the image content of the image area through a computer vision tool to obtain an image type, such as a chart, a photo and an icon, and calculating the image type by utilizing a clustering algorithm, a contrast formula and a Laplacian operator variance to obtain a main tone, contrast and definition of the image type, wherein the main tone represents the core color attribute of the image, the contrast represents the brightness difference degree of the image, and the definition represents the detail sharpness degree of the image.

The table structure feature set comprises table structure features, data attribute features and business rule features, a table analysis tool is used for extracting a table area to obtain basic structure information, and the basic structure information is split to obtain the table structure information. The method comprises the steps of obtaining a data attribute characteristic, wherein the data attribute characteristic comprises a line number, a column number, a row-column proportion and a cell merging quantity proportion, obtaining a numerical cell quantity and a numerical proportion, obtaining a text cell quantity and a numerical proportion and a numerical type distribution through regular matching statistics, obtaining a keyword matching degree through a domain dictionary by a business rule characteristic, and collecting the total line number, wherein the keyword matching degree represents the association compactness of the table content and a specific business domain, and the collecting line number represents a collecting analysis attribute reflecting the table data.

The table structured feature set, the image structured feature set, and the text structured feature set constitute a structured feature set.

Step S23, converting the structured feature set by a normalization method to obtain a fused structured feature set.

And mapping quantization indexes in the structured feature set, such as image contrast and table numerical cell duty ratio, to the [0,1] interval by adopting a normalization method to obtain a character labeling density normalization value, a numerical cell duty ratio normalization value and a definition normalization value, converting the non-quantization indexes into a numerical form by a label coding method, avoiding the difference of numerical ranges, and converting the non-quantization indexes into the numerical form by the label coding method. The normalized value of the text label density represents the relative density of text labels in an image, the closer to 1 is the higher the text label proportion in the image is, the closer to 0 is the text label proportion in the image is, the lower the text label proportion in the image is, the normalized value of the numerical cell proportion represents the relative proportion level of the numerical cell in the table, the closer to 1 is the relative proportion level of the numerical cell in the table is the main numerical data, the closer to 0 is the text description cell proportion in the table is the relative proportion level, the higher the normalized value of the definition is the relative strength of the image, the closer to 1 is the sharpness value is the more the detail of the image is, and the more fuzzy the image is represented.

Based on the structured feature set, calculating the association strength value of each two of the text, the image and the table.

The method comprises the steps of converting a subject word set and an image type into a semantic set of a fixed dimension numerical array through a word set model, calculating an included angle cosine value between the two semantic sets, wherein the included angle cosine value is within the range of < -1 > and 1 >, the smaller or equal to 0 is the included angle cosine value, the weaker semantic association between the two semantic sets is represented, extracting text label content in an image area through an optical character recognition technology, counting the number of text label pixels and the number of image total pixels, calculating the ratio of the number of text label pixels to the number of image total pixels to obtain a text label density original value, calculating the text label density original value through a normalization method to obtain a text label density normalization value, multiplying the included angle cosine value by the text label density normalization value to obtain the association strength value of the text and the image, and combining the included angle cosine value and the text label density normalization value to accurately judge the association compactness of the text and the image at the semantic level, and eliminating image interference without text labels or mismatching.

The correlation intensity value between the text and the table is obtained by counting the number of numerical pairs between the text structural feature set and the table structural feature set, for example, 50 ten thousand yuan is mentioned in the text structural feature set, when 50 ten thousand yuan appears in the table of the table structural feature set, the correlation intensity value between the text and the table is obtained by calculating the ratio of the number of numerical pairs to the total number of text numerical values to obtain a numerical value comparison ratio, the total number of cells and the number of numerical cells in the table area are extracted through a table analysis tool, the ratio of the number of numerical cells to the number of total cells is calculated to obtain a numerical value cell ratio original value, the numerical value cell ratio original value is calculated through a normalization method to obtain a numerical value cell ratio normalization value, the correlation intensity value between the text and the table is obtained by multiplying the numerical value cell ratio normalization value by the numerical value comparison ratio, the combination of the numerical value cell ratio normalization value and the numerical value comparison ratio value is effective in verifying the consistency of the text and the table on the data level, and the influence of the table with low data ratio or unmatched numerical value is avoided.

The correlation intensity value between the image and the table is obtained by detecting a line and column parting line through Hough transformation in the image type in the image structural feature set, the line and column number of the vision is represented by the line and column distribution form of the table screenshot in vision, for example, the visual screen screenshot is detected to contain 5 lines and 3 columns, and the line and column number of the vision is 5 lines and 3 columns. The basic structure information in the table structured feature set calculates the fitness, and the formula of the fitness is as follows: the image area is calculated through the Laplace operator to obtain a Laplace operator variance, namely a definition original value, the definition original value is calculated through a normalization method to obtain a definition normalization value, the product of the fitness and the definition normalization value is the correlation strength value between the image and the table, and the combination of the fitness and the definition normalization value can ensure that the correlation strength value is calculated only based on the image which is reliably matched in a visual sense and structurally and the table, so that the correlation deviation caused by fuzzy image or structural dislocation is avoided.

And forming a fusion structured feature set by the association strength values among the three.

Step S20 solves the problem of insufficient multi-dimensional understanding of the accessory content through a modal identification algorithm, a three-level region, a structured feature set and a fusion structured feature set, and realizes conversion from unstructured accessory layout information to structured fusion features. The mode identification algorithm is not only used for distinguishing text, images and form modes, but also can accurately exclude non-target mode interference, ensures the accuracy of area identification through threshold judgment, for example, the text mode excludes non-text areas through text mode ratio, the image mode excludes text form interference through dual judgment of color entropy value and edge density, the structural feature set shows the capability of mining unique information of the modes when extracting features, even if the accessory contains multiple modes, the core features of each mode can still be completely extracted, the fusion structural feature set brings the improvement of the subsequent classification calculation efficiency and the association analysis accuracy, and provides support for deep understanding of the multi-mode accessory.

Step S30, extracting a context element set according to the original data stream of the email attachment, and calculating the context element set to obtain a total association strength value, and formulating a directed association set for the total association strength value.

Further, as shown in fig. 4, step S30 includes:

step S31, extracting a set of context elements based on the email attachment original data stream.

And carrying out intention verb recognition on the content subject data of the original data stream of the electronic mail attachment to obtain a basic verb set and intention classification, wherein the basic verb set comprises a full-scene general action verb and a cross-scene basic processing verb. The generic action verb of the whole scene represents the verb of the most basic action, such as submitting, sending, receiving and viewing, and the basic processing verb of the cross-scene represents the verb of the processing function bearing the basic performance in a plurality of business scenes, such as approval, verification and summarization. The intent classification includes instruction class, information class, request class. The instruction class indicates that the action is explicitly pointed, and the operation needs to be performed, such as approval and signing. The information class indicates that the action is directed to information delivery without execution, e.g., referencing, knowledge. The request class indicates that the action is directed to the demand initiation, requiring others to respond. For example, assistance, scheduling. Analyzing the attachment file from three elements of text semantic elements, personnel relationship network elements and time mode characteristic elements through the data extracted from the original data stream of the attachment of the electronic mail.

In text semantic element analysis, a semantic feature set and a text structural feature set are combined to judge a service scene, a basic type is identified through a Chinese type subspace in the semantic feature set, the service scene is primarily matched, for example, the basic type is a contract type attachment, the primary matching is a legal subscription scene, secondary verification is carried out on the primarily matched service scene according to the density of the professional terms in the text structural feature set, if the density of the professional terms is larger than a preset term density threshold value, the specific service scene is confirmed, and meanwhile, the scene is associated with the verbs. Wherein the term density threshold is set based on the total number of words identified as terms of art in the attachment. Basic verb set and intention classification and association relation between verbs and scenes are obtained in text semantic element analysis.

In the analysis of the personnel relationship network elements, mailbox domain name judgment of the attachment file is extracted through the regular expression to obtain organization attribution, personnel identity attribute of the organization attribution is obtained through an organization architecture database query interface, for example, personnel departments, job levels and report relationship information to generate personnel cooperation relationship, type labels, such as upper-level to lower-level, lower-level to upper-level cooperation and external business, and personnel identity attribute and personnel cooperation relationship are obtained in the analysis of the personnel relationship network elements.

In the analysis of the time pattern feature factors, basic time information of the mail is extracted from metadata of an electronic mail attachment original data stream, such as a sending time stamp, a sender address, a recipient list, a copy list, a mail subject, text content and a mail priority mark, a time dimension feature is judged according to a predefined time interval, a service period and a service expiration date are analyzed according to the industry feature, a time sensitivity mark is generated according to the time distance and the service importance, the predefined time interval is determined according to the service requirement, for example, an electronic business financial department checks for efficiently processing monthly sales data, the monthly needs to finish the business requirement of monthly sales reconciliation and refund statistics before 5 days of the next month, and the predefined time interval is set to be 1 day to 5 days per month. The time pattern feature analysis yields basic time information, predefined time intervals, and time sensitivity markers.

The data from the three element analyses are combined into a set of contextual elements.

And step S32, calculating the context element set to obtain a total association strength value.

Calculating content relevanceThe intent verbs and business scenarios are converted into a set representation, e.g., approval and financial reimbursement sheets are converted into a set representation [ approval-financial scenarios ], which is converted into a set of vectors in a computable structured form by a pre-trained word set model, e.g., a BERT model. Calculating an included angle cosine value of two elements in the vector set, wherein the included angle cosine value ranges from minus 1 to 1, and the included angle cosine value is larger than 0, which indicates that the higher the semantic similarity of the two vectors is, the lower the semantic similarity represented by the two vectors is when the semantic similarity is smaller than or equal to 0.

Calculating the degree of correlation of personnelThe method comprises the steps of obtaining a specific scene of personnel work and a personnel history cooperative relation through personnel identity attributes, setting basic scores according to the correlation degrees of different relation types of texts and accessories, wherein the correlation degrees of the different relation types are classified into strong correlation, medium correlation, weak correlation and no correlation from high to low, the purpose of the strong correlation representation text is achieved by the accessories, for example, the texts contain approval suppliers for payment, the accessories contain supplier payment application forms, the medium correlation representation text and the accessories can be referred to, for example, the texts contain check budget deviation, the accessories contain budget execution comparison forms, the weak correlation representation text and the accessories are indirectly correlated, for example, the texts contain synchronous project documents, the accessories contain project group address books, the irrelevant representation text and the accessories are not correlated, for example, the texts contain employee training, and the accessories contain customer return details.

Calculating time correlationBased on business rule analysis, matching the scene with a specific time interval according to the inherent time rules of different business scenes by large-scale historical data analysis, and setting a time correlation score in a range of 0-1 by combining the business urgency of the scene. For example, the specific time of the checkout reimbursement approval period in the financial scene is the last 3 working days of the month, if the last three working days of the month are matched, the matching is successful, otherwise, the matching is failed, and the business urgency is set according to the time constraint and overdue influence achieved by the business objective.

Calculating the total associated intensity value by weightingThe value range of the total association strength value is [0,1]. Wherein, the The weight coefficients of the content relevance, the personnel relevance and the time relevance are respectively set according to the misjudgment cost of each relevance.

And step S33, making a hierarchical decision rule for the total association strength value, and constructing a directed association set.

And formulating a directed association set according to the total association strength value, wherein the directed association graph comprises a strong association scene, a medium association scene and a weak association scene. When the total association strength value is smaller than the median value in the value range, the scene is a weak association scene, when the total association strength value is larger than or equal to the median value in the value range and smaller than the strong association threshold value, the scene is a medium association scene, and when the total association strength value is larger than the strong association threshold value, the scene is a strong association scene. Wherein the strong association threshold is set according to industry standards and risk bearing capacity of actual business.

In a strongly correlated scenario, when a decision element is required, the context inference result is mined if there are conclusions of context inference and other single, isolated information.

Under the medium association scene, the conclusion output by the context inference and the content analysis is subjected to weighted comparison according to the total association strength value, and the judgment conclusion with high signal-to-weight ratio is adopted. For context inference, weights are set primarily based on their context consistency and information integrity, and if the context includes information of attachments that can form a logical closed loop, then high weights are assigned, for example, such a consistent flow from purchase application to vendor quote to contract validation. For the conclusion of content analysis output, weight is set according to semantic consistency of feature matching degree, for example, keywords extracted from the content are value-added tax and reimbursement subjects, a financial report format is matched with an obtained scene, high weight can be given, and if the characteristics such as files and data are matched with scattered words, the weight is low.

In weakly associated scenarios, the attachment content analysis is primarily relied upon.

Step S30 solves the problems of insufficient utilization of the context information and single association judgment in the accessory classification through the context element set, the total association strength value and the directed association set, and realizes the conversion from the accessory isolation information to the multi-dimensional context association decision. The context element set not only covers the semantics and personnel cooperation time characteristics of the accessory content, but also builds complete business scene logic through association analysis among elements, such as matching with business scenes by combining intention verbs, and accurately positioning the business purpose of the accessory, the total association strength value effectively balances the influence of three dimensions of the content, personnel and time through differential weight setting in the calculation process, even if single dimension information has deviation, the accuracy of association judgment can be ensured through weighted calculation, the directed association set provides a targeted judgment strategy for different association scenes according to the layering decision rule of the total association strength value, avoids a cut classification error region, remarkably improves the accuracy and reliability of accessory classification in complex business scenes, and provides scientific basis for subsequent intelligent classification decision.

Example 2:

This embodiment provides a machine learning driven intelligent classification system for email attachments, as shown in fig. 5, based on embodiment 1, comprising:

In the semantic fingerprint module, the acquiring the original data stream of the attachment of the email, performing three-level slicing processing on the original data stream of the attachment of the email, obtaining semantic features according to the three-level slicing, mapping the semantic features to a semantic feature set space, and obtaining semantic fingerprint codes, including:

step S11, acquiring an electronic mail attachment original data stream, and performing three-level slicing processing on the electronic mail attachment original data stream;

In the fusion feature module, the identifying the semantic fingerprint code and the three-stage slice to obtain a three-stage region, obtaining a structured feature set according to the three-stage region, and converting the structured feature set to obtain a fusion structured feature set, including:

Step S21, a modal identification algorithm is established, semantic fingerprint codes and three-level slices are identified, and three-level areas are obtained;

step S22, extracting the three-level region to obtain a structured feature set;

In the directed association module, the extracting a context element set according to the original data stream of the email attachment, and calculating the context element set to obtain a total association strength value, and formulating a directed association set for the total association strength value, including:

Step S31, extracting a context element set based on the original data stream of the email attachment;

step S32, calculating the context element set to obtain a total association strength value;

Claims

1. A machine learning driven intelligent classification method for email attachments, the method comprising:

Extracting a context element set according to the original data stream of the email attachment, and calculating the context element set to obtain a total association strength value, and formulating a directed association set for the total association strength value;

The three-stage slice comprises:

2. The machine learning driven intelligent classification method for email attachments of claim 1, wherein the method for deriving semantic features from three-level slicing comprises:

3. The machine learning driven intelligent classification method for email attachments of claim 2, wherein the method for identifying semantic fingerprint codes and tertiary slices to obtain tertiary regions comprises:

4. The machine learning driven intelligent classification method for email attachments of claim 3 wherein said method for deriving a structured feature set from a three-level region comprises:

5. The machine learning driven intelligent classification method for email attachments of claim 4, wherein the method for transforming structured feature sets to obtain fused structured feature sets comprises:

6. The machine learning driven intelligent classification of email attachments as claimed in claim 5, wherein the method of extracting a set of context elements from an email attachment raw data stream comprises:

7. The machine learning driven intelligent classification method for email attachments of claim 6, wherein the method for computing a set of context elements to obtain a total associated strength value comprises:

8. The machine learning driven e-mail attachment intelligent classification method of claim 7, wherein the directed association set comprises:

9. A machine learning driven e-mail attachment intelligent classification system for implementing the machine learning driven e-mail attachment intelligent classification method of any of claims 1-8, the system comprising: