CN120745646B

CN120745646B - Method and system for auditing network chat sensitive words based on multidimensional identification

Info

Publication number: CN120745646B
Application number: CN202511189528.2A
Authority: CN
Inventors: 朱屹涛
Original assignee: Beijing Link Star Technology Co ltd
Current assignee: Beijing Link Star Technology Co ltd
Priority date: 2025-08-25
Filing date: 2025-08-25
Publication date: 2025-12-23
Anticipated expiration: 2045-08-25
Also published as: CN120745646A

Abstract

The invention belongs to the technical field of network chat auditing, and provides a method and a system for auditing a network chat sensitive word based on multidimensional identification, wherein the method comprises the following steps: and acquiring multi-source data in real time, extracting core keywords, acquiring texts with the core keywords and interactive data corresponding to the texts, identifying hot events by analyzing the interactive data, and extracting temporary sensitive words based on the hot events. According to the invention, the problem of response lag of the traditional static sensitive word bank to the emerging sensitive words generated by the hot events is optimized by collecting multi-source data in real time and dynamically extracting the temporary sensitive words related to the hot events, and the relevance score is calculated by means of semantic similarity and co-occurrence times, so that dominant associated words can be covered, hidden reference word clusters can be identified, missed judgment caused by changeable word forms is reduced, and timeliness and comprehensiveness of sensitive word identification are improved.

Description

Method and system for auditing network chat sensitive words based on multidimensional identification

Technical Field

The invention belongs to the technical field of network chat auditing, and particularly relates to a method and a system for auditing a network chat sensitive word based on multidimensional identification.

Background

The network chat sensitive word auditing means that chat contents of users in a network platform (such as social media, instant messaging tools, forums and the like) are monitored, identified and processed through a technical means and a rule system, words or expressions related to specific sensitive information are screened out so as to standardize network communication behaviors and the process of preventing risk propagation, and the core aim is to balance network talk freedom and information safety and avoid bad influence on society or individuals caused by the diffusion of the sensitive contents (such as illegal information, privacy disclosure and the like);

However, multi-dimensional recognition often depends on the integrity of a sensitive word stock, but the dynamic property and boundary definition of the sensitive word stock are easily ignored, wherein temporary sensitive words generated by a hot event are included, namely, a social hot event can quickly generate new sensitive words, and the temporary sensitive words have timeliness, such as names and codes of specific events in emergencies and public opinion disputes, can be sensitive only in an event fermentation period, and can lag behind a network propagation speed if word stock updating depends on manual labeling, so that missed judgment is caused;

therefore, the invention provides a method and a system for auditing the network chat sensitive words based on multidimensional identification.

Disclosure of Invention

In order to overcome the deficiencies of the prior art, at least one technical problem presented in the background art is solved.

The technical scheme adopted for solving the technical problems is that the method for auditing the network chat sensitive words based on multidimensional identification comprises the following steps:

Acquiring multi-source data in real time, extracting core keywords, acquiring texts with the core keywords and interactive data corresponding to the texts, identifying hot events by analyzing the interactive data, and extracting temporary sensitive words based on the hot events;

Constructing a temporary word stock based on the temporary sensitive words, setting a scene validation rule by combining the temporary word stock and the hot event tag, and auditing the network chat;

the method comprises the steps of monitoring heat data of hot events in real time, analyzing and outputting heat comprehensive values, realizing change analysis of the heat comprehensive values based on the conventional trend and the re-combustion trend in two dimensions, and evaluating timeliness of temporary sensitive words according to change analysis results of the heat comprehensive values;

And executing hierarchical adjustment on the auditing strength of the temporary sensitive words according to the timeliness evaluation result of the temporary sensitive words.

A system for auditing a network chat sensitive word based on multi-dimensional recognition, the system comprising:

The temporary sensitive word acquisition module acquires multi-source data in real time and extracts core keywords, acquires texts with the core keywords and interactive data corresponding to the texts, identifies hot events by analyzing the interactive data, and extracts temporary sensitive words based on the hot events;

The chat auditing module is used for constructing a temporary word stock based on the temporary sensitive words, setting a scene validation rule by combining the temporary word stock and the hot event label, and auditing the network chat;

The timeliness analysis module monitors the heat data of the hot events in real time, analyzes and outputs a heat comprehensive value, realizes the change analysis of the heat comprehensive value based on the double dimensionalities of the conventional trend and the reburning trend, and evaluates the timeliness of the temporary sensitive words according to the change analysis result of the heat comprehensive value;

And the auditing strength adjustment module is used for performing grading adjustment on the auditing strength of the temporary sensitive words according to the timeliness evaluation result of the temporary sensitive words.

The beneficial effects of the invention are as follows:

According to the invention, the problem of response lag of the traditional static sensitive word bank to the emerging sensitive words generated by the hot events is optimized by collecting multi-source data in real time and dynamically extracting the temporary sensitive words related to the hot events, and the relevance score is calculated by means of semantic similarity and co-occurrence times, so that dominant associated words can be covered, hidden reference word clusters can be identified, missed judgment caused by changeable word forms is reduced, and timeliness and comprehensiveness of sensitive word identification are improved;

According to the invention, by monitoring the heat data in real time and dynamically evaluating the timeliness of the temporary sensitive words, the auditing strength is adjusted according to the trend change and the reburning condition, so that a novel full-period management mechanism is formed, the mechanism can strengthen risk prevention and control in the active period of the event, reduce unnecessary intervention in the declining period, and remarkably improve the flexibility, accuracy and efficiency of auditing the network chat sensitive words.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow chart of steps of a method for auditing a network chat sensitive word based on multi-dimensional recognition according to the present invention;

FIG. 2 is a flowchart of the steps for acquiring a hot event in a method for auditing a network chat sensitive word based on multi-dimensional recognition;

FIG. 3 is a block diagram of a system for auditing a network chat sensitive word based on multi-dimensional recognition in accordance with the present invention.

Detailed Description

The invention is further described in connection with the following detailed description in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

Example 1

Referring to fig. 1 and 2, the method for auditing the network chat sensitive words based on multidimensional identification according to the embodiment of the invention includes the following steps:

firstly, acquiring multi-source data in real time, analyzing the multi-source data, extracting core keywords to identify hot events, and extracting temporary sensitive words based on the hot events;

in some embodiments, it is first required to define that a hot event refers to an event or topic that is rapidly propagated through a network platform (such as social media, news websites, forums, instant messaging tools, etc.) within a certain period of time, and causes high public attention, broad discussion, and a large amount of interactions (such as praise, comments, forwarding, sharing, etc.);

in the first step, the process of collecting the multi-source data in real time is as follows:

the web crawlers and the API interface are used for capturing the data of each platform in real time, and the collected data is cached by means of a message queue (such as Kafka), so that the orderly processing of the data is ensured;

The acquisition sources of the multi-source data include, but are not limited to, social media platforms, news websites, forums, instant messaging tools;

the social media platform is used for capturing high-frequency words and comment area hotwords under related topics of the event, such as name, code number and specific expression in # event # topics;

news websites, core entities in main stream media reports, such as name, place and event abbreviations, and spontaneous created index words of users in comment areas, such as implicit index of 'certain melon' and 'that thing' in specific events;

Selecting chat records related to events in public group chat, and extracting new words or abnormal expressions which occur at high frequency;

in the first step, the process of extracting the core keywords is as follows:

preprocessing the acquired multi-source data, including data cleaning and word segmentation;

The data cleaning comprises removing noise such as special symbol and HTML label;

splitting the text into words, and extracting keywords by using a TF-IDF (word frequency-inverse document frequency) algorithm;

specifically, TF-IDF measures importance by the frequency of words in a single text and the scarcity of words in all texts;

TF is the number of times a word appears in the current text/the total word number of the text, IDF is log (total document number/document number containing the word+1), and TF-IDF is the product of TF and IDF;

Extracting candidate keywords with TF-IDF values greater than or equal to the TF-IDF threshold value as core keywords for batch texts;

The determination of the batch text is set by the person skilled in the art according to experience and event characteristics, such as a text set within 1 hour of a certain platform;

in the first step, the process of identifying the hot event is:

extracting relevant texts with core keywords, and obtaining interaction data of each text;

The interactive data comprises praise numbers, comment numbers, forwarding numbers, sharing numbers and collection numbers;

Summing the praise number, the comment number, the forwarding number, the sharing number and the collection number of all the texts respectively, and carrying out average value processing to obtain absolute mutual quantity;

Taking the ratio of the interaction quantity of the current period to the interaction quantity of the previous period as the relative interaction quantity;

Setting an absolute interaction quantity threshold and a relative interaction quantity threshold, and extracting texts with absolute interaction quantity larger than the absolute interaction quantity threshold and relative interaction quantity larger than the relative interaction quantity threshold as hot events;

The absolute interaction quantity threshold and the relative interaction quantity threshold are summarized and set according to experience and event characteristics by a person skilled in the art, and the period is preset by the person skilled in the art;

Illustratively, take a social event as an example:

cleaning an HTML tag < p > and a special symbol "#" in news comments, and obtaining keywords such as an event, a related enterprise, potential safety hazard, investigation and the like after word segmentation;

The TF-IDF values of the related enterprises and the potential safety hazards are respectively 0.45 and 0.42, and core keywords are selected;

Extracting all texts with core keywords, wherein the total sum of the praise numbers is 325678, the total sum of the comment numbers is 128905, the total sum of the forwarding numbers is 876543, the total sum of the sharing numbers is 456789, and the total sum of the collection numbers is 98765;

The absolute interaction amount is 377336, if the average value of the interaction amount of similar events in the last cycle (yesterday) is 150000, the relative interaction amount is 2.5156, the absolute interaction amount (377336) > absolute threshold (100000) and the relative interaction amount (2.5156) > relative threshold (2.0) are met, and meanwhile, the requirements of the absolute interaction amount and the relative interaction amount threshold are met, and the event is identified as a hot event;

In step one, the process of extracting the temporary sensitive word based on the hot event is as follows:

taking the identified hot event as a core, and extracting all texts containing core keywords of the hot event;

Preprocessing the screened text, including data cleaning and word segmentation;

And (3) cleaning data, namely removing noise in the text, such as special symbols and HTML labels, and ensuring the purity of the text.

Word segmentation, namely splitting a text into independent words for extracting sensitive words subsequently;

Extracting words highly related to the hot event from the preprocessed text as temporary sensitive words, wherein the temporary sensitive words comprise, but are not limited to, specific names, codes, disputed expressions and implicitly-designated words generated in the event fermentation process;

For example, in a hot event such as a blessing, an offensive nickname extracted from the related text, an event-related implicit reference word, etc., may be determined as a temporary sensitive word;

The temporary sensitive words are screened through relevance, and the process is as follows:

for any core keyword, a pre-training word vector model is adopted, candidate words in the core keyword and the pre-processing text are converted into multidimensional vectors, and the semantic similarity between the vectors of all words in the pre-processing text and the core keyword vector is calculated through cosine similarity;

The pre-training Word vector model includes but is not limited to Word2Vec, BERT;

Counting the vocabulary co-occurring with the core keywords in the preprocessing text, and counting the frequency of co-occurrence;

Summing the semantic similarity and the times of co-occurrence to obtain a mean value for any candidate word in the preprocessing, and comprehensively outputting a correlation score;

extracting candidate words with the relevance scores larger than the relevance score limit value as temporary sensitive words;

the method for extracting the temporary sensitive words through the process is characterized in that firstly, the temporary sensitive words have strong relevance with hot events, and irrelevant vocabulary interference is reduced, secondly, semantic similarity can identify hidden reference words, the co-occurrence frequency can strengthen dominant associated words, meanwhile, the dominant expression and the hidden reference of the two types of sensitive words are covered, the diversified expression modes in network chat are adapted, and missed judgment caused by changeable vocabulary forms is reduced;

Constructing a temporary word library based on the acquired temporary sensitive words, and setting a scene validation rule by combining the temporary word library and the hot event tag to audit the network chat;

in some embodiments, temporary sensitive words are acquired, and a temporary word stock is constructed and used as a core audit object;

adding a multidimensional attribute tag for each temporary sensitive word, wherein the multidimensional attribute tag comprises an associated event tag and a risk type tag;

The related event label is used for defining hot events to which the word belongs, such as "# a certain food safety event#" "# a certain star scarlet#";

risk type labels, namely sensitive types of labeling words, such as "privacy class", "dispute class", "violation class", corresponding to "unpublished address", "offensive nickname", "harmonic rule violation word";

adding multidimensional attribute labels to hot events, wherein the multidimensional attribute labels comprise event type labels, event type labels and sensitive trigger point labels;

The event type label is used for distinguishing event properties such as social civil events, entertainment eight diagrams events and emergency events;

The propagation platform label is used for labeling a platform for mainly propagating events, such as microblog-based, group chat diffusion and news comment area fermentation;

the sensitive trigger point label is used for recording core contradictions such as privacy leakage risks which are most prone to disputes in the event;

setting a scene effective rule to audit the network chat, specifically:

When the temporary sensitive word appears in the network chat content, matching the associated event label of the temporary sensitive word, the event type label of the hot event and the propagation platform label, and triggering a specific auditing rule when the three are matched;

the consistency is matched by the three, so that the possibility of cross-event and cross-platform misjudgment can be reduced;

The temporary sensitive word 'melon', the associated event label '# a star scarlet#' and the event type label 'entertainment eight diagrams' and the propagation platform label 'group chat diffusion' are used for triggering auditing only in a group chat scene corresponding to a hot event '# a star scarlet#';

auditing based on the differentiation of event type labels can specifically comprise when the event type label is a social folk or an emergency and when the event type label is an entertainment eight diagrams;

When the event type label is a social folk event or an emergency event, a strict auditing mode is triggered:

if the propagation platform is a full platform (including microblog, news comment area, group chat and the like), immediately executing real-time interception as long as temporary sensitive words (such as a trouble place code and unpublished casualty data) appear, synchronously pushing the temporary sensitive words to a manual review queue (complete review within 3 minutes), and simultaneously recording sender IP and account information for tracing;

Exemplary, temporary sensitive word certain chemical plant leakage (associated event label "# certain chemical safety event#", event type label "emergency", propagation platform label "full platform diffusion"), when microblog, group chat and news comments appear, automatic interception is performed, and after manual rechecking is performed to confirm whether the information is unreal information, release or permanent shielding is determined;

when the event type label is 'entertainment eight diagrams', a loose audit mode is triggered:

If the propagation platform is microblog or social media, intercepting a single piece of content of a temporary sensitive word (such as an aggressive nickname) more than or equal to 3 times, and playing a window for 1-2 times to remind that the user notices the expression civilization;

If the propagation platform is group chat, delay interception is triggered only when the word is reported by 2 or more group members (shielding if not withdrawn within 30 minutes);

The core of the scenery is to make the auditing rule pair number seat, and to limit the scene range of auditing effectiveness by matching the related event label of the temporary sensitive word, the event type label of the hot event and the propagation platform label, so as to reduce the influence caused by 'one cut';

this embodiment has at least the following effects:

firstly, the problem of response lag of a traditional static word stock to temporary sensitive words is optimized, and words strongly related to hot events are timely brought into an auditing range by calculating a relevance score in real time;

Secondly, the dominant associated vocabulary and the recessive reference vocabulary are covered simultaneously, so that missed judgment caused by diversification of network expression is reduced, strong association between temporary sensitive words and hot events is ensured, and a foundation is laid for follow-up accurate auditing;

thirdly, limiting an audit effective scene, and reducing cross-event and cross-platform misjudgment;

fourth, formulate the differentiation rule to different event types and propagation scenes, while preventing the risk effectively, reduce the excessive intervention of the normal communication of users of a cut of audit, have balanced the information security and managed and controlled and network speaking the free demand.

Example 2

Referring to fig. 1, the method for auditing the network chat sensitive words based on multidimensional identification according to the embodiment of the invention further comprises the following steps:

because the core characteristics of the temporary sensitive words are rapidly transmitted along with the hot event burst, the temporary sensitive words have certain timeliness along with the hot spot drop of the hot event, so that the temporary word stock is dynamically updated;

Step three, monitoring the heat data of the hot spot event in real time, analyzing the change trend of the heat data, and evaluating the timeliness of the temporary sensitive words based on the change trend of the heat data;

Since the timeliness of the temporary sensitive word is strongly related to the event heat and the public opinion period, but the heat attenuation can be dynamic and nonlinear, for example, a certain event can be cooled after a week, but the new progress is suddenly reburnt, the analysis of the change trend of the heat data is particularly important, if the change trend analysis of the heat data is not carried out, the failure management of the temporary sensitive word can lose a powerful basis, and the possibility of missed judgment and excessive audit of the sensitive word is increased;

In some embodiments, the heat data of the hot events are monitored in real time, wherein the monitored objects take the identified hot events as cores, and the related heat data are collected in real time, and the heat data comprise absolute interaction quantity, relative interaction quantity and interaction quality value;

the absolute mutual amount and the mutual amount are calculated in the same manner as those in the first step, and are not described herein;

extracting comments with the number of words exceeding the average number of words length, counting the number of the comments as the number of exceeding the standard number of the comment words, and calculating the ratio of the number of exceeding the standard number of the comment words to the total number of the comments as the comment depth;

The comments replied by other users are extracted, the number of the comments is counted and used as the number of the secondary interaction comments, and the ratio of the number of the secondary interaction comments to the total number of the comments is calculated and used as the secondary interaction duty ratio;

The comment depth and the secondary interaction duty ratio are subjected to weighted fusion, and an interaction quality value is comprehensively output;

It is known to those skilled in the art that the comment depth is generally weighted higher than the secondary interaction ratio, the comment depth is generally weighted to be 0.6, the secondary interaction ratio is generally weighted to be 0.4,

Normalizing the absolute interaction quantity, the relative interaction quantity and the interaction quality value, then carrying out weighted fusion, and comprehensively outputting a heat comprehensive value;

it can be known by those skilled in the art that the normalized absolute interaction amount weight is 0.4, the normalized relative interaction amount weight is 0.3, and the interaction quality value weight is 0.3;

for any one hot event, the two-dimensional analysis is carried out through the conventional trend and the reburning trend, and the specific process is as follows:

the conventional trend is analyzed through trend direction, period division and conventional prediction, namely, the natural trend type is identified based on the heat comprehensive value of 3-5 continuous time periods;

Wherein the time period is preset by a person skilled in the art;

Calculating the heat integrated value deviation value of adjacent time periods to obtain a heat integrated value deviation value sequence;

Extracting a forward heat integrated value deviation value, counting the number of the forward heat integrated value deviation value, and carrying out ratio calculation on the forward heat integrated value deviation value and the total heat integrated value deviation value number to obtain a direction consistency index;

setting weight for the heat comprehensive value deviation value according to time, and carrying out weighted summation to obtain an amplitude index;

The weight is set to be increased along with the period according to the time, namely the influence of the recent deviation value on the trend is larger, the weight of the Kth heat integrated value deviation value is K/(n-1), and n is the total number of the heat integrated value deviation values;

Counting the maximum number of continuous homodromous deviations of the heat comprehensive value deviation value sequence, and taking the maximum number as a continuous trend index;

carrying out standardization treatment on the direction consistency index, the amplitude index and the continuous trend index, and then carrying out weighted fusion to obtain a first comprehensive trend value;

The method comprises the steps of normalizing a direction consistency index C to obtain a normalized direction consistency index C '=2C-1;C', normalizing an amplitude index W to obtain W '=W/|max (W) |, obtaining a normalized amplitude index W', normalizing a continuous trend index S to obtain S '=S/max (S), obtaining a normalized continuous trend index S', giving weight according to the importance of the index, and obtaining a direction consistency index (0.4) > an amplitude index (0.3) > a continuous trend index (0.3);

The method comprises the steps of judging that the trend is upward when a first comprehensive trend value is more than or equal to a first judgment value (0.2), judging that the trend is fluctuating when the first comprehensive trend value is more than or equal to a second judgment value (-0.2) and the first comprehensive trend value is less than or equal to the first judgment value (0.2), and judging that the trend is downward when the first comprehensive trend value is less than or equal to the second judgment value (-0.2);

When the trend is upward, the temporary sensitive words are strong timeliness, when the trend is downward, the temporary sensitive words are weak timeliness, and when the trend is fluctuation, the temporary sensitive words are stable timeliness;

Firstly, 50% of an initial heat integrated value of a hot spot event is used as a decay reference line, and when the heat integrated value is lower than the decay reference line, the heat trend is judged to enter a decay period;

and starting a re-combustion analysis by taking 30% of the decay datum line as a re-combustion trigger datum line, namely when the heat comprehensive value rises to the re-combustion trigger datum line, wherein the re-combustion analysis comprises the steps of verifying the re-combustion trend:

Monitoring whether the heat comprehensive value continuously maintains at least 2 time periods to be not lower than a reference line after rising back to the reburning trigger reference line, if only 1 period reaches the standard, judging that the short-term fluctuation is invalid reburning, and if 2 or more continuous periods reach the standard, entering the next step of verification;

Calculating an interaction quality value of the reburning period and comparing the interaction quality value with an interaction quality average value of the declining period;

The calculation mode of the interactive quality value is the same as the calculation mode described above, and is not described herein again.

If the interactive quality value of the reburning period is more than or equal to 1.2 times of the average value of the declining period, the rebound is accompanied with high-quality discussion, but the screen is brushed by low quality, so that the effective reburning is realized, otherwise, the ineffective reburning is realized;

The second comprehensive trend value after reburning is calculated, and the comprehensive trend value is the same as the first comprehensive trend value calculated during the conventional trend analysis, and is not described in detail herein;

if the effective reburning is carried out and the second comprehensive trend value is more than or equal to the first judgment value (0.2), the complete reburning is judged, and the timeliness of the temporary sensitive word is adjusted from the original weak timeliness (decay period) to the strong timeliness;

If the effective reburning is carried out, and the second comprehensive trend value is less than or equal to a second judgment value (-0.2), judging that the effective reburning is carried out, and the timeliness of the temporary sensitive words still keeps the weak timeliness (decay period);

If the temporary sensitive word is invalid and reburning, if the second comprehensive trend value is more than the second judgment value (-0.2) and the second comprehensive trend value is less than the first judgment value (0.2), the temporary sensitive word is judged to be fluctuation reburning, and the timeliness of the temporary sensitive word is adjusted from the original weak timeliness (decay period) to stable timeliness;

The method has the advantages that the heat data of the hot events are monitored in real time, the heat comprehensive value is calculated, the conventional trend is analyzed, the timeliness of the temporary sensitive words is matched, and the problem that the conventional static timeliness judgment cannot adapt to the dynamic change of the hot events is solved;

Meanwhile, by setting a decay datum line and a reburning trigger datum line and combining continuous period stability, interaction quality comparison and comprehensive trend values, a reburning trend is verified, short-term fluctuation and substantial reburning are effectively distinguished, accidental heat rise in a decay period is misjudged as reburning, or a situation that real reburning is invisible is reduced, timeliness assessment of temporary sensitive words is consistent with the actual transmission state of an event, and a scientific basis is provided for subsequent auditing rule adjustment;

The dynamic allocation of the audit resources is indirectly carried out, the audit is intensively carried out on the resources in the event rising period (strong timeliness), unnecessary intervention is reduced in the decay period (weak timeliness), and the waste of calculation power and manpower is reduced;

step four, based on the timeliness evaluation result of the temporary sensitive words, performing hierarchical adjustment on the auditing strength of the temporary sensitive words;

In some embodiments, when the temporary sensitive word is highly time-efficient, the time-efficient means that the hot event associated with the temporary sensitive word is in a highly active period or rapidly heats up after reburning, and the highest level audit needs to be started, including but not limited to:

the type of social folk life or emergency is that the full platform is intercepted after 1 time, and the manual recheck is carried out within 5 minutes, and the traceability information is recorded;

after reburning a certain chemical safety event, a new leakage (strong timeliness) of a certain factory of a temporary sensitive word is intercepted after 1 time of microblog and group chat, whether the temporary sensitive word is unreal information is checked manually, and if the temporary sensitive word is unreal information, the temporary sensitive word is permanently shielded and traced;

the entertainment eight diagrams event type is that the single occurrence is more than or equal to 2 times of reminding, and the single occurrence is more than or equal to 4 times of blocking;

when the temporary sensitive word is stable and time-efficient, a middle level audit is initiated, including but not limited to:

social folk life or emergency, namely, when the whole platform is recorded for more than or equal to 2 times, sending a reminder if the whole platform is repeated within 1 hour;

Entertainment eight diagrams event: the single occurrence is more than or equal to 3 times of reminding, single occurrence is more than or equal to 6 times of interception

When the temporary sensitive word is weakly time-sensitive, a low-level audit needs to be initiated, including but not limited to:

full-type events are recorded only when high frequency is concentrated, for example, the events are not less than 10 times in 12 hours and are manually checked in 24 hours, and the private chat scene is not interfered.

Example 3

Based on the same inventive concept as the method for auditing the network chat sensitive words based on the multi-dimensional recognition in the foregoing embodiment, as shown in fig. 3, the present application provides an auditing system for auditing the network chat sensitive words based on the multi-dimensional recognition, wherein the system specifically includes:

the temporary sensitive word acquisition module is used for acquiring multi-source data in real time and analyzing the multi-source data, extracting core keywords to identify hot events, and extracting temporary sensitive words based on the hot events;

Collecting multi-source data such as social media, news websites, forums, instant messaging tools and the like in real time, extracting core keywords by utilizing a TF-IDF algorithm after cleaning and word segmentation, identifying hot events meeting absolute interaction quantity and relative interaction quantity thresholds by analyzing interactive data containing core keyword texts, calculating semantic similarity and co-occurrence times by combining a pre-training word vector model based on the hot events, screening out temporary sensitive words with up-to-standard correlation, and optimizing the hysteresis problem of a traditional word library on emerging sensitive words;

the chat auditing module constructs a temporary word stock based on the acquired temporary sensitive words, and sets a scene validation rule to audit the network chat by combining the temporary word stock and the hot event tag;

By matching the association event label of the temporary sensitive words with the type of the hot event and the propagation platform label, setting a scene auditing rule, such as strict interception of social folk event triggering, loose reminding of entertainment eight diagrams events, and reduction of cross-scene misjudgment;

The timeliness analysis module is used for monitoring the heat data of the hot events in real time, analyzing the change trend of the heat data and evaluating timeliness of the temporary sensitive words based on the change trend of the heat data;

The method comprises the steps of monitoring heat data of a hot spot event in real time, calculating a heat comprehensive value and analyzing trend, wherein the event is judged to be ascending, fluctuation or descending trend through a first comprehensive trend value, and the strong, stable and weak timeliness of a corresponding temporary sensitive word is judged;

The auditing strength adjustment module is used for performing grading adjustment on the auditing strength of the temporary sensitive word based on the timeliness evaluation result of the temporary sensitive word;

According to the timeliness evaluation result of the temporary sensitive words, the auditing strength is adjusted in a grading manner, namely when strong timeliness occurs for 1 time, namely interception and manual rechecking are performed on a full platform of the social civil events, 2 times of reminding and 4 times of interception are performed on single events of the entertainment eight diagrams, when the timeliness (such as fluctuation trend) is stable, 2 times of recording, repeated reminding and 3 times of reminding and 6 times of interception are performed on the social civil events, when the timeliness (such as the decay period) is weak, only high-frequency concentrated recording (such as 12 hours and 10 times) is performed, private scenes are not interfered, and auditing accuracy and user experience are balanced.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for reviewing sensitive words in online chat based on multi-dimensional identification, characterized by the following steps:

Real-time collection of multi-source data and extraction of core keywords; acquisition of text containing core keywords and corresponding interactive data; identification of hot events by analyzing interactive data; extraction of temporary sensitive words based on hot events.

A temporary thesaurus is built based on temporary sensitive words. The temporary thesaurus is combined with hot event tags to set scenario-based rules for effectiveness and to review online chat.

Real-time monitoring of the popularity data of trending events, analysis and output of comprehensive popularity value, based on the dual dimensions of regular trend and resurgence trend to realize the change analysis of comprehensive popularity value, and evaluate the timeliness of temporary sensitive words based on the change analysis results of comprehensive popularity value;

The process of conventional trend analysis is as follows:

Extract the comprehensive heat index value for N consecutive time periods, calculate the deviation value of the comprehensive heat index value between adjacent time periods, obtain the sequence of comprehensive heat index value deviation values, analyze and calculate the sequence of comprehensive heat index value deviation values, and obtain the first comprehensive trend value.

When the first comprehensive trend value is greater than or equal to the first judgment value, it is determined to be an upward trend, and the temporary sensitive word is highly time-sensitive; when the first comprehensive trend value is greater than the second judgment value and the first comprehensive trend value is less than the first judgment value, it is determined to be a fluctuating trend, and the temporary sensitive word is stable and time-sensitive; when the first comprehensive trend value is less than or equal to the second judgment value, it is determined to be a downward trend, and the temporary sensitive word is weakly time-sensitive.

The process of the first comprehensive trend value is as follows:

Extract the deviation values of the positive heat index composite value, count their number, and then calculate the ratio with the total number of deviation values of the heat index composite value to obtain the directional consistency index.

The deviation values of the comprehensive heat index are weighted according to time and then weighted and summed to obtain the amplitude index;

The weight of the Kth comprehensive heat index deviation value is K/(n-1), where n is the total number of comprehensive heat index deviation values;

The maximum value of consecutive unidirectional deviations in the statistical heat index comprehensive value deviation sequence is used as an indicator of continuous trend.

The directional consistency indicator, amplitude indicator, and continuous trend indicator are standardized and then fused to output the first comprehensive trend value.

Based on the timeliness assessment results of temporary sensitive words, the review intensity of temporary sensitive words will be adjusted in stages.

2. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of extracting core keywords is as follows:

The collected multi-source data was segmented into words; and keywords were extracted using the TF-IDF algorithm.

TF is: the number of times a word appears in the current text / the total number of words in the text; IDF is: log(total number of documents / number of documents containing the word + 1); TF-IDF is: the product of TF and IDF;

Extract words whose TF-IDF value ranks before the ranking percentage threshold as candidate keywords; for batch text, extract candidate keywords whose TF-IDF value is greater than or equal to the TF-IDF threshold as core keywords.

3. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of identifying hot events is as follows:

Interaction data includes: number of likes, number of comments, number of reposts, number of shares, and number of favorites;

Sum the number of likes, comments, reposts, shares, and favorites for all texts, and then average them to get the absolute interaction count.

The ratio of the current period's interaction volume to the previous period's interaction volume is used as the relative interaction volume.

Texts with both absolute and relative interaction volumes exceeding the limit are extracted and designated as trending events.

4. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of extracting temporary sensitive words is as follows:

Extract the text of the core keywords corresponding to each hot event. For any core keyword, use a pre-trained word vector model to transform the core keyword and candidate words in the pre-processed text into multi-dimensional vectors. Calculate the semantic similarity between the vectors of all words in the pre-processed text and the vector of the core keyword using cosine similarity.

Analyze the words that co-occur with the core keywords in the preprocessed text and count the number of times they co-occur.

For any candidate word in the preprocessed text, the semantic similarity and the number of times they co-occur are summed and averaged to output a comprehensive relevance score; candidate words with a relevance score greater than the limit are extracted as temporary sensitive words.

5. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of obtaining the comprehensive popularity value is as follows:

Real-time monitoring of trending events' popularity data; popularity data includes: absolute interaction volume, relative interaction volume, and interaction quality score;

After normalizing and standardizing the absolute interaction volume, relative interaction volume, and interaction quality value, the overall popularity value is integrated and output.

6. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 5, characterized in that: the calculation process of the interaction quality value is as follows:

Extract comments that exceed the average word count and count their number as the number of comments exceeding the word count limit. Calculate the ratio of the number of comments exceeding the word count limit to the total number of comments as the comment depth.

Extract comments that have been replied to by other users and count their number as the number of secondary interaction comments. Calculate the ratio of the number of secondary interaction comments to the total number of comments as the secondary interaction percentage.

The interaction quality score is output by combining the depth of comments with the proportion of secondary interactions.

7. The method for reviewing sensitive words in online chat based on multi-dimensional identification according to claim 1, characterized in that: the analysis process of the resurgence trend is as follows:

When the overall heat index falls below the decline baseline, the heat trend is determined to have entered a decline phase; when the overall heat index rises back to the reignition trigger baseline, reignition analysis is initiated.

After the overall heat index rises back to the re-ignition trigger baseline, it is determined whether it remains above the baseline for at least two consecutive time periods. If it only meets the standard for one period, it is judged as a short-term fluctuation and an invalid re-ignition. If it meets the standard for two or more consecutive periods, it proceeds to the next step of verification.

Calculate the interaction quality value during the reignition period and compare it with the mean interaction quality value during the decay period;

If the interaction quality value during the reignition period is greater than or equal to 1.2 times the average value during the decline period, it is considered a valid reignition; otherwise, it is considered an invalid reignition.

Calculate the second comprehensive trend value after reignition;

If it is an effective resurgence, and the second comprehensive trend value is greater than or equal to the first judgment value, then it is judged as a complete resurgence, and the temporary sensitive words are highly time-sensitive;

If it is an effective resurgence, and the second comprehensive trend value is less than or equal to the second judgment value, it is judged as a recession resurgence, and the temporary sensitive words still maintain weak timeliness;

If it is an invalid resurgence, and the second comprehensive trend value is greater than the second judgment value and the second comprehensive trend value is less than the first judgment value, then it is judged as a fluctuating resurgence, and the temporary sensitive word is stable timeliness.

8. A multi-dimensional identification system for reviewing sensitive words in online chat, characterized in that the system is used to perform the method described in any one of claims 1-7, the system comprising:

Temporary sensitive word acquisition module: Real-time collection of multi-source data and extraction of core keywords, acquisition of text containing core keywords and corresponding interactive data, identification of hot events by analyzing interactive data, and extraction of temporary sensitive words based on hot events;

Chat moderation module: A temporary thesaurus is built based on temporary sensitive words. The temporary thesaurus is combined with hot event tags to set scenario-based rules for effect and to review online chat.

Timeliness Analysis Module: Monitors the popularity data of trending events in real time, analyzes and outputs a comprehensive popularity value, realizes the change analysis of the comprehensive popularity value based on the dual dimensions of regular trend and resurgence trend, and evaluates the timeliness of temporary sensitive words based on the change analysis results of the comprehensive popularity value;

Review intensity adjustment module: Based on the timeliness assessment results of temporary sensitive words, the review intensity of temporary sensitive words is adjusted in stages.