[go: up one dir, main page]

CN120745646B - Method and system for auditing network chat sensitive words based on multidimensional identification - Google Patents

Method and system for auditing network chat sensitive words based on multidimensional identification

Info

Publication number
CN120745646B
CN120745646B CN202511189528.2A CN202511189528A CN120745646B CN 120745646 B CN120745646 B CN 120745646B CN 202511189528 A CN202511189528 A CN 202511189528A CN 120745646 B CN120745646 B CN 120745646B
Authority
CN
China
Prior art keywords
value
temporary
comprehensive
trend
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202511189528.2A
Other languages
Chinese (zh)
Other versions
CN120745646A (en
Inventor
朱屹涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Link Star Technology Co ltd
Original Assignee
Beijing Link Star Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Link Star Technology Co ltd filed Critical Beijing Link Star Technology Co ltd
Priority to CN202511189528.2A priority Critical patent/CN120745646B/en
Publication of CN120745646A publication Critical patent/CN120745646A/en
Application granted granted Critical
Publication of CN120745646B publication Critical patent/CN120745646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of network chat auditing, and provides a method and a system for auditing a network chat sensitive word based on multidimensional identification, wherein the method comprises the following steps: and acquiring multi-source data in real time, extracting core keywords, acquiring texts with the core keywords and interactive data corresponding to the texts, identifying hot events by analyzing the interactive data, and extracting temporary sensitive words based on the hot events. According to the invention, the problem of response lag of the traditional static sensitive word bank to the emerging sensitive words generated by the hot events is optimized by collecting multi-source data in real time and dynamically extracting the temporary sensitive words related to the hot events, and the relevance score is calculated by means of semantic similarity and co-occurrence times, so that dominant associated words can be covered, hidden reference word clusters can be identified, missed judgment caused by changeable word forms is reduced, and timeliness and comprehensiveness of sensitive word identification are improved.

Description

Method and system for auditing network chat sensitive words based on multidimensional identification
Technical Field
The invention belongs to the technical field of network chat auditing, and particularly relates to a method and a system for auditing a network chat sensitive word based on multidimensional identification.
Background
The network chat sensitive word auditing means that chat contents of users in a network platform (such as social media, instant messaging tools, forums and the like) are monitored, identified and processed through a technical means and a rule system, words or expressions related to specific sensitive information are screened out so as to standardize network communication behaviors and the process of preventing risk propagation, and the core aim is to balance network talk freedom and information safety and avoid bad influence on society or individuals caused by the diffusion of the sensitive contents (such as illegal information, privacy disclosure and the like);
However, multi-dimensional recognition often depends on the integrity of a sensitive word stock, but the dynamic property and boundary definition of the sensitive word stock are easily ignored, wherein temporary sensitive words generated by a hot event are included, namely, a social hot event can quickly generate new sensitive words, and the temporary sensitive words have timeliness, such as names and codes of specific events in emergencies and public opinion disputes, can be sensitive only in an event fermentation period, and can lag behind a network propagation speed if word stock updating depends on manual labeling, so that missed judgment is caused;
therefore, the invention provides a method and a system for auditing the network chat sensitive words based on multidimensional identification.
Disclosure of Invention
In order to overcome the deficiencies of the prior art, at least one technical problem presented in the background art is solved.
The technical scheme adopted for solving the technical problems is that the method for auditing the network chat sensitive words based on multidimensional identification comprises the following steps:
Acquiring multi-source data in real time, extracting core keywords, acquiring texts with the core keywords and interactive data corresponding to the texts, identifying hot events by analyzing the interactive data, and extracting temporary sensitive words based on the hot events;
Constructing a temporary word stock based on the temporary sensitive words, setting a scene validation rule by combining the temporary word stock and the hot event tag, and auditing the network chat;
the method comprises the steps of monitoring heat data of hot events in real time, analyzing and outputting heat comprehensive values, realizing change analysis of the heat comprehensive values based on the conventional trend and the re-combustion trend in two dimensions, and evaluating timeliness of temporary sensitive words according to change analysis results of the heat comprehensive values;
And executing hierarchical adjustment on the auditing strength of the temporary sensitive words according to the timeliness evaluation result of the temporary sensitive words.
A system for auditing a network chat sensitive word based on multi-dimensional recognition, the system comprising:
The temporary sensitive word acquisition module acquires multi-source data in real time and extracts core keywords, acquires texts with the core keywords and interactive data corresponding to the texts, identifies hot events by analyzing the interactive data, and extracts temporary sensitive words based on the hot events;
The chat auditing module is used for constructing a temporary word stock based on the temporary sensitive words, setting a scene validation rule by combining the temporary word stock and the hot event label, and auditing the network chat;
The timeliness analysis module monitors the heat data of the hot events in real time, analyzes and outputs a heat comprehensive value, realizes the change analysis of the heat comprehensive value based on the double dimensionalities of the conventional trend and the reburning trend, and evaluates the timeliness of the temporary sensitive words according to the change analysis result of the heat comprehensive value;
And the auditing strength adjustment module is used for performing grading adjustment on the auditing strength of the temporary sensitive words according to the timeliness evaluation result of the temporary sensitive words.
The beneficial effects of the invention are as follows:
According to the invention, the problem of response lag of the traditional static sensitive word bank to the emerging sensitive words generated by the hot events is optimized by collecting multi-source data in real time and dynamically extracting the temporary sensitive words related to the hot events, and the relevance score is calculated by means of semantic similarity and co-occurrence times, so that dominant associated words can be covered, hidden reference word clusters can be identified, missed judgment caused by changeable word forms is reduced, and timeliness and comprehensiveness of sensitive word identification are improved;
According to the invention, by monitoring the heat data in real time and dynamically evaluating the timeliness of the temporary sensitive words, the auditing strength is adjusted according to the trend change and the reburning condition, so that a novel full-period management mechanism is formed, the mechanism can strengthen risk prevention and control in the active period of the event, reduce unnecessary intervention in the declining period, and remarkably improve the flexibility, accuracy and efficiency of auditing the network chat sensitive words.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a flow chart of steps of a method for auditing a network chat sensitive word based on multi-dimensional recognition according to the present invention;
FIG. 2 is a flowchart of the steps for acquiring a hot event in a method for auditing a network chat sensitive word based on multi-dimensional recognition;
FIG. 3 is a block diagram of a system for auditing a network chat sensitive word based on multi-dimensional recognition in accordance with the present invention.
Detailed Description
The invention is further described in connection with the following detailed description in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
Example 1
Referring to fig. 1 and 2, the method for auditing the network chat sensitive words based on multidimensional identification according to the embodiment of the invention includes the following steps:
firstly, acquiring multi-source data in real time, analyzing the multi-source data, extracting core keywords to identify hot events, and extracting temporary sensitive words based on the hot events;
in some embodiments, it is first required to define that a hot event refers to an event or topic that is rapidly propagated through a network platform (such as social media, news websites, forums, instant messaging tools, etc.) within a certain period of time, and causes high public attention, broad discussion, and a large amount of interactions (such as praise, comments, forwarding, sharing, etc.);
in the first step, the process of collecting the multi-source data in real time is as follows:
the web crawlers and the API interface are used for capturing the data of each platform in real time, and the collected data is cached by means of a message queue (such as Kafka), so that the orderly processing of the data is ensured;
The acquisition sources of the multi-source data include, but are not limited to, social media platforms, news websites, forums, instant messaging tools;
the social media platform is used for capturing high-frequency words and comment area hotwords under related topics of the event, such as name, code number and specific expression in # event # topics;
news websites, core entities in main stream media reports, such as name, place and event abbreviations, and spontaneous created index words of users in comment areas, such as implicit index of 'certain melon' and 'that thing' in specific events;
Selecting chat records related to events in public group chat, and extracting new words or abnormal expressions which occur at high frequency;
in the first step, the process of extracting the core keywords is as follows:
preprocessing the acquired multi-source data, including data cleaning and word segmentation;
The data cleaning comprises removing noise such as special symbol and HTML label;
splitting the text into words, and extracting keywords by using a TF-IDF (word frequency-inverse document frequency) algorithm;
specifically, TF-IDF measures importance by the frequency of words in a single text and the scarcity of words in all texts;
TF is the number of times a word appears in the current text/the total word number of the text, IDF is log (total document number/document number containing the word+1), and TF-IDF is the product of TF and IDF;
Extracting candidate keywords with TF-IDF values greater than or equal to the TF-IDF threshold value as core keywords for batch texts;
The determination of the batch text is set by the person skilled in the art according to experience and event characteristics, such as a text set within 1 hour of a certain platform;
in the first step, the process of identifying the hot event is:
extracting relevant texts with core keywords, and obtaining interaction data of each text;
The interactive data comprises praise numbers, comment numbers, forwarding numbers, sharing numbers and collection numbers;
Summing the praise number, the comment number, the forwarding number, the sharing number and the collection number of all the texts respectively, and carrying out average value processing to obtain absolute mutual quantity;
Taking the ratio of the interaction quantity of the current period to the interaction quantity of the previous period as the relative interaction quantity;
Setting an absolute interaction quantity threshold and a relative interaction quantity threshold, and extracting texts with absolute interaction quantity larger than the absolute interaction quantity threshold and relative interaction quantity larger than the relative interaction quantity threshold as hot events;
The absolute interaction quantity threshold and the relative interaction quantity threshold are summarized and set according to experience and event characteristics by a person skilled in the art, and the period is preset by the person skilled in the art;
Illustratively, take a social event as an example:
cleaning an HTML tag < p > and a special symbol "#" in news comments, and obtaining keywords such as an event, a related enterprise, potential safety hazard, investigation and the like after word segmentation;
The TF-IDF values of the related enterprises and the potential safety hazards are respectively 0.45 and 0.42, and core keywords are selected;
Extracting all texts with core keywords, wherein the total sum of the praise numbers is 325678, the total sum of the comment numbers is 128905, the total sum of the forwarding numbers is 876543, the total sum of the sharing numbers is 456789, and the total sum of the collection numbers is 98765;
The absolute interaction amount is 377336, if the average value of the interaction amount of similar events in the last cycle (yesterday) is 150000, the relative interaction amount is 2.5156, the absolute interaction amount (377336) > absolute threshold (100000) and the relative interaction amount (2.5156) > relative threshold (2.0) are met, and meanwhile, the requirements of the absolute interaction amount and the relative interaction amount threshold are met, and the event is identified as a hot event;
In step one, the process of extracting the temporary sensitive word based on the hot event is as follows:
taking the identified hot event as a core, and extracting all texts containing core keywords of the hot event;
Preprocessing the screened text, including data cleaning and word segmentation;
And (3) cleaning data, namely removing noise in the text, such as special symbols and HTML labels, and ensuring the purity of the text.
Word segmentation, namely splitting a text into independent words for extracting sensitive words subsequently;
Extracting words highly related to the hot event from the preprocessed text as temporary sensitive words, wherein the temporary sensitive words comprise, but are not limited to, specific names, codes, disputed expressions and implicitly-designated words generated in the event fermentation process;
For example, in a hot event such as a blessing, an offensive nickname extracted from the related text, an event-related implicit reference word, etc., may be determined as a temporary sensitive word;
The temporary sensitive words are screened through relevance, and the process is as follows:
for any core keyword, a pre-training word vector model is adopted, candidate words in the core keyword and the pre-processing text are converted into multidimensional vectors, and the semantic similarity between the vectors of all words in the pre-processing text and the core keyword vector is calculated through cosine similarity;
The pre-training Word vector model includes but is not limited to Word2Vec, BERT;
Counting the vocabulary co-occurring with the core keywords in the preprocessing text, and counting the frequency of co-occurrence;
Summing the semantic similarity and the times of co-occurrence to obtain a mean value for any candidate word in the preprocessing, and comprehensively outputting a correlation score;
extracting candidate words with the relevance scores larger than the relevance score limit value as temporary sensitive words;
the method for extracting the temporary sensitive words through the process is characterized in that firstly, the temporary sensitive words have strong relevance with hot events, and irrelevant vocabulary interference is reduced, secondly, semantic similarity can identify hidden reference words, the co-occurrence frequency can strengthen dominant associated words, meanwhile, the dominant expression and the hidden reference of the two types of sensitive words are covered, the diversified expression modes in network chat are adapted, and missed judgment caused by changeable vocabulary forms is reduced;
Constructing a temporary word library based on the acquired temporary sensitive words, and setting a scene validation rule by combining the temporary word library and the hot event tag to audit the network chat;
in some embodiments, temporary sensitive words are acquired, and a temporary word stock is constructed and used as a core audit object;
adding a multidimensional attribute tag for each temporary sensitive word, wherein the multidimensional attribute tag comprises an associated event tag and a risk type tag;
The related event label is used for defining hot events to which the word belongs, such as "# a certain food safety event#" "# a certain star scarlet#";
risk type labels, namely sensitive types of labeling words, such as "privacy class", "dispute class", "violation class", corresponding to "unpublished address", "offensive nickname", "harmonic rule violation word";
adding multidimensional attribute labels to hot events, wherein the multidimensional attribute labels comprise event type labels, event type labels and sensitive trigger point labels;
The event type label is used for distinguishing event properties such as social civil events, entertainment eight diagrams events and emergency events;
The propagation platform label is used for labeling a platform for mainly propagating events, such as microblog-based, group chat diffusion and news comment area fermentation;
the sensitive trigger point label is used for recording core contradictions such as privacy leakage risks which are most prone to disputes in the event;
setting a scene effective rule to audit the network chat, specifically:
When the temporary sensitive word appears in the network chat content, matching the associated event label of the temporary sensitive word, the event type label of the hot event and the propagation platform label, and triggering a specific auditing rule when the three are matched;
the consistency is matched by the three, so that the possibility of cross-event and cross-platform misjudgment can be reduced;
The temporary sensitive word 'melon', the associated event label '# a star scarlet#' and the event type label 'entertainment eight diagrams' and the propagation platform label 'group chat diffusion' are used for triggering auditing only in a group chat scene corresponding to a hot event '# a star scarlet#';
auditing based on the differentiation of event type labels can specifically comprise when the event type label is a social folk or an emergency and when the event type label is an entertainment eight diagrams;
When the event type label is a social folk event or an emergency event, a strict auditing mode is triggered:
if the propagation platform is a full platform (including microblog, news comment area, group chat and the like), immediately executing real-time interception as long as temporary sensitive words (such as a trouble place code and unpublished casualty data) appear, synchronously pushing the temporary sensitive words to a manual review queue (complete review within 3 minutes), and simultaneously recording sender IP and account information for tracing;
Exemplary, temporary sensitive word certain chemical plant leakage (associated event label "# certain chemical safety event#", event type label "emergency", propagation platform label "full platform diffusion"), when microblog, group chat and news comments appear, automatic interception is performed, and after manual rechecking is performed to confirm whether the information is unreal information, release or permanent shielding is determined;
when the event type label is 'entertainment eight diagrams', a loose audit mode is triggered:
If the propagation platform is microblog or social media, intercepting a single piece of content of a temporary sensitive word (such as an aggressive nickname) more than or equal to 3 times, and playing a window for 1-2 times to remind that the user notices the expression civilization;
If the propagation platform is group chat, delay interception is triggered only when the word is reported by 2 or more group members (shielding if not withdrawn within 30 minutes);
The core of the scenery is to make the auditing rule pair number seat, and to limit the scene range of auditing effectiveness by matching the related event label of the temporary sensitive word, the event type label of the hot event and the propagation platform label, so as to reduce the influence caused by 'one cut';
this embodiment has at least the following effects:
firstly, the problem of response lag of a traditional static word stock to temporary sensitive words is optimized, and words strongly related to hot events are timely brought into an auditing range by calculating a relevance score in real time;
Secondly, the dominant associated vocabulary and the recessive reference vocabulary are covered simultaneously, so that missed judgment caused by diversification of network expression is reduced, strong association between temporary sensitive words and hot events is ensured, and a foundation is laid for follow-up accurate auditing;
thirdly, limiting an audit effective scene, and reducing cross-event and cross-platform misjudgment;
fourth, formulate the differentiation rule to different event types and propagation scenes, while preventing the risk effectively, reduce the excessive intervention of the normal communication of users of a cut of audit, have balanced the information security and managed and controlled and network speaking the free demand.
Example 2
Referring to fig. 1, the method for auditing the network chat sensitive words based on multidimensional identification according to the embodiment of the invention further comprises the following steps:
because the core characteristics of the temporary sensitive words are rapidly transmitted along with the hot event burst, the temporary sensitive words have certain timeliness along with the hot spot drop of the hot event, so that the temporary word stock is dynamically updated;
Step three, monitoring the heat data of the hot spot event in real time, analyzing the change trend of the heat data, and evaluating the timeliness of the temporary sensitive words based on the change trend of the heat data;
Since the timeliness of the temporary sensitive word is strongly related to the event heat and the public opinion period, but the heat attenuation can be dynamic and nonlinear, for example, a certain event can be cooled after a week, but the new progress is suddenly reburnt, the analysis of the change trend of the heat data is particularly important, if the change trend analysis of the heat data is not carried out, the failure management of the temporary sensitive word can lose a powerful basis, and the possibility of missed judgment and excessive audit of the sensitive word is increased;
In some embodiments, the heat data of the hot events are monitored in real time, wherein the monitored objects take the identified hot events as cores, and the related heat data are collected in real time, and the heat data comprise absolute interaction quantity, relative interaction quantity and interaction quality value;
the absolute mutual amount and the mutual amount are calculated in the same manner as those in the first step, and are not described herein;
extracting comments with the number of words exceeding the average number of words length, counting the number of the comments as the number of exceeding the standard number of the comment words, and calculating the ratio of the number of exceeding the standard number of the comment words to the total number of the comments as the comment depth;
The comments replied by other users are extracted, the number of the comments is counted and used as the number of the secondary interaction comments, and the ratio of the number of the secondary interaction comments to the total number of the comments is calculated and used as the secondary interaction duty ratio;
The comment depth and the secondary interaction duty ratio are subjected to weighted fusion, and an interaction quality value is comprehensively output;
It is known to those skilled in the art that the comment depth is generally weighted higher than the secondary interaction ratio, the comment depth is generally weighted to be 0.6, the secondary interaction ratio is generally weighted to be 0.4,
Normalizing the absolute interaction quantity, the relative interaction quantity and the interaction quality value, then carrying out weighted fusion, and comprehensively outputting a heat comprehensive value;
it can be known by those skilled in the art that the normalized absolute interaction amount weight is 0.4, the normalized relative interaction amount weight is 0.3, and the interaction quality value weight is 0.3;
for any one hot event, the two-dimensional analysis is carried out through the conventional trend and the reburning trend, and the specific process is as follows:
the conventional trend is analyzed through trend direction, period division and conventional prediction, namely, the natural trend type is identified based on the heat comprehensive value of 3-5 continuous time periods;
Wherein the time period is preset by a person skilled in the art;
Calculating the heat integrated value deviation value of adjacent time periods to obtain a heat integrated value deviation value sequence;
Extracting a forward heat integrated value deviation value, counting the number of the forward heat integrated value deviation value, and carrying out ratio calculation on the forward heat integrated value deviation value and the total heat integrated value deviation value number to obtain a direction consistency index;
setting weight for the heat comprehensive value deviation value according to time, and carrying out weighted summation to obtain an amplitude index;
The weight is set to be increased along with the period according to the time, namely the influence of the recent deviation value on the trend is larger, the weight of the Kth heat integrated value deviation value is K/(n-1), and n is the total number of the heat integrated value deviation values;
Counting the maximum number of continuous homodromous deviations of the heat comprehensive value deviation value sequence, and taking the maximum number as a continuous trend index;
carrying out standardization treatment on the direction consistency index, the amplitude index and the continuous trend index, and then carrying out weighted fusion to obtain a first comprehensive trend value;
The method comprises the steps of normalizing a direction consistency index C to obtain a normalized direction consistency index C '=2C-1;C', normalizing an amplitude index W to obtain W '=W/|max (W) |, obtaining a normalized amplitude index W', normalizing a continuous trend index S to obtain S '=S/max (S), obtaining a normalized continuous trend index S', giving weight according to the importance of the index, and obtaining a direction consistency index (0.4) > an amplitude index (0.3) > a continuous trend index (0.3);
The method comprises the steps of judging that the trend is upward when a first comprehensive trend value is more than or equal to a first judgment value (0.2), judging that the trend is fluctuating when the first comprehensive trend value is more than or equal to a second judgment value (-0.2) and the first comprehensive trend value is less than or equal to the first judgment value (0.2), and judging that the trend is downward when the first comprehensive trend value is less than or equal to the second judgment value (-0.2);
When the trend is upward, the temporary sensitive words are strong timeliness, when the trend is downward, the temporary sensitive words are weak timeliness, and when the trend is fluctuation, the temporary sensitive words are stable timeliness;
Firstly, 50% of an initial heat integrated value of a hot spot event is used as a decay reference line, and when the heat integrated value is lower than the decay reference line, the heat trend is judged to enter a decay period;
and starting a re-combustion analysis by taking 30% of the decay datum line as a re-combustion trigger datum line, namely when the heat comprehensive value rises to the re-combustion trigger datum line, wherein the re-combustion analysis comprises the steps of verifying the re-combustion trend:
Monitoring whether the heat comprehensive value continuously maintains at least 2 time periods to be not lower than a reference line after rising back to the reburning trigger reference line, if only 1 period reaches the standard, judging that the short-term fluctuation is invalid reburning, and if 2 or more continuous periods reach the standard, entering the next step of verification;
Calculating an interaction quality value of the reburning period and comparing the interaction quality value with an interaction quality average value of the declining period;
The calculation mode of the interactive quality value is the same as the calculation mode described above, and is not described herein again.
If the interactive quality value of the reburning period is more than or equal to 1.2 times of the average value of the declining period, the rebound is accompanied with high-quality discussion, but the screen is brushed by low quality, so that the effective reburning is realized, otherwise, the ineffective reburning is realized;
The second comprehensive trend value after reburning is calculated, and the comprehensive trend value is the same as the first comprehensive trend value calculated during the conventional trend analysis, and is not described in detail herein;
if the effective reburning is carried out and the second comprehensive trend value is more than or equal to the first judgment value (0.2), the complete reburning is judged, and the timeliness of the temporary sensitive word is adjusted from the original weak timeliness (decay period) to the strong timeliness;
If the effective reburning is carried out, and the second comprehensive trend value is less than or equal to a second judgment value (-0.2), judging that the effective reburning is carried out, and the timeliness of the temporary sensitive words still keeps the weak timeliness (decay period);
If the temporary sensitive word is invalid and reburning, if the second comprehensive trend value is more than the second judgment value (-0.2) and the second comprehensive trend value is less than the first judgment value (0.2), the temporary sensitive word is judged to be fluctuation reburning, and the timeliness of the temporary sensitive word is adjusted from the original weak timeliness (decay period) to stable timeliness;
The method has the advantages that the heat data of the hot events are monitored in real time, the heat comprehensive value is calculated, the conventional trend is analyzed, the timeliness of the temporary sensitive words is matched, and the problem that the conventional static timeliness judgment cannot adapt to the dynamic change of the hot events is solved;
Meanwhile, by setting a decay datum line and a reburning trigger datum line and combining continuous period stability, interaction quality comparison and comprehensive trend values, a reburning trend is verified, short-term fluctuation and substantial reburning are effectively distinguished, accidental heat rise in a decay period is misjudged as reburning, or a situation that real reburning is invisible is reduced, timeliness assessment of temporary sensitive words is consistent with the actual transmission state of an event, and a scientific basis is provided for subsequent auditing rule adjustment;
The dynamic allocation of the audit resources is indirectly carried out, the audit is intensively carried out on the resources in the event rising period (strong timeliness), unnecessary intervention is reduced in the decay period (weak timeliness), and the waste of calculation power and manpower is reduced;
step four, based on the timeliness evaluation result of the temporary sensitive words, performing hierarchical adjustment on the auditing strength of the temporary sensitive words;
In some embodiments, when the temporary sensitive word is highly time-efficient, the time-efficient means that the hot event associated with the temporary sensitive word is in a highly active period or rapidly heats up after reburning, and the highest level audit needs to be started, including but not limited to:
the type of social folk life or emergency is that the full platform is intercepted after 1 time, and the manual recheck is carried out within 5 minutes, and the traceability information is recorded;
after reburning a certain chemical safety event, a new leakage (strong timeliness) of a certain factory of a temporary sensitive word is intercepted after 1 time of microblog and group chat, whether the temporary sensitive word is unreal information is checked manually, and if the temporary sensitive word is unreal information, the temporary sensitive word is permanently shielded and traced;
the entertainment eight diagrams event type is that the single occurrence is more than or equal to 2 times of reminding, and the single occurrence is more than or equal to 4 times of blocking;
when the temporary sensitive word is stable and time-efficient, a middle level audit is initiated, including but not limited to:
social folk life or emergency, namely, when the whole platform is recorded for more than or equal to 2 times, sending a reminder if the whole platform is repeated within 1 hour;
Entertainment eight diagrams event: the single occurrence is more than or equal to 3 times of reminding, single occurrence is more than or equal to 6 times of interception
When the temporary sensitive word is weakly time-sensitive, a low-level audit needs to be initiated, including but not limited to:
full-type events are recorded only when high frequency is concentrated, for example, the events are not less than 10 times in 12 hours and are manually checked in 24 hours, and the private chat scene is not interfered.
Example 3
Based on the same inventive concept as the method for auditing the network chat sensitive words based on the multi-dimensional recognition in the foregoing embodiment, as shown in fig. 3, the present application provides an auditing system for auditing the network chat sensitive words based on the multi-dimensional recognition, wherein the system specifically includes:
the temporary sensitive word acquisition module is used for acquiring multi-source data in real time and analyzing the multi-source data, extracting core keywords to identify hot events, and extracting temporary sensitive words based on the hot events;
Collecting multi-source data such as social media, news websites, forums, instant messaging tools and the like in real time, extracting core keywords by utilizing a TF-IDF algorithm after cleaning and word segmentation, identifying hot events meeting absolute interaction quantity and relative interaction quantity thresholds by analyzing interactive data containing core keyword texts, calculating semantic similarity and co-occurrence times by combining a pre-training word vector model based on the hot events, screening out temporary sensitive words with up-to-standard correlation, and optimizing the hysteresis problem of a traditional word library on emerging sensitive words;
the chat auditing module constructs a temporary word stock based on the acquired temporary sensitive words, and sets a scene validation rule to audit the network chat by combining the temporary word stock and the hot event tag;
By matching the association event label of the temporary sensitive words with the type of the hot event and the propagation platform label, setting a scene auditing rule, such as strict interception of social folk event triggering, loose reminding of entertainment eight diagrams events, and reduction of cross-scene misjudgment;
The timeliness analysis module is used for monitoring the heat data of the hot events in real time, analyzing the change trend of the heat data and evaluating timeliness of the temporary sensitive words based on the change trend of the heat data;
The method comprises the steps of monitoring heat data of a hot spot event in real time, calculating a heat comprehensive value and analyzing trend, wherein the event is judged to be ascending, fluctuation or descending trend through a first comprehensive trend value, and the strong, stable and weak timeliness of a corresponding temporary sensitive word is judged;
The auditing strength adjustment module is used for performing grading adjustment on the auditing strength of the temporary sensitive word based on the timeliness evaluation result of the temporary sensitive word;
According to the timeliness evaluation result of the temporary sensitive words, the auditing strength is adjusted in a grading manner, namely when strong timeliness occurs for 1 time, namely interception and manual rechecking are performed on a full platform of the social civil events, 2 times of reminding and 4 times of interception are performed on single events of the entertainment eight diagrams, when the timeliness (such as fluctuation trend) is stable, 2 times of recording, repeated reminding and 3 times of reminding and 6 times of interception are performed on the social civil events, when the timeliness (such as the decay period) is weak, only high-frequency concentrated recording (such as 12 hours and 10 times) is performed, private scenes are not interfered, and auditing accuracy and user experience are balanced.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1.一种基于多维度识别网络聊天敏感词审核方法,其特征在于:包括以下步骤:1. A method for reviewing sensitive words in online chat based on multi-dimensional identification, characterized by the following steps: 实时采集多源数据并进行核心关键词提取,获取存在核心关键词的文本以及文本对应的互动数据,通过分析互动数据识别热点事件,基于热点事件提取临时敏感词;Real-time collection of multi-source data and extraction of core keywords; acquisition of text containing core keywords and corresponding interactive data; identification of hot events by analyzing interactive data; extraction of temporary sensitive words based on hot events. 基于临时敏感词构建临时词库,结合临时词库与热点事件标签设置场景化生效规则,并对网络聊天进行审核;A temporary thesaurus is built based on temporary sensitive words. The temporary thesaurus is combined with hot event tags to set scenario-based rules for effectiveness and to review online chat. 实时监测热点事件的热度数据,并分析输出热度综合值,基于常规趋势与复燃趋势双维度实现热度综合值的变化分析,并根据热度综合值的变化分析结果评估临时敏感词的时效性;Real-time monitoring of the popularity data of trending events, analysis and output of comprehensive popularity value, based on the dual dimensions of regular trend and resurgence trend to realize the change analysis of comprehensive popularity value, and evaluate the timeliness of temporary sensitive words based on the change analysis results of comprehensive popularity value; 所述常规趋势分析的过程为:The process of conventional trend analysis is as follows: 提取连续N个时间周期的热度综合值,计算相邻时间周期的热度综合值偏差值,得到热度综合值偏差值序列,对热度综合值偏差值序列分析计算,得到第一综合趋势值;Extract the comprehensive heat index value for N consecutive time periods, calculate the deviation value of the comprehensive heat index value between adjacent time periods, obtain the sequence of comprehensive heat index value deviation values, analyze and calculate the sequence of comprehensive heat index value deviation values, and obtain the first comprehensive trend value. 当第一综合趋势值≥第一判断值时,判定为上升趋势,临时敏感词为强时效性;当第一综合趋势值>第二判断值且第一综合趋势值<第一判断值时,判定为波动趋势,临时敏感词为稳定时效性;当第一综合趋势值≤第二判断值时,判定为下降趋势,临时敏感词为弱时效性;When the first comprehensive trend value is greater than or equal to the first judgment value, it is determined to be an upward trend, and the temporary sensitive word is highly time-sensitive; when the first comprehensive trend value is greater than the second judgment value and the first comprehensive trend value is less than the first judgment value, it is determined to be a fluctuating trend, and the temporary sensitive word is stable and time-sensitive; when the first comprehensive trend value is less than or equal to the second judgment value, it is determined to be a downward trend, and the temporary sensitive word is weakly time-sensitive. 所述第一综合趋势值的过程为:The process of the first comprehensive trend value is as follows: 提取正向热度综合值偏差值,并统计其数量,再与总热度综合值偏差值数量进行比值计算,得到方向一致性指标;Extract the deviation values of the positive heat index composite value, count their number, and then calculate the ratio with the total number of deviation values of the heat index composite value to obtain the directional consistency index. 对热度综合值偏差值根据时间设置权重,并进行加权求和,得到幅度指标;The deviation values of the comprehensive heat index are weighted according to time and then weighted and summed to obtain the amplitude index; 第K个热度综合值偏差值的权重为K/(n-1),n为热度综合值偏差值的总数量;The weight of the Kth comprehensive heat index deviation value is K/(n-1), where n is the total number of comprehensive heat index deviation values; 统计热度综合值偏差值序列连续同向偏差的最大数量值,作为持续趋势指标;The maximum value of consecutive unidirectional deviations in the statistical heat index comprehensive value deviation sequence is used as an indicator of continuous trend. 将方向一致性指标、幅度指标以及持续趋势指标进行标准化处理,然后进行融合输出第一综合趋势值;The directional consistency indicator, amplitude indicator, and continuous trend indicator are standardized and then fused to output the first comprehensive trend value. 依据临时敏感词的时效性评估结果,对临时敏感词的审核力度执行分级调整。Based on the timeliness assessment results of temporary sensitive words, the review intensity of temporary sensitive words will be adjusted in stages. 2.根据权利要求1所述的一种基于多维度识别网络聊天敏感词审核方法,其特征在于:所述提取核心关键词的过程为:2. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of extracting core keywords is as follows: 对采集到的多源数据进行分词处理;并使用TF-IDF算法提取关键词;The collected multi-source data was segmented into words; and keywords were extracted using the TF-IDF algorithm. TF为:某词在当前文本中出现次数/文本总词数;IDF为:log(总文档数/包含该词的文档数+1);TF-IDF为:TF与IDF的乘积;TF is: the number of times a word appears in the current text / the total number of words in the text; IDF is: log(total number of documents / number of documents containing the word + 1); TF-IDF is: the product of TF and IDF; 提取TF-IDF值排名在排名占比阈值前的词汇作为候选关键词;对批量文本,提取TF-IDF值大于等于TF-IDF阈值的候选关键词作为核心关键词。Extract words whose TF-IDF value ranks before the ranking percentage threshold as candidate keywords; for batch text, extract candidate keywords whose TF-IDF value is greater than or equal to the TF-IDF threshold as core keywords. 3.根据权利要求1所述的一种基于多维度识别网络聊天敏感词审核方法,其特征在于:所述识别热点事件的过程为:3. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of identifying hot events is as follows: 互动数据包括:点赞数、评论数、转发数、分享数、收藏数;Interaction data includes: number of likes, number of comments, number of reposts, number of shares, and number of favorites; 将所有文本的点赞数、评论数、转发数、分享数、收藏数分别进行求和,并进行均值处理,得到绝对互动量;Sum the number of likes, comments, reposts, shares, and favorites for all texts, and then average them to get the absolute interaction count. 将当前周期的互动量与上一周期互动量的比值,作为相对互动量;The ratio of the current period's interaction volume to the previous period's interaction volume is used as the relative interaction volume. 提取绝对互动量超标且相对互动量大于超标的文本,作为热点事件。Texts with both absolute and relative interaction volumes exceeding the limit are extracted and designated as trending events. 4.根据权利要求1所述的一种基于多维度识别网络聊天敏感词审核方法,其特征在于:所述提取临时敏感词的过程为:4. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of extracting temporary sensitive words is as follows: 提取每个热点事件对应的核心关键词的文本,对于任意一个核心关键词,采用预训练词向量模型,将核心关键词与预处理文本中的候选词汇转化为多维向量,通过余弦相似度计算预处理后文本中所有词汇的向量与核心关键词向量的语义相似度;Extract the text of the core keywords corresponding to each hot event. For any core keyword, use a pre-trained word vector model to transform the core keyword and candidate words in the pre-processed text into multi-dimensional vectors. Calculate the semantic similarity between the vectors of all words in the pre-processed text and the vector of the core keyword using cosine similarity. 统计预处理文本中与核心关键词共同出现的词汇,并统计共同出现的次数;Analyze the words that co-occur with the core keywords in the preprocessed text and count the number of times they co-occur. 对于任意一个预处理文中的候选词,将语义相似度与共同出现的次数进行求和取均值,综合输出相关性得分;提取大于相关性得分限值的候选词,作为临时敏感词。For any candidate word in the preprocessed text, the semantic similarity and the number of times they co-occur are summed and averaged to output a comprehensive relevance score; candidate words with a relevance score greater than the limit are extracted as temporary sensitive words. 5.根据权利要求1所述的一种基于多维度识别网络聊天敏感词审核方法,其特征在于:所述热度综合值的获取过程为:5. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 1, characterized in that: the process of obtaining the comprehensive popularity value is as follows: 实时监测热点事件的热度数据;热度数据包括:绝对互动量、相对互动量以及互动质量值;Real-time monitoring of trending events' popularity data; popularity data includes: absolute interaction volume, relative interaction volume, and interaction quality score; 将绝对互动量、相对互动量以及互动质量值进行归一标准化之后进行融合综合输出热度综合值。After normalizing and standardizing the absolute interaction volume, relative interaction volume, and interaction quality value, the overall popularity value is integrated and output. 6.根据权利要求5所述的一种基于多维度识别网络聊天敏感词审核方法,其特征在于:所述互动质量值的计算过程为:6. The method for reviewing sensitive words in online chat based on multi-dimensional identification as described in claim 5, characterized in that: the calculation process of the interaction quality value is as follows: 提取字数超过平均字数长度的评论,并统计其数量,作为评论字数超标数量,将评论字数超标数量与总评论数进行比值计算,作为评论深度;Extract comments that exceed the average word count and count their number as the number of comments exceeding the word count limit. Calculate the ratio of the number of comments exceeding the word count limit to the total number of comments as the comment depth. 提取被其他用户回复的评论,并统计其数量,作为二次互动评论数量,将二次互动评论数量与总评论数进行比值计算,作为二次互动占比;Extract comments that have been replied to by other users and count their number as the number of secondary interaction comments. Calculate the ratio of the number of secondary interaction comments to the total number of comments as the secondary interaction percentage. 将评论深度与二次互动占比进行融合综合输出互动质量值。The interaction quality score is output by combining the depth of comments with the proportion of secondary interactions. 7.根据权利要求1所述的一种基于多维度识别网络聊天敏感词审核方法,其特征在于:所述复燃趋势的分析过程为:7. The method for reviewing sensitive words in online chat based on multi-dimensional identification according to claim 1, characterized in that: the analysis process of the resurgence trend is as follows: 当热度综合值低于衰退基准线时,判定热度趋势进入衰退时段;当热度综合值回升至复燃触发基准线时,启动复燃分析;When the overall heat index falls below the decline baseline, the heat trend is determined to have entered a decline phase; when the overall heat index rises back to the reignition trigger baseline, reignition analysis is initiated. 监测热度综合值在回升至复燃触发基准线后,是否连续维持至少2个时间周期不低于该基准线,若仅1个周期达标,判定为短期波动,为无效复燃;连续2个及以上周期达标,则进入下一步验证;After the overall heat index rises back to the re-ignition trigger baseline, it is determined whether it remains above the baseline for at least two consecutive time periods. If it only meets the standard for one period, it is judged as a short-term fluctuation and an invalid re-ignition. If it meets the standard for two or more consecutive periods, it proceeds to the next step of verification. 计算复燃时段的互动质量值,并与衰退时段的互动质量均值对比;Calculate the interaction quality value during the reignition period and compare it with the mean interaction quality value during the decay period; 若复燃时段互动质量值≥衰退时段均值的1.2倍,为有效复燃;反之为无效复燃;If the interaction quality value during the reignition period is greater than or equal to 1.2 times the average value during the decline period, it is considered a valid reignition; otherwise, it is considered an invalid reignition. 计算复燃后的第二综合趋势值;Calculate the second comprehensive trend value after reignition; 若为有效复燃,且第二综合趋势值≥第一判断值,则判定为完全复燃,临时敏感词为强时效性;If it is an effective resurgence, and the second comprehensive trend value is greater than or equal to the first judgment value, then it is judged as a complete resurgence, and the temporary sensitive words are highly time-sensitive; 若为有效复燃,第二综合趋势值≤第二判断值时,则判定为衰退复燃,临时敏感词仍保持弱时效性;If it is an effective resurgence, and the second comprehensive trend value is less than or equal to the second judgment value, it is judged as a recession resurgence, and the temporary sensitive words still maintain weak timeliness; 若为无效复燃,第二综合趋势值>第二判断值且第二综合趋势值<第一判断值时,则判定为波动复燃,临时敏感词为稳定时效性。If it is an invalid resurgence, and the second comprehensive trend value is greater than the second judgment value and the second comprehensive trend value is less than the first judgment value, then it is judged as a fluctuating resurgence, and the temporary sensitive word is stable timeliness. 8.一种基于多维度识别网络聊天敏感词审核系统,其特征在于,该系统用于执行上述权利要求1-7任一项所述的方法,该系统包括:8. A multi-dimensional identification system for reviewing sensitive words in online chat, characterized in that the system is used to perform the method described in any one of claims 1-7, the system comprising: 临时敏感词获取模块:实时采集多源数据并进行核心关键词提取,获取存在核心关键词的文本以及文本对应的互动数据,通过分析互动数据识别热点事件,基于热点事件提取临时敏感词;Temporary sensitive word acquisition module: Real-time collection of multi-source data and extraction of core keywords, acquisition of text containing core keywords and corresponding interactive data, identification of hot events by analyzing interactive data, and extraction of temporary sensitive words based on hot events; 聊天审核模块:基于临时敏感词构建临时词库,结合临时词库与热点事件标签设置场景化生效规则,并对网络聊天进行审核;Chat moderation module: A temporary thesaurus is built based on temporary sensitive words. The temporary thesaurus is combined with hot event tags to set scenario-based rules for effect and to review online chat. 时效性分析模块:实时监测热点事件的热度数据,并分析输出热度综合值,基于常规趋势与复燃趋势双维度实现热度综合值的变化分析,并根据热度综合值的变化分析结果评估临时敏感词的时效性;Timeliness Analysis Module: Monitors the popularity data of trending events in real time, analyzes and outputs a comprehensive popularity value, realizes the change analysis of the comprehensive popularity value based on the dual dimensions of regular trend and resurgence trend, and evaluates the timeliness of temporary sensitive words based on the change analysis results of the comprehensive popularity value; 审核力度调整模块:依据临时敏感词的时效性评估结果,对临时敏感词的审核力度执行分级调整。Review intensity adjustment module: Based on the timeliness assessment results of temporary sensitive words, the review intensity of temporary sensitive words is adjusted in stages.
CN202511189528.2A 2025-08-25 2025-08-25 Method and system for auditing network chat sensitive words based on multidimensional identification Active CN120745646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511189528.2A CN120745646B (en) 2025-08-25 2025-08-25 Method and system for auditing network chat sensitive words based on multidimensional identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511189528.2A CN120745646B (en) 2025-08-25 2025-08-25 Method and system for auditing network chat sensitive words based on multidimensional identification

Publications (2)

Publication Number Publication Date
CN120745646A CN120745646A (en) 2025-10-03
CN120745646B true CN120745646B (en) 2025-12-23

Family

ID=97196830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511189528.2A Active CN120745646B (en) 2025-08-25 2025-08-25 Method and system for auditing network chat sensitive words based on multidimensional identification

Country Status (1)

Country Link
CN (1) CN120745646B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121211509B (en) * 2025-11-28 2026-01-30 山东征途信息科技股份有限公司 Digital rural population information safe storage method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787049A (en) * 2016-02-26 2016-07-20 浙江大学 Network video hotspot event finding method based on multi-source information fusion analysis
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101560456B1 (en) * 2013-11-01 2015-10-15 황성봉 Extraction and Estimation Method of Trend Information with the Analasis of Vocabularies
CN108717408B (en) * 2018-05-11 2023-08-22 杭州排列科技有限公司 A sensitive word real-time monitoring method, electronic equipment, storage medium and system
CN113378565B (en) * 2021-05-18 2022-11-04 北京邮电大学 Event analysis method, device, device and storage medium for multi-source data fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787049A (en) * 2016-02-26 2016-07-20 浙江大学 Network video hotspot event finding method based on multi-source information fusion analysis
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word

Also Published As

Publication number Publication date
CN120745646A (en) 2025-10-03

Similar Documents

Publication Publication Date Title
Thelwall et al. Sentiment in Twitter events
US20100174813A1 (en) Method and apparatus for the monitoring of relationships between two parties
CN106940732A (en) A kind of doubtful waterborne troops towards microblogging finds method
CN120745646B (en) Method and system for auditing network chat sensitive words based on multidimensional identification
CN109325117A (en) A multi-feature fusion method for detecting social security events in microblogs
Wang et al. Exploring rumor combating behavior of social media on NIMBY conflict: Temporal modes, frameworks and strategies
CN113177164A (en) Multi-platform collaborative new media content monitoring and management system based on big data
Xu et al. MNRD: A merged neural model for rumor detection in social media
Petroni et al. An extensible event extraction system with cross-media event resolution
Kim et al. SMS spam filterinig using keyword frequency ratio
Yu et al. Senti-COVID19: An interactive visual analytics system for detecting public sentiment and insights regarding COVID-19 from social media
Colton et al. Sampling techniques to overcome class imbalance in a cyberbullying context
Wu et al. Statistical analysis of dispelling rumors on Sina Weibo
CN117614743B (en) Phishing early warning method and system thereof
CN110457009B (en) Implementation method of software security requirements recommendation model based on data analysis
Zhong et al. Fast detection of deceptive reviews by combining the time series and machine learning
Cheng et al. Topic relevance of public health emergencies influence on internet public opinion resonance: simulation based on Langevin’s equation
CN114661860A (en) Reputation risk monitoring and quantitative evaluation method, electronic device and computer-readable storage medium
Virmani et al. HashMiner: Feature Characterisation and analysis of# Hashtag Hijacking using real-time neural network
Wang et al. A novel framework of identifying Chinese jargons for telegram underground markets
Arafat et al. Popularity prediction of online news item based on social media response
CN111160025A (en) Method for actively discovering case keywords based on public security text
Bliss Analyzing temporal patterns in phishing email topics
Vasilyev et al. Semantic Text Analysis Technology Application in Assessing Current Threats and Software Vulnerabilities
Guimaraes et al. Contributions to the Detection of Unreliable Twitter Accounts through Analysis of Content and Behaviour.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant