[go: up one dir, main page]

CN118765394A - Near real-time in-meeting content item suggestions - Google Patents

Near real-time in-meeting content item suggestions Download PDF

Info

Publication number
CN118765394A
CN118765394A CN202380023672.1A CN202380023672A CN118765394A CN 118765394 A CN118765394 A CN 118765394A CN 202380023672 A CN202380023672 A CN 202380023672A CN 118765394 A CN118765394 A CN 118765394A
Authority
CN
China
Prior art keywords
meeting
user
content items
content item
natural language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202380023672.1A
Other languages
Chinese (zh)
Inventor
K·T·莫伊尼汉
A·阿伦
I·加布列李德斯
G·S·阿南德
X·刘
S·马尔其斯卡
A·莫罗
J·D·潘达约
G·塞尔瓦代伊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/813,685 external-priority patent/US12272362B2/en
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority claimed from PCT/US2023/010723 external-priority patent/WO2023167758A1/en
Publication of CN118765394A publication Critical patent/CN118765394A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

Embodiments discussed herein relate to improving the prior art by causing one or more indications of one or more content items to be presented to one or more user devices associated with one or more conference participants during or prior to a conference based at least in part on one or more natural language utterances associated with the conference, the context of the conference, and/or the context associated with one or more conference participants. In other words, particular embodiments automatically recommend related content items in response to real-time natural language utterances and/or other contexts in a meeting.

Description

Near real-time in-meeting content item suggestions
Background
Computer-implemented techniques can assist users in communicating with each other over a communication network. For example, some teleconferencing techniques use a conference bridge component that communicatively connects a plurality of user devices over a communication network so that users can conference or otherwise talk to each other in near real-time. In another example, the conferencing software application can include instant messaging, chat functionality, or audiovisual switching functionality via a webcam and microphone for electronic communications. However, these prior art and other techniques do not provide intelligent functionality for automatically recommending relevant content items (such as files) during a meeting based on near real-time natural language utterances in the meeting. In addition, these techniques have drawbacks in terms of computer information security, user privacy, and computer resource consumption (such as disk I/O, network bandwidth, and network latency), among other drawbacks.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in isolation to determine the scope of the claimed subject matter.
Various embodiments discussed herein relate to improving the prior art by causing one or more indications of one or more content items (such as files) to be presented to one or more user devices associated with one or more conference participants during a conference based at least in part on one or more natural language utterances associated with the conference (such as conference participants speaking a filename), a context of the conference (such as a conference ID or theme), and/or a context associated with one or more conference participants (such as a pattern of particular history files shared between conference participants of the same conference name). In other words, particular embodiments automatically recommend related content items in response to real-time natural language utterances in a meeting as well as other contexts.
In operation, some embodiments first detect a first natural language utterance of one or more participants associated with the meeting, wherein the one or more participants include a first participant. For example, a microphone may receive near real-time audio data and an associated user device may then send the near real-time audio data to a speech-to-text service over a computer network, enabling the speech-to-text service to encode the audio data into text data, and then perform Natural Language Processing (NLP) to detect that the user has uttered an utterance.
In addition, some embodiments determine a plurality of content items associated with the meeting or primary participant. For example, some embodiments perform computer reading of a network map to select nodes representing those content items that are closest in distance to the node represented by the primary participant or meeting.
Based on the first natural language utterance and at least one of: a first context associated with the meeting, a second context associated with the first meeting participant and/or a third context associated with another meeting participant of the meeting, some embodiments determine a score for each of a plurality of content items. For example, particular embodiments can connect various data into feature vectors, such as a first identifier that identifies a first and meeting, a first natural language utterance, a second set of identifiers that each identify a respective meeting participant of the meeting, and a third identifier that identifies the meeting, which are then used as inputs to a weakly supervised machine learning model, so that the machine learning model predicts which content items are the most relevant content items that appear during a particular time of the meeting. And based on the score, particular embodiments rank each of the plurality of content items.
Based at least in part on the ranking, particular embodiments result in at least one indication of a first content item of a plurality of content items being presented during the meeting and to a first user device associated with the first participant. For example, the model may predict that the first content item (document) is the most relevant content item because it matches the user intent of the content that the meeting attendees are currently talking about (e.g., the attendees are explicitly referencing the document) and append the same document in the meeting invitation to prepare the meeting. Thus, particular embodiments will automatically result in presentation of a document (e.g., without a manual user request) as a suggestion for user access, and selectively avoid causing presentation of other documents because they do not indicate that the user intends or otherwise has an associated meeting or meeting participant context.
Drawings
The invention is described in detail below with reference to the attached drawing figures, wherein:
FIG. 1 is a block diagram illustrating an example operating environment suitable for implementing some embodiments of the present disclosure;
FIG. 2 is a block diagram depicting an example computing architecture suitable for implementing some embodiments of the present disclosure;
FIG. 3 is a schematic diagram illustrating different models or layers, each of their inputs, and each of their outputs, according to some embodiments;
FIG. 4 is a schematic diagram illustrating how a neural network performs specific training and deployment predictions given specific inputs, according to some embodiments;
FIG. 5 is a schematic diagram of an example network diagram, according to some embodiments;
FIG. 6 is an example screen shot illustrating the presentation of an indication (link) of a content item in accordance with some embodiments;
FIG. 7 is an example screen shot illustrating the presentation of multiple indications of a content item of a natural language utterance according to the particular timestamp spoken, in accordance with some embodiments;
FIG. 8 is a schematic diagram illustrating a real world conferencing environment and highlighting relevant portions of content items, in accordance with some embodiments;
FIG. 9A is an example screenshot illustrating zero query presentation of indications (links and filenames) of content items (files) according to some embodiments;
FIG. 9B is a screen shot representing the completion of the natural language utterance of FIG. 9A, according to some embodiments.
FIG. 10 is a flowchart of an example process for training a weakly supervised machine learning model, according to some embodiments;
FIG. 11 is a flowchart of an example process for causing presentation of an indication of a content item based at least in part on a natural language utterance of a meeting, according to some embodiments;
FIG. 12 is a flowchart of an example process for presenting an indication of an agenda document or a pre-read document prior to a meeting, according to some embodiments; and
FIG. 13 is a block diagram of an example computing device suitable for use in implementing some embodiments described herein.
Detailed Description
The subject matter of aspects of the present disclosure is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Furthermore, although the terms "step" and/or "block" may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described. Each of the methods described herein may include a computing process that may be performed using any combination of hardware, firmware, and/or software. For example, various functions may be performed by a processor executing instructions stored in memory. The methods may also be embodied as computer-useable instructions stored on a computer storage medium. These methods may be provided by a stand-alone application, a service, or a hosted service (alone or in combination with another hosted service) or a plug-in to another product, to name a few.
As described above, the prior art fails to intelligently recommend or provide content items (such as documents) during a meeting based on real-time natural language utterances in the meeting. For example, some prior art techniques (such as email applications or meeting applications) are configured to store manual user-attached files in computer memory prior to the meeting beginning. If a user wishes to view a file, these techniques require an explicit user query or other user activity (such as clicking) to manually search for or display the file. For example, a first user may send a meeting invitation in a calendar application, as well as several documents to be discussed in the meeting. When a meeting begins or when a user begins talking about a particular file in the meeting, the user may be required to manually retrieve the particular file in the email application via a search query. However, not only does all of these actions negatively impact the user experience, but the corresponding user interface is static in nature. Because these existing applications require the user to manually retrieve the data, the user must struggle with going deep into the various user interface pages to find the appropriate files, or issue a query, which still requires the computer to generate the correct search results and requires the user to identify a particular document, thereby negatively impacting accuracy and user experience. Furthermore, human-machine interaction is static in nature. As described above, if a user desires a particular file, the user is required to issue a basic query or selection in order for the computer to retrieve the file. The computer does not automatically retrieve the file during the meeting based on near real-time natural language utterances of the meeting and does not responsively select the file based on user input.
The prior art also fails to intelligently and automatically cause presentation of the content item (or an indication thereof, such as a link to the content item) or generation of the content item prior to the meeting beginning. For example, if the user wishes to make an agenda document or a pre-read document, prior art techniques (such as word processing techniques) require the user to manually enter each character sequence, which is not only time consuming, but also increases storage device I/O. After a certain amount of time, these techniques require the storage manager to contact a storage device, such as a disk, to store content that the user has generated, which often happens multiple times while the user is generating a single document. However, multiple contacting the disk is expensive because it requires the read/write head to mechanically identify the correct disk and sector multiple times, which eventually wears the read/write head. Even though the user has generated an agenda project document or a pre-read document and wishes to retrieve it prior to the meeting, the user still has to manually open an operating system dialog box or the like to display the document, which is still difficult and requires unnecessary deep or query requests.
The prior art also has the defects in the aspects of computer information security and user privacy. For example, a particular meeting application uses a supervised machine learning model to predict which utterances in the meeting correspond to action items, etc. To make such predictions, these models require human annotators (such as subject matter experts) to view private plain text user emails, chats, and other documents so that they can mark them as action items or not so in order to set ground truth for the model. However, this obviously includes users because human annotators or remote users can steal sensitive information located in these documents, such as phone numbers, social security numbers, credit card information, and the like. Furthermore, the prior art fails to incorporate access control mechanisms to prevent users from accessing content items that they should not view.
The prior art also consumes unnecessary amounts of computing resources, such as network bandwidth, network latency, and I/O, when searching for content items. For example, as described above, some conferencing applications predict whether a particular natural language utterance corresponds to an action item or other type. To make such predictions, the prior art traverses an entire decision tree or other data structure, or communicates with various services over a network, to search for content items that provide clues for action item detection. For example, each node in the graph can represent a signal or data source for polling or monitoring to detect whether a natural language utterance is an action item. Polling all data sources increases storage device I/O (excessive physical read/write head movement on non-volatile disks) because each time a node is traversed, the component must repeatedly contact the storage device to perform a read operation, which is time consuming, error prone, and eventually wears out components such as the read/write head. Furthermore, polling all of these data sources increases network latency and reduces bandwidth, as the same application is also performing real-time processing of utterances for meetings, which is computationally expensive to process. This means that, since there are many bits dedicated to finding content items for prediction, the bits available for handling utterances of a conference are significantly reduced, which reduces the bandwidth. This loss of bandwidth also causes jitter or latency problems with processing utterances, which means that the entire signal (a series of TCP/IP packets) is delayed, resulting in a segmented or delayed utterance, making it difficult to understand or hear what the user is speaking.
Various embodiments of the present disclosure provide one or more technical solutions to these technical problems, as well as other problems as described herein. For example, particular embodiments relate to causing one or more indications (such as links) of one or more content items (such as files) to be presented to one or more user devices associated with one or more conference participants during a conference based at least in part on one or more natural language utterances, a context of the conference (such as a conference ID or theme), and/or a context associated with one or more conference participants (such as a pattern of particular files shared between the participants). In other words, particular embodiments automatically recommend related content items during a meeting based at least in part on real-time natural language utterances in the meeting.
In operation, some embodiments first detect a first natural language utterance of one or more participants associated with a meeting, wherein the one or more participants include a first participant. For example, a microphone may receive near real-time audio data and an associated user device may then send the near real-time audio data to a speech-to-text service over a computer network, enabling the speech-to-text service to encode the audio data into text data, and then perform Natural Language Processing (NLP) to detect that the user has uttered an utterance.
In addition, some embodiments determine a plurality of content items associated with the meeting or primary participant. For example, some embodiments perform computer reading of a network map to select nodes representing those content items that are closest in distance to the node represented by the primary participant or meeting.
Based on the first natural language utterance and at least one of: a first context associated with the meeting, a second context associated with the first meeting participant and/or a third context associated with another meeting participant of the meeting, some embodiments determine a score for each of a plurality of content items. For example, particular embodiments can connect various data into feature vectors (such as a first identifier that identifies a first and meeting, a first natural language utterance, a second set of identifiers that each identify a respective meeting participant of the meeting, and a third identifier that identifies the meeting) and then use the feature vectors as inputs to a weakly supervised machine learning model so that the machine learning model predicts which content items are the most relevant content items that appear during a particular time of the meeting. And based on the score, particular embodiments rank each of the plurality of content items.
Based at least in part on the ranking, particular embodiments result in at least one indication of a first content item of a plurality of content items being presented during the meeting and to a first user device associated with the first participant. For example, the model may predict that the first content item (document) is the most relevant content item because it matches the user intent of the content that the meeting attendees are currently talking about (e.g., the attendees are explicitly referencing the document) and append the same document in the meeting invitation to prepare the meeting. Thus, particular embodiments will automatically result in the presentation of documents (e.g., without a manual user request) as suggestions for user access, and selectively avoid causing the presentation of other documents because their rank is not high enough.
Particular embodiments improve upon the prior art in that they score or rank each of a plurality of content items and/or in that they result in an indication of the content items being presented during the meeting based on the scoring or ranking. For example, scoring and presentation can be based on factors such as real-time natural language utterances in the conference and/or other context, such as conference subject or participant ID. Particular embodiments do not require explicit user queries or other user activities (such as clicking) to manually search or present content items, but rather automatically provide such content items based on unique rules or factors (e.g., provide content items that match the natural language utterances of the meeting or provide content items based on the user downloading such content items as attachments to previous emails). For example, using the illustration above, if a first user issues a meeting invitation in a calendar application, and several documents are to be discussed in the meeting, particular embodiments may automatically score each of these documents based on near real-time natural language utterances (such as meeting participants explicitly referencing the documents) and the ID of the meeting (meeting context). In some cases, the generated score is itself a technical solution to solve these problems, as the most relevant content items are presented. Rather than requiring the user to manually retrieve a particular file in an email application via a search query, particular embodiments will automatically cause an indication (such as a link) to be presented for the particular file based on the score when the meeting begins or when the user begins talking about the particular file. Such a presentation is itself an additional technical solution to these technical problems.
Particular embodiments improve user interfaces and human-machine interactions by automatically causing presentation of indications of content items during a meeting, thereby eliminating the need for a user to laboriously go deep into various pages to find appropriate files or issue queries. Instead, for example, a tile (tile), deliver a speech, or other user interface element that automatically appears with at least the first content item can be presented to the user. Particular embodiments do not require a user to issue a static query or selection to the computer to retrieve each of the plurality of files, but rather cause the computer to automatically retrieve each file (or other content item) during the meeting based on near real-time natural language utterances of the meeting, and to responsively select the content item based on user input. For example, deliver a speech can be automatically presented to the user device, as well as a ranked list of score-based content items. Based on receiving an indication that the user has selected an indicator referencing a particular content item, in the ranked list, particular embodiments select the content item and cause the presentation of the indication of the content item, thereby improving human-machine interaction, as the computer automatically presents various candidate content items, but selects one candidate content item for presentation based only on user selection, rather than presenting each content item to the user based on a manual explicit computer query or selection.
Some embodiments improve upon the prior art by intelligently and automatically causing an indication of or generating a content item to be presented to a user prior to the meeting beginning. For example, if a user wishes to make an agenda document or a pre-read document, particular embodiments automatically generate the contents of such agenda or pre-read document based on the context associated with the meeting (such as the meeting topic, the particular meeting participant, and the existing email discussing the meeting). For example, particular embodiments can locate historical emails and files that discuss the subject matter of a meeting, and copy particular content from both sources into a single document that summarizes the agenda items. Such actions reduce storage device I/O because certain embodiments perform a single write (or fewer writes) to the storage device to generate the document, rather than repeatedly storing or writing manual user input to the storage device as required by the prior art. Thus, for example, certain embodiments contact the disk a fewer number of times, which results in the read/write head mechanically identifying the disk and/or sector a fewer number of times, which results in less wear of the read/write head. Even if the user has generated an agenda project document or a pre-read document and wishes to retrieve it prior to the meeting, various embodiments can automatically result in rendering such a document, which is much simpler and results in less in-depth analysis, as the user does not have to manually open an operating system dialog box or the like to render the document.
Various embodiments also improve computer information security and user privacy over the prior art. For example, particular embodiments use a weakly supervised model rather than using a supervised machine learning model for prediction. A weak supervision model is a model that can use any flexible (noisy, inaccurate, or limited) data source and programmatically or heuristically mark training data in the context of supervision without using human annotators. As described above, in order to make predictions, existing supervision models require human annotators to view private user emails, chats, and other documents so that they can mark them as action items or not so in order to set ground truth for the model. However, particular embodiments improve these models by programmatically assigning particular tags without a human annotator. In this way, any human annotator cannot view or steal private data, such as credit card information, telephone numbers, etc. In addition, some embodiments encrypt such personal information such that other remote users cannot access the information.
Furthermore, particular embodiments improve security and user privacy by incorporating access control mechanisms to prevent users from accessing content items that they should not access. For example, during a meeting, particular embodiments only result in the presentation of private emails to user devices associated with users, but avoid causing the presentation of private emails to secondary participants based on the secondary participants not having permission to access the private emails. In other words, while particular embodiments automatically recommend or cause presentation of related content items (such as files) during a meeting based on real-time natural language utterances in the meeting (users explicitly talking about files), such recommendation or presentation does not occur at the expense of user privacy-such content items are not caused to be presented to the user device of the user without access rights, which may be private to the given user.
One of the access control mechanisms that improves the prior art is the concept of causing an indication of content items to be presented to a user in response to receiving a user request to share such content items from a user having access to the content items. For example, particular embodiments may cause presentation of private emails to user devices of primary participants based on real-time conversations in the meeting regarding content within the emails. Some embodiments result in a prompt being displayed to the user device asking whether the primary participant would like to share with other participants of the conference. Particular embodiments then receive a request for a primary participant to share an email with a secondary participant of the meeting. In response to receiving the request, some embodiments cause presentation of the first content item to a second user device associated with the second participant.
Particular embodiments also improve other computing resource consumption, such as network bandwidth, network latency, and I/O when searching for content items. In particular, particular embodiments improve computing resource consumption by determining a plurality of content items associated with a primary participant or meeting (or determining that the content items are actually associated with the primary participant or meeting), which are candidates for presentation during the meeting. Particular embodiments are able to determine that a subset of content items may be relevant to a meeting or particular attendee, rather than traversing an entire decision tree or other data structure when determining content items. For example, the determining of the plurality of content items can include performing a computer reading of a network graph and selecting a plurality of content items among other content items, wherein a number of nodes represent content items to be analyzed. Embodiments can "clip" or remove specific nodes in the graph that do not represent those content items that are most relevant to the meeting participant or meeting. For example, only nodes representing content items that are within a threshold distance of the node representing the user may be selected. In another example, only content items are considered in which the strength of the edge-indicative relationship exceeds a threshold (e.g., via the thickness of the edge). In this way, it is not necessary to traverse the entire graph and, more generally, to consider or monitor each content item that is not relevant to a particular meeting or user.
Thus, this reduces storage device I/O (excessive physical read/write head movement on non-volatile disks) because traversal of the graph occurs on fewer nodes or fewer content items are analyzed, and thus embodiments contact the storage device fewer times to perform read/write operations, which results in less wear on the read/write head. Furthermore, this reduces network latency and reduces bandwidth as fewer data sources, nodes or content items are considered. This is because there are fewer bits dedicated to finding content items for prediction than in the prior art, since there are fewer content items under consideration. Thus, there are significantly more bits available for processing natural language utterances for the conference, which increases bandwidth. Thus, such bandwidth savings reduce jitter or other latency issues with respect to processing utterances, which means that the complete signal is less likely to be delayed, resulting in less segmented or less delayed utterances, making it easier to understand or hear what the user is speaking.
Turning now to FIG. 1, a block diagram is provided that illustrates an example operating environment 100 in which some embodiments of the present disclosure may be employed. It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions) may be used in addition to or instead of those shown, and some elements may be omitted entirely for clarity. Furthermore, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in combination with other components, and in any suitable combination and location. The various functions described herein as being performed by an entity may be performed by hardware, firmware, and/or software. For example, some functions may be performed by a processor executing instructions stored in memory.
Among other components not shown, the example operating environment 100 includes a plurality of user devices, such as user devices 102a and 102 b-102 n; a plurality of data sources (e.g., databases or other data stores), such as data sources 104a and 104 b-104 n; a server 106; sensors 103a and 107; and network(s) 110. It should be appreciated that the environment 100 shown in FIG. 1 is an example of one suitable operating environment. Each of the components shown in fig. 1 may be implemented via any type of computing device, such as computing device 1300 as described in connection with fig. 13. These components may communicate with each other via network(s) 110, which network 110 may include, but is not limited to: local Area Networks (LANs) and/or Wide Area Networks (WANs). In some implementations, network(s) 110 include the internet and/or a cellular network, as well as any of a variety of possible public and/or private networks.
It should be appreciated that any number of user devices, servers, and data sources may be employed within the operating environment 100 within the scope of the present disclosure. Each may comprise a single device or multiple devices cooperating in a distributed environment. For example, the server 106 may be provided via a plurality of devices arranged in a distributed environment, which collectively provide the functionality described herein. In addition, other components not shown may also be included within the distributed environment.
The user devices 102a and 102 b-102 n can be client devices on a client side of the operating environment 100, while the server 106 can be on a server side of the operating environment 100. The server 106 can include server-side software designed to work in conjunction with client-side software on the user devices 102a and 102 b-102 n to implement any combination of features and functions discussed in this disclosure. This division of the operating environment 100 is provided to illustrate one example of a suitable environment and does not require that any combination of the server 106 and the user devices 102a and 102 b-102 n be maintained as separate entities for each implementation. In some embodiments, one or more servers 106 represent one or more nodes in a cloud computing environment. According to various embodiments, a cloud computing environment includes a network-based distributed data processing system that provides one or more cloud computing services. Further, a cloud computing environment can include many computers (hundreds or thousands or more) disposed within one or more data centers and configured to share resources over one or more networks 110.
In some embodiments, the user device 102a or the server 106 alternatively or additionally includes one or more web servers and/or application servers to facilitate the delivery of web or online content to a browser installed on the user device 102 b. In general, the content may include static content and dynamic content. When a client application (such as a web browser) requests a website or web application via a URL or search term, the browser typically contacts a web server to request static content or basic components (e.g., HTML pages, image files, video files, etc.) of the website or web application. The application server typically delivers any dynamic portion of the network application or a portion of the business logic in the network application. Business logic can be described as a function of managing communications between user devices and data stores (e.g., databases). Such functionality can include business rules or workflows (e.g., code indicating conditional if/then statements, while statements, etc. to represent the order of the flow).
The user devices 102a and 102 b-102 n may comprise any type of computing device capable of being used by a user. For example, in one embodiment, the user devices 102 a-102 n may be a type of computing device described herein with respect to fig. 13. By way of example, and not limitation, a user device may be embodied as: a Personal Computer (PC), a notebook computer, a mobile phone or mobile device, a smart phone, a tablet computer, a smart watch, a wearable computer, a Personal Digital Assistant (PDA), a music player or MP3 player, a Global Positioning System (GPS) or device, a video player, a handheld communication device, a gaming device or system, an entertainment system, a vehicle computing system, an embedded system controller, a camera, a remote control, a bar code scanner, a computerized measurement device, an appliance, a consumer electronics device, a workstation, or any combination of these delineated devices, or any other suitable computer device.
The data sources 104a and 104 b-104 n may include data sources and/or data systems configured to make data available to any of the various components of the operating environment 100 or system 200 described in connection with fig. 2. Examples of data source(s) 104 a-104 n may be one or more of the following: databases, files, data structures, corpora, or other data stores. The data sources 104a and 104 b-104 n may be separate from the user devices 102a and 102 b-102 n and the server 106, or may be incorporated and/or integrated into at least one of these components. In one embodiment, the data sources 104 a-104 n include sensors (such as sensors 103a and 107) that may be integrated into the user device(s) 102a, 102b, or 102n or server 106 or associated with the user device 102a, 102b, or 102n or server 106.
The operating environment 100 can be used to implement one or more of the components of the system 200 described in fig. 2, including components for scoring and causing presentation of indicated candidate items during or prior to a meeting, as described herein. The operating environment 100 can also be used to implement aspects of the processes 1000, 1100, and/or 1200 described in connection with fig. 10, 11, and 12, as well as any other functionality as described in connection with fig. 2-13.
Referring now to FIG. 2, in conjunction with FIG. 1, a block diagram is provided that illustrates aspects of an example computing system architecture suitable for implementing some embodiments of the present disclosure and designated generally as system 200. System 200 represents only one example of a suitable computing system architecture. Other arrangements and elements may be used in addition to or in place of those shown and some elements may be omitted entirely for clarity. Moreover, as with operating environment 100, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in combination with other components, and in any suitable combination and location.
Example system 200 includes network 110, network 110 is described in connection with fig. 1, and communicatively couples components of system 200, including conference monitor 250, user data collection component 210, presentation component 220, content item generator 260, and storage 225. For example, these components may be embodied as a set of compiled computer instructions or functions, program modules, computer software services, or an arrangement of processes executing on one or more computer systems, such as computing device 1300 described in connection with FIG. 13.
In one embodiment, the functions performed by the components of system 200 are associated with one or more personal assistant applications, services, or routines. In particular, such applications, services, or routines may operate on one or more user devices (such as user device 102 a), servers (such as server 106), may be distributed across one or more user devices and servers, or implemented in the cloud. Further, in some embodiments, these components of system 200 may be distributed across a network in the cloud, including one or more servers (such as server 106) and client devices (such as user device 102 a), or may reside on user devices (such as user device 102 a). Further, these components, the functions performed by these components, or the services implemented by these components may be implemented at an appropriate abstraction layer (such as the operating system layer, application layer, hardware layer of a computing system). Alternatively or additionally, the functions of these components and/or embodiments described herein may be performed, at least in part, by one or more hardware logic components. For example, but not limited to, exemplary types of hardware logic components that can be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs). Further, although functionality is described herein with respect to particular components shown in the example system 200, it is contemplated that the functionality of these components may be shared or distributed among other components in some embodiments.
With continued reference to fig. 2, the user data collection component 210 is generally responsible for accessing or receiving (and in some cases also identifying) user data from one or more data sources, such as the data sources 104a and 104 b-104 n of fig. 1. In some embodiments, the user data collection component 210 may be used to facilitate the accumulation of user data for a particular user (or in some cases, multiple users including crowd-sourced data) of the meeting monitor 250 or content item generator 260. In some embodiments, a "user" as specified herein may be replaced with the term "meeting participant" of the meeting. The data can be received (or accessed) by the user data collection component 210 and optionally accumulated, reformatted, and/or combined, and stored in one or more data stores, such as storage 225, where it can be used by other components of the system 200. For example, user data may be stored in user profile 240 or associated with user profile 240, as described herein. In some embodiments, any personal identification data (i.e., user data specifically identifying a particular user) is not uploaded or otherwise provided with user data from one or more data sources, is not permanently stored, and/or is not available to components or sub-components of system 200. In some embodiments, a user may choose to join or leave the services provided by the techniques described herein and/or choose which user data and/or which user data sources to use by the techniques.
User data may be received from a variety of sources, where the data may be available in a variety of formats. For example, in some embodiments, user data received via user data collection component 210 may be determined via one or more sensors, which may be located on or associated with one or more user devices (such as user device 102 a), servers (such as server 106), and/or other computing devices. As used herein, a sensor may include functions, routines, components, or combinations thereof for sensing, detecting, or otherwise obtaining information, such as user data, from the data source 104a, and may be embodied in hardware, software, or both. By way of example and not limitation, user data may include data sensed or determined from one or more sensors (referred to herein as sensor data), such as location information of the mobile device(s), attribute or feature of the user device(s) (such as device status, charging data, date/time, or other information derived from the user device(s) such as the mobile device), user activity information (e.g., application usage; online activity; search; voice data such as automatic voice recognition; activity log; communication data including conversation, text, instant messaging, and email; website posts; other user data associated with communication events), in some embodiments including user activity occurring on more than one user device, user history, conversation log, application data, contact data, calendar and scheduling data, notification data, social network data, news (including search engines or trending items on a social network), online game data, e-commerce activity (including data from online accounts such as Video streaming service, gaming service or) User account data (which may include data from user preferences or settings associated with personal assistant applications or services), home sensor data, appliance data, GPS data, vehicle signal data, traffic data, weather data (including forecasts), wearable device data, other user device data (which may include device settings, profiles, network related information (such as network names or IDs, domain information, workgroup information, connection data, wi-Fi network data or configuration data (data about models, firmware or devices), device pairing (such as a user pairing a mobile phone with a bluetooth headset), for example, or other network related information)), gyroscope data, accelerometer data, payment or credit card usage data (which may include information from a PayPal account of the user), purchase data (such as information from a history XboxLive, amazon. Com or eBay account of the user), other sensor data (including data derived from sensor components associated with the user (including location, movement, orientation, user, location, activity, network access, access to a plurality of devices or other data provided by the user, access to a plurality of devices for example) or other sensor components sensed or otherwise detected by the sensor components, location data that may be derived from Wi-Fi, cellular network, or IP address data), and virtually any other data source that may be sensed or determined as described herein.
User data can be received by the user data collection component 210 from one or more sensors and/or computing devices associated with a user. Although it is contemplated that user data may be processed, for example, by sensors or other components not shown, for interpretation by user data collection component 210, embodiments described herein do not limit user data to processed data and may include raw data. In some embodiments, user data collection component 210 or other components of system 200 may determine explanatory data from the received user data. The interpretative data corresponds to data used by components of system 200 to interpret user data. For example, the explanatory data can be used to provide context to the user data, which can support determinations or inferences made by components or sub-components of the system 200, such as locale information from a location, a text corpus (i.e., speech-to-text) from user speech, or aspects of spoken understanding. Moreover, it is contemplated that for some embodiments, components or sub-components of system 200 may use user data and/or user data in combination with explanatory data to achieve the goals of the sub-components described herein.
In some aspects, user data may be provided in a user data stream or signal. The "user signal" can be a feed or stream of user data from a corresponding data source. For example, the user signal may come from a smart phone, home sensor device, smart speaker, GPS device (e.g., location coordinates), vehicle sensor device, wearable device, user device, gyroscopic sensor, accelerometer sensor, calendar service, email account, credit card account, or other data source. In some embodiments, the user data collection component 210 continuously, periodically receives or accesses user-related data as it becomes available or as needed.
With continued reference to fig. 2, the example system 200 includes a conference monitor 250. Conference monitor 250 includes conference activity monitor 252 context information determiner 254, conference content assembler 256, and natural language utterance detector 257. Conference monitor 250 is generally responsible for determining and/or detecting conference features from online conferences and/or face-to-face conferences and making conference features available to other components of system 200. For example, such monitored activity can be a meeting location (e.g., as determined by a geographic location of the user device), a subject of the meeting, an invitee to the meeting, a meeting participant of the meeting, whether the meeting repeatedly occurs, a related expiration date, an item, and so forth. In some aspects, conference monitor 250 determines and provides a set of conference features (such as described below) for a particular conference and for each user associated with the conference. In some aspects, the meeting may be a past (or historical) meeting or a current meeting. Further, it should be appreciated that conference monitor 250 may be responsible for monitoring any number of conferences, for example, each online conference associated with system 200. Thus, features corresponding to an online meeting determined by meeting monitor 250 can be used to analyze multiple meetings and determine corresponding patterns.
In some embodiments, the input to conference monitor 250 is sensor data and/or user device data for one or more users in the event, and/or contextual information from a conference invitation and/or email or other device activity of the user in the conference. In some embodiments, this includes user data collected by user data collection component 210 (which can be accessed via user profile 240).
Conference activity monitor 252 is generally responsible for monitoring conference events (such as user activity) via one or more sensors (such as microphones, video), devices, chat, presented content, and the like. In some embodiments, conference activity monitor 252 outputs records or activities that occur during the conference. For example, the activity or content may be time stamped or otherwise related to the meeting transcription. In an illustrative example, conference activity monitor 252 may indicate clock times for the start and end of a conference. In some embodiments, meeting activity monitor 252 monitors user activity information from a plurality of user devices associated with the user and/or from cloud-based services (such as email, calendar, social media, or similar information sources) associated with the user, and may include contextual information associated with the transcription or content of the event. For example, an email may detail conversations between two participants that provide context to a transcription of a meeting by describing details of the meeting (such as the purpose of the meeting). In some embodiments, meeting activity monitor 252 may determine current or near real-time user activity information and may also determine historical user activity information, which may be determined based on collecting observations of user activity over time and/or accessing a user log of past activity (e.g., such as browsing history). Further, in some embodiments, the conference activity monitor may determine user activity (which may include historical activity) from other similar users (i.e., crowdsourcing).
In embodiments that use context information associated with the user device (such as via the context information determiner 254), the user device may be identified by the conference activity monitor 252 by detecting and analyzing characteristics of the user device (such as device hardware, software such as an OS, network-related characteristics, user accounts accessed via the device, and the like). For example, as previously described, information about the user device may be determined using the functionality of many operating systems to provide information about hardware, OS versions, network connection information, installed applications, and the like. In some embodiments, a device name or identification (device ID) may be determined for each device associated with the user. This information about the identified user device associated with the user may be stored in a user profile associated with the user, such as user account(s) and device(s) 244 of user profile 240. In one embodiment, a user device may be polled, queried, or otherwise analyzed to determine context information about the device. This information may be used to determine a tag or identification of the device (such as a device ID) so that user activity on one user device may be identified and distinguished from user activity on another user device. Further, as previously described, in some embodiments, a user may declare or register a user device, such as logging into an account via the device, installing an application on the device, connecting to an online service querying the device, or otherwise providing information about the device to the application or service. In some embodiments, logging into an account associated with the user (such asAccount or Net report, email account, social network, etc.) is identified and determined to be associated with the user. In some embodiments, meeting activity monitor 252 monitors user data associated with the user device and other relevant information on the user device, across multiple computing devices (e.g., associated with all participants in the meeting), or in the cloud. Information about the user's device may be determined from the user data provided via the user data collection component 210 and may be provided to the action item generator 260 and other components of the system 200 to make predictions as to whether the character sequence or other content is an action item. In some implementations of meeting activity monitor 252, a user device may be identified by detecting and analyzing characteristics of the user device (such as device hardware, software such as an OS, network-related characteristics, user accounts accessed via the device, and the like), as described above. For example, information about the user device may be determined using the functionality of many operating systems to provide information about hardware, OS versions, network connection information, installed applications, and the like. Similarly, some embodiments of meeting activity monitor 252 or a subcomponent thereof can determine a device name or identification (device ID) for each device associated with the user.
The context information extractor/determiner 254 is generally responsible for determining context information (also referred to herein as "context") associated with a conference and/or one or more conference participants. This information may be metadata or other data that is not the actual meeting content or payload itself but describes the relevant information. For example, the context may include who attended or invited to attend the meeting, the subject matter of the meeting, whether the meeting repeatedly appears, the location of the meeting, the date of the meeting, other items or relationships between other meetings, information about the invited or actual attendees of the meeting (such as company roles, whether the attendees are from the same company, etc.). In some embodiments, the contextual information extractor/determiner 254 determines some or all of the information by determining information (such as computer reading) within the user profile 240 or meeting profile 270, as described in more detail below.
Natural language utterance detector 257 is generally responsible for detecting one or more natural language utterances from one or more participants of a meeting or other event. For example, in some embodiments, natural language utterance detector 257 detects natural language via a speech-to-text service. For example, an activated microphone at the user device can pick up or capture near real-time utterances of the user, and the user device can send voice data over the network(s) 110 to a voice-to-text service that encodes or converts audio voice into text data using natural language processing. In another example, natural language utterance detector 257 can detect natural language utterances (such as chat messages) via Natural Language Processing (NLP) only via, for example, parsing each word, labeling each word with a lexical (POS) tag, and the like, to determine a syntactic or semantic context. In these embodiments, the input may not be audio data, but may be written natural language utterances, such as chat messages. In some embodiments, NLP includes using an NLP model, such as bi-directional encoder representation (BERT) from a transformer (e.g., via Next Sentence Prediction (NSP) or Masking Language Modeling (MLM)), in order to convert audio data into text data in a document.
In some embodiments, natural language utterance detector 257 detects natural language utterances using speech recognition or speech recognition functionality via one or more models. For example, the natural language utterance detector 256 can detect and attribute a natural language utterance to a given participant using one or more models, such as a Hidden Markov Model (HMM), a Gaussian Mixture Model (GMM), long term memory (LSTM), BERT, and/or other ranked or natural language processing models. For example, the HMM can learn one or more speech patterns for a particular participant. For example, the HMM can determine patterns in the amplitude, frequency, and/or wavelength values of particular tones of one or more speech utterances (such as phenomena) that the user has made. In some embodiments, the inputs used by these one or more models include speech input samples as collected by user data collection component 210. For example, the one or more models can receive historical telephone calls, intelligent speaker utterances, videoconferencing auditory data, and/or any sample of a particular user's voice. In various cases, these speech input samples are pre-labeled or categorized as speech of a particular user prior to training in a supervised machine learning context. In this way, particular weights associated with particular features of the user's speech can be learned and associated with the user, as described in more detail herein. In some embodiments, these speech input samples are not labeled, but rather are clustered or otherwise predicted in an unsupervised context.
HMM is a computational tool for representing probability distributions. For example, the HMM can calculate the probability that an audio input belongs to a particular category (such as human speech or a particular participant) and not other categories of sounds (e.g., different speech input samples or portions of a single speech input sample) over an observed sequence. These tools model time series data. For example, at a first time window, a user may speak a first set of phenomena at a particular pitch and volume level, which are recorded as particular amplitude values, frequency values, and/or wavelength values. As described herein, "pitch" refers to the frequency of sound (e.g., in hertz) that indicates whether speech is deep or low or high. A "phenomenon" is the smallest sound element that distinguishes one word (or word element, such as syllable) from another word. At a second time window, subsequent to the first time window, the user may speak of another set of phenomena having another set of sound values.
HMM enhances markov chains. A markov chain is a model that provides insight as to the probability of a sequence of random variables or states, each of which takes a value from a set of data. The assumption of a Markov chain is: any prediction is based only on the current state, not the state prior to the current state. The state before the current state has no effect on the future state. HMMs can be used to analyze speech data because the speech phenomenon of pitch, tone, or any utterance tends to fluctuate (depending on emotion or goal) and not necessarily on previous utterances prior to the current state (such as a 10 second current window of a single speech input sample). In various cases, the event or feature of interest is hidden because it cannot be directly observed. For example, the hidden event of interest can be an utterance or an identification of a user associated with a speech input sample. In another example, the hidden event of interest can generally be an identification of whether the sound corresponds to a natural language utterance of a human (as opposed to other sounds). Although speech or voice input data (such as frequency, amplitude, and wavelength values) is directly observed, the representation of the user making the speech or voice input sample is unknown (which is hidden).
HMMs allow models to use observed events (speech input samples) and hidden events (such as identification of various attendees) that are essentially causal factors in the probabilistic algorithm. HMMs are represented by the following components: a set of N states q=q 1q2…qN, a transition probability matrix aa=a 11…aij…aNN (each a ij representing the probability of moving from state i to state j such that) A sequence of T observations o=o 1o2…oT (each from the vocabulary v=v 1,v2,…vT), a sequence of observation likelihoods b=b i(ot (also referred to as emission probabilities), each representing the probability that an observation O t generates from state i and the initial probability distribution pi=pi 1π2…πN over the states. Pi i is the probability that a Markov chain will start in state i. Some states j may have pi j =0, which means that they cannot be initial states.
The probability of a particular state (such as the identity of the user that issued the first sequence of phenomena) depends only on the previous state (such as the identity of the user that issued another particular sequence of phenomena prior to the first sequence of phenomena), thus introducing a markov assumption: p (q i|q1…qi-1)=P(qi|qi-1). The probability of outputting observation o i depends only on the state in which observation Q i was generated, and not on any other state or any other observation, thereby resulting in output independence O(oi|q1…qi…,qr,o1,…,oi,…oT)=P(oi|qi). this allows the component to declare that, given observation o (such as the first subsection of a speech input sample of a set of speech frequency values), the algorithm is able to find a hidden Q-state sequence (such as the identity of one or more participants that sent each segment of each speech input sample).
In various embodiments, an HMM or other model is provided for each participant (e.g., a participant of an organization or meeting) to train their daily calls or other voice samples in order to "learn" their specific voices (such as by learning hidden variables of the HMM). Some embodiments retrain the speech model after each new call (or ingested speech input sample), which enables the embodiments to continually improve the user's speech model. Some embodiments alternatively or additionally use other models, such as LSTM and/or GMM, which will be described in more detail herein.
Conference content assembler 256 receives the conference content, relevant context information (such as via context information determiner 254), and natural language utterances detected via natural language utterance detector 257, and generates a rich conference activity timeline. In some embodiments, the timeline is a transcribed document that includes tags and/or other associated content. For example, the timeline can include structured data (such as a database) that includes records, where each record includes a timeline for each dialog or natural language utterance and a timestamp indicating when the natural language utterance started/stopped. The record can alternatively or additionally include contextual information such as information about the meeting participants of the meeting or the meeting itself (such as the subject matter of the meeting, a file, a slide show, or any information in the user profile 240 or the meeting profile 270). The rich conference activity timeline can be the output of conference monitor 250.
User profile 240 generally refers to data about a particular user or meeting participant, such as the study information of the meeting participant, personal preferences of the meeting participant, and the like. The user profile 240 includes user meeting activity information 242, user preferences 244, and user accounts and devices 246. The user meeting activity information 242 can include an indication of when a meeting participant or speaker tends to mention content items identified via patterns in previous meetings, how the meeting participant identified the content items (via a particular name), and who they were talking to when referring to the content items. For example, a particular meeting participant may always reference a content item during the last 5 minutes of the meeting. This information can be used by the content item ranker 264 to rank content items for presentation, as described in more detail below. The user profile 240 may also include how the meeting participant or speaker references the content item. For example, a historical meeting event may indicate that a particular user always uses "Xt5" to reference the name of the document. This can help the content item ranker 264 determine that the intent of the natural language utterance is to refer to the corresponding content item.
The user profile 240 can include user preferences 244, which typically include user settings or preferences associated with the meeting monitor 250. By way of example and not limitation, such settings may include user preferences, crowdsourcing preferences (such as whether to use crowdsourcing information or whether the user's event information may be shared as crowdsourcing data) regarding a particular meeting (and related information) that the user wishes to explicitly monitor or not or a category of events to monitor or not; preferences regarding which event consumers can consume the user's event pattern information; and threshold and/or notification preferences, as described herein. In some embodiments, the user preferences 244 may be or include, for example: a communication channel selected by a particular user for a content item to be transmitted therethrough (e.g., SMS text, instant chat, email, video, etc.).
User account and device 246 generally refers to a device ID (or other attribute such as CPU, memory, or type) belonging to a user, as well as account information such as name, business unit, team member, role, etc. In some embodiments, the roles correspond to meeting attendee corporate titles or other IDs. For example, the attendee role can be or include one or more job titles of attendees, such as software engineers, marketing director, CEO, CIO, management software engineers, auxiliary law consultants, internal business auxiliary presidents, and the like. In some embodiments, user profile 240 includes a participant role for each participant in the conference. The participant roles can help determine the score or ranking of a given content item as described with respect to content item ranker 264. This is because, depending on the role of the attendees, particular content items (such as files) are more likely to be displayed to the attendees.
Conference profile 270 corresponds to conference data (such as collected by user data collection component 210) and associated metadata. Conference profile 270 includes conference name 272, conference location 274, conference participant data 276, and external data 278. The meeting name 272 corresponds to the title or topic (or sub-topic) of the event or an identifier that identifies the meeting. The content items can be determined or ranked based at least in part on the meeting name 272, as described with respect to 262 and 264. This is because particular content items may be more or less relevant for a particular meeting and associated subject matter. For example, for conferences where the topic is the accuracy of a machine learning model, any documents about model details (such as providing more test data, reducing error rates, etc.) are more likely to be presented than, for example, in conferences where the topic is a sales strategy based on gestures and other limb language habits.
Meeting location 274 corresponds to the geographic location or type of meeting. For example, meeting location 274 can indicate a physical address of the meeting or a building/room identifier of the meeting location. Meeting location 274 can alternatively or additionally indicate that the meeting is a virtual or online meeting or a face-to-face meeting. Meeting location 274 can also be a signal for determining or ranking content items, as described with respect to 262 and 264. This is because a particular meeting location is associated with a particular topic and, based at least in part on the location or topic, the content of the meeting is less likely or more likely to be considered as a content item. For example, if it is determined that the meeting is in building B (which is the building where the engineering test occurred), then certain documents are more likely to be relevant than other documents (such as those describing instructions for the test, the building, etc.).
Meeting participant data 276 indicates the name or other identifier of the meeting participant at the particular meeting. In some embodiments, meeting participant data 276 includes relationships between meeting participants at the meeting. For example, meeting participant data 276 can include a graphical view or hierarchical tree structure indicating the highest management position at the top or root node, one intermediate manager at the branch directly below the management position, and one high-level staff at the leaf level below the intermediate manager. In some embodiments, the name or other identifier of the meeting participant at the meeting is determined automatically or near real-time as the user speaks (e.g., based on a voice recognition algorithm), or can be determined based on manual input by the meeting participant, the invitee, or an administrator of the meeting. In some embodiments, in response to determining meeting participant data 276, system 200 then retrieves or generates user profile 240 for each participant of the meeting.
The external data 278 corresponds to any other suitable information that can be used to determine or rank the content items via 262 or 264. In some embodiments, the external data 278 includes any non-personalized data that can still be used to make predictions. For example, the external data 278 can include human habit information learned over several conferences, even though the current participant pool for the current event is different than the participant pool for participating in the historical conference. This information can be obtained via a remote source such as a blog, social media platform, or other data source unrelated to the current meeting. In one illustrative example, it can be determined over time that a particular type of content item is always generated at the last 10 minutes of a meeting for a particular organization or business unit. Thus, for the last 10 minutes of the current meeting that a particular pool of participants has never seen before, candidates are more likely to be predicted as content items presented in the meeting based on the history of the particular organization or business unit.
With continued reference to fig. 2, the system 200 includes a content item generator 260. The content item generator 260 is generally responsible for selecting one or more content items for presentation to a particular meeting participant or user during a meeting or prior to the beginning of the meeting. The content item generator 260 includes a content item generator 261, a content item candidate determiner 262, a content item ranker 264, an access control component 266, and an attribute component 268. In some embodiments, the functions engaged by the content item generator 260 are based on information contained in the user profile 240, the meeting profile 270, information determined via the meeting monitor 250, and/or data collected via the user data collection component 210, as described in more detail below.
The content item generator 261 is generally responsible for generating content and/or formatting content items. For example, content item generator 261 can generate words, sentences, paragraphs, bullets, titles, and the like. Such generation can indicate creation of a completely new content item (such as a document) that did not exist previously. In some embodiments, for example, the content item generator 260 generates an agenda document or a pre-read document. An "agenda document" is a document that describes each item or topic that will be discussed for a given meeting. A "pre-read document" is a document (or set of documents) that gives contextual information, summaries, and/or background details of a particular meeting. For example, a meeting may discuss sales figures for a particular business unit in multiple geographic areas. The preread may include a number of documents corresponding to a specific sales per geographic area for a particular business unit. The context and context information may be information or documents that provide definitions, graphics, or other information needed to better prepare the meeting or understand the meeting.
In some embodiments, content item generator 261 generates content item content based on information contained in user profile 240 and/or meeting profile 270. In an illustrative example, content item generator 261 can include a model, such as a weakly supervised model, or use the model to learn which content items are relevant (and irrelevant) via information contained in user profile 240 or meeting profile 270, generate a network graph based on the relevance, and then distance the network graph from a threshold distance of nodes representing meetings to discover candidate content items, such as emails discussing meetings, documents attached to meeting invitations, and so forth. Such models and graphics will be described in more detail below. In some embodiments, content item generator 261 extracts selected information or content from one or more candidate content items and generates a new document. For example, content item generator 261 may extract different natural language tags corresponding to different topics to be discussed in the meeting from multiple emails of different users and then insert the tags into a new format (e.g., bullets next to each topic where bullets do not previously exist) to create agenda documents.
To identify "topics" or otherwise understand the resulting document (such as filling in missing words or text), some embodiments use natural language processing functions such as Named Entity Recognition (NER), NSP, or MLM. For example, text extracted from an email or other content item may include sentence fragments or incomplete sentences. Thus, some embodiments are able to complete sentence fragments or incomplete sentences via training encoders using NSP and MLM.
The content item candidate determiner 262 is generally responsible for determining a plurality of content items associated with a meeting participant and/or conference. "content item" as described herein refers to any suitable data unit, such as a file or a link to a file, a document or a link to a document, an image (such as a digital photograph) or a link to an image, an email, a notification, a message, and the like. The content item typically represents some external data related to the conference participant utterance of the current conference. Thus, a content item generally excludes any natural language utterances that occur during a meeting for which one or more content items are to be presented. In some embodiments, such determined content items can exist in a larger set of content items that are not relevant to the meeting or the particular user, such that only the determined set of content items is analyzed, as described herein. In some embodiments, the content item candidate determiner determines which content items are associated with the meeting participants and/or meetings based on information contained in the user profile 240, the meeting profile 270, and/or detected by the natural language utterance detector 257.
In some embodiments, the content item candidate determiner 262 determines the plurality of content items based on training and/or using one or more machine learning models (such as a supervised machine learning model, an unsupervised machine learning model, a semi-supervised machine learning model, a classification-based model, a clustering model, and/or a regression-based model). For example, such a model can be a weakly supervised neural network model trained to learn which content items are attached to meeting invitations, or otherwise associated with particular meetings, as described in more detail below.
In some embodiments, the content item candidate determiner 262 additionally or alternatively determines a plurality of content items based on invoking or accessing one or more data structures (such as a network map). For example, a first node of the network graph may represent a meeting participant or conference. In some embodiments, the content item candidate determiner 262 walks beyond a predetermined distance from the first node to discover other nodes corresponding to the determined plurality of content items, thereby selecting only a selected number of content items closest to the first node. The network map will be described in more detail below.
The content item ranker 264 is generally responsible for determining or generating a score (such as an integer or confidence value) and ranking each content item determined by the content item candidate determiner 262. In some embodiments, such scores are driven heuristically or statistically based on a set of programmed rules. For example, the policy may indicate: if the natural language utterance detected via 257 includes a description of a document that matches the document name, the data structure can be incremented by a first score (and not increment or decrement the score when there is no match), which can be changed to a second score based on the document appended to the meeting invitation for the meeting (while without such an attachment, the first score does not change or is lower), and if the document is shared by the user to whom the embodiment presents the content item, the second score can be changed to a higher score (or a lower score can be given to documents that are not shared by the user).
Alternatively or additionally, the score is based on an output of the machine learning model such that the score reflects a confidence level, classification, or other prediction that is most relevant to the content item. For example, using a given natural language utterance, user ID, meeting ID, and other meeting participant IDs as inputs, the model may predict from the natural language utterance that the most relevant content item is the first content item to cause presentation. The machine learning model will be described in more detail below.
In some embodiments, the content item ranker 264 then ranks each content item according to a score. For example, in integer-based scores, content items may be ranked from a highest integer score to a lowest integer score. For example, the first ranked content item may have a score of 20, the second ranked content item a score of 15, and the third and last ranked content item may have a score of 8.
In some embodiments, with respect to the confidence scores, the content items are ranked from highest confidence to lowest confidence score. For example, given a particular natural language near real-time utterance detected via natural language utterance detector 257, an ID of a meeting, and one or more attendees, the highest ranked document may be one for which embodiment 90% confidence that the intent of the natural language utterance refers to the document, which the user may access (such as via access control component 266), and/or which is otherwise relevant given context (such as meeting context and user context). The second highest ranked document may be one for which the model is 80% confident that it is relevant to the given context, even though the intent of the natural language utterance refers to the second highest ranked document being less confident. In an illustrative example, the first or highest ranked document may be the actual document referenced by the natural language utterance in the near real-time meeting, while the second or lower ranked document may be a different document than the document explicitly referenced in the natural language utterance, but still relevant given the meeting context or other information within the user profile 240 or the meeting profile 270.
In some embodiments, the content item ranker 264 weights the individual scores or content items based on the individual characteristics or factors that make up the scores (such as by increasing the scores). For example, determining the intent of a document referenced via a natural language utterance detected by natural language utterance detector 257 may be weighted highest, meaning that it is the most important factor for ranking. This may be important because some embodiments may only result in rendering the document in near real-time relative to when the document is referenced in the natural language utterance of the meeting. For example, the user may say, "we discussed sales last meeting. "particular embodiments may result in presenting the document specifying the sales in near real time relative to the time the sentence was spoken and as the highest ranked document. It should be appreciated that although the various examples herein describe causing the content items themselves to be presented, indications of those content items may alternatively be caused to be presented.
In some embodiments, the content item ranker 264 may further weight the content item with the highest personal affinity for the user to whom the content item is to be presented with the second highest weight score. For example, documents with more user activity or participation (such as click, view, query) may be given a higher weight for a particular user than other documents with little or no user activity by the same user. In some embodiments, documents associated with a particular meeting or meeting participant of the meeting (such as documents attached to meeting invitations) may also be given a particular weight, but may not be as important as near real-time document references, as they may not appear as important relative to the time the meeting participant is speaking or producing a natural language utterance. In an illustrative example, each document attached to a meeting invitation or other email referencing a meeting may be given a higher weight or score than documents not attached to the meeting invitation or email.
The access control component 266 is generally responsible for determining whether a particular user or meeting participant meets accessibility criteria, such as opening a link or viewing, for accessing a given content item, such as a ranked list of content items generated by the content item ranker 264. In some embodiments, the access control component 266 acts as a watchdog function to strictly allow or prohibit (via binary yes or no values) access to the content items based on accessibility criteria, regardless of the ranking of the content items via the content item ranker 264. In some embodiments, such accessibility criteria are defined in a data structure and define a set of rules that a user must pass in order to access. For example, the first rule may specify: the first document can be accessed only when the user has a particular corporate role or higher, such as a level 2 manager or higher. The second rule may specify: the second document is accessible if the user device requesting the second document is associated with a particular business unit. In these embodiments, the device ID may be mapped to the user ID and the service element in the data structure. In some embodiments, the accessibility criterion may additionally or alternatively be whether a given author of a content item has explicitly allowed others to view the content item.
In some embodiments, the attribution component 268 is generally responsible for attributing a particular content item to a particular user or attendee in preparation for selection and causing presentation of the content item to the particular user. This allows for different content items to be presented to different user devices associated with different conferees based on access control mechanisms and/or relevance of the different conferees for a given conference, as described with respect to content item ranker 264. For example, for a first participant, the first document may be ranked highest and result in being presented to the first participant. However, the second participant may not have access control rights to the first document, or the first document may not be the highest ranking for the second participant. Thus, the first document may be presented as belonging to the first participant but not the second participant.
In some embodiments, the attribution component 268 alternatively or additionally attributes or maps each selected or ranked content item to a particular natural language utterance detected via the natural language utterance detector 257. In this way, the user can easily identify which content items are associated with or belong to which natural language utterances, such as in a user interface. For example, a meeting may include 5 natural language utterances, each referencing or otherwise associated with a different content item. Thus, at a first time and in near real-time relative to the time at which the first natural language utterance was made (or received), particular embodiments result in a first set of ranked content items being presented next to an indicator that recites the first natural language utterance. At a second time subsequent to the first time and in near real-time relative to the time when the second natural language utterance was spoken in the same meeting, particular embodiments cause a second set of ranked content items to be presented next to a second indicator that recites the second natural language utterance. In this way, different content items can be displayed continuously in near real time, based on the spoken natural language utterance.
The example system 200 also includes a presentation component 220 that is generally responsible for presenting content and related information to a user, such as one or more ranked content items (or indications thereof) that are ranked via a content item ranker 264. The presentation component 220 can include one or more applications or services on a user device, on multiple user devices, or in the cloud. For example, in one embodiment, presentation component 220 manages presentation of content to a user on a plurality of user devices associated with the user. Based on content logic, device characteristics, associated logic hubs, inferred user logic locations, and/or other user data, presentation component 220 can determine on which user device(s) to present content, as well as the context of the presentation, such as how (or in what format and how much content, which may depend on the user device or context) to present content and/or when to present content. In particular, in some embodiments, presentation component 220 applies content logic to device features, associated logic hubs, inferred logical locations, or sensed user data to determine aspects of content presentation. For example, the clarification and/or feedback request can be presented to the user via presentation component 220.
In some embodiments, presentation component 220 generates user interface features associated with the content item. Such functions can include interface elements (such as graphical buttons, sliders, menus, audio prompts, alarms, alerts, vibrations, pop-up windows, notification bars or status bar items, in-application notifications, or other similar functions for interacting with a user), queries, and prompts. In some embodiments, a personal assistant service or application operating in conjunction with presentation component 220 determines when and how to present content. In such embodiments, content including content logic may be understood as recommending to presentation component 220 (and/or personal assistant service or application) when and how to present notifications, which may be overridden by personal assistant application or presentation component 220.
The example system 200 also includes a storage device 225. Storage 225 typically stores information including data, computer instructions (e.g., software program instructions, routines, or services), data structures, and/or models used in embodiments of the techniques described herein. By way of example and not limitation, data included in storage 225 and any user data that may be stored in user profile 240 or meeting profile 270 may be generally referred to as data. Any such data may be sensed or determined from sensors (referred to herein as sensor data), such as location information of the mobile device(s), smartphone data (such as handset status, charging data, date/time, or other information derived from the smartphone), user activity information (e.g., application usage; On-line activities; searching; voice data such as automatic voice recognition; an activity log; communication data including calls, text, instant messages, and emails; a website post; other records associated with the event; Or other activity related information), including user activity occurring on more than one user device, user history, session logs, application data, contact data, record data, notification data, social network data, news (including search engines or popular or trending items on a social network), home sensor data, appliance data, global Positioning System (GPS) data, vehicle signal data, traffic data, weather data (including forecasts), wearable device data, other user device data (which may include, for example, device settings, profiles, network connections such as Wi-Fi network data, or configuration data (about model numbers, profiles, etc.), firmware or device data), device pairing (such as a user pairing a mobile phone with a bluetooth headset)), gyroscope data, accelerometer data, other sensor data that may be sensed or otherwise detected by a sensor (or other detector) component, including data derived from a sensor component associated with a user (including location, motion, orientation, positioning, user access, user activity, network access, user device charging or other data that can be provided by the sensor component), data derived based on other data (e.g., location data that may be derived from Wi-Fi, cellular network, or IP address data), and virtually any other data source that may be sensed or determined as described herein. In some aspects, the date or information (e.g., requested content) may be provided in the user signal. The user signal can be a feed of various data from the corresponding data source. For example, the user signal may come from a smart phone, a home sensor device, a GPS device (e.g., for location coordinates), a vehicle sensor device, a wearable device, a user device, a gyroscope sensor, an accelerometer sensor, a calendar service, an email account, a credit card account, or other data source. Some embodiments of storage 225 may store thereon computer logic (not shown) including rules, conditions, associations, classification models, and other criteria to perform the functions of any of the components, modules, analyzers, generators, and/or engines of system 200.
FIG. 3 is a schematic diagram illustrating different models or layers, each of their inputs, and each of their outputs, according to some embodiments. At a first time, the text generation model/layer receives the document 307 and/or the audio data 305. In some embodiments, the document 307 is an original document or data object, such as an image of a tangible paper or a particular file with a particular extension (e.g., PNG, JPEG, GIFF). In some embodiments, the document is any suitable data object, such as a web page (such as a chat page), an application activity, and so forth. The audio data 305 may be any data representing sound in which sound waves from one or more audio signals have been encoded into other forms, such as digital sound or audio. The resulting form can be recorded via any suitable extension, such as WAV, audio exchange file format (AIFF), MP3, etc. The audio data may include natural language utterances as described herein.
At a second time, after the first time, text generation model/layer 311 converts or encodes document 307 into a machine-readable document and/or converts or encodes audio data into a document (both of which may be referred to herein as an "output document"). In some embodiments, the functional representation of text generation model/layer 311 includes or includes functionality as described with respect to natural language detector 257 and meeting content assembler 256. For example, in some embodiments, text generation model/layer 311 performs OCR on document 307 (image) to generate a machine-readable document. Alternatively or additionally, text generation model/layer 311 performs speech-to-text functions to convert audio data 305 into transcribed documents and performs NLP, as described with respect to natural language utterance detector 257.
At a third time after the second time, the speaker intent model/layer 313 receives as input the output document (e.g., a speech-to-text document), the conference context 309, and/or the user context 303 generated by the text generation model/layer 311 to determine an intent of one or more natural language utterances within the output document. In some embodiments, the speaker intention model/layer 313 is included in the content item ranker 264 and/or the content item candidate determiner 262. "intent" as described herein refers to classifying or otherwise predicting a particular natural language utterance as belonging to a particular semantic meaning. For example, the first intent of the natural language utterance may be to open the first document, while the second intent may be to praise the user to create the first document. In some embodiments, those intents that render the content item are given a higher weight or are considered for downstream content item suggestion prediction. Some embodiments use one or more natural language models to determine intent, such as an intent recognition model, BERT, WORD2VEC, and the like. Such models may not only be pre-trained to understand basic human language, e.g., via MLM and NSP, but can be fine-tuned to understand natural language via meeting context 309 and user context 303. For example, as described with respect to user meeting activity information 242, a user may always discuss a particular document at a particular time during a monthly meeting, which is a particular user context 303. Thus, assuming that the meeting is a monthly meeting, the user is speaking, and a particular time has arrived, the speaker intent model/layer 313 may determine that the intent is to produce a particular document. In another example, a particular document of a business unit may have a document named "XJ5," as indicated in meeting context 309. Thus, such a name can be detected in the phrase "let us see XJ5" and it can be determined that the intention is to visualize the XJ5 document by fine tuning the BERT model on that term.
In some embodiments, meeting context 309 refers to any data described with respect to meeting profile 270. In some embodiments, user context 303 refers to any data described with respect to user profile 240. In some embodiments, meeting context 309 and/or user context additionally or alternatively represent any data collected via user data collection component 210 and/or obtained via meeting monitor 250.
In some embodiments, the intent is explicit. For example, a user may directly request or ask for output of a content item in a document. However, in alternative embodiments, the intent is implicit. For example, the user may not directly request or require the content item, but the meeting context 309 and/or the user context 303 indicate or suggest that the document will be useful to the user. For example, a attendee may say that "i am sending your last email describes an example … … of the question i am talking" attendees may not explicitly tell other attendees to open an email. However, the intent may still be to present the email, as it may be useful.
At a fourth time after the third time, the content item ranking model/layer 315 model/layer 314 takes as input the intent predicted by the speaker intent model/layer 313, the conference context 309, the user context 303, and/or the particular natural language utterance of the output document in order to predict the relevant content item at the final output. In some embodiments, the content item ranking model/layer 315 represents or includes functionality as described with respect to the content item ranker 264.
Fig. 4 is a schematic diagram illustrating how a neural network 405 performs specific training and deployment predictions given specific inputs, according to some embodiments. In one or more embodiments, the neural network 405 represents or includes functionality as described with respect to the content item ranking model/layer 315 of fig. 3, the content item ranker 264 of fig. 2, and/or the speaker intention model/layer 313 of fig. 3.
In various embodiments, the neural network 405 is trained using one or more data sets of training data input(s) 415 in order to make acceptable loss training prediction(s) 407, which will help make the correct inference prediction(s) 409 later on in deployment. In some embodiments, training data input(s) 415 and/or deployment input(s) 403 represent raw data. As such, it may be converted, structured, or otherwise altered before it is fed to the neural network 405 so that the neural network 405 can process the data. For example, various embodiments normalize the data, scale the data, input the data, perform data sorting, perform data disputes, and/or any other preprocessing technique to prepare the data for processing by the neural network 405.
In one or more embodiments, learning or training can include minimizing a penalty function between a target variable (e.g., a related content item) and an actual predicted variable (e.g., a non-related content item). Based on the losses determined by the loss function (e.g., mean Square Error Loss (MSEL), cross entropy loss, etc.), the loss function learns to reduce the prediction error over multiple epochs or training sessions so that the neural network 405 learns which features and weights indicate the correct reasoning given the input. Thus, it may be desirable to reach as close to 100% confidence as possible in a particular classification or inference in order to reduce prediction errors. In an illustrative example, the neural network 405 is capable of learning over several periods of time, for a given transcribed document (or natural language utterance within a transcribed document) or application item (such as a calendar item), as shown in the training data input(s) 415, the correct content item that is likely or predicted is a particular email, file, or document.
After a first round/training (e.g., processing training data input(s) "415), the neural network 405 may make predictions that may or may not be at an acceptable loss function level. For example, the neural network 405 may process the meeting invitation item(s) of the training input(s) 415 (which is an example of an application item). Subsequently, the neural network 405 may predict that no particular content item is (or will be) attached to the meeting invitation. This process may then be repeated over a number of iterations or periods until the best or correct predictor(s) are learned (e.g., by maximizing rewards and minimizing losses) and/or the loss function reduces the prediction error to an acceptable confidence level. For example, using the above illustration, the neural network 405 may learn that a particular meeting invitation item is associated with or likely will include a particular file.
In one or more embodiments, the neural network 405 converts or encodes the run-time input(s) 403 and the training data input(s) 415 into corresponding feature vectors in feature space (e.g., via convolutional layer (s)). A "feature vector" (also referred to as a "vector") as described herein may include one or more real numbers, such as a series of floating point values or integers (e.g., [0,1,0 ]), natural language (e.g., english) words, and/or other character sequences (e.g., symbols (e.g., @, |, #), phrases, sentences, etc.) that represent one or more other real numbers. Such natural language words and/or character sequences correspond to a set of features and are encoded or converted into corresponding feature vectors so that the computer can process the corresponding extracted features. For example, for a given detected natural language utterance of a given meeting, and for a given suggested user, embodiments can parse, tag, and encode each deployment input 403 value—the suggested meeting participant's ID, natural language utterance (and/or intent of such utterances), the speaking meeting participant's ID, the application item associated with the meeting, the meeting's ID, the documents associated with the meeting, the email associated with the meeting, the chat associated with the meeting, and/or other metadata (e.g., time of file creation, time of last modification of file, time of last meeting participant access to file), all encoded as a single feature vector.
In some embodiments, the neural network 405 learns parameters or weights via training so that similar features are closer to each other in feature space (e.g., via euclidean distance or cosine distance) by minimizing the loss via a loss function (e.g., triplet loss or GE2E loss). Such training occurs based on one or more training data inputs 415 fed to the neural network 405. For example, if several meeting invitations on the same meeting or meeting topic (monthly sales meeting) are all attached with the same file, then each meeting invitation will be close to each other in vector space and indicate that the next time the meeting invitation is shared, it is likely that the corresponding file will be attached or otherwise related to the meeting.
Similarly, in another illustrative example of training, some embodiments learn embedding of feature vectors based on learning (e.g., deep learning) to detect similar features between training data input(s) 415 in feature space using distance measures, such as cosine (or euclidean) distances. For example, the training data input 415 is converted from a string or other form into a vector (e.g., a set of real numbers), where each value or set of values represents an individual feature (e.g., a history document, email, or chat) in a feature space. The feature space (or vector space) may include a set of feature vectors, each oriented or embedded in space based on the aggregate similarity of features of the feature vectors. The particular characteristic features predicted for each target can be learned or weighted during various training phases or periods. For example, given training input(s) 415 for a particular user or meeting ID, the neural network 405 can learn that a particular content item is always associated with the meeting or particular user. For example, over 90% of the time, when we speak of the natural language sequence "let us talk about XJ5 …", conference participants always open the corresponding document. Thus, this pattern can be weighted (e.g., node connections are emphasized to a value near 1, while other node connections (e.g., representing other documents) are attenuated to a value near 0). In this way, embodiments learn weights corresponding to different features so that similar features found in the input positively contribute to the prediction.
One or more embodiments can determine one or more feature vectors representing the input(s) 515 in the vector space by aggregating (e.g., mean/median or dot product) feature vector values to reach a particular point in the feature space. For example, using the above illustration, each meeting invitation may be part of a separate feature vector (as it is a separate event or for a different meeting). Some embodiments aggregate all of these relevant feature vectors because they represent the same type of conference.
In one or more embodiments, the neural network 405 learns features from the training data input(s) 415 and responsively applies weights thereto during training. "weights" in the context of machine learning may represent the importance or meaning of a feature or feature value to a prediction. For example, each feature may be associated with an integer or other real number, where the higher the real number, the more important the feature is for its prediction. In one or more embodiments, weights in a neural network or other machine learning application can represent the strength of connections between nodes or neurons from one layer (input) to the next (output). A weight of 0 may mean that the input will not change the output, while a weight higher than 0 changes the output. The higher the value of the input or the closer the value is to 1, the more the output will change or increase. Likewise, there will be a negative weight. The negative weight may scale down the value of the output. For example, the more the value of the input increases, the more the value of the output decreases. Negative weights may result in a negative score. In some embodiments, the strength of such weights or connections represents the weights described above with respect to content item ranker 264, where, for example, at the first layer of the neural network, the nodes representing near real-time utterances are weighted higher than the nodes representing other characteristics (such as personal affinity) because one goal may be to generate related content items from the content currently speaking to the attendees. In another example, at a second layer of the neural network, the particular content item is weighted higher based on its strength of relationship or affinity to the particular user or meeting, as described with respect to fig. 5.
In some embodiments, such training includes using a weak supervision model. Supervised learning is impractical when sensitive data, such as enterprise data, is used. Some embodiments define heuristic methods to programmatically label training and evaluation data. For example, some embodiments assign positive labels to emails and files that are attached in a meeting invitation or that are shared/presented in an actual meeting, and assign negative labels to all emails and files that a user (such as a meeting organizer) has originally attached or shared but not attached or shared.
In one or more embodiments, after training of the neural network 405, the machine learning model(s) 405 receive (e.g., in a deployed state) one or more deployment inputs 403. When the machine learning model is deployed, it is typically already trained, tested and packaged so it can process its never processed data. In response, in one or more embodiments, deployment input(s) 403 are automatically converted to one or more feature vectors and mapped in the same feature space as the vector(s) representing training data input(s) 415 and/or training predictions. In response, one or more embodiments determine a distance (e.g., euclidean distance) between the one or more feature vectors and other vectors representing training data input(s) 415 or predictions, which is used to generate one or more inferential predictions 409.
In an illustrative example, the neural network 405 may connect all (one or more) inputs 503 representing each feature value into a feature vector. The neural network 405 may then match the user ID or other IDs (such as meetings) with the user IDs stored in the data store to retrieve the appropriate user context, as indicated in the training data input 415. In this manner, and in some embodiments, training data input(s) 415 represent training data for a particular meeting participant or conference. The neural network 405 may then determine a distance (e.g., euclidean distance) between the vector representing the runtime input(s) 403 and each vector represented in the training data input(s) 415. Based on the distance being within a threshold distance, particular embodiments determine that, for a given: the detected natural language utterances and/or intents, meetings, user IDs and all corresponding deployment data (documents, emails, chats, metadata), the most relevant content item is Y. Thus, inference prediction(s) 409 can include such content item Y. "suggested attendee ID" refers to the ID of the user/attendee to whom the content item is to be presented.
In particular embodiments, inference prediction(s) 409 may be hard (e.g., membership of a category is binary "yes" or "no") or soft (e.g., there is a probability or likelihood of attaching to a tag). Alternatively or additionally, transfer learning may occur. Transfer learning is the concept of reusing pre-trained models for new related problems (e.g., new video encoders, new feedback, etc.).
Fig. 5 is a schematic diagram of an example network diagram 500, according to some embodiments. In some embodiments, the network diagram 500 represents a data structure used by the content item candidate determiner 262 to generate candidates and/or the content item ranker 264 to rank content items. A network graph is a visualization for a set of objects, where pairs of objects are connected by links or "edges. The interconnected objects are represented by points called "vertices" and the links connecting the vertices are called "edges". Each node or vertex represents a particular location in one, two, three (or any other dimension) space. A vertex is a point at which one or more edges intersect. The edges connect the two vertices. Specifically, the network graph 500 (undirected graph) includes nodes or vertices of: "user a", "user B", "file X", "conference a", "application Y" and "user E". The network diagram also includes edges K, I, H, J-1, J-2 and G-1, G-2, G-3, G-4.
In particular, network diagram 600 illustrates the relationship between multiple users, meetings, and content items (such as file X and application Y). It will be appreciated that these content items are merely representative. As such, the content item may alternatively or additionally be a particular file, image, email, chat session, text message that the user has sent or received, etc. in which the user has participated. In some embodiments, edges represent or illustrate specific user interactions (such as downloads, shares, saves, modifications, or any other read/write operations) with specific content items with respect to relationships between users and content items. In some embodiments, with respect to the relationship between meeting a and the content item, the edge represents the degree of association between the meeting and the content item. For example, the more times file X has been downloaded to meeting invitations associated with meeting a, the thicker the edges (or more edges) that exist between the corresponding nodes. In some embodiments, regarding the relationship between meeting A and a particular user, an edge represents the frequency with which the particular user attended (or invited to attend) the meeting, or otherwise represents the degree of association between corresponding nodes
Representing computer resources as vertices allows users, meetings, and content items to be linked in ways that they may not. For example, application Y may represent a group container (such as MICROSOFT TEAMS) in which electronic messages are exchanged between group members. Thus, the network diagram 500 may illustrate which users are members of the same group. In another illustrative example, network map 500 may indicate that user A downloaded file X at a first time (represented by edge G-1), a second time (represented by edge G-2), a third time (represented by edge G-3), and a fourth time (represented by edge G-4). The diagram 500 may also illustrate that user B also downloaded file X, as represented by edge J-1, and written to file X at another time, as represented by edge J-2. Thus, based on the illustrated edge instances between the respective nodes, network diagram 500 illustrates a stronger relationship between user a and file X relative to user B (e.g., user a downloads file X more times relative to user B). In other embodiments, the thickness of a single edge indicates the degree of relationship strength. For example, there may be a thicker single line between user A and file X than any other edge between another user and file X, indicating the strongest relationship, rather than indicating 4 edges between user A and file X.
In general, network diagram 500 indicates that user A has interacted with File A multiple times, and that user B has interacted with File A as well. Network diagram 500 also indicates that file X and application Y have a strong relationship to both file X and meeting a. The network diagram 500 also indicates that the user E has also interacted with the application Y.
In various embodiments, the network map 500 is used to determine or rank particular candidate content items associated with one or more of the particular users (user a, user B, or user E) and/or associated with meeting a. For example, some embodiments determine that file X is most relevant to user A based on the number of edges and/or distance. In some embodiments, the determination or ranking of content items is performed, for example, by selecting the N closest nodes representing a particular content item, meeting a, or user a (such as 3 content items within a particular distance threshold). For example, using network diagram 500, user A may have been the only user in diagram 500 invited to the meeting (not user B, user C, and user E). Thus, network diagram 500 may represent a network diagram of user A. One or more network map rules may specify that the two closest candidate items for user a are selected, namely file X and application Y.
In various embodiments, the closeness is determined based on using distances in the network map. In some embodiments, with respect to the network graph, a "distance" corresponds to a number of edges (or a set of edges) in the shortest path between vertex U and vertex V. In some embodiments, if there are multiple paths connecting two vertices, the shortest path is considered the distance between the two vertices. Thus, the distance can be defined as d (U, V). For example, the distance between user A and file X is 1 (because there are only 1 edge sets G-1 through G-4), the distance between user A and user B (and meeting A) is 2, and the distance between user A and user E is 4 (because there are 4 edge sets between user A and user E). In some embodiments, content items are instead determined or ranked based solely on distance, regardless of the actual number of connections they may be selected (and thus not based on "N" connections, as described above). For example, one or more network graph rules may specify that all vertices or users are selected as participant candidates at or within distance 4 of user a.
Some embodiments additionally or alternatively determine or rank content items by selecting the top N content items that suggest that the attendees (such as user a) and the event-related files have interacted most (as determined by the number of edges between vertices). For example, one or more network graph rules may specify that only those content items having two or more edges between them and the user or meeting, which in the illustration of FIG. 5 is only file X, not application Y.
Alternatively or additionally, some embodiments determine or rank content items by selecting the N content items closest to meeting a and/or the "center point" of a particular user. In some embodiments, a "center point" refers to the geometric center of a set of objects (such as the average location of nodes in network map 500). For example, if only user B and user E are invited to the meeting (instead of user A), then the average location of B and E may be File X. One or more network map rules may specify that only content items-File X-that are within a threshold distance of the center point are selected.
In some embodiments, there may be similar but different network graphs for each participant. This means that different users can view different content items even if they are part of the same meeting and even if the same natural language utterance has been spoken. For example, network diagram 500 may represent a diagram of user a. Since the user accesses file X the most number of times (as represented by an edge number) for a given meeting A, particular embodiments may rank file X as the highest level presented to user A. However, the network map of user E may indicate: user E never downloads or otherwise accesses file X for meeting a, but instead participates in the most user activity for application Y. Thus, for the same meeting or natural language utterance, particular embodiments result in rendering application Y instead of file X.
In alternative embodiments, the same network map exists for all users or a given meeting, such as in a meeting network map. In this way, the same content item can be generated for each participant in the conference. For example, some embodiments traverse graph 500 to search for common files (such as via Jaccard index) in all meeting attendees or meeting graphs, which may be file X and application Y. Such a common file can be based on all users invited to meeting a, item names, title of the meeting, whether group members report to the same director, etc.
In some embodiments, the network diagram 500 serves as an input to a machine learning model (such as the neural network 505), the content item ranking model/layer 315, and/or the content item ranker 264, so that the model can learn the relationships between content items, meetings, and conferees even without explicit links. Similarly, in some embodiments, the network map 500 is used to set the weights of various neural network connections. For example, some embodiments weight nodes representing content items (or words contained therein) in terms of personal affinity for a particular user. For example, if network diagram 500 represents a network diagram of user A, then the closest content item is file X (or the most edge appears between user A and file X) and is therefore given the highest weight relative to application Y. In another example, a weight can be assigned to each person relative to user a. User a may talk most with user B (because of the hosting/hosted relationship). Then, at the ranking layer, the file associated with user B will get a higher weight because user A interacts with user B more than user E (based on the number of edges J-1 and J-2).
Turning now to fig. 6, an example screen shot 600 illustrates the presentation of an indication 606 (link) of a content item in accordance with some embodiments. In some embodiments, the presentation of links 606 represents the output of system 200 of FIG. 2, content item ranking model/layer 315 of FIG. 3, and/or inference prediction(s) 409 of FIG. 4. For example, the link 606 (or a file referenced by the link 606) represents the content selected or ranked highest by the content item generator 260 of FIG. 2. In some embodiments, screenshot 600 (and fig. 7-9B) particularly represents content that results in display by presentation component 220 of fig. 2. In some embodiments, screenshot 600 represents a page or other instance of a consumer application (such as MICROSOFT TEAMS), where users are able to collaborate and communicate with each other (e.g., via instant chat, video conferencing, etc.).
With continued reference to fig. 6, at a first time, conference participant 620 speaks natural language utterance 602 — "the sales number of june is higher than expected … …" in some embodiments, in response to such natural language utterance 602, natural language utterance detector 257 detects natural language utterance 620. In some embodiments, in response to detection of a natural language utterance, various functions may occur automatically as described herein, such as functions described with respect to one or more components of the content item generator 260, the text generation model/layer 311, the speaker intent model/layer 313, the content item ranking model/layer 315, the neural network 405, and/or the network graph 500 traversal to rank content items. In response to determining that a particular email is highest ranked or otherwise best or most suitable for presentation, presentation component 220 automatically causes presentation of window 604 during the meeting, along with the embedded indicia and corresponding links 606— "this is the link to the email you sent at 08/03, which discusses the sales figures just referenced Alek. "
Window 604 also includes additional text 612 ("do you like to share the email with the group? so that the user devices of the other participants (participants 620, 618) in the group do not automatically receive e-mail, unlike participant 622, participant 622 automatically receives content item 606. This is because, for example, the email may be private to the meeting participant 622 or otherwise contain sensitive information. In response to receiving an indication that meeting participant 622 has selected "yes" button 607, particular embodiments cause a link 606 to be presented to each of the user devices associated with the other meeting participants.
Turning now to fig. 7, an example screen shot 700 illustrates a presentation of multiple indications of a content item according to a particular time-stamped natural language utterance, in accordance with some embodiments. In some embodiments, presentation of the indication of the content item represents the output of the system 200 of fig. 2, the content item ranking model/layer 315 of fig. 3, and/or the inference prediction(s) 409 of fig. 4. For example, for a time-stamped natural language utterance 14:02, file A, file B, and File C represent content selected or ranked by the content item generator 260 of FIG. 2. In some embodiments, screenshot 700 represents a page or other instance of a consumer application in which users are able to collaborate and communicate with each other (e.g., via instant chat, video conferencing, etc.).
Fig. 7 illustrates that when the intent of such utterances is to produce one or more content items, the content items are caused to be presented in near real-time in a meeting with respect to each natural language utterance (or detection of such utterances). Deliver a speech 704 a 704 correspondingly indicates a number of time-stamped natural language utterances and corresponding content items (also referred to as content item suggestions). In some embodiments, when the intent is not to reference or render any content item, the natural language utterance is not mapped or otherwise associated with a particular content item suggestion, as illustrated in deliver a speech 704,704. For example, this may be at 14:03 and 14:49 because the attendees may have already talked about personal transactions such as following a child, a ball game, or other things unrelated to meetings or any particular content item. In this way, some embodiments filter natural language utterances from deliver a speech, 704 where the intent (as determined by the speaker intent model/layer 313) is not to produce the content item.
At a first time 14:02, jane states "we last week done very well … …" on the project. In some embodiments, natural language utterance detector 257 detects natural language utterance 620 in response to such natural language utterance. In some embodiments, in response to detection of a natural language utterance, various functions occur automatically as described herein, such as functions described with respect to one or more components of the content item generator 260, the text generation model/layer 311, the speaker intent model/layer 313, the content item ranking model/layer 315, the neural network 405, and/or the network graph 500 traversal to rank content items. In response to determining that file a, file B, and file C are most relevant to presentation, presentation component 220 automatically presents file a, file B, and file C during the meeting. In some embodiments, the positioning of the content items within the screen shot 704 indicates a particular ranking of the content items. For example, file a may be ranked highest and thus presented as the top-most content item. File B may be ranked the second highest or have the second highest score and thus be presented directly under File A. And file C may be the last ranked (or the most relevant content item last ranked) and thus presented directly under file B. For the time stamp 14:04 and 14:49, the same process occurs-for 14:49, the most relevant content items may be file D and file E, whereas for timestamp 14:49, the most relevant content items may be file F and file G.
Turning now to fig. 8, a schematic diagram of a real world conference environment and highlighting relevant portions of a content item is illustrated in accordance with some embodiments. In some embodiments, the presentation of content item 808 including highlighting 810 represents the output of system 200 of FIG. 2, content item ranking model/layer 315 of FIG. 3, and/or inference prediction(s) 409 of FIG. 4. In some embodiments, the environment within fig. 8 illustrates a real-world room or other geographic area including real-world conference participants 802 and 804 (as opposed to a video conference or conference application as illustrated in fig. 6 and 7).
At a first time, virtual auxiliary device 806 (such as a smart speaker and/or microphone) receives an audio signal corresponding to natural language utterance 804-we know when the expiration date is? "in response to virtual auxiliary device 804 receiving natural language utterance 802, virtual auxiliary device 806 causes transmission of natural language utterance 804 to another computing device, such as a server, over network(s) 110, and natural language utterance detector 257 detects natural language utterance 804. In some embodiments, in response to detection of natural language utterance 804, various functions occur automatically as described herein, such as functions described with respect to one or more components of content item generator 260, text generation model/layer 311, speaker intent model/layer 313, content item ranking model/layer 315, neural network 405, and/or network graph 500 traversal to rank content items.
In response to determining that the document 808 is most relevant to presentation, the presentation component 220 automatically causes the document 808 to be presented during the meeting, along with highlighted text 810, the highlighted text 810 being directly relevant to the question indicated in the answer utterance 804. In this way, the meeting participants 812 can quickly view the highlighted text 810 to answer the question via the utterance 814. This has utility because meeting participant 812 does not have to manually search, open, and/or scroll through relevant information within document 808, which would be expensive because meeting participant 1812 may be expected to quickly find or know the information. For example, document 808 may be 20 pages long, and thus manually scrolling or drilling would be inefficient or waste valuable time.
Highlighting refers to underlining, changing fonts, changing colors, and/or otherwise changing the appearance of particular text relative to other text in the content item. Some embodiments use natural language modeling and/or string matching algorithms to detect the location of the highlighting. For example, some embodiments detect that the intent of utterance 804 is to find a document that indicates the expiration date of a particular item X, as indicated in a previous email, and that is accompanied by documents related to the meeting. In response to finding the correct document, an encoder, converter, or other BERT component may cause a computer to read text within the document 808 to search for semantically similar text related to the utterance 804 (e.g., "expiration date" semantically resembles "completion," and keywords or keyword formats (based on using syntactic rules or components), such as dates (11 months, 16 days, friday).
Turning now to fig. 9A, an example screenshot 900 illustrates zero query presentation of an indication 906 (links and filenames) of a content item (file) in accordance with some embodiments. In some embodiments, the presentation of indication 906 represents the output of system 200 of fig. 2, content item ranking model/layer 315 of fig. 3, and/or inference prediction(s) 409 of fig. 4. In some embodiments, screenshot 900 represents a page or other instance of a consumer application in which users are able to collaborate and communicate with each other (e.g., via instant chat, video conferencing, etc.).
At a first time, meeting participant 920 makes a natural language utterance 902— "good, let us turn attention to friday … …" in some embodiments, in response to such natural language utterance 902, natural language utterance detector 257 detects natural language utterance 902. In some embodiments, in response to detection of natural language utterance 902, various functions occur automatically as described herein, such as functions described with respect to one or more components of content item generator 260, text generation model/layer 311, speaker intent model/layer 313, content item ranking model/layer 315, neural network 405, and/or network graph 500 traversal to rank content items.
As illustrated in natural language utterance 902, it may not be clear from the utterance alone what will be discussed by friday. In addition, there is no explicit query or other requirement to render any documents. Further, natural language utterance 908 indicates that participant 922 is interrupting or otherwise speaking something that results in natural language utterance 902 being incomplete, such that the participant may not understand what is of importance on friday. However, in some embodiments, the speaker intent model/layer 313 determines that the implicit intent of the natural language utterance 904 is to discuss a particular ORIAN transaction to end based on the meeting context or user context (find meeting attachments to discuss ORIAN transaction ending on friday). In other words, embodiments are able to determine any content that participants will talk (or will talk in the future) even if they do not explicitly reference them in the natural language utterance or query. Thus, some embodiments use the context of the user's meeting, email, file, and/or near real-time natural language utterance to create a zero-query suggestion content item, such as an indication 906 of ORIAN protocol, as indicated in window 904. In response to determining that the content item associated with indication 906 is most relevant to natural language utterance 902, presentation component 220 automatically causes presentation of indication 906 during the meeting.
Fig. 9B is a screen shot representing the completion of the natural language utterance 902 of fig. 9A, in accordance with some embodiments. Thus, fig. 9B illustrates a point in time in the conference after the point in time of fig. 9A. Thus, the conferee 920 may say "as known to your friday is the day we complete ORIAN transactions," as indicated at 910. However, as illustrated by the content included in indication 906 ("ORIAN _protocol") in fig. 9A, prior to making natural language utterance 910 of fig. 9B, particular embodiments have determined intent and have resulted in the presentation of relevant indication 906. Thus, particular embodiments propose zero query content item suggestions to users.
FIG. 10 is a flowchart of an example process 1000 for training a weakly supervised machine learning model, according to some embodiments. Process 1000 (and/or any of the functions described herein, such as 1100 and 12000) may be performed by processing logic that includes hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc.), software (e.g., instructions run on a processor to perform hardware simulation), firmware, or a combination thereof. Although specific blocks described in this disclosure are referenced in a particular number and in a particular order, it should be understood that any block may occur substantially in parallel with or before or after any other block. Furthermore, there may be more (or fewer) blocks than shown. The added blocks may include blocks embodying any of the functions described herein (e.g., as described with respect to fig. 1-13). The computer-implemented method, system (which includes at least one computing device having at least one processor and at least one computer-readable storage medium), and/or computer-readable medium as described herein may perform or be caused to perform process 1000 or any other function described herein.
In some embodiments, process 1000 represents training of neural network 405 of fig. 4 via training data input 415 and training predictions 507. Various embodiments receive a plurality of application items, as per block 1002. An "application item" as described herein refers to any suitable information element, application process(s) and/or application routine(s) associated with an application. For example, the application items can be or include calendar items of a personal information manager application (such as OUTLOOK), video conferencing sessions or events (such as a particular meeting in MICROSOFT TEAMS), wherein users participate in natural language utterance audio exchanges and can visually see each other, chat sessions of a chat application, and so forth. Thus, each videoconference session or event may include multiple recorded natural language utterances and/or video recordings of videoconference sessions or events.
"Calendar item" as described herein refers to any portion of an application workflow (such as a subset of program processes or routines) that allows a user to schedule tasks, plan meetings, set reminders for upcoming events, schedule meetings, send email notifications to meeting attendees, and so forth. For example, the calendar item may include a meeting invitation, which may be an email sent to meeting invitees to invite them to the meeting. Such emails may typically include attachments to other content items, such as files to be discussed in the corresponding meeting.
In response to receiving the plurality of application items, some embodiments programmatically (without a human annotator) assign positive labels to one or more content items associated with the application items for each of the plurality of application items, per block 1004. A content item "associated with" a particular application item refers to a content item attached to the application item (such as a file attached in a meeting invitation email), a content item shared or referenced in a meeting or other videoconference event, a content item that has been mapped to a particular application item (such as a network graph in which a first node represents a meeting and a second set of nodes within a threshold distance represent various content items related to the meeting), a content item shared in a chat session, or any other content item referenced by a user of the application. In the illustrative example of block 1004, some embodiments assign a positive tag to each file of meeting invitations or other calendar items attached to a particular meeting.
In response to receiving the plurality of application items, some embodiments programmatically (without a human annotator) assign negative labels to one or more other content items not associated with the application item for each application item, per block 1006. A content item "unassociated" with a particular application item refers to a content item that is not attached to the application item, a content item that is not shared or referenced in a meeting or other videoconference event, a content item that is not mapped to a particular application item, a content item that is not shared in a chat session, or any other content item that is not referenced by a user of the application. For example, using the illustration above with respect to block 1004, some embodiments assign a negative tag to each file of meeting invitations or other calendar items that are not attached to a particular meeting. In other words, these embodiments determine a pool of content items that might otherwise be attached to a meeting invitation but never attached by any user.
Based on programmatically assigning positive and negative labels, particular embodiments extract features and determine ground truth, per block 1008. In an illustrative example, particular embodiments receive various historical meeting invitations associated with various meetings or meeting types, each with a positive or negative label indicating a particular content item attached to the meeting invitation. In response, particular embodiments convert or encode such tagged data into one or more feature vectors to represent features of the data for particular tags, which represent a fundamental fact.
Some embodiments identify a pair of application item content items, as per block 1010. In other words, each application item of the plurality of application items is paired with a corresponding or associated content item and/or a non-corresponding or non-associated content item. For example, a meeting invitation may be paired with each file that was attached to the meeting invitation as an application item content pair. Additionally or alternatively, the meeting invitation may be paired with each file that is never attached to the meeting invitation as another application content item pair.
Some embodiments train a weakly supervised machine learning model based on learning weights associated with the features, as per block 1012. In other words, the machine learning model takes as input the pairs identified at block 1010 and determines the patterns associated with each pair to ultimately learn the embedded or specific features for a given content item and set of application items representing the ground truth. In this way, which features are present and which features are not present are learned for a given ground truth model over multiple iterations or epochs. And in this way, embodiments learn which content items are associated with a given application item based on the tag. Training predictions can be continued until the loss function is acceptable with respect to ground truth, such that each appropriate node weight or node path of the neural network is appropriately activated or deactivated, as described with respect to fig. 4.
FIG. 11 is a flowchart of an example process 1100 for causing presentation of an indication of a content item based at least in part on a natural language utterance of a meeting, according to some embodiments. Some embodiments detect a first natural language utterance of one or more participants associated with a meeting, as per block 1103. Examples and more specific details are described with respect to natural language utterance detector 257 of fig. 2 and text generation model/layer 311 of fig. 3. In some embodiments, the first natural language utterance is among a plurality of natural language utterances associated with the meeting. For example, a video conference may include storing a record (audio file) of each natural language utterance of various participants for the duration of the conference.
In some embodiments, the detection of the first natural language utterance includes encoding audio speech into first text data at the transcribed document (such as described with respect to meeting content assembler 256), and performing natural language processing of the first text data to determine the first natural language utterance. Further details and examples of this are described with respect to the text generation model/layer 311 of FIG. 3, which text generation model/layer 311 may encode audio data 305 into an output document. In other embodiments, detecting the natural language utterance may include reading a data object (such as a chat page) and parsing, marking, and tagging (via POS tags) the natural language text via natural language processing. In some embodiments, the transcription document includes second text data indicative of the plurality of natural language utterances, and the transcription document further includes a plurality of name identifiers, wherein each name identifier indicates a particular conferee that uttered a corresponding natural language utterance of the plurality of natural language utterances.
According to block 1105, some embodiments determine a plurality of content items associated with a meeting and/or primary participant, such as a meeting participant whose device is to be presented with an indication of the content items at block 1111. In some embodiments, the plurality of content items excludes a plurality of natural language utterances. In some embodiments, such exclusion means that the content item does not relate to any other natural language utterance that occurs in the conference in which the first natural language utterance has been detected. For example, a meeting may include utterances from John, jane, and Mary. The actual voice or audio data from these participants is not a content item.
In some embodiments, each content item is a candidate for presentation to a user device associated with the primary participant during the meeting. In some embodiments, content items that are candidates for presentation also include an indication (such as a link) of the content item. In this way, the indication is a candidate for presentation, not the content item itself. Similarly, in some embodiments, even if an indication (such as a link or file name) is actually presented to the user instead of the actual content item, the content item is still considered a candidate for presentation because the user is still able to access the content item from the indication.
In some embodiments, determining the plurality of content items at block 1105 includes performing a computer read of a network graph associated with the primary participant, wherein a first node of the network graph represents a meeting and a second set of nodes of the network graph represent at least one of: a respective content item of the plurality of content items and other content items, a primary participant, and another participant associated with the participant. Examples of this and more details are described with respect to network diagram 500 of fig. 5. For example, an embodiment can select the N closest nodes representing the content item (in terms of edge distance) from the nodes representing the conference.
In some embodiments, the plurality of content items includes one or more of a data file (also referred to herein as a "file") or message. For example, the plurality of content items can include a plurality of data files, a plurality of messages, and/or a combination of different data files and messages. A "data file" is a data object (such as a container) that stores data. For example, the file can be an image file (such as a digital photograph), a document file (such as a WORD or PDF document), any email attachment, and the like. A "message" may refer to one or more natural language words or characters that exclude each natural language utterance of a meeting. For example, the message can be a chat message phrase entered by a particular user in a chat session. In some embodiments, the message includes a notification, e.g., information useful to the attendees, such as "expiration date for John is currently talking about item is 11/16. "in some embodiments, the message comprises an email. An email (or other message) may refer to a file that includes an email received or sent in the format of an email application. Alternatively or additionally, an email may refer to text copied from the email that is in a changed format relative to the email application (such as copying each word in the email to a pop-up window without going to/from a function or other feature). In some embodiments, each content item is pre-existing or has been generated (such as having sent and received an email) prior to detecting the first natural language utterance.
According to block 1107, some embodiments determine (such as generate) a score for each of the plurality of content items based on the first natural language utterance and at least one of: a first context associated with a meeting (such as meeting context 309), a second context associated with the first meeting (such as described in user context 303), and/or a third context associated with another meeting of the meeting (such as described in user context 303). Examples of determining the score of each block 1107 are described with respect to content item ranker 264 of fig. 2, content item ranking model/layer 315 of fig. 3, and/or inference prediction(s) 409 of fig. 4. However, in alternative embodiments, such scores are determined based on the first natural language utterance, the first context, the second context, and/or the third context—that is, the score can be generated without regard to the detected natural language utterance.
In the illustrative example of block 1107, some embodiments first determine an intent of the first natural language utterance via natural language processing (as described with respect to speaker intent model// layer 313) based on the meeting context and/or the user context. Some embodiments responsively determine that the intent is to reference (or otherwise associate) a particular content item. Particular embodiments then rank each content item based on the first natural language utterance, the meeting context, and/or the user context (as described with respect to content item ranking model/layer 315). For example, the highest ranked content item can be one of the particular content items indicated in the intent.
In some embodiments, generating (or determining) the score for each content item includes predicting, via a weakly supervised machine learning model, that the first content item is the most relevant content item relative to other content items. An example of this is described with respect to neural network 405 of fig. 5. In some embodiments, the prediction is based on concatenating one or more of the following into a feature vector: the feature vector is used as input to a weakly supervised machine learning model, a first identifier identifying a first and meeting, a first natural language utterance, a second set of identifiers each identifying a respective meeting participant of the meeting, and a third identifier identifying the meeting. Examples of the same, additional, or alternative inputs (such as intent) are described with respect to deployment input(s) 403 and/or training input(s) 415 of fig. 4.
In some embodiments, the score determined at block 1107 is based on training a weakly supervised model by programmatically assigning a first label (such as a positive label) to each content item associated with an application item (such as an explicit reference or attached to an application item) and a second label (such as a negative label) to each content item not associated with a calendar item (such as not explicitly referenced or attached to a calendar item) without a human annotator, and learning which content items are associated with an application item based on the first and second labels. In some embodiments, these steps include a process 1000 for training a machine learning model as described with respect to fig. 10.
According to block 1109, some embodiments rank each of the plurality of content items based at least in part on the score. In some embodiments, such ranking includes functionality as described with respect to content item ranking model/layer 315 and/or content item ranker 264.
According to block 1111, some embodiments cause an indication of at least a first content item of the plurality of content items to be presented during the meeting and to a first user device associated with the first participant based at least in part on the ranking at block 1109. However, in some embodiments, this resulting presentation is based at least in part on the score (block 1107), instead of or in addition to the ranking. In some embodiments, an "indication" in the context of block 1111 refers to a link (such as a hyperlink referencing a document or otherwise selectable to open a document), a file name (such as a name stored as another), the content item itself, a hash, or other data representing or associated with the content item. For example, the indication can be a link to a file. Examples of block 1111 are described with respect to the presentation of link 606 of FIG. 6, the presentation of content item suggestions (such as File A, file B, and File C) in deliver a speech 704, the presentation of document 808 of FIG. 8, and the presentation of link and filename 906.
In some embodiments, such causing presentation includes causing presentation of a document having highlighted characters, wherein highlighting of the characters is based at least in part on the first natural language utterance. In some embodiments, the functionality represents or includes the functionality as described with respect to fig. 8, in which highlighted text 810 is presented.
In some embodiments, causing presentation includes causing presentation of an indication of a file (or other content item) and selectively avoiding causing presentation of an indication of other files (or content items). In some embodiments, such selective avoidance is based on content items below a score (such as a confidence level) or ranking threshold. For example, referring back to fig. 7, for the time stamp 14: the natural language utterance at 02 may only present file a, not files B and C, because it does not exceed a particular scoring threshold (such as an 80% confidence level of relevance).
In some embodiments, for the same first natural language utterance and the same meeting as described with respect to process 1100 of fig. 11, different content items may be determined and scored for different meeting participants (and/or other meeting participants) of the meeting. In this way, each presented content item is personalized for a particular meeting participant of the meeting. For example, some embodiments determine a second plurality of content items associated with a second participant of the meeting, wherein each content item is also a candidate for presentation to a second user device associated with the second participant during the meeting. Based at least in part on the first natural language utterance and another context associated with the second participant, some embodiments generate a second score for each of the second plurality of content items. And based at least in part on the second score, some embodiments rank each of the second plurality of content items. Based at least in part on the ranking of each of the second plurality of content items, particular embodiments result in presenting another indication of at least the second one of the plurality of content items during the meeting and to the second user device.
In an illustrative example, a speaker of a meeting may refer to a sales figure. In response, particular embodiments result in the presentation of a first email sent by a first participant at a first user device and simultaneously result in the presentation of a second email sent by a second participant at a second user device, wherein both emails describe or reference a sales number indicated by a speaker, but are both private data and are therefore sent only to the respective participant. In other words, for example, some embodiments avoid causing presentation of the first content item to the second user device based on the second participant not having access to the first content item.
In some embodiments, after presentation via block 1111, some embodiments receive a request via the first user device for the primary participant to share the primary content item with the secondary participant. For example, referring back to fig. 6, some embodiments receive an indication that the primary participant has selected the "yes" button 607 in the prompt, i.e., whether the primary participant "wants to share e-mail with the group. In response to receipt of the request, some embodiments cause the first content item to be presented to a second user device associated with a second participant, e.g., as described with respect to fig. 6.
Some embodiments additionally cause a second content item to be presented prior to the meeting based at least in part on a context associated with the meeting (and/or a context associated with one or more meeting participants), wherein the plurality of content items includes at least one of a pre-read document or an agenda document, as described, for example, with respect to content item generator 261 of fig. 2.
FIG. 12 is a flowchart of an example process 1200 for presenting an indication of an agenda document or a pre-read document prior to a meeting, according to some embodiments. According to block 1202, some embodiments determine at least one of: a first context associated with the meeting and a second context associated with one or more invitees to the meeting. In some embodiments, the first context includes the functions and data as described with respect to meeting context 309, meeting profile 270, and/or meeting monitor 250 of fig. 2 of fig. 3. In some embodiments, the second context includes the functions and data as described with respect to the user context 303, the user profile 240, and/or the user data collection component 210 of fig. 2 of fig. 3.
Some embodiments generate or access an agenda document or a pre-read document based on the first context and/or the second context, per block 1204. In some embodiments, such "generation" of a document includes functionality as described with respect to content item generator 261. In some embodiments, such "accessing" of a document includes accessing a data record (such as a database record) comprising the document from a data store (such as RAM or disk). In these embodiments, the document has been generated and stored in computer memory and accessed, for example, in response to block 1202.
According to block 1206, some embodiments cause an indication of the agenda document or the pre-read document to be presented prior to the meeting beginning and at a user device associated with the invitee to the meeting. In some embodiments, the timing of such presentation of the document prior to the start of the meeting is based on one or more predetermined rules or policies, such as 10 minutes prior to the start of the meeting or 5 minutes prior to the start of the meeting, where the start time of the meeting is derived from the meeting context (e.g., meeting context 309).
Other embodiments
Accordingly, various aspects of techniques are described herein for systems and methods for near real-time in-meeting content item suggestion. It will be understood that the various features, subcombinations, and modifications of the embodiments described herein have utility and may be used in other embodiments without reference to other features or subcombinations. Furthermore, the order and sequence of steps illustrated in the example flowcharts are not meant to limit the scope of the present disclosure in any way, and indeed, within embodiments of the present disclosure, the steps may occur in a variety of different orders. Such variations and combinations thereof are also considered to be within the scope of embodiments of the present disclosure.
In some embodiments, a computerized system (such as the computerized system described in any of the embodiments above) includes at least one processor and one or more computer storage media storing computer-useable instructions that, when used by the at least one computer processor, cause the at least one computer processor to perform operations. The operations include: detecting a first natural language utterance associated with one or more participants of a meeting, the one or more participants including a first participant, the first natural language utterance being among a plurality of natural language utterances associated with the meeting; determining a plurality of content items associated with the primary participant, the plurality of content items excluding the plurality of natural language utterances, each content item of the plurality of content items being associated with a candidate for presentation to a user device associated with the primary participant during the meeting; based at least in part on the first natural language utterance, at least one of: generating a score for each of the plurality of content items from a first context associated with the meeting and a second context associated with the first participant; ranking each content item of the plurality of content items based at least in part on the score; and during the meeting and based at least in part on the ranking, causing, at least in part in response to detecting the first natural language utterance, presentation of an indication of at least a first content item of the plurality of content items to the first user device associated with the first participant.
Advantageously, these and other embodiments, as described herein, improve upon the prior art in that scoring and presentation may be based on factors such as real-time natural language utterances in a meeting and/or other contexts, such as meeting subject or meeting participant ID. Rather than requiring an explicit user query or other user activity (such as clicking) to manually search or present content items, particular embodiments automatically provide such content items based on unique rules or factors (e.g., providing content items that match the natural language utterances of the meeting, or providing content items based on the user downloading these content items as attachments in a previous email). The generated score is itself a technical solution to these problems, as the most relevant content items are revealed. Rather than requiring the user to manually retrieve a particular file in an email application via a search query, particular embodiments will automatically cause an indication (such as a link) to be presented for the particular file based on the score when the meeting begins or when the user begins talking about the particular file. Such a presentation is itself an additional technical solution to these technical problems.
Moreover, as described herein, these and other embodiments improve upon the prior art by improving user interfaces and human-machine interactions by automatically causing presentation of indications of content items during a meeting, thereby eliminating the need for a user to laboriously go deep into various pages to find appropriate files or issue queries. Moreover, as described herein, these and other embodiments improve upon the prior art by intelligently and automatically causing an indication of a content item to be presented to a user or generating a content item prior to the start of a meeting to reduce storage device I/O, as particular embodiments perform a single write (or fewer writes) to a storage device to generate a document, rather than repeatedly storing or writing manual user input to the storage device as required by the prior art.
Moreover, these and other embodiments improve computer information security and user privacy over the prior art by programmatically assigning specific tags without human annotators using a weakly supervised model. In this way, any human annotator cannot view or steal private data, such as credit card information, telephone numbers, etc. In addition, some embodiments encrypt such personal information such that other remote users cannot access the information. Furthermore, particular embodiments improve security and user privacy by incorporating access control mechanisms to prevent users from accessing content items that they should not access. For example, during a meeting, particular embodiments only result in the presentation of the content item to user devices associated with users of the content item, but avoid causing the presentation of the content item to the secondary meeting participant based on the secondary meeting participant not having access to the content item. One of the access control mechanisms that improves the prior art is the concept of causing an indication of content items to be presented to a user in response to receiving a request to share the content items from a user that has access to the content items.
Moreover, these and other embodiments also improve other computing resource consumption, such as network bandwidth, network latency, and I/O when searching for content items, by determining a plurality of content items associated with a first participant or meeting that is a candidate for presentation during the meeting (or determining that the content item is actually associated with the first participant or meeting). Particular embodiments are able to determine that a subset of content items may be relevant to a meeting or particular attendee, rather than traversing an entire decision tree or other data structure when determining content items. This reduces storage device I/O because the number of accesses to the storage device is less when performing read/write operations, which reduces wear on the read/write head. Furthermore, this reduces network latency and reduces bandwidth as fewer data sources, nodes or content items are considered.
In any combination of the above embodiments, detecting the first natural language utterance includes encoding audio speech as first text data at the transcribed document, and performing natural language processing of the first text data to determine the first natural language utterance.
In any combination of the above embodiments of the computerized system, determining the plurality of content items associated with the primary participant includes performing a computer reading of a network graph associated with the primary participant, and selecting the plurality of content items among other content items, a first node of the network graph representing a meeting, a second set of nodes of the network graph representing at least one of: the respective ones of the plurality of content items and other content items, a first participant, and another participant associated with the meeting.
In any combination of the above embodiments of the computerized system, the plurality of content items comprises one or more of a data file or a message, and wherein the presented indication comprises a link to the data file or a link to the message.
In any combination of the above embodiments of the computerized system, generating the score for each content item includes predicting, via a weakly supervised machine learning model, that the first content item is the most relevant content item relative to other content items of the plurality of content items.
In any combination of the above embodiments of the computerized system, the predicting comprises concatenating one or more of the following into feature vectors for use as input to a weakly supervised machine learning model: a first identifier identifying a first participant, a first natural language utterance, a second set of identifiers each identifying a respective participant of the meeting, and a third identifier identifying the meeting.
In any combination of the above embodiments of the computerized system, the operations further comprise: the weakly supervised model is trained by programmatically assigning a first label to each content item associated with an application item and a second label to each content item not associated with an application item without a human annotator, and learning which content items are associated with an application item based on the first and second labels.
In any combination of the above embodiments of the computerized system, causing the presentation includes causing the presentation of a document having highlighted characters, the highlighting of the characters based at least in part on the first natural language utterance.
In any combination of the above embodiments of the computerized system, causing presentation includes causing presentation of the file or the link to the file, and selectively avoiding causing presentation of the other file or the link to the other file, each of the other files representing a respective content item of the plurality of content items, the file representing the first content item.
In any combination of the above embodiments of the computerized system, the operations further comprise: determining a second plurality of content items associated with a second participant of the meeting, each content item of the second plurality of content items being a candidate for presentation to a second user device associated with the second participant during the meeting; generating a second score for each of the second plurality of content items based at least in part on the first natural language utterance and another context associated with the second participant; ranking each content item of the second plurality of content items based at least in part on the second score; and causing, based at least in part on the ranking of each of the second plurality of content items, presentation of another indication of at least a second one of the plurality of content items during the meeting and to a second user device associated with the second participant.
In any combination of the above embodiments of the computerized system, the operations further comprise avoiding causing presentation of the indication of the first content item to the second user device based on the second participant not having access to the first content item.
In any combination of the above embodiments of the computerized system, the operations further comprise: receiving, via the first user device, a request for the first participant to share the first content item with a second participant of the conference; and in response to receiving the request, causing presentation of the first content item to a second user device associated with the second participant.
In any combination of the above embodiments of the computerized system, the operations further comprise causing presentation of an indication of a second content item of the plurality of content items prior to the meeting based at least in part on a context associated with the meeting, and wherein the plurality of content items comprises one or more of a pre-read document and an agenda document associated with the meeting.
In some embodiments, a computer-implemented method (such as the computer-implemented method described in any of the embodiments above) includes detecting a first natural language utterance of one or more participants associated with a meeting, the one or more participants including the first participant. The computer-implemented method may further include determining a plurality of content items associated with the meeting. The computer-implemented method may further include determining a score for each of the plurality of content items based on the first natural language utterance and at least one of: a first context associated with the meeting, a second context associated with the first meeting participant, and a third context associated with another meeting participant of the meeting. The computer-implemented method may further include ranking each of the plurality of content items based at least in part on the score. The computer-implemented method may also include causing, during the meeting and based at least in part on the ranking, presentation of an indication of at least a first content item of the plurality of content items to a first user device associated with the first meeting participant. Advantageously, these and other embodiments, as described herein, improve upon the prior art in that scoring and presentation may be based on factors such as real-time natural language utterances in a meeting and/or other contexts, such as meeting subject or meeting participant ID. Rather than requiring an explicit user query or other user activity (such as clicking) to manually search or present content items, particular embodiments automatically provide such content items based on unique rules or factors (e.g., providing content items that match the natural language utterances of the meeting, or providing content items based on the user downloading these content items as attachments in a previous email). The generated score is itself a technical solution to these problems, as the most relevant content items are revealed. Rather than requiring the user to manually retrieve a particular file in an email application via a search query, particular embodiments will automatically cause an indication (such as a link) to be presented for the particular file based on the score when the meeting begins or when the user begins talking about the particular file. This presentation is itself an additional technical solution to these technical problems.
Moreover, as described herein, these and other embodiments improve upon the prior art by improving the user interface and human-machine interaction by automatically causing presentation of indications of content items during a meeting, thereby eliminating the need for the user to laboriously go deep into various pages to find the appropriate files or issue queries. Moreover, as described herein, these and other embodiments improve upon the prior art by intelligently and automatically causing an indication of a content item to be presented to a user or generating a content item prior to the start of a meeting to reduce storage device I/O, as particular embodiments perform a single write (or fewer writes) to a storage device to generate a document, rather than repeatedly storing or writing manual user input to the storage device as required by the prior art.
Moreover, these and other embodiments improve computer information security and user privacy over the prior art by programmatically assigning specific tags without human annotators using a weakly supervised model. In this way, any human annotator cannot view or steal private data, such as credit card information, telephone numbers, etc. In addition, some embodiments encrypt such personal information such that other remote users cannot access the information. Furthermore, particular embodiments improve security and user privacy by incorporating access control mechanisms to prevent users from accessing content items that they should not access. For example, during a meeting, particular embodiments only result in the presentation of the content item to user devices associated with users of the content item, but avoid causing the presentation of the content item to the secondary participant based on the secondary participant having no access rights to the content item. One of the access control mechanisms that improves the prior art is the concept of causing an indication of content items to be presented to a user in response to receiving a request to share the content items from a user that has access to the content items.
In addition, these and other embodiments also improve other computing resource consumption, such as network bandwidth, network latency, and I/O when searching for content items, by determining a plurality of content items associated with a first participant or meeting that is a candidate for presentation during the meeting (or determining that the content item is actually associated with the first participant or meeting). Particular embodiments may determine that a subset of content items may be relevant to a meeting or particular attendee, rather than traversing an entire decision tree or other data structure when determining content items. This reduces storage device I/O because the number of accesses to the storage device is less when performing read/write operations, which reduces wear on the read/write head. Furthermore, this reduces network latency and reduces bandwidth as fewer data sources, nodes or content items are considered.
In any combination of the above embodiments of the computer-implemented method, causing presentation includes causing presentation of an indication of the first content item to the first user device during the meeting, and selectively avoiding causing presentation of an indication of any other content item of the plurality of content items.
In any combination of the above embodiments of the computer-implemented method, the method further comprises causing a second indication of a second content item to be presented to the user device prior to the meeting beginning, and wherein the second content item comprises one of a pre-read document and an agenda document.
In any combination of the above embodiments of the computer-implemented method, generating the score for each content item includes predicting, via a weakly supervised machine learning model, that the first content item is the most relevant content item relative to other content items of the plurality of content items.
In any combination of the above embodiments of the computer-implemented method, the method further comprises: determining a second plurality of content items associated with a second participant of the meeting, each content item of the second plurality of content items being a candidate for presentation to a second user device associated with the second participant; determining a second score for each of the second plurality of content items based at least in part on the first natural language utterance and another context associated with the second participant; ranking each content item of the second plurality of content items based at least in part on the second score; and causing another indication of at least a second content item of the plurality of content items to be presented to a second user device associated with the second meeting participant based at least in part on the ranking of each content item of the second plurality of content items.
In any combination of the above embodiments of the computer-implemented method, the method further comprises avoiding causing presentation of an indication of the first content item to the second user device based on the second participant not having access to the first content item.
In some embodiments, one or more computer storage media (e.g., one or more of the computer storage media described in any of the embodiments above) includes computer-executable instructions embodied thereon that, when executed by one or more processors, cause the one or more processors to perform operations comprising: a first natural language utterance of one or more participants associated with a meeting is detected, the one or more participants including a first participant. The operations may also include determining a plurality of content items associated with at least one of the meeting or primary participant. The operations may also include determining a score for each of the plurality of content items based at least in part on at least one of the first natural language utterance, a first context associated with the meeting, a second context associated with the first meeting participant, and a third context associated with another meeting participant of the meeting. The operations may also include causing, during the meeting and based at least in part on the score, presentation of an indication of at least a first content item of the plurality of content items to a first user device associated with the first participant. Advantageously, these and other embodiments, as described herein, improve upon the prior art in that scoring and presentation may be based on factors such as real-time natural language utterances in a meeting and/or other contexts, such as meeting subject or meeting participant ID. Rather than requiring an explicit user query or other user activity (such as clicking) to manually search or display the content items, particular embodiments automatically provide such content items based on unique rules or factors (e.g., providing content items that match the natural language utterances of the meeting, or providing content items based on the user downloading these content items as attachments in a previous email). The generated score is itself a technical solution to these problems, as the most relevant content items are revealed. Rather than requiring the user to manually retrieve a particular file in an email application via a search query, particular embodiments will automatically cause an indication (such as a link) to be presented for the particular file based on the score when the meeting begins or when the user begins talking about the particular file. This presentation is itself an additional technical solution to these technical problems.
Moreover, as described herein, these and other embodiments improve upon the prior art by improving the user interface and human-machine interaction by automatically causing presentation of indications of content items during a meeting, thereby eliminating the need for the user to laboriously go deep into various pages to find the appropriate files or issue queries. Moreover, as described herein, these and other embodiments improve upon the prior art by intelligently and automatically causing an indication of a content item to be presented to a user or generating a content item prior to the start of a meeting to reduce storage device I/O, as particular embodiments perform a single write (or fewer writes) to a storage device to generate a document, rather than repeatedly storing or writing manual user input to the storage device as required by the prior art.
Moreover, these and other embodiments improve computer information security and user privacy over the prior art by programmatically assigning specific tags without human annotators using a weakly supervised model. In this way, any human annotator cannot view or steal private data, such as credit card information, telephone numbers, etc. In addition, some embodiments encrypt such personal information such that other remote users cannot access the information. Furthermore, particular embodiments improve security and user privacy by incorporating access control mechanisms to prevent users from accessing content items that they should not access. For example, during a meeting, particular embodiments only result in the presentation of the content item to user devices associated with users of the content item, but avoid causing the presentation of the content item to the secondary participant based on the secondary participant having no access rights to the content item. One of the access control mechanisms that improves the prior art is the concept of causing an indication of content items to be presented to a user in response to receiving a request to share the content items from a user that has access to the content items.
In addition, these and other embodiments also improve other computing resource consumption, such as network bandwidth, network latency, and I/O when searching for content items, by determining a plurality of content items associated with a first participant or meeting that is a candidate for presentation during the meeting (or determining that the content item is actually associated with the first participant or meeting). Particular embodiments may determine that a subset of content items may be relevant to a meeting or particular attendee, rather than traversing an entire decision tree or other data structure when determining content items. This reduces storage device I/O because the number of accesses to the storage device is less when performing read/write operations, thereby reducing wear of the read/write head. Furthermore, this reduces network latency and reduces bandwidth as fewer data sources, nodes or content items are considered.
Overview of an exemplary operating Environment
Having described various embodiments of the present disclosure, an exemplary computing environment suitable for implementing embodiments of the present disclosure will now be described. With reference to FIG. 13, an exemplary computing device 1300, commonly referred to as computing device 1300, is provided. Computing device 1300 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the disclosure. Neither should the computing device 1300 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated.
Embodiments of the present disclosure may be described in the general context of computer code or machine-useable instructions, including computer-useable or computer-executable instructions such as program modules, being executed by a computer or other machine, such as a smart phone, tablet PC, or other mobile device, server, or client device. Generally, program modules (including routines, programs, objects, components, data structures, etc.) refer to code that perform particular tasks or implement particular abstract data types. Embodiments of the present disclosure may be practiced in various system configurations, including mobile devices, consumer electronics, general-purpose computers, more specialized computing devices, and the like. Embodiments of the disclosure may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Some embodiments may include an end-to-end software-based system that may operate within the system components described herein to operate computer hardware to provide system functionality. At a low level, a hardware processor may execute instructions selected from a given processor's machine language (also referred to as machine code or native) instruction set. The processor recognizes native instructions and performs corresponding low-level functions, such as functions related to logic, control, and memory operations. Low-level software written in machine code may provide more complex functionality to higher-level software. Thus, in some embodiments, computer-executable instructions may include any software, including low-level software written in machine code, high-level software such as application software, and any combination thereof. In this regard, the system components may manage resources and provide services for system functions. Embodiments of the present disclosure contemplate any other variations and combinations thereof.
With reference to FIG. 13, a computing device 1300 includes a bus 10 that directly or indirectly couples the following devices: memory 12, one or more processors 14, one or more presentation components 16, one or more input/output (I/O) ports 18, one or more I/O components 20, and an illustrative power supply 22. Bus 10 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of FIG. 13 are shown with lines for the sake of clarity, in reality, these blocks represent logical, not necessarily actual, components. For example, a presentation component such as a display device may be considered an I/O component. In addition, the processor has a memory. The inventors hereof recognize that such is the nature of the art, and reiterate that the diagram of FIG. 13 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments of the present disclosure. There is no distinction between the categories of "workstation," "server," "notebook," "handheld," or other computing devices, etc., as all of these categories are within the scope of FIG. 13 and reference is made to "computing device"
Computing device 1300 typically includes a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computing device 1300 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other storage technologies, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 1300. The computer storage medium itself contains no signals. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
Memory 12 includes computer storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, or other hardware. Computing device 1300 includes one or more processors 14 that read data from various entities such as memory 12 or I/O components 20. Presentation component 16 presents data indications to a user or other device. Exemplary presentation components include display devices, speakers, printing components, vibration components, and the like.
The I/O ports 18 allow the computing device 1300 to be logically coupled to other devices, including I/O components 20, some of which may be built-in. Illustrative components include microphones, joysticks, game pads, satellite dishes, scanners, printers, wireless devices, and the like. The I/O component 20 may provide a Natural User Interface (NUI) that processes air gestures, voice, or other physiological input generated by a user. In some cases, the input may be sent to an appropriate network element for further processing. NUI may implement any combination of speech recognition, touch and stylus recognition, facial recognition, biometric recognition, on-screen and near-screen gesture recognition, air gesture, head and eye tracking, and touch recognition associated with a display on computing device 1300. Computing device 1300 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations thereof, for gesture detection and recognition. Furthermore, computing device 1300 may be equipped with an accelerometer or gyroscope capable of detecting motion. The output of the accelerometer or gyroscope may be provided to a display of the computing device 1300 to present immersive augmented reality or virtual reality.
Some embodiments of computing device 1300 may include one or more radios 24 (or similar wireless communication components). Radio 24 sends and receives radio or wireless communications. Computing device 1300 can be a wireless terminal adapted to receive communications and media over a variety of wireless networks. Computing device 1300 may communicate with other devices via wireless protocols such as code division multiple access ("CDMA"), global system for mobile ("GSM"), or time division multiple access ("TDMA"), among other protocols. The radio communication may be a short range connection, a long range connection or a combination of short range and long range wireless telecommunication connections. When we refer to "short" and "long" types of connections we do not refer to the spatial relationship between two devices. Instead, we will generally refer to short-range and long-range as different classes or types of connections (i.e., primary and secondary). By way of example and not limitation, a short-range connection may include a connection to a device providing access to a wireless communication network (e.g., a mobile hotspot)A connection, such as a WLAN connection using the 802.11 protocol; the bluetooth connection to another computing device is a second example of a short range connection or a near field communication connection. By way of example and not limitation, the remote connection may include a connection using one or more of CDMA, GPRS, GSM, TDMA and 802.16 protocols.
Having identified the various components used herein, it should be understood that any number of components and arrangements may be employed to achieve the desired functionality within the scope of the present disclosure. For example, components in the embodiments depicted in the figures are shown with lines for the sake of conceptual clarity. Other arrangements of these and other components may also be implemented. For example, although some components are described as single components, many of the elements described herein may be implemented as discrete or distributed components, or in combination with other components, and in any suitable combination and location. Some elements may be omitted entirely. Further, various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software, as described below. For example, various functions may be performed by a processor executing instructions stored in memory. Thus, other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used in addition to those shown.
The embodiments of the present disclosure have been described with the intent to be illustrative rather than restrictive. The embodiments described in the preceding paragraphs may be combined with one or more of the specifically described alternatives. In particular, in the alternative, a claimed embodiment may comprise a reference to more than one other embodiment. The claimed embodiments may specify further limitations of the claimed subject matter. Alternative embodiments will become apparent to the reader of this disclosure after reading this disclosure and as a result of reading this disclosure. Alternative means of accomplishing the foregoing may be accomplished without departing from the scope of the claims. Certain features and subcombinations may be employed without reference to other features and subcombinations and are contemplated within the scope of the claims.
As used herein, the term "set" may be used to refer to an ordered (i.e., sequential) or unordered (i.e., non-sequential) set of objects (or elements), such as, but not limited to, data elements (e.g., events, clusters of events, etc.). A set may include N elements, where N is any non-negative integer. That is, a collection may include 0,1, 2, 3, … N objects and/or elements, where N is a positive integer without an upper bound. Thus, as used herein, a set may be an empty set (i.e., an empty set) that does not include any elements. A collection may contain only one element. In other embodiments, a collection may include a plurality of elements that is significantly larger than one, two, or three elements. As used herein, the term "subset" is a collection that is contained in another collection. The subset may be, but is not necessarily, an appropriate or strict subset of another set that contains the subset. That is, if set B is a subset of set a, then in some embodiments set B is an appropriate or strict subset of set a. In other embodiments, set B is a subset of set a, but is not an appropriate or strict subset of set a.

Claims (15)

1.一种系统,包括:1. A system comprising: 至少一个计算机处理器;以及at least one computer processor; and 存储计算机可使用指令的一个或多个计算机存储介质,所述计算机可使用指令当由所述至少一个计算机处理器使用时,使得所述至少一个计算机处理器执行包括以下的操作:One or more computer storage media storing computer-usable instructions that, when used by the at least one computer processor, cause the at least one computer processor to perform operations comprising: 检测与会议的一个或多个与会者相关联的第一自然语言话语,所述一个或多个与会者包括第一与会者,所述第一自然语言话语在与所述会议相关联的多个自然语言话语之中;detecting a first natural language utterance associated with one or more attendees of a conference, the one or more attendees including a first attendee, the first natural language utterance being among a plurality of natural language utterances associated with the conference; 确定与所述第一与会者相关联的多个内容项目,所述多个内容项目排除所述多个自然语言话语,所述多个内容项目中的每个内容项目与用于在所述会议期间向与所述第一与会者相关联的用户设备呈现的候选相关联;determining a plurality of content items associated with the first conferee, the plurality of content items excluding the plurality of natural language utterances, each content item in the plurality of content items being associated with a candidate for presentation to a user device associated with the first conferee during the conference; 至少部分地基于所述第一自然语言话语以及以下中的至少一项来生成针对所述多个内容项目中的每个内容项目的分数:与所述会议相关联的第一上下文和与所述第一与会者相关联的第二上下文;generating a score for each of the plurality of content items based at least in part on the first natural language utterance and at least one of: a first context associated with the conference and a second context associated with the first attendee; 至少部分地基于所述分数,对所述多个内容项目中的每个内容项目进行排名;以及ranking each content item in the plurality of content items based at least in part on the score; and 在所述会议期间并且至少部分地基于所述排名,至少部分地响应于对所述第一自然语言话语的所述检测,导致向与所述第一与会者相关联的所述第一用户设备呈现对所述多个内容项目中的至少第一内容项目的指示。During the conference and based at least in part on the ranking, at least in part in response to the detection of the first natural language utterance, an indication of at least a first content item of the plurality of content items is presented to the first user device associated with the first attendee. 2.根据权利要求1所述的系统,其中,对所述第一自然语言话语的所述检测还包括:将音频语音编码为在转录文档处的第一文本数据,以及执行对所述第一文本数据的自然语言处理以确定所述第一自然语言话语。2. The system of claim 1 , wherein the detection of the first natural language utterance further comprises encoding the audio speech into first text data at a transcription document, and performing natural language processing on the first text data to determine the first natural language utterance. 3.根据权利要求1所述的系统,其中,对与所述第一与会者相关联的多个内容项目的所述确定还包括:执行与所述第一与会者相关联的网络图的计算机读取以及在其他内容项目之中选择所述多个内容项目,所述网络图的第一节点表示所述会议,所述网络图的第二节点集合表示以下中的至少一项:所述多个内容项目和所述其他内容项目的相应内容项目、所述第一与会者以及与所述会议相关联的另一与会者。3. A system according to claim 1, wherein the determination of multiple content items associated with the first participant also includes: performing a computer reading of a network diagram associated with the first participant and selecting the multiple content items among other content items, a first node of the network diagram representing the meeting, and a second set of nodes of the network diagram representing at least one of the following: corresponding content items of the multiple content items and the other content items, the first participant, and another participant associated with the meeting. 4.根据权利要求1所述的系统,其中,所述多个内容项目包括数据文件或消息中的一项或多项,并且其中,所呈现的指示包括到所述数据文件的链接或者到所述消息的链接。4. The system of claim 1, wherein the plurality of content items comprises one or more of a data file or a message, and wherein the presented indication comprises a link to the data file or a link to the message. 5.根据权利要求1所述的系统,其中,针对每个内容项目的所述分数的所述生成还包括:经由弱监督机器学习模型来预测所述第一内容项目是相对于所述多个内容项目中的其他内容项目最相关的内容项目。5. The system of claim 1, wherein the generating of the score for each content item further comprises predicting, via a weakly supervised machine learning model, that the first content item is the most relevant content item relative to other content items in the plurality of content items. 6.根据权利要求5所述的系统,其中,所述预测包括将以下中的一项或多项连接成用作对所述弱监督机器学习模型的输入的特征向量:标识所述第一与会者的第一标识符、所述第一自然语言话语、各自标识所述会议的相应与会者的第二标识符集合、以及标识所述会议的第三标识符。6. A system according to claim 5, wherein the prediction includes concatenating one or more of the following into a feature vector for use as input to the weakly supervised machine learning model: a first identifier identifying the first attendee, the first natural language utterance, a set of second identifiers that each identify a corresponding attendee of the meeting, and a third identifier identifying the meeting. 7.根据权利要求5所述的系统,其中,所述操作还包括:通过在没有人类注释者的情况下以编程方式将第一标签分配给与应用项目相关联的每个内容项目并且将第二标签分配给不与所述应用项目相关联的每个内容项目来训练所述弱监督模型,以及基于所述第一标签和所述第二标签来学习哪些内容项目与所述应用项目相关联。7. A system according to claim 5, wherein the operation further comprises: training the weakly supervised model by programmatically assigning a first label to each content item associated with an application item and assigning a second label to each content item not associated with the application item without a human annotator, and learning which content items are associated with the application item based on the first label and the second label. 8.根据权利要求1所述的系统,其中,所述导致呈现包括导致呈现具有突出显示的字符的文档,对所述字符的所述突出显示是至少部分地基于所述第一自然语言话语的。8. The system of claim 1, wherein said causing presentation comprises causing presentation of a document having highlighted characters, said highlighting of said characters being based at least in part on said first natural language utterance. 9.根据权利要求1所述的系统,其中,所述导致呈现还包括导致呈现文件或者到所述文件的链接,以及选择性地避免使得呈现其他文件或者到所述其他文件的链接,所述其他文件中的每个文件表示所述多个内容项目中的相应内容项目,所述文件表示所述第一内容项目。9. A system according to claim 1, wherein the causing presentation further comprises causing presentation of a file or a link to the file, and selectively avoiding presentation of other files or links to the other files, each of the other files representing a corresponding content item among the plurality of content items, the file representing the first content item. 10.根据权利要求1所述的系统,其中,所述操作还包括:10. The system of claim 1, wherein the operations further comprise: 确定与所述会议的第二与会者相关联的第二多个内容项目,所述第二多个内容项目中的每个内容项目是用于在所述会议期间向与所述第二与会者相关联的第二用户设备呈现的候选;determining a second plurality of content items associated with a second attendee of the conference, each content item in the second plurality of content items being a candidate for presentation to a second user device associated with the second attendee during the conference; 至少部分地基于所述第一自然语言话语和与所述第二与会者相关联的另一上下文,生成针对所述第二多个内容项目中的每个内容项目的第二分数;generating a second score for each content item in the second plurality of content items based at least in part on the first natural language utterance and another context associated with the second conferee; 至少部分地基于所述第二分数,对所述第二多个内容项目中的每个内容项目进行排名;以及ranking each content item in the second plurality of content items based at least in part on the second score; and 至少部分地基于对所述第二多个内容项目中的每个内容项目的所述排名,导致在所述会议期间并且向与所述第二与会者相关联的所述第二用户设备呈现对所述多个内容项目中的至少第二内容项目的另一指示。Based at least in part on the ranking of each content item in the second plurality of content items, causing presentation of another indication of at least a second content item in the plurality of content items during the conference and to the second user device associated with the second conferee. 11.根据权利要求10所述的系统,其中,所述操作还包括:基于所述第二与会者不具有对所述第一内容项目的访问权限,避免导致向所述第二用户设备呈现对所述第一内容项目的所述指示。11. The system of claim 10, wherein the operations further comprise: refraining from causing presentation of the indication of the first content item to the second user device based on the second conferee not having access rights to the first content item. 12.根据权利要求1所述的系统,其中,所述操作还包括:12. The system of claim 1, wherein the operations further comprise: 经由所述第一用户设备接收针对所述第一与会者向所述会议的第二与会者共享所述第一内容项目的请求;以及receiving, via the first user device, a request for the first participant to share the first content item with a second participant of the conference; and 响应于对所述请求的所述接收,导致向与所述第二与会者相关联的第二用户设备呈现所述第一内容项目。Responsive to the receipt of the request, causing presentation of the first content item to a second user device associated with the second conferee. 13.根据权利要求1所述的系统,其中,所述操作还包括:至少部分地基于与所述会议相关联的所述上下文,导致在所述会议之前呈现对所述多个内容项目中的第二内容项目的指示,并且其中,所述多个内容项目包括与所述会议相关联的预读文档和议程文档中的一项或多项。13. A system according to claim 1, wherein the operation further includes: causing an indication of a second content item of the plurality of content items to be presented prior to the meeting based at least in part on the context associated with the meeting, and wherein the plurality of content items include one or more of a pre-read document and an agenda document associated with the meeting. 14.一种计算机实现的方法,包括:14. A computer-implemented method comprising: 检测与会议相关联的一个或多个与会者的第一自然语言话语,所述一个或多个与会者包括第一与会者;detecting a first natural language utterance of one or more attendees associated with a conference, the one or more attendees including a first attendee; 确定与所述会议相关联的多个内容项目;determining a plurality of content items associated with the conference; 基于所述第一自然语言话语以及以下中的至少一项来确定针对所述多个内容项目中的每个内容项目的分数:与所述会议相关联的第一上下文、与所述第一与会者相关联的第二上下文以及与所述会议的另一与会者相关联的第三上下文;determining a score for each of the plurality of content items based on the first natural language utterance and at least one of: a first context associated with the meeting, a second context associated with the first attendee, and a third context associated with another attendee of the meeting; 至少部分地基于所述分数,对所述多个内容项目中的每个内容项目进行排名;以及ranking each content item in the plurality of content items based at least in part on the score; and 在所述会议期间并且至少部分地基于所述排名,导致向与所述第一与会者相关联的所述第一用户设备呈现对所述多个内容项目中的至少第一内容项目的指示。During the conference and based at least in part on the ranking, causing presentation of an indication of at least a first content item of the plurality of content items to the first user device associated with the first conferee. 15.根据权利要求14所述的计算机实现的方法:15. The computer-implemented method of claim 14: 其中,所述导致呈现还包括:导致在所述会议期间向所述第一用户设备呈现对所述第一内容项目的所述指示,以及选择性地避免导致呈现对所述多个内容项目中的任何其他内容项目的指示;并且wherein causing presentation further comprises: causing presentation of the indication of the first content item to the first user device during the conference, and selectively avoiding causing presentation of an indication of any other content item in the plurality of content items; and 其中,针对每个内容项目的所述分数的所述生成还包括:经由弱监督机器学习模型来预测所述第一内容项目是相对于所述多个内容项目中的其他内容项目最相关的内容项目。Wherein, the generating of the score for each content item further comprises: predicting, via a weakly supervised machine learning model, that the first content item is the most relevant content item relative to other content items in the plurality of content items.
CN202380023672.1A 2022-03-01 2023-01-12 Near real-time in-meeting content item suggestions Pending CN118765394A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
EP22382183.6 2022-03-01
US17/813,685 US12272362B2 (en) 2022-03-01 2022-07-20 Near real-time in-meeting content item suggestions
US17/813,685 2022-07-20
PCT/US2023/010723 WO2023167758A1 (en) 2022-03-01 2023-01-12 Near real-time in-meeting content item suggestions

Publications (1)

Publication Number Publication Date
CN118765394A true CN118765394A (en) 2024-10-11

Family

ID=92951120

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202380023672.1A Pending CN118765394A (en) 2022-03-01 2023-01-12 Near real-time in-meeting content item suggestions

Country Status (1)

Country Link
CN (1) CN118765394A (en)

Similar Documents

Publication Publication Date Title
CN114503115B (en) Generate rich action items
US12272362B2 (en) Near real-time in-meeting content item suggestions
CN114556354B (en) Automatically determine and present personalized action items from events
EP4173275B1 (en) Detecting user identity in shared audio source contexts
US11721093B2 (en) Content summarization for assistant systems
US11816609B2 (en) Intelligent task completion detection at a computing device
CN110892395B (en) Virtual assistant that provides enhanced communication session services
CN110869969B (en) Virtual assistant for generating personalized responses within communication sessions
CN110235154B (en) Use characteristic keywords to associate meetings with projects
CN114600114A (en) On-device Convolutional Neural Network Models for Assistant Systems
US20230385778A1 (en) Meeting thread builder
TW202307643A (en) Auto-capture of interesting moments by assistant systems
US20230419270A1 (en) Meeting attendance prompt
CN119148856A (en) Processing multi-modal user input for an assistant system
CN118657476A (en) Preventing accidental activation of assistant system based on donning/docking detection
US20250292778A1 (en) Near real-time in-meeting content item suggestions
CN118765394A (en) Near real-time in-meeting content item suggestions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination