EP1831804A1

EP1831804A1 - Relational compressed data bank images (for accelerated interrogation of data banks)

Info

Publication number: EP1831804A1
Application number: EP05850178A
Authority: EP
Inventors: Michael Haft; Oliver Mihatsch; Reimar Hofmann
Original assignee: Panoratio Database Images GmbH
Current assignee: Panoratio Database Images GmbH
Priority date: 2004-12-24
Filing date: 2005-12-19
Publication date: 2007-09-12
Also published as: US20080133573A1; WO2006066556A2; WO2006066556A8

Abstract

The invention relates to a data bank interrogation system, wherein two or more data bank tables are linked by means of a common key or several keys which are respectively common to at least two data bank tables. In an analysis query and a selection of data sets in the first data bank, a selection of data sets is determined in the second data bank corresponding to the selection according to the common key and the analysis query is answered using the thus selected data sets in the second data bank.

Description

description

Relational compressed database images (for accelerated query of databases)

The invention relates to a database query system and a method for computer-aided database query.

The systematic collection of information about business operations is widespread. Such information can, after they have been captured in the form of data and stored in ^■ suitably be used depending on the type of information, for example, for economic and / or strategic marketing purposes.

For example, information about customers shopping in a hardware store is collected and the data collected, such as the age of the customer and the place of residence of the customers analyzed in order to adjust the offered assortment of the DIY store or to better estimate which advertising strategies are successful could be.

However, a statistical statement based on such collected data only has a high significance if very many data or data records have been recorded.

For example, it does not make sense for a DIY store to change its product range just because eight out of a total of ten customers surveyed gave corresponding answers during a survey.

Therefore, to obtain a meaningful and significant result, it is necessary to acquire, structure, store, store, and store a large amount of data. that is, to file in a database and to analyze, that is to evaluate statistically.

Despite the relatively powerful computer systems available today, this is a non-trivial task.

In terms of memory requirements, time required to access the data and costs stored in the database, it is of considerable importance to efficiently store and manage databases.

Furthermore, in conventional database systems, certain requests can not be answered at all or only with great effort.

For example, a hardware store might have a customer database table in which information about the clients of the hardware store is stored in the form of customer records. A customer record contains, for example, the customer number of the customer, the gender of the customer and the year of birth of the customer.

The hardware store could also have a transaction database table in which

Information about transactions, ie sales transactions ^"are stored in the form of transaction records. A transaction record could, for example, a transaction number, a specification of the product sold under the transaction, the statement of revenue in the transaction, indicating the date of the day on which the transaction was made, the customer number of the customer involved in the transaction and a specification of the transaction Payment method used by the customer (cash payment, card payment).

Let's assume that a DIY store sales manager would like to know what the age distribution of the customers who bought bedding and balcony plants in January is.

However, the sales manager can not answer this question by querying the first database table or the second database table.

By querying the first database table, the sales manager can not answer the question because the first database table contains no information about the products purchased by a customer.

By querying the second database table, the sales manager can not answer the question because the second database table contains no information about the age of the customers who made transactions.

All standard relational databases have the option of linking multiple database tables using common key fields (in the example above, for example, customer number). However, many so-called "JOIN" operations are often computationally expensive, and many database systems in use today are on or off the border in terms of their response times and workloads, many of which are caused by queries that link multiple database tables and complicated ones

Contain selection criteria that span multiple database tables. Queries that concern only a single database table can be handled by a so-called "fill table scan", ie the entire database table is read from the hard disk (or another memory) into the working memory once and each data record is processed individually. The runtime of such queries thus finds a natural upper bound, and linking multiple database tables means that this simple approach no longer works and potentially very long polling times can arise.

A possible way out, which is partly done in the context of data warehousing, is to change the structuring of the information in different database tables so that all are needed for a query

Information ultimately included ^'in a single database table.

The question could be answered by querying the first database table, if any

Customer record containing the information as to whether the customer who matches this customer record purchased bedding and balcony plants in January. Similarly, a customer record could have a field that contains a first value if the customer purchased bedding and balcony plants in January and includes a second value if the customer did not purchase bedding and balcony plants in January.

It can be seen that in the case of such a request, the structure of the database table must already be selected before the request. In this example, the customer database table must be designed so that each customer record contains the information as to whether the corresponding customer purchased bedding and balcony plants in January. This However, this is not possible without further ado, as it is typically not evident when designing the database table which queries are made to the database table in the future.

The customer database table could be designed to answer a variety of requests. For example, each client record could include information as to whether the customer purchased bedding and balcony plants in January, whether the customer bought bedding and balcony plants in February, and so on for all months, and if the customer bought screws in January Customer in February has bought screws and so on for all products and months.

However, this approach results in a customer database table of unacceptable size.

The customer database table also grows significantly when a list of products purchased by each customer is included in each customer record. In order to be able to answer the above question, in particular in such a list, the month of sale would also have to be stored for each product purchased. Furthermore, if inquiries are to be expected concerning the payment method used by the customer when purchasing the product, the corresponding information must also be included in the customer database table. In accordance with the expected queries to the customer database table, a customer database table with an unacceptable size may also be required in this case if a so-called flat data structure is used for the customer database table. In particular, storing a list of products and additional information is problematic as the length of these Product list may vary greatly from customer to customer, but in database tables but usually a fixed number of fields is provided for all records. Either a very large number of fields must be provided (1st product, ... 100th product), so that even for customers with extensive purchases everything can be stored, or the product list is cut off for some customers, ie not completely stored, or the list is stored using a field of appropriate data type that supports a variable length of the product list (eg, using a field of a string data type). The latter, however, has the disadvantage that queries relating to this field are complex and inefficient to process, especially if additional attributes of the products are involved (for example, the query "Show all customers who in August a product in the field of technology over 100 Bought euros ").

An acceptable size of the customer database table can be achieved if information (from the transaction database table) is aggregated into the

Customer database table, for example, if each customer receives information about whether he made any transaction in January, made any transaction in February, and so on. In this way, however, the answer to the above request is not possible because the information is not included with sufficient accuracy in the customer database table.

In summary, conventional relational database systems either have the ability to store the data in a so-called normalized schema using different database tables in a so-called normalized scheme, with the disadvantage that (analytical) queries are very inefficient or inefficient build a flat "denormalized" data schema with only one or a few database tables, which accelerates analysis but is very memory-intensive, inflexible and difficult to maintain.

In [1] probabilistic models such as Bayesian networks and Markov networks are described.

[2] discloses methods for learning dependency structures underlying a data set using Bayesian networks and Markov networks.

In [3] various statistical learning methods are described.

[4] discloses a method for arithmetic coding of data.

In [5] a method is described in which a Gaussian mixed model is used for a database with continuous entries in order to answer inquiries to the database approximately.

[6] discloses the generation of a statistical clustering model for a database by means of which requests to the

Database efficiently approximatively be answered.

Various methods are known which enable the structuring, efficient storage and analysis of data:

In [7] Z-Ordering is described.

In [8] K * trees are described. In [9] the IGrid index is described.

In [10] inference methods are described.

In [11] a method is described in which a first statistical image for a database is formed, which represents the statistical relationships of the data elements contained in the first database. Then, the first statistical image is stored in a server computer and from there via a

Transfer the communication network to a client computer. The received first statistical image is further processed by the client computer.

Reference [12] discloses a method for managing data by means of a multi-dimensional database. A data aggregation server is set up to deliver requested aggregated data to client devices.

The invention is based on the problem of creating a possibility to determine results of queries, for the determination of which data from several database tables are required, more efficiently, less computationally intensive and less memory-intensive compared with the prior art.

The problem is solved by a database query system and a method for computer-aided database query with the features according to the independent claims.

A database query system is provided with a first database image of a first database table having a first plurality of data records and a second database image of a second database table having a second plurality of data records. Each Record of the first plurality of records and each record of the second plurality of records is associated with a value of a database key. The database retrieval system has an input device configured to receive an analysis request to the second database image, a selection device configured to select a part of the first plurality of data records according to a first selection, a determination device that is set up to determine a second selection of a part of the second plurality of data records, wherein according to the second selection such records are assigned to which values of the database key are assigned, which are each associated with at least one record selected according to the first selection and a processing device that is configured, the

Determine the result of the analysis request based on the portion of the second plurality of records.

Also provided is a method for computer-aided database retrieval according to the database retrieval system described above.

The data records of the first database table and the data records of the second database table, which contain related information, are illustrated by means of a

Database key linked and stored in compressed form as a database images. The database images store the database key values for the records. Concurrent information is that relating to the same person or thing, for example, the second database table contains records of information about customers of a hardware store, and the first database table contains information about transactions performed in the hardware store. In this example, a record contains the second database table and a record of the first database table related information, if the record of the first database table contains information about a transaction that was performed by the customer, through which the record of the second database table contains information. The database key that links the two records could, in this example, be a customer number of the customer contained in both records.

A database key may consist of a single data field of a database table (eg, a customer number uniquely identifies a customer in a customer table), or a combination of multiple data fields (eg, the combination of a store number and a customer number within the store ).

A request to the second database table, that is a request to the second database image, is also illustrated, and information from the first database is also answered

Database table are required, answered by in the first database image records are selected according to the required information, that is, records are selected for a particular condition is met. Subsequently, the corresponding data records of the second database image are selected, that is to say the data records in the second database image are selected which correspond to the selected data records of the first database image in accordance with the association by means of the database key. Based on the selected data records, the query can be answered because the required information from the first database image was used to generate the selection of the data records of the second database image. An idea underlying the invention can be seen in the fact that for each database table involved, a database image is created which contains in compressed form certain information from the database table. This database image is typically much smaller than the original database table, and is more suitable for certain operations because of its structure. This makes it possible to respond to certain database queries faster based on the database image (or a combination of information from the database image and a remaining simpler query to the database) rather than from the original database alone. In particular, the following describes how database images can be linked together (illustratively with a result corresponding to a JOIN operation of two database tables). In such cases, there are particularly great advantages, since these operations can be particularly complex in normal databases.

Illustratively, the first database image and the second database image, which are linked by means of the data key as explained, form a compressed relational structure.

By using database images rather than the database tables themselves, faster access is achieved since the first database image and the second database image can be accessed quickly in a memory, such as a database

Memory (main memory) of a computer can be stored. Along with the described methods of speeding up queries in relational structures, a method is described that enables efficient triggering of relational queries in a graphical user interface using accelerated query times.

The first database table and the second database table can be two database tables created from a database architectural perspective from two different perspectives. As in the example above, the first one contains

Database table, for example, one record each for the customers of the DIY store, which contains information about the respective customer, and the second database table j e a record for the executed in the hardware store transactions containing information about the transaction.

For example, as above, the second database table contains records of information about clients of a hardware store, including the age of the respective customer, but not when the customer made a transaction in the hardware store, and the first database table contains information about transactions made in the hardware store, including the date of each transaction, but not how old the customer is who made the transaction. For a query to the second database table, after the average age of the customers who made a transaction in May, the first database table will require information about which transactions were performed in May. These are selected and the database records the records of the second

Database table that contains information about customers who made a transaction in May. Subsequently, the query can be answered on the basis of the selected data records of the second database table. In this way, it is possible to answer queries to the second database table, for the answers of which information from the first database table is required, without taking over the information in the second database table, for example in the form of a list or additional entries in the records of the second database table ,

Thus, complicated statistical analyzes can be performed efficiently and simply for the user.

Clearly, when evaluating the second database table, it is not necessary to permanently look up additional information from the first database table by means of a database key. In this way, a considerable amount of computation can be saved and thus there is a considerable efficiency advantage over conventional databases in a query of such a kind.

The first database table and the second database table may be stored in a storage device of the database retrieval system. In particular, they can be stored distributed, for example by means of a plurality of data server computers, which are coupled by means of a communication network.

In this case of distributed database tables, the use of the invention is of particular advantage since, as explained above, when evaluating the second database table, it is not necessary to permanently access additional information on the first database table, in particular in the case of distributed database tables a considerable effort, in particular communication costs, would be required.

In one embodiment, evaluations and / or selections may be made simultaneously in the first database table and the second database table. In a selection, in the first database table and a simultaneous (additional) selection in the second database table, a query is based on the data records corresponding to the selections. In the above example, for example, all transactions (or the corresponding transaction records) in which bedding and balcony plants were sold could be selected in the first database table. In addition, all customers (or the corresponding customer data records) older than 59 years could be selected in the second database table. A request to the first database table and / or to the second database table is then based on the transaction records corresponding to transactions where a customer older than 59 years has (at least) bought a bedding and balcony plant Customer records that correspond to customers over the age of 59 and have purchased at least one bed and balcony plant answered.

The database tables vividly export a list of database keys corresponding to the existing ("own") selection, importing the list of the respective other database table, which is combined with the "own" selection.

In one embodiment, more than two database tables are linked in the manner described in an analogous manner. These can be done using one (for all database tables) common database key or by means of several pairs of shared database keys. For example, a customer table and a checkout table could be linked by means of a customer number and the check-in table with a transaction table by means of a check-in code number. Clearly, a common database key must exist for each link of two database tables, and all database tables must be linked directly (by means of a common database key) or indirectly (via the "detour" of another database table).

The most common type of database systems are relational databases. A relational database is typically understood to mean a software system that manages one or more database tables in a database. Each database table may contain many records (for example, one customer table one record per customer, one transaction table one record per transaction). Each record in a database table contains values for the same fields (for example, customer number, age, gender).

The invention clearly relates to the combination of several such database tables. The database tables can come from the same database, but also from different databases.

Preferred developments of the invention will become apparent from the dependent claims. The further embodiments of the invention, which are described in connection with the database query system, apply mutatis mutandis to the method for computer-aided database query. It is preferred that the first compressed database image and / or the second compressed database image is generated according to a statistical model.

In one embodiment, the first compressed database image and the second compressed database image are independently created database images.

Preferably, the statistical model is a graphical probability model. For example, a Bayesian network is used as the probabilistic model.

In the embodiment described below, not only a small memory overhead can be achieved by means of the database images, but also the structure of the database images can be used for efficient and quick access.

It is further preferred that the input device is further adapted to receive a selection instruction and the selection means is arranged to select the part of the first plurality of data records according to the selection instruction.

Clearly, by selecting records, a user can more accurately specify a query and determine results for complicated queries.

It is further preferred that the database retrieval system further comprises a display device configured to display a screen display showing the display of possible values of at least one random variable for which values are included in the first plurality of data sets. and that the selection instruction is selecting the display of at least one possible value (s) of the random variable, and the first selection is selecting all the records of the first plurality of records for which the random variable is one of the selected at least one possible one Takes values.

In this way, a user can easily select records, for example by clicking on an expression of a random variable using a computer mouse.

It is further preferred that the display device is further configured to display a further screen display having an indication of the result of the analysis request, and that the display device is further configured to switch between the screen display and the further screen display.

Illustratively, a user can thus use the screen display to select data records and then switch to the further screen display so that the analysis results corresponding to the selection are displayed.

It is further preferred that the database query system has an access device which is set up to access the second database table and to determine data contained in the data sets of the second database table selected according to the second selection, and wherein the processing device is set up, determine the result of the analysis request using the data. Clearly, if the second database image does not contain sufficient information to answer the analysis request, the underlying second database table is used. However, it is not necessary to access the entire second database table, but only to the records selected according to the second selection.

This is particularly advantageous if only a small part of the data sets the selection criteria of the second

Selection and therefore only a few data records have to be retrieved from the second database table since the access to the second database table is considerably slower than the access to the second database image since the second database table typically has to be stored in a memory because of its memory requirement, which is much slower to access than the memory where the second database image is stored.

Illustratively, the second database image is used as a multidimensional index of the second database table. This will be explained in more detail below.

It is further preferred that in the first database image the first plurality of data sets are grouped into a first plurality of segments (clusters) and / or in the second database image the second plurality of data sets are grouped into a second plurality of segments ,

Illustratively, the first database image and / or the second database image are generated according to a statistical clustering model. Preferably, the value of the database key is a record of the first database image (that is, a record of the first plurality of records) of a number of the segment in which the record is contained and a number of the record according to a numbering of the records of the segment.

Preferably, the value of the database key is a record of the second database image (that is, a record of the second plurality of records) of a number of the segment in which the record is contained and a number of the record according to a numbering of the records of the segment.

As a database key is vividly a "natural

Key ", which results naturally from the division into clusters, whereby the records are numbered consecutively within the clusters.

The "natural key" becomes vivid instead of a

Database key used in the first database table or in the second database table (for example, a customer number) used to link the first database image and the second database image.

It is further preferred that for each record of the first plurality of data records the value of the database key is stored in the first database table and / or for each record of the second plurality of data records the value of the database key is stored in the second database table.

This is particularly important if the "natural key" described above for the records is used. In this case, the "natural key" is used to link the first database image and the second database image. When the first database table or the second database table is used, for example in the context of the above-mentioned use as a multidimensional index, it is necessary to set the value of the "natural key" to the value of the database key stored in the first database table (for example Transaction number) or the second database table (for example, customer number) is used, which is made possible by the fact that for each record the value of the "natural key" is stored in the first database table or the second database table.

Regardless of the above database query system, or alternatively to the above database query system, in one embodiment, a method is provided for generating a compressed image of a database table containing a plurality of records, each record including a value of at least one statistical variable the steps

Determining a statistical probabilistic model for describing the relative frequencies of the values of the at least one statistical variable in the database table records and for grouping the data records into one segment of each of a plurality of segments;

Determining, for each segment of the plurality of segments, corresponding to the relative frequencies of the values of the at least one statistical variable in the data records of the segment, a representative value of the at least one statistical variable; Assigning, for each segment of the plurality of segments, a first encoding value to the representative value of the respective segment;

Associating, for each record, a second coded value with the value of the statistical variable contained in the record, if the value contained in the record differs from the representative value of the segment in which the record contains the record.

Furthermore, an arrangement, a computer-readable storage medium and a computer program element according to the above-described method for generating a compressed image of a database table is provided.

The assignment of the first coding value to the representative value and the assignment of the second coding value to the value of the statistical variable contained in the data record can clearly show a compression of the representative value or value. of the one contained in the record

Value of the statistical variables. In particular, the second encoding value is preferably stored.

Clearly, a database table is divided into a large number of segments. For each segment and for each statistical variable to which each record contained in the segment contains an expression, a representative value, viz. A default value, of the statistical variable is determined. The representative value is an expression of the statistical variable that occurs with high relative frequency within the segment, that is, the data records contained in the segment. For each record contained in the segment, it is now assumed that the expression is the representative value corresponds to that contained in the data record, and accordingly the expression contained in the data record is coded only if the expression deviates from the representative value.

Clearly, the value of a random variable is explicitly stored / encoded only if that value deviates from the value that one would expect based on statistical modeling (i.e., on the representative value). In the simplest case, the expected value is the most common value in a database table or in the segment of a database table. For higher compression, one can also choose the value that is the most likely value based on the prediction of a statistical model as the default value.

It is preferred that the representative value be determined based on the description given by the statistical probability model of the relative frequencies of the values of the at least one statistical variable in the data records of the segment.

Illustrative is the statistical

Probability model is used to determine which value qualifies as a representative value for the statistical variable in the segment.

In this way, the representative value can be determined with little computational effort.

For example, the value is chosen as the representative value for which the statistical probability model indicates a high relative frequency within the segment. Preferably, the representative value corresponds to an expression of the statistical variables that occurs in the data sets contained in the segment with a relative frequency that is above a predetermined threshold.

For example, in one embodiment, the occurrence of the statistical variables is chosen as the representative value that occurs at the highest relative frequency within the segment.

In this case, only very few occurrences need to be coded, since most of the data records contained in the segment have the representative value as the expression of the statistical variables. Thus, a high compression can be achieved.

Preferably, the statistical probability model is a graphical probability model. For example, a Bayesian network is used as the probabilistic model.

It is preferred that the values of the statistical variables contained in data sets contained in the same segment and which (values) differ from the representative value of the segment are determined by an arithmetic coding method and / or a method encoded for runlength encoding.

Clearly, in one embodiment, the data sets are efficiently encoded by grouping the records into segments of similar records, stored in a data structure constructed in accordance with those segments, and the similarity of the records within the segments more efficient coding by statistical methods (eg run-length coding, arithmetic coding).

In this case, the data of each segment can be stored line by line (that is, all values of the same data set are stored next to each other, ie at adjacent memory locations, in the memory). Alternatively, the data can be stored column by column (i.e., vividly field by field, values of the first field of all data sets are immediately in memory).

Further, regardless of the above database query system, or alternatively to the above database query system, in one embodiment, a computing arrangement for analyzing data is provided

a display device, which is set up, has at least one first window, which has a first display element which has the display of a designation of a first analysis result, which relates to a first statistical variable, and / or the display of the first analysis result, and a second window, which has a second display element displaying the display of a designation of a second analysis result relating to a second statistical quantity and / or the display of the second analysis result;

- A selector means by which a user can select the first display element and move to the location of the second display element;

a detection device configured to detect whether the first display element has been moved to the location of the second display element;

- A calculating means, which is arranged, in the case that the first display element is shifted to the location of the second display element, a third analysis result to calculate the first statistical size and the second statistical size;

- wherein the display device is set up to display the third analysis result.

Illustratively, a user can drag and drop on a graphical user interface the first display element to the second display element move and thereby control the computer assembly so that the third analysis result is determined.

An indicator that is indicative of a designation of a first analysis result concerning a statistical quantity and / or the display of the analysis result is, for example

a label field of a screen surface window, the window containing the relative frequencies of the occurrences of a statistical variable occurring in a database table; the display, for example the displayed value, a relative frequency of occurrence of a statistical variable occurring in a database table or the display of another analysis result;

- the designation of an expression of a statistical variable or the designation of a group of expressions of a statistical variable;

- the name of a statistical variable or the name of a group of statistical variables.

Clearly, an improved usability concept is provided, in particular for the operation of computer programs which allow the query of databases and the statistical analysis of data stored in a database. It is preferred that the first analysis result is based on data contained in a first database table and that the second analysis result is based on data contained in a second database table.

The first window thus serves to analyze the first database table and the second window to analyze the second database table. The user can therefore generate analysis results across the windows, based in particular on data contained in the first database table and on data contained in the second database table.

For example, the first database table is a transactional database table that has data in one

The second database is a customer database table containing data about the clients of the construction market. A user can look in a first window as the first analysis result the distribution of the random variable "total turnover of the customers" (relative

Frequency of total sales of customers). The first window, for example, shows in a table that in 2004, 30% of the clients of the DIY store made a total turnover between 100 and 150 euros through transactions (and correspondingly other values for other value ranges of the total turnover). For example, the first table has the title "Total customer revenue". In a second window, a second analysis result relating to the transaction database is displayed, for example in a second table titled "Products", the relative frequency of the products purchased. For example, the second table contains the entry that accounts for 3% of all transactions Balcony plants were bought, in 7% of all transactions were purchased garden furniture, etc.

The user can now, for example, let the customer break down on the products, ie generate and display an analysis result that contains, for example, the information that 25% of the customers in the context of purchases of bedding and balcony plants made a total turnover between 100 euros and 150 euros {and other values for other value ranges of total sales and other products). The user achieves this by, for example, selecting the title bar of the first window, for example a field with the string "total sales of the customers", and dragging it into the second window, for example by dragging and dropping it into the second window.

The display device is preferably a computer screen.

The selection device is preferably a computer mouse.

It can be used as a display device but also, for example, a touch screen and the user can select and move the first display element by touching the touch screen. Accordingly, the selector is an element of the touch screen.

Embodiments of the invention are illustrated in the figures and are explained in more detail below.

Figure 1 shows a computer arrangement according to an embodiment of the invention. FIG. 2 shows a first screen display of an Explorer computer program according to an exemplary embodiment of the invention.

FIG. 3 shows a second screen display of an Explorer computer program according to an exemplary embodiment of the invention.

FIG. 4 shows a third screen display of an Explorer computer program according to an exemplary embodiment of the invention.

FIG. 5 shows a fourth screen display of an Explorer computer program according to an exemplary embodiment of the invention.

FIG. 6 shows a fifth screen display of an Explorer computer program according to an embodiment of the invention.

FIG. 7 shows a sixth screen display of an Explorer computer program according to an embodiment of the invention.

FIG. 8 illustrates a cluster hierarchy corresponding to a database image according to an embodiment of the invention.

FIG. 9 illustrates a cluster according to an exemplary embodiment of the invention.

1 shows a computer arrangement 100 according to an embodiment of the invention. A computer system 101 is coupled to a database system 102.

The computer system 101 according to this embodiment is a personal computer (PC) but may also be another computer, for example a workstation.

The computer system 101 includes a screen 110, a microprocessor 103, a memory 104, and various input devices 111, such as a keyboard and a computer mouse.

Database system 102 is a computer system for storing database tables. The database system 102 may accordingly be a high-end computer

Memory capacity is equipped and with the computer system 101, for example by means of an Ethernet interface or wirelessly, for example by means of Blue-tooth coupled. For example, the database system may function as an Oracle database, a Microsoft Access database, a Lotus 1-2-3 database, or a dBase database.

In the database system 102, a customer database table 105 and a transaction database table 106 are stored, which are described in more detail below.

In the memory 104 of the computer system 101, a customer database table image 107, that is a compressed image of the customer database table 105, and a transaction database table image 108, that is a compressed image of the transaction database table 106, are stored. The customer database table image 107 and the transaction database table image 108 are illustrative Data structures that contain the data from the customer database table 105 or the transaction database table 106 in compressed form.

The type of compression and the structure of the customer database table image 107 and the

Transaction database table image 108 will be described in detail below.

In another embodiment, the database system 102 is part of the computer system 101. For example, the computer system 101 has a hard disk in which the customer database table 105 and the transaction database table 106 are stored, and further has a working memory in which the customer database table image 107 and the

Transaction database table image 108 are stored, so that in particular the customer database table image 107 and the transaction database table image 108 can be accessed quickly.

Also stored in the memory 104 is an explorer computer program 109 executed by the microprocessor 103 that allows results of a statistical analysis of the customer database table image 107 (and thus the customer database table 105) and the client database

Transaction database table image 108 (and thus the transaction database table 106) on the screen 110 graphically.

This will be explained in more detail below. 2 shows a first screen display 200 of an Explorer computer program according to an embodiment of the invention.

The first screen display 200 shows results of a statistical analysis of the customer database table image 107 and thus results of a statistical analysis of the customer database table 105.

The customer database table 105 contains information about the customers of a hardware store. For example, the customer database table contains a customer data record for each customer of the DIY store (or for each registered customer of the DIY store) that contains a customer number of the customer, the gender of the customer, the income class of the customer and the year of birth of the customer. The customer records that are in the

Customer database table 105 may still contain a variety of other information about the respective customer, in this example, however, it is assumed that they contain only the above information.

The customer database table image 107 accordingly contains this information about the customers of the hardware store in compressed form, as explained below.

The Explorer computer program 109 allows the analysis of the data contained in the customer database table image 107 and the graphical display of results of such analysis.

In this embodiment, the Explorer computer program 109 examined how the age distribution of the clients of the building market is and the result of the Explorer computer program 109 in a first window 201 of the first screen 200 shown. From this it can be seen that 68. 65% of the building market announcements are male and that 31, 33% of the building market announcements are female.

Illustratively, Explorer computer program 109 performs this analysis by counting all customer records that contain the information that the customer corresponding to the customer record is male and counts all customer records that contain the information that the corresponding customer is female and the count results relative to the total number of customer records.

Furthermore, the age distribution of the customers of the building market was analyzed by means of the Explorer computer program 109 by counting customer data records which contain the information that the birth year of the corresponding customer is within a certain range.

The result of this age distribution analysis is displayed in a second window 202 of the first screen 200 on the screen 110.

Furthermore, it was examined by means of the explorer computer program 109 how the distribution of the income classes in the home improvement market is, and the result of this analysis is displayed in a third window 203 of the first screen display 200. It can be seen that most of the DIY customers (70, 14%) belong to the income class 7.

The analyzes whose results are displayed in the first window 201 in the second window 202 and in the third window 203 are based on all customer records, for example, all customer records were counted, the Information indicates that the corresponding customer is male and set in proportion to the number of customer records to determine the corresponding analysis result (68, 65%).

Since all customer data records have been used as the basis for the analyzes, 100% is displayed in a selection information field 204. The selection information field 204, in another embodiment, further includes the total number of customer records that underlie the analyzes.

The first screen display 200, like all other screen displays shown in FIGS. 3 to 7, has a first selection window 205 and a second selection window 206. The first selection window 205 and the second selection window 206 allow the user to set additional windows to be displayed in the area adjacent to the first selection window 205 and the second selection window 206, for example, windows having analysis results analogous to the first window 201, the second window 202 and the third window 203, which relate to other statistical variables, such as the sales of customers of the construction market.

As mentioned, the transactional database table image 108 and thus the transaction database table 106 can also be analyzed by means of the explorer computer program 109. The analysis results may also be displayed on the screen 110, Figure 3 shows a corresponding display.

FIG. 3 shows a second screen display 300 of an Explorer computer program according to an embodiment of the invention. For example, switching between the first screen display 200 and the second screen display 300 can be accomplished by operating (clicking) an icon in a toolbar.

In this embodiment, the transaction database table 106 includes a plurality of transaction records. Each transaction record corresponds to a transaction, ie, a sales transaction in the hardware store, and contains a transaction number that uniquely identifies the transaction, a specification of the product sold during the transaction, the gross sales value of the transaction, the date of the transaction, and the transaction Customer number of the customer involved in the transaction, that is, the one sold

Bought a product. This information is correspondingly included in the transaction database table image 108 in compressed form.

The second screen 300 shows in a first

Window 301 shows the results of an analysis of how often certain products in the hardware store transactions have been purchased by customers in relation to all DIY store transactions.

For example, 24, 07% of all DIY store transactions have bought technology products. The groups of products such as "technology", "ambiance" and "garden" are more detailed, for example, the product group "garden" has the subgroup "garden / fences and accessories" and the subgroup "plants". The subgroup "plants" is further subdivided into "bedding and balcony plants", "nursery products", "indoor plants" etc. From the first window it can be seen that in 6, 68% of all construction market transactions bedding and balcony plants were sold.

This analysis result is achieved by all

Transaction records are counted that contain the information that was sold in the relevant transaction bedding and balcony plants. The count result is related to the total number of transaction data, giving the percentage value (6.68%).

A second window 302 displays the result of an analysis of how the number of transactions is distributed over the year.

For example, it can be seen that 9.1% of all transactions were conducted in March. This result is determined by determining the number of transaction records containing the information that the transaction was committed on a single day in March, which can be determined by evaluating the date of the transaction, and adding the number to the total number of transactions Transaction records.

A third window displays the result of an analysis of the gross sales value distribution on the transactions. For example, it can be seen that for 13, 72% of all transactions, the gross sales value was between 10 and 25 euros.

The analyzes whose results are displayed in the first window 301, in the second window 302 and in the third window 303 are all assigned to transaction records Basically, why analogous to Figure 2 in a selection information field 304, the value 100% is displayed. In the following, an example is explained in which an analysis is based on only a part of the transaction data sets.

4 shows a third screen 400 of an explorer computer program according to an embodiment of the invention.

The third screen 400 emerges from the second screen 300 when a user selects bed and balcony plants by means of one of the input devices 111 in the first window 301 of the second screen corresponding to a first window 401, and the second window 302 of the second Screen 300 corresponding to a second window 402 selects March 2003.

For example, by means of a computer mouse, the user clicks the value 6, 68 in the first window 301 of the second screen display 300, replacing it with a first bar 404 and the value 100, as shown in the first window 401. By analogy, it is assumed that the user has clicked on the value 9, 01 by means of a computer mouse, for example in the second window 302 of the second screen display 300, whereby this value is replaced by a second sheet 405 and the value 100, as in the second window 402 is shown.

The first bar 404 indicates that now only

Transaction records are selected that contain the information that a bed and balcony plant was sold in during the transaction. The second bar 405, which, like the first bar 404, is displayed in a distinctive color, such as red, indicates that only such transaction records are included that contain the information that the corresponding transaction was made in March 2003.

Overall, all transaction records are then selected that contain the information that the corresponding transactions were made in March 2003 and that a bedding and balcony plant was sold as part of the transaction.

Accordingly, only a fraction of the total number of transaction records is selected. In this example, 1.3% of all transaction records are transactions that sold a flowerbed and balcony plant in March. This is shown in a selection information field 406 corresponding to the selection information field 304 in the second screen display 300.

The selected (selected) data records are based on the analyzes whose results are displayed in the first window 401, in the second window 402 and in the third window 403, respectively.

Since all of the selected transaction records contain the information that a bedding and balcony plant was sold during the current transaction, bedding and balcony plants were sold in 100% of all selected transactions, ie transactions corresponding to the selected transaction records, which is represented by the value 100 in the first bar 404 is displayed. Analogously, according to the selection of the

Transaction records performed 100% of all selected transactions in March 2003, represented by the number 100 in the second bar 405.

A nontrivial analysis result, however, is shown in the third window 403.

For example, it can be seen that in 82, 45% of all selected transactions, the gross sales value is below 5

Euro is. This means that for all transactions that took place in March 2003 and in which a bed and balcony plant was sold, the gross sales value was less than 5 euros.

Now assume that a DIY store sales manager wants to perform an analysis of the age distribution of some customers who bought at least one bedding and balcony plant in March 2003. The sales manager may want to conduct this analysis to determine whether it is worth starting a "geranium for retiree" discount sale next March.

The sales manager starts the Explorer computer program 109 on the basis of the customer database table image 107, so that the first screen display 200 is displayed on the screen 110.

Then, it launches a new instance of the explorer computer program 109 (or opens another window in the explorer computer program 109) based on the transaction database table image 108 so that the second screen display 300 is displayed on the screen 110. Subsequently, the sales manager evaluates, as described above with reference to FIG. 4, bed and balcony plants in the first window 301 of the second screen display 300 and March 2003 in the second window 302 of the second screen display 300, so that the second screen display 300 enters the third screen display 400 passes.

Subsequently, the sales manager changes, for example by clicking on a corresponding icon, to the first screen display 200, which according to the selection j, however, has passed into the fourth screen display 500, which is shown in FIG.

5 shows a fourth screen display 500 of an Explorer computer program according to an embodiment of the invention.

According to the selection of all transactions carried out in March 2003 in which a bedding and balcony plant was sold, the analyzes whose results in a first window 501 corresponding to the first window 201 of the first screen display 200, in a second Window 502, which corresponds to the second window 202 of the first screen display 200, or in a third window 503, which corresponds to the third window 203 of the first screen 200, are represented, based exactly the customer records that correspond to customers in March 2003 bought a bed and balcony plant.

This is done by having in the

Transaction database table image 108 all those customer numbers are determined, each corresponding to a transaction record that a transaction which was completed in March 2003 and in which a customer (namely the customer specified by the customer number) has bought a bed and balcony plant. The analyzes, whose results are displayed in the first window 501, in the second window 502 and in the third window 503, are now based exactly on the customer records, which contain one of the thus determined customer numbers. These customer records are referred to below as the selected customer records.

Illustratively, the customer number is used as a database key that links related customer records and transaction records together.

In accordance with the selection of the customer data records, the proportion of the selected customer data records in the total number of customer data records is displayed in a selection information field 504 corresponding to the selection information field 204 of the first screen display 200, in this example 1.02%. This means that 1, 02% of the (registered)

In March 2003, customers of the DIY store bought at least one bed and balcony plant.

The selected customer data records are sent to the analyzes, the results of which in the first window 501, in the second window 502 or in the second window 502. are displayed in the third window 503.

For example, from the first window 501 it can be seen that 57. 93% of all customers who bought at least one bedding and balcony plant in March 2003 are male.

It can be seen from the third window 503 that 79, 41% of the selected customers, that is to say the customers, who owns the selected customer records that belong to income class 7.

However, in this example, the sales manager is interested in the result of the analysis, the result of which is displayed in the second window 502.

It can be seen that 19, 25% of all customers who bought at least one bed and balcony plant in March 2003 were born between 1930 and 1939.

By comparison with the second window 202 of the first screen display 100, it can be seen that the proportion of customers born between 1930 and 1939 who bought at least one bed and balcony plant in March 2003 for all customers who in March 2003 had at least one bed and balcony plant is larger (19, 25%) than the share of the building market customers born between 1930 and 1939 to all customers of the construction market (10, 95%).

The sales manager might conclude that it might well be worth starting a "Geranium for Retiree" discount campaign next March.

Illustratively, the data are in the one described above

Embodiment not in the form of a so-called flat data structure before, that is in a single database table, but are distributed to multiple database tables, in this example, the customer database table 105 and the transaction database table 106. The customer database table 105 and the

Transaction database table 106 is in a 1: n relationship using the customer number because in this example, a customer may be involved in multiple transactions. In In other embodiments, m: n relationships are also conceivable, for example, if a customer may be involved in multiple transactions, and multiple customers can perform a transaction together.

In one embodiment, when a selection has been made in accordance with FIG. 4, a further window is displayed in the first screen 200, by means of which the user can select whether the selection according to FIG. 4 shows the analyzes whose results in the first window 201, FIG. in the second

Window 202 and the third window 203 are to be based. For example, the additional window can be set to the state "yes", which means that the selection according to FIG. 4 is used as the basis for the analyzes. This condition may be in the further

Window (instead of "Yes") may also be labeled, for example, as "Customer has transactions that match the selection in the other database table," or "Customer has done transactions with product = bedding plants, gross sales value <5, transaction month = March03". Accordingly, the other

Windows have a state "no" (or correspondingly designated state). The user, in this example the sales manager, can put the further window in one of the two states by using, for example, a computer mouse, i. H. make a selection of one of the two states and thereby determine whether the currently entered selections in the other database table should be taken into account when evaluating this database table.

The further window may optionally retain its designation and the effect of selections made therein when the selection in the second screen display is changed, or adjust automatically. Depending on that, so will the either continue to refer to bedding plants (for example, if the "keep" mode is activated) or switch to drilling machines, if you change the selection in the second display of bedding plants on drills.

Database Table Further (and assuming that "yes" was selected in the further window described above, i.e. the selection was adopted according to Fig. 4), by means of the fourth screen display 500 analogous to the third one

Screen 400 a re-selection, in this case by customers, be performed. According to this selection, by means of the common key (customer number) of the transaction database table image 108 and the customer database table image 107, it is possible to select transactions on which the analyzes are based, the results of which are shown in the third screen display. For example, in the fourth screen 500, the user could select the customers who purchased at least one bed and balcony plant in March 2003 and who belong to income class six, for example by clicking on the value 2, 87 in the third window 503.

If the mode of the other windows is set to "maintained", the selection of customers defined in the last paragraph, in the interaction between the transaction table and the customer table, can be transferred back to the transaction world, so that more information about the other transactions of this customer group can be found the previously defined bedding and balcony plants in March. To do this, the selections in the third screen display are first removed (which has no effect on the fourth screen 400 according to the "keep" mode) and in the one displayed there further windows select the state "yes", whereby the customer list currently active in the fourth screen 400 is transferred to the third screen 300. Accordingly, the third screen 300 would change and in the third window 403 the distribution of the gross sales values of the transactions displayed by customers who are in income class six and bought at least one bed and balcony plant in March 2003.

The selection can now be continued. In this way, complex questions can be answered, such as the question "What do customers buy in September who bought garden fences in May?". This can be strategically exploited by a sales manager, for example, to decide whether or not to sell garden fences in the fall if a lot of garden fences were sold in one year in the spring.

In the embodiment described above, two database images are combined that vividly represent different views. Thus, the customer database table image 107 corresponds to a view of the customers of the hardware store and the transaction database table image 108 of a view of the transactions made in the hardware store.

In the following, with reference to FIG. 6 and FIG. 7, further screen displays representing results of analyzes performed by the Explorer computer program 109 will be explained. 6 shows a fifth screen display 600 of an Explorer computer program according to an embodiment of the invention.

The fifth screen 600 is shown in the third screen 400.

The fifth screen display 600 includes (partially) a first window 601 corresponding to the first window 301 of the second screen display 300. The fifth

Screen 600 further includes (partially) a second window 602 that corresponds to third window 303 of second screen display 300.

A third window 603 shows the result of an analysis, in which it was determined for each product group how high the proportion of transactions in which a product from the respective product group was sold and in which the gross sales value was less than 5 euros all transactions in which a product of the respective product group was sold.

For example, a first bar 604 shows that in about 60% of all transactions where a Product was sold from the Product Group "Technology", the gross sales value was below € 5. Corresponding bars are shown for the product groups "Ambiente", "Garten", "Baustoffe / Sanitär" etc.

The expression "less than 5 euros" becomes clear

Random variable "gross sales value" broken down by product group. The user of the explorer computer program 109 may select the fifth screen display 600 from the third

Display screen 400 by clicking on the value (65, 84) for the expression "<5" in the third window 403 of the third screen 400 with a computer mouse holding the mouse button pressed and the value in the first window 401 of the third screen 400 pulls (drag and drop).

In general, an expression of a first random variable over a second random variable can be broken down by dragging the value for the relative frequency of the expression of the first random variable into a window in which the relative frequencies of the occurrences of the second random variable are represented. This can also have one

Across the screen. For example, the user may click the value (65 _r 84) for the expression "<5" in the third window 403 of the third screen display 400 with a computer mouse, change to the fifth screen display 500 by a corresponding command, and drag into the first window 501. Accordingly, the expression "below 5 Euro" would be broken down by the gender "gross sales value" random variable and, for example, a bar would appear stating that 40% of all transactions made by a male customer were priced below 5 Euros (and another bar accordingly for the female customers).

In this example, the first random variable is the gross sales value and the second random variable is the product. In another embodiment, similar to, for example, also by drag and drop, also a three-dimensional diagrammatic representation can be generated. For example, a diagrammatic three-dimensional representation in which all product groups are represented along one axis (that is, occurrences of a first random variable), as is the case in the third window 603, along a second coordinate axis, ranges of gross sales values, for example "<5" _f " 5-10 ", etc. (occurrences of a second random variable). At a location of the grid formed by the first coordinate axis and the second coordinate axis, which corresponds to a certain product group and a given gross sales value range, could by a

A third coordinate axis bar shows the percentage of transactions in which a product of the product group was sold and the sales value is in the sales value range, on the transactions where a product from the product group was sold.

Illustratively, this corresponds to the representation of the analysis result shown in the third window 603 for all gross sales value ranges (and not just the gross sales value range "<5") by the representation shown in the third window about a further coordinate axis (the above-mentioned second coordinate axis ) and accordingly a two-dimensional scheme of beams is created.

7 shows a sixth screen 700 of an explorer computer program according to an embodiment of the invention.

The sixth screen 700 has (partially) a first window 701 corresponding to the first window 301 of the second screen display 300. The sixth screen 700 further includes (partially) a second window 702 corresponding to the third window 303 of the second screen display 300.

In a third window 703, the result of another analysis is shown. The analysis determined the average gross sales value of all transaction records that correspond to a transaction where a product from a particular product group was sold, and performed accordingly for multiple product groups.

For example, a flag 704 shows that the average gross sales value of all gross sales values for transactions in which a product from the product group technology was sold is about 8 euros. Corresponding further markings, which display respective average gross sales values for different product groups, are likewise shown in the third window 703, in this example for the product groups "Ambiente", "Garten", "Baustoffe / Sanitär" etc.

Clearly, the average gross sales value (the gross sales values from all transaction records) is broken down across the different product groups.

The user may generate the sixth screen 700 from the second screen 300 by, for example, dragging and dropping the field with the string "percentage values" from the third window 303 into the first window 301. In this case, the user could be presented with a selection menu by means of which the user can select from several options. For example, the user may choose to display a window instead of the third window 703 that does not indicate the average gross sales value for each product group but the sum of all gross sales values contained in transaction records corresponding to the transaction each one product from the respective product group was sold. For example, in this case, another tag (analogous to tag 704) indicating the sum of all the sales values from transaction records corresponding to the transaction where a product was sold from the Product Group "Engineering" might be displayed.

Clearly, the total sales are broken down into different product groups.

In the analyzes whose results are shown in the third window 603 of the fifth screen representation 600 or in the third window of the sixth representation 700, it was assumed that all transaction records were always used as a basis. However, it is also possible to base the analyzes on only a part of the transaction data sets by performing a selection of specific transaction data records, as explained above with reference to FIG. 4 and FIG.

Analogous to the breakdown of the average across different product groups as shown in Fig. 7, other statistical quantities may also be broken down by random variable characteristics. For example, for each product group, the variance of the gross sales values could be determined for all transactions in which a product from the respective product group was sold. All analyzes can also be based on weighted data sets in another embodiment. For example, a customer data record is weighted with what sales have been made so far with the corresponding customer. For example, for a first age range, if the customers in the first age range made more revenue than the customers in the second age range, the customer would have a higher customer share than a second age range, as indicated by the second window 202 of the first screen the customer in the first age range is not higher than the number of customers in the second age range (since the weighting is taken into account when counting the corresponding customer data records). This presupposes that each customer data record contains information about the turnover of the respective customer.

Likewise, in analyzes involving transaction database table 106, transactions may be weighted according to their share of revenue.

If a selection of customers is made, as explained above with reference to FIG. 4, then a window in which the selected customers are broken down according to the occurrence of a random variable can be displayed in the screen display relating to the customer database table 105.

According to the above example, in which all customers are selected who bought a bed and balcony plant in March 2003, another window could be displayed in the fourth screen 500, which shows different sales areas (for example through bars), how high the proportion of customers who made the respective sales and bought a bed and balcony plant in March, to all customers who bought bedding and balcony plants in March.

The following describes the shape and structure of a database image of a database table according to a

Embodiment of the invention explained, for example, the customer database table image 107th

The database table has several data records which clearly form the database table among each other. For example, as in the example described above, there is one record for each (registered) customer of a hardware store. For example, each record has a database table entry that contains the age of each customer. Illustratively, the data records form rows in which the age of the customer corresponding to the respective row is indicated in an "age" column.

The attribute age (and other existing attributes such as income, gender, etc.) of the customer is interpreted as a random variable, that is construed. Depending on the customer, this random variable assumes a certain value (state, form), for example the value 23, if the corresponding customer is 23 years old. The possible values of the random variables occur with a relative frequency in the database table. For example, if one quarter of all (registered) customers of the DIY store 23, the relative frequency of the value (state) 23 of the random variable age is 0, 25 or 25%. ^■

To create the database image of the database table, a statistical model of the data in the database table is created generated . The statistical model is illustratively an approximation of the common probability distribution of the random variables of the database table.

For example, in the above example, when generating a statistical model of the database table, it is determined that the probability that a customer is 23 is 0.25, which can be formally written as follows:

P (customer is 23) = 0, 25

The statistical model is "learned" by a learning process from the database table entries, that is, using the database table entries, preferably using a maximum likelihood approach. The probabilities present within the framework of the statistical model of the database table describe, as mentioned, the relative frequencies of the states of the database table entries, depending on the procedure exactly or approximately. The database table entries may assume a variety of states, which states may occur with different relative frequencies.

Once a statistical model has been generated, it can be used to determine the relative dependencies between the

States of random variables, that is, the correlation of random variables to study.

For example, the relative frequencies (probabilities) of the states of certain

Random variables are given according to a predetermined condition and corresponding to the predetermined relative frequencies of the states of the random variables relative frequencies of the states thereof dependent (thus correlated) further random variables are determined.

As a statistical model, for example, a graphical probability model (Graphical

Probabilistic Model), as described for example in [1]. The graphical truth models include in particular Bayesian networks (Bayesian Networks or Belief Networks) and Markov-Net ze.

A statistical model can be generated, for example, by structural learning in Bayesian networks, as described, for example, in [2].

Another possibility is to learn the parameters of the statistical model for a fixed structure, that is, to determine, as described for example in [3].

In a variety of learning techniques, a likelihood function is used as an optimization criterion for the parameters of the model. A particular implementation here is the expectation-maximization (EM) learning method, which is described in more detail below with reference to a specific model.

Typically, it is not a high generalizability of the statistical model that is important, but a good adaptation of the statistical model to the data contained in the database table, that is to say a good match of the probabilities of the random variables specified by the statistical model with the relative frequencies given by the database table entries.

As a statistical model is preferably a statistical clustering model, in particular a Bayesian

Clustering model, which divides the data into a plurality of clusters (also called segments).

By using a clustering model, the database table is divided into several smaller parts (clusters, segments), which in turn can be considered as separate database tables and, because of their smaller size, can be handled more efficiently.

A more efficient statistical evaluation of the database table using a clustering model can be achieved, for example, by checking in the statistical evaluation of the database table whether a given selection condition leads to the statistical model recognizing that all the data that contains the selection conditions meet in a single or a subset of clusters. If this is true, then one can restrict oneself to these clusters in the evaluation. Likewise, it is possible to have a restriction to those clusters in which the data satisfying the given condition is included with at least a certain relative frequency. The remaining clusters, in which data according to the given condition are contained only in a smaller proportion, can be neglected, if only approximate statements are desired. As a statistical clustering model, for example, a Bayesian clustering model (a model with a discrete latent variable) is used.

This will be described in more detail below.

Given a set (K-tuple) of random variables (statistical variables) X = (X] _, ..., X ^). The possible states of the random variables are described by the respective lower case letters. The i-th

(1 ≤ i <K) Random variable Xj_ can, for example, the

States x -,, x

1, J, XX _r Δ ₉ , ..., xl. _fτ ijj_, where L ± is a natural number greater than or equal to one.

Both discrete and continuous (real-valued) random variables can be used.

In this embodiment, continuous states are discretized using corresponding discretization intervals. Accordingly, it is assumed that the states of the random variables

-j_ (for all i with 1 <i <K) are discrete.

A record in the database table contains a value (expression) for each of the random variables X] _, ..., X ^. The π-th dataset of the database table can accordingly be in the form

^χπ = < ^χ f. * Κ> written, where x is for all

1 <i <K.

When written among each other, the datasets vividly form a database table (or panel) that has a column for each random variable.

It is assumed that the board M has entries. Thus, the entire database table can be used as a matrix

to be written .

When using a clustering model, a so-called hidden variable (cluster variable), which is denoted by Ω, is additionally used. The cluster variable has one of the values ωj_ (i = 1,..., R) for each data record of the database table. The value of the Ω variable for a record indicates which cluster (segment) the record is associated with as part of the clustering model. In this example, therefore, there are R different clusters.

With P (Ω | Θ) the a priori distribution of the

Cluster, where P (ωj_ I Θ = θ) is given the a priori weight of the ith cluster. That is, P (ω ± I Θ = θ) is the probability that a (random) record of the database table belongs to the ith cluster. The a priori distribution describes how much of the data is assigned to the j eweiligen ^clusters'. The set of random variables Θ can take the possible parameter vectors θ of the statistical model.

Let P (X | Ω = ωj_, Θ = θ) let be the conditional one

Probability distribution within the i-th cluster, ie the probability distribution of the random variables X. = (X ^, ..., X ^) within the ith cluster.

The a priori distribution P (Ω | Θ) and the distributions of the conditional probabilities P (XI Ω - tö ±, Θ = θ) (for each cluster) together form a probability model P (X, Ω I Θ) for (Xi , ..., X _κ , Ω).

The probability model is given by the product of the a priori distribution and the conditional probability distribution, that is:

P (X I Θ) = P (Ω I Θ) • P (X | Ω, Θ)

respectively .

R

P (X I Θ) =] T P (Ω = ωi I Θ) • P (X I Ω = m ±, i = l

this means

P (X = (X ₁ , ..., x _k ) I Θ = θ) =

R

^ P (Ω = i i i Θ = θ) ^• P (X = (Xi, ..., x _k ) I Ω = W ₁ , Θ = θ) i = l The probability P (Ω = <DJ_ | Θ = θ) is called the weight of the ith cluster (segment).

The logarithmic likelihood function L of the parameter vector θ of the data set p_ is given by

L (θ) = log P (DI Θ = θ) = Σ log P (X = x ^π l≤π≤M

In the context of expectation-maximization (EM) learning, a sequence of parameter vectors θ ^ 'is now constructed according to the following general rule:

θ ^{(t + 1} ^ = ar ^g m ^a x £ ^{P P (co} i ' ^x *' ^ ^{t)) " ^log ^ ** ' ^ω i' ^ θ l≤π≤M l≤i≤R

By means of this iteration rule, a stepwise maximization of the likelihood function and the determination of a suitable parameter vector θ which specifies the statistical model takes place. Each of the iteration steps consists of an E and a M step. The E step corresponds to the right part of the above equation. For each of the M data sets, the expected values or the a posteriori probability P (Ω | X = x, Θ = θ) for the cluster

Calculates variable Ω based on the current parameters, d. H. estimated the cluster affiliation of the record. In the M-step, the new parameters are then set according to the above equation.

After learning the parameter vector θ (after the convergence of the above iteration), each record x ^{π is assigned to} a cluster (segment). The assignment takes place by means of the a posteriori distribution P (Ω IX = x, 0 = β). The data set x is assigned to the i-th cluster, whose weight is highest, that is to say when valid

P (Qi I X = x, Θ = B) = max P (ω-ϊ | X = x, Θ = θ). l≤j≤R

The cluster membership of each record can be stored in an additional field of the record in the database table and appropriate indexes can be prepared to quickly access the data belonging to a particular cluster.

For example, if a statistical query of the form "Give all data sets with Xx = xχ ^, χ and X2 = X2,3 'and the corresponding distribution over X3 and X4 (ie P (X3 | Xx = ^κ l, lr

^X 2 ^{= X} 23 ^{) unc} * P (XI I ^ l = ^ l 11 X-2 ^{= X} 23 ⁾⁾ from "to the database table, the procedure is as follows:

First, the a posteriori distribution P (Ω | Xx = x ± _f ±, X2 = X23) is determined. This distribution shows (possibly only approximately) what proportion of the data is to be found in which clusters of the database table according to the specified condition. So it is possible to limit itself to the parts (clusters) of the database table, which correspond to P (Ω | Xx = xχ, χ _f X2 ^{= X} 2,3) ^e ^ - depending on the desired accuracy. ^{n have a} high a posteriori weight and thus vividly contain a large part of the (according to the condition) relevant data. An ideal case is given if P (COjJ X ₁ = xi, i,

^X 2 ^{= X} 2,3) ⁼ ^ for ^e i ⁿ i ^unc * accordingly P (COj I Xi = xi i,

X2 = ^X 2,3) ⁼ ° applies ^to all j ≠ i, that is, when all the condition corresponding data made are included in a single cluster.

In such a case, a restriction on the i-th cluster can be made without a loss of accuracy in the further evaluation. In this case, the property of the cluster models described here is exploited that the a posteriori probability of a cluster for a selection condition is 0 only if no single data record satisfying the condition is contained in the cluster. In this respect, the models are exact.

In addition to identifying the relevant clusters, the statistical model can also be used to directly calculate certain desired probabilities (possibly approximatively). For example

To determine probability distributions for X3 and X4, the desired distributions P (X3 | X ₁ = X ₁ \, X2 = X23) and P (X4 | X ₁ = X ₁ \, X2 = X23) based approximate the parameters of the model be determined, for example according to

P (X ₃ IX ₁ = x _lfl , X ₂ = x ₂ , 3) =

Σ P (X ₃ I Ω = Oi, X ₁ = x _lfl , X ₂ = x ₂ , 3 'Θ = ^β ) ^■ l≤i≤R

P (Ω = CO ₁ IX ₁ = x _lfl , X ₂ = x ₂ , 3, Θ = θ) Alternatively, however, the statistical model can also only be used to determine the clusters relevant for the current request.

After restricting to the relevant clusters, more accurate methods can be used within the clusters. Z. For example, an exact counting of statistics within the cluster can be done, for example, if the data was organized (and possibly compressed) according to cluster affiliation in memory or on disk, or with the aid of an additional index. the cluster affiliation. Within the clusters, simple counting methods in main memory, conventional database reporting methods or OLAP (on-line analytical processing) methods can be used, or other statistical models specially adapted to the clusters can be used. A close integration with OLAP is particularly advantageous, as the so-called "sparse" data of the high dimensions is exploited by the statistical clustering model and OLAP methods are used only within the effectively lower-dimensional clusters.

The restriction to relevant clusters is particularly advantageous if the clusters within a framework

Database image compressed, as explained below. In this case, the entire database image, that is, all clusters, need not be decompressed on a request.

The tradeoff of speed and accuracy in the evaluation results from the amount of data excluded from the evaluation: the more clusters are excluded from the evaluation, the faster, but also less accurate, will be the answer to a statistical request. The user can be given the opportunity to determine the tradeoff between accuracy and speed himself. In addition, automatic more accurate methods may be initiated if the model's evaluation yields insufficient accuracy.

In general, clusters are excluded from the evaluation below a certain minimum

Weight are. Exact results can be obtained by excluding only those clusters having a posteriori weight of zero.

Overtraining a clustering model is irrelevant because the goal is to reproduce historical data as accurately as possible, rather than forecasting for the future. However, strongly over-trained clustering models tend to provide the most unambiguous assignment of queries to clusters, which is why further operations can very quickly restrict them to small parts of the database table.

In an inserted data storage medium, the data belonging to a cluster is advantageously stored in a manner corresponding to the cluster membership.

For example, the data associated with a cluster may be stored in a portion of the memory 104 so that the associated data may be read in blocks quickly.

As mentioned, random variables that take on continuous values can be discretized. For example, a "Income" random variable, that is, a random variable that corresponds to the indication in the customer records of the income of the respective customer, are divided into income classes. The division into income classes can be different or coarse, according to the analytical

Requirements, that is, according to the accuracy requirements, by means of which the database image is to reproduce the database table, that is to say contain the information from the database table.

For a very accurate representation of an initially continuous size, the variable may first be discretized at intervals. In addition to the resulting discrete variable (which is compressed as in the methods described herein), the mean of each interval may additionally be stored and for each discrete value the deviation from the mean. Since then only small differences have to be stored, this can be done very memory efficient.

Variants of categorical variables are coded accordingly, for example, for a "gender" random variable the expression "male" is coded by means of a zero and the expression "female" by means of a one.

If a categorical random variable has a multitude of characteristics in the database table, then ^■ these can be grouped into classes when the data image is created, as long as this allows the requirements for the database image.

For example, the product index of the above mentioned DIY store could be organized hierarchically, for example the product titled "M4 screw" could be part of the "Machine screws" product group. The

Product group "machine screws" could in turn be assigned to the product group "screws", which in turn is assigned to the product group "tool accessories", wherein "Tool Accessories" itself is a product subgroup of the product group "Tools". According to the requirements of the data tape image, it might now be sufficient not to differentiate between different machine screws, but to combine them into a class "machine screws". Accordingly, for example, each transaction record in the transaction database table image 108 in the field corresponding to the product specification has the entry "machine screws" (or a value assigned to this characteristic, respectively), if the corresponding one

Transaction record in the transaction database table 106 in the field corresponding to the product specification containing specification of any machine screw.

A query to the database image can now be processed based on this categorical variable's categorization. If a more precise classification of the values of the categorical variable (for example a differentiation between different machine screws) is required to answer the request, the database table is used. In this case, however, typically only a few details have to be queried from the database table.

Illustratively, the database image can be used to provide approximate answers to statistical queries.

In one embodiment, the database image is constructed hierarchically. Clearly, the clusters generated as described above are themselves understood as database tables and subdivided into segments analogously to the entire database table, that is to say each data record in the ith cluster is assigned to a jth subcluster of a plurality of sub clusters of the ith cluster , Continuing analogously, a tree of clusters and vividly becomes

Subclusters constructed by the j-th subcluster of the i-th Cluster itself to a kth subcluster of a plurality of subclusters of the jth subcluster of the ith cluster, and so on.

The resulting cluster hierarchy is shown in FIG.

8 illustrates a cluster hierarchy 800 corresponding to a database image according to an embodiment of the invention.

The cluster hierarchy 800 is in the form of a tree.

The database table 801 is symbolized by the root of the tree. According to the above example, the database table M has records each containing values of the random variables X = (X] _, ..., X ^).

For the database table 801, a statistical clustering model is determined.

The probability distribution of the random variables X = (X] _, ..., X _jζ ) for all data sets (according to the particular statistical clustering model) is denoted by P (X). (As opposed to above, the indication of a

Parameter vector θ and waives the random variable Θ accordingly. It is assumed that the statistical clustering model is specified by a corresponding set of parameters. )

According to the statistical clustering model, the database table 801 becomes a first plurality of R] _ clusters

802 divided. The probability distribution for the data sets in the ith cluster of the first plurality of clusters 802 is given by P (X | α> j_). The i-th cluster of the first plurality of clusters 802 contains N-j_ data sets. The probability that a cluster belongs to the i-th cluster of the first plurality of clusters 802 is P (<OJ_), where a> ± the value is the cluster variable Ω corresponding to the i-th cluster of the first plurality of clusters 802.

The clusters of the first plurality of clusters 802, in turn, are clustered to form a second plurality of clusters 803. The i-th cluster of the first plurality of clusters 802 is thereby divided into R2, i (sub-) clusters.

The j-th subcluster (which is one of the clusters of the second plurality of clusters 803) of the i-th cluster of the first plurality of clusters 802 is assigned the value G> ± _r j of the cluster variable Ω.

The probability distribution for the records in the j-th subcluster of the i-th cluster of the first plurality of clusters 802 is given by P (X | ω ^ j). The jth subcluster of the ith cluster of the first plurality of clusters 802 contains N j records. The probability that one

Clusters of the jth subcluster of the i-th cluster of the first plurality of clusters 802 is P (OOj ^ j).

The clusters of the second plurality of clusters 803 are further subdivided into clusters analogously to the first plurality of clusters 802, so that a third plurality of clusters 803 are clustered Clusters 804 are created for which the quantities P (X | α> i, j, k>'P(t> i _r j, k) and N _ir j, _{k are} defined analogously to the above.

The records in the lowest level of the cluster hierarchy 800 are stored in compressed form and stored, for example, in the memory 104 as a database image. (The database image has additional data in addition to the stored records, such as the parameter set of the statistical (clustering) model that was determined.)

In the following, it will be explained with reference to FIG. 9 how the records of a cluster are compressed and stored.

9 illustrates a cluster 900 according to an embodiment of the invention.

The cluster 900 is shown in the form of a table. Each row of a plurality of N rows 901, 902 corresponds to a record contained in the cluster 900.

Each column of a plurality of K columns 903, 904 corresponds to a random variable. -

The following is explained by way of example with reference to the πth row 902 and the ith column 903.

The cluster 900 corresponds to the value ω of the cluster variable Ω.

The π-th data set has, as above, the form x ^π = (x ™, ..., x ^)

where χ7 e x. _,, ..., X, _τ > for all 1 <i <K. The values x-,, X- _o , ..., x. _τ (for all 1 with 1 <i <K), the possible occurrences of the random variables X-j_, Lj_ are their number. A data set thus corresponds to a K-tuple of possible occurrences, wherein the K-tuple at the i-th point has one of the possible values of the ith random variable Xj_.

The probability distribution of the random variables for the records in cluster 900, that is, the relative ones

Let K-tuple frequencies of occurrences in cluster 900 be given by P (X | ω) (possibly only approximatively, depending on how accurate the particular statistical model is).

As above, suppose that the x. _{1 (} χ. _Or ..., x. _Τ (for all i with 1 <i <K) are discrete values.) Do the records of the underlying database table, that is, the database table from which the database image was generated, continuous values, these are discretized, so a value Xi j may correspond to a discretization interval.

According to the above-described determination of a clustering model, the cluster hierarchy 800 is formed so that the data within the clusters of the cluster hierarchy 800 is more homogeneous than the entire data in the underlying database table. In particular, every random variable is given a value (one characteristic) which is most frequently (or relatively frequently) contained in the data records of cluster 900 and thus in the majority of rows 901, 902. The excellent value for the i-th random variable X ^ (also as the default value of the ith random variable or as

Representative value) is called XJ_. The default value can be calculated using the statistical model, so the occurrences contained in the data sets do not have to be counted in each case in order to determine their relative relative frequency.

For a default value, it is clear that the conditional

Probability P (Xj_ = XJ_ | ω-jj is relatively high, that is, in the ith cluster can be assumed that the i-th random variable has the value κ ±.

For example, 90% of all (registered) male customers between the age of 30 and 40 years of the above-mentioned hardware store may have a call money account (to recognize this, the customer database table 105 must contain the information as to whether the customers have a call money account). For this class of customers, it can therefore be assumed with a high degree of certainty that they (each) own a call money account. If the generation of the clustering model now also shows that a cluster predominantly consists of customers of this type, for example, the customers in this cluster are 85% male, 95% between 30 and 40, and 92% have a call money account, Thus, the default value "yes" is used for the call money account random variable, ie the entry whether the corresponding customer has a call money account ("yes" being coded, for example, by the value 1).

Thus, the value of the cluster variable Ω for a cluster for prediction of the data sets in the cluster can be illustrated in this example, the value of the random variable that indicates whether the corresponding customer has a call money account.

In this embodiment, the data sets in the cluster 900 are compressed based on the basic principle that only the deviation of an occurrence of a random variable from the corresponding default value is always stored. This is done, for example, by means of runlength coding.

Clearly, information is only coded if it deviates from the expectations of the statistical model.

In the following, the column-wise runlength coding of the data records contained in the cluster 900 will be explained.

The i-th column is runlength encoded. For example, the i-th column contains the values

* * * * * * * * * ^χ if ^χ ir ^χ j _^ 5 I xj_ 2 '^ i' ^ i ' ^x i' ² ^ i '^ i 1' ^x i '^ "i' ^ i ' ^x i 4 '

It was assumed Lj_> 5. For example, it could be Xi = X-! o apply.

In runlength encoding according to this

Embodiment of the invention, the default value Xi is not encoded, but only encodes how often it occurs in consecutive lines. Accordingly, the i-th column becomes 2, κ _{i (5} , 0, * i, 2 ' ⁴ ' ^x i, l ' ³ ' ^x i, 4

encoded.

In another embodiment, one is added to the number of consecutive lines in which the default value is contained, so that the coded column the

Form 3, X _{1 5} , 1, X ₁₂ , 5, X ₁ -L, 4, X ₁₄ has.

For quick access to the encoded column, it is not necessary to decode it. Clearly, the data can be worked on directly in coded form, so that inquiries can be answered faster than in the case that the compression is reversed in the case of a request (which would result in a high computational effort).

The following are some examples of accessing the coded column.

For example, without decoding the coded column, it may be determined which records in the ith column contain a value other than the default value. In case of a request, the result will be delivered according to Table 1.

Likewise, it can be determined without decoding the coded column, which records in the i-th column contain the default value. If requested, the result will be shown in Table 2.

Table 2

Furthermore, without decoding the coded column, for example, it can be determined which data records in the ith column have the value x. ₁ included. In case of a request, the result will be delivered according to Table 3.

Table 3

In another embodiment, the cluster 900 is arithmetically coded column by column.

Arithmetic coding (see, for example, [4]) is a

Compression method in which a data stream into a Bit representation of a real interval is converted. In doing so, a given probability distribution is used.

The probability distribution is used to calculate the

Probability that the next value in the data stream is the value x, P (next value = x).

In the present case, the data stream is represented by the ith column 904 (or by all of them written one after the other)

Columns). The probability P (next value = x) is determined by means of the determined statistical clustering model. The compression is then performed according to an arithmetic compressor.

In this embodiment, however, it is necessary to decode the coded column to answer inquiries (such as the above).

In another embodiment, a combination of runlength coding and arithmetic coding is used.

In a first step, the i-th column is given, for example, by

^x i ' ^x i ^{i X} 2.5' ^x i 2 ' ^x i' ^x i ' ^x i' ^x i ' ^x i 1' ^x i ' ^x i' ^x i ' ^x i 4

analogous to above by 3, xi-, oc, l, χi. , z-,, 5, Xx. , J ,. , 4, encoded, where, as above, the values 3, 5 and 4 indicate in each case the run length of the default value plus one at the corresponding point in the data stream. Subsequently, the data stream 3, x. r, l, χ. ₀ , 5 _r x _{H ir} 4, χ. "Further compressed by means of arithmetic coding. The probability distribution used for this is given as follows: Probabilities for the values that specify the run length are given by

P (run length = n) = P (next value in the data stream = x ±) ^{n ~} ^ - (1-

P (next value in the data stream = Xχ)) •

Probabilities for values Xj ^ x ₁ are given by

P (next value in the data stream = Xj_) = P (next value in the data stream = xj_) / (1-P (next value in the data stream = xj_)).

However, in this embodiment as well, it is necessary to decode the coded column to answer queries (such as the above).

In another embodiment, the procedure is not column by column, but by rows. Analogous to the column-wise procedure, the above options are available (run-length coding, arithmetic coding, combination of run-length coding and arithmetic coding).

If arithmetic coding is used in line-by-line fashion, the compression rate can be further increased by using conditional probabilities for the probability distribution used for the arithmetic coding. If, for example, the π-th row x ^π = (xj, ..., x ^) is compressed, the probability that the ith component xj has the value XJ_ can be the probability

which can be determined by means of the specific statistical clustering model.

In summary, it is clear that the database table is compressed using the determined statistical (clustering) model (provided that the space saved is greater than the space required to store the statistical model). The cluster hierarchy 800, as shown in FIG. 8, is preferably constructed to such an extent that no further storage space is saved by further segmentation (that is, subdivision into clusters) of the lowest level of clusters (in FIG. 8 of the third plurality of clusters 804) in this case, because the space required to store the statistical model offsets the additional compression achieved).

Regardless of which method is used to compress the cluster 900, the cluster 900 can then be compressed in a second step by means of a further compression method, for example by means of a Lempel-Ziv compression method, in order to eliminate possibly existing redundancies. Since compression of the cluster has already been achieved by means of one of the abovementioned compression methods, complex compression methods can be used in the second step without requiring unacceptable computational overhead in compression and / or decompression.

Furthermore, methods for coding sparse panels (sparse coding) can be used.

The statistical methods of compression and the data structures built up thereby not only have a positive effect on the size of a database image. The data structures can also be easily used to accelerate analytical queries. If z. For example, if only one value is coded for a variable, if it deviates from the default value, corrections to a default statistic must always be made for all the data records just selected when determining statistics about the different values, corresponding to each coded deviation from the default value.

The coding of the cluster 900 or of the data sets contained in the cluster, for example according to one of the embodiments explained above, makes it possible to store a key in the data image for each data record contained in the cluster 900, by means of which the corresponding data record in FIG the underlying database table can be found.

Each record in the underlying database table has a key associated with it. The database image of the database table contains this key for each compressed record stored as explained above.

However, as a key stored for each record in the database image, a "natural key" of the segmentation may be used, that is, as a key to a record in the cluster 900, a correspondence of a first key containing the Cluster number of clusters 900 specified, and a second key, which corresponds to a number of the record corresponding to a numbering of the records contained in the cluster 900. The second key is thus illustratively the number of the record within the cluster 900. The cluster number of the cluster 900 may be a hierarchical cluster number configured according to the cluster hierarchy 800. For example, the subclusters of a cluster can be numbered consecutively, and the subclusters of such a subcluster can be numbered consecutively again, so that, for example, a hierarchical cluster number of the cluster 900 of the form 1/3/2 results if the cluster 900 the second subcluster (in the third plurality of clusters 804) of the third subcluster (in the second plurality of clusters 803) of the first cluster of the first plurality of clusters 802.

The second key, which corresponds to a number of the record corresponding to a numbering of the records contained in the cluster 900, can typically be chosen to be very short (one byte or few bytes in length) because only a few records are contained in the cluster 900 due to the segmentation.

The use of this "natural key" has the advantage that only a small amount of memory in the

Storage of keys for records in the database image is created.

The assignment of the "natural keys" to the keys used in the underlying database table (which is required to find the record corresponding to a record in the database image in the database table) can take the form of a database table in the database the database table contains, itself to be stored and with an access to the

Database table or to the database accordingly. If a plurality of database tables and corresponding database images exist, for example, according to FIG. 1, a transaction database table image 108 for a transaction database table 106 and a customer database table image 107 for a

Customer database table 105, keys for the respective data records are stored in the database images.

In the example of FIG. 1, as explained with reference to FIG. 4 and FIG. 5, upon selection of transaction records in the transaction database table image 108 (eg, FIG. 4), corresponding customer records in the customer database table image 107 are selected. This is done by means of a common key of the customer database table 105 and the transaction database table 106, for example by the customer number of a customer corresponding to a customer record or a customer involved in a transaction corresponding to a transaction record.

Upon selection of transaction records in the transaction database table image 108 (e.g., as shown in FIG. 4), the corresponding transaction records in the transaction database table 106 may be identified (eg, by means of a transaction database record key stored in the transaction database table image 108 in the transaction database table image 108) appropriate

Allocation table). By means of the customer numbers, the correspondingly selected customer data records in the customer database table 105 can now be determined and, by means of an allocation table, which corresponds to the keys of the customer data records of the customer database table image 107

Assigning customer data records keys to the customer database table 105 which are determined according to selected customer records in the customer database table image 107 and the corresponding selection (for example according to FIG. 5) can be used.

So that access to the customer database table 105 and the transaction database table 106 for determining the corresponding selection of the customer records in the customer database table image 107 is not required, the transaction database table image and the customer database table image 107 itself have a common key (for example, customer numbers) enable the corresponding selection of customer records in the customer database table image 107 to select transaction records in the transaction database table image 108 analogous to the procedure described above. t

Thus, the proposed method has the following advantages, in particular in the context of relational queries (that is, queries involving multiple database tables). The compression allows the database images to be kept in a small but fast memory (in main memory). At the same time, the database images are designed so that keys can be stored in the compressed images and still allow (almost) random access. This allows different database images (as originally different tables (database tables) in the relational database) to connect via keys and thus to answer relational queries. This gives a considerable speed gain for the following reasons:

• The speed of the main memory is much greater than other large mass storage devices (hard disks).

• The database images are constructed in such a way that segmentation allows fast access to the data and fast counting.

• There is a so-called random access (as opposed to hard disks) in the main memory, which is especially beneficial when using keys relational queries must access specific elements in different images.

A further increased efficiency is given in an embodiment in which a database image

(For example, the transaction database table image 108) contains references to the data records in the other database image (eg, the customer database table image 107).

In another embodiment, an increase in efficiency is achieved in that the two database images are not generated independently of each other, but that the grouping of data sets to clusters for generating one of the two database images takes place with regard to the other database image.

For example, the transaction database table image 108 is generated with respect to the customer database table image 107 by mapping all transaction records that correspond to the same customer record, that is, correspond to the transactions in which the same customer was involved, to the same cluster. This makes it possible, for example when selecting customer records in the customer database table image 107, to quickly access the corresponding transaction records in the transaction database table image 108, since they are all assigned to the same cluster of the transaction database table image 108. This is particularly advantageous when the clusters of the transaction database table image 108 are compressed and must be decompressed on access. In a grouping as above, therefore, only a few clusters need to be decompressed on a request.

A coordinated cluster structure can, for example. be achieved by first clustering as usual a blackboard (i.e., database table) is generated by a learning process. All the data from the second panel corresponding to the keys to a cluster from the first panel are then combined into a cluster for the second panel without a learning procedure. In the example, the customers are first grouped into typical customer classes (ie, a clustering of the customer database table data records is performed). The transaction records for all the transactions that belong to the customers of a customer class are then combined into a cluster for the transaction data. Accordingly, learning takes place only on the first board. The clustering on the second panel depends on the clusters of the first panel.

Advantageously, a common clustering can also be achieved through joint learning. A common clustering can z. B. can be achieved through common EM steps in an EM learning process, using a common cluster variable. As described above, in an EM learning process, the cluster affiliations are first estimated (E-step). In a common EM learning process, the affiliation z. B. a customer from a customer table to a cluster made not only on the basis of his customer characteristics but also on the basis of his transactions (stored in the transaction table). Conversely, for the transactions belonging to a customer, there are not different a posteriori estimates for the customer

Cluster affiliation, but a common assignment.

More concretely, for example, the common clustering can be done as follows. To obtain the a-posteriori estimate for the latent variable (the cluster variable) for a client, first, as in known inference techniques (see, e.g., the inference methods described in [10], using Message Passing algorithms) a message from each of the customer's known variables (or variable groups or cliques) from the customer table to the cluster variable Posted . As usual, the

Probability charts used according to the structure of the chosen customer model. In an additional step, a message is now sent to the cluster variable from each entry in the transaction table belonging to the customer just considered, in order to obtain the information from the transaction table in the a posteriori estimation of the customer's affiliation To consider clusters. For each transaction that belongs to a customer, the

Probability tables of a selected "transaction model" (a common probabilistic model for the variables from the transaction table and the latent variable) can be used, and the resulting a posteriori estimate for the cluster variable can form the basis for the M step. In the customer model this is the usual M-step using the jointly calculated posterior for each customer and calculation of the "sufficient statistics" (see [1] and [3]) as the sum across all customers. In the transaction model, the calculation of sufficient statistics for the M step can be done as the sum of all transactions of a customer with the associated posterior and as an additional sum across all customers.

If a database image contains keys as described above, the database image can be used as a multidimensional index for a database. This will be explained below. In particular, multiple database-associated database images allow for multidimensional access to a database in conditions

Dimensions from different database tables are set.

For a database table, an index can be created for a column of the database table that allows to quickly find records of the database table for which the size stored in the column assumes a certain value. For example, the customer database table 105 could have a column indicating the nationality of the customers, that is, each customer record has a field that contains a specification of the nationality of the corresponding customer. Often, when country-specific queries of the customer database table 105 are made, it is advantageous to group the keys of customer records corresponding to customers of a particular nationality into an index (that is, a list). In this way, the customer records that correspond to customers of nationality can be found quickly in the database table. This allows an index to be created for each column of the database table. However, if the database table has a large number of columns, a considerable outlay arises, which in particular leads to performance difficulties. In extreme cases, it is, for example

For performance reasons, it is not possible to generate an index for each column of the database table.

A database image can be used as a "multidimensional" index for the database table if, as explained above, the records in the database image have keys stored that allow them to find the corresponding records in the underlying database table. Thus, for each selection of records in the database image according to predefined

Properties, the corresponding data records can be found in the underlying database table without having to check the specified conditions for all data records of the database table.

This is particularly advantageous if only a small

Part of the data meets the selection criteria and therefore only a few records from the database table must be retrieved, but without the database image all records would have to look through to check whether they meet the selection conditions. For example, the customer database table for each (registered customer) of the hardware store contains a customer record that contains the customer's address in addition to the age of the customer, the customer number, the gender of the customer (etc.). In the customer database table image 107, there is a customer record for each customer which contains only a portion of this information, for example the gender of the corresponding customer and the age of the corresponding customer, but in particular not the address of the corresponding customer. At the end of a planning process, a target group could have been determined, for example, all customers between the ages of 30 and 40 with a certain income who are unmarried. The customer database table image 107 can now be used as a multidimensional index for the customer database table 105 in the sense that the customer data records of the customer database table 105 that correspond to the target group can be determined quickly by means of the keys stored in the customer database table image 107. The customer database table image outputs the corresponding keys and the

Keys are passed to the database. On the basis of the keys, the database can directly retrieve the addresses of the customers of the target group from the customer database table 105, without having to examine the condition defining the target group on all customer data records in a complex process.

Using a database key of relationally connected database images, similarly, data sets (target groups) can be retrieved from a database very quickly, which define themselves via a condition that concerns various database tables of a database. For example, addresses to customers from a database that are between 30 and 40 years old can be determined very quickly are old (= condition to a field from the database table with the customer master data) and that bought flower bulbs in January (= condition to a field from the transaction table).

As mentioned above, for a categorical random variable, the occurrences present in the database can be grouped in the database image, thus requiring less memory, in particular for the database image, since fewer different occurrences have to be encoded. For example, as explained above, all possible machine screws are combined to form a product group "machine screws". Analogously, the database image may contain discretizations of occurrences existing in the database, or different values may be combined in value ranges in the database image.

For example, the customer database table 105 contains in each customer record the information in which month the corresponding customer was born so that the age of the corresponding customer is known to one month. In order to achieve a low storage cost of the customer database table image 107, the customer data records of the customer database table image 107 always have the specification of the age of the corresponding customer only for one year.

A request is made to ^'the database dump, for the exact, contained only in the underlying database table information is required as a pre-selection, the records will be made by means of the database dump, using the data stored in the database dump Key the pre-selection corresponding records of the underlying database table are determined and then by accessing the

Database table the request will be answered, with only the records of the database table corresponding to the preselection must be taken into account, whereby a speed advantage is achieved.

For example, a request to the

Customer database table image 107, which refers to all customers under 17, 5 years. In the customer database table image 107, in the data sets according to the above example, the age of the customers is only known for one year. By means of the customer database table image 107, the request can be answered for all customers under the age of 17, since the corresponding data records can be uniquely determined. In addition, by means of the customer database table image 107, the keys of the customer data sets are determined, for which the corresponding customers are between 17 and 18 years old. This key can now be accessed by accessing the

Customer database table 105 which of these customer records actually correspond to customers who are under 17, 5 years old. If these are determined accordingly, the request can be completely answered.

The function as a multidimensional index is particularly advantageous if several database tables are involved in the query, so if z. B. to query the addresses of all customers who are under 18 years old, and bought flower bulbs in January. In the database query language SQL, such queries are called "JOIN ^ΛΛ . Such queries, which require linking multiple database tables, are often slow in databases. A list of the IDs (identifications, for example customer numbers) of such customers can, as described in detail in the preceding embodiments, be very efficiently determined by the combination of two suitable database images, the z. B. through statistical modeling achieve a compression that makes it possible to calculate the list completely in main memory.

In particular, a database image can be graphically used as a transparent accelerator for a database. Instead of using a user interface, for example, a program sends a request to the database. The query is quickly answered using the database image, as explained above, by accessing the database only when necessary because the data in the database image is insufficient. For example, as above, the address of a customer is not stored in the database image, but only in the database image underlying database table in the database or in the database image. This is transparent in that, for the program transmitting the request, there is no difference in whether the request is answered directly by accessing the underlying database table, or by using the database image of the database table.

Thus, requests from other software are clearly taken from the database image instead of the database, evaluated, and then either independently answered based on the information stored in the database image (or multiple database images), or - if certain required information not in the database image - a possibly. forwarded optimized request to the database, retrieved the results, possibly further processed, and transmitted the result to the requesting software. For example, optimizations made may be that

Selection criteria are removed in the query, and by direct control of individual records using a selections corresponding to the database image generated list of keys.

In particular, the invention can accept and answer queries in the query language SQL (structured query language).

In particular, to communicate the SQL query from the requesting software for invention and to return the results of one of the interface standards JDBC (java database connectivity) or ODBC (open database connectivity) can be used.

In particular, the invention can be used transparently as an accelerator, ie, such that an application software designed for direct access to the database can be accelerated without intervention by the invention.

This document cites the following publications:

[1] Castillo, Jose Manuel Gutierrez,

Ali S. Hadi: Expert Systems and Probabilistic Network Models, Springer, New York

[2] Reimar Hofmann: "Learning the Structure of Nonlinear Dependencies with Graphical Models", Dissertation, Berlin, or David Heckermann, A tutorial on learning Bayesian networks, Technical Report MSR-TR-95-06, Microsoft Research

[3] Martin A. Tanner: "Tools for Statistical Inference," Springer, New York, 1996

[4] Moffat, A., Neal, R.M., and Witten, I. H . : "Arithmetic coding revisited", ACM Transactions on Information Systems, vol. 16, pp. 256-294, 1995

[5] WO 00/65479

[6] WO 02/101581

[7] A. Orenstein: Spatial query processing in an object oriented database system, in SIGMOD, Washington, D. C, pp. 326-236, 1986.

[8] Ramakrishnan Raghu: "Database Management Systems," McGraw-Hill, 2002

[9] Charu C. Aggarwal, Philip S. Yu,: "The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space", Proceedings of the sixth ACM SIGKDD international Conference on Knowledge discovery and data mining, Pages: 119 - 129, ACM Press New York, NY, USA , 2000 [10] Finn V. Jensen: "An Introduction to Bayesian Networks," Springer, 1996, chap. 4

DE 102 52 445 A1

[12] US 2002/0029207 Al

LIST OF REFERENCE NUMBERS

100 computer arrangement

101 computer system

102 database system

103 microprocessor

104 memory

105 customer database

106 transaction database

107 Customer Database Image

108 transaction database image

109 Explorer computer program

110 screen

111 input devices

200 screen display

201-203 Screen with analysis results

204 Selection information field

205, 206 selection window

300 screen display

301-303 Screen with analysis results

304 Selection information field

400 screen display

401-403 Screen with analysis results

404, 405 bars

406 Selection information field

500 screen display

501-503 Screen with analysis results

504 Selection information field

600 screen display

601-603 Screen with analysis results 604 bars

700 screen

701-703 Screen with analysis results

704 mark

800 cluster hierarchy

801 database

802 plural clusters

803 plurality of clusters

804 plurality of clusters

900 clusters

901, 902 lines

903, 904 columns

Claims

claims

1. Database query system with

a first database image of a first database table having a first plurality of data records and a second database

A database image of a second database table having a second plurality of data records, wherein each record of the first plurality of records and each record of the second plurality of records is assigned a value of a database key;

an input device configured to receive an analysis request to the second database image;

a selection device, which is set up to select a part of the first plurality of data records according to a first selection;

a determination device which is set up to determine a second selection of a part of the second plurality of data records, wherein, according to the second selection, those data records are selected which are assigned values of the database key that are in each case assigned to at least one data record is selected according to the first selection;

a processing means arranged to determine the result of the analysis request on the basis of the part of the second plurality of data records.

The database query system of claim 1, wherein the first database image and / or the second database image is generated according to a statistical model.

The database query system of claim 2, wherein the statistical model is a graphical probability model.

The database retrieval system according to one of claims 1 to 3, wherein the input device is further arranged to receive a selection instruction and the selection means is arranged to select the part of the first plurality of data records according to the selection instruction.

The database query system of claim 4, further comprising a display device that is configured to

Display displaying display of possible values of at least one random variable for which each of the first plurality of records contains a value, and the selection instruction is selecting the display of at least one possible value of the random variable and the first selection is that all records of the first plurality of records containing the selected at least one possible value are selected.

6. The database query system according to claim 5, wherein the

Display device is further configured to display another screen display having an indication of the result of the analysis request, and wherein the display device is further configured to switch between the screen display and the further screen display.

7. The database query system of claim 1, further comprising an access device configured to access the second database table and to determine data contained in the second database table records selected according to the second selection, and wherein the Processing device set up is to determine the result of the analysis request using the data.

The database query system of any one of claims 1 to 7, wherein in the first database image the first plurality of

Data sets are grouped into a first plurality of segments and / or in the second database image, the second plurality of data sets are grouped into a second plurality of segments.

The database retrieval system of claim 8, wherein the value of the database key for a record of the first database image consists of a number of the segment in which the record is contained and a number of the record according to a numbering of the records of the segment ,

The database query system of claim 8, wherein the value of the database key for a record of the second database image consists of a number of the segment in which the record is contained and a number of the record according to a numbering of the records of the segment ,

The database retrieval system according to claim 9 or 10, wherein for each record of the first plurality of records the value of the database key in the first database table and / or for each record of the second plurality of records the value of the database key in the second database table is stored.

12. A method for computer-aided database query using a first database table with a first plurality of records and a second database table with a second plurality of records, each record of the first plurality of records and j edem Record of the second plurality of records is assigned a value of a database key, comprising the steps:

Receiving an analysis request to the second database table; Selecting a part of the first plurality of data sets according to a first selection;

Determining a second selection of a part of the second plurality of data records, wherein, according to the second selection, those data records are selected which are assigned values of the database key which are also respectively assigned to at least one data record which is selected according to the first selection;

Determining the result of the analysis request based on the portion of the second plurality of records.