Tracking End-Users in Web Databases
Boris Rozenberg,Yaron Gonen, Ehud Gudes, and
Nurit Gal-Oz
Erez Shmueli
Deutche Telekom Laboratories at BGU and
Department of Information System Engineering
Ben Gurion University
Beer Sheva 84105, Israel
Deutche Telekom Laboratories at BGU and
Department of Computer Science
Ben Gurion University
Beer Sheva 84105, Israel
and to connect each user to a private database account or
connection, could exhaust memory and computing resources on
the application server, database or both. The solution is to use a
Connection pool, a pool of pre-established database
connections that links the application to a database schema.
The application opens, and keeps open, several physical
database connections. When an incoming request is made, the
application will grab a connection from the pool, issue a query,
and then return the connection back to the pool. For many
applications, the criterion of performance is so important that
connection pooling is the only viable solution. Additionally,
web applications usually connect all end users to the same
database account, where this account has the union of all
privileges for all users. In this model, the DBMS cannot
distinguish between SQL statements of different users. This
implies the following security limitations:
Abstract—When a database is accessed via a web application,
users usually receive a pooled connection to the database. From a
database point of view, such a connection is always established by
the same user (i.e. the web application) and specific data on the
end user is not available. As a consequence, users' specific
transactions cannot be audited and fine-grained access control
cannot be enforced at the database level. In this paper we
propose a method and a system which provide the ability to track
the end users in web databases. The new method can be applied
to legacy web applications without requiring any changes in their
existing infrastructure. Furthermore, the new users tracking
ability provides a basis for native database protection
mechanisms, and intrusion detection systems.
Keywords- DB users tracking; web users tracking; auditing
I.
INTRODUCTION
Auditing: The user's identity is not known and their
individual actions are untraceable as well as
unregulated.
In today's web application world, the HTTP protocol is
used to connect users to application servers rather than directly
to the database. This has important implications on application
security. Unfortunately, HTTP, while useful for its original
purpose of static content, is stateless and in many cases is not
optimal for many of today's current demands. Stateless means
that there is a new connection established for every request. As
such, the state of individual and recurring connections can't be
supported by the protocol. The stateless aspect of HTTP was
solved by the application servers, for the purpose of client to
application server communication. The solution utilizes
browsers cookies and the ability of the application server to
create and maintain transparent sessions and state for each
client. However, this represents only part of the architecture.
The connection from the application server to the database still
remains a challenge.
Access Control: The principle of minimal privilege is
violated. It is impossible to authorize the web
application user with appropriate privileges at the
database level since all application users have access to
the same data. It means that roles cannot be defined
and therefore fine-grained access control to the
database cannot be enforced. Ensuring that only the
right privileges are available to the user is left only to
the application.
Intrusion Detection: Not only that the database does
not identify the user who accesses it, but it is also
impossible to follow SQL statements of the same user.
Attackers can exploit this flaw to access sensitive data,
or use unauthorized functions. There exist attacks, such
as SQL injection [26], or business logic violations [2,
3], which could be seen only at the session level
(composed of multiple transactions), and thus could
not be detected at the database level.
The application server connects to the database on the user's
behalf. The way this is done makes a large difference for two
critical reasons: security and performance. The ideal solution is
to connect the users to the database using a 1:1 mapping model,
i.e. every end user has a distinct database account or
connection. This is secure because the DBMS knows who is
connected and can employ its security mechanisms,
specifically access control and auditing.
User-tracking which means associating the user with its
corresponding database SQL statement, becomes therefore an
essential task for solving the above problems.
However, in the web environment the user is almost never
directly connected to the database. The user is connected to the
application, and the application is connected to the database.
However, Web applications typically support many end users
978-1-4577-0460-4/11/$26.00 ©2011 IEEE
Not much academic literature exists on the User-tracking
problem. A proposal to solve the problem of users'
identification was presented in [2], but the proposed approach
105
the web application and the database, thus making it
impractical for legacy systems.
that is based on Parameterized views requires changes in both
the web application and the database. In terms of commercial
products, few methods for passing users' identity from the web
application to the database exist in latest Oracle versions, but
again, they require changes in legacy systems. Also, several
third party vendors, such as Imperva [24] suggest a solution to
this problem. However, there is no published technical
information on how they do it or an evaluation of their
performance.
Latest versions of Oracle support several ways to pass the
user identity through the web application to the database. For
users known to the database, the proxy authentication in OCI,
for example, can be used. OCI enables a web application to set
up, within a single database connection, a number of
"lightweight" user sessions, each of which uniquely identifies a
connected user. The application can switch between these
sessions as required to process transactions on behalf of users.
But in order to support the fine-grained authorization, Oracle
must maintain all the application users. Since these users are
created/dropped dynamically by the web application and the
number of such users may be tremendous, this solution is not
practical for a typical web application. Furthermore, in order to
use this method, legacy application should be modified.
In this paper, we propose a new method and system that
provides a way to link each SQL statement in the database to
the end user that performed it (via the web application), thus
making the web application's access to databases more
traceable. Furthermore, the new users tracking ability provides
a basis for fine grained access control.
The rest of the paper is structured as follows: in Section 2
we outline the related work; in Section 3 we present the overall
system architecture as well as a detailed description of the
developed algorithms; in Section 4 we describe our
experimental evaluation process, and we conclude in Section 5
with directions for future work.
II.
Several 3rd party products that refer to the problem of user
identification at the database level have been proposed.
Imperva [24], for example, has developed multiple methods for
identification of end users regardless of how these users access
the database. According to vendors, Imperva’s SecureSphere
product supports architectures that use connection pooling to
aggregate multiple user sessions into a single database
connection. SecureSphere tracks end users without requiring
any changes in existing databases and web applications.
However, the algorithms used by SecureSphere were not
published and no information is supplied regarding their quality
nor regarding their efficiency.
RELATED WORK
Most DBMS authorization mechanisms (usually
implemented using Roles, Grant/Revoke operations and Views
[10, 11]) rely heavily on the identity of the user accessing the
database [10, 11]. When lacking this information, these
authorization mechanisms cannot be applied, and intrusions
may become easier.
Several papers suggested methods to detect attacks on web
databases like SQL injection [1, 3, 26]. However all these
papers assume that the problem of user-tracking is solved
which is the major goal of the present paper.
Much effort was invested in developing methods for
detecting intrusions to databases. Various IDS for databases
that also rely on user identity have been proposed in the
literature [3-9].
III.
As was explained in [3], intrusions are usually identified
within a logical session. Few methods for session boundaries
identification that rely on known user identity have been
proposed [14-17]. The session identification task involves
identifying session boundaries from the log of database events.
The importance of session identification is derived from the
following motivations: it enables grouping all user activities
within one session and can be useful for performance
optimization by utilizing the prediction of future queries; such
grouping basically defines a role that can be used to derive the
actual access rights of a user; a deviation from such a grouping
may indicate an attempt for intrusion, thus it can be used for
intrusion detection.
OUR SYSTEM
We propose a system that provides an ability to monitor and
track queries from the end user through the Web application to
the database. The general conceptual view is depicted in Fig. 1.
Figure 1. The UT system conceptual view.
A pre-requisite of identifying logical sessions is identifying
users. However, as explained previously, the user identity is
unknown at the database level in the web environment. In spite
of the importance of this problem, to the best of our
knowledge, only a small number of published works tried to
address it. The most relevant work to ours, by Roichman and
Gudes [2], proposed a parameterized views method that allows
transferring the web user identity to the database, thus enabling
the using of database’ built-in access control mechanisms. The
drawback of that approach is that it requires changes in both,
As can be seen in Fig. 1, the UT (Users Tracking) system
receives input from users to WebServer and from WebServer to
DB network traffic or from the WebServer and DB log files (if
available). The output of the system is a mapping of users to
the database SQL statements that they performed.
A. The System Architecture
Fig. 2 provides a detailed view of the system. The UT
system contains two layers, namely the Network Management
106
layer, and the Users Tracking layer. Following is a brief
description of each layer.
the protected Web Application, the Web Application
functionality and the mapping function of user's requests to the
actual transactions in the database. In the Real-Time mode the
system will create the log of transactions with the actual user
name attached to each transaction.
Figure 2. The UT system architecture.
The Network Management layer includes modules which
handle the acquisition and management of network packets. It
handles the reconstruction of TCP sessions and the parsing of
HTTP and DB protocols. The output of this layer is a full
session between the Web Server and a user (HTTP requests)
and a full session between the Web Server and the database
(SQL statements). The implementation of this layer is a
technical issue and is not going to be further discussed in this
paper.
Figure 3. Users Tracking – main stages.
Fig. 3 describes the main stages of the Users Tracking task
and the following subsections give a detailed description of the
algorithms developed for each stage. We assume that the only
information available is from log files, and no additional
semantic information is provided.
B. Web Application Actions Discovery
Each Web application has a way of working to accomplish
its functionality and is represented by an ordered sequence of
actions that users can perform. For example, in a banking Web
Application, in order to transfer money to another account, the
user should request a transfer form (the first action) fill the
form (the second action) and submit the form (the third action).
Since users can only perform the operations available at the
application interface and no other operation is available for the
end-users, the main question we tackle in this section is which
actions are supported by an arbitrary Web Application.
Typically, these actions are applied by HTTP GET or POST
requests from the user's browser. Let us look at the following
HTTP request:
The Users Tracking layer includes modules which handle
the Web Application users' identification and mapping of users'
actions to SQL statements in the database. This layer is
responsible for discovering of the actions available in the Web
Application, discovering of SQL statements possible in the
system, and matching between the discovered actions and SQL
statements. This layer can be operated in two modes: Learning
and Real-Time mode. In the Learning mode the output of this
layer is a set of actions existing in certain Web Application, a
set of SQL statements usable in the database, and for each
action the mapping to the set of SQL statements in the
database. In the Real-Time mode, the output of this layer is two
log files. The first one contains a log of users' actions in terms
of Web Application actions. The second one contains a log of
SQL statements with a username attached.
GET http://my.dns/act.php?lname=
Smit&fname=Jon
Two more modules are identified in Fig. 2: the Logical
Sessions Identification module and the Role Mining module.
Once the user identification is complete, one can divide the
physical sessions into logical sessions where each logical
session corresponds to a set of SQL statements that are all
performed by a single role. The identification of logical
sessions has two important uses: intrusion detection (when one
logical session that contains statements performed by different
roles), and role mining (which reflect the actual privileges and
actual roles of users in the system). The design of these two
modules, which rely heavily on the Users tracking module, will
be reported in future papers.
It contains the PHP script name (other script-engines such
as ASP may be considered) followed by the list of parameters
for the script. So, if we assume that there's a different script to
each of the actions in the Web application, we can harvest all
the scripts, and compile a list of all the actions. However, let as
look at the following example:
GET http://my.dns/act.php?lname=
Smit&fname=Jon&op=add
GET http://my.dns/act.php?lname=
Smit&fname=Jon&op=remove
We face a problem: the script name is the same for both
requests, but the action itself resides in a parameter, whose
name is of course arbitrary. As can be guessed, the script
As mentioned above, the system can be operated in one of
two modes: Learning and Real-Time. In the Learning mode the
system will automatically learn the login request/response of
107
Suppose we have the following log file (only one script
Trade.php is considered):
act.php adds the input name to a list if the parameter op equal
to "add", and removes the input name from the list if the
parameter op equal to "remove".
1. GET http://my.dns/Trade.php?act=
request&type=buy&tradeId=21&cnt=10
2. GET http:// my.dns /Trade.php? tradeId=22&
type=buy&act=request&cnt=5
3. GET http:// my.dns
/Trade.php?act=request&tradeId=23
&cnt=7&type=buy
4. GET http:// my.dns /Trade.php? cnt=8
&act=request&type=sell&tradeId=24
5. GET http://my.dns/Trade.php?act=
request&type=buy&tradeId=28&cnt=1
6. GET http://my.dns/Trade.php?
tadeId=29&type=buy&act=request&cnt =2
Our approach is to define an action to be a concatenation of
the script name and some relevant parameters of the script. The
key question is how to identify the relevant parameters?
We propose a method comprising of two phases: the
training phase and the pruning phase. We assume that we have
a Web Server log file with all HTTP requests submitted by
users to a certain Web Application during a specified period of
time (this is the task of the Network Monitoring layer of our
system to create such a log file). First we split this log into two
parts – one for the training and one for the pruning. Next we
run the following algorithm (Fig. 4):
We use the first four lines for the training and last two lines
for the pruning (for the purpose of the example only). In
general the percent of data used for training is application
dependent and should be provided by the system administrator.
In line two of the algorithm we create a set STrade.php of all
the parameters of the script:
1. For each PHP script rk in the Training
Data(1 k NumOfPHPScripts)
2.
Let Srk = {p1,p2,…pn} be a set
of candidates to comprise the
action for request rk
3.
For each parameter pi in Srk
(1 i n)
4.
Obtain the set Spi of
different values of pi
5. For each PHP script rk in the Pruning Data
6.
For each parameter pi in rk
Spi
7.
if pi.value
8.
Srk = Srk-{pi}
STrade.php={act,type,tradeId,cnt}
In the lines 3-4 we create the set of different values for each
parameter in the set STrade.php as following:
Sact = {request}
Stype = {buy,sell}
StradeID = {21,22,23,24}
Scnt = {10,5,7,8}
Figure 4. Web application actions discovery - Algorithm 1.
In line two of the algorithm for each PHP script in the log
file we generate the set Srk of parameters of the script. Initially
each parameter in Srk is a candidate to comprise the action for
this script (together with a script name and other parameters).
In lines 3-4 of the algorithm, for each parameter in the set Srk
we create a set of different values that this parameter receives
over the training part of the log file. Next in lines 5-7 of the
algorithm we test on the pruning part of the log whether some
parameter receives a value unseen previously. If indeed some
parameter receives a value unseen previously, this parameter
cannot be considered as a candidate to comprise the action and
it's removed from the set of candidates in line eight of the
algorithm. The assumption behind this approach is that in
typical Web application, all the actions should appear during
certain time interval (domain dependent parameter). For
example, if we have a bank Web application log for one month
period, it's reasonable to assume that all possible actions will
appear during the first week of the log. Thus, in our approach
we could split the one month log into two parts: one week for
training and three weeks for pruning. In the training phase we
consider all different values of each parameter of each script to
be a candidate to comprise the action. If it's true and the
parameter comprises the action, it should not receive unseen
values in the rest of the log, and if it does, we remove it from
the list of candidates. Finally, for each PHP script all remaining
parameters (in the set Srk) together with a script name
comprise the action. Following is an example for the algorithm
operation.
Next, in the lines 6-8, for each parameter we test whether it
receives new values over the pruning part of the log, by
comparing the actual values with a set of previously obtained
values. We get that 28 is not in the StradeID set and 1 is not in
the Scnt set. At this point we reduce the set STrade.php to be
{act, type}. It means that the action is defined as concatenation
of "STrade.php", the value of parameter act and the values of
parameter type. Thus we have two actions in this example:
1. STrade.php+Sact=request+Stype=buy
2. STrade.php+Sact=request+Stype=sell
C. User Name Identification
The goal of this task is to detect when a user is logging in
into a Web Application and what is his/her ID (username).
Since the parameter's name (login) varies and it's applicationdependent, we can choose one of the following methods to
identify the username:
Keywords - to maintain a list of words that might
indicate a user logging in.
Manual - every Web application wished to be analyzed
will have to describe the login process, so the system
will be able to identify the username field in HTTP
requests.
D. Web Users Tracking
After user login event has been detected, the goal of Users
tracking task is to attach the user ID to all the actions the user
108
has performed with the Web application. This can be achieved
by associating the user ID with HTTP session ID, which is
typically generated and sent from a web server to a client to
identify the current interaction session. The client usually
stores and sends the session ID as an HTTP cookie and/or
sends it as a parameter in GET or POST queries [25]. This
approach will allow us to track all user activity in terms of Web
application actions. Note that after this stage we have a log of
web actions with each action associated with the user-id that
performed it.
f :A
Q that maps every action to a sequence of
SQL statements. For example: f ( a5 )
q1 , q10 , q2 ,
meaning that the action a5 is mapped to the sequence
q1 , q10 , q2 .
function
The main technique we use is based on frequent sequence
mining. First, by using Timestamps which are present in the
logs, we identify the SQL statements which can be mapped to
the action. Note this set is not unique since several actions may
be performed concurrently by the Web application. Then we
find the subset of mapped SQL statements by using the
frequent sequence mining technique. The intuition is that a
frequent sequence corresponds to the actual set of mapped
statements. Algorithm 2 is used to construct this mapping:
E. SQL Statements Identification
Our goal in this step is to find (and fingerprint) the set of
SQL statements which are possible in the specific Web
application, given a log of SQL statements for some time
period. We employ an approach presented in [1] for the
purpose of SQL injection detection to learn the structure of
each SQL statement possible in the system and to fingerprint
that structure. There are a large number of such possible
statements, but they usually differ only in parameters values. If
we replace the parameters in each statement with variables, we
get some high level representation of the SQL statement called
the fingerprint. Fig. 5 gives an example of an SQL statement
and its corresponding fingerprint:
A
1. For each ai
DSQa
2.
i
3. For each action ai'
SA
Tstart=Start time of ai'
4.
Tstop=Stop time of ai'
Let's S Qa be a sequence of all
5.
i
In order to do so we first use an SQL lexical analyzer [19]
to create the parse-tree. Since the SQL statements in the real
database log do not contain "?" as parameters, but the values
themselves, we need a schema-description in order to
differentiate schema objects from parameters. Using such
schema-description given by the user or extracted from the
database, we automatically extract the parameters from the
statement and create a fingerprint.
the queries from Db log with
the TStart QtimeStamp TStop
Add SQa to DSQa
6.
i
i
7. For each DSQa
i
f (ai' )
8.
max Seq[CAMLS ( DSQa )]
i
Figure 6. Map Function Creation - Algorithm 2.
'
For each action ai SA we know its start and stop time
(from 3.2). (Note that it can be executed in parallel with other
actions.) We can separate all the queries executed during this
period of time on the database. Denote this sequence of queries
as S Qa (line five of the algorithm). The sequence S Qa contains
Figure 5. Select statement fingerprint – example.
i
i
queries executed by
F. Mapping from Web Actions to SQL Statements
This is a major step in the methodology. The main goal is to
attach the username to each SQL query performed by the Web
application on the database. So far we are able to attach the
username to each action in terms of the Web application. What
we need in order to complete the task is to know the mapping
function from the Web application action to the actual SQL
statements for each action existing in the Web application.
Let's first define the problem formally:
'
i
a (and maybe some other queries). In
line six of the algorithm, for each action
ai' we create a dataset
DS Qa of all S Qa (one sequence for each instance of the
i
i
'
i
action a in the SA ) and for each such dataset we can find the
longest frequent sequence (not necessarily successive), by
using the CAMLS algorithm [20] (line eight of the algorithm).
Each such sequence represents a set of SQL queries that the
'
corresponding action ai is mapped to.
Given a set of actions, supported by the Web
application A {a1 , a2 ,..., ak } , a set of SQL statements
We choose CAMLS among other existing algorithms for
frequent sequences mining. Given a database and a minimal
support value, CAMLS efficiently generates all sequences that
repeat (i.e., frequent) in the database with a support equal to or
greater than a minimal support value. In our implementation
we enumerate frequent sequences (with very high values of
supported by the database Q {q1 , q2 ,..., ql } , a sequence of
actions performed by the Web application during some
specified period of time SA a1' , a2' ,...an' , ai' A , and a
sequence
of
corresponding
SQL
statements
we
should
produce
a
subjective
Sq
q1 , q2 ,..., qm , qi Q
109
fraction (delta) and we try to tag the DBlog with the action ID
using the map function obtained so far (line 3). If all the
transactions have been tagged or minSupportThreshold has
been reached, we stop and create the final map function from
the tagged DBLog in the lines 8-10 of the algorithm.
support like 100%) for each DS Qa separately and take the
i
longest sequence as the output.
This approach works well for the case when each web
action has exactly one mapping to the sequence of transactions.
But some web actions have more then one mapping. For
example, in our domain (see the evaluation section) if some
user buys his/her first security, the system performs insert (tID
= 14) into Holdings Table. In the case, the user already has
securities, the system performs update (tID = 13) on the
Holdings Table. In both cases the user performs the same
action (aID = 6) on the Web application with the same
parameters. This is depicted by the Log shown in Fig. 7.
G. DB Users Tracking
This is the last step of the algorithm which is mainly needed
in
the
Run-time
phase.
Given
a
mapping
function f : A Q from the previous section, a tagged
sequence of actions S 'A a1u1 , au22 ,... auno , aui j A (every action
is tagged with a user ID), and a sequence of SQL queries
S q q1 , q 2 ,..., q m , q i Q we should produce a tagged
sequence of SQL queries S 'q
q1u1 , q u22 ,... q uno , q ui j
Q.
Note that the first query in the sequence S q belongs to the
first action in the sequence S 'A . We can eliminate (or mark) all
the queries in the sequence S q that belong to the first action in
the sequence S 'A (using f). We are left with S 'A minus the first
action and S q minus all the queries belonging to the first
action. Now we can continue doing the same recursively. The
marking of all the queries might be tricky: for example, what
happens when two or more actions contain the same queries on
the database? In this case, in order to know which query to
eliminate, we'll need to examine the parameters of the query
and compare them to the action parameters in Web application
(see Fig. 7).
Figure 7. Multiple Mapping - Example.
The mapping problem in this case is much more complex.
We cannot use high values of support anymore. We tried to use
one of the bi-partite graph matching algorithms [27], but the
results were disappointing. Therefore we propose Algorithm 3
(Fig. 8) for solving the mapping problem in this case.
1. Set minSupport to 100%
2. Run algorithm 2 with current minSupport
value
3. Tag (still not tagged) queries in the DBLog
with the action ID using the map function
created by algorithm 2 /*if there are
multiple choices – compare the parameters
of the action with the parameters of the
SQL queries. If the parameters are same, or
there are no parameters - don't tag */
4. Select all not tagged queries in the DBLog
and create set Ps of all possible web
actions that not tagged queries can belong
to (using the time stamps)
5. minSupport = minSupport - delta
6. while (Ps
&&
In Fig. 9 two examples are shown. In example (a) a single
user (u1) executes the action (a1), f(a1)=<t1,t10,t2>; in the
example (b), two users (u1, u2) execute the same action (a1) in
parallel, f(a1)=<t1,t10,t2>. The two actions a1 can match the
two transactions t1 in two different ways. The exact match is
determined by checking the equality of action parameters and
SQL statement parameters.
minSupport minSupportThreshold)
execute steps 1-5 for new
minSupport value and actions
set A = Ps (only problematic
actions)
8. for each action a in the WebLog
9.
select Start and Stop Time
10.
if there is no other action a in
the interval between TStart and
TStop
11.
add all transactions tagged
as belonging to a to f(a)
7.
Figure 8. Map Function Creation - Algorithm 3.
Figure 9. DB Users Tracking - Example.
We run algorithm 2 iteratively each time reducing the
minSupport value (starting from 100%) by a predefined small
110
IV.
TABLE I.
EVALUATION
In order to evaluate our system we need a comprehensive
environment with thousands of users working in. Since such
environment was not available to us, we decided to simulate it
on a real-life scenario based on a benchmark which is
commonly used to evaluate DBMS performance. Our
simulation was divided into two phases. During the first phase
we evaluated the Network Management layer only and during
the second step we evaluated the Users Tracking Layer
separately on a simulated input.
Recall that the goal of the Network Management Layer is to
reconstruct a full session between the Web Server and a user
(HTTP requests) and a full session between the Web Server
and the database (SQL statements) from the network traffic.
We ran this module on several Web applications and recorded
its output to the log files. These logs were tested manually for
the correctness. From this point we assume that we have this
layer working correctly and move to the second phase. The
second phase simulates a comprehensive environment with
thousands of users, and logs out both the web server log and
the database log.
Our log generator simulates the operation of a web
application and of a database server, and therefore outputs two
logs accordingly. The log generator is build of two parts - the
standard TPC-E emulator, and a Java emulator, developed by
us. TPC Benchmark™ E (TPC-E) [18] is an On-Line
Transaction Processing (OLTP) workload which is commonly
used to evaluate the performance of various database systems.
It is a mixture of intensive transactions that simulate the
activities found in complex OLTP application environments.
The database schema, data population, transactions, and
implementation rules have been designed to be broadly
representative of modern OLTP systems. The TPC-E emulator,
which the log generator is based upon, models the activity of a
brokerage firm that manages customer accounts, execute
customer trade orders, and be responsible for the interactions of
customers with financial markets. It is composed of four
different roles: customers, brokers, newsmen and market. We
omit here the database design and the transaction flow of the
business model. TPC-E generates all the events in the system
and passes them to our emulator which creates a web log
record and a series of SQL statements associated with each
web-action according to Table 1. These statements are logged
in DB log records. We have generated 10 triples of logs: <Web
application log, DB log, tagged DB log> (tagged DB Log is
identical to DB log, but also contains the username attached to
each query) with the following parameters:
MAPPING FROM WEB ACTIONS TO SQL QUERIES.
Role
customer
Web action
trade request
broker
submit trade
broker
show trade
requests
market
show submitted
requests
market
handle trade
newsman
insert news
all
read news
all
read single
news item
all
Query security
details
SQL queries
select on security
table (also join
with code tables)
insert to trade table
with status 'request'
update trade status
to 'submitted'
select on accounts
only this broker
handles
select on trade
table for requests
(also join with
code tables)
select for
submitted or
pending trans'
select if holding
exists
insert/update/delete
holdings
check for pending
condition (select
on security table)
insert/update/delete
holdings or
nothing
select on company
table
insert to
news_items table
select on
new_items for
titles.
select on
news_items and
company tables
select on
new_items for title
and item body
select on
news_items and
company tables
select on security
table
Note, that in our evaluation web actions and map functions
are known in advance. So in order to evaluate the web action
identification and map function creation algorithms, we run the
corresponding modules of the system and compare the results
with the expected results. In order to evaluate the Users
tracking algorithm, we ran our User Tracking module on the
pairs <Web application log, DB log> and compare its output
with the tagged DB log. The results of this evaluation were
very encouraging. In all the runs where each web action
appears at least twice in the training set, the output was
identical to the expected one. In the cases when some web
action appears only one time or does not appear at all in the
training, the corresponding SQL statements were not tagged
MaxNumOfUsers: 500-1000 (from the set of 4 roles )
MaxNumOfWebActions 16500 – 17500 (from the set
of 16 different actions)
MaxNumberOfSQLQueries 19000 - 21600 (from the
set of 14 different transactions)
MaxNumOfPhysicalSessions: 2000-2500
MaxNumOfConcurentUsers: 1-10
111
[9]
with a user name (because a map function for these actions was
not learned).
V.
CONCLUSION AND FUTURE WORK
[10]
In this paper we presented algorithms and a prototype
system that provides users tracking functionality for web
databases. The system enables to track all activities from the
end user through the Web application to the database, without
requiring any changes in existing databases, and web
applications.
[11]
[12]
[13]
Users' tracking is only a first step towards two important
functionalities we plan to add to our system. The first is logical
sessions' identification [14-17] and the second is actual users'
privileges/roles mining [21-23]. For both of them the Users
Identification is a pre-requisite.
[14]
[15]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[16]
[17]
Buehrer, T., Weide, B.W., Sivilotti, P.A.G., 2005. Using Parse Tree
Validation to Prevent SQL Injection Attacks. In Proceedings of the 5th
international workshop on Software Engineering and Middleware,
Portugal.
Roichman, A., Gudes, E., 2007. Fine-grained Access Control to Web
Databases. In Proceedings of 12th SACMAT Symposium.
Roichman, A., Gudes, E., 2008. DIWeDa - Detecting Intrusions in Web
Databases. In Proceedings of 22nd Annual IFIP Conference, U.K.
Halfond, W., Orso, A., 2006. Preventing SQL Injection Attacks Using
AMNESIA. In Proceedings of 28th International Conference on
Software Engineering, China.
Low, W.L., Lee, S.Y., Teoh, P., 2002. DIDAFIT: Detecting Intrusions in
Databases Through Fingerprinting. In Proceedings of the 4th
International Conference on Enterprise Information Systems.
Bertino, E., Terzi, E., Kamra, A., Vakali, A., 2005. Intrusion Detection
in RBAC-administered Databases. In Proceeding of 21st Computer
Security Applications Conference, USA.
Chung, C., Gertz, M., Levitt, K., 1999. A misuse detection system for
database systems. In Proceedings of IFIP TC11 WG11.5 Third Working
Conference.
Hu, Y., Panda, B.,2004. A Data Mining Approach for Database Intrusion
Detection. In Proceedings of the ACM Symposium on Applied
computing, Cyprus.
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
112
Srivastava, A., Reddy, S.R., 2005. Intertransaction Data Dependency for
Intrusion Detection in Database Systems. Part of Information and
System Security course, School of Information Technology, IIT
Kharagpur, India.
Bertino, E., Samarati, P., Jajodia, S., 1997. An Extended Authorization
Model for Relational Databases, In Proceeding of IEEE Transactions on
Knowledge and Data Engineering, Volume 9, Issue 1, Pages: 85-101.
Griffiths, P. P., Wade, B.W., 1976. An Authorization Mechanism for a
Relational Database System. ACM, Transactions on Database Systems.
Gulutzan, P., Pelzer, T., 1999. SQL-99 Complete, Really An ExampleBased Reference Manual of the New Standard. R&D Books Miller
Freeman, Inc.
Ramakrishnan, R., Gehrke, J., 2001. Database Management Systems,
Chapter 17.1, Introduction to Database security. Second Edition.
Gupta, K. K., Nath, B., Ramamohanarao, K., 2008. User Session
Modeling for Effective Application Intrusion Detection. SEC 2008: 269284.
He, D., Goker, A., 2000. Detecting session boundaries from Web user
logs. In 22nd. Annual Colloquium on IR Research IRSG 2000,
Cambridge, UK, 57–66.
Yao, O., An, A., Huang, X., 2005. Finding and Analyzing Database User
Sessions. DASFAA 2005: 851-862.
Yao, O., An, A., Huang, X., 2006. Applying language modeling to
session identification from database trace logs. Knowl. Inf. Syst. 10(4):
473-504.
TPC-E benchmark Web Site. http://www.tpc.org/tpce/
http://zql.sourceforge.net/.
Gonen, Y., Gal-Oz, N., Yahalom, R., Gudes, E., 2010. CAMLS: A
Constraint-based Apriori Algorithm for Mining Long Sequences.
DASFAA 2010: 63-77.
Molloy, I., Chen, H., Li, T., Wang, Q., Li, N., Bertino, E., Calo, S.,
Lobo, J., 2008. Mining roles with semantic meanings. In Proc. ACM
Symposium on Access Control Models and Technologies (SACMAT).
Molloy, I., Ninghui, L., Tiancheng, L., Ziqing, M., Wang, O., Lobo, J.,
2009. Evaluating role mining algorithms. In Proc. ACM Symposium on
Access Control Models and Technologies (SACMAT).
Ni, Q., Lobo, J., Calo, S. B., Rohatgi, P., Bertino, E., 2009. In Proc.
ACM Symposium on Access Control Models and Technologies
(SACMAT).
Imperva Web Site. http://www.imperva.com.
http://en.wikipedia.org/wiki/Session_(computer_science).
Kamra, A., Terzi, E., Bertino, E., 2008. Detecting anomalous access
patterns in relational databases. VLDB J. 17(5): 1063-1077.
Cormen, Leiserson, Rivest, Stein, 2009. Introduction to algorithms, The
MIT press.