Profiling Users in the UNIX OS Environment
Vu N. P. Dao 1
dao1@llnl.gov
Rao Vemuri 1,2
rvemuri@ucdavis.edu
Steven J. Templeton 2
templets@cs.ucdavis.edu
[1] Lawrence Livermore National Laboratory, 7000 East Ave., Livermore, CA 94551
[2] University of California, Davis, One Shields Ave., Davis, CA 95616
Abstract
This paper presents results obtained by using a
method of profiling a user based on the login host,
the login time, the command set, and the command
set execution time of the profiled user. It is
assumed that the user is logging onto a UNIX host
on a computer network.
The paper concentrates on two areas: shortterm and long-term profiling. In short-term
profiling the focus is on profiling the user at a
given session where user characteristics do not
change much. In long-term profiling, the duration
of observation is over a much longer period of
time. The latter is more challenging because of a
phenomenon called concept or profile drift. Profile
drift occurs when a user logs onto a host for an
extended period of time (over several sessions)
causing his profile to change.
I.
Introduction
Profiling is a technique of grouping
individuals or things into groups or categories
based on certain features such as appearance, traits,
situations, etc. The term profiling probably strikes
a negative feeling in many people because most are
aware of negative applications of profiling in news
headlines. Nevertheless there are many benefits
and useful applications in user profiling.
Following are a few examples of profiling.
Constructive examples of profiling include grocers
buying vegetable or fruit products based on color
and firmness of the produce, while destructive
examples include management not promoting
employees based on color, race and gender. Other
examples of profiling include law enforcement
officers stopping certain types of people for
questioning after a crime has occurred. In a
nutshell, profiling is a classification procedure that
groups pertinent information of an event or
situation together so that people can make better
decisions pertaining to that event or situation. In
many ways, the results obtained from profiling
proved accurate, although one can debate the legal
and ethical issues involved in many profiling
applications.
The science of profiling has been successfully
used in many important application areas – most
notably in law enforcement. A recent high-profile
case was that of identifying the ‘Unabomber’.
After going over a manifesto purportedly authored
by the ‘Unabomber’ [1], the FBI came up with a
profile. Most of the profile characteristics proved
to be correct when the ‘Unabomber’ was
apprehended. Other examples include identifying
the author of a piece of litertary work based on that
author’s usage of words, grammar, and so on.
Profiling (or, equivalently classifying) has
many applications in the realm of modern
computer and information technology. By profiling
users one can have a better understanding of the
users’ computer usage patterns. The results can
then be used to allocate system resources more
efficiently and to provide better services within a
networked (or collaborative) environment. In other
areas, an ability to infer user preferences from user
behavior patterns has many applications in
Internet-based commerce. An ability to infer user
intentions from user behavior has applications in
detecting and arresting computer-based crimes.
The long-term goal of our work is to use user
profiling as one of the ingredients in detecting
intruders into a secure networked environment.
This paper is organized as follows. A brief
review of the literature on profiling computer users
and the direction of this paper is given in section II.
Section III discusses the system resources that can
be used for profiling. To set the stage for the
experimental results presented in the other sections,
this section discusses the topology of the computer
network, and describes how the data were
collected. Section IV discusses the essential
parameters used to profile the users. Section V
covers short term user profiling, or more
specifically user profile within a few sessions on an
individual host. Section VI covers long term user
profiling, or user profiling over many sessions of a
host. This section further dwells into the drift in
user profiling known in the literature as profile or
concept drift. Section VII provides a summary of
the work presented in this paper. Section VIII
concludes the paper. This section listed the results
found and discusses future work in user profiling
especially in the area of computer security. Section
IX listed the references used in this paper, and
Appendix A listed the results.
II.
Literatures Reviewed and the
Direction of this Paper on Profiling
Computer Users
The area of profiling computer users for
detecting intrusions was first mentioned in
Denning’s paper [2] on building a model for
intrusion detection in 1987. From that time many
others elaborated to include different ways of
profiling the users. Some of these included those
of Obaidat and Sadoun [3], whose work
concentrated on identifying computer users through
the keystroke dynamics. Lane and Brodley [4]
concentrated their works in monitoring the UNIX
commands that the user typed. They introduced
“concept drift” [5] to take into account changes in
the user profile.
Warrender, Forrest, and
Pearlmutter [6] used system calls into the kernel of
an operating system to profile user usages for
intrusion detection. Profiling computer users for
applications in computer security has thrived in
recent years.
Aside from the research works mentioned
above, many others worked in the practical aspects
of intrusion detection also pointed out the need for
an accurate user profile model. Among these were
Bace [7], and Northcutt [8]; both authors talked
about ways of detecting intruders logging into their
networks through the use of user profiling. From
their experiences, both authors classified computer
break-ins into two main categories – inside and
outside intruders. Inside intruders have authorized
use of the computer network; whereas outside
intruders do not have authorization. These two
authors concluded that many applications in
computer science, especially computer security
(i.e. intrusion detection) can benefit from computer
user profile.
The direction of this paper is to build a
foundation on user profiling for future works in
intrusion detection. To build a reliable intrusion
detection system, Bass [9] suggested that
multisensors should be used. Multisensing is a
method of combining data from multiple and
diverse sensors and sources in order to make
inferences about events, activities, and situation.
Thinking along this line, this paper presents a
method of profiling a user through multiple
parameters from the process accounting log of the
system. The multiple parameters user profile
obtained here will be used as one of the many
components for our subsequent work in intrusion
detection.
III.
System Resources for User Profiling
Before any work in user profiling is done, it is
important to focus on the essential data based on
the system resources that one has, and the system
available to users. This section begins with a
description of the computer network topology in
service. Next the section discussed how the user
data was monitored and logged.
(a) Computer Network Topology
The topology of our computer network at the
University of California, Davis, Computer Security
Laboratory consists of a number of computer hosts
from a variety of manufacturers. All hosts run one
of four operating systems: Sun Solaris, Sun OS,
Free BSD, and LINUX. Both the Sun Solaris and
Sun OS run on the workstations, while the Free
BSD and LINUX run on the PCs. There are a few
Macintosh personal computers connected to the lab
network but we limited our work to the UNIX
workstations and PCs only.
Since all the workstations are connected
together, most run the same software applications.
Nevertheless, a few workstations were dedicated to
run specially licensed software applications to save
cost on network licensing. On the other hand, the
PCs have separately installed software applications
running on them. The PCs share one commonality
with the workstations – the UNIX operating
system.
Any user with a valid account can log onto any
workstation or host within the computer network
through two means. The first login means is to log
in from any physical host in the lab. The second
method of accessing the lab network is through dial
up modems. Once connected to any host within
the lab, the user has the capability to remotely log
on to other computers within the laboratory
network.
(b) Data Monitoring and Logging
In the UNIX operating system, there is a
process accounting program that is running in the
background [10,11] of the operating system. The
process accounting program keeps tab of the
computer resources. Some of these computer
resources are keyboard input, time of keyboard
usage, CPU usage, memory usage, cache memory
usage, buffer usage, etc.
By invoking the process accounting program
to run [12] and piping the output to a logfile, we
were able to collect the inputs each user typed on
the keyboard.
IV.
Profiling Users Through Their Most
Important Characteristics
One of the key difficulties in user profiling in
the context of networked environment is that there
are many different kinds of users and these can be
categorized using a vast number of variables. Some
of the variables, such as gender, physical and
intellectual capabilities, and communication skills
do not change at all or change slowly. Some
variables such as stress, fatigue, computer-related
experience, skills at typing at the keyboard,
preferences for certain types of information,
propensity to use certain commands, may show
some drift with time and experience. User profiles
can also be developed on the basis of interaction
features a user prefers (menu-based interaction,
command line interaction, via function keys, etc).
Researchers have also used machine learning
techniques to track user actions and construct
models for user preference [13].
Each user is a unique individual with a unique
set of characteristics. When faced with the same
problem or situation, each individual has a unique
perspective of solving or looking at that situation.
The hypothesis is that these individual behavioral
characteristics can be extracted from the log files
of each user.
Our
approach
depends
on
learning
characteristic sequences of actions generated by
users. The underlying hypothesis is that a user
responds in a similar manner to similar situations,
leading to repeated sequences of actions. Indeed,
the existence of command alias mechanisms in
many UNIX command interpreters supports the
idea that users tend to perform many repeated sets
of actions, and that these sequences differ on a peruser basis. It is the differences in characteristic
sequences that we attempt to use to profile users in
order to differentiate users.
With the above ideas in mind, we chose four
parameters in the captured logfile to profile the
users. Although these four parameters are not an
exhaustive list of parameters in the logfile, we
found empirically that they contain more than the
adequate information we need to profile the user
on. These four user profile features are the login
host, the login time, the command set, and the
command execution time. We will go in detail on
each of these features below:
(a) Login Host
In our network, as well as most UNIX
operating system networks, the users have the
freedom to connect to any host on the network.
Many times, the users want to keep separate
applications running on separate hosts because of
software license agreements or have a preferred
host that they want to log on to (e.g. the user’s
personal workstation or assigned host).
On
different occasions, users might want to keep the
work related to certain applications or projects
specific to a host on the network. Thus, it is
important to keep track of the host that a user is
logged on because the same user can have a
different user characteristic from host to host.
(b) Login Time
Most users tend to have a preferred time
window to do their work. For instance, a nocturnal
person whose normal computer login time is
between 10:00PM to 2:00AM is unlikely to log
onto the network at 8:00AM in the morning except
in some unexpected situations. Likewise, a user
whose regular schedule is from 10:00 AM to
6:00PM rarely logs in at night, say from 12:00AM
to 6:00AM. Thus the login time is considered to be
a useful profiling parameter.
(c) Command Sets
The command set is perhaps the most
important parameter to profile a user on. It is a
more distinguished trait that makes a user unique.
However the UNIX command set available to users
is large. To profile a user based on all commands
in the UNIX command set is an overwhelming
task. Yet, almost all UNIX users utilize a portion
of the available command set.
To simplify our task, we went over all the
logfiles and selected a set of 100 UNIX commands
that all users most frequently used. The result, in
the order of the frequency of their usage, is listed in
Table 1 (i. e. the command used most frequently is
listed first). Table 1 is used as a secondary
command set to profile a user.
The primary command set to profile a user is
his owns. In determining a user’s command set,
the following rule applies. If the number of
commands in the user’s repertoire of commands is
greater than 100, then only the most used 100
commands in his log file are used. However, if the
user command set is N, (i.e., N < 100) then all of
the N commands will be used. In addition (100 N) top commands in Table 1 will be used in that
user’s command set. The conditions below re-state
the logic discussed.
Given:
CS = Commands in the user’s
Command Set
N = The number of command
if (CS > 100)
CS = 100 most used commands
else if (CS < 100)
CS = N + [(100-N) Table 1 commands]
end
1 sh
2 stty
3 sed
4 mail
5 [
6 dtfile
7 frm
8 in.telne
9 gen.pl
10 groff
11 date
12 sendmail
13 hostname
14 uudemon.
15 tty
16 tetex.cr
17 perl
18 emacs
19 dot
20 grotty
21 row
22 tcsh
23 cat
24 su
25 test
26 more
27 utmp_upd
28 whoami
29 top
30 troff
31 column
32 grids
33 updatedb
34 find
35 ps
36 frcode
37 lpNet
38 mkdir
39 w
40 file
41 dtexec
42 chmod
43 logger
44 bash
45 pt_chmod
46 logrotat
47 xterm
48 rdate
49 gzip
50 xhost
51 dtscreen
52 rm
53 vi
54 less
55 grep
56 tmpwatch
57 rlogin
58 tr
59 ln
60 top-sun4
61 msgchk
62 gabriel_
63 in.rshd
64 sort
65 amd
66 finger
67 telnet
68 tput
69 resize
70 gtbl
71 rsh
72 mv
73 id
74 clear
75 crond
76 uuxqt
77 quota
78 pwd
79 domainna
80 mesg
81 uname
82 ptbl
83 cp
84 id.pl
85 run-part
86 uusched
87 ping
88 df
89 xlock
90 lpr
91 awk
92 ls
93 login
94 chown
95 atrun
96 man
97 movemail
98 gunzip
99 last
100 in.rlogi
Table 1: 100 Frequently Used Command Set by all
users
(d) Command Execution Time
The final parameter that was monitored for
user profiling is the execution time of each
command.
The command execution time
parameter tracks how much time a command is
required to run after a user hits the return key. In
UNIX, any user can modify a command or creating
an alias command to do a series of tasks. For
instance a directory listing is typically defined as
‘ls’, however the same command can be used to list
files in the current directory with different options,
as in ‘ls – la’. ‘ls – la’ would do a long listing of
all files in the current directory with a complete
listing of when the files were created, and their
size, etc. Furthermore, any experienced UNIX user
can use the same command to do other tasks such
as deleting that directory, (i.e. define ‘ls’ to do ‘rm’
of the directory or the hard drive). The latter one is
known as a trojan command – i.e. the command is
defined for doing a specific task [14] other than
intended. Most trojan commands are malicious in
nature.
To prevent unexpected execution time of
commands outside of their scopes, the tracking of
the execution of these commands would isolate
those commands that took more CPU cycle time to
run than normal.
V.
Host User Profile
Using the logfiles referred in section III, we
then proceeded on profiling the user according to
the four features – the login host, the login time,
the command set, and the command execution
time.
As mentioned in subsection IVa, some hosts
on our computer network have different
applications running on them. We decided to
profile the users on each host individually. In the
short-term profiling case, several steps were
involved. The first step was to parse the data into
each host that the profiled user had logged on.
Here we selected the command set according to the
rules presented in the command set subsection
(subsection IVc). In the second step, we divided
the commands into a one-week period and counted
the number of occurrence of each command in that
one-week interval. In the third step, we determined
when the users logged onto the lab network by the
login time. Finally we looked at the command
execution time to see if any alias (i.e. trojan)
commands were run.
From the above four steps, we generated a
command set and a login time for each user. For
illustration purposes, we presented eight different
plots from two different users that we profiled on
in Appendix A. The command execution time
profile did not vary much (i.e. no trojan commands
were detected), thus was not plotted. However, it
is necessary to keep track of the time each
command was run.
Figures 1a and 1b illustrate the command set
and login time of User 1 on Host A, while figures
1c and 1d illustrate the same profiled User 1 on
Host B. Figures 2 (a & b) and figures 2 (c & d)
represent the User 2 on Host C and D respectively.
In figures 1a, 1c, 2a, and 2b the X-axis indicates
the command set of the profiled user; the Y-axis
indicates the profiled user’s time login in a oneweek increments.
The Z-axis indicates the
percentage of occurrence of each command.
Similarly, in figures 1b, 1d, 2b, and 2d the X-axis
indicates the time of log in (starting from 12:00AM
and ending at 11:00PM). The Y-axis represents the
weekly increment in time, and the Z-axis
represents the percentage of usage in a one-hour
duration when the user is logged in.
Figures 1a and 1c are the command plots of
the same user (User 1) on host A and B. From
observation, one can see that these two plots do not
exhibit any similar pattern. Likewise figures 1b
and 1d are the time plot of the same user on host A
and B do not show any similar correlation. After
inspecting the plots for User 1 on other hosts, we
can see some similarity of the command and or
time plots to either figures 1a and 1b or figures 1c
and 1d.
In a similar comparison, the profile of User 2
(figures 2a and 2b, and 2c and 2d) exhibits some
similarity in the same command set and or time
usage pattern across the two hosts (Hosts A & B).
The results of Users (1 & 2) above are two
cases of the 28 users that we profiled in our
network. Of all of the users we had profiled in a
short-term case, we learned that a user profile is a
function of the host. The same user can have
different or similar profiles on different hosts. This
can be attributed to the fact that a user used a
computer host for a specific application or need. In
other words, if the same user is using the different
applications on different hosts, then his profiles on
these hosts will be the same. However if the same
user is using different applications on different
hosts, then his profiles on these hosts will be
different.
VI.
Host User Profile with Concept or
Profile Drift
When a person works in the same environment
for an extended period of time, it is highly likely
that he will adapt and change his style to fit his
environment. This is because user is more familiar
with the system, or that he has discovered a better
way of doing thing [5]. The same can be said of
doing one’s work. The nature of the job might
change over time. As a result, a user modifies his
command set to fit his new situation. These
changes in user’s behavior correspond to the longterm profiling. The change in work habit, work
environment or application in one’s account is
known in the literature as concept or profile drift.
Two situations contribute to profile drift –
natural profile drift and forced profile drift.
Natural profile drift occurs when a user slowly
adapts to his environment through learning or
experiences. This happens when the user learned a
new command or a new way of doing the same
thing. For instance, after working on a project for
a while, most workers will become more
experienced and perform their job better. Natural
profile drift is gradual and constantly occurring in
the user profile. On the other hand, forced profile
drift occurs when the user abruptly changes his
profile. This usually happens when the user
changes one or more of the followings – work
environment, responsibility or job. The user is
forced to change in order to accommodate his new
role. Usually when the forced profile drift occurs,
the user will subsequently undergo a natural
progression drift once he is comfortable in his new
environment.
Profile drift is evident in the figures 1 (a,b,c,d)
and 2 (a,b,c,d). In the natural profile drift, the
command figures are more linear while the time
figures stay constant. In the second case of forced
profile drift, both the command and time figures
have abrupt changes. After that more gradual
changes would occur in the command and time
plots.
VII.
Summary
We used the built-in process accounting log of
the UNIX operating systems on the hosts to log the
users’ usage.
From the users’ logfiles we
determined the four most essential parameters to
profile them on. The four profile parameters were
the login host, the login time, the command set,
and the command execution time. The login host
and time correspond to the host and the time that
the profiled user logs onto the network. The
command set is the 100 most frequently used
commands that the profiled user uses. If the
profiled user uses less than 100 commands, then
we appended his command set with the most
frequently used commands in Table 1. We found
no noticeably different in the user’s command
execution time and decided to include this feature
in our future work in intrusion detection.
The login host, login time and command set
were adequate in profiling users in both short term
and long term profiling sessions. Moreover, in
short term profiling, we found that the user profile
is dependent on the host that he logs on. In long
term profiling there also exists profile drift. Thus
in the long term profiling case a user profile is a
function of the host and is a function of time.
VIII.
Conclusion and Future Work
This paper has demonstrated that the host, the
login time, and the UNIX command set can be
used to profile a user with a high degree of
accuracy. Two important points were learned.
First, the user profile is host dependent. The same
user could have different profiles on different
hosts. This is due to the fact that user profiling is a
function of the applications residing on the host.
The second point was that the profile drift occurred
over time. Profile drift occurs in two ways –
natural profile drift and forced profile drift. Both
are due to the fact that users will change their
profile to fit their environment.
Further work in this area can be a monitoring
of different system parameters such as memory
usage, page fault usage, buffer over, etc. Perhaps
an entirely new process accounting system to track
the desired parameters for user profiling is also
possible. Also the results obtained in this paper
were in the form of figures and observations. The
conclusions we made were by observing those
figures in the appendix. It is important to come up
with a quantitative measurement of these results.
This can be accomplished if the users’ profiles are
used in actual applications such as intrusion
detection.
Furthermore, as we have been making the
connection of the work presented here to that of
intrusion detection throughout this paper, it is
important to point in that direction for future
research. The work in this paper represents one
foundation in a multisensing system to be used in
detecting intruders logging onto a computer
network.
IX.
References
[1] Unabomber, “Industrial Society and Its
Future”, The Washington Post, Sept. 19, 1995.
[2] D. Denning, “An Intrusion Detection Model”,
IEEE Transactions on Software Engineering,
1987.
[3] M. Obaidat, B. Sadoun, “Verification of
Computer Users Using Keystroke Dynamics,”
IEEE Trans. On Systems, Man and
Cybernetics, Vol. 27, No. 2, pp. 261-269, Apr.
1997.
[4] T. Lane, C. Brodley, “An Application of
Machine Learning to Anomaly Detection”,
http://mow.ecn.purdue.edu/~terran/facts/resea
rch/research.html, 1997.
[5] T. Lane, C. Brodley, “Temporal Sequence
Learning and Data Reduction for Anomaly
Detection”,
http://mow.ecn.purdue.edu/~terran/facts/resea
rch/research.html, 1998.
[6] C. Warrender, S. Forrest, B. Pearlmutter,
“Detecting Intrusions Using System Calls:
Alternative Data Models”, IEEE, Nov. 1999.
[7] R. Bace, Intrusion Detection, Macmillan
Technical Publishing, 2000.
[8] S. Northcutt, Network Intrusion Detection —
An Analyst’s Handbook, New Riders
Publishing, 1999.
[9] T. Bass, “Intrusion Detection Systems and
Multisensor Data Fusion”, Communications of
the ACM, Vol. 43, No. 4, Apr. 2000.
[10] S. Garfinkel, G. Spafford, Practical UNIX &
Internet Security, O’Reilly, 1996.
[11] S. Coffin, UNIX System V Release IV: The
Complete Reference, McGraw Hill, 1990.
[12] E. Nemeth, G. Snyder, S. Seebass, T. Hein,
UNIX System Administration Handbook, 2nd
ed. Prentice Hall PTR, 1995
[13] P. Maes, “Agents that Reduce Work and
Information Overload”, Communications of
the ACM, Vol. 37, July 94.
[14] L. Klander, Hacker Proof – The Ultimate
Guide To Network Security, Jamsa Press,
1997.
Appendix A – Results
Figure 1a: Command Plot of User 1 – Host A
X axis – Command Set
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage
Figure 1c: Command Plot of user 1 – Host B
X axis – Command Set
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage
Figure 1b: Time Plot of User 1 – Host A
X axis – Time Login [12:00AM - 11:00PM]
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage
Figure 1d: Time Plot of User 1 – Host B
X axis – Time Login [12:00AM - 11:00PM]
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage
Figure 2a: Command Plot of User 2 – Host C
X axis – Command Set
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage
Figure 2c: Command Plot of User 2 – Host D
X axis – Command Set
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage
Figure 2b: Time Plot of User 2 – Host C
X axis – Time Login [12:00AM - 11:00PM]
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage
Figure 2d: Time Plot of User 2 – Host D
X axis – Time Login [12:00AM - 11:00PM]
Y axis – Weekly Increment in Time
Z axis -- % of Command Usage