DAT Manual PDF
DAT Manual PDF
Part I Semester I
2013 – 2014
1
M.Sc in Information Technology Part I
Practical Problems
Prepared and Implemented by
&
Compiled By
2
INDEX
4 Practical 1 11
5 Practical 2 21
6 Practical 3 24
7 Practical 4 28
8 Practical 5 41
9 Practical 6 46
10 Practical 7 49
11 Practical 8 54
12 Practical 9 57
13 Practical 10 58
14 References 60
3
List of Practical
3. Graph Plotting
a. Gnu plot for plotting vectors 1
b. Gnu plot for plotting vectors 2
c. Gnu plot for plotting vectors 3
Continuous distributions
a) Normal distribution
b) Lognormal distribution
c) Gamma distribution
d) Exponential distribution
4
e) Beta distribution
5
Installation of cygwin
6
8) Configure the apophenia library by typing:
tar xvzf apophenia-0.99-09_Jul_13.tgz
cd apophenia-0.99
9) . /configure
To test :
1. Once cygwin installation is complete, we can check the same by running
a test program.
2. To run a test program with “abc.c”
3. Run the following command in bash……
4. gcc –std=gnu99 abc.c –o abc.out –lapophenia –lgsl –lsqlite3
./abc.out
7
Ubuntu Installation as per the free download.
3) Open the Terminal. It will open in the Current Directory. We have to Extract
the package sqlite-autoconf-3071700.tar.gz
4) After the Extraction of the package, the folder is created in the Current
Directory is known as sqlite-autoconf-3071700
jayesh@jayesh-G31M-S2L:~$ cd sqlite-autoconf-3071700
jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$
jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ ./configure
It will ask the password ,type the passwoord and press the Enter Key
8
9) jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo ldconfig
3) Open the Termina. It will open in the Current Directory. We have to Extract
the package gsl-1.16.tar.gz
4) After the Extraction of the package, the folder is created in the Current
Directory is known as gsl-1.16
jayesh@jayesh-G31M-S2L:~$ cd gsl-1.16
jayesh@jayesh-G31M-S2L:~/gsl-1.16$
jayesh@jayesh-G31M-S2L:~/gsl-1.16$ ./configure
It will ask the password ,type the password and press the Enter Key
8) After the Make has been done it need to install the gsl
9
9) jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo ldconfig
3)Open the Termina. It will open in the Current Directory. We have to Extract
the package apophenia-0.99.tar.gz
4) After the Extraction of the package, the folder is created in the Current
Directory is known as apophenia-0.99
jayesh@jayesh-G31M-S2L:~$ cd apophenia-0.99
jayesh@jayesh-G31M-S2L:~/apophenia-0.99$
jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ ./configure
It will ask the password ,type the password and press the Enter Key
10
Practical No.1 - SQL queries based on Unit I
sqlite> .databases
seq name file
--- --------------- ----------------------------------------------------------
0 main /home/jayesh/testDB.db
sqlite>
Problem statement :
To execute SQL queries in order to store and retrieve the data under study in a
database. Sqlite is used for executing the queries.
i) Queries for performing DDL commands.
DDL commands are used to create, modify and delete database
objects. The data is stored in an RDBMS in the form of tables.
Following are the queries to be performed for DDL commands in
Sqlite
11
sqlite> CREATE TABLE DEPARTMENT(
ID INT PRIMARY KEY NOT NULL,
DEPT CHAR(50) NOT NULL,
EMP_ID INT NOT NULL
);
You can verify if your table has been created successfully using SQLIte
command .tables command
sqlite>.tables
COMPANY DEPARTMENT
ii) Insertion value into the COMPANY and DEPARTMENT Table
12
INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID)
VALUES (3, 'Finance', 7 );
iii) Select clause is a data manipulation command used for retrieving the
data in the desired format from the database objects. The syntax of
the various select clause and its purpose is given below:
a) list down all the records where AGE is greater than or equal to
25 AND salary is greater than or equal to 65000.00:
13
sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 AND SALARY >=
65000;
a) list down all the records where AGE is greater than or equal to
25 ORsalary is greater than or equal to 65000.00:
sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 OR SALARY >=
65000;
list down all the records where AGE is not NULL which means all the
records because none of the record is having AGE equal to NULL:
list down all the records where NAME starts with 'Ki', does not matter
what comes after 'Ki'.
sqlite> SELECT * FROM COMPANY WHERE NAME LIKE 'Ki%';
14
list down all the records where AGE value is either 25 or 27:
list down all the records where AGE value is neither 25 nor 27:
list down all the records where AGE value is in BETWEEN 25 AND 27:
finds all the records with AGE field having SALARY > 65000
15
Find the total amount of salary on each customer
b) Order by Clause
16
SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME ORDER
BY NAME;
c) Following is the example which would display record for which name
count is less than 2:
17
d) which would sort the result in Ascending order by SALARY:
18
sqlite> SELECT * FROM COMPANY LIMIT 3 OFFSET 2;
g) Joins
19
sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY INNER JOIN
DEPARTMENT
ON COMPANY.ID = DEPARTMENT.EMP_ID;
sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY LEFT OUTER JOIN
DEPARTMENT
ON COMPANY.ID = DEPARTMENT.EMP_ID;
20
Practical 2
i) Multiplication Table
#include <apop.h>
int main(){
gsl_matrix *m = gsl_matrix_alloc(20,15);
gsl_matrix_set_all(m, 1);
for (int i=0; i< m->size1; i++){
Apop_matrix_row(m, i, one_row);
gsl_vector_scale(one_row, i+1);
}
for (int i=0; i< m->size2; i++){
Apop_matrix_col(m, i, one_col);
gsl_vector_scale(one_col, i+1);
}
apop_matrix_show(m);
gsl_matrix_free(m);
}
jayesh@jayesh-G31M-S2L:~$ ./multiplicationtable.out
21
ii) the function in will take in a double indicating taxable income and will
return US income taxes owed, assuming a head of household with two
dependents taking the standard deduction
#include <apop.h>
int main(){
apop_db_open("data-census.db");
strncpy(apop_opts.db_name_column, "geo_name", 100);
apop_data *d = apop_query_to_data("select geo_name,
Household_median_in as income\
22
from income where sumlevel = '040'\
order by household_median_in desc");
Apop_col_t(d, "income", income_vector);
d->vector = apop_vector_map(income_vector, calc_taxes);
apop_name_add(d->names, "tax owed", 'v');
apop_data_show(d);
}
jayesh@jayesh-G31M-S2L:~$ ./taxes.out
23
Practical III
Plotting a vector
#include <apop.h>
int main(){
apop_db_open("data-climate.db");
plot_matrix_now(apop_query_to_matrix("select (year*12+month)/12., temp
from temp"));
}
24
Eigen vector
#include "eigenbox.h"
apop_data *query_data(){
apop_db_open("data-census.db");
return apop_query_to_data(" select postcode as row_names, "
" m_per_100_f, population/1e6 as population, median_age "
" from geography, income,demos,postcodes "
" where income.sumlevel= '040' "
" and geography.geo_id = demos.geo_id "
" and income.geo_name = postcodes.state "
" and geography.geo_id = income.geo_id ");
}
25
int main(){
apop_plot_lattice(query_data(), "out");
}
Query out the month, average, and variance, and plot the data using errorbars.
Prints to stdout, so pipe the output through Gnuplo
#include <apop.h>
int main(){
apop_db_open("data−climate.db");
apop_data *d = apop_query_to_data("select \
(yearmonth/100. − round(yearmonth/100.))*100 as month, \
avg(tmp), stddev(tmp) \
26
from precip group by month");
printf("set xrange*0:13+; plot ’−’ with errorbars\n");
apop_matrix_show(d−>matrix);
}
27
Practical 4
Discrete distributions
1. Bernoulli distribution
2. binomial distribution
3. Poisson distribution
4. Multinomial distribution
5. hypergeometric distribution
Continous distributions
1. Normal distribution
2. Lognormal distribution
3. Gamma distribution
4. Exponential distribution
5. Beta distribution
#include <stdio.h>
#include <gsl/gsl_randist.h>
int
main (void)
{
int i;
double p = 0.6;
float sum=0;
/* prints probability distibution table*/
printf("\n");
return 0;
28
}
#include <stdio.h>
#include <gsl/gsl_randist.h>
int
main (void)
{
int i,n=5;
double p = 0.6;
float sum=0;
/* prints probability distibution table*/
printf("\n");
return 0;
}
29
Poisson distribution (poi.c)
#include <stdio.h>
#include <gsl/gsl_randist.h>
int
main (void)
{
int i, n = 10;
double mu = 3.0;
float sum=0;
/* prints probability distibution table*/
printf("\n");
return 0;
}
30
Uniform distribution(uniform.c)
#include <stdio.h>
#include <gsl/gsl_randist.h>
int
main (void)
{
double x;
int a,b ;
printf("enter vaue for x ,a,b \n");
scanf("%f",&x);
scanf("%d",&a);
scanf("%d",&b);
float sum=0;
/* prints probability distibution table*/
printf("%f\t\t%f\n",x,k);
31
return 0;
}
#include <stdio.h>
#include <gsl/gsl_randist.h>
int k=3;
const double p[]={0.2,0.4,0.4};
const unsigned int n[]={2,3,4};
/* prints probability */
printf("%3.9f\n",pmf);
return 0;
}
32
The following formula gives the probability of obtaining a specific set of
outcomes when there are three possible outcomes for each event:
where
p is the probability,
n is the total number of events
n1 is the number of times Outcome 1 occurs,
n2 is the number of times Outcome 2 occurs,
n3 is the number of times Outcome 3 occurs,
p1 is the probability of Outcome 1
p2 is the probability of Outcome 2, and
p3 is the probability of Outcome 3.
33
The formula for k outcomes is
int x,s,f,n;
n=6;
x=2;//random variable
s=13;//success
f=39;//failure
/* prints probability */
printf("%d %3.6f\n",x,pmf);
return 0;
}
34
continous distributions (contdist.c)
#include <stdio.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_cdf.h>
void normal();
void beta();
void gamma1();
void exponential();
void lognormal();
int main()
{
int choice;
printf("continous distributions\n");
printf("-----------------------\n");
printf("1:Normal distribution\n");
printf("2:Gamma distribution\n");
printf("3:Exponential distribution\n");
printf("4:Beta distribution\n");
printf("5:Lognormal distribution\n");
printf("enter your choice\n");
scanf("%d",&choice);
switch(choice)
{case 1:
normal();
break;
case 2:
gamma1();
break;
case 3:
exponential();
35
break;
case 4:
beta();
break;
case 5:
lognormal();
break;
default:
printf("wrong choice\n");
}
return 0;
}
void normal()
{
double P, Q;
double x = 10;
double sigma=5;
double pdf;
printf("Normal distribution :x=%f sigma=%f\n",x,sigma);
pdf = gsl_ran_gaussian_pdf (x,sigma);
printf ("prob(x = %f) = %f\n", x, pdf);
P = gsl_cdf_gaussian_P (x,sigma);
printf ("prob(x < %f) = %f\n", x, P);
Q = gsl_cdf_gaussian_Q (x,sigma);
printf ("prob(x > %f) = %f\n", x, Q);
x = gsl_cdf_gaussian_Pinv (P,sigma);
printf ("Pinv(%f) = %f\n", P, x);
x = gsl_cdf_gaussian_Qinv (Q,sigma);
printf ("Qinv(%f) = %f\n", Q, x);
}
void gamma1()
{
double P, Q;
double x = 1.5;
double a=1;
double b=2;
double pdf;
36
printf("Gamma distribution :x=%f a=%f b=%f\n",x,a,b);
pdf = gsl_ran_gamma_pdf (x,a,b);
printf ("prob(x = %f) = %f\n", x, pdf);
P = gsl_cdf_gamma_P (x,a,b);
printf ("prob(x < %f) = %f\n", x, P);
Q = gsl_cdf_gamma_Q (x,a,b);
printf ("prob(x > %f) = %f\n", x, Q);
x = gsl_cdf_gamma_Pinv (P,a,b);
printf ("Pinv(%f) = %f\n", P, x);
x = gsl_cdf_gamma_Qinv (Q,a,b);
printf ("Qinv(%f) = %f\n", Q, x);
void exponential()
{
double P, Q;
double x = 0.05;
double lambda=2;
double pdf;
printf("Exponential distribution :x=%f lambda=%f\n",x,lambda);
pdf = gsl_ran_exponential_pdf (x,lambda);
printf ("prob(x = %f) = %f\n", x, pdf);
P = gsl_cdf_exponential_P (x,lambda);
printf ("prob(x < %f) = %f\n", x, P);
Q = gsl_cdf_exponential_Q (x,lambda);
printf ("prob(x > %f) = %f\n", x, Q);
x = gsl_cdf_exponential_Pinv (P,lambda);
printf ("Pinv(%f) = %f\n", P, x);
x = gsl_cdf_exponential_Qinv (Q,lambda);
printf ("Qinv(%f) = %f\n", Q, x);
void beta()
37
{
double P, Q;
double x = 0.8;
double a=0.5;
double b=0.5;
double pdf;
printf("Beta distribution :x=%f a=%f b=%f\n",x,a,b);
pdf = gsl_ran_beta_pdf (x,a,b);
printf ("prob(x = %f) = %f\n", x, pdf);
P = gsl_cdf_beta_P (x,a,b);
printf ("prob(x < %f) = %f\n", x, P);
Q = gsl_cdf_beta_Q (x,a,b);
printf ("prob(x > %f) = %f\n", x, Q);
x = gsl_cdf_beta_Pinv (P,a,b);
printf ("Pinv(%f) = %f\n", P, x);
x = gsl_cdf_beta_Qinv (Q,a,b);
printf ("Qinv(%f) = %f\n", Q, x);
}
void lognormal()
{
double P, Q;
double x = 4;
double zeta=2;
double sigma=1.5;
double pdf;
printf("Lognormal distribution :x=%f zeta=%f sigma=%f\n",x,zeta,sigma);
pdf = gsl_ran_lognormal_pdf (x,zeta,sigma);
printf ("prob(x = %f) = %f\n", x, pdf);
P = gsl_cdf_lognormal_P (x,zeta,sigma);
printf ("prob(x < %f) = %f\n", x, P);
Q = gsl_cdf_lognormal_Q (x,zeta,sigma);
printf ("prob(x > %f) = %f\n", x, Q);
x = gsl_cdf_lognormal_Pinv (P,zeta,sigma);
printf ("Pinv(%f) = %f\n", P, x);
38
x = gsl_cdf_lognormal_Qinv (Q,zeta,sigma);
printf ("Qinv(%f) = %f\n", Q, x);
}
39
40
Practical No. 5 Implement regression and goodness of fit
Implementing regression
Steps :
Functions used :
int gsl_fit_wlinear (const double * x, const size_t xstride, const double * w,
const size_t wstride, const double * y, const size_t ystride, size_t n, double * c0,
double * c1, double * cov00, double * cov01, double * cov11, double * chisq)
This function computes the best-fit linear regression coefficients (c0,c1) of the
model Y = c_0 + c_1 X for the weighted dataset (x, y), two vectors of
length n with strides xstride and ystride. The vector w, of length n and
stride wstride, specifies the weight of each datapoint. The weight is the
reciprocal of the variance for each datapoint in y.
The covariance matrix for the parameters (c0, c1) is computed using the
weights and returned via the parameters (cov00, cov01, cov11). The weighted
sum of squares of the residuals from the best-fit line, \chi^2, is returned
in chisq.
This function uses the best-fit linear regression coefficients c0, c1 and their
covariance cov00, cov01, cov11 to compute the fitted function y and its
standard deviation y_err for the model Y = c_0 + c_1 X at the pointx.
int
main (void)
{
int i, n = 4;
double x[4] = { 1970, 1980, 1990, 2000 };
double y[4] = { 12, 11, 14, 13 };
double w[4] = { 0.1, 0.2, 0.3, 0.4 };
41
double c0, c1, cov00, cov01, cov11, chisq;
gsl_fit_wlinear (x, 1, w, 1, y, 1, n,
&c0, &c1, &cov00, &cov01, &cov11,
&chisq);
printf ("\n");
gsl_fit_linear_est (xf,
c0, c1,
cov00, cov01, cov11,
&yf, &yf_err);
42
B. Implementing goodness of fit Chi Square
If you want to use a database on the hard drive instead of memory, then call
this once and only once before using any other database utilities.
Parameters:
Returns:
0: everything OK
1: database did not open.
apop_model m
43
estimate the parameters of a model given data.This function copies the input
model, preps it, and calls m.estimate(d,&m). If your model has
no estimate method, then I assume apop_maximum_likelihood(d, m), with the
default MLE params.
Parameters:
d The data
m The model
apop_data * binspec,
int bin_count,
gsl_rng * rng
Make random draws from an apop_model, and bin them using a binspec in the
style of apop_data_to_bins. If you have a data set that used the same binspec,
you now have synced histograms, which you can plot or sensibly test
hypotheses about.
Parameters:
44
The number of random draws to make. (arbitrary default =
draws
10,000)
Returns:
An apop_pmf model.
#include <apop.h>
int main(){
apop_db_open("data-climate.db");
apop_data *precip = apop_query_to_data("select PCP from precip");
apop_model *est = apop_estimate(precip, apop_normal);
apop_data *precip_binned = apop_data_to_bins(precip/*,
.bin_count=180*/);
apop_model *datahist = apop_estimate(precip_binned, apop_pmf);
apop_model *modelhist = apop_model_to_pmf(.model=est,
.binspec=apop_data_get_page(precip_binned, "<binspec>"), .draws=1e5);
double scaling = apop_sum(datahist->data-
>weights)/apop_sum(modelhist->data->weights);
gsl_vector_scale(modelhist->data->weights, scaling);
apop_data_show(apop_histograms_test_goodness_of_fit(datahist,
modelhist));
}
45
Prac 6. Implement testing with likelihood
1. Building an optimized model & then solving the same for maximum.( a
function can be provided in this case)
#include <apop.h>
#include "sinsq.c"
int main(){
46
system ("mkdir -p localmax_out; rm -f localmax_out/*.gplot");
apop_opts.verbose ++;
do_search(APOP_SIMPLEX_NM, "N-M Simplex", "simplex");
do_search(APOP_CG_FR, "F-R Conjugate gradient", "fr");
do_search(APOP_SIMAN, "Simulated annealing", "siman");
do_search(APOP_RF_NEWTON, "Root-finding", "root");
fflush(NULL);
system("sed -i \"1iplot '-'\" localmax_out/*.gplot");
}
#include <apop.h>
47
apop_model_show(out);
return out;
}
#ifndef TESTING
int main(){
apop_db_open("data-metro.db");
printf("With constant dummies:\n"); dummies(0);
printf("With slope dummies:\n"); dummies(1);
}
#endif
#define TESTING
#include "dummies.c"
int main(){
apop_db_open("data-metro.db");
apop_model *m0 = dummies(0);
apop_model *m1 = dummies(1);
show_normal_test(m0, m1, m0->data->matrix->size1);
}
48
Prac 7. Generate random numbers using Monte Carlo method using
1.Exponential distribution
2. uniform distribution
3. binomial distribution
some functions used for random number generation
the functions used for random number generation are declared in the header
file `gsl_rng.h'.
const gsl_rng_type * T : holds static information about each type of
generator.
gsl_rng_env_setup() : This function reads the environment
variables GSL_RNG_TYPE and GSL_RNG_SEED and uses their values to
set the corresponding library
variables gsl_rng_default and gsl_rng_default_seed.
#include <stdio.h>
#include <gsl/gsl_rng.h>
int
main (void)
{
const gsl_rng_type * T;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
gsl_rng_free (r);
return 0;
}
49
Running the program without any environment variables uses the initial
defaults, an
mt19937 generator with a seed of 0 as follows:
By setting the two variables on the command line we can change the default
generator and the seed as follows:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
50
printf(" %2.4f \n",x);
}
return(0);
}
Generating uniform random numbers in the range [0.0, 1.0) using uniform
distribution
#include <stdio.h>
#include <gsl/gsl_rng.h>
int
main (void)
{
const gsl_rng_type * T;
gsl_rng * r;
int i, n = 10;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
51
gsl_rng_free (r);
return 0;
}
int
main (void)
{
const gsl_rng_type * T;
gsl_rng * r;
int i, n = 10;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
float p=0.3;
52
/* print n random variates chosen from
the binomial distribution with mean
parameter mu */
printf ("\n");
gsl_rng_free (r);
return 0;
}
53
Practical No. 8 Implementing Parametric testing
1. t test
#include <apop.h>
int main(){
apop_db_open("data-census.db");
gsl_vector *n = apop_query_to_vector("select in_per_capita from
income "
"where state= (select state from geography where name ='North
Dakota')");
gsl_vector *s = apop_query_to_vector("select in_per_capita from income
"
"where state= (select state from geography where name ='South
Dakota')");
apop_data *t = apop_t_test(n,s);
apop_data_show(t); //show the whole output set...
printf ("\n confidence: %g\n", apop_data_get(t, .rowname="conf.*2
tail")); //...or just one value.
}
2.F test
apop_data* apop_f_test ( apop_model * est,
apop_data * contrast
)
Runs an F-test specified by q and c. Your best bet is to see the chapter on
hypothesis testing in Modeling With Data, p 309. It will tell you that:
54
and that's what this function is based on.
Parameters:
Est an apop_model that you have already calculated. (No default)
The matrix and the vector , where each row represents a
hypothesis. (Defaults: if matrix is NULL, it is set to the identity
matrix with the top row missing. If the vector is NULL, it is set
contrast
to a zero matrix of length equal to the height of the contrast
matrix. Thus, if the entire apop_data set is NULL or omitted,
we are testing the hypothesis that all but are zero.)
Returns:
An apop_data set with a few variants on the confidence with which we
can reject the joint hypothesis.
Todo:
There should be a way to get OLS and GLS to store . In fact, if you
did GLS, this is invalid, because you need , and I didn't ask for .
Exceptions:
out->error='a' Allocation error.
out->error='d' dimension-matching error.
out->error='i' matrix inversion error.
out->error='m' GSL math error.
#include "eigenbox.h"
int main(){
double line[] = {0, 0, 0, 1};
apop_data *constr = apop_line_to_data(line, 1, 1, 3);
apop_data *d = query_data();
55
apop_model *est = apop_estimate(d, apop_ols);
apop_model_show(est);
apop_data_show(apop_f_test(est, constr));
}
56
Practical No. 9 Drawing an Inference
Obtaining mean ,standard error & p value for the given data.
#include <apop.h>
void one_boot(gsl_vector *base_data, gsl_rng *r, gsl_vector* boot_sample);
57
Practical No 10.Implement Non-parametric Testing
1. Anova
#include <apop.h>
int main(){
apop_db_open("data-metro.db");
char joinedtab[] = "(select year, riders, line \
from riders, lines \
where riders.station = lines.station)";
apop_data_show(apop_anova(joinedtab, "riders", "line", "year"));
58
59
References
1. Modelling with data, Ben Klemens, Princeton University Press
2. Computational Statistics, James E. Gentle, Springer
3. Computational Statistics, Second Edition, Geof H. Givens and
Jennifer A.Hoeting, Wiley Publications
4. www.cygwin.com
5. http://apophenia.info/
60