[go: up one dir, main page]

100% found this document useful (1 vote)
165 views60 pages

DAT Manual PDF

This document provides instructions and examples for practical exercises involving SQL queries and data analysis tools. It outlines 10 practical exercises involving: 1) SQL queries including DDL commands, select statements, joins, and subqueries; 2) implementing GSL matrices and vectors; 3) graph plotting with GNUPLOT; 4) statistical distributions; 5) regression and goodness of fit; 6) maximum likelihood; 7) Monte Carlo simulation; 8) parametric testing; 9) statistical inference; and 10) non-parametric testing. It also provides instructions for installing relevant software like Cygwin, SQLite, GSL, and GNUPLOT on Windows and Ubuntu systems.

Uploaded by

Pranav Lakde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
165 views60 pages

DAT Manual PDF

This document provides instructions and examples for practical exercises involving SQL queries and data analysis tools. It outlines 10 practical exercises involving: 1) SQL queries including DDL commands, select statements, joins, and subqueries; 2) implementing GSL matrices and vectors; 3) graph plotting with GNUPLOT; 4) statistical distributions; 5) regression and goodness of fit; 6) maximum likelihood; 7) Monte Carlo simulation; 8) parametric testing; 9) statistical inference; and 10) non-parametric testing. It also provides instructions for installing relevant software like Cygwin, SQLite, GSL, and GNUPLOT on Windows and Ubuntu systems.

Uploaded by

Pranav Lakde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

M.Sc. I.T.

Part I Semester I

Data Analysis Tools

MANUAL FOR PRACTICAL

2013 – 2014

1
M.Sc in Information Technology Part I

Course III : Data Analysis Tools

Practical based on the Book “Modelling with Data”

Practical Problems
Prepared and Implemented by

Mr. Mahesh Naik, Valia College, Andheri

&

Mr. Jayesh Shinde, UDIT, Santacruz

Compiled By

R. Srivaramangai, UDIT, Santacruz

2
INDEX

S.NO DESCRIPTION PAGE NUMBER


1 List of Practical 4

2 Installation procedure for cygwin 6

3 Installation procedure for ubuntu 8

4 Practical 1 11

5 Practical 2 21

6 Practical 3 24

7 Practical 4 28

8 Practical 5 41

9 Practical 6 46

10 Practical 7 49

11 Practical 8 54

12 Practical 9 57

13 Practical 10 58

14 References 60

3
List of Practical

1. SQL queries based on Unit I


a. DDL commands of SQL
b. Select clause
i. Simple select
ii. Select queries with where clause
iii. Select queries with arithmetic, relational and logical
operators
iv. Select queries with order by, group by, having, limit and
offset
v. Select queries with aggregation functions and distinct
vi. Select queries with sub queries and Joins

2. Implementing gsl matrices and vectors


a. Illustration of gsl Matrix multiplication
b. Illustration of gsl vector with database query embedded

3. Graph Plotting
a. Gnu plot for plotting vectors 1
b. Gnu plot for plotting vectors 2
c. Gnu plot for plotting vectors 3

4. Implementing Statistical Distributions


Discrete distributions
a) Bernoulli distribution
b) Binomial distribution
c) Poisson distribution
d) Multinomial distribution
e) Hyper geometric distribution

Continuous distributions
a) Normal distribution
b) Lognormal distribution
c) Gamma distribution
d) Exponential distribution

4
e) Beta distribution

5. Implementing Regression and goodness of fit


a. Implementing OLS regression
b. Implementing goodness of fit –chi square
6. Illustrating the maximum likelihood
7. Generating random numbers with Monte Carlo method using
a. Exponential distribution
b. Uniform distribution
c. Binomial distribution
8. Implementing Parametric testing
a. Using t-test
b. Using f-test
9. Illustrating the method of Inference
10.Implementing non-parametric testing - ANOVA

5
Installation of cygwin

1) Download the Cygwin software from the site named as


http://www.cygwin.com/
The most recent version of the Cygwin DLL is 1.7.20-1.
2) Download one more library of functions named as apophenia from the
website http://apophenia.info/
3) Now Install cygwin by running its setup.exe.
4) There are numerous packages in cygwin ans so select those packages
which are required for the practical, namely gcc compiler, make, gsl ,
gnu, sqlite
5) Now the apophenia library is to be included in the cygwin software.
When we install cygwin ,the cygwin folder is created in the C: drive.
Within the cygwin folder , go to home directory and sub directory for
example C:\cygwin\home\yourname (C:\cygwin\home\Jayesh).
6) Copy the apophenia library to that directory named Jayesh
7) Double click on the Cygwin terminal icon and the terminal will open.
you will be taken to the cygwin terminal window as shown below
which displays the present working directory

6
8) Configure the apophenia library by typing:
tar xvzf apophenia-0.99-09_Jul_13.tgz
cd apophenia-0.99
9) . /configure

To test :
1. Once cygwin installation is complete, we can check the same by running
a test program.
2. To run a test program with “abc.c”
3. Run the following command in bash……
4. gcc –std=gnu99 abc.c –o abc.out –lapophenia –lgsl –lsqlite3
./abc.out

7
Ubuntu Installation as per the free download.

How to install the Sqlite on ubuntu 13.04

1) Download the archive package of sqlite database named sqlite-autoconf-


3071700.tar.gz from the htpp:// www.sqlite.org.

2) After download of the sqlite-autoconf-3071700.tar.gz package ,copy the


package in the Home folder of Ubuntu 13.04

3) Open the Terminal. It will open in the Current Directory. We have to Extract
the package sqlite-autoconf-3071700.tar.gz

Then type the Command

tar xvfz sqlite-autoconf-3071700.tar.gz

4) After the Extraction of the package, the folder is created in the Current
Directory is known as sqlite-autoconf-3071700

5) Move to that new folder which has been created

jayesh@jayesh-G31M-S2L:~$ cd sqlite-autoconf-3071700

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$

6) It is needed to configure all the files present in the sqlite-autoconf-3071700


folder

type the Command:

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ ./configure

7) After the configuration has been done,

Type the Command

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make

It will ask the password ,type the passwoord and press the Enter Key

8) Now we need to install the “make” using the following command:

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make install

8
9) jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo ldconfig

How to install the apophenia on ubuntu 13.04

1) Download the archive package of gsl named gsl-1.16.tar.gz from the


htpp:// www.gnu.org/s/gsl/‎

2) After download of the gsl-1.16.tar.gz package , copy the package in the


Home folder of Ubuntu 13.04

3) Open the Termina. It will open in the Current Directory. We have to Extract
the package gsl-1.16.tar.gz

Then type the Command

tar xvfz gsl-1.16.tar.gz

4) After the Extraction of the package, the folder is created in the Current
Directory is known as gsl-1.16

5) Move to that new folder which has been created

jayesh@jayesh-G31M-S2L:~$ cd gsl-1.16

jayesh@jayesh-G31M-S2L:~/gsl-1.16$

6) It is needed to configure all the files present in the gsl-1.16 folder

type the Command:

jayesh@jayesh-G31M-S2L:~/gsl-1.16$ ./configure

7) After the configuration has been done,

Type the Command

jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make

It will ask the password ,type the password and press the Enter Key

8) After the Make has been done it need to install the gsl

jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make install

9
9) jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo ldconfig

How to install the gsl on ubuntu 13.04

1) Download the archive package of apophenia named apophenia-0.99.tar.gz


from the htpp:// apophenia.info/‎‎

2) After download of the apophenia-0.99.tar.gz package, copy the package in


the Home folder of Ubuntu 13.04

3)Open the Termina. It will open in the Current Directory. We have to Extract
the package apophenia-0.99.tar.gz

Then type the Command

tar xvfz apophenia-0.99.tar.gz

4) After the Extraction of the package, the folder is created in the Current
Directory is known as apophenia-0.99

5) Move to that new folder which has been created

jayesh@jayesh-G31M-S2L:~$ cd apophenia-0.99

jayesh@jayesh-G31M-S2L:~/apophenia-0.99$

6) It is needed to configure all the files present in the gsl-1.16 folder

type the Command:

jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ ./configure

7) After the configuration has been done,

Type the Command

jayesh@jayesh-G31M-S2L:~/apophenia-0.99 $ sudo make install

It will ask the password ,type the password and press the Enter Key

9) jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ sudo ldconfig

Installation of GNUPLOT On Ubuntu 13.04

sudo apt-get install gnuplot-x11

10
Practical No.1 - SQL queries based on Unit I

For all database related practical, create a database in Sqlite3

jayesh@jayesh-G31M-S2L:~$ sqlite3 testDB.db


SQLite version 3.7.17 2013-05-20 00:56:22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
To Check the database created or not

sqlite> .databases
seq name file
--- --------------- ----------------------------------------------------------
0 main /home/jayesh/testDB.db
sqlite>

Problem statement :

To execute SQL queries in order to store and retrieve the data under study in a
database. Sqlite is used for executing the queries.
i) Queries for performing DDL commands.
DDL commands are used to create, modify and delete database
objects. The data is stored in an RDBMS in the form of tables.
Following are the queries to be performed for DDL commands in
Sqlite

sqlite> CREATE TABLE COMPANY(


ID INT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL
);

11
sqlite> CREATE TABLE DEPARTMENT(
ID INT PRIMARY KEY NOT NULL,
DEPT CHAR(50) NOT NULL,
EMP_ID INT NOT NULL
);

You can verify if your table has been created successfully using SQLIte
command .tables command
sqlite>.tables
COMPANY DEPARTMENT
ii) Insertion value into the COMPANY and DEPARTMENT Table

INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY)


VALUES (1, 'Paul', 32, 'California', 20000.00 );

INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY)


VALUES (2, 'Allen', 25, 'Texas', 15000.00 );

INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY)


VALUES (3, 'Teddy', 23, 'Norway', 20000.00 );

INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY)


VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 );

INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY)


VALUES (5, 'David', 27, 'Texas', 85000.00 );

INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY)


VALUES (6, 'Kim', 22, 'South-Hall', 45000.00 );

INSERT INTO COMPANY VALUES (7, 'James', 24, 'Houston', 10000.00 );

INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID)


VALUES (1, 'IT Billing', 1 );

INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID)


VALUES (2, 'Engineering', 2 );

12
INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID)
VALUES (3, 'Finance', 7 );

iii) Select clause is a data manipulation command used for retrieving the
data in the desired format from the database objects. The syntax of
the various select clause and its purpose is given below:

Select * from company;

a) list down all the records where AGE is greater than or equal to
25 AND salary is greater than or equal to 65000.00:

13
sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 AND SALARY >=
65000;

a) list down all the records where AGE is greater than or equal to
25 ORsalary is greater than or equal to 65000.00:
sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 OR SALARY >=
65000;

list down all the records where AGE is not NULL which means all the
records because none of the record is having AGE equal to NULL:

sqlite> SELECT * FROM COMPANY WHERE AGE IS NOT NULL;

list down all the records where NAME starts with 'Ki', does not matter
what comes after 'Ki'.
sqlite> SELECT * FROM COMPANY WHERE NAME LIKE 'Ki%';

14
list down all the records where AGE value is either 25 or 27:

sqlite> SELECT * FROM COMPANY WHERE AGE IN ( 25, 27 );

list down all the records where AGE value is neither 25 nor 27:

sqlite> SELECT * FROM COMPANY WHERE AGE NOT IN ( 25, 27 );

list down all the records where AGE value is in BETWEEN 25 AND 27:

sqlite> SELECT * FROM COMPANY WHERE AGE BETWEEN 25 AND 27;

finds all the records with AGE field having SALARY > 65000

sqlite> SELECT AGE FROM COMPANY


WHERE EXISTS (SELECT AGE FROM COMPANY WHERE SALARY > 65000);

15
Find the total amount of salary on each customer

sqlite> SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME;

Company Table Have a multiple record

INSERT INTO COMPANY VALUES (8, 'Paul', 24, 'Houston', 20000.00 );


INSERT INTO COMPANY VALUES (9, 'James', 44, 'Norway', 5000.00 );
INSERT INTO COMPANY VALUES (10, 'James', 45, 'Texas', 5000.00
);sqlite> sqlite>

b) Order by Clause

16
SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME ORDER
BY NAME;

Consider COMPANY table is having following records:

c) Following is the example which would display record for which name
count is less than 2:

SELECT * FROM COMPANY GROUP BY name HAVING count(name) < 2;

sqlite > SELECT * FROM COMPANY GROUP BY name HAVING


count(name) > 2;

17
d) which would sort the result in Ascending order by SALARY:

sqlite> SELECT * FROM COMPANY ORDER BY SALARY ASC;

e) which would sort the result in descending order by NAME:

sqlite> SELECT * FROM COMPANY ORDER BY NAME DESC;

f) Following is an example which limits the row in the table according to


the no of rows you want to fetch from table:

sqlite> SELECT * FROM COMPANY LIMIT 6;

18
sqlite> SELECT * FROM COMPANY LIMIT 3 OFFSET 2;

g) Joins

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY CROSS JOIN


DEPARTMENT;

19
sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY INNER JOIN
DEPARTMENT
ON COMPANY.ID = DEPARTMENT.EMP_ID;

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY LEFT OUTER JOIN
DEPARTMENT
ON COMPANY.ID = DEPARTMENT.EMP_ID;

20
Practical 2

i) Multiplication Table

#include <apop.h>

int main(){
gsl_matrix *m = gsl_matrix_alloc(20,15);
gsl_matrix_set_all(m, 1);
for (int i=0; i< m->size1; i++){
Apop_matrix_row(m, i, one_row);
gsl_vector_scale(one_row, i+1);
}
for (int i=0; i< m->size2; i++){
Apop_matrix_col(m, i, one_col);
gsl_vector_scale(one_col, i+1);
}
apop_matrix_show(m);
gsl_matrix_free(m);
}

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 multiplicationtable.c -o


multiplicationtable.out -lapophenia -lgsl -lsqlite3

jayesh@jayesh-G31M-S2L:~$ ./multiplicationtable.out

21
ii) the function in will take in a double indicating taxable income and will
return US income taxes owed, assuming a head of household with two
dependents taking the standard deduction

#include <apop.h>

double calc_taxes(double income){


double cutoffs[] = {0, 11200, 42650, 110100, 178350, 349700, INFINITY};
double rates[] = {0, 0.10, .15, .25, .28, .33, .35};
double tax = 0;
int bracket = 1;
income -= 7850; //Head of household standard deduction
income -= 3400*3; //exemption: self plus two dependents.
while (income > 0){
tax += rates[bracket] * GSL_MIN(income, cutoffs[bracket]-cutoffs[bracket-
1]);
income -= cutoffs[bracket];
bracket ++;
}
return tax;
}

int main(){
apop_db_open("data-census.db");
strncpy(apop_opts.db_name_column, "geo_name", 100);
apop_data *d = apop_query_to_data("select geo_name,
Household_median_in as income\

22
from income where sumlevel = '040'\
order by household_median_in desc");
Apop_col_t(d, "income", income_vector);
d->vector = apop_vector_map(income_vector, calc_taxes);
apop_name_add(d->names, "tax owed", 'v');
apop_data_show(d);
}

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 taxes.c -o taxes.out -lapophenia -


lgsl -lsqlite3

jayesh@jayesh-G31M-S2L:~$ ./taxes.out

23
Practical III
Plotting a vector

#include <apop.h>

void plot_matrix_now(gsl_matrix *data){


static FILE *gp = NULL;
if (!gp)
gp = popen("gnuplot -persist", "w");
if (!gp){
printf("Couldn't open Gnuplot.\n");
return;
}
fprintf(gp,"reset; plot '-' \n");
apop_matrix_print(data, .output_pipe=gp);
fflush(gp);
}

int main(){
apop_db_open("data-climate.db");
plot_matrix_now(apop_query_to_matrix("select (year*12+month)/12., temp
from temp"));
}

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 pipeplot.c -o pipeplot.out -


lapophenia -lgsl -lsqlite3
jayesh@jayesh-G31M-S2L:~$ ./pipeplot.out

24
Eigen vector

#include "eigenbox.h"

apop_data *query_data(){
apop_db_open("data-census.db");
return apop_query_to_data(" select postcode as row_names, "
" m_per_100_f, population/1e6 as population, median_age "
" from geography, income,demos,postcodes "
" where income.sumlevel= '040' "
" and geography.geo_id = demos.geo_id "
" and income.geo_name = postcodes.state "
" and geography.geo_id = income.geo_id ");
}

void show_projection(gsl_matrix *pc_space, apop_data *data){


fprintf(stderr,"The eigenvectors:\n");
apop_matrix_print(pc_space, .output_pipe=stderr);
apop_data *projected = apop_dot(data, apop_matrix_to_data(pc_space));
printf("plot '-' using 2:3:1 with labels\n");
apop_data_show(projected);
}

25
int main(){
apop_plot_lattice(query_data(), "out");
}

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 eigenbox.c -o eigenbox.out -


lapophenia -lgsl -lsqlite3
jayesh@jayesh-G31M-S2L:~$ ./eigenbox.out
jayesh@jayesh-G31M-S2L:~$ gnuplot -persist < out

Query out the month, average, and variance, and plot the data using errorbars.
Prints to stdout, so pipe the output through Gnuplo
#include <apop.h>
int main(){
apop_db_open("data−climate.db");
apop_data *d = apop_query_to_data("select \
(yearmonth/100. − round(yearmonth/100.))*100 as month, \
avg(tmp), stddev(tmp) \
26
from precip group by month");
printf("set xrange*0:13+; plot ’−’ with errorbars\n");
apop_matrix_show(d−>matrix);
}

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 errorbars.c -o errorbars.out -


lapophenia -lgsl -lsqlite3
jayesh@jayesh-G31M-S2L:~$ ./errorbars.out | gnuplot –persist

27
Practical 4

Implement the statistical distributions

Discrete distributions
1. Bernoulli distribution
2. binomial distribution
3. Poisson distribution
4. Multinomial distribution
5. hypergeometric distribution

Continous distributions
1. Normal distribution
2. Lognormal distribution
3. Gamma distribution
4. Exponential distribution
5. Beta distribution

bernoulli distribution (bernoulli.c)

#include <stdio.h>
#include <gsl/gsl_randist.h>

int
main (void)
{
int i;
double p = 0.6;
float sum=0;
/* prints probability distibution table*/

printf("random variable|||probability |||cumulative prob.\n");


printf("-------------------------------------------------------\n");
for (i = 0; i <= 1; i++)
{
float k = gsl_ran_bernoulli_pdf (i,p);
sum=sum+k;
printf("%d\t\t%f\t\t%f\n",i,k,sum);
}

printf("\n");
return 0;

28
}

binomial distribution (binomial.c)

#include <stdio.h>
#include <gsl/gsl_randist.h>

int
main (void)
{
int i,n=5;
double p = 0.6;
float sum=0;
/* prints probability distibution table*/

printf("random variable|||probability |||cumulative prob.\n");


printf("-------------------------------------------------------\n");
for (i = 0; i <= n; i++)
{
float k = gsl_ran_binomial_pdf (i,p,n);
sum=sum+k;
printf("%d\t\t%f\t\t%f\n",i,k,sum);
}

printf("\n");
return 0;
}

29
Poisson distribution (poi.c)
#include <stdio.h>
#include <gsl/gsl_randist.h>

int
main (void)
{
int i, n = 10;
double mu = 3.0;
float sum=0;
/* prints probability distibution table*/

printf("random variable|||probability |||cumulative prob.\n");


printf("-------------------------------------------------------\n");
for (i = 0; i <= n; i++)
{
float k = gsl_ran_poisson_pdf (i,mu);
sum=sum+k;
printf("%d\t\t%f\t\t%f\n",i,k,sum);
}

printf("\n");
return 0;
}

30
Uniform distribution(uniform.c)

#include <stdio.h>
#include <gsl/gsl_randist.h>

int
main (void)
{

double x;
int a,b ;
printf("enter vaue for x ,a,b \n");
scanf("%f",&x);
scanf("%d",&a);
scanf("%d",&b);
float sum=0;
/* prints probability distibution table*/

printf("random variable|||probability \n");


printf("-------------------------------------------------------\n");

float k = (float)gsl_ran_flat_pdf (x,a,b);

printf("%f\t\t%f\n",x,k);

31
return 0;
}

Multinomial distribution (multinomial.c)

#include <stdio.h>
#include <gsl/gsl_randist.h>

int main (void)


{

int k=3;
const double p[]={0.2,0.4,0.4};
const unsigned int n[]={2,3,4};

/* prints probability */

printf("random variable|||probability \n");


printf("-------------------------------------------------------\n");

double pmf =gsl_ran_multinomial_pdf(k,p,n);

printf("%3.9f\n",pmf);

return 0;
}

32
The following formula gives the probability of obtaining a specific set of
outcomes when there are three possible outcomes for each event:

where

p is the probability,
n is the total number of events
n1 is the number of times Outcome 1 occurs,
n2 is the number of times Outcome 2 occurs,
n3 is the number of times Outcome 3 occurs,
p1 is the probability of Outcome 1
p2 is the probability of Outcome 2, and
p3 is the probability of Outcome 3.

For the chess example,

n = 12 (12 games are played),


n1 = 7 (number won by Player A),
n2 = 2 (number won by Player B),
n3 = 3 (the number drawn),
p1 = 0.40 (probability Player A wins)
p2 = 0.35(probability Player B wins)
p3 = 0.25(probability of a draw)

33
The formula for k outcomes is

Hypergeometric distribution (hyper.c)


#include <stdio.h>
#include <gsl/gsl_randist.h>

int main (void)


{

int x,s,f,n;
n=6;
x=2;//random variable
s=13;//success
f=39;//failure

/* prints probability */

printf("random variable|||probability \n");


printf("-----------------------------------\n");

double pmf =gsl_ran_hypergeometric_pdf(x,s,f,n);

printf("%d %3.6f\n",x,pmf);

return 0;
}

34
continous distributions (contdist.c)

#include <stdio.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_cdf.h>
void normal();
void beta();
void gamma1();
void exponential();
void lognormal();
int main()
{
int choice;
printf("continous distributions\n");
printf("-----------------------\n");
printf("1:Normal distribution\n");
printf("2:Gamma distribution\n");
printf("3:Exponential distribution\n");
printf("4:Beta distribution\n");
printf("5:Lognormal distribution\n");
printf("enter your choice\n");
scanf("%d",&choice);
switch(choice)
{case 1:
normal();
break;
case 2:
gamma1();
break;
case 3:
exponential();

35
break;
case 4:
beta();
break;
case 5:
lognormal();
break;
default:
printf("wrong choice\n");
}
return 0;
}

void normal()
{
double P, Q;
double x = 10;
double sigma=5;
double pdf;
printf("Normal distribution :x=%f sigma=%f\n",x,sigma);
pdf = gsl_ran_gaussian_pdf (x,sigma);
printf ("prob(x = %f) = %f\n", x, pdf);

P = gsl_cdf_gaussian_P (x,sigma);
printf ("prob(x < %f) = %f\n", x, P);

Q = gsl_cdf_gaussian_Q (x,sigma);
printf ("prob(x > %f) = %f\n", x, Q);

x = gsl_cdf_gaussian_Pinv (P,sigma);
printf ("Pinv(%f) = %f\n", P, x);

x = gsl_cdf_gaussian_Qinv (Q,sigma);
printf ("Qinv(%f) = %f\n", Q, x);
}

void gamma1()
{
double P, Q;
double x = 1.5;
double a=1;
double b=2;
double pdf;

36
printf("Gamma distribution :x=%f a=%f b=%f\n",x,a,b);
pdf = gsl_ran_gamma_pdf (x,a,b);
printf ("prob(x = %f) = %f\n", x, pdf);

P = gsl_cdf_gamma_P (x,a,b);
printf ("prob(x < %f) = %f\n", x, P);

Q = gsl_cdf_gamma_Q (x,a,b);
printf ("prob(x > %f) = %f\n", x, Q);

x = gsl_cdf_gamma_Pinv (P,a,b);
printf ("Pinv(%f) = %f\n", P, x);

x = gsl_cdf_gamma_Qinv (Q,a,b);
printf ("Qinv(%f) = %f\n", Q, x);

void exponential()
{
double P, Q;
double x = 0.05;
double lambda=2;
double pdf;
printf("Exponential distribution :x=%f lambda=%f\n",x,lambda);
pdf = gsl_ran_exponential_pdf (x,lambda);
printf ("prob(x = %f) = %f\n", x, pdf);

P = gsl_cdf_exponential_P (x,lambda);
printf ("prob(x < %f) = %f\n", x, P);

Q = gsl_cdf_exponential_Q (x,lambda);
printf ("prob(x > %f) = %f\n", x, Q);

x = gsl_cdf_exponential_Pinv (P,lambda);
printf ("Pinv(%f) = %f\n", P, x);

x = gsl_cdf_exponential_Qinv (Q,lambda);
printf ("Qinv(%f) = %f\n", Q, x);

void beta()

37
{
double P, Q;
double x = 0.8;
double a=0.5;
double b=0.5;
double pdf;
printf("Beta distribution :x=%f a=%f b=%f\n",x,a,b);
pdf = gsl_ran_beta_pdf (x,a,b);
printf ("prob(x = %f) = %f\n", x, pdf);

P = gsl_cdf_beta_P (x,a,b);
printf ("prob(x < %f) = %f\n", x, P);

Q = gsl_cdf_beta_Q (x,a,b);
printf ("prob(x > %f) = %f\n", x, Q);

x = gsl_cdf_beta_Pinv (P,a,b);
printf ("Pinv(%f) = %f\n", P, x);

x = gsl_cdf_beta_Qinv (Q,a,b);
printf ("Qinv(%f) = %f\n", Q, x);
}

void lognormal()
{
double P, Q;
double x = 4;
double zeta=2;
double sigma=1.5;
double pdf;
printf("Lognormal distribution :x=%f zeta=%f sigma=%f\n",x,zeta,sigma);
pdf = gsl_ran_lognormal_pdf (x,zeta,sigma);
printf ("prob(x = %f) = %f\n", x, pdf);

P = gsl_cdf_lognormal_P (x,zeta,sigma);
printf ("prob(x < %f) = %f\n", x, P);

Q = gsl_cdf_lognormal_Q (x,zeta,sigma);
printf ("prob(x > %f) = %f\n", x, Q);

x = gsl_cdf_lognormal_Pinv (P,zeta,sigma);
printf ("Pinv(%f) = %f\n", P, x);

38
x = gsl_cdf_lognormal_Qinv (Q,zeta,sigma);
printf ("Qinv(%f) = %f\n", Q, x);
}

39
40
Practical No. 5 Implement regression and goodness of fit

Implementing regression
Steps :

Functions used :
int gsl_fit_wlinear (const double * x, const size_t xstride, const double * w,
const size_t wstride, const double * y, const size_t ystride, size_t n, double * c0,
double * c1, double * cov00, double * cov01, double * cov11, double * chisq)

This function computes the best-fit linear regression coefficients (c0,c1) of the
model Y = c_0 + c_1 X for the weighted dataset (x, y), two vectors of
length n with strides xstride and ystride. The vector w, of length n and
stride wstride, specifies the weight of each datapoint. The weight is the
reciprocal of the variance for each datapoint in y.

The covariance matrix for the parameters (c0, c1) is computed using the
weights and returned via the parameters (cov00, cov01, cov11). The weighted
sum of squares of the residuals from the best-fit line, \chi^2, is returned
in chisq.

int gsl_fit_linear_est (double x, double c0, double c1, double cov00,


double cov01, double cov11, double * y, double * y_err)

This function uses the best-fit linear regression coefficients c0, c1 and their
covariance cov00, cov01, cov11 to compute the fitted function y and its
standard deviation y_err for the model Y = c_0 + c_1 X at the pointx.

program computes a least squares straight-line fit to a simple dataset, and


outputs the best-fit line and its associated one standard-deviation error bars.
#include <stdio.h>
#include <gsl/gsl_fit.h>

int
main (void)
{
int i, n = 4;
double x[4] = { 1970, 1980, 1990, 2000 };
double y[4] = { 12, 11, 14, 13 };
double w[4] = { 0.1, 0.2, 0.3, 0.4 };

41
double c0, c1, cov00, cov01, cov11, chisq;

gsl_fit_wlinear (x, 1, w, 1, y, 1, n,
&c0, &c1, &cov00, &cov01, &cov11,
&chisq);

printf ("# best fit: Y = %g + %g X\n", c0, c1);


printf ("# covariance matrix:\n");
printf ("# [ %g, %g\n# %g, %g]\n",
cov00, cov01, cov01, cov11);
printf ("# chisq = %g\n", chisq);

for (i = 0; i < n; i++)


printf ("data: %g %g %g\n",
x[i], y[i], 1/sqrt(w[i]));

printf ("\n");

for (i = -30; i < 130; i++)


{
double xf = x[0] + (i/100.0) * (x[n-1] - x[0]);
double yf, yf_err;

gsl_fit_linear_est (xf,
c0, c1,
cov00, cov01, cov11,
&yf, &yf_err);

printf ("fit: %g %g\n", xf, yf);


printf ("hi : %g %g\n", xf, yf + yf_err);
printf ("lo : %g %g\n", xf, yf - yf_err);
}
return 0;
}

42
B. Implementing goodness of fit Chi Square

int apop_db_open ( char const * filename )

If you want to use a database on the hard drive instead of memory, then call
this once and only once before using any other database utilities.

When you are done doing your database manipulations, be sure to


call apop_db_close if writing to disk.

Parameters:

The name of a file on the hard drive on which to store the


filename
database.

Returns:
0: everything OK
1: database did not open.

apop_model* apop_estimate ( apop_data * d,

apop_model m

43
estimate the parameters of a model given data.This function copies the input
model, preps it, and calls m.estimate(d,&m). If your model has
no estimate method, then I assume apop_maximum_likelihood(d, m), with the
default MLE params.

Parameters:

d The data

m The model

Returns: A pointer to an output model, which typically matches the input


model but has its parameters element filled in.

apop_model* apop_model_to_pmf ( apop_model * model,

apop_data * binspec,

long int draws,

int bin_count,

gsl_rng * rng

Make random draws from an apop_model, and bin them using a binspec in the
style of apop_data_to_bins. If you have a data set that used the same binspec,
you now have synced histograms, which you can plot or sensibly test
hypotheses about.

The output is normalized to integrate to one.

Parameters:

A description of the bins in which to place the draws;


binspec
see apop_data_to_bins. (default: as in apop_data_to_bins.)

The model to be drawn from. Because this function works via


model random draws, the model needs to have a draw method. (No
default)

44
The number of random draws to make. (arbitrary default =
draws
10,000)

If no bin spec, the number of bins to use (default: as


bin_count
per apop_data_to_bins, )

The gsl_rng used to make random draws. (default: see note


rng
on Auto-allocated RNGs)

Returns:
An apop_pmf model.

 This function uses the Designated initializers syntax for inputs.

#include <apop.h>

int main(){
apop_db_open("data-climate.db");
apop_data *precip = apop_query_to_data("select PCP from precip");
apop_model *est = apop_estimate(precip, apop_normal);
apop_data *precip_binned = apop_data_to_bins(precip/*,
.bin_count=180*/);
apop_model *datahist = apop_estimate(precip_binned, apop_pmf);
apop_model *modelhist = apop_model_to_pmf(.model=est,
.binspec=apop_data_get_page(precip_binned, "<binspec>"), .draws=1e5);
double scaling = apop_sum(datahist->data-
>weights)/apop_sum(modelhist->data->weights);
gsl_vector_scale(modelhist->data->weights, scaling);
apop_data_show(apop_histograms_test_goodness_of_fit(datahist,
modelhist));
}

45
Prac 6. Implement testing with likelihood

1. Building an optimized model & then solving the same for maximum.( a
function can be provided in this case)

APOP_SIMPLEX_NM Nelder-Mead simplex (gradient handling rule is irrelevant)


APOP_CG_FR Conjugate gradient (Fletcher-Reeves) (default)
APOP_SIMAN simulated annealing
APOP_RF_NEWTON Find a root of the derivative via Newton's method

#include <apop.h>

double sin_square(apop_data *data, apop_model *m){


double x = apop_data_get(m->parameters, 0, -1);
return -sin(x)*gsl_pow_2(x);
}

apop_model sin_sq_model ={"-sin(x) times x^2",1, .p = sin_square};

#include "sinsq.c"

void do_search(int number, char *name, char *trace){


apop_model *out;
double p[] = {0};
double result;
char *outf;
asprintf(&outf, "localmax_out/%s.gplot", trace);
Apop_model_add_group(&sin_sq_model, apop_mle,
.starting_pt= p,
.method= number, .tolerance= 1e-4,
.mu_t= 1.25, .trace_path= outf);
out = apop_estimate(NULL, sin_sq_model);
result = gsl_vector_get(out->parameters->vector, 0);
printf("The %s algorithm found %g.\n", name, result);
Apop_settings_rm_group(&sin_sq_model, apop_mle);
}

int main(){

46
system ("mkdir -p localmax_out; rm -f localmax_out/*.gplot");
apop_opts.verbose ++;
do_search(APOP_SIMPLEX_NM, "N-M Simplex", "simplex");
do_search(APOP_CG_FR, "F-R Conjugate gradient", "fr");
do_search(APOP_SIMAN, "Simulated annealing", "siman");
do_search(APOP_RF_NEWTON, "Root-finding", "root");
fflush(NULL);
system("sed -i \"1iplot '-'\" localmax_out/*.gplot");
}

2. Comparing 2 models using likelihood ratio

#include <apop.h>

apop_model * dummies(int slope_dummies){


apop_data *d = apop_query_to_mixed_data("mmt", "select riders, year-
1977, line \
from riders, lines \
where riders.station=lines.station");
apop_data *dummified = apop_data_to_dummies(d, 0, 't', .append='y',
.remove='y');
if (slope_dummies){
Apop_col(d, 1, yeardata);
for(int i=0; i < dummified->matrix->size2; i ++){
Apop_col(dummified, i, c);
gsl_vector_mul(c, yeardata);
}
}
apop_model *out = apop_estimate(dummified, apop_ols);

47
apop_model_show(out);
return out;
}

#ifndef TESTING
int main(){
apop_db_open("data-metro.db");
printf("With constant dummies:\n"); dummies(0);
printf("With slope dummies:\n"); dummies(1);
}
#endif

#define TESTING
#include "dummies.c"

void show_normal_test(apop_model *unconstrained, apop_model


*constrained, int n){
double statistic = (apop_data_get(unconstrained->info, .rowname="log
likelihood")
- apop_data_get(constrained->info, .rowname="log
likelihood"))/sqrt(n);
double confidence = gsl_cdf_gaussian_P(fabs(statistic), 1); //one-tailed.
printf("The Normal statistic is: %g, so reject the null of no difference
between models "
"with %g%% confidence.\n", statistic, confidence*100);
}

int main(){
apop_db_open("data-metro.db");
apop_model *m0 = dummies(0);
apop_model *m1 = dummies(1);
show_normal_test(m0, m1, m0->data->matrix->size1);
}

48
Prac 7. Generate random numbers using Monte Carlo method using

1.Exponential distribution
2. uniform distribution
3. binomial distribution
some functions used for random number generation
the functions used for random number generation are declared in the header
file `gsl_rng.h'.
 const gsl_rng_type * T : holds static information about each type of
generator.
 gsl_rng_env_setup() : This function reads the environment
variables GSL_RNG_TYPE and GSL_RNG_SEED and uses their values to
set the corresponding library
variables gsl_rng_default and gsl_rng_default_seed.

program to create a global generator using the environment


variables GSL_RNG_TYPE and GSL_RNG_SEED,

#include <stdio.h>
#include <gsl/gsl_rng.h>

gsl_rng * r; /* global generator */

int
main (void)
{
const gsl_rng_type * T;

gsl_rng_env_setup();

T = gsl_rng_default;
r = gsl_rng_alloc (T);

printf ("generator type: %s\n", gsl_rng_name (r));


printf ("seed = %lu\n", gsl_rng_default_seed);
printf ("first value = %lu\n", gsl_rng_get (r));

gsl_rng_free (r);
return 0;
}

49
Running the program without any environment variables uses the initial
defaults, an
mt19937 generator with a seed of 0 as follows:

By setting the two variables on the command line we can change the default
generator and the seed as follows:

using exponential distribution

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int main(int argc, char *argv[])


{
int i,n;
float x,alpha;
gsl_rng *r=gsl_rng_alloc(gsl_rng_mt19937); /* initialises GSL RNG */
n=atoi(argv[1]);
alpha=atof(argv[2]);
x=0;
for (i=0;i<n;i++)
{
x=alpha*x + gsl_ran_exponential(r,1);

50
printf(" %2.4f \n",x);
}
return(0);
}

Generating uniform random numbers in the range [0.0, 1.0) using uniform
distribution

#include <stdio.h>
#include <gsl/gsl_rng.h>

int
main (void)
{
const gsl_rng_type * T;
gsl_rng * r;

int i, n = 10;

gsl_rng_env_setup();

T = gsl_rng_default;
r = gsl_rng_alloc (T);

for (i = 0; i < n; i++)


{
double u = gsl_rng_uniform (r);
printf ("%.5f\n", u);
}

51
gsl_rng_free (r);

return 0;
}

Using binomial distribution


#include <stdio.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int
main (void)
{
const gsl_rng_type * T;
gsl_rng * r;

int i, n = 10;

/* create a generator chosen by the


environment variable GSL_RNG_TYPE */

gsl_rng_env_setup();

T = gsl_rng_default;
r = gsl_rng_alloc (T);
float p=0.3;

52
/* print n random variates chosen from
the binomial distribution with mean
parameter mu */

for (i = 0; i < n; i++)


{
unsigned int k = gsl_ran_binomial(r, p,n);
printf (" %u", k);
}

printf ("\n");
gsl_rng_free (r);
return 0;
}

Following functions can be used to generate random numbers using different


distributions by knowing the parameters required.

53
Practical No. 8 Implementing Parametric testing

1. t test
#include <apop.h>

int main(){
apop_db_open("data-census.db");
gsl_vector *n = apop_query_to_vector("select in_per_capita from
income "
"where state= (select state from geography where name ='North
Dakota')");
gsl_vector *s = apop_query_to_vector("select in_per_capita from income
"
"where state= (select state from geography where name ='South
Dakota')");
apop_data *t = apop_t_test(n,s);
apop_data_show(t); //show the whole output set...
printf ("\n confidence: %g\n", apop_data_get(t, .rowname="conf.*2
tail")); //...or just one value.
}

2.F test
apop_data* apop_f_test ( apop_model * est,
apop_data * contrast
)

Runs an F-test specified by q and c. Your best bet is to see the chapter on
hypothesis testing in Modeling With Data, p 309. It will tell you that:

54
and that's what this function is based on.

Parameters:
Est an apop_model that you have already calculated. (No default)
The matrix and the vector , where each row represents a
hypothesis. (Defaults: if matrix is NULL, it is set to the identity
matrix with the top row missing. If the vector is NULL, it is set
contrast
to a zero matrix of length equal to the height of the contrast
matrix. Thus, if the entire apop_data set is NULL or omitted,
we are testing the hypothesis that all but are zero.)
Returns:
An apop_data set with a few variants on the confidence with which we
can reject the joint hypothesis.
Todo:
There should be a way to get OLS and GLS to store . In fact, if you
did GLS, this is invalid, because you need , and I didn't ask for .

 There are two approaches to an -test: the ANOVA approach, which is


typically built around the claim that all effects but the mean are zero;
and the more general regression form, which allows for any set of linear
claims about the data. If you send a NULL contrast set, I will generate the
set of linear contrasts that are equivalent to the ANOVA-type approach.
Readers of {Modeling with Data}, note that there's a bug in the book
that claims that the traditional ANOVA approach also checks that the
coefficient for the constant term is also zero; this is not the custom and
doesn't produce the equivalence presented in that and other textbooks.

Exceptions:
out->error='a' Allocation error.
out->error='d' dimension-matching error.
out->error='i' matrix inversion error.
out->error='m' GSL math error.

#include "eigenbox.h"

int main(){
double line[] = {0, 0, 0, 1};
apop_data *constr = apop_line_to_data(line, 1, 1, 3);
apop_data *d = query_data();
55
apop_model *est = apop_estimate(d, apop_ols);
apop_model_show(est);
apop_data_show(apop_f_test(est, constr));
}

56
Practical No. 9 Drawing an Inference

Obtaining mean ,standard error & p value for the given data.

#include <apop.h>
void one_boot(gsl_vector *base_data, gsl_rng *r, gsl_vector* boot_sample);

void one_boot(gsl_vector * base_data, gsl_rng *r, gsl_vector* boot_sample){


for (int i =0; i< boot_sample−>size; i++)
gsl_vector_set(boot_sample, i,
gsl_vector_get(base_data, gsl_rng_uniform_int(r, base_data−>size)));
}
int main(){
int rep_ct = 10000;
gsl_rng *r = apop_rng_alloc(0);
apop_db_open("data-census.db");
gsl_vector *base_data = apop_query_to_vector("select in_per_capita from
income where sumlevel+0.0 =40");
double RI = apop_query_to_float("select in_per_capita from income
where sumlevel+0.0 =40 and geo_id2+0.0=44");
gsl_vector *boot_sample = gsl_vector_alloc(base_data->size);
gsl_vector *replications = gsl_vector_alloc(rep_ct);
for (int i=0; i< rep_ct; i++){
one_boot(base_data, r, boot_sample);
gsl_vector_set(replications, i, apop_mean(boot_sample));
}
double stderror = sqrt(apop_var(replications));
double mean = apop_mean(replications);
printf("mean: %g; standard error: %g; (RI-mean)/stderr: %g; p value: %g\n",
mean, stderror, (RI-mean)/stderror, 2*gsl_cdf_gaussian_Q(fabs(RI-mean),
stderror));
}

57
Practical No 10.Implement Non-parametric Testing

1. Anova

apop_data* apop_anova ( char * table,


char * data,
char * grouping1,
char * grouping2
)

2. This function produces a traditional one- or two-way ANOVA table.


3. It works from data in an SQL table, using queries of the form select data
from table group by grouping1, grouping2.
4. Parameters:
The table to be queried. Anything that can go in an
table SQL from clause is OK, so this can be a plain table name or a
temp table specification like (select ... ), with parens.
The name of the column holding the count or other such
data
data
grouping1 The name of the first column by which to group data
If this is NULL, then the function will return a one-way
grouping2 ANOVA. Otherwise, the name of the second column by
which to group data in a two-way ANOVA.

#include <apop.h>

int main(){
apop_db_open("data-metro.db");
char joinedtab[] = "(select year, riders, line \
from riders, lines \
where riders.station = lines.station)";
apop_data_show(apop_anova(joinedtab, "riders", "line", "year"));

58
59
References
1. Modelling with data, Ben Klemens, Princeton University Press
2. Computational Statistics, James E. Gentle, Springer
3. Computational Statistics, Second Edition, Geof H. Givens and
Jennifer A.Hoeting, Wiley Publications
4. www.cygwin.com
5. http://apophenia.info/

60

You might also like