Addis Ababa University Center for African and Asian Studies
Assignment II
 Simple Linear Regression Analysis and Hypothesis Testing
   Partial Fulfillment for the Course Advanced Research Methods
                              (ASCM 701)
               Submitted to Dr. Kidist Gebreselassie
    Submitted by Maria Mamo, Yared Zekiros and Tewabe Tadesse
                         Outline
• Meaning, applications and examples of Student’s t-
  distribution, F-distribution, Chi2 (X2) distribution; the
  t-test, F-test, X2 (Chi2) test (i.e., X2goodness of fit test,
  other X2tests), ANOVA test and Likert Scale
• Exercise on Hypothesis Testing
           T-DISTRIBUTION
Developed in 1908 by William Sealy Gosset
Has normal distribution
It is data that follow a bell curve when plotted on a
graph
Population standard deviation is unknown
Greatest number of observations close to the mean
and fewer observations in the tails
It is bell-shaped and uni-modal (For example, a uni-
modal distribution could be a set of test scores
where most students scored around the same value,
resulting in a single peak in the distribution).
T-distribution chart
                   Cont’d
Its shape depends on the sample size n. As the
sample size n becomes larger, the t-distribution
gets closer to the standard normal distribution
Statistical analysis on some studies which can’t be
done using the normal distribution can be done
using the t-distribution
T-statistics is used when n less than 30
Cont’d
                    Cont’d
Example (one sample t-test)
• Suppose we have a sample of 10 students, and
  we want to test if their average test score is
  significantly different from the population
  mean. The sample mean is 75, the population
  mean is 70, and the sample standard
  deviation is 8.
 Remember:
                  Cont’d
  Step 1: Calculate the standard error of the
mean The standard error of the mean (SE) is
calculated as the sample standard deviation
divided by the square root of the sample size:
 SE = sample standard deviation / √(sample
size) = 8 / √(10) ≈ 2.53
                     Cont’d
Step 2: Calculate the t-statistic The t-statistic is
calculated as the difference between the
sample mean and the population mean,
divided by the standard error of the mean:
t = (sample mean - population mean) / SE =
(75 - 70) / 2.53 ≈ 1.98
                   Cont’d
  Step 3: Determine the degrees of freedom
The degrees of freedom for a t-test with a
sample size of 10 is 10 - 1 = 9.
 Step 4: Find the critical t-value If we want to
test at a 95% confidence level (two-tailed
test), we would need to find the critical t-value
for 9 degrees of freedom. Using a t-
distribution table or calculator, the critical t-
value is approximately ±2.262
                    Cont’d
  Step 5: Make a decision Since our calculated
 t-value (1.98) is less than the critical t-value
 (2.262), we would fail to reject the null
 hypothesis at the 95% confidence level.
 This means that we do not have enough
 evidence to conclude that the sample mean is
 significantly different from the population
 mean.
                        Cont’d
Example (two sample t-test)
Let's say we have two samples of data:
Sample 1: 5, 7, 9, 11, 13
Sample 2: 6, 8, 10, 12, 14
   We want to calculate the t-distribution for these two
  samples using the following formula:
  t = (x̄1 - x̄2) / √((S12 / n1) + (S22 / n2))
  Where: x̄1 and x̄2 are the means of Sample 1 and
  Sample 2, respectively s1 and s2 are the standard
  deviations of Sample 1 and Sample 2, respectively n1
  and n2 are the sample sizes of Sample 1 and Sample 2,
  respectively
                      Cont’d
    First, let's calculate the means and standard
 deviations of the two samples:
Mean of Sample 1 : (x̄1) = (5 + 7 + 9 + 11 + 13) / 5 =
9
Mean of Sample 2 : (x̄2) = (6 + 8 + 10 + 12 + 14) / 5
= 10
  Standard deviation of Sample 1 : (S1) = √((1/4) *
((5-9)2 + (7-9)2 + (9-9)2 + (11-9)2 + (13-9)2))= 2.83
Standard deviation of Sample 2 : (S2) = √((1/4) *
((6-10)2 + (8-10)2 + (10-10)2 + (12-10)2 + (14-10)2))
 = 2.83
                     Cont’d
Now, we can plug these values into the t-
distribution formula:
  t = (9 - 10) / √((2.832 / 5) + (2.832 / 5))
  t = -1 / √((7.9989 / 5) + (7.9989 / 5))
  t = -1 / √(1.5998 + 1.5998)
  t = -1 / √3.1996
  t = -1 / 1.7889 t = -0.559
So, the t-distribution for these two samples is -
0.559.
                       Cont’d
To find the critical t-value, we need to know the
degrees of freedom and the desired confidence level.
For this example, let's assume a 95% confidence level
and calculate the critical t-value for a two-tailed test.
 First, we need to calculate the degrees of freedom
(df) using the formula:
      df = (n1 + n2) - 2
In this case, the sample size for both samples is 5, so:
     df = (5 + 5) - 2 df = 8
                      Cont’d
  Next, we can find the critical t-value using a t-
 distribution table or a statistical software. For
 a 95% confidence level and 8 degrees of
 freedom, the critical t-value is approximately
 ±2.306.
      Since our calculated t-value (-0.559) is
 within the range of the critical t-value (-2.306,
 2.306), we would fail to reject the null
 hypothesis at the 95% confidence level.
           F-DISTRIBUTION
Named in honor of R.A. Fisher who studied it in 1924
Ronald Aylmer Fisher, was a prominent British
statistician and geneticist. In 1924, Fisher introduced
the F-distribution as part of his work on statistical
hypothesis testing and analysis of variance.
Used for comparing the variances of two populations
The F-distribution is either zero or positive, so there
are no negative values for F
Used for smaller sample sizes, where the variance in
the data is unknown
                   Cont’d
It is skewed to the right and its shape depends on the
degrees of freedom
It is a continuous probability distribution with two
degrees of freedom
The F-distribution is always non-negative
The F-distribution has two parameters: degrees of
freedom for the numerator and degrees of freedom
for the denominator
The F-distribution is used to test the equality of
variances in two populations
It is also used in the F-test to compare the variances
of two samples
                    Cont’d
• The F-distribution is commonly used in
  statistics to test hypotheses about population
  variances and to compare the fits of different
  statistical models
• In regression analysis, the F-distribution is
  used to test the overall significance of the
  model and the significance of individual
  regression coefficients
                   Cont’d
It gives a lower probability to the center and a
higher probability to the tails than the standard
normal distribution.
A more conservative form of the standard deviation
               F-distribution chart
                          Cont’d
F Distribution can be used for several types of
applications, including:
• Testing hypotheses about the equality of two
    population variances
• Testing the validity of a multiple regression equation
• Comparing the fit of different models in regression
    analysis
• Constructing confidence intervals and testing
    hypotheses about population variances
• One-factor analysis of variance (ANOVA)
                      Cont’d
Example
Suppose we want to compare the variances of two
samples, Sample 1 and Sample 2, to determine if
they are significantly different. We can use the F-
distribution to perform this comparison.
Let's assume the following data:
  Sample 1 variance (S12) = 10
  Sample 2 variance (S22) = 5
  Sample 1 degrees of freedom (df1) = 5
  Sample 2 degrees of freedom (df2) = 8
                    Cont’d
Now, we can calculate the F-statistic using the
formula:
      F = (S12 / S22) / (df1 / df2)
      F = (10 / 5) / (5 / 8)
      F = 2 / (5 / 8)
      F = 2 / 0.625
      F = 3.2
                      Cont’d
Conclusion
• The calculated F-statistic is 3.2.
• We compare this value to the critical F-value
  based on the degrees of freedom and the
  desired significance level.
• If the calculated F-statistic is greater than the
  critical F-value, we would reject the null
  hypothesis and conclude that the variances of
  the two samples are significantly different.
                      Cont’d
• If the calculated F-statistic is less than the
   critical F-value, we would fail to reject the
   null hypothesis, indicating that there is no
   significant difference in the variances of the
   two samples.
In this case, assuming a significance level of 0.05
(commonly used in statistics), we can refer an F-
distribution table or use statistical software to
find the critical F-value for degrees of freedom 5
and 8 at a 0.05 significance level.
                      Cont’d
Therefore, with a significance level of 0.05, the
critical F-value for degrees of freedom 5 and 8 is
approximately 3.49.
 With a significance level of 0.05, the critical F-
value for degrees of freedom 5 and 8 is
approximately 3.49, and the calculated F-value
is 3.2. Comparing the calculated F-value to the
critical F-value, we can make the following
conclusion regarding the null hypothesis:
                      Cont’d
Since the calculated F-value (3.2) is less than
 the critical F-value (3.49), we fail to reject the
 null hypothesis.
This indicates that there is no significant
 difference in the variances of the two
 samples at the 0.05 significance level.
 Therefore, based on this analysis, we do not
 have enough evidence to conclude that the
 variances of the two samples are significantly
 different.
         X2-DISTRIBUTION
Takes only positive values
Skewed to the right
Specified by giving its degrees of freedom
    Chi-square chart
                                      T-TEST
A set of data gathered from two similar or different
   groups
T-test is applicable for a smaller sample size
 Only valid and should be done when the mean or
   average of only two categories or groups needs
   to be compared
   Assumptions of T-Test
      •   The measurement scale used for such hypothesis testing follows a set of
          continuous or ordinal patterns. The accounted parameters and variants
          influencing the samples and surrounding the groups are based on the standard
          consideration.
      •   The tests are completely based on random sampling. As no individuality is
          maintained in the samples, the reliability is often questioned.
      •   When the data is plotted with respect to the T-test distribution, it should follow
          a normal distribution and bring about a bell-curved graph.
      •   For a clearer bell curve, the sample size needs to be bigger.
      •   The variance should be such that the standard deviations of the samples are
          almost equal.
      •   There should be no extreme outliers in the differences.
        Example – One Sample T-Test
• A claim is made that the average number of days a
  person spends on vacation is more than or equal to 5
  days (hypothesized population mean) based on a sample
  of 16 people whose mean came out to be 9 days.
  – Null Hypothesis (H0): The average number of days a
    person spends on vacation is equal to 5 days.
    Mathematically, H0: μ=5.
  – Alternative Hypothesis (Ha): The average number of
    days a person spends on vacation is more than 5
    days. Mathematically, Ha: μ>5.
•   Sample size of 16 persons is taken. The mean number of days spent on vacation by the
    persons in sample is found to be 9 days with a sample standard deviation is found to
    be 3 days and confidence level 95%.
•   Formula:
                      x̄ = 9, μ = 5, s = 3, n = 16
                      t =(9-5)/(3/ √16) = 5.33
•   The critical t-value for a one-tailed test at
    degree of freedom (n-1) or 16-1 = 15,
    the alpha level of 0.05 is 1.753.
•   If the calculated t-value is 5.33 and the
    critical t-value for a one-tailed test at the
    alpha level of 0.05 is 1.753, you can
    make the following conclusions about              Interpretation: There is a statistically
    the null hypothesis:                              significant difference between the sample
                                                      mean and the hypothesized population
Conclusion: Since the calculated t-value              mean, and the sample provides enough
(5.33) is greater than the critical t-value           evidence to support the claim that the
(1.753), there is sufficient evidence to reject       average number of days a person spends
the null hypothesis at the 0.05 sign. level.          on vacation is more than 5 days.
                    F-TEST
• The F-test is a statistical test that is used to
  compare the variances of two or more groups
  or samples. It is based on the F-distribution,
  which is a probability distribution that arises
  when comparing the variability between
  groups to the variability within groups
                      Cont’d
Purposes of F-Test
• Testing equality of variances: When
  comparing two or more groups, the F-test can
  be used to determine if their variances are
  statistically equal. This is useful, for example,
  in assessing whether different treatments or
  interventions have similar levels of variability.
                      Cont’d
• Comparing means: In certain situations, such
  as in analysis of variance (ANOVA), the F-test
  can be utilized to compare means across
  multiple groups.
  This involves using a "between-groups"
  estimate of variance and a "within-groups"
  estimate of variance to calculate an F-statistic.
                Chi Square-TEST
• Used to determine whether the association between two
  qualitative variables is statistically significant.
Example
• A survey was conducted in the randomly selected individuals in a
  shopping mall to determine if educational attainment is related to
  gender.
• First organize the data file into
  cross-tabulation of the two
  qualitative (nominal) variables to
  obtain the frequencies for each
  category, which can be done using
  statistical software, especially for a
  very large sample.
• Formulate the hypotheses
   Null Hypothesis:
   – H0: There is no significant association between gender and education
      level.
   Alternative Hypothesis:
   – Ha: There is a significant association between gender and education
      level.
• Specify the expected values for each cell of the table (when the null
  hypothesis is true)
  The expected values specify what the values of each cell of the table
  would be if there was no association between the two variables.
•   To see if the data give convincing evidence against the null hypothesis,
    compare the observed counts from the sample with the expected counts,
    assuming H0 is true.
    Statistical software such as SPSS, Datatab etc…will compute both the
    expected and observed counts for each cell when conducting a chi-square
    test.
    Statistical software such as SPSS, Datatab etc…will compute both the
    expected and observed counts for each cell when conducting a chi-square
    test.
•   Compute chi test statistic.
     Chi-Square Test – Test Statistic
    If these values are entered into the formula for the chi-square tests statistic, the
    value obtained is 0.504.
•   Decide if chi-square is statistically significant
     – The final step of the chi-square test of significance is to determine if the value
        of the chi-square test statistic is large enough to reject the null hypothesis.
     – Significance levelis5% or chosen p-value chosen 0.05.
     – Statistical software makes this determination much easier.X²=0.504
                        Result
                                    Chi2                 0.504
                                    df                   3
                                    p-value              0.918
    Interpretation: A Chi-Square test was performed between gender and education.
    No expected cell frequency were less than 5. There is no statistical relationship
    between gender and higher education.
                   ANOVA TEST
• Used to analyze whether there are statistically significant
  differences among the means of three or more groups. It is
  often used to compare means across different levels of a
  categorical variable.
• Often used to compare means across different levels of a
  categorical variable.
• It cannot tell you which specific groups were statistically
  significantly different from each other, only that at least two of
  the groups were.
• Example: level of employee training by category - beginner,
  intermediate and advanced and customer satisfaction ratings.
   – null hypothesis same rating for all employee category
   – Alternative: different performance rating among employees
      category
             Types of ANOVA Test
• One-way ANOVA– testing differences between three or more groups
  based on one independent variable.
• Example, comparing the sales performance of different stores in a retail
  chain.
• Two-way ANOVA: two independent variables,                      Example,
  impact of both advertising spend and product placement on sales revenue.
• Factorial ANOVA: more than two independent variables.
                             Example, a business might examine the
  combined effects of age, income and education level on consumer
  purchasing habits.
• Welch’s F-test ANOVA: Used when the assumption of equal variances is not
  met.                                         Example, a company might use
  to compare the job satisfaction levels of employees in different departments,
  where each department has a different variance in job satisfaction scores.
    Assumptions of ANOVA Test
 Normality: The first assumption is that the groups each fall
  into what is called a normal distribution. This means that the
  groups should have a bell-curve distribution with few or no
  outliers.
 Homogeneity of variance: Also known as homoscedasticity,
  this means that the variances between each group are the
  same.
 Independence: The final assumption is that each value is
  independent from each other. This means, for example, that
  unlike a conjoint analysis the same person shouldn’t be
  measured multiple times.
          Example of ANOVA Test
Let's consider a simple example to demonstrate the calculations for a one-
way ANOVA by hand. Suppose we have three groups of participants, each
following a different workout program, and we want to compare their
average weight loss. The data is as follows:
 • Group A: 10, 12, 15, 11, 13 (Sample size = 5)
 • Group B: 8, 9, 11, 10, 12 (Sample size = 5)
 • Group C: 6, 7, 9, 8, 10 (Sample size = 5)
Step 1: Calculate the mean for each group
 • Mean of Group A = (10 + 12 + 15 + 11 + 13) / 5 = 12.2
 • Mean of Group B = (8 + 9 + 11 + 10 + 12) / 5 = 10
 • Mean of Group C = (6 + 7 + 9 + 8 + 10) / 5 = 8
Step 2: Calculate the overall mean (Grand Mean)
•Grand Mean = (12.2 + 10 + 8) / 3 = 10.07
Step 3: Calculate the Sum of Squares Total (SST)
 – SST = (10-10.07)2 + (12-10.07)2 + (15-10.07)2 + (11-10.07)2 + (13-10.07)2 + (8-
   10.07) 2+ (9-10.07)2+ (11-10.07)2 + (10-10.07)2 + (12-10.07)2 + (6-10.07)2 + (7-
   10.07) 2+ (9-10.07) 2+ (8-10.07) 2 + (10-10.07) 2= 56.8
Step 4: Calculate the Sum of Squares Between (SSB)
•SSB = 5 * (12.2 - 10.07) 2 + 5 * (10 - 10.07) 2 + 5 * (8 - 10.07) 2 = 30.27
Step 5: Calculate the Sum of Squares Within (SSW)
 – SSW = (10-12.2) 2+ (12-12.2) 2 + (15-12.2) 2+ (11-12.2) 2+ (13-12.2) 2 + (8-
   10) 2 + (9-10) 2+ (11-10) 2 + (10-10) 2+ (12-10) 2 + (6-8) 2+ (7-8) 2 + (9-8) 2+
   (8-8) 2 + (10-8) 2 = 52.8
Step 6: Calculate the Degrees of Freedom
 • Degrees of Freedom (df) between = k - 1 = 3 - 1 = 2
 • Degrees of Freedom (df) within = N - k = 15 - 3 = 12
 • Degrees of Freedom (df) total = N - 1 = 15 - 1 = 14
Step 7: Calculate the Mean Squares
 • Mean Square (MS) between = SSB / df between = 30.27 / 2 = 15.135
 • Mean Square (MS) within = SSW / df within = 52.8 / 12 = 4.4
Step 8: Calculate the F-Statistic
    •F = MS between / MS within = 15.135 / 4.4 = 3.44
Step 9: Compare to Critical Value
 • We compare the calculated F-value to the critical F-value for the
   chosen significance level and degrees of freedom.
 • In this example, the calculated F-statistic would be compared to the
   critical F-value from an F-distribution table to determine if there are
   significant differences in the mean weight loss between the workout
   programs.
              LIKERT SCALE
• A psychometric response scale
• Used in questionnaires to obtain participant’s preferences or
  degree of agreement with a statement or set of statements.
• Indicating level of agreement with a given statement by way
  of an ordinal scale
• Used to measure peoples’ attitudes, opinions, or perceptions.
• Used in psychology, sociology, education, marketing research
  etc.
• Examples customer satisfaction, public opinion research, from
  brand affinity, political beliefs etc..
       Types of LIKERT SCALE
By question:
1. Agree to Disagree Likert Scale: Strongly Disagree/Disagree/Neither
   agree nor disagree/Agree/Strongly Agree
2. Satisfaction      Likert     Scale:     Very     dissatisfied/Somewhat
   dissatisfied/Neither dissatisfied or satisfied/Somewhat satisfied/Very
   satisfied
3. Likelihood Likert Scale: Very unlikely/Somewhat unlikely/Neither likely
   nor unlikely/Somewhat likely/Very likely
4. Good to bad Likert Scale: Very poor/Poor/Average/Good/Excellent
5. Frequency Likert Scale: Never/Rarely/Sometimes/Often/Always
By number:
6. Even Likert Scale 4, 8 point Likert scales
7. Odd Likert Scale 5, 7 and 9 point scales
          Example of LIKERT SCALE
• A bank wants to know the customer satisfaction on its newly introduced
  ATM machine. It administered the following questionnaire in 100 ATM
  users of the new ATM is planted and compare it with their satisfaction on
  the machine it has replaced. Customer rating on the first machine, using a
  Likert Scale question of 5 points, was 35 % very poor and poor, 50% as
  average and 15% as good and excellent.
• Using the same tool the survey found the following result in the table.
Question                                    1        2        3       4         5
                                        Very poor   Poor   Average   Good   Excellent
How would you rate the service of the      9        12       42       30       7
new ATM machine?
Total %                                    9        12       42       30       7
                           Cont.
Conclusion: The finding of the survey shows that out of the 100
customers 9% rated the service of the new ATM as very poor, 12%
as poor, 42% as average, 30% as good and 7% as excellent.
The proportion of customers who rated the new machine as
generally poor is lesser (21%) than the old machine (35%) and
average rating of new machine is also better than the old machine
(42% vs 50%). Similarly (37%) rated the new machine as good and
excellent compared to (15%) rating on the old machine.
Interpretation: A larger portion of the customers found the new
machine to perform better than the old one. Therefore, it was a
good decision to replace the old machine with the new one.
   Exercise on Hypothesis Testing
•Drawing scatterplots, estimating the best-fit line, testing hypothesis and
making predictions
A. Take a sample data of at least 30 observations for two related variables,
   ‘Y’ and ‘x’ (Y is a dependent variable and ‘x’ is a potential explanatory
   variable). The data could be secondary or primary. Indicate which
   variable is dependent and which one is explanatory.
B. State the source of your idea for the potential relationship between ‘x’
   and ‘y’ and cite relevant sources.
C. Present the data in tabular form, including the units of measurement for
   each variable, and state the type of data (as cross-sectional, time series,
   pooled), and the source of data.
                                       Cont’d
•   A cross-sectional data of the year 2019 was used of gross domestic product (GDP)
    per capital and film production of 38 countries by United Nations Educational,
    Scientific and Cultural Organization (UNESCO).
•   UNESCO. The African Film Industry Trends, Challenges and Opportunities for Growth.
    Published in 2021.UNESCO
•   N B Total number of Film Production is 10,204
•   GDP AND FILM PRODUCTION OF SELECTED AFRICAN COUNTERIES IN 2019.docx
•   The selection of dependent and independent variable is based on the Keynesian
    Economic Theory of Consumption and Income.
•   The theory states that income distribution positively affects consumption pattern. In
    other words a greater the income in the hands of people leads to increased
    consumption even if it wont be as much as the increase in income.
•   Therefore, film production is considered as an proxi indicator of film consumption or
    general consumption will become the dependent variable. Whereas GDP per capita
    is taken as a proxi indicator for income.
    D. Plot the data onto a scatterplot. What does the pattern of points suggest to you
              about the nature of the relationship between the two variables?
           Film Production
                                             GDP
•     The data shows that the 38 have produced a total of X films.
•     Most of the data points are clustered near the origin, indicating that there are many
      countries with low GDP and low film production.
•     There are a few scattered points extending out towards higher GDP and higher film
      production values.
•     This graph implies that there is some positive relationship between the two variables.
 E. Calculate the mean, median, range, standard deviation, standardized inter-quartile
 deviation, correlation coefficient, and coefficient of variation for the given data on the
  dependent and explanatory variables and interpret the results. Does the sign of the
estimated correlation coefficient confirm the pattern of relationship depicted under ‘d’?
                                      Descriptive Statistics
                                             Mean
                                                        N           Mean
                 FILM_PRODUCTION                               38      268.53
                 GDP                                           38    5618.026
                 Valid N (listwise)                            38
   •   These mean values provide a central tendency measure for the
       FILM_PRODUCTION and GDP variables in the dataset.
   •   The mean FILM_PRODUCTION value of 268.53 suggests the average level
       of film production, while the mean GDP value of 5618.026 represents the
       average level of GDP across the observations.
                              Median
                             Descriptive Statistics
                                   Median
                                      FILM_PRODUCTI
                                           ON                 GDP
        N          Valid                         38                 38
                   Missing                               0          0
        Median                                        58.50   3081.500
• The median FILM_PRODUCTION value is 58.50, indicating that half of the
  observations have a FILM_PRODUCTION value of 58.50 or lower, and the
  other half have a value of 58.50 or higher.
• Similarly, the median GDP value is 3,081.500, suggesting that half of the
  observations have a GDP value of 3081.500 or lower, and the other half
  have a value of 3,081.500 or higher.
                                         Range
                                  Descriptive Statistics
                                         Range
                                                    N           Range
             FILM_PRODUCTION                               38        2589
             GDP                                           38     28742.0
             Valid N (listwise)                            38
•   The range for FILM_PRODUCTION is 2,589, indicating the difference between the
    highest and lowest values in the dataset for FILM_PRODUCTION.
•   The range for GDP is 2,8742.0, representing the difference between the highest
    and lowest values in the GDP dataset.
                     Standard deviation
                                  Descriptive Statistics
                                  Standard deviation
                                                N          Std. Deviation
             FILM_PRODUCTION                         38            543.125
             GDP                                     38         6049.4112
             Valid N (listwise)                      38
•The standard deviation for FILM_PRODUCTION is 543.125, indicating the average
amount of variation or dispersion of the FILM_PRODUCTION values around the mean.
A higher standard deviation suggests that the values are more spread out from the
mean.
•The standard deviation for GDP is 6,049.4112 which suggests the average amount of
variation or dispersion of the GDP values around the mean.
Standard inter-quartile deviation
                                            Descriptive Statistics
                                      Standard inter-quartile deviation
                                                    FILM_PRODUCTION            GDP
              N             Valid                                         38              38
                            Missing                                        0               0
              Percentiles   25                                        28.75          1700.000
                            50                                        58.50          3081.500
                            75                                       188.75          7808.250
•The IQR provides a measure of the spread of the middle 50% of the data.
    A larger IQR indicates a greater spread of values within that middle 50%.
•The IQR for FILM_PRODUCTION is the difference between Q3 and Q1:
    IQR = Q3 - Q1 = 188.75 - 28.75 = 160
•   The IQR for GDP is the difference between Q3 and Q1:
.   IQR = Q3 - Q1 = 7808.250 - 1700.000 = 6108.250
                                 Correlations
                                                    Correlations
                                                                   FILM_PRODUCTION       GDP
               FILM_PRODUCTION    Pearson Correlation                                1         .055
                                  Sig. (2-tailed)                                              .741
                                  N                                              38             38
               GDP                Pearson Correlation                           .055             1
                                  Sig. (2-tailed)                               .741
                                  N                                              38             38
•   The correlation between FILM_PRODUCTION and GDP is 0.055. The p-value for this correlation
    is 0.741.
•   A correlation of 0.055 suggests a very weak positive relationship between FILM_PRODUCTION
    and GDP. Additionally, the p-value of 0.741 indicates that this correlation is not statistically
    significant at the conventional significance level of 0.05.
•   In summary, based on these results, there is no strong evidence to suggest a significant linear
    relationship between FILM_PRODUCTION and GDP in this dataset.
F. Assuming a linear relationship between the variables ‘x’ and ‘y’, and normal
distribution, estimate the linear regression equation (i.e., the best-fit line) depicting ‘y’
as a function of ‘x’ for the given sample data (do this by making use of one of the
software packages and annex the software output).
         . regress FILM_PRODUCTION GDP
               Source         SS           df         MS       Number of obs    =         38
                                                               F(1, 36)         =       0.11
                Model    33400.4894         1     33400.4894   Prob > F         =     0.7415
             Residual      10881029        36     302250.805   R-squared        =     0.0031
                                                               Adj R-squared    =    -0.0246
                Total    10914429.5        37      294984.58   Root MSE         =     549.77
         FILM_PRODU~N   Coefficient   Std. err.       t    P>|t|     [95% conf. interval]
                  GDP     .0049666    .0149407      0.33   0.741    -.0253344       .0352677
                _cons     240.6236     122.472      1.96   0.057    -7.761118       489.0084
                     Linear Regression
                                Interpretation
F 1. State the equation
                                •   The coefficient for "GDP" is 0.0049666, indicating
                                    that for every one unit increase in GDP, there is a
Y= β0+ β1X1                         predicted increase of approximately 0.0049666
                                    units in FILM_PRODUCTION. However, since the p-
Where y is dependent                value associated with this coefficient is 0.741 (Prob
variables                           > F), which is greater than the typical significance
                                    level of 0.05, we fail to reject the null hypothesis
β0 is constant                      that there is no relationship between GDP and
                                    FILM_PRODUCTION.
β1 is coefficient of GDP
                                •   The constant term, represented by "_cons", has a
X1 is the GDP                       coefficient of 240.6236 with a p-value of 0.057. This
                                    suggests that when GDP is zero (or very close to
From the data from the above        zero), the estimated mean value of
table the regression equation       FILM_PRODUCTION would be around 240.6236.
would be                        •   The R-squared value for this model is low at 0.0031,
                                    indicating that only about 0.31% of the variability in
Y= 240.6236 + .0049666GDP           FILM_PRODUCTION can be explained by changes in
                                    GDP.
                                        T-Test
•   The t-test in this regression output is used to test the null hypothesis that the coefficient
    for "GDP" is equal to zero. Here are the relevant values for the t-test:
         The coefficient for "GDP" is 0.0049666.
         The standard error for this coefficient is 0.0149407.
         The t-value, which measures the number of standard errors the coefficient is away
         from zero, is calculated as (coefficient / standard error).
          In this case,(0.0049666 / 0.0149407) it is approximately 0.33.
•   The p-value associated with this t-value, labeled as P>|t|, is 0.741.
Interpretation
• The t-value of 0.33 suggests that there is not a statistically significant relationship
    between "GDP" and "FILM_PRODUCTION".
• The p-value of 0.741 indicates that there is insufficient evidence to reject the null
    hypothesis at a typical significance level of 0.05.
• In conclusion, there does not appear to be a statistically significant relationship between
    GDP and FILM_PRODUCTION based on these results from the t-test. A unit change of
    GDP would change .004966 of film production
     F2) Interpret the values of the estimated parameters, the R2 (coefficient of
               determination) and the F-test (or goodness of fit test).
•   F 2. R2 is .0031 the variation of GDP would determine at.0031 of film production.
    That means the dependent variable film production is explained by the
    independent variable the GDP at .0031
•   Prob > F is 0.7415 which greater that assumed rejection value of 0.05. Here F is
    greater than 0.05. Thus, the model has goodness of fit.
     F3) Test the hypothesis that the variable ‘x’ affects ‘y’ (state the null
    hypothesis, test the hypothesis, indicate your decision and discuss the
    results) by using the critical value (t-test) and the p-value approaches.
•   Test of hypothesis based on P-value approach
    Hypothesis                  Assumption of test of the    Result                     Decision
                                hypothesis
    Ho= There is no             P< 0.05= Statistically       P= .741 which is greater   The null hypothesis is
    relationship between film   significant                  than 0.05                  rejected because P value is
    production and GDP                                                                  greater than 0.05. Thus,
                                P>0.05 = statistically not                              the alternative hypothesis
                                significant                                             is accepted
Test of Hypothesis based on Critical Value T- Test
• The calculated t value in the above table is 0.33 and the critical t value from the t
  distribution value is 2.021 at degree of freedom 37.
• The calculated t value is less than the critical t value. Thus, there is not statistically
  significant.
 The null hypothesis is rejected and the alternative hypothesis is accepted.
F4) test the hypothesis that the intercept is significantly different from zero (state
 the null hypothesis, test the hypothesis, indicate your decision and discuss the
      results) by using the critical value (t-test) and the p-value approaches.
•   The y-intercept of the given regression equation above y = 240.6236 + 0.0049666X 1
    is the value of y when X1 is 0. Therefore, the y-intercept is 240.6236. The intercept
    is significantly different from zero.
                               N      Minimum    Maximum      Mean      Std. Deviation
       FILM_PRODUCTION         38        10        2599      268.53       543.125
       GDP                     38       314.0     29056.0   5618.026     6049.4112
F5) Use the regression line to make predictions around the mean, the maximum, and
 the minimum values of ‘x’ and interpret your results; and explain any deviations of
                       estimated values from raw data values.
                                            •   The histogram overlaid with a curve,
                                                which appears to represent the
                                                distribution of regression
                                                standardized residuals for a
                                                dependent variable named
                                                "FILM_PRODUCTION." This is a
                                                common type of statistical graph
                                                used to visualize the distribution of a
                                                dataset and check for normality in
                                                the residuals of a regression analysis.
                                            •   On the x-axis, there are regression
                                                standardized residual values, ranging
                                                from approximately -3 to 4. The y-
                                                axis represents the frequency of
                                                these residuals.
                                Cont.
• The most notable feature is a tall blue bar at around the 0 mark on the x-
  axis, which indicates the highest frequency of residuals is around the
  mean of the dataset. This bar reaches a frequency of almost 30. The rest
  of the bars are much shorter, indicating lower frequencies for other
  residual values. The distribution of the bars suggests that the majority of
  data points are close to the mean, with fewer cases having high positive
  or negative residuals.
• The curve overlaying the bars is smooth and seems to be a fitted normal
  distribution curve, suggesting that the residuals might be approximately
  normally distributed, which is an assumption in many regression analyses.
  However, the actual residuals appear to be slightly right-skewed, as there
  are more bars on the right side of the peak than the left.
                 Regression line
• The graph is attempting to analyze the correlation between GDP and film
  production, but the data and the low R-squared value suggest that there
  may not be a strong linear correlation between the two variables based
  on this particular dataset.
Thank You !