Hypothesis Testing

Using computer simulation. Based on examples from the infer package. Code for Quiz 13.

Load the R package we will use.

Question: t-test

The data this quiz is a subset of HR

set.seed(123)

hr_3_tidy.csv is the name of your data subset

hr  <- read_csv("https://estanny.com/static/week13/data/hr_3_tidy.csv", 
                col_types = "fddfff") 

use the skim to summarize the data in hr

skim(hr)
Table 1: Data summary
Name hr
Number of rows 500
Number of columns 6
_______________________
Column type frequency:
factor 4
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 fem: 253, mal: 247
evaluation 0 1 FALSE 4 bad: 148, fai: 138, goo: 122, ver: 92
salary 0 1 FALSE 6 lev: 98, lev: 87, lev: 87, lev: 86
status 0 1 FALSE 3 fir: 196, pro: 172, ok: 132

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 39.41 11.33 20 29.9 39.35 49.1 59.9 ▇▇▇▇▆
hours 0 1 49.68 13.24 35 38.2 45.50 58.8 79.9 ▇▃▃▂▂

Q: Is the mean number of hours worked per week 48?

specify that hours is the variable of interest

hr  %>% 
  specify(response = hours)
Response: hours (numeric)
# A tibble: 500 x 1
   hours
   <dbl>
 1  49.6
 2  39.2
 3  63.2
 4  42.2
 5  54.7
 6  54.3
 7  37.3
 8  45.6
 9  35.1
10  53  
# ... with 490 more rows

hypothesize that the average hours worked is 48

hr  %>% 
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500 x 1
   hours
   <dbl>
 1  49.6
 2  39.2
 3  63.2
 4  42.2
 5  54.7
 6  54.3
 7  37.3
 8  45.6
 9  35.1
10  53  
# ... with 490 more rows

generate 1000 replicates representing the null hypothesis

hr %>% 
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)  %>% 
  generate(reps = 1000, type = "bootstrap") 
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500,000 x 2
# Groups:   replicate [1,000]
   replicate hours
       <int> <dbl>
 1         1  34.5
 2         1  33.6
 3         1  35.6
 4         1  78.2
 5         1  52.7
 6         1  77.0
 7         1  37.1
 8         1  41.9
 9         1  62.7
10         1  38.8
# ... with 499,990 more rows

The output has 500000 rows

calculate the distribution of statistics from the generated data

null_t_distribution  <- hr  %>% 
  specify(response = age)  %>% 
  hypothesize(null = "point", mu = 48)  %>% 
  generate(reps = 1000, type = "bootstrap")  %>% 
  calculate(stat = "t")

null_t_distribution
# A tibble: 1,000 x 2
   replicate    stat
       <int>   <dbl>
 1         1  0.929 
 2         2  0.480 
 3         3 -0.0136
 4         4  0.435 
 5         5 -0.810 
 6         6 -1.06  
 7         7 -0.0470
 8         8  0.809 
 9         9  0.986 
10        10  0.199 
# ... with 990 more rows

visualize the simulated null distribution

null_t_distribution %>% 
  visualise()

calculate the statistic from your observed data

observed_t_statistic  <- hr  %>%
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)  %>%
  calculate(stat = "t")

observed_t_statistic
# A tibble: 1 x 1
   stat
  <dbl>
1  2.83

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution  %>% 
  get_p_value(obs_stat = observed_t_statistic , direction = "two-sided")
# A tibble: 1 x 1
  p_value
    <dbl>
1   0.012

shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  shade_p_value(obs_stat = observed_t_statistic, direction = "two-sided")

If the p-value < 0.05? Yes (yes/no)

Does your analysis support the null hypothesis that the true mean number of hours worked was 48? No (yes/no)

Question: 2 sample t-test

hr_2_tidy.csv is the name of your data subset

hr_2 <- read_csv("https://estanny.com/static/week13/data/hr_2_tidy.csv", 
                col_types = "fddfff") 

Q: Is the average number of hours worked the same for both genders?

use skim to summarize the data in hr_2 by gender

hr_2 %>% 
  group_by(gender)  %>% 
  summarize(obs_stat = mean(hours))
# A tibble: 2 x 2
  gender obs_stat
  <fct>     <dbl>
1 male       49.3
2 female     49.5

Use geom_boxplot to plot distributions of hours worked by gender

hr_2 %>% 
  ggplot(aes(x = gender, y = hours)) + 
  geom_boxplot() 

specify the variables of interest are hours and gender

hr_2 %>% 
  specify(response = hours, explanatory = gender)
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 500 x 2
   hours gender
   <dbl> <fct> 
 1  78.1 male  
 2  35.1 female
 3  36.9 female
 4  38.5 male  
 5  36.1 male  
 6  78.1 female
 7  76   female
 8  35.6 female
 9  35.6 male  
10  56.8 male  
# ... with 490 more rows

hypothesize that the number of hours worked and gender are independent

hr_2  %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
   hours gender
   <dbl> <fct> 
 1  78.1 male  
 2  35.1 female
 3  36.9 female
 4  38.5 male  
 5  36.1 male  
 6  78.1 female
 7  76   female
 8  35.6 female
 9  35.6 male  
10  56.8 male  
# ... with 490 more rows

generate 1000 replicates representing the null hypothesis

hr_2 %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute") 
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups:   replicate [1,000]
   hours gender replicate
   <dbl> <fct>      <int>
 1  47.8 male           1
 2  60.3 female         1
 3  46.5 female         1
 4  37.2 male           1
 5  74.1 male           1
 6  35.9 female         1
 7  35.6 female         1
 8  54.5 female         1
 9  55.6 male           1
10  44.1 male           1
# ... with 499,990 more rows

calculate the distribution of statistics from the generated data

null_distribution_2_sample_permute <-  hr_2 %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")  %>% 
  calculate(stat = "t", order = c("female", "male"))

null_distribution_2_sample_permute
# A tibble: 1,000 x 2
   replicate   stat
       <int>  <dbl>
 1         1  0.505
 2         2 -0.650
 3         3  0.279
 4         4  0.435
 5         5  1.73 
 6         6 -0.139
 7         7 -2.14 
 8         8  0.274
 9         9  0.766
10        10  1.52 
# ... with 990 more rows

*visualize the simulated null distribution

null_distribution_2_sample_permute %>% 
  visualize()

calculate the statistic from your observed data

_ Assign the output observed_t_2_sample_stat

_ Display observed_t_2_sample_stat

observed_t_2_sample_stat <- hr_2 %>%
  specify(response = hours, explanatory = gender)  %>% 
  calculate(stat = "t", order = c("female", "male"))

observed_t_2_sample_stat
# A tibble: 1 x 1
   stat
  <dbl>
1 0.160

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution  %>% 
  get_p_value(obs_stat = observed_t_2_sample_stat , direction = "two-sided")
# A tibble: 1 x 1
  p_value
    <dbl>
1   0.854

shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  shade_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")

If the p-value < 0.05? No (yes/no)

Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same? Yes (yes/no)

Question: ANOVA

hr_3_tidy.csv is the name of your data subset

hr_anova <- read_csv("https://estanny.com/static/week13/data/hr_3_tidy.csv", 
                col_types = "fddfff") 

Q: Is the average number of hours worked the same for all three status (fired, ok and promoted) ?

use skim to summarize the data in hr_anova by status

hr_anova %>% 
  group_by(status)  %>% 
  summarize(obs_stat = mean(hours))
# A tibble: 3 x 2
  status   obs_stat
  <fct>       <dbl>
1 promoted     59.3
2 fired        42.4
3 ok           48.0

Use geom_boxplot to plot distributions of hours worked by status

hr_anova %>% 
  ggplot(aes(x = status, y = hours)) + 
  geom_boxplot()

specify the variables of interest are hours and status

hr_anova %>% 
  specify(response = hours, explanatory = status)
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 500 x 2
   hours status  
   <dbl> <fct>   
 1  49.6 promoted
 2  39.2 fired   
 3  63.2 promoted
 4  42.2 promoted
 5  54.7 promoted
 6  54.3 fired   
 7  37.3 fired   
 8  45.6 promoted
 9  35.1 fired   
10  53   promoted
# ... with 490 more rows

hypothesize that the number of hours worked and status are independent

hr_anova  %>% 
  specify(response = hours, explanatory = status)  %>% 
  hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
   hours status  
   <dbl> <fct>   
 1  49.6 promoted
 2  39.2 fired   
 3  63.2 promoted
 4  42.2 promoted
 5  54.7 promoted
 6  54.3 fired   
 7  37.3 fired   
 8  45.6 promoted
 9  35.1 fired   
10  53   promoted
# ... with 490 more rows

generate 1000 replicates representing the null hypothesis

hr_anova %>% 
  specify(response = hours, explanatory = status)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute") 
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups:   replicate [1,000]
   hours status   replicate
   <dbl> <fct>        <int>
 1  38.4 promoted         1
 2  41   fired            1
 3  66.1 promoted         1
 4  46.1 promoted         1
 5  65.1 promoted         1
 6  43.7 fired            1
 7  35.1 fired            1
 8  40.9 promoted         1
 9  38.4 fired            1
10  67.7 promoted         1
# ... with 499,990 more rows

The output has 500000 rows

calculate the distribution of statistics from the generated data

_ Assign the output null_distribution_anova

_ Display null_distribution_anova

null_distribution_anova  <- hr_anova %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")  %>% 
  calculate(stat = "F")

null_distribution_anova
# A tibble: 1,000 x 2
   replicate     stat
       <int>    <dbl>
 1         1 0.00878 
 2         2 0.359   
 3         3 1.16    
 4         4 0.00524 
 5         5 2.25    
 6         6 0.259   
 7         7 2.61    
 8         8 0.0457  
 9         9 0.0830  
10        10 0.000339
# ... with 990 more rows

visualize the simulated null distribution

null_distribution_anova %>% 
  visualize()

calculate the statistic from your observed data

observed_f_sample_stat  <- hr_anova %>%
  specify(response = hours, explanatory = status)  %>% 
  calculate(stat = "F")

observed_f_sample_stat
# A tibble: 1 x 1
   stat
  <dbl>
1  110.

get_p_value from the simulated null distribution and the observed statistic

null_distribution_anova  %>% 
  get_p_value(obs_stat = observed_f_sample_stat, direction = "greater")
# A tibble: 1 x 1
  p_value
    <dbl>
1       0

**shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  shade_p_value(obs_stat = observed_f_sample_stat, direction = "greater")

If the p-value < 0.05? Yes (yes/no)

Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same? No (yes/no)

ggsave(filename = "preview.png",
       path = here::here("_posts", "2021-05-18-hypothesis-testing"))