Home > CLRES 2020/Biostatistics 2041

# CLRES 2020/Biostatistics 2041

### CLRES 2020/Biostatistics 2041

Course instructors:
Biostatistics: Statistical Approaches in Clinical Research

Lab 5, Created by Fiona Callaghan

GSCC 126, Monday 1-5pm, August 9, 2004

Joyce C. H. Chang, PhD

Maria K. Mor, PhD

Doris M. Rubio, PhD

Mark S. Roberts, MD, MPP

Teaching Assistants:

Fiona Callaghan MS

Bill Clark

David Corcoran

Vinay Mehta

# Goals for Lab 5

1. Confidence intervals for Binomial and Poisson Data.
2. One sample test for Binomial proportion.
3. Two sample test for Binomial proportion.
4. Sample size and Power for Binomial tests.
5. Chi-squared test, Fisher��s exact test.
6. Matched pair data (McNemar��s test).
7. Measures of agreement (kappa statistic).
8. 2��C tables test of trend.

Whenever you see a check-mark that means that you are required to perform some action.  Whenever some words are in this font it means that these

are commands that you should type in the command window of STATA.  And whenever you see an > it refers to going to a series of drop-down windows, as in

��All Programs>Mathematics>STATA��.  There are generally two ways to do most things in STATA:  using commands that you type in the command window, or using drop-down menus, as in SPSS.  Whenever possible, we will give you both ways of doing things in STATA, but you are only required to do it the way you feel most comfortable.  On the back of this handout is some space for you to answer questions about the lab material.

The questions that you have to answer to get credit for this lab are enclosed in a box like this.

You will answer these questions as you go through the lab and hand them in at the end for credit, so remember to write your name on them!  If you experience trouble at any time, just raise your hand to let a TA or an instructor know that your need help.  Let��s get started!

# Getting Started

First we will log on to the computer.  To do this you will need your University of Pittsburgh user id and your password.

• You should see a space on the screen to enter your user id.  Type it in and press return.
• Now enter your password and press return.  You should now be logged on to the computer.

We will open a folder in which to save our work, and then we will open STATA and enter our data sets into STATA.

• Right-click somewhere on the desktop and select ��New Directory��.  Name your folder ��Lab5��.  We will save all our work in this folder.
• Go to the web page:  http://www.pitt.edu/~changj/CLRES2020/main.html
• Scroll down to find the data sets and right-click on ��popular.dta�� and select ��Save Link As����.
• We want to save the file in ��/scratch/username/Desktop/Lab5��.  The ��username�� is your University of Pittsburgh email id (the part of your University of Pittsburgh email address that comes before the ��@�� e.g. ��fmc2�� is the id from the email address fmc2@pitt.edu), so on my computer I would save it in ��/scratch/fmc2/Desktop/Lab5��.  To do this, double click on ��Desktop�� and then ��Lab5�� in the main window (you should only have to do this once;   the computer will remember where you are saving your files later on).  Click ��Save��.
• Save the data set casecontrol.dta, also.
• Your data sets should now be in your ��Lab5�� folder on the Desktop.  Open up your ��Lab5�� folder to check that it is there, by double clicking on the ��Lab5�� icon on your desktop.  If things do not look right, contact a TA.

Now we will open STATA.

• To open STATA, click on the icon in the bottom left of your screen (this is the ��Start Applications�� menu) and go up to ��Mathematics�� and then move the mouse right onto ��STATA�� to highlight it.  Click on STATA and it should open.
• We wish to tell the STATA to save anything we do from now on in our ��Lab5�� file.  To do this, in the command window type:                               cd ��/scratch/username/Desktop/Lab5��
• Now open the log file.  Type log using log5.log or you could go to File>Log>Begin�� .  You will have to give the log file a name, so type in ��log5��.  Next we have to make sure that STATA saves it as a ��.log�� file and not a ��.smcl�� file;  go to the drop down menu next to ��Save as type:  Stata SMCL Document (*.smcl)�� and select ��Stata log (*.log)��.  Then save in your Lab5 folder (you may have to double click on Desktop to find the Lab5 folder).
• Type use popular in the command window of STATA, and press return.  You can also enter your data using a drop down window.  Go to ��File>Open���� and select the popular.dta data set and click ��Open��. Your data set should now be in STATA.

Datafile Name: Popular Kids

Datafile Subjects: Psychology , Social science

Story Names: Students' Goals , What Makes Kids Popular

Reference: Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a Social Determinant for Children," Research Quarterly for Exercise and Sport, 63, 418-424

Authorization: Contact authors

Description: Subjects were students in grades 4-6 from three school districts in Ingham and Clinton Counties, Michigan. Chase and Dummer stratified their sample, selecting students from urban, suburban, and rural school districts with approximately 1/3 of their sample coming from each district. Students indicated whether good grades, athletic ability, or popularity was most important to them. They also ranked four factors: grades, sports, looks, and money, in order of their importance for popularity. The questionnaire also asked for gender, grade level, and other demographic information.

Number of cases: 478

Variable Names:

1. Gender: Boy or girl
2. Grade: 4, 5 or 6
3. Age: Age in years
4. Race: White, Other
5. Urban/Rural: Rural, Suburban, or Urban school district
6. School: Brentwood Elementary, Brentwood Middle, Ridge, Sand, Eureka, Brown, Main, Portage, Westdale Middle
7. Goals: Student's choice in the personal goals question where options were 1 = Make Good Grades, 2 = Be Popular, 3 = Be Good in Sports
8. Grades: Rank of "make good grades" (1=most important for popularity, 4=least important)
9. Sports: Rank of "being good at sports" (1=most important for popularity, 4=least important)
10. Looks: Rank of "being handsome or pretty" (1=most important for popularity, 4=least important)
11. Money: Rank of "having lots of money" (1=most important for popularity, 4=least important)

## First we will have to create a binomial variable from this data.  Let��s make a new variable out of the variable ��looks��.  Let��s say that if the child answered 1 or 2, then that child thought that being handsome or pretty was ��important�� (=1) for popularity, and if the child answered 3 or 4 they thought that being handsome or pretty was ��not important�� (=0).

• Type generate looks2 = 1 if (looks ==1 | looks==2) & looks ~= .  This is saying that if ��looks�� is a 1 or a 2, and it is not a missing value, then make our new variable (looks2) equal to 1.
• Type replace looks2 = 0 if (looks ==3 | looks==4) & looks ~= .   This is saying that if ��looks�� is a 3 or a 4, and it is not a missing value, then looks2 = 0.  Now we have a binary (binomial) variable.
• Type ci looks2, binomial level(90) and this will give us an exact 90% binomial confidence interval for p = proportion of kids who think that looks are important for popularity.  The confidence interval is (0.6131, 0.6868).  You could also go to Statistics>Summaries, tables, and tests>Summary statistics>Confidence intervals.  Click on the ��Binomial 0/1 variables; compute exact confidence intervals�� box and enter looks2 in the ��Variables�� box.  Click ��OK��.

Question 1:  Generate a new variable that is binary and indicates whether the child thought that being good at sports was important or unimportant for popularity.  Make ��important�� =1 and ��unimportant��=0.  Call this variable ��sports2��.  How many thought sports was important and how many thought sports was not important?

Question 2:  Calculate an exact 90% binomial confidence interval for ��sports2��.  Interpret the confidence interval.

We can also use the STATA confidence interval calculator if we are given some summary statistics but not the whole data set.  Suppose we know that there were 10 people who got a particular disease after begin vaccinated, out of a total of  231 vaccinated people and we want to find the 99% confidence interval for the probability of getting the disease.  In this case we have 10 ��successes�� and 231 is the number of ��trials�� (STATA calls this the ��sample size��).  The point estimate of p is 10/231 = 0.0433.

• Type cii 231 10, level(99) or you could go to Statistics>Summaries, tables, and tests>Summary statistics>Binomial CI calculator and type ��231�� in the Sample size�� box, ��10�� in the ��Successes�� box and ��99�� in the ��Confidence level�� box.  You should get (0.016, 0.090).

Question 3:  Find a confidence interval for p when we have 8 successes out of 21 trials.

### Normal approximation confidence interval for Binomial

There are no drop-down menus or explicit commands to calculate an approximate confidence interval, but STATA can help. First we need to know if npq>5.  The formula is:

p �� z1-��/2��(p��q/n)

• Type summarize looks2 to get a value for the point estimate of p.  It is about 0.6506.  We can also see that the sample size is 478.
• Type display 438*0.6506*(1-0.6506) if this is greater than 5 then we can do the normal approximation (you should get about 10.4).
• Find the z critical value.  For a 95% confidence interval it is z1-��/2 = z0.975 = 1.96.  To confirm this type display invnorm(0.975)
• Type display 0.6506+1.96*sqrt(0.6506*(1-0.6506)/478) and display 0.6506-1.96*sqrt(0.6506*(1-0.6506)/478) to get the upper and lower bounds on the confidence interval.  You should get about (0.6079, 0.6933).  Compare this to the exact confidence interval that you calculated previously.  It is pretty close.

Note that sometimes you might get a confidence interval that includes values less than 0 or greater than 1 e.g.(0.20, 1.02) or (-0.03, 0.50).  Obviously the normal approximation has broken down here, and we would correct the confidence intervals to make them (0.20, 1.00) or (0.00, 0.50).

Question 4: Calculate npq where p is the sample proportion of kids who think that sporting ability is important for popularity.  Can we use a normal approximation?

Question 5: What is the 93% confidence interval for the proportion of kids who think that sporting ability is important for popularity, using a normal approximation?

Question 6:  Suppose you had a sample of 600 people, and the number of them with a disease was 594.  Calculate npq.  Is a normal approximation justified?  Question 7:  Calculate the 99% confidence interval for the previous question using the normal approximation.  Is anything wrong with this confidence interval?

Question 8: Calculate the exact confidence interval for this problem.

### Exact confidence interval for Poisson

We can also find confidence intervals for Poisson data.

Using the confidence interval calculator for Poisson data, we need to know the ��exposure�� which is usually the time period over which the data was collected, and the number of events that occurred during that time.  Suppose we had 17 cases of food poisoning over a period of 3 years, in a particular town, and we needed to calculate an 80% confidence interval for the rate of food poisoning.  The point estimate of the rate is 17/3 = 5.667 cases per year.

• Type cii 3 17, poisson level(80) or you could go to Statistics>Summaries, tables, and tests>Summary statistics>Poisson CI calculator and type ��3�� in the ��sample exposure��, ��17�� in ��sample events�� and ��80�� in the ��confidence level��.  You should get (3.99, 7.87).  This is a confidence interval for the average rate of disease for one year.

## One-sample test for proportions (Binomial Data)

Suppose we want to test whether the proportion of kids who think that good looks are important for popularity is less than 0.70 (Hint:  If you look at the confidence interval that you calculated for this variable, you can already see if we are going to reject or fail to reject this test).  So our Ho: p = 0.70, Ha: p < 0.70.  Suppose �� = 0.05.

• Type bitest looks2 == 0.7 to get the exact test.  You can also go to Statistics>Summaries, tables, & tests>Classical tests of hypotheses>Binomial probability test.  Enter ��looks2�� in the ��Name of 0/1 variable�� box, and ��0.70�� in the ��Probability of success�� box.  The output is:

. bitest looks2 == 0.7

Variable |        N   Observed k   Expected k   Assumed p   Observed p

-------------+------------------------------------------------------------

looks2 |      478        311        334.6       0.70000      0.65063

Pr(k >= 311)             = 0.991267  (one-sided test)

Pr(k <= 311)             = 0.011313  (one-sided test)

Pr(k <= 311 or k >= 358) = 0.021567  (two-sided test)

• The p-value for Ha: p<0.70 is 0.011313.  We reject the null hypothesis at the 5% level.  Note that 311 is the number of kids who think that looks are important to popularity, so P(p < 0.6506) = P(k �� 311) where k = # of successes.
• Type prtest looks2 == 0.7  to get the normal approximation test.  You can also go to Statistics>Summaries, tables, & tests>Classical tests of hypotheses>One sample proportion test.  The variable is ��looks2�� and the ��hypothesized proportion�� is 0.70.  The output looks like:

. prtest looks2 == 0.7, level(95)

One-sample test of proportion                 looks2: Number of obs =      478

------------------------------------------------------------------------------

Variable |       Mean   Std. Err.                     [95% Conf. Interval]

-------------+----------------------------------------------------------------

looks2 |   .6506276    .021807                      .6078866    .6933686

------------------------------------------------------------------------------

Ho: proportion(looks2) = .7

Ha: looks2 < .7         Ha: looks2 != .7          Ha: looks2 > .7

z = -2.356               z = -2.356               z = -2.356

P < z = 0.0092         P > |z| = 0.0185           P > z = 0.9908

• We get a p-value of 0.0092 so we reject the null hypothesis at the 5% level.  Note that the confidence interval is the normal approximation confidence interval for p.  This provides an easy way to calculate the normal approximation of the CI of p.  We can calculate the critical value by typing display invnorm(0.05) which gives us the value -1.645.  Or z statistic is -2.356, so we reject Ho.

Question 9: Perform an exact test of Ho: p = 0.70 against Ha: p < 0.70 for the variable ��sports2��, with �� = 0.03.  Is the true proportion of kids who think that sporting ability is important for popularity less than 0.70?

Question 10: Perform an normal approximation test of Ho: p = 0.70 against Ha: p < 0.70 for the variable ��sports2��, with �� = 0.03.  Is the true proportion of kids who think that sporting ability is important for popularity less than 0.70?

Again, we can use a proportion test calculator to do the tests for us, if we only have the summary statistics.  Suppose we know that our sample had 15 successes and n = 27 and we wish to test Ho: p = 0.50 against Ha: p �� 0.50.  Our sample estimate for p is 15/27 = 0.556.  n��po��qo = 27��0.50��0.50 = 6.75 so we can use the normal approximation or exact test.  Say, �� = 0.08.

• Type prtesti 27 0.556 0.5, level(92)  or go to Statistics>Summaries, tables, & tests>Classical tests of hypotheses>One sample proportion calculator.  Your ��sample proportion�� is 0.556 and your ��hypothesized proportion�� is 0.50.  This is the normal approximation test.

. prtesti 27 0.556 0.5, level(92)

One-sample test of proportion                      x: Number of obs =       27

------------------------------------------------------------------------------

Variable |       Mean   Std. Err.                     [92% Conf. Interval]

-------------+----------------------------------------------------------------

x |       .556   .0956196                      .3886001    .7233999

------------------------------------------------------------------------------

Ho: proportion(x) = .5

Ha: x < .5              Ha: x != .5               Ha: x > .5

z =  0.582               z =  0.582               z =  0.582

P < z =  0.7197        P > |z| =  0.5606          P > z =  0.2803

• We get a p-value of 0.5606 so we fail to reject Ho.  We can find our critical z values by typing display invnorm(0.08/2) which gives us a value of
• z1-��/2 = 1.75 and z��/2 = -1.75  .  Our z = 0.582, so we fail to reject Ho.
• Type bitesti 27 15 0.5  or go to Statistics>Summaries, tables, & tests>Classical tests of hypotheses>Binomial probability test calculator.  Under ��Probability of success�� type 0.50.  This is the p0.

. bitesti 27 15 0.5

N   Observed k   Expected k   Assumed p   Observed p

------------------------------------------------------------

27         15         13.5       0.50000      0.55556

Pr(k >= 15)            = 0.350554  (one-sided test)

Pr(k <= 15)            = 0.778966  (one-sided test)

Pr(k <= 12 or k >= 15) = 0.701108  (two-sided test)

• We get a p-value of 0.7011 so we fail to reject Ho.

Question 11: Perform a one-sample test of Binomial proportions where n = 50 and sample p = 0.16.  Test whether p is different to 0.10 and use �� = 0.11.  Use a normal approximation if appropriate.

## Two sample test of binomial proportions

Suppose we want to test whether the proportion/probability that the kids think sports is important for popularity is the same for girls as it is for boys. We will assume these samples are independent.  We would test this as follows, using an �� = 0.01.

• Frist we must generate a new variable called ��gender2�� that takes a 1 when the child is a boy and 2 when the child is a girl.  This is so that STATA can tell the difference between the groups (at the moment the data is text and STATA has trouble telling apart the word labels ��girl�� and ��boy�� – STATA works better with numbers).  Type generate gender2 = 1 if gender == "boy" & gender ~="." and replace gender2 = 2 if gender == "girl" & gender ~="."
• Type prtest sports2, by(gender2) level(99) or you could go to Statistics>Summaries, tables, & tests>Classical tests of hypotheses> Group proportion test.  Your ��Variable name�� is ��sports2�� and your ��Group variable name�� is ��gender2��, and your confidence level is 99.
• You get the following output.

. prtest sports2, by( gender2 ) level(99)

Two-sample test of proportion                      1: Number of obs =      227

2: Number of obs =      251

------------------------------------------------------------------------------

Variable |       Mean   Std. Err.      z    P>|z|     [99% Conf. Interval]

-------------+----------------------------------------------------------------

1 |   .8237885   .0252879                      .7586513    .8889257

2 |   .5258964   .0315174                      .4447131    .6070798

-------------+----------------------------------------------------------------

diff |   .2978921   .0404082                      .1938076    .4019767

|  under Ho:   .0431549     6.90   0.000

------------------------------------------------------------------------------

Ho: proportion(1) - proportion(2) = diff = 0

Ha: diff < 0            Ha: diff != 0             Ha: diff > 0

z =  6.903               z =  6.903               z =  6.903

P < z =  1.0000        P > |z| =  0.0000          P > z =  0.0000

• Our Ho: p1 = p2 and Ha: p1 �� p2.  Our p-value is very low (< 0.00005) so we reject Ho.

Using the calculator to do the same test we get:

• Type prtesti 227 0.8238 251 0.5259, level(99) or go to Statistics>Summaries, tables, & tests>Classical tests of hypotheses>Two sample proportion calculator.  With a little rounding error, you will get the same results as above.

Question 12:  Suppose we want to perform a two-sample test to see if the importance of ��looks�� is different from boys to girls.  Copy the 2 by 2 table into your answers.  Calculate npq for sample 1 and sample 2.  Is a normal approximation test justified?

Question 13:  Is the proportion of kids who think that looks are important different for girls and boys? Use a normal approximation test and �� = 0.05.

## Sample size and power for Binomial tests

The calculations are very similar to those for sample size and power for t-tests.

Retrospectively, we can calculate the power for the tests we have done so far.  Suppose we wanted to calculate the power of the two-sample test comparing the importance of sports of boys to girls.

• We type sampsi 0.8238 0.5259, alpha(0.01) n1(227) n2(251) remember that if we had done a one-sided test we would have to type onesided at the end of the line.  Or we could go to: Statistics>Summaries, tables, & tests>Classical tests of hypotheses>Sample size and power determination.  Click on the box for ��Two-sample comparison of proportions�� and enter 0.8238 and 0.5259 in the ��proportion 1�� and ��proportion 2�� boxes.  Then go to ��Options��.  Click ��Compute power��, enter your �� level as 0.01, enter ��227�� for ��Sample one size�� and click on the box for ��Sample two size�� and enter ��251��.  Make sure that the drop-down command at the bottom of the screen says ��Two-sided test��.  Finally, click OK.
• The power is 1.0000, so we were almost certain of detecting a difference, as large as we found.

. sampsi 0.8238 0.5259, alpha(0.01) n1(227) n2(251)

Estimated power for two-sample comparison of proportions

Test Ho: p1 = p2, where p1 is the proportion in population 1

and p2 is the proportion in population 2

Assumptions:

alpha =   0.0100  (two-sided)

p1 =   0.8238

p2 =   0.5259

sample size n1 =      227

n2 =      251

n2/n1 =     1.11

Estimated power:

power =   1.0000

If we wanted to perform this study again – two-sided, ��=0.01 – to get a power of 0.85, with equal sample size and assuming a p1 = 0.8238 and p2 = 0.5259, we could type the following:

• sampsi 0.8238 0.5259, alpha(0.01) power(0.85) ratio(1) (remember that if this was for a one-sided test you would need to specify onesided at the end of the line).  Or we could enter in the appropriate numbers using the Statistics>Summaries, tables, & tests>Classical tests of hypotheses>Sample size and power determination menus.  Be careful to enter all the information you need to, (and none that you don��t):  ratio=1, ��=0.01, power=0.85, you wish to compute sample size (not power), it is for a two-sided test, and your two-sample proportions are  0.8238 and 0.5259.  You should find the sample size to be 70 children for each sample.

. sampsi 0.8238 0.5259, alpha(0.01) power(0.85)

Estimated sample size for two-sample comparison of proportions

Test Ho: p1 = p2, where p1 is the proportion in population 1

and p2 is the proportion in population 2

Assumptions:

alpha =   0.0100  (two-sided)

power =   0.8500

p1 =   0.8238

p2 =   0.5259

n2/n1 =   1.00

Estimated required sample sizes:

n1 =       70

n2 =       70

Question 14:  What was the power of the test that compared the importance of looks to boys and girls?

Question 15:  What is the sample size required to perform the same test, but with power of 0.75?

## Chi-squared and Fisher��s exact tests

We use these tests when analyzing tabular data:  usually when analyzing the relationship between 2 or more discrete variables.  The chi-squared test is an excellent approximate test if the expected values of the cells are all above 5.  Fisher��s exact test is more difficult to calculate (for us humans as well as for the computer).  In fact, it is so problematic that it may cause STATA to crash (if the numbers get over 100 or so and the numbers of rows and columns is greater than 2) so always check the expected cell counts first to see if a chi-squared test will do.  If you really need the accuracy of Fisher��s exact then borrow a friend��s computer to do it on��..

Suppose we want to investigate whether there is a relationship between the children rating money as an important factor in popularity and the children attending an urban, suburban or rural school.  First we will do a chi-squared test.

• Type tabulate money urbanrural, chi2 expected and this asks for table of ��money�� versus ��urbanrural�� and the chi-squared test, and to put the expected counts in the cells (under the observed counts).  You can also go to ��Statistics>Summaries, tables, & tests>Tables>Two-way tables with measure of association�� and you will see many options to choose from.  Under ��Test statistics�� check ��Pearson��s chi-squared�� and under ��Cell contents�� check ��Expected frequencies��.  You can play around with the other options but do NOT do Fisher��s exact test yet, unless you would like STATA to crash.
• You should get the following output:

. tabulate money urbanrural, chi2 expected

+--------------------+

| Key                |

|--------------------|

|     frequency      |

| expected frequency |

+--------------------+

|           Urban/Rural

Money |     Rural   Suburban      Urban |     Total

-----------+---------------------------------+----------

1 |        14          8         12 |        34

|      10.6       10.7       12.7 |      34.0

-----------+---------------------------------+----------

2 |        25         26         24 |        75

|      23.4       23.7       27.9 |     75.0

-----------+---------------------------------+----------

3 |        43         43         46 |       132

|      41.1       41.7       49.2 |     132.0

-----------+---------------------------------+----------

4 |       67         74         96 |       237

|      73.9       74.9       88.3 |     237.0

-----------+---------------------------------+----------

Total |       149        151        178 |       478

|     149.0      151.0      178.0 |     478.0

Pearson chi2(6) =   4.3719   Pr = 0.626

• We can see that our p-value is 0.626 (not significant at �� = 0.05) so there does not appear to be a relationship between location of school and how importantly the children rate money.  Our chi-squared statistic is 4.37.  Our degrees of freedom = 6 = (4-1)��(3-1).

To illustrate Fisher��s exact test, we will use a smaller table.  Suppose we wanted to see if there was an association between gender and whether the children thought that achieving good grades was important for popularity.  First we must make a binary variable out of ��grades�� so we can make a (small) 2 by 2 table.

• Now we have our binary variable, we can do a Fisher��s exact test with ��grades2�� versus ��gender��.  Type tabulate  gender  grades2, exact expected to get the Fisher��s exact probability.  You could also go to ��Statistics>Summaries, tables, & tests>Tables>Two-way tables with measure of association�� and under ��Test Statistics�� choose ��Fisher��s exact test��.
• Your output should look like this:

. tabulate  gender   grades2, exact expected

+--------------------+

| Key                |

|--------------------|

|     frequency      |

| expected frequency |

+--------------------+

Gender |         0          1 |     Total

-----------+----------------------+----------

boy |       127        100 |       227

|     123.9      103.1 |     227.0

-----------+----------------------+----------

girl |       134        117 |       251

|     137.1      113.9 |     251.0

-----------+----------------------+----------

Total |       261        217 |       478

|     261.0      217.0 |     478.0

Fisher's exact =                 0.582

1-sided Fisher's exact =                 0.319

Fisher��s test does not have a statistic as such, is simply calculates the probability of seeing the observed table or more extreme under the null hypothesis of no association.  Here the probability is 0.582, so this is not extreme enough to warrant rejecting the null hypothesis.  We would reject if the probability was less than ��.

Question 16:  Perform a chi-squared test and a Fisher��s exact test on whether the children��s goals (see the description of the ��goals�� variable) differs by gender.   Copy the 2 by 3 table into your answers.  Is a chi-squared test justified or is a Fisher��s exact test more appropriate?

Question 17:  In which cells is there a big difference between expected and observed counts?  What would you say is the biggest difference between girls and boys concerning their goals?

You can also use a STATA ��calculator�� to do these tests if you only know the summary data.    Suppose you know that the two-by-two table looks like the following:

. table gender  looks2

----------------------

|   looks2

Gender |    0     1

----------+-----------

boy |  109   118

girl |   58   193

----------------------

Then we can perform a chi-squared test to see if there is an association between gender and whether a child thinks that ��looking good�� is important for popularity.

• Type tabi 109 118 \ 58 193, chi2 exact expected to do a Pearson��s chi-squared test and a Fishers exact test on the table.  Or, go to ��Statistics>Summaries, tables, & tests>Tables>Table calculator�� and type 109 118 \ 58 193 in the ��User supplied cell frequencies�� box.  Check the ��Test Statistic�� boxes for Pearson��s chi squared and Fisher��s exact, and under ��Cell contents�� check Expected frequencies.  The p-value is very small (0.0000) showing us that there is a strong association.

## Matched Pairs Data (McNemar��s test)

For a matched pair analysis, typically, we have some people in our study and the people are paired up so that people in the same pair are both the same age, weight, smoking status etc – the pairs are matched by whatever variables seem relevant.  Then we give each person in each pair different treatments.  The idea is that any differences we see should be a result of the treatment we gave them, and not because of some socio-economic/health or other factor (confounder).

The following data is completely fictitious.  McNemar��s test can be used for a 1-1 matched case-control study.  Suppose that in a small town, many people are employed at a chemical plant, and it is also the case that this town has a high rate of leukemia.  Researchers want to know if there is a connection.  Suppose we have 25 people with leukemia, and we match them with 25 people who do not have leukemia (matched by gender, race, age).  Then we check to see if they have ever worked at the chemical plant (exposure).  We give a person a ��0�� if they have not worked at the plant, and a ��1�� if they have worked at the plant.  The coding is important, because STATA will automatically interpret a ��1�� as ��exposed�� and a ��0�� as ��not exposed��.  The data is on the course web site and is called ��casecontrol.dta��.

• Type use casecontrol, clear
• Type mcc case control  or go to Statistics>Observational/Epi. analysis>Tables for epidemiologists>Matched case control studies.  Your ��Exposed case variable�� is case and your ��Exposed control variable�� is control.  The Ho: ��no relationship between exposure and disease��.  Your output should look like:

. mcc case control

| Controls               |

Cases            |   Exposed   Unexposed  |     Total

-----------------+------------------------+----------

Exposed |        12           6  |        18

Unexposed |         3           4  |         7

-----------------+------------------------+----------

Total |        15          10  |        25

McNemar's chi2(1) =      1.00    Prob > chi2 = 0.3173

Exact McNemar significance probability       = 0.5078

Proportion with factor

Cases            .72

Controls          .6     [95% Conf. Interval]

---------     --------------------

difference       .12     -.1504438   .3904438

ratio            1.2      .8390229   1.716282

rel. diff.        .3     -.1919471   .7919471

odds ratio         2      .4271342   12.35923   (exact)

• The p-value is 0.3173 for McNemar��s test.  However the number of discordant pairs is less than 20 so we should not use McNemar��s test based on the chi-squared statistic.  McNemar��s test has an exact version, so we should use that.  The p-value for that is 0.5078 so we still fail to reject the null hypothesis at �� = 0.05.
• We can use a calculator version too.  Go to ��Statistics>Observational/Epi. analysis>Tables for epidemiologists>Matched case control calculator�� and enter the appropriate cell counts, or type mcci 12 6 3 4

Question 18: Perform a McNemar��s analysis on the matched pair data in ��case2�� and ��control2��.  Copy the table into your answer sheet.

Question 19:  Is a McNemar��s chi-squared test appropriate?  Does ��exposure�� increase the risk of disease?

• FYI, another way to do a McNemar test is the following: Type symmetry case control, exact and this will give you the McNemar test of homogeneity (the chi-squared and the exact version).  The results are exactly the same as before; the only difference is that the table that STATA produces flips the columns and rows.  You can also go to ��Statistics>Observational/Epi. analysis>Other>Symmetry & marginal homogeneity tests�� and the corresponding ��Statistics>Observational/Epi. analysis>Other>Symmetry & marginal homogeneity tests calculator��.  If your data has more than 2 levels of exposure and it is a matched pairs study, then both these commands can cope with that.  Be careful when asking for the exact test:  it may stall your computer for a long time when you have more than a 2 by 2 table.

## Kappa Statistic – Measure of Agreement

We use the Kappa statistic when we wish to know how reliable a diagnostic test is, whether people will give the same response to survey questions week to week, or to what degree do two independent physicians agree on the diagnosis of the same patients.

Suppose two testing labs are asked to diagnose as ��healthy�� or ��not healthy�� for cell tissue from 25 patients.  Let ��healthy�� be indicated with a ��1�� and ��not healthy�� be indicated with a ��0��.  This data is kept in the variables ��lab1�� and ��lab2��.

• Type table lab1 lab2 to get the table, and then type kap lab1 lab2 in order to get the kappa statistic.  You can also go to ��Statistics>Observational/Epi. analysis>Other>Interrater agreement, two unique raters��.  You should get the following:

. table lab1 lab2

----------------------

|    lab2

lab1 |    0     1

----------+-----------

0 |    3     4

1 |    1    17

----------------------

. kap lab1 lab2

Expected

Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z

-----------------------------------------------------------------

80.00%      64.96%     0.4292     0.1879       2.28      0.0112

• The kappa statistic is 0.4292 and the p-value is 0.0112.  The agreement is (moderately) good (see p410 of Rosner for guidelines for evaluating kappa) and is significant at the 5% level.
• There does not appear to be a test calculator in STATA for the kappa statistic.

Question 20:  In this data set, there are results from two fictitious diagnostic tests (in variables ��test1�� and ��test2�� that are applied to the same patients).  Copy the 2 by 2 table into your answer sheet.  Evaluate the kappa statistic to find out if there is good agreement between the tests and whether the agreement is significant.

To the best of my knowledge STATA does not explicitly calculate the following:

• Test of trend using Chi-square
• Sample size or power for matched pairs (dependent samples) data.

Here is a note about how your data is set out.  A table like this:

 disease (row) /quality of life (columns) 1 2 3 1 234 23 53 0 66 120 107

could be entered into a data base a number of different ways.  There are 603 people in this study in total, and it is a table of disease status (0-1) by quality of life (1-2-3).  The data could be like this:

. list

+------------------------------+

| disease   quality   freque~y |

|------------------------------|

1. |       1         1        234 |

2. |       1         2         23 |

3. |       1         3         53 |

4. |       0         1         66 |

5. |       0         2        120 |

|------------------------------|

6. |       0         3        107 |

+------------------------------+

or like this:

. list

+-------------------+

| disease   quality |

|-------------------|

1. |       1         1 |

2. |       0         1 |

3. |       1         2 |

4. |       1         3 |

5. |       0         3 |

|-------------------|

.

.

.

601. |       1         1 |

602. |       0         2 |

603. |       0         1 |

+-------------------+

We have run the tests in this lab assuming the data is laid out in the second manner, but we can run most of the tests if our data was set out like the first one, with frequencies. STATA calls frequencies ��weights��.  So to do a Chi-squared test of association, for example, you would go to ��Statistics>Summaries, tables, & tests>Tables>Two-way tables with measure of association�� and check all the statistics that you would like STATA to calculate.  Then click on the ��Weights�� window and check the ��Frequency weights�� box and enter the variable ��frequency��.  A lot of commands (but not all) have a ��Weights�� window.  In the command line, you would type tabulate disease quality [fweight=frequency ], chi2 to get the chi-squared test of association.

The End.
Saving the Lab

At the end of the session, follow the following procedure so that you can save any files you may want to review later on (e.g. your log file).  These are the instructions if you are saving your files onto a floppy disk.  If you have a zip disk, just do the same steps but with the "Zip" folder on the Desktop rather than the "Floppy" folder.

• Type log close and your log file is automatically saved and closed.  You can also go to File>Log>Close.
• Insert floppy disk (or zip disk).
• Right click on the "Floppy" icon on the Desktop and select "Mount".   We can now save files onto this disk.  If you do not ��Mount�� the disk, then your files may not save properly.
• Close your "Lab5" folder if it is open.  Click on the "Lab5" icon on the Desktop and drag the whole folder to the floppy disk icon on your Desktop.   You should get a small menu giving you a choice to "Move" or "Copy" the documents.  Click on "Copy".   Your files should now be on your floppy disk.
• Double click on the floppy disk icon to check that there is now a "Lab5" folder on your floppy disk.
• Now close the floppy disk window, and right click on the floppy disk icon and select "Unmount".  You must do this in order to take your disk out of these machines and still have your files saved.
• Now press the button on your computer to eject the floppy disk.

It is very important to save a backup on the university computer in case something happens to the disk.

• Click on the ��Lab5�� folder icon and drag the whole folder to the ��AFS�� folder on your desktop.  You should get a small menu giving you a choice to "Move" or "Copy" the documents.  Click on "Copy".   Your files are now stored on the University of Pittsburgh computer system and can be accessed from any computer with an internet connection.  See the instructions below on how to access these documents from your home computer.
• You have finished -- see you for the next lab!

### Accessing the files from home from the University of Pittsburgh computer system

Here are some instructions FYI to help you access your backup copy in case there is some problem with your floppy disk or zip, when you get out of here.  To access your backup copies from your home or office computer do the following steps:

• Open Netscape Navigator or Internet Explorer.  Type ftp://username@unixs.cis.pitt.edu and go to this destination.  (eg. Using my username, I would type ftp://fmc2@unixs.cis.pitt.edu ).
• After the screen has loaded, you should see a list of files and one of them should be your ��Lab5��.  Just drag and click that file to wherever you want to put it on your home computer.  Close Internet Explorer.

## Answer Sheet – Lab5 CLRES 2020 Summer 04.

NAME and DATE:

Question 1

 # sports important # sports not important # total

Question 2

Question 3:

Question 4:

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Search more related documents:CLRES 2020/Biostatistics 2041 