# Missing values

Sometimes, a data set may have “holes” in it, that is, missing values. Some statistical procedures such as regression analysis will not work as well, or at all, on a data set with missing values. The observations with missing values have to be deleted or the missing values have to be substituted in order for a statistical procedure to produce meaningful results. Most statistical programs (including SAS, SPSS, and Stata) will automatically remove such cases from any analysis you run (without deleting the cases from the dataset). This is why the “n” often varies from analysis to analysis, even if the dataset is the same. Different variables have different amounts of missing data and hence, changing the variables in a model changes the number of cases with complete data on all the variables in the model. Because the software drops cases with missing values for us, it is very easy to “forget” about missing data entirely. However, the presence of missing data can influence our results, especially when a dataset or even a single variable, has a high percentage of values missing. Thus it is always a good idea to check a dataset for missing data, and to think about how the missing data may influence our analyses. This page shows a few methods of looking at missing values in a dataset, this information can be used to make better informed decisions about how to handle the missing values.

Before we begin, we need some data with missing values, the code below inputs a small dataset into Stata, and then displays that data. In a small dataset, like the one below, it is very easy to look at the raw data and see where values are missing. However, when datasets are large, we need a more systematic way to examine our dataset for missing values. Below we show you some ways to do that, using the data below as an example.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42 clear

input landval improval totval salepric saltoapr

30000 64831 94831 118500 1.25

30000 50765 80765 93900 .

46651 18573 65224 . 1.16

45990 91402 . 184000 1.34

42394 . 40575 168000 1.43

. 3351 51102 169000 1.12

63596 2182 65778 . 1.26

56658 53806 10464 255000 1.21

51428 72451 . . 1.18

93200 . 4321 422000 1.04

76125 78172 54297 290000 1.14

. 61934 16294 237000 1.10

65376 34458 . 286500 1.43

42400 . 57446 . .

40800 92606 33406 168000 1.26

end

list

+---------------------------------------------------+

| landval improval totval salepric saltoapr |

|---------------------------------------------------|

1. | 30000 64831 94831 118500 1.25 |

2. | 30000 50765 80765 93900 . |

3. | 46651 18573 65224 . 1.16 |

4. | 45990 91402 . 184000 1.34 |

5. | 42394 . 40575 168000 1.43 |

|---------------------------------------------------|

6. | . 3351 51102 169000 1.12 |

7. | 63596 2182 65778 . 1.26 |

8. | 56658 53806 10464 255000 1.21 |

9. | 51428 72451 . . 1.18 |

10. | 93200 . 4321 422000 1.04 |

|---------------------------------------------------|

11. | 76125 78172 54297 290000 1.14 |

12. | . 61934 16294 237000 1.1 |

13. | 65376 34458 . 286500 1.43 |

14. | 42400 . 57446 . . |

15. | 40800 92606 33406 168000 1.26 |

+---------------------------------------------------+

## 1. Number of missing values vs. number of non missing values

The first thing we are going to do is determine which variables have a lot of missing values. We have created a small Stata program called **mdesc** that counts the number of missing values in both numeric and character variables. You can download **mdesc** from within Stata by typing **search mdesc** (see How can I use the search command to search for programs and get additional help? for more information about using **finidit**).

Then you can run **mdesc** for one or more variables as illustrated below.

1

2

3

4

5

6

7

8

9 mdesc

Variable Missing Total Missing/Total

------------------------------------------------------------

landval 2 15 .133333

improval 3 15 .2

totval 3 15 .2

salepric 4 15 .266667

saltoapr 2 15 .133333

Now we know the number of missing values in each variable. For instance, variable **salepric** has four missing values and **saltoapr** has two missing values.

## 2. Obtaining the number of missing values per observation

We can also look at the distribution of missing values across observations. The code below creates a variable called **nmis** that gives the number of missing values for each observation. The function **rmiss2()** used here is an extension to the egen function **rmiss()**. It counts the number of missing values in the varlist.**rmiss2()** accepts both string and numeric variables. (Stata’s **rmiss()** only accepts numeric variables.) You can download **rmiss2() **over the internet from within Stata by typing **search rmiss2** (see How can I use the search command to search for programs and get additional help? for more information about using **search**).

1 egen nmis=rmiss2(landval improval totval salepric saltoapr)

Below we tabulate the variable we just created. Looking at the frequency table we know that there are four observations with no missing values, nine observations with one missing values, one observation with two missing values and one observation with three missing values.

1

2

3

4

5

6

7

8

9

10 tab nmis

nmis | Freq. Percent Cum.

------------+-----------------------------------

0 | 4 26.67 26.67

1 | 9 60.00 86.67

2 | 1 6.67 93.33

3 | 1 6.67 100.00

------------+-----------------------------------

Total | 15 100.00

## 3. Patterns of missing values

We can also look at the patterns of missing values. You can download **mvpatterns **over the internet from within Stata by typing **search mvpatterns** (see How can I use the search command to search for programs and get additional help? for more information about using **search**). The command **mvpatterns** produces output for all variables in the dataset, for missing data patterns across a subset of variables, a variable list can be included, for example, **mvpatterns landval improval totval**.

The output produced by **mvpatterns** is shown below. The first table lists the variables, their storage type (type), the number of observations (obs), the number of missing values (mv), and the variable label if the variables has one. The second table contains the information on the pattern of missing values. The first block of columns in the output shows the patterns of missing data. Within the block, each variable is represented by a column, a “+” indicates that values of that variable are present in a given missing data pattern, a “.” indicates that they are missing. The columns follow the same order as the variable list in the first table, so that the first column in the output below represents **landval**, the second **improval**, and so on. The missing data patterns are listed in descending frequency, here the most common missing data pattern is complete data (“+++++”). The table also shows the number of missing values in that pattern (_mv), and the number of cases with that missing data pattern (_freq). Based on the information in the second table we know that there are four observations with no missing values, two cases missing on just the variable **salepric**, and one observation with missing values on **improval**, **salepric** and **saltoapr**.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27 mvpatterns

Variable | type obs mv variable label

-------------+-----------------------------------

landval | float 13 2

improval | float 12 3

totval | float 12 3

salepric | float 11 4

saltoapr | float 13 2

-------------------------------------------------

Patterns of missing values

+------------------------+

| _pattern _mv _freq |

|------------------------|

| +++++ 0 4 |

| +++.+ 1 2 |

| ++.++ 1 2 |

| +.+++ 1 2 |

| .++++ 1 2 |

|------------------------|

| ++++. 1 1 |

| ++..+ 2 1 |

| +.+.. 3 1 |

+------------------------+

## 4. When all the variables of interest are numeric

The examples above all work regardless of whether the variables of interest (i.e. the variables whose missing data patterns you want to examine) are numeric or string. When all of the variables you wish to check for missing values are numeric we can use a program called **misschk** to simplify the steps of examining the missing data in our dataset. (Note: numeric variables include those with value labels that are strings, as long as the actual values of the variables are stored as numbers.) You can download **misschk** from within Stata by typing **search misschk **(see How can I use the search command to search for programs and get additional help? for more information about using **search**).

Below is the command for **misschk**. We have listed all five of the variables in our dataset in the variable list after the **misschk** command. However, we could have just left the list of variables blank (i.e. used only **misschk , gen(miss)** instead), if we had, **misschk** would have run using all the variables in our dataset. The variable list is only necessary if we want to run **misschk** on only some of the variables in our dataset. The **gen(miss)** option tells **misschk** that we want it to create two new variables, both of which start with “miss”. These two variables will be named **misspattern** and **missnumber**. The variable **misspattern** indicates which of the missing data patterns each case follows. The variable **missnumber** indicates the number of missing values for each case.

1 misschk landval improval totval salepric saltoapr, gen(miss)

The output for **misschk** consists of three tables. The first table lists the number of missing values, as well as percent missing for each variable, this is similar to the table produced by **mdesc** in part 1 above. This table also contains a column labeled “#” which assigns each variable a number that is used to identify the variable later on in the output. The second table shows the distribution of missing values. The pattern of missingness is described using the variable numbers from the first table, and underscores (“_”). The numbers indicate which variables are missing in that pattern, the underscores represent non-missing observations. For example, from the second table we see that two cases have missing values on variable 1 (**landval**), but complete data on all other variables, and that one case is missing data on variables 2, 4, and 5. The bottom row shows that four cases are not missing any values at all (all underscores). This table shows the same information generated in part three above, but in a slightly different format. The missing data pattern for each case is described in the variable **misspattern**. Finally, the third table shows the distribution of the number of missing values per case. This is the same information discussed above in part 2. The number of variables each case is missing is also contained in the variable **missnumber**.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35 Variables examined for missing values

# Variable # Missing % Missing

--------------------------------------------

1 landval 2 13.3

2 improval 3 20.0

3 totval 3 20.0

4 salepric 4 26.7

5 saltoapr 2 13.3

Missing for |

which |

variables? | Freq. Percent Cum.

------------+-----------------------------------

1____ | 2 13.33 13.33

_2_45 | 1 6.67 20.00

_2___ | 2 13.33 33.33

__34_ | 1 6.67 40.00

__3__ | 2 13.33 53.33

___4_ | 2 13.33 66.67

____5 | 1 6.67 73.33

_____ | 4 26.67 100.00

------------+-----------------------------------

Total | 15 100.00

Missing for |

how many |

variables? | Freq. Percent Cum.

------------+-----------------------------------

0 | 4 26.67 26.67

1 | 9 60.00 86.67

2 | 1 6.67 93.33

3 | 1 6.67 100.00

------------+-----------------------------------

Total | 15 100.00