This course will be retired on January 8, 2022.
Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Sometimes data is not obviously wrong or missing, but contains a more subtle error that can be found with knowledge of our data and how it was collected.
Data Files:
Data Source Information:

National Health and Nutrition Examination Survey (NHANES) main site

NHANES 19992000 information

Demographics Documentation/Codebook

Body Measurements Documentation/Codebook

Occupation Documentation/Codebook
Python and Pandas Resources:
Sometimes data isn't obviously wrong or
missing, it has a more subtle error.
0:00
Many of these errors can only be found
by using our knowledge of the data and
0:05
how it was collected.
0:08
We'll be continuing to
fix our example data set.
0:10
Hopefully you saved the changes you
made in the previous lessons, but
0:13
if not, the updated versions
are available in the teacher's notes.
0:16
There's also link to the code I'll be
using, if you'd like to download that.
0:20
Go ahead and pause this video while
you download any files you need and
0:24
setup your programming environment.
0:28
Ready, great.
0:30
We're going to start off by talking
about nonsensical or impossible data.
0:32
These are data entries that
are nonsensical or impossible, based on
0:36
the way that the data site was created or
the nature of what was recorded.
0:40
First, we want to load in the libraries
we'll need, numpy and pandas.
0:44
We'll also be working with all three
files again, so I'll load those as well.
0:50
Now, let's find some impossible data.
1:07
Weight is a good example for
impossible data.
1:10
We know a person can never
have a negative weight.
1:12
So let's print a description
of our weight column.
1:15
According to the summary
of the weight column,
1:21
the lowest recorded weight
is negative 149 kilograms.
1:23
We are not sure with the way we
are suppose to have been, so
1:27
we can just remove all
values less than zero.
1:30
First we find all rows with
the way less than zero.
1:33
Then we replace all those
values with not a number.
1:39
Now, we can view the description
of our data again.
1:47
Now our lowest weight is just over three
kilograms, a feasible value for a baby.
1:53
This is a good sanity check for
weight, but it's very simplistic.
1:58
After all,
we have all ages in our data set.
2:02
And a feasible weight range for
2:05
a two year old would be drastically
different than one for a 50 year old.
2:06
If we wanted to be more careful, we
could design more complicated checks and
2:10
take a participant's age into account.
2:14
More complicated checks are outside
the scope of this course.
2:16
But something to keep in mind as
you're cleaning your own data.
2:19
Categorical values can also
have impossible values.
2:23
Let's take a look at the possible
entries for weight comment.
2:26
The column for
2:29
weight comment as four possible values,
coded as one through four.
2:30
Any other entry in that
column is an error.
2:36
Let's take a look at the unique
entries in the weight comment column.
2:39
It looks like the numbers 7 and
11 have also been entered.
2:47
Since they aren't valid entries,
we need to remove them from our data set.
2:50
Again, we'll find all the invalid entries
and replace them with not a number.
2:54
Let's see what our new
data entries are now.
3:07
There we have it, only entries of
one through four, or not a number.
3:16
Often data entries are problematic
because they have extreme values.
3:21
According to Guinness World Records,
3:24
the heaviest person in the world weighed
635 kilograms at their heaviest.
3:26
Let's take another look
at our weight column.
3:31
Our data set has a maximum
weight of over 12,000 kilograms.
3:37
That's much higher than the world record.
3:41
Something must be wrong
with those entries.
3:43
Since they don't make sense,
let's remove all weights heavier than 635.
3:44
We can do the same thing as before.
3:50
Find all the rows with a weight above 635.
3:53
Then replace those rows with not a number.
3:58
And then finally we can
look at our data again.
4:05
Now the maximum weight is a much
more reasonable 193 kilograms.
4:10
Some analyses will also remove outliers,
or values too far from the average.
4:15
Values too far from normal
could bias the data set, or
4:20
there may be something wrong with
this data to cause those values.
4:23
On the other hand, if we don't have a good
reason to suspect a problem with them
4:27
these outliers may be valid,
although unusual, data entries.
4:31
In that case, removing them eliminates
important information from the data set.
4:35
Whether you remove outliers or not, will
depend on your particular circumstances.
4:40
One way to find outliers is to determine
how many standard deviations a value
4:44
is from the average for that group.
4:48
This is sometimes called the zscore.
4:50
To find the zscore of the maximum and
minimum weight value,
4:52
we first calculate the mean and
standard deviation in the entire column.
4:56
Note that we need to use nanmean and
nanstd so that the functions can deal with
5:00
the present of not a number
of values correctly.
5:05
Let's find the zscore for
these maximum and minimum weights.
5:09
Like with the mean and standard deviation,
we use nanmin and nanmax to find
5:13
the maximum and minimum values,
while ignoring not a number entries.
5:17
To get the zscore for the lowest weight,
5:29
we subtract the mean
from our minimum value.
5:31
Then we divide by the standard deviation.
5:37
We do the same thing to get
the zscore for the maximum weight.
5:40
And then let's print out
the zscores we just calculated.
5:48
And I misspelled high as height.
6:14
There we go.
Our maximum weight is four standard
6:17
deviations from the mean.
6:20
And our minimum weight is one
standard deviation below the mean.
6:21
Finally, let's look for saturated data.
6:25
According to the code book,
all ages above 85 were recorded as 85.
6:28
This means that the age
value saturates at 85,
6:33
anything above that value is a mistake.
6:36
Also, we don't know the true
age of anyone recorded as 85,
6:38
only that they are at least 85.
6:42
If a person's exact age is important,
6:45
we may want to remove all saturated
entries or run a separate analysis for
6:47
people 85 and older,
that does not include age as a term.
6:52
In other cases,
6:56
grouping all those over 84 into a single
age may not affect the analysis.
6:57
Let's see what our maximum age is.
7:02
We have at least one entry
above the saturation limit,
7:11
with a maximum listed at 109.
7:14
If we feel confident that someone just
forgot to list ages at a maximum of 85,
7:16
we can replace all higher values with 85.
7:21
If we're concerned that some
other error caused the problem,
7:24
it may be better to remove
the values entirely.
7:27
Let's replace all values
above 85 with a value 85.
7:30
Remember to save all the changes
we've made to the data files.
7:44
That's it for
the video portion of this lesson.
8:11
Now it's time for you to try out
what you've learned on your own.
8:13
The teacher's notes have a link to
the practice notebook for this lesson.
8:16
There are two practice exercises for
you this time.
8:19
You'll need to check some of
the columns in the occupation file for
8:22
nonsensible values.
8:25
Then you'll find the zscore
of the maximum and
8:26
minimum values of column
OCQ180 in occupation.
8:29
Make sure you don't count values 7777,
or 9999, since those
8:33
are extra code values and
not actually the number of hours worked.
8:36
Remember to save your data
again when you're finished.
8:42
Afterwards, it's time for another quiz.
8:44
When you've completed the quiz,
8:46
we'll talk about ways to help
automate the data cleaning process.
8:48
You need to sign up for Treehouse in order to download course files.
Sign up