Normalizing SASsy Data Using Log Transformations

by Chris Rucker

Most data analysts know that data is dirty and SAS data is no exception to the rule.  The data is often unstructured, lacking primary or foreign keys, and often contains duplicate observations.

One best practice before performing an exploratory data analysis is to normalize your data so that it is somewhat symmetrical - like a normal distribution or a bell curve.  It is common knowledge that approximately 68 percent of data falls within one standard deviation of the mean when transformed.  Minimize the noise plus garbage data by using a logarithmic function (i.e., log) to transform your data.

SAS programming language has a common logarithm function, or base10 function, for log transformations from untransformed dirty data to symmetrical data.  The log uses multiplication to test "to what power is a number equal to another number?"

This example uses the Sashelp.cars dataset because of its relative simplicity and small number of observations.  The following base10 log transformation using minimal SAS code for the "Cylinders" variable outputs a parallel log variable called "LOGVAR".

SAS Code:

data cars_log_transformed;
  set sashelp.cars;
  LOGVAR=log10(cylinders);
run;

Partial SAS Dataset:

Make      Cylinders   LOGVAR
Acura     6 	      0.778151
Acura     4           0.60206
Acura     4           0.60206

What Does It All Mean?

Graphing our two variables shows the distribution of the Cylinders variable after transformation.

Figure 1 indicates the majority of data (shaded area) centered on the mean (~0.75).  And approximately 68 percent of my data fell within one standard deviation of the mean between the ~0.66 and ~0.83 log values.  We have a normal distribution!

The result includes a 95 percent confidence interval with a 3.61 percent margin of error, so my statistic will be within 3.61 percentage points of the real population value 95 percent of the time.

Now we have less noise, garbage, and dirty data!

Chris Rucker is a Data Scientist and analyzes data for a large MCO.
GOMAB

Return to $2600 Index