Thursday, March 17, 2005

Data cleanup and trendline madness

Looking at the graph of raw MAF values it is obvious that some points contain wrong data:



So I decided to come up with an easy method of cleaning it up, as visually there is a very strong basic trend, and few clear outliers.
I created a bunch of different methods for it in Excel, but they all seemed to fall to the same problems:
1. very manual and non-automatable
2. don't work universally for the whole range, they'd work for either low or high range, but never for both
3. too simple to self-scale for different samples, as some datasets are dirtier than others

Thus, instead of working on a wacky function with simple math, I figured bringing the function to a more manageable form first would allow to use simpler methods. Having seen how a lot of relationships in nature are better viewed on logarithmic scales, I started graphing first one, then both axis as LOG10(freq) and LOG10(flow).
here's the amazing result:



The datapoints build a line! so I quickly put in a linear trendline, and it's off, but it just seems to be thrown off by some values that are still clearly way out there. Thus I procede to throw out TWO datapoints with zero values, just to see what happens. Look at the trendline, it shifts significanly, R^2 indicates a MUCH better fit just by tossing 2 points out of over 22.7k of them! And I thought large data samples eliminated small mistakes...



The curisity settles in, and I create a VERY dirty hack of cleaning data, and set it up for +/- 20% error (everything more than that gets discarded), making the dataset come down from 22.7k to 20.7k data points. this is the result:



Less ouliers, better fit.
Then I shorten the leash of my 'clean' filter to just 10% +/- of each other, cutting the graph down to 18.9k points:



Observations and conclusions:
  • how does a logarithmic scale 'linearize' a 3rd order poly function?
  • ridiculous values produce ridiculous results
  • large samples of data aren't the automatic protection from dirt as we'd hope it would be.
  • data cleanup is apparently much more important than i ever thought
please comment, I'm befuddled. more to come once i figure it more out.


0 Comments:

Post a Comment

<< Home