Data cleanup and trendline madness
Looking at the graph of raw MAF values it is obvious that some points contain wrong data:
So I decided to come up with an easy method of cleaning it up, as visually there is a very strong basic trend, and few clear outliers.
I created a bunch of different methods for it in Excel, but they all seemed to fall to the same problems:
1. very manual and non-automatable
2. don't work universally for the whole range, they'd work for either low or high range, but never for both
3. too simple to self-scale for different samples, as some datasets are dirtier than others
Thus, instead of working on a wacky function with simple math, I figured bringing the function to a more manageable form first would allow to use simpler methods. Having seen how a lot of relationships in nature are better viewed on logarithmic scales, I started graphing first one, then both axis as LOG10(freq) and LOG10(flow).
here's the amazing result:
The datapoints build a line! so I quickly put in a linear trendline, and it's off, but it just seems to be thrown off by some values that are still clearly way out there. Thus I procede to throw out TWO datapoints with zero values, just to see what happens. Look at the trendline, it shifts significanly, R^2 indicates a MUCH better fit just by tossing 2 points out of over 22.7k of them! And I thought large data samples eliminated small mistakes...
The curisity settles in, and I create a VERY dirty hack of cleaning data, and set it up for +/- 20% error (everything more than that gets discarded), making the dataset come down from 22.7k to 20.7k data points. this is the result:
Less ouliers, better fit.
Then I shorten the leash of my 'clean' filter to just 10% +/- of each other, cutting the graph down to 18.9k points:
Observations and conclusions:
- how does a logarithmic scale 'linearize' a 3rd order poly function?
- ridiculous values produce ridiculous results
- large samples of data aren't the automatic protection from dirt as we'd hope it would be.
- data cleanup is apparently much more important than i ever thought
0 Comments:
Post a Comment
<< Home