Great post. A few points I'd make to echo what you wrote, but also to add some nuance from my industry. The first is I like to say that statistics is primarily for understanding data, not, as it's often used, to prove (infer, really) cause and effect. It's about understanding relationships between variables. This, to me, is foundationally important for data science, generally, and I've used Anscombe’s quartet as an example to demonstrate it. There is clearly a better model to be used in some of the cases than simple linear regression, and plotting the data is enough to tell you that (or that there is a problem with the data). Just throwing models at a problem without visually exploring it first can potentially be a waste of time and prone to error. It can also demonstrate when multi-collinearity might be a problem. Quickly plot all your regressors and when you see that many have a correlation with each other and the variable of interest it immediately makes you start to explore what the relationships among variables are with each other, which might be the best, which might have added information, biased, etc. And this sort of visual to explanatory inference is still something that humans do better than machines, if they know enough to do it in the first place.
With that said, my reasoning falls apart with some modern analyses. It's much harder to explore variables when you have hundreds of them. Relationships between variables don't necessarily have to be visibly obvious to humans, and machines can disentangle them, given enough data to do so. The challenge is machine learning approaches are prone to becoming black boxes, so on the whole I still think starting with simpler models and data exploration is the best approach to avoiding overfitting, especially in the case of scenarios where proper cross-validation is not available. This is not uncommon in geology, where datasets are often small, biased, and messy, and where subjective interpretation really needs to play a role. I think the state-of-the-art in geology right now often is overfit models with poor predictive power, especially outside of a small area of interest. And there's a lot more I could write on that, but this reply is already way too long. Maybe it will inspire me to do a follow-up for geology!
"With that said, my reasoning falls apart with some modern analyses. It's much harder to explore variables when you have hundreds of them."
This is true for many fields, but in ones where we have the privilege of plotting our variables on a map, I think a well-trained human can still use fundamental principles of geography and spatial reasoning to do a good job.
Also in EO, these features are often either variations of particular statistical aggregations or time-lagged values of bands, and are not independent from one another. Ultimately, in every field it serves us to think about the physical underlying process generating the data/signal of interest, and reason about that, rather than trying to work backwards from the data. Geology in particular is a field where this is common, see variogram choice in kriging.
Great post. A few points I'd make to echo what you wrote, but also to add some nuance from my industry. The first is I like to say that statistics is primarily for understanding data, not, as it's often used, to prove (infer, really) cause and effect. It's about understanding relationships between variables. This, to me, is foundationally important for data science, generally, and I've used Anscombe’s quartet as an example to demonstrate it. There is clearly a better model to be used in some of the cases than simple linear regression, and plotting the data is enough to tell you that (or that there is a problem with the data). Just throwing models at a problem without visually exploring it first can potentially be a waste of time and prone to error. It can also demonstrate when multi-collinearity might be a problem. Quickly plot all your regressors and when you see that many have a correlation with each other and the variable of interest it immediately makes you start to explore what the relationships among variables are with each other, which might be the best, which might have added information, biased, etc. And this sort of visual to explanatory inference is still something that humans do better than machines, if they know enough to do it in the first place.
With that said, my reasoning falls apart with some modern analyses. It's much harder to explore variables when you have hundreds of them. Relationships between variables don't necessarily have to be visibly obvious to humans, and machines can disentangle them, given enough data to do so. The challenge is machine learning approaches are prone to becoming black boxes, so on the whole I still think starting with simpler models and data exploration is the best approach to avoiding overfitting, especially in the case of scenarios where proper cross-validation is not available. This is not uncommon in geology, where datasets are often small, biased, and messy, and where subjective interpretation really needs to play a role. I think the state-of-the-art in geology right now often is overfit models with poor predictive power, especially outside of a small area of interest. And there's a lot more I could write on that, but this reply is already way too long. Maybe it will inspire me to do a follow-up for geology!
Thanks for the comment Alex!
"With that said, my reasoning falls apart with some modern analyses. It's much harder to explore variables when you have hundreds of them."
This is true for many fields, but in ones where we have the privilege of plotting our variables on a map, I think a well-trained human can still use fundamental principles of geography and spatial reasoning to do a good job.
Also in EO, these features are often either variations of particular statistical aggregations or time-lagged values of bands, and are not independent from one another. Ultimately, in every field it serves us to think about the physical underlying process generating the data/signal of interest, and reason about that, rather than trying to work backwards from the data. Geology in particular is a field where this is common, see variogram choice in kriging.