Michael E. Gruen's avatar image.

I founded The New York Burger Appreciation Society and eat ~8 lbs of burgers a week.

My name is Michael E. Gruen, and in between meals, I'm a data-driven technologist living in NYC, stationed at Work Market.

Anonymous profile image
  • How do you source and vet data that's purposeful to your projects?

    Do you have methods of both finding data and determining what data should be analyzed for various projects?

    Michael E. Gruen's avatar image.

    Good analysis relies on good data, and methods for determining what data is useful depends on the questions. Since I'm usually operating on data internal to an organization, sourcing is fairly straightforward.

    Externally-focused projects are more tricky.

    I like to start with large, widely-used data sets to get the macro picture. When munging it with new, unproven data, I can get to know what the normals are and can compare against that. For instance, if calculating the most popular routes with FAA flight data, I might also look at population densities to see if there are a whole lot of edges between major cities to validate/contextualize the flight data. (See also: https://xkcd.com/1138/)

    When the data doesn't line up, it might be because I've found an insight worth sharing. Or maybe (more often that not) it's a poor data set—but not always. Per the earlier example, if it's missing those major routes (e.g. LAX <> JFK), maybe it's small-plane flight data. Or, maybe it's just bad.

    In practice, though, unless I'm looking for something specific, I ditch data that refutes the normal. Yes, I'll miss stuff. But, you know, pragmatism.

    This is one of the reasons that data is hard.

    As for finding data, municipalities and government agencies have some great data sets, as does kaggle.com. Beyond that, internet searching is surprisingly fruitful and, using the aforementioned approach, it's usually easy to tell whether new data sets are worth further exploration.