SRCCON SESSION ====================================================================== How NOT to Skew with Statistics: Drafting, Data Bulletproofing, and Tools for Well-*Developed* Stories http://srccon.org/sessions/ ====================================================================== ATTENDANCE Aurelia Moser, @auremoser Justin Myers, @myersjustinc Kio Stark @kiostark Tyler Machado @tylermachado Latoya Peterson @latoyapeterson Jacob Harris (@harrisj) Ameen Soleimani (@ameensol) Helga Salinas, @helga_salinas Jeremy B. Merrill, @jeremybmerrill Sandhya Kambhampati, @sandhya__k Lauren Rabaino @laurenrabaino Sarah Squire (@sarahjsquire) Armand Emamdjomeh (@emamd) Tasneem Raja @tasneemraja Jeff Larson (@thejefflarson) Allison McCann (@atmccann) Robinson Meyer (@yayitsrob) Nadja Popovich (@popovichN) Noah Veltman (@veltman) Matthew Pleasant (@matthewpleasant) Chris Williams (@enactd) Rebecca Lai (@kkrebeccalai) Tom Meagher (@ultracasual) AmyJo Brown (@amyjobrown) Rigoberto Carvajal (@rcarvajal85) Sonya Song (@sonya2song) Sara Schnadt (@SaraSchnadt) Emma Carew Grovum (@emmacarew) >>>>> survey FTW: http://bit.ly/1n7YWb0 <<<<<<< SURVEY, YO ^ i found it. CASE STUDIES * 538 - Nigeria Kid-mapping - General Data-Verifying Error: http://53eig.ht/1jgrkpo | http://bit.ly/1sU2MEB * Vox - Kansas Watch - Population Representation Error: http://bit.ly/1h5TMYX * Spaciality - Stop + Frisk Cartography - Distribution Error: http://bit.ly/1kVTEd8 BASIC QUESTIONS * How do you diagnose your data, pre-processing? * How do you proceed to process it? * How do you present your information in the appropriate visualization? * Can we make a checklist/guide that teases out red-flags and data issues ====================================================================== CONSIDERATIONS ====================================================================== DATA INTERVIEW ============== * Spend 30 minutes brainstorming, looking for an angle to * Interview the source who collected the data: how the data is collected, why is it collected this way, come back to this person as you have more questions. * Find the nerd in the basement, be critical of his/her collection practice * Even just try to _understand_ that collection practice. Why are things the way they are? * Healthy distrust of authority figures. Talk to people who are disgruntled or at least tired and ready to RANT * Sketch it out * "it" == EVERYTHING.s * Spend a lot of time disambiguating multiple people's interpretations of what's going on * Drawing it out helps you ensure you're on the same page * Understand the cultural bubble and find a "cultural bridge" to the outside world who can connect you to the institution you're sourcing data from, and the associated culture that supports it. * Source 3 or 4 vantage points on the same dataset if you can * KNOW WHEN TO SAY NO * Scream "NO THIS WILL BE PUBLISHING LIES" * News judgment is just as important in data projects as it is in other reporting. * The bad data may be a story itself: http://www.nhregister.com/general-news/20140512/scope-of-nationwide-heroin-epidemic-unknown-drug-related-death-overdose-data-lacking * When working on deadline, focus on one or two facts. * Lowball: "I can't give you that, but how about this instead?" * Set expectations up front: Here's the kind of thing we can do in 15 minutes, here's what's possible in an hour, these things take an afternoon, etc. * Start with a minimum viable thingie, and add/elaborate as/if time allows MATH / STATS =========== * Proxy * "In statistics, a proxy is a variable that is used when it’s impossible to measure something directly" * Per-capita GDP is often used as a proxy for measures of standard of living or quality of life. * Likewise, country of origin or birthplace might be used as a proxy for race, or vice versa. * Known sloppy proxies * Geocoded tweets? * A minute fraction of all tweets contain useful location data * Confirmation bias * Confirmation bias is the tendency of people to favor information that confirms their beliefs or hypotheses. * Correlation does not equal causation * https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation * Qualitative * Qualitative properties are properties that are observed and can generally not be measured with a numerical result. * Example: * Quantitative * Have numerical characteristics that can be compared in terms of "more", "less" or "equal", or, by assigning a numerical value in terms of a unit of measurement. * Example: * Categorical data * In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values * Example: The blood type of a person: A, B, AB or O. * Continuous data * Artifacts * Discrete data of continuous stuff ("discretized" data, i.e. when police officers write down that traffic incidents happened vs when they happened): smoothing or jittering is a type of preprocessing data that allows you to "unbin" the data and avoid the inference of patterns that don't really exist * "shift changes" in the police data are another kind of artifact. removing artifacts is sometimes a case-by-case thing. still thinking about this... * Dichotomizing Continuous Variables in Statistical Analysis * https://stats.stackexchange.com/questions/16565/what-is-the-effect-of-dichotomising-variables * When is it valuable to flatten data * When can it get you into trouble * Precision vs Accuracy * Mean, Median and Mode * When is Mean best used? * When is Median best used? * When is Mode best used? * Make sure mean/median are close so your distribution is good * Labeling data in unit and being transparent about the population you are representing * Normalization of data for population * 50% dropout rate at a school with 4 students is difference DATA TYPING + DESIGN ==================== * Geo Data * Geocoding and the accuracy of it * Mapping population density * Watch out for population maps: http://xkcd.com/1138/ * Understanding distributions * Different ways of describing same geography (e.g. Hong Kong named 4 different ways in the same spreadsheet) * Data getting "rounded": crime coded to the police station, crashes coded to the nearest intersection, etc. * _Old_ projections/other outdated standards, e.g., http://en.wikipedia.org/wiki/Public_Land_Survey_System * Time Series Data * Line graphs are ideal for continuous data * Graph/Relational Data * When to use a scatterplot, when to use a graph, when those things can look messy and how to control for that. * What does does the resulting graph tell you? * Zero-basing charts? Column/bar charts? But time series don't need to? * Scoping charts: how to figure out the scale/time frame you want to illustrate BIG PITFALLS ============ * Sanity checks: describe your idea to another person before you've sunk a ton of time into something * Don't pity the generalists: your expertise might blind you to your own assumptions. Describe your project to a naive third party and beg for/benefit from their "dumb questions" ====================================================================== BIBLIOGRAPHY ====================================================================== CHECKLISTS =========== * Civic Software Checklist: http://civicpatterns.org/checklist/ * Bulletproofing Checklist: https://github.com/propublica/guides/blob/master/data-bulletproofing.md * News Apps Styleguide: https://github.com/propublica/guides/blob/master/news-apps.md * Data Smells: Ensuring Accuracy in Data Journalism: https://github.com/nikeiubel/data-smells/wiki/Ensuring-Accuracy-in-Data-Journalism REFERENCES ============ * How to Lie with Statistics: http://amzn.to/IFodHy * Distrust Your Data: http://bit.ly/1sYG9Na * Naked Statistics: http://amzn.to/1kW4EHA * Data Cuisine: http://mzl.la/1pcWiMS * Spurious Correlations: http://www.tylervigen.com/ ====================================================================== GROUP WORK ====================================================================== 1. Choose an example or consider one of your own and add it here : * 512 Paths to the White House - New York Times : https://nyti.ms/OlWyxL * The Sexperience 1000 - Channel 4: http://bit.ly/1rpAOkz * Front Row to Fashion Week - New York Times: http://nyti.ms/KTkJ57 * Previously, On Arrested Development - NPR: http://n.pr/18Anhe0 * Buildings in the Netherlands by year of construction: http://bit.ly/1pBTdrh * Earth : an animated map of global wind and weather: http://bit.ly/INiShj * Planet Money Makes A T-Shirt - NPR: http://bit.ly/INiShj * Newsmap: http://newsmap.jp/ * Data diary: we posted a reporter's process on github, to serve as fact-checking document, public-facing methodology, etc: https://github.com/motherjones/west-texas-data-diary - Mother Jones @tasneemraja @jaeahjlee * Deadly Day in Baghdad http://nyti.ms/1jXwJTy we had to handcheck each data point on the top map and pull out locations, disambiguate; lower maps raw data but errors considered fine for illustrating general trends @harrisj 2. Break down the data process in your interactive of choice. Try to find the data, diagnose the representation, critique/compliment it and define a few takeaways that could contribute to a list of considerations (4 bullets max) under * Math/Stats * Design * UX Check out this gist if you get stuck: http://bit.ly/1nDtFwT. 3. Record your bullets in the GROUPs below. 4. We'll reconvene and aggregate everyone's bullets into a bigger checklist doc, push it to a repo of resources, and email you when that's ready for consultations and versioning! ====================================================================== WRAP UP NOTES ====================================================================== GROUP I: GROUP II: GROUP III: GROUP IV: