SRCCON SESH
============
<><><><><><><><><><><><><><><>
<> HERE IS OUR COLLAB SHEET <>
https://etherpad.mozilla.org/bOwBSAeLe5
info fish: https://docs.google.com/forms/d/1jiwSoV53z3uIZ7RQCvkVPfWiV4GavUJ1HXVvmMQYKbk/edit
<><><><><><><><><><><><><><><>
How NOT to Skew with Statistics: Drafting, Data Bulletproofing, and Tools for Well-*Developed* Stories
(http://srccon.org/sessions/proposals/#p3467)
crowdsource a document and collaborate on a glossary and forkable guide for how to tackle data
From email:
I thought about it more after I wrote a blog post about 538's treatment of the Nigerian kidnappings, and about how hard it can be to bulletproof your data and make visualizations/interactives that are meaningful and not confusing; I'd thought about breaking it down in to chapters like How to Lie With Statistics, and building a conversation around misleading pitwalls to avoid.
Worth making mention in the description that we hope to come away with a checklist of known data smells/tips for folks/definitions of methods/terms...?
CASE STUDIES
=============
* 538 - Nigeria Kid-mapping: http://aureliamoser.com/2014/05/15/538-errors-plotting-crises-and-the-protocol-of-reprocessing-data/
* Vox - Kansas Watch: http://www.vox.com/2014/4/21/5636040/whats-the-matter-with-kansas-and-porn
* Distribution Error: http://spatialityblog.com/2012/07/27/nyc-stop-frisk-cartographic-observations/
Definitions of methods/terms
================
* How do you diagnose your data, pre-processing?
* How do you proceed to process it?
* How do you present your information in the appropriate visualization?
* "Checklist of smart tests to run against a dataset, and common red flags."
I. Geo Data
* Geocoding and the accuracy of it
* Mapping population density
* Understanding distributions
II. Time Series Data
* Line graphs are ideal for continuous data
III. Graph/Relational Data
* When to use a scatterplot.
* What does does the resulting graph tell you?
* Zero-basing charts? Column/bar charts? But time series don't need to?
* Scoping charts.
IV Misc
* Proxy
* "In statistics, a proxy is a variable that is used when itâ€™s impossible to measure something directly"
* Per-capita GDP is often used as a proxy for measures of standard of living or quality of life.
* Likewise, country of origin or birthplace might be used as a proxy for race, or vice versa.
* Known sloppy proxies
* Geocoded tweets?
* A minute fraction of all tweets contain useful location data
* Confirmation bias
* Confirmation bias is the tendency of people to favor information that confirms their beliefs or hypotheses.
* Correlation does not equal causation
* https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
* Qualitative
* Qualitative properties are properties that are observed and can generally not be measured with a numerical result.
* Example:
* Quantitative
* Have numerical characteristics that can be compared in terms of "more", "less" or "equal", or, by assigning a numerical value in terms of a unit of measurement.
* Example:
* Categorical data
* In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values
* Example: The blood type of a person: A, B, AB or O.
* Continuous data
* Dichotomizing Continuous Variables in Statistical Analysis
* https://stats.stackexchange.com/questions/16565/what-is-the-effect-of-dichotomising-variables
* When is it valuable to flatten data
* When can it get you into trouble
* Precision vs Accuracy
* Mean, Median and Mode
* When is Mean best used?
* When is Median best used?
* When is Mode best used?
* Labeling data in unit and being transparent about the population you are representing
* Normalization of data for population
* 50% dropout rate at a school with 4 students is difference
Dissection of a Few Newsapps (small groups)
=================================
Choose an example :
* 512 Paths to the White House - New York Times : https://nyti.ms/OlWyxL
* The Sexperience 1000 - Channel 4 img1, img2, img3: http://bit.ly/1rpAOkz
* Front Row to Fashion Week - New York Times: http://nyti.ms/KTkJ57
* Previously, On Arrested Development - NPR: http://n.pr/18Anhe0
* -
* Buildings in the Netherlands by year of construction: http://bit.ly/1pBTdrh
* Earth :: an animated map of global wind and weather: http://bit.ly/INiShj
* Planet Money Makes A T-Shirt - NPR: http://bit.ly/INiShj
* Newsmap: http://newsmap.jp/
Try to find the data, diagnose the representation, critique/compliment it and define a few takeaways (4 bullets max) under
* Math/Stats
* Design
* UX
We'll reconvene and aggregate everyone's bullets into a bigger checklist doc
Checklist for Bulletproofing
===================
Civic Software Checklist:http://civicpatterns.org/checklist/
Bulletproofing Checklist: https://github.com/propublica/guides/blob/master/data-bulletproofing.md
News Apps Styleguide: https://github.com/propublica/guides/blob/master/news-apps.md
Known Problems/Smells:
===================
* 65,535 row limit in Excel pre-2007
* Row count is a nice round number like 100,000. Was the download paginated?
* Geocoding errors (geographic centroids)
Getting to Know Data:
===================
* Who published?
* Just like the sources we interview have motivations, so too do data providers
* Advocacy groups
* Studies
* Trade groups
* When was it relevant?
* Is the information current
* What does the data encompass
* How was it compiled?
* Ask about the methodology and how calculations were made
* If receiving from a database, ask for a record layout or schema
* Ask someone to walk you through the findings
* What does it tell you?
* What doesn't it tell you?
* Spot check everything
* Check for weird or conflicting values
* Incorrect percentages and calculations
* Multiple spellings of people, places and things
* How can it be benchmarked against other data sets?
Diagnosing Data for Representation:
==============================
* Stats/Math
* How are numbers contextualized in the representation?
* How is the reference point for visual read introduced (baseline of 0 on charts? visual order of the image to suggest legibility)
* How is math done in the viz? (are they explicit about % vs. abolute values and how faithfully the #s represent their images)
* How are outliers, anomalies, awk data points explained/handled?
* Design
* How many font styles/types are used?
* How is color used? To convey data or for decor? Where could it be simplified? Do that.
* UX
* Is the representation of data an effortless read?
* How are graphs and numbers incremented, what is the baseline/reference point for charts?
References:
=========
How to Lie with Statistics:
http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728
Distrust Your Data:
https://source.opennews.org/en-US/learning/distrust-your-data/
Naked Statistics: http://www.amazon.com/dp/039334777X
Data Cuisine: https://etherpad.mozilla.org/vYClsM1qVR
Spurious Correlations: http://www.tylervigen.com/