Thanks Adrienne, great question.
The questions are (credit to Brian Boyer): Who are our users, what do they want, and what story are we hoping they get out of this? Plus, because it's ProPublica, we add, what is the impact in the real world this app might have?
Naturally, we kick the tires on the data to make sure what we want to do is possible but it can sometimes be hard to see that clearly before you start trying to use it (he said with a pained expression). As for projects we considered but ultimately nixed -- we went into our redistricting project thinking that it would be about ultra-sophisticated GIS that can secretly create gerrymandered districts that looked compact, so we tooled up for that story, but it ended up that the shenanigans were pretty rudimentary and old school and ended up not needing the "beautiful machine" we'd planned on building to detect hidden stuff.
One project we probably would have killed if we had known how hard it would be is Dollars for Docs. It seemed like a really easy problem -- scrape 11 PDFs and websites and merge the data. A child could do it! Six months later...
But I'm thrilled we didn't know what we were getting into!
Narrative journalism teaches us a lot. Start with the general and go to the specific. Provide good examples. Show don't tell. Our apps all start with what we call a "far" view -- typically the main page of the app -- that lets you know why you should care, what the story is, what the national phenomenon is and how places compare to each other. We then try to walk you through levels of abstraction down to the very specific -- your town, your street, your school. So if we're successful now you know the national picture, why you should care, and what it's got to do with you.
What doesn't work is what Amanda Cox calls "here's the data, hope you find something." We've all seen examples of that.
But the ideal is that each readers tells him/herself their own story, and it *just so happens* to coincide with the story we're trying to tell them.
Lots of things. PDFs turn out to be deceptively difficult to parse. They're written in a page-description language tuned entirely for presentation, so the structure of the data tables is typically, and not always in the same way, removed in favor of placing each column/row in a particular coordinate on a page. So that's a big hurdle, and a different one for each pharmaceutical company, and in fact a problem that needs different handling between reports from the same company. Also there's no standard way of naming things, so what one pharmaceutical company will call "travel" another will call "reimbursement," (and then the next quarter change that word too).
So we have dozens of quarterly disclosures that can be thousands of pages of PDF data that each need hand-tuning to parse properly. It ends up being a massive plate spinning exercise.
But it's all worth it, I really hasten to add. Dollars for Docs is a public service we're enormously proud of. I also should add that in theory the Physician Payment Sunshine Act will require pharma companies to disclose payment data to the federal government in a structured way, which we hope will make this kind of project much simpler.
That's all the questions we have time for today. Scott, thanks so much for joining us.
Our pleasure! We’re big fans of ProPublica here at Thunderdome, and appreciate all the great work you do.