Investigative reporting 3.0–or, Web stalking

The following is based on an Investigative Reporters and Editors seminar this weekend in Birmingham, Alabama.

IRE talked about how the Web--both the Surface ("visible") and Deep ("invisible") Webs--can help reporters address the occupational hazard of having to know everything about anything at any given moment.

The hour-long presentation, Effective use of the Internet, was fittingly framed by the first word in the title. Mark Horvit, IRE’s executive director, began by emphasizing that reporters should approach online research armed with a strategy (i.e., key words to search and a general idea of what’s available and desirable) to avoid getting distracted by the Web’s potentially cavernous detours. Step one, Horvit said, is not to log on, but to sketch out a plan.

Important for every investigative journalist to know about search engines is that a Google search, for example, does not look through the actual Internet, per se. It searches Google’s servers, which are stocked with information that the search engine company’s Web “crawlers” have found and stored.

What they’re missing – eye-opening stats:

  • Google searches far less than half of what’s out there
  • Total shared results of any two search engines: 8.9 percent
  • Any three search engines: 2.2 percent
  • Above figures from 2007 study by Dogpile, Penn State and Queensland University of Technology
  • Some estimate the “invisible” Web is 550 times bigger than the “visible” Web.
  • Google says more than 1,000 federal government sites can’t be crawled.

If (way) more than half the Web isn’t showing up in a search engine result, then it is important for investigative reporters to know where to go to find it. Here are some of the principles behind efficiently conducting those searches, with both superficial tools and subterraneous means.

Surface Web – Savvy searching tips:

  • Treat info online as one would any source (confirm)
  • Find out who owns the Web site
  • Know Google advanced search options (esp. domain and file type)
  • Archived Web: Gone doesn’t mean forever. (Google cache, Wayback Machine)
  • Consult at least two other search engines–each has its own strengths and weaknesses.
  • People finders (i.e., http://www.pipl.com, http://www.whitepages.com, etc.)
  • Social media searches (i.e., http://www.whostalkin.com… Who’s Talkin’, not Who Stalkin’… or so they say)
  • Use Wikipedia for the footnotes only

The session then took Web searches to the next level… well, at least a step above what amateur voyeurs might use to get information.

Deep Web – Search like a pro:

  • Know what search engines typically miss (databases, content behind firewalls and registration screens, ASP/dynamically generated pages, Robo.txt excluded pages)
  • The information is out there, but the key is to find organizations that make is more easily accessible. Bookmark these!
  • Directories by and for journalists (‘Net Tour and Reporter’s Desktop)
  • Know the gateways to public records
  • Pipl actually claims to access the Deep Web. Try it. Pipl yourself. It’s scary how much information it digs up with just a name.
  • The census is your friend, especially in 2010
  • To get fully submerged… go to IRE’s Web site!

I’m not going to copy-paste in this post all of the useful links for discovering the “hidden Web” and the “dead Web,” which were hyper-linked in the PowerPoint presentation that Mark offered to send out to anybody at the day-long seminar who asked for it. All of this stuff is available at the organization’s site, and I can see what the nominal membership fees pay for, seriously.

Advertisements

One thought on “Investigative reporting 3.0–or, Web stalking

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s