The following is based on an Investigative Reporters and Editors seminar this weekend in Birmingham, Alabama.
The hour-long presentation, Effective use of the Internet, was fittingly framed by the first word in the title. Mark Horvit, IRE’s executive director, began by emphasizing that reporters should approach online research armed with a strategy (i.e., key words to search and a general idea of what’s available and desirable) to avoid getting distracted by the Web’s potentially cavernous detours. Step one, Horvit said, is not to log on, but to sketch out a plan.
Important for every investigative journalist to know about search engines is that a Google search, for example, does not look through the actual Internet, per se. It searches Google’s servers, which are stocked with information that the search engine company’s Web “crawlers” have found and stored.
What they’re missing – eye-opening stats:
- Google searches far less than half of what’s out there
- Total shared results of any two search engines: 8.9 percent
- Any three search engines: 2.2 percent
- Above figures from 2007 study by Dogpile, Penn State and Queensland University of Technology
- Some estimate the “invisible” Web is 550 times bigger than the “visible” Web.
- Google says more than 1,000 federal government sites can’t be crawled.
If (way) more than half the Web isn’t showing up in a search engine result, then it is important for investigative reporters to know where to go to find it. Here are some of the principles behind efficiently conducting those searches, with both superficial tools and subterraneous means.
Surface Web – Savvy searching tips:
- Treat info online as one would any source (confirm)
- Find out who owns the Web site
- Know Google advanced search options (esp. domain and file type)
- Archived Web: Gone doesn’t mean forever. (Google cache, Wayback Machine)
- Consult at least two other search engines–each has its own strengths and weaknesses.
- People finders (i.e., http://www.pipl.com, http://www.whitepages.com, etc.)
- Social media searches (i.e., http://www.whostalkin.com… Who’s Talkin’, not Who Stalkin’… or so they say)
- Use Wikipedia for the footnotes only
The session then took Web searches to the next level… well, at least a step above what amateur voyeurs might use to get information.
Deep Web – Search like a pro:
- Know what search engines typically miss (databases, content behind firewalls and registration screens, ASP/dynamically generated pages, Robo.txt excluded pages)
- The information is out there, but the key is to find organizations that make is more easily accessible. Bookmark these!
- Directories by and for journalists (‘Net Tour and Reporter’s Desktop)
- Know the gateways to public records
- Pipl actually claims to access the Deep Web. Try it. Pipl yourself. It’s scary how much information it digs up with just a name.
- The census is your friend, especially in 2010
- To get fully submerged… go to IRE’s Web site!
I’m not going to copy-paste in this post all of the useful links for discovering the “hidden Web” and the “dead Web,” which were hyper-linked in the PowerPoint presentation that Mark offered to send out to anybody at the day-long seminar who asked for it. All of this stuff is available at the organization’s site, and I can see what the nominal membership fees pay for, seriously.