Waldo Jaquith discovered that the FDLP (Federal Depository Library Program) appears to have an allergic reaction to people downloading their data with basic command line tools.
fdlp․gov blocks requests from cURL with a 403 and a “malware detected” error. >:-/
— Waldo Jaquith (@waldojaquith) January 2, 2013
The CSV’s URL (linked from this post) is not blocked by their robots.txt. Using an alternate tool, wget, worked fine. My colleague Thom Neale humorously noted that having curl tell FDLP.gov that it’s “microsoft-malware-professional-2013” also worked, but Waldo found that “Mozilla/5.0” did not. So FDLP has some weird, specific logic around who is approved to download their data and who isn’t.
In the course of verifying all this, after trying to download the CSV only a handful of times, FDLP blocked the entire Sunlight Foundation office from any access to FDLP.gov. This was 2 weeks ago, and this is still what Sunlight staff see when they visit FDLP.gov:
The ridiculousness of permanently blocking us after so few requests aside, considering requests to download structured data by non-browsers to be “malware” is seriously backwards thinking, especially for a government agency.
Restricting abusive behavior is obviously fine, but that abuse should be measured by behavior, not by user agent profiling. If you host structured data at a public, permanent link, expect people to want to obtain that data through a great variety of reasonable means.