Instant APIs: no code necessary, just add mouseclicks
APIs and mashups are the bread and butter of the Sunlight Labs. In fact, the official name of the the Labs is actually The Sunlight Mashup Labs, so it is perhaps not surprising that we headed off to the MashupCamp 3 at MIT, Cambridge a couple of weeks ago to geek out and to see the latest happenings, trends and players in the world of mashups and technology.
There was a lot of cool stuff demoed, debated and hacked at the camp but one theme that really stood out for me was tools that made scraping easier and more effective than ever before. In short, tools to create instant APIs, without programming — yes, really.
To be more concrete: suppose there is a website that you visit every day and you read the headlines. However, you prefer to read these stories in your favorite RSS reader but this website does not provide an RSS feed. You can use these tools to create such a feed yourself.
Or, suppose you want to buy a second-hand iBook from Craigslist and are not in a rush; you would rather wait for the right deal. Well, you could manually check the listings everyday but you would rather be alerted if an iBook comes up for sale within a certain distance from home and under a certain price. To do this programmatically, you need to grab the listings for computer ads and then you can parse them to see in any match your criteria. You can use these tools to create an API for Craigslist computer ads, under a search query of “iBook”, in a matter of minutes.
Anything that is presented on a webpage, certainly a static webpage, is something that could be scraped. The problem is that scraping is not easy. Quality of HTML structure varies considerably. While some sites always use valid W3C XHTML, use CSS with lots of descriptive class names and sensible divs that divide different sections of the page, many do not. Do a view source of a random myspace page and you will see what I mean: it is a huge mess.
So, to scrape a page when it does not have a clear or consistent structure, or to scrape a page that requires a login (say, your AOL buddy list) is not at all easy. It can be done but it takes time and, moreover, because one is scraping who knows how long that scraper will last. The content provider makes a change in page structure and your scraper has to be reworked. Finally, one has to be a coder: you need to use tools such as curl or php to get the page contents, you may need tools such as html parsers to navigate the content, and you may need competency in regex to pull out precisely what you need from the page. Finally, you may actually need a server to create a page to display the final, desired content; and not everyone has such resources. (We will see below how one can dispense with this latter requirement.) Any tools that provide instant APIs from arbitrary webpages, especially for non-progammers, has to be good news.
Two organizations in this space who demoed at Mashup Camp are Dapper and OpenKapow. Now that I’ve had a chance to play around with these technologies, I want to take this opportunity here to review them, giving an unbiased critique.
Dapper
Dapper is a US startup based in Israel founded by Eran Shir and Jon Aizen and is currently in open beta. While all services are free at the moment, large projects or commercial uses of Dapper may be charged in the future. However, if Dapper works as well as it is claimed then I can see many organizations making good use of Dapper and saving in development costs even in a fee-based structure. Why do I say this? What precisely is Dapper?
Dapper is a web-based tool and service for scraping, creating APIs from potentially any webpage and then generating output of that API in a variety of formats including HTML, XML, RSS and even Google Maps.
One starts by entering a URL and the webpage is displayed within Dapper. One can click through different pages and add each page to a “basket”. Thus, one could add say pages 1, 2, 3, 4 of a blog. Dapper then analyzes these pages to work out the structure, say what is a static header and footer and what is dynamic content. One then gets to “play”. That is, clicking on a story title (should) highlights all other story titles in the page. As such, one is confident that Dapper will grab the correct content from the page when you specify precisely what you want your API, or “Dapp”, should do.
I have to say that I had mixed success with this. While Dapper correctly identified the story titles for techcrunch.com and http://www.followthemoney.org/Newsroom/index.phtml, it did not do so on sunlightfoundation.com and it could not work out my intention of grabbing rows or columns on http://opensecrets.org/orgs/list.asp?order=A (but it could get the org titles). However, this is still a beta and Dapper certainly does work: there are hundreds of user-created Dapps that one can browse and use.
After playing, one can then progress to creating the API proper: one clicks on a desired element, gives it a name and one can then define a group, e.g. specify that this story title, this author name and this number of diggs are all related as one unit. After that, one can preview the API, i.e. check what is pulled out by the API. If all is well, one is done. Then the real fun begins…
Dapper provides the API in a impressive variety of formats: XML, HTML, RSS, Alerts, iCalendar (transforms the output of a Dapp into an iCalendar which can be used in Google Calendar, Sunbird, iCal, and other programs), Google maps (places locations directly onto a map), Google gadgets, Netvibes, image loop, email, link to another Dapp, CSV, JSON, YAML, XSL, and fork it as another Dapp. This APIs are published and hosted at Dapper (hence one doesn’t need their own server to provide an API). Dapper also allows one to define parameters for the API such as a {query} or {page}.
Some of the resultant URLs are mess though. So, Dapp allows one to define a service that provides a nice clean URL. That is, instead of say http://www.dappit.com/RunDapp?dappName=sunlightlabs&v=1&thisparam=y&thatparam=z… one can instead provide users with say http://www.dappit.com/services/sunlightlabs.
>Another nice feature is their AggregatorAid that allows one to pool Dapps into one single service. Thus, if one has a suite of different individual search Dapps (Google, digg, reddit, etc) one can provide a single query that will aggregate the results as a single API. The only restriction is that each that individual Dapp must have the same query parameter name (I hope that this is relaxed in later versions).
Finally, Dapper provides secure, authenticated login into sites. Thus, if you want to scrape your AOL buddies or comments on your Facebook page, Dapper allows you to create an API without exposing your username or password.
Overall, Dapper is a clean, slick instant API generator that has some very nice features: it is web-based, has a clean interface, and provides API output in the majority of formats that anyone would want. It is beta, it is not perfect, but then it is free. One really can get a basic API scraped from a page, published, and up and running in literally 5 minutes. This is why I say that I can see a valid business model here. Organizations, especially non-profits such as ourselves, can try Dapper first to create a given API, and only if it doesn’t work then create the API from scratch or outsource which will certainly be more time consuming and expensive.
OpenKapow
Open Kapow, also in beta, founded by Stefan Andreasen in Denmark, is part of the larger Kapow Techologies which appears to have a large range of corporate clients and a number offices on both sides of “the pond”.
OpenKapow has the same goal of instant APIs as Dapper but takes a different approach. For a start, while Dapper is web-based, OpenKapow requires a large “RoboMaker” download (100+ Mb) for windows or linux. [So, as a mac guy that cuts me out. If one takes a look around at any hacker or mashup conference you will find a sea of macs, and these are the guys who are most likely to do such mashups. While macs are UNIX underneath the linux version does not install on macs.]Installing the RoboMaker on windows, one is presented with a sophisticated Java-based tool for mashups. As in Dapper, a webpage is loaded and displayed within the tool and one can select various elements of the page. RoboMaker has a nice DOM JTreeView that one can explore the structure of the page and one can click various elements and they will be highlighted and at the same time, the selected item is also shown in an HTML viewer. Together, the three panels provide a better understanding of the page structure than Dapper.
One the right are a series of controls for selecting tags based on name, type, conditions etc. There really are a very large number of features and controls that (I would image) provide a very granular approach to scraping. Therein, however, lies the problem.
There were so many controls, so many right click context menus, that it was not at all obvious for the newbie where to begin. I failed to select all the h1 tags of the page despite trying various combinations of select *.h1 etc. I saw OpenKapow in action at mashupcamp and I did see Stefan demo this creating an API for Google search. It looked very easy…if you know what you are doing. I really do believe that OpenKapow is a more sophisticated tool that probably can deal with edge cases that Dapper cannot. It is a tool for hackers/programmers who must invest some time into understanding the tool (and getting the thing downloaded and installed) but it is certainly not a quick and easy hacker tool. While both will generate APIs without true coding, I feel that Dapper is an quick-and-dirty, intuitive 90% solution while OpenKapow is a 90%+ tool for more serious projects.
Perhaps I have a short attention span but I have to admit that I gave up trying to get something basic working on OpenKapow. Besides, even if i did get it working, the output formats for OpenKapow are far more limited: (X)HTML, XML, REST, and RSS.
Conclusions
Dapper especially allows one to try a quick hack. If that doesn’t work then one could resort to writing it oneself or using a tool such as OpenKapow. OpenKapow looks as though it would be worth the effort to invest in learning all of the features if one was going to be doing lots of scraping and mashups, or in a corporate environment where one needed a very granular mashup that needed to work all the time. I think that both of these represent an exciting new era for mashups. They both represent programming-free tools for instant, shareable APIs. They both involve a reasonable GUI to select elements rather than needing to poke around the underlying HTML itself. As such, the bar for mashing up is significantly lowered.