import.io & Web Scraping for the Average Joe or Jill

Why should just the big-shots have access to an unlimited flow of data from the unstructured web?

Data, data everywhere, but not a byte to crunch.聽 This expertly-crafted opening sentence captures the sentiment of many an aspiring analyst as he or she crawls through the tubes of the internet, desperate for tabular, structured data to feed into a project.聽 While companies might have more structured data than they know how to handle, the opposite is true for most people outside of a firm鈥檚 org structure.

 

This is understandable:聽 data is a competitive advantage, so sharing ought to be done, if at all, judiciously.聽 But in so doing, data owners take the limitless power of the crowd down a peg.

Until the storied data singularity occurs, then, crawler & scraper technology fills the gap.聽 Borrowing a page* from search engines, these automatons wander the web, downloading & storing whatever data they鈥檙e instructed to.聽 With a little python & a lot of moxie, users can build datasets in exponentially shorter amounts of time than manually harvesting.

Some companies have fed their cash cows on the grass of web scraping tech.聽 In a well-known lawsuit, internet artifact craigslist.org set their legal team on 3Taps, a company that scraped all Craigslist postings & sold to third-parties (most famously, padmapper.com built its reputation on Craigslist data scraped by 3Taps).

To be clear:聽 scraping is not in-and-of-itself illegal.聽 In fact, the technology does little more than replicate the same requests of servers that a user鈥檚 browser might.聽 So long as a scraper doesn鈥檛 overload the server with request (a popular weapon used in 鈥淒enial of Service鈥 attacks, often used by ne鈥檈r-do-wells to bring-down webservers), the entire process is on the up-and-up.

Where 3Taps went wrong, then, was monetizing the asset on the back-end鈥攄ata to which Craigslist held exclusive rights*.聽 But the fruit of the data tree proved too sweet for other companies to resist a taste in the wake of the Craigslist fiasco.

Enter 鈥渋mport.io.鈥澛 This young company has two notable components to their business model:聽 the first is a user-friendly, powerful, and graphical utility for crawling the web.聽 Using import.io solutions for the first time, this author was able to harvest thousands of posts from a forum with just 15 minutes of setup–no lines of code, and only a few trips to the documentation.聽 To put that in perspective, using some of the more popular code-based scraper solutions, only an expert user could have built that sort of scraper in so little time.

https://youtube.com/watch?v=cdmsTxu45-c

The second component of import.io is where the company got especially clever.聽 Much in the same way that Duolingo provides free language lessons in-exchange for crowd-sourced translation services, import.io puts its army of users to work.聽 Their enterprise package boasts that 鈥渆very day our powerful infrastructure collects millions of data records from the web,鈥 access to which they sell at a premium.

For a single company to offer just this second component would be a tremendous burden. Every site鈥檚 structure is different, and that typically calls for a custom scraper program.聽 No company could write & maintain scrapers for every site on the web.聽 So, import.io doesn鈥檛.聽 With every user building scrapers, tabularizing & organizing the data, and (most importantly) updating their scraper when a site changes, import.io has an unprecedented army of data harvesters at their fingers.

For import.io and the few competitors in this space, the future is ostensibly bright.聽 Because their critical mass of data likely doesn鈥檛 depend on any one site (like 3Taps & padmapper did), they鈥檙e able to adjust to company鈥檚 response to scraping. This is clearly on their radar, as their documentation pretty explicitly states that they respect target sites鈥 wishes.

The most obvious room for growth for import.io is the flexibility of their scraper solution.聽 Programmatic scraping might be difficult, but it鈥檚 almost infinitely extensible鈥攕omething a graphical tool can rarely claim.聽 Until big-dog open-source solutions like Scrapy & Selenium are no longer necessary, import.io will not be able to address the entire market.

*Pun.

**There鈥檚 more to it, including 3Taps circumventing of Craigslist鈥檚 efforts to block them, but this is sufficient for our purposes.

Previous:

Target Using Predictive Analytics to Increase Value Capture

Next:

SumAll: summing data across platforms

Student comments on import.io & Web Scraping for the Average Joe or Jill

  1. So does import.io openly acknowledge breaking terms of service of all the websites it scrapes from by selling the data? I’m curious about their long term strategy and how they’re thinking about the legal side of things.

    1. I think they’re taking the classic platform approach: all they do is build tools, it’s up to users to apply them ethically.

      To wit, their (T&C)[https://import.io/terms-and-conditions] reads:

      8.2. If you wish to use the Service to convert any Web Data into a table or data or a structured API (or any other functionality offered by the Service) that you do not own, you must obtain the consent of or an appropriate licence from the licensors or owners of such Web Data before you process all of or any portion of such Web Data through the Service. You must comply with requests from third party rights holders to cease to deal in any way with any Web Data that they own when you do not possess appropriate licences to deal with such Web Data.

      1. oops, inverted my markdown syntax; here’s the link: [T&C]()

Leave a comment