{"id":2736,"date":"2015-11-22T22:40:11","date_gmt":"2015-11-23T03:40:11","guid":{"rendered":"https:\/\/digital.hbs.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/"},"modified":"2015-11-22T22:43:12","modified_gmt":"2015-11-23T03:43:12","slug":"import-io-web-scraping-for-the-average-joe-or-jill","status":"publish","type":"hck-submission","link":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/","title":{"rendered":"import.io &amp; Web Scraping for the Average Joe or Jill"},"content":{"rendered":"<p>Data, data everywhere, but not a byte to crunch.\u00a0 This expertly-crafted opening sentence captures the sentiment of many an aspiring analyst as he or she crawls through the tubes of the internet, desperate for tabular, structured data to feed into a project.\u00a0 While companies might have more structured data than they know how to handle, the opposite is true for most people outside of a firm\u2019s org structure.<\/p>\n<p>&nbsp;<\/p>\n<p>This is understandable:\u00a0 data is a competitive advantage, so sharing ought to be done, if at all, judiciously.\u00a0 But in so doing, data owners take the limitless power of the crowd down a peg.<\/p>\n<p>Until the storied data singularity occurs, then, crawler &amp; scraper technology fills the gap.\u00a0 Borrowing a page* from search engines, these automatons wander the web, downloading &amp; storing whatever data they\u2019re instructed to.\u00a0 With a little python &amp; a lot of moxie, users can build datasets in exponentially shorter amounts of time than manually harvesting.<\/p>\n<p>Some companies have fed their cash cows on the grass of web scraping tech.\u00a0 In a well-known lawsuit, internet artifact craigslist.org set their legal team on 3Taps, a company that scraped all Craigslist postings &amp; sold to third-parties (most famously, padmapper.com built its reputation on Craigslist data scraped by 3Taps).<\/p>\n<p>To be clear:\u00a0 scraping is not in-and-of-itself illegal.\u00a0 In fact, the technology does little more than replicate the same requests of servers that a user\u2019s browser might.\u00a0 So long as a scraper doesn\u2019t overload the server with request (a popular weapon used in \u201cDenial of Service\u201d attacks, often used by ne\u2019er-do-wells to bring-down webservers), the entire process is on the up-and-up.<\/p>\n<p>Where 3Taps went wrong, then, was monetizing the asset on the back-end\u2014data to which Craigslist held exclusive rights*.\u00a0 But the fruit of the data tree proved too sweet for other companies to resist a taste in the wake of the Craigslist fiasco.<\/p>\n<p>Enter \u201cimport.io.\u201d\u00a0 This young company has two notable components to their business model:\u00a0 the first is a user-friendly, powerful, and graphical utility for crawling the web.\u00a0 Using import.io solutions for the first time, this author was able to harvest thousands of posts from a forum with just 15 minutes of setup&#8211;no lines of code, and only a few trips to the documentation.\u00a0 To put that in perspective, using some of the more popular code-based scraper solutions, only an expert user could have built that sort of scraper in so little time.<\/p>\n<p>https:\/\/youtube.com\/watch?v=cdmsTxu45-c<\/p>\n<p>The second component of import.io is where the company got especially clever.\u00a0 Much in the same way that Duolingo provides free language lessons in-exchange for crowd-sourced translation services, import.io puts its army of users to work.\u00a0 Their enterprise package boasts that \u201cevery day our powerful infrastructure collects millions of data records from the web,\u201d access to which they sell at a premium.<\/p>\n<p>For a single company to offer just this second component would be a tremendous burden. Every site\u2019s structure is different, and that typically calls for a custom scraper program.\u00a0 No company could write &amp; maintain scrapers for every site on the web.\u00a0 So, import.io doesn\u2019t.\u00a0 With every user building scrapers, tabularizing &amp; organizing the data, and (most importantly) updating their scraper when a site changes, import.io has an unprecedented army of data harvesters at their fingers.<\/p>\n<p>For import.io and the few competitors in this space, the future is ostensibly bright.\u00a0 Because their critical mass of data likely doesn\u2019t depend on any one site (like 3Taps &amp; padmapper did), they\u2019re able to adjust to company\u2019s response to scraping. This is clearly on their radar, as their documentation pretty explicitly states that they respect target sites\u2019 wishes.<\/p>\n<p>The most obvious room for growth for import.io is the flexibility of their scraper solution.\u00a0 Programmatic scraping might be difficult, but it\u2019s almost infinitely extensible\u2014something a graphical tool can rarely claim.\u00a0 Until big-dog open-source solutions like Scrapy &amp; Selenium are no longer necessary, import.io will not be able to address the entire market.<\/p>\n<p>*Pun.<\/p>\n<p>**There\u2019s more to it, including 3Taps circumventing of Craigslist\u2019s efforts to block them, but this is sufficient for our purposes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why should just the big-shots have access to an unlimited flow of data from the unstructured web?<\/p>\n","protected":false},"author":68,"featured_media":2737,"comment_status":"open","ping_status":"closed","template":"","categories":[655,1006],"class_list":["post-2736","hck-submission","type-hck-submission","status-publish","has-post-thumbnail","hentry","category-data","category-web-scraping"],"connected_submission_link":"https:\/\/d3.harvard.edu\/platform-digit\/assignment\/data-driven-value-creation-value-capture-and-operating-models\/","yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>import.io &amp; Web Scraping for the Average Joe or Jill - Digital Innovation and Transformation<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"import.io &amp; Web Scraping for the Average Joe or Jill - Digital Innovation and Transformation\" \/>\n<meta property=\"og:description\" content=\"Why should just the big-shots have access to an unlimited flow of data from the unstructured web?\" \/>\n<meta property=\"og:url\" content=\"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/\" \/>\n<meta property=\"og:site_name\" content=\"Digital Innovation and Transformation\" \/>\n<meta property=\"article:modified_time\" content=\"2015-11-23T03:43:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/d3.harvard.edu\/platform-digit\/wp-content\/uploads\/sites\/2\/2015\/11\/enterprise.png\" \/>\n\t<meta property=\"og:image:width\" content=\"600\" \/>\n\t<meta property=\"og:image:height\" content=\"228\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/\",\"url\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/\",\"name\":\"import.io &amp; Web Scraping for the Average Joe or Jill - Digital Innovation and Transformation\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2015\\\/11\\\/enterprise.png\",\"datePublished\":\"2015-11-23T03:40:11+00:00\",\"dateModified\":\"2015-11-23T03:43:12+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/#primaryimage\",\"url\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2015\\\/11\\\/enterprise.png\",\"contentUrl\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2015\\\/11\\\/enterprise.png\",\"width\":600,\"height\":228},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/import-io-web-scraping-for-the-average-joe-or-jill\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Submissions\",\"item\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/submission\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"import.io &amp; Web Scraping for the Average Joe or Jill\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/#website\",\"url\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/\",\"name\":\"Digital Innovation and Transformation\",\"description\":\"MBA Student Perspectives\",\"potentialAction\":[{\"@type\":\"性视界Action\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/d3.harvard.edu\\\/platform-digit\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"import.io &amp; Web Scraping for the Average Joe or Jill - Digital Innovation and Transformation","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/","og_locale":"en_US","og_type":"article","og_title":"import.io &amp; Web Scraping for the Average Joe or Jill - Digital Innovation and Transformation","og_description":"Why should just the big-shots have access to an unlimited flow of data from the unstructured web?","og_url":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/","og_site_name":"Digital Innovation and Transformation","article_modified_time":"2015-11-23T03:43:12+00:00","og_image":[{"width":600,"height":228,"url":"https:\/\/d3.harvard.edu\/platform-digit\/wp-content\/uploads\/sites\/2\/2015\/11\/enterprise.png","type":"image\/png"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/","url":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/","name":"import.io &amp; Web Scraping for the Average Joe or Jill - Digital Innovation and Transformation","isPartOf":{"@id":"https:\/\/d3.harvard.edu\/platform-digit\/#website"},"primaryImageOfPage":{"@id":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/#primaryimage"},"image":{"@id":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/#primaryimage"},"thumbnailUrl":"https:\/\/d3.harvard.edu\/platform-digit\/wp-content\/uploads\/sites\/2\/2015\/11\/enterprise.png","datePublished":"2015-11-23T03:40:11+00:00","dateModified":"2015-11-23T03:43:12+00:00","breadcrumb":{"@id":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/#primaryimage","url":"https:\/\/d3.harvard.edu\/platform-digit\/wp-content\/uploads\/sites\/2\/2015\/11\/enterprise.png","contentUrl":"https:\/\/d3.harvard.edu\/platform-digit\/wp-content\/uploads\/sites\/2\/2015\/11\/enterprise.png","width":600,"height":228},{"@type":"BreadcrumbList","@id":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/import-io-web-scraping-for-the-average-joe-or-jill\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/d3.harvard.edu\/platform-digit\/"},{"@type":"ListItem","position":2,"name":"Submissions","item":"https:\/\/d3.harvard.edu\/platform-digit\/submission\/"},{"@type":"ListItem","position":3,"name":"import.io &amp; Web Scraping for the Average Joe or Jill"}]},{"@type":"WebSite","@id":"https:\/\/d3.harvard.edu\/platform-digit\/#website","url":"https:\/\/d3.harvard.edu\/platform-digit\/","name":"Digital Innovation and Transformation","description":"MBA Student Perspectives","potentialAction":[{"@type":"性视界Action","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/d3.harvard.edu\/platform-digit\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"}]}},"_links":{"self":[{"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/hck-submission\/2736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/hck-submission"}],"about":[{"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/types\/hck-submission"}],"author":[{"embeddable":true,"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/comments?post=2736"}],"version-history":[{"count":0,"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/hck-submission\/2736\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/media\/2737"}],"wp:attachment":[{"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/media?parent=2736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/d3.harvard.edu\/platform-digit\/wp-json\/wp\/v2\/categories?post=2736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}