Sorting and securing at scale: machine learning at Dropbox
By Kariba Voy
Student
Posted
How Dropbox leverages machine learning in its quest to help users maintain focus.
As the steward of hundreds of billions of documents belonging to over 500 million users [1], Dropbox is starting to rely on machine learning more heavily than ever. Dropbox researchers have invested years of study into modern work and how people spend their time doing that work. Having found that people waste a significant amount of productive time on three specific activities 鈥 organization, contextualization, and prioritization [2] 鈥 Dropbox views machine learning as a critical tool to help users avoid productivity potholes and maintain their focus.
听
This is objective is difficult to achieve because Dropbox needs to sift through and make sense of vast amounts of content while providing an experience tailored to each and every user. 听For example, while most web search engines take into account a user鈥檚 search habits (e.g. Google search history), Dropbox must go a step further to distinguish which documents should be available to each user [1]. Further complicating this task is the constantly changing nature of these documents. As a hub for creative collaboration [3], documents in Dropbox are continuously being updated by multiple collaborators [4]. This dynamic process means that when Dropbox indexes a certain file with certain search criteria, within a few seconds those indexes may be rendered irrelevant due to edits by various users and new indexes must be assigned.听
听
Given these conditions, machine learning is going to become ever more important for Dropbox.听
听
How machine learning helps address these issues:
Dropbox鈥檚 first notable feature that harnessed machine learning was their document scanner. 鈥淢辞谤别 than 20 billion image and PDF files have been stored in Dropbox, and of those, 10鈥20% are photos of documents鈥 [5]. Images of documents pose an issue because the text within them cannot be searched. As far as the computer is concerned this 鈥渢别虫迟鈥 is just a group of pixels, not text [6]. Machine learning offered a solution. Dropbox built an in-house Optical Character Recognition tool [7] that leveraged machine learning to recognize, extract, and index the text in these images so that users can search for it.
听
In the short term, Dropbox is continuing to find ways to utilize machine learning in the features they build. For instance, they recently released a redesigned search engine called 鈥淣补耻迟颈濒耻蝉鈥 [7] which uses machine learning to solve the problems of search described previously. Farther down the line, Dropbox appears committed to investing in machine learning expertise and to making it a foundational component of the company. Dropbox鈥檚 job page clearly illustrates this emphasis, with open listings for machine learning engineers, product managers, PhDs, as well as college interns [8].
听

Recommendations:
A natural extension of their current use of machine learning is in securing user data. As the guardian of immense amounts of private user data, Dropbox is an attractive target for hackers. Machine learning, with its ability to rapidly process information at scale, could be expected to more quickly recognize and block suspicious entities attempting to accounts and, accordingly, should be pursued in the immediate and short term. According to The Times of Israel, Dropbox is hoping to shift the focus of their Tel Aviv team to security and machine learning, and also potentially acquire a startup to further this endeavor [9]. 听In addition, I would advise them to investigate the possibilities of machine learning to help enhance individual document and data security processes, including, for example, in connection with password and verification best practices.
听
Machine learning relies on large datasets and constant learning opportunities in order to evolve and improve. While Dropbox鈥檚 user base offers great scale, Dropbox鈥檚 approach to product releases may delay its ability to provide the machine learning protocols with lots of learning opportunities. While some companies in the technology industry have been known for moving fast and iterating once a product is live (for example Facebook noted this strategy in their IPO filings [10]), Dropbox is known for holding every product to a very high bar, only releasing it when majority of the kinks have been worked out. (This mindset is demonstrated in their values such as 鈥沦飞别补迟 the details鈥 and 鈥淏别 worthy of trust鈥 [11]). This approach of only releasing products when they鈥檙e highly evolved is in tension with machine learning鈥檚 need to be exposed to lots of use before it can evolve. One way I recommend bridging this gap would be to look into using simulations of historic user activity in order to test and iterate with new features that rely on machine learning.听
听
However, changing the company鈥檚 philosophy on product launches is a major strategic decision and far from a sure success. While this careful approach to product releases has served Dropbox well so far, will the benefits of machine learning push them to release products earlier in the development process? Moreover, with machine learning becoming a focus for many technology companies, how can Dropbox expect to outcompete the competition for machine learning talent?听
听
听
听(800 words)
听
Sources:
- 听
- 听
- 听听
- 听
- 听听
- 听听
- 听
- 听