How to nurture Data Scientists

by Ben Lorica (last updated July/2010)

Among technology startups, data scientist is an increasingly common term used to refer to data geeks able to bridge traditionally separate functional areas of data intelligence. A data scientist is someone who's comfortable performing several (if not all) aspects of data intelligence projects:

1. data acquisition: this might entail writing custom parsers and web crawlers, or scripts that target specific web services or API's for non-traditional data sources.

2. data management: ETL, manipulate, query, and maintain data in databases, key-value stores, or Hadoop.

3. information visualization: uncovering patterns through the use of static visualization toolkits and/or interactive platforms based on Flash, Javascript, or Processing.

4. analytics: this can range from simple to complex techniques in multivariate statistics, machine-learning, and NLP.

5. insight: extract, summarize, and present key findings to a broad audience.

There are many tools, skills, and technical details, and one can spend years mastering each of the items listed above. While a data scientist may not possess true expert knowledge in any of the areas, he/she is comfortable skipping back and forth and performing basic tasks in all of them. The result is a data geek nimble enough to quickly investigate a data project and produce answers to (high-level) questions from management.

To nurture data scientists, companies need to focus more on culture and organizational structure. Many data workers have enough skills and training to quickly become productive in multiple areas of data intelligence. The problem is that most don't work in environments that encourage them to become data scientists. They're stuck in silos and limited to one or two areas of data intelligence. Often, they're restricted to use tools "approved" by their managers.

After working in companies both large and small, it's clear to me that the strict separation of tasks is the major obstacle faced by data scientists. The most common manifestation is the separation between data analysis and data management. In many large companies, most analysts/statisticians have to wait for data from a designated data warehousing team, and in a lot of cases they wait for data from multiple owners of different data warehouses.

For the moment, data scientists thrive in smaller startups, internet companies, and other organizations where there is less emphasis on defined roles and tasks. But there really is no reason why large and mature organizations can't join the fun. (There's no reason why your statisticians can't learn how to write simple web scrapers and why your database people can't learn simple statistics and visualization.) Here are a few suggestions on how to make it happen:

Embrace non-traditional data sources

One way to get people to think beyond their traditional roles is to use data sources outside those controlled by existing data warehouse groups. Many companies limit data intelligence to data from ERP systems or data vendors (or a variety of "log" files). The web is awash with data, much of which might be useful for your business analysis if you had a team of data scientists.

Start with a small team

Once you commit to forming a team of data scientists, you can start by identifying current employees who might fit the profile. They have to be open-minded, team-oriented, and have some programming skills in one of the areas described above. Ideally you would have mix of people from computer science, statistics/quantitative, or data oriented backgrounds. Team members need to be willing to share simple tools, hacks, and techniques with one another. Cross-fertilization will happen naturally if team members get excited about learning from each other. Employees who are reluctant to share techniques, tools, and ideas would hinder progress.

Allow the use of new tools and techniques

Many I.T. departments are quite strict on what employees can install and use. Many of the favorite tools used by data scientists are free and/or open source, and might be unfamiliar to the I.T. department. (Many come from very recent work done by academics.) New data sources may also require the use of web crawlers and services that may not be to the liking of those who maintain your existing firewalls and filters. Vendors will start offering tools that cover multiple areas of data intelligence, thus reducing context switching and enabling flow. But for the moment, data scientists use a variety of tools, and in any one of the areas described above, one can avail of simple to advanced tools. Simple tools are great ways to introduce basic skills that can form the basis for more advanced learning.

Start with simple projects, and experiment

Rapid iteration and experimentation is important as you're starting out. Pose simple and concrete hypotheses. Start slowly, perhaps leveraging simple tools, web services, and free data sources. Instead of crawling large web sites or taking on complex text parsing and NLP tasks, take advantage of semi-structured data available through web services and API's, while slowly expanding your set of non-traditional data sources. Rather than jumping into Hadoop or a NoSQL database, it might be wise to go with more familiar SQL databases: e.g., Greenplum has a free, single-node version of their MPP SQL database. Static visualization toolkits like R, and free interactive visualization tools from Google docs (or the Google Viz API) offer a variety of infoviz choices.

Shield your data scientists from (middle) managers

Once managers get a whiff that there's a team playing with new data sources, they might try to put obstacles up ("What about data integrity? They're not using the proper machine-learning/statistical techniques! The experimental design is wrong! How can they combine that with our data?"). Without political support, your team of data scientists is going to encounter (non) friendly fire. New things tend to be perceived as threats, so it's best to reassure managers quickly that the data scientists complement what they do. Insights uncovered by your small team of data scientists can be used to inform more formal data/analytic projects. Data scientists aren't going to eliminate the need for statisticians, but they may point them towards different data sets and questions.

Use your initial team of data scientists as evangelists

If you picked your initial team of data scientists correctly, they should be comfortable presenting their findings to others in your company. Better yet, they would be enthusiastic about it! Use them to influence how the rest of the company views data intelligence and to slowly knock down those silos.

I'm not saying that new training and enterprise tools won't be eventually needed as you form your in-house team of data scientists. But I think that by addressing cultural and organizational structures, many companies can use their own employees along with free tools, to seed a small team of data scientists. I speak from experience having worked for large companies -- the talent is there and the techniques aren't that hard to learn, but the organizational silos are hard to overcome. Their ranks already include a pool of talent ready to shine, if not for the rigid corporate structures that limit what they can do

[Originally posted in The Practical Quant]

NOTE: Reproduction & reuse allowed under Creative Commons Attribution.