Defining the Right Data Strategy
Executive Summary
The design and implementation of a good data strategy is essential. Not only it improves the performance of AI-based systems by providing high quality data, it also saves time, cost and resources significantly. However, the importance of the data strategy is very often underestimated and perceived as a time-consuming and labor-intensive task.
In this white paper, we describe how to design a customized data strategy for your organization and show that an initial, properly conducted, comprehensive analysis does not take too much effort, and provides the necessary insights to define the best data strategy.
In section 1, “Background”, we present the foundation on which the data strategy is built by discussing the iterative nature of the AI-based systems lifecycle. In section 2, “Let’s Start with the Basics”, we review the concept of data quality and show that data quality is a multidimensional concept. We also discuss the importance of these two data aspects and their impact on the data collection and annotation process. As we conclude, the implementation of a data strategy that includes a continuous data production process, the proper quality measures as well as a clear specification of the quality objectives leads to achieve the best AI models performance.
In section 3, “Defining the Best Data Strategy”, we describe the steps to design the data strategy as well as the points that need to be taken into account. In particular, we show how to identify the data project requirements and constraints, and provide a method to estimate the amount of training data that needs to be produced. We also provide the pros and cons of outsourcing vs producing the data internally, and compare different approaches to data annotation as a function of the data security and privacy requirements. We finally discuss the requirements of the optimal data platform, so that data can be collected, selected, annotated, stored and managed in an efficient and intuitive way; and provide a list of data strategy tasks to facilitate the decision on the data strategy.
Index
1. Background
2. Let’s Start with the Basics
2.1. What is Training Data and Why Data Quality is so Important?
2.2. What is Data Quality?
2.3. Project Scope and Training Data Requirements
3. Defining the Best Data Strategy
3.1. Data Project Definition
3.1.1. Project Analysis
3.1.2. Project Limitations
3.2. How Much Training Data is Required?
3.2.1. Previous Research Results
3.2.2. Insights from Domain Expertise and the Performance Curve
3.3. Internalization vs Outsourcing
3.4. Data Annotation: Remote, in the Office or in High Secure Facilities?
3.5. The Data Platform
3.6. Decision on the Best Data Strategy
4. Conclusions
5. References
Please, complete the questionnaire below to download the white paper: