The organization should establish and implement robust data governance and management practices specifically for training, validation, and testing datasets utilized in high-risk AI systems. These practices must be tailored to the specific intended purpose of each AI system.
Particular attention should be paid to the management of data preparation processing operations, which include:
- Annotation and labelling: Defining clear guidelines and procedures to ensure accuracy, consistency, and quality of data annotations and labels.
- Cleaning: Implementing methods and tools to identify, correct, and mitigate errors, inconsistencies, or missing values within the datasets.
- Updating: Establishing processes for the timely and accurate integration of new data or modifications to existing data, ensuring the dataset remains current and relevant.
- Enrichment: Managing the process of adding valuable, relevant information to existing datasets from various sources, ensuring data integrity and quality.
- Aggregation: Defining appropriate methods for combining data from multiple sources, ensuring the aggregated data is accurate, consistent, and fit for purpose.
The organization should ensure that personnel involved in these data preparation activities are adequately trained and that the processes are documented and regularly reviewed for effectiveness and compliance.