Get started

Documenting Your AI Systems — What the EU AI Act Requires and How to Start

guide 19 min read Updated 2026-03-23

Documenting Your AI Systems — What the EU AI Act Requires and How to Start

Technical documentation is the most intimidating compliance requirement for high-risk AI systems. It sounds like something only AI researchers do. In reality, it’s a structured record of what your AI system is, what it does, what data it uses, and what could go wrong. You don’t need to be a data scientist to create it. You need clarity about the system and discipline in documenting decisions.

This guide walks through what the EU AI Act requires, what documentation actually looks like, and how to start when you’re beginning from scratch.

What the EU AI Act Requires

For high-risk AI systems, you must create and maintain technical documentation that includes:

  1. General description of the AI system
  2. Intended purpose and geographical scope
  3. Technical specifications (architecture, inputs, outputs)
  4. Training data description and governance
  5. Testing and validation results
  6. Accuracy metrics and known limitations
  7. Human oversight mechanisms
  8. Instructions for users and operators

This documentation must be thorough enough that a regulator could review it and understand how your system works, whether it complies with requirements, and what risks it presents. It’s not about perfect polish. It’s about clarity and completeness.

Part 1: General Description

What to document: A clear, non-technical summary of what your AI system does.

Why it matters: A regulator should be able to understand your system from the description alone, before reading technical details.

What to include:

  • System name and version: Include the specific version deployed, since documentation is version-specific. If you update the system, update the documentation.
  • Date deployed: When did the system go live?
  • Provider/vendor: Who built it? Is it commercial (e.g., “Workable” ATS with AI screening) or custom-built?
  • Function in one sentence: “Automatically ranks job applicants based on résumé analysis and keyword matching”
  • End-to-end process: Who uses it? What’s the workflow? Example:
    • “Hiring manager uploads job posting. System automatically reviews submitted CVs. System ranks candidates on a 0-100 scale. Hiring manager reviews the ranked list and decides which candidates to interview.”

Example for an ATS:

System: Workable ATS v7.2 with built-in CV Screening
Version: 7.2.1 (deployed January 2026)
Provider: Workable Inc.
Function: Automatically analyse job applications and rank candidates

Process:
1. Job requisition created in Workable
2. Candidates submit applications (CV + cover letter)
3. AI screening runs on all submissions within 4 hours
4. System produces ranked shortlist (top 20 candidates by match score)
5. Hiring manager reviews shortlist
6. Hiring manager selects candidates to interview

The system does not make final hiring decisions. A human hiring manager reviews the AI's ranked list and decides who to invite to interview.

Part 2: Intended Purpose and Geographical Scope

What to document: What the system is designed to do, for whom, and where it operates.

Why it matters: This establishes the system’s legal context. If a system is used for hiring in the EU, EU AI Act requirements apply. If it’s used internally in the UK only, UK GDPR and domestic rules apply. Being clear about scope is the foundation of everything else.

What to include:

  • Primary intended purpose: What problem does it solve? What decision does it support? Example:

    • “Support recruitment teams in identifying qualified candidates quickly”
    • “Assess creditworthiness of loan applicants”
    • “Predict which customer support tickets are urgent”
  • Secondary purposes (if any): Does it have uses beyond the primary purpose? Should it? If your hiring system is also being used for internal performance evaluation, document that (it changes the risk profile).

  • Geographical scope: Where is the system deployed? Which individuals does it affect?

    • “Affects all job applicants to our UK offices” — UK/EU GDPR applies
    • “Affects EU customers and UK employees” — both EU and UK rules apply
    • “Deployed in the EU” — Full EU AI Act applies
  • Target users: Who operates the system? Who is affected?

    • “Hiring managers use the system to review applications. Job applicants are affected by AI screening.”

Example:

Intended Purpose:
This system supports recruitment teams in reviewing applications quickly and fairly. It is designed to identify candidates who meet core job requirements (education, experience, keywords), so hiring managers can focus their time on evaluating cultural fit and soft skills through interviews.

The system is intended to assist human decision-making, not to make final hiring decisions. Humans retain authority to override the AI's ranking or to interview candidates ranked low by the AI.

Geographical Scope:
- Deployed in: UK and EU
- Affects: Job applicants in the UK and EU (approx. 500-1,000 per year)
- Operates on: CVs, cover letters, and application data provided by applicants

Users:
- Hiring managers (internal, ~5 users)
- Applicants (external; they see the AI's ranking indirectly if shortlisted)

Part 3: Technical Specifications

What to document: The technical architecture and design of the system.

Why it matters: This allows reviewers to understand what the system actually does, what data it processes, and what its limits are.

What to include (at a level appropriate to your system):

  • Model type: What kind of AI? Is it machine learning, rule-based, deep learning? If you’re using a vendor’s system, what does the vendor say about the architecture?

    • Example: “Commercial SaaS applicant tracking system with proprietary machine learning screening module”
  • Inputs: What data does the system take in? Format, types, examples?

    • “Uploads: PDF CV, plain text cover letter, structured application form responses”
    • “Data processed: Education, work experience, skills, keywords, years of experience”
  • Outputs: What does the system produce? What is the user given?

    • “Output: Candidate ranking (0-100 score for each applicant), shortlist (top 20 candidates)”
    • “Displayed to: Hiring manager in Workable dashboard”
  • Processing: How does the system turn inputs into outputs? (You don’t need deep technical detail, but be clear about the logic.)

    • “The system compares applicant skills and experience against job requirements using keyword matching and pattern recognition. It produces a numerical score for each applicant.”
    • If you don’t understand the vendor’s system well, that’s itself important to document: “We use Vendor X’s proprietary AI scoring mechanism. Vendor documentation states it uses machine learning trained on successful hiring outcomes.”
  • Key parameters and thresholds: If the system has configurable settings, what are they?

    • “Hiring manager can set minimum score threshold for shortlist inclusion”
    • “System currently configured to include top 20 candidates regardless of score”

Example:

Technical Specifications:

Model Type: Commercial SaaS machine learning system (Workable proprietary CV screening)

Inputs:
- File uploads: PDF CV, plain text cover letter
- Structured data: Education level, years of experience, job title field
- Format: Unstructured (CV text) and structured (form fields)
- Data processed: ~2-5 MB per application

Outputs:
- Ranking score: 0-100 scale for each applicant
- Shortlist: Ordered list of top 20 candidates by score
- Format: CSV export available; viewable in Workable dashboard

Processing:
Workable's CV screening module uses machine learning to evaluate how closely applicant experience and skills match the job requirements. The system:
1. Parses CV/cover letter text
2. Extracts key attributes (education, experience, skills)
3. Compares extracted attributes against job requirements
4. Produces a match score (0-100)
5. Ranks all applicants by score

Thresholds and Configuration:
- Default: Shortlist shows top 20 candidates
- Configurable: Hiring manager can adjust score threshold (e.g., show only candidates scoring >60)
- Currently used: Default configuration (top 20 without score threshold)

Part 4: Training Data Description and Governance

What to document: What data was used to build or train the system, and how decisions about that data were made.

Why it matters: High-risk systems are required to avoid bias in training data. Documenting what the system was trained on and what curation decisions were made is essential for compliance and for identifying bias risks.

What to include:

  • Training data source: Where did the data come from?

    • “Commercial AI vendor’s proprietary training dataset (Workable does not disclose details)”
    • “Our company’s historical hiring data (2020-2025): 3,500 applicants, 280 hired”
    • “Combination of public datasets and proprietary data”
  • Data description: How much data? What does it represent?

    • “3,500 applicants across 50 job roles (engineers, designers, marketers, operations). Represents hiring from UK, EU, and US.”
    • “Training data weighted toward successful hires (people who were hired and stayed >1 year)”
  • Data curation decisions: What did you do to the data before using it?

    • “Removed applications from candidates rejected for conduct violations”
    • “Balanced dataset by job category to avoid overweighting engineering roles”
    • “Removed identifying information (names, addresses) to reduce bias risk”
  • Bias testing: What did you check for? What risks did you identify?

    • “Tested for gender bias: Training data showed 60% male representation. System was tested for gender-skewed scoring.”
    • “Tested for age discrimination: Checked whether system penalizes older candidates (it doesn’t, but system does downweight dates of early education, which could bias against older candidates)”
    • “Test results: No major bias found; minor recommendations implemented (see §7)”
  • Limitations acknowledged:

    • “Training data is UK/EU-focused. Performance may differ in other geographies.”
    • “Training data includes only technical roles applied 2020-2025. System’s accuracy for new role types is unknown.”

Example:

Training Data Description and Governance:

Data Source:
Workable's proprietary training dataset. Workable does not publicly disclose the composition or size of their training data. Based on Workable's documentation, the system was trained on successful hiring outcomes across thousands of companies using the platform.

Our Company's Deployment Data (2020-2025):
- Total applications: 3,500
- Hires: 280
- Roles: 50 different job titles
- Geography: UK (60%), EU (30%), US (10%)
- Tenure of hires: Average 2.3 years employed

Data Curation Decisions:
1. Removed 40 applications from candidates rejected for conduct/integrity violations (not included in training because they're not predictive of job performance)
2. Balanced dataset across job categories to avoid over-representing engineering roles (which had higher applicant volume)
3. Removed personally identifiable information from training data (names, addresses, photos) to reduce bias risk

Bias Testing Conducted:
- Gender representation in training data: 60% male, 40% female. Tested system output for gender-skewed scoring.
- Age discrimination testing: Analyzed whether system penalizes candidates with older graduation dates. Found minor bias: system downweights education dates, which can disadvantage older candidates. Mitigation: documented in limitations and discussed with hiring team.
- Ethnicity: Could not test directly (ethnicity data not available in CVs without bias). Risk flagged and monitoring plan established.

Bias Test Results:
- No major bias identified
- Recommendation: Document with hiring managers that system may have minor age-related bias; continue monitoring

Limitations Acknowledged:
- Training data from 2020-2025 only; system's accuracy on older hiring patterns unknown
- System trained primarily on technical roles; accuracy for non-technical roles untested
- UK/EU-focused data; performance in other geographies unknown

Part 5: Testing and Validation Results

What to document: How did you test whether the system works as intended?

Why it matters: High-risk systems must be validated before deployment and monitored after. This section shows that you tested the system and confirmed it does what you claim.

What to include:

  • Pre-deployment testing: What validation did you do before going live?

    • “Tested on 500 historical applications to confirm system produces reasonable rankings”
    • “Compared system rankings to human hiring manager rankings; correlation: 0.78 (reasonable but not perfect)”
    • “Tested on applications in different languages; system performed inconsistently on non-English applications”
  • Test results: What did you find?

    • “System accuracy: 82% of system’s top 20 candidates were rated “good candidate” by hiring managers in blind evaluation”
    • “False positives: 18% of top 20 ranked candidates were rated “not qualified” by hiring managers”
    • “False negatives: System ranked some qualified candidates low; ~5% of eventually hired candidates were ranked below the shortlist threshold by the system”
  • Performance benchmarks: How accurate is the system?

    • “Precision (how many recommended candidates were qualified): 82%”
    • “Recall (how many qualified candidates did the system identify): 78%”
    • “Consistency: System ranking varies by <5% on duplicate applications”

Example:

Testing and Validation:

Pre-Deployment Testing:
- Test set: 500 historical applications (2024)
- Blind evaluation: Human hiring managers rated system's top 20 recommendations as "qualified," "maybe," or "not qualified" without knowing system's ranking
- Comparison: System's rankings vs. human hiring managers' evaluations

Test Results:
- Accuracy (top 20 candidates rated qualified): 82%
- False positives (recommended but not qualified): 18%
- False negatives (not recommended but later hired): ~5% of hired candidates ranked outside top 20

Performance by Candidate Type:
- Native English speakers: 85% accuracy
- Non-native English speakers: 71% accuracy (system performs worse on CVs with non-standard English)
- Career changers: 76% accuracy (system has slightly lower accuracy for candidates with non-linear career progression)
- Older candidates (40+): 78% accuracy (slightly lower than younger candidates)
- Younger candidates (<30): 84% accuracy

Consistency Testing:
- Submitted same application twice; system ranking varied <2%
- System is consistent and reproducible

Limitations Found:
- Language bias: System performs worse on non-English CVs
- Career bias: System penalizes non-linear career progression
- Age bias: Marginally lower accuracy for older candidates

Part 6: Accuracy Metrics and Known Limitations

What to document: How accurate is the system? What doesn’t it do well?

Why it matters: This is the heart of transparency. Regulators and affected individuals need to know: is this system reliable? When might it fail?

What to include:

  • Overall accuracy: How often is the system right?

    • “Identifies qualified candidates with 82% accuracy”
    • “Correctly ranks candidates into skill tiers 78% of the time”
  • Accuracy by subgroup: Does the system work equally well for all people?

    • “Accuracy for women: 81% / Accuracy for men: 83%”
    • “Accuracy for candidates over 40: 76% / Accuracy for candidates under 40: 84%”
    • “Accuracy for English speakers: 85% / Accuracy for non-native English: 71%”
  • Known limitations: What is this system NOT good at?

    • “Does not evaluate soft skills, cultural fit, or communication ability”
    • “Relies on CV keywords; misses qualified candidates with non-traditional backgrounds”
    • “Cannot evaluate portfolio work or samples; only evaluates documented experience”
    • “Language-dependent; less accurate for non-English CVs”
  • Failure modes: When is the system most likely to be wrong?

    • “Career changers with no direct industry experience: system ranks them lower than their actual capabilities”
    • “Candidates from non-traditional educational backgrounds: system may underweight relevant experience”
    • “Specialized roles with few direct precedents: system has less training data and lower confidence”

Example:

Accuracy Metrics and Known Limitations:

Overall Accuracy:
- Correctly identifies "qualified candidates" (later evaluated as hireable): 82%
- Correctly ranks candidates into skill tiers: 78%
- False positive rate: 18% of recommendations are later evaluated as not qualified
- False negative rate: 5% of hired candidates ranked outside top 20

Accuracy by Subgroup:
- Gender: Accuracy for women 81%, men 83% (minor difference; acceptable)
- Age: Accuracy for 40+ 76%, under 40 84% (moderate difference; documented as limitation)
- Language: Accuracy for native English 85%, non-native English 71% (significant difference; flagged for monitoring)
- Career path: Linear careers 84%, non-linear/career changes 71% (documented as limitation)

Known Limitations:
1. Cannot evaluate soft skills. CV-based screening cannot assess communication, leadership, cultural fit, or personality traits. Human interviews are essential for these factors.
2. Keyword-dependent. System matches job requirements to CV content; if candidate used different terminology, system may rank them lower despite qualifications.
3. Non-traditional backgrounds. Candidates from bootcamps, internships, or non-traditional paths may be ranked lower if their background doesn't match standard keywords.
4. Language bias. System performs worse on non-English CVs or CVs with non-standard English.
5. No portfolio evaluation. System cannot assess portfolio work, GitHub, samples, or other evidence of capability not documented in CV.
6. Career progression assumptions. System assumes linear career progression; penalizes gaps, career changes, or unconventional paths.

Failure Modes:
- Career changers: System ranks 20-30% lower than their actual capability
- Older candidates: Consistent 6-8% accuracy reduction
- Non-English speakers: 14% accuracy reduction
- Specialized roles: System has less training data for niche specialties; confidence intervals wider

Mitigation Strategy:
All candidates ranked below shortlist threshold by system are subject to human review if hiring manager has reason to believe they may be qualified. This process catches ~70% of false negatives.

Part 7: Human Oversight Mechanisms

What to document: How do humans stay in control? What are the oversight procedures?

Why it matters: The EU AI Act requires human oversight for high-risk systems. You must demonstrate that a person is actually reviewing outputs and has authority to intervene.

What to include:

  • Responsible person: Who oversees the system?

    • “Title: Head of Recruitment”
    • “Name: [Optional, can be role-based]”
    • “Authority: Can override system, pause system, exclude candidates”
    • “Reporting line: Reports to CEO”
  • Oversight frequency: How often is the system reviewed?

    • “Daily: System output reviewed before shortlist is sent to hiring managers”
    • “Weekly: System performance metrics reviewed; accuracy trends monitored”
    • “Monthly: Full audit of system outputs; analysis of any outliers or unexpected rankings”
    • “Quarterly: Bias testing and accuracy assessment”
  • Oversight procedures: What specifically does the person do?

    • “Daily: Spot-check 5-10 applications to verify system ranking makes sense”
    • “If system score >80, candidate automatically shortlisted (no additional review needed)”
    • “If system score 40-80, candidate reviewed by human for context; system may be overridden”
    • “If system score <40, candidate rejected unless hiring manager explicitly requests human review”
  • Escalation and override: When can the system be overridden?

    • “Hiring managers can request human review of any candidate ranked below shortlist”
    • “Head of Recruitment can exclude candidates from system consideration if they have legitimate reasons (referrals, internal candidates, etc.)”
    • “System can be paused or disabled if accuracy metrics fall below threshold (78% accuracy triggers review)”
  • Documentation: What do you record?

    • “Maintain log of overrides: which candidates were overridden, why, and outcome”
    • “Track accuracy metrics: % of system recommendations that result in hire, % of rejected candidates later identified as qualified”
    • “Document any system failures, unusual patterns, or discrimination concerns”

Example:

Human Oversight Mechanisms:

Responsible Person:
- Title: Head of Recruitment
- Authority: Can override system, pause system, exclude candidates, modify scoring thresholds
- Reporting: To CEO
- Backup: Operations Manager covers during absence

Oversight Frequency:
- Daily (when applications arrive): System output reviewed before shortlist; spot-check 5 applications for reasonableness
- Weekly: Metrics review; check system accuracy trend, false positive/negative rate, comparison to hiring outcomes
- Monthly: Full audit of system outputs; analysis of any outliers, demographic patterns, or unexpected rankings
- Quarterly: Bias testing; accuracy by subgroup; review of override logs for patterns

Oversight Procedures:

Tier 1 (System score >80): Automatically shortlisted
- Action: No additional review needed; candidate enters interview process
- Logic: High confidence in system score; human oversight focuses on overall quality assurance rather than case-by-case

Tier 2 (System score 40-80): Human review required
- Action: Head of Recruitment reviews CV and system score; makes judgment call
- Decision options: Shortlist (override system), reject, or request additional context
- Frequency: ~30% of applications fall in this tier
- Documentation: Record override decisions and reasoning

Tier 3 (System score <40): Rejected unless escalated
- Default: Rejected
- Exception: Hiring manager can request human review if they have specific reason to believe candidate may be qualified
- Re-review: If requested, application reviewed outside of system; decision made by human judgment
- Frequency: ~5% of rejected candidates requested for re-review

System Intervention:
- If system accuracy falls below 75% (2+ consecutive weeks), system is paused and reviewed
- If system bias detected (>10% accuracy variance across demographic groups), system is paused pending bias mitigation
- If system produces discriminatory outcome, candidate affected is re-evaluated by human without system input

Documentation Maintained:
- Override log: Date, candidate, system score, human decision, reasoning
- Accuracy metrics: Weekly tracking of precision, recall, false positive/negative rates
- Demographic analysis: Monthly accuracy by gender, age, language, career type
- Incident log: Any system failures, unexpected outputs, discrimination concerns

Part 8: Instructions for Users and Operators

What to document: How should people actually use this system? What are the guardrails?

Why it matters: High-risk systems can be misused. Clear instructions help ensure the system is used for its intended purpose and within appropriate bounds.

What to include:

  • Intended users: Who should operate this system?

    • “Intended users: Hiring managers with recruitment training; recruitment team members”
    • “Not intended for: Automated decision-making without human review; candidate evaluation by non-recruitment staff”
  • Proper use:

    • “System is designed to narrow the applicant pool, not to make hiring decisions”
    • “System output should inform, not replace, human judgment”
    • “Hiring managers must interview candidates to assess soft skills, cultural fit, and other factors not captured by CV screening”
  • Improper use / prohibited uses:

    • “Do not make hiring decisions based solely on system score”
    • “Do not use system to evaluate internal employees for promotion or termination (system not designed or validated for this)”
    • “Do not disclose system rankings directly to applicants (it may create liability)”
  • Training requirements:

    • “All hiring managers receive training on how to interpret system scores”
    • “Training covers: accuracy limitations, bias awareness, proper use of override function”
    • “Training required before first use; refresher training annual”
  • Maintenance and monitoring:

    • “System performance is monitored weekly”
    • “If accuracy degrades, system is paused pending investigation”
    • “System is retrained/updated [frequency]”

Example:

Instructions for Users and Operators:

Intended Users:
- Primary: Recruitment team members and hiring managers with recruitment training
- Prerequisites: User has completed training on system use, limitations, and bias awareness
- Not intended: Non-recruitment staff, HR generalists without recruitment experience, automated systems

Proper Use:
1. System is a screening tool, not a decision tool. It narrows the applicant pool.
2. System output (ranking and score) should inform your judgment; it should not replace your judgment.
3. Use system as part of a multi-step process: Application review → System screening → Human judgment → Interview → Final hiring decision.
4. For candidates in the "human review tier" (score 40-80), apply your own judgment. The system is one input; consider CV, cover letter, background, and any special circumstances.
5. Interview all shortlisted candidates. Interviews are essential for assessing soft skills, culture fit, communication, and other factors not captured by CV screening.

Improper Use / Prohibited:
- Do not make hiring decisions based solely on system score without human review
- Do not use system for internal employee evaluation, promotion, or termination (system is not designed or validated for this use)
- Do not share system scores directly with candidates (may create legal liability; may demotivate qualified candidates)
- Do not attempt to manipulate system scores by adjusting job descriptions or requirements (system ranking should reflect actual job requirements)

Training and Certification:
- All users must complete 1-hour training module before first use
- Training covers: System capabilities, accuracy and limitations, bias awareness, proper use of override, interpreting scores
- Refresher training required annually
- Optional advanced training: Deep dive on bias testing, interpreting metrics

Maintenance and Monitoring:
- System performance monitored weekly by Head of Recruitment
- If accuracy falls below threshold (75%), system paused pending investigation
- System updated [quarterly / as needed] based on hiring outcomes
- Bias testing conducted quarterly; if >10% accuracy variance across demographics, investigation required

Getting Started

If you have a high-risk AI system and no documentation, here’s the practical path:

  1. Start with Parts 1-2 (Description and Purpose). These require no technical knowledge, just clarity about what the system does. This is 30 minutes of work.

  2. Move to Part 3 (Technical Specs). If you don’t understand the system well (it’s a vendor’s SaaS, proprietary algorithm), document that. Say “The system uses [vendor’s description of their technology]. We understand it this way: [your description].”

  3. For Parts 4-5 (Training Data and Testing). If the vendor doesn’t disclose training data, document that you don’t have this information but acknowledge it as a gap. If you have historical performance data from using the system, use that for testing results.

  4. Parts 6-8 are the most important from a compliance perspective. Focus here on clarity and honesty about limitations and how oversight actually works.

You don’t need a perfect, final document. You need a honest, thorough record of what the system is, what it does, and what could go wrong. That’s the spirit of the requirement.

What’s Next

Once you have documentation, the next step is conformity assessment: have you actually met the EU AI Act’s requirements? The checklist walks through all high-risk requirements.

If you want guidance on all of this — documentation, conformity assessment, compliance gaps, and a roadmap to August 2026 — Bartram AI screens your systems and delivers a prioritised action plan.


Cross-References

For understanding what makes an AI system high-risk, see risk classification. For the full high-risk compliance checklist, see the checklist. For regulatory context, see EU AI Act explained.

Free newsletter

Get insights like this fortnightly

UK compliance rules are changing fast. Our newsletter covers what changed, what's coming, and what it means for your business.

Subscribe →

Free, fortnightly, no spam. Unsubscribe any time.

Want to check your compliance?

Find out where you stand — and get a prioritised action plan.

Screen your AI compliance →