Create a Scoring Data
Everything you need to know about creating Scoring Data to train a Custom AI Scoring Algorithm.
The quality of a scoring dataset plays a key role in the success of a scoring algorithm retraining.
The whole recruitment process can be seen from two different points of view:
- Recruiter viewpoint: finding the best profiles for a job offer
- Applicant viewpoint: finding interesting jobs for a candidate
Use-case #1: Recruiter Viewpoint - Scoring Profiles for a Job
For this use case, we are going to consider the following example:

Basic scoring algorithms such as "Blue & White Collars", available in Hrflow.ai marketplace, come with 91% accuracy. Retraining this scoring algorithm on your custom dataset leads to the following benefits:
- Highly reduces the false-positive results returned by the scoring engine
- Increases the overall accuracy by several percent on your own data

Quality of the Training Data
The success of a Hrflow.ai scoring algorithm retraining strongly relies on the quality of the data.
On one hand, particular attention must be paid to the statistical biases of the dataset. This issue originates from the overrepresentation of some categories of data. Some example includes dataset containing mostly IT profiles, senior profiles, males over females, etc. Without any safety precautions taken, a deep learning model naturally tends to inadvertently leverage biases to better fit the data.
On the other hand, leveraging the hiring process in order to improve a scoring algorithm depends upon:
- Process steps: listing and carefully describing all the different steps of your hiring process
- Steps links: providing all the possible transitions between each step
Summarizing your hiring process through a diagram constitutes a good way to depict the links between the steps of your process. For our use case, the steps look like this:

Prerequisites
In order to successfully retrain an unbiased scoring engine:
- Provide relevant side information about the applicants
- Provide relevant side information about the jobs
- Provide enough diversity of data in regard to all the side information available
- Provide detailed information about your hiring process
Format of the Training Data
Your training data should be stored in a folder that contains:
- resumes: a folder containing resumes in any supported extension (pdf, docx, image, and more) and metadata stored in JSON format about each resume (e.g. category of the resume if it has been categorized)
- jobs: a folder containing job objects stored as JSON files
- step.json: a file providing the links between the resumes and the job offers
- process.json: a file describing the various steps of your hiring process and the links between each step
<dataset_directory>/
├── resumes/
| ├── 00.pdf
| ├── 00.json
| ├── 01.png
| ├── 01.json
| ├── ...
| ├── <resume_id>.<resume_extension>
├── jobs/
| ├── 00.json
| ├── 01.json
| ├── ...
| ├── <job_id>.json
| process.json
└── step.json
[
{"job_id": "00", "resume_id": "00", "step": "screening"},
{"job_id": "00", "resume_id": "01", "step": "screening"},
{"job_id": "01", "resume_id": "02", "step": "screening"},
{"job_id": "01", "resume_id": "03", "step": "interview"},
{"job_id": "03", "resume_id": "04", "step": "hired"},
{"job_id": "03", "resume_id": "05", "step": "rejected_after_screening"},
{"job_id": "03", "resume_id": "06", "step": "rejected_after_interview"}
]
[
{"step": "screening", "description": "the applicant's resume is about to be reviewed by a recruiter", "next_step": ["interview", "rejected_after_screening"]},
{"step": "interview", "description": "the applicant's interview is about to be scheduled with a recruiter", "next_step": ["hired", "rejected_after_interview"]},
{"step": "hired", "description": "the applicant succesfully passed the whole recruitment process"},
{"step": "rejected_after_screening", "description": "the applicant has been rejected after a recruiter read his resume"},
{"step": "rejected_after_interview", "description": "the applicant failed the tests of the interview"}
]
Prerequisites
In addition, in the scope of retraining a scoring engine, a minimum of:
- 20k unique candidates
- ~500 unique jobs
- 1k applications in hired status
is highly recommended.
Use-case #2: Applicant Viewpoint - Scoring Jobs for a Profile
For this use case, we are going to put ourselves in a job seeker shoes:

Basic scoring algorithms like the "Aerospace Industry", available in Hrflow.ai marketplace, come with 96% accuracy. Retraining this scoring algorithm on your custom dataset leads to the following benefits:
- Highly reduces the false-positive results returned by the scoring engine
- Increases the overall accuracy by several percent on your own data

Quality of the Training Data
The success of a Hrflow.ai scoring algorithm retraining strongly relies on the quality of the data.
On one hand, particular attention must be paid to the statistical biases of the dataset. This issue originates from the overrepresentation of some categories of data. Some example includes dataset containing mostly IT profiles, senior profiles, males over females, etc. Without any safety precautions taken, a deep learning model naturally tends to inadvertently leverage biases to better fit the data.
On the other hand, leveraging the application process in order to improve a scoring algorithm depends upon:
- Process steps: listing and carefully describing all the different steps of the application process
- Steps links: providing all the possible transitions between each step
Summarizing your application process through a diagram constitutes a good way to depict the links between the steps of your process. For our use case, the steps look like this:

Prerequisites
In order to successfully retrain an unbiased scoring engine:
- Provide relevant side information about the job seekers if available
- Provide relevant side information about the jobs
- Provide enough diversity of data in regard to all the side information available
- Provide detailed information about your application process
Format of the Training Data
Your training data should be stored in a folder that contains:
- resumes: a folder containing resumes in any supported extension (pdf, docx, image, and more) and metadata stored in JSON format about each resume (e.g. category of the resume if it has been categorized)
- jobs: a folder containing job objects stored as JSON files
- step.json: a file providing the links between the resumes and the job offers
- process.json: a file describing the various steps of your application process and the links between each step
<dataset_directory>/
├── resumes/
| ├── 00.pdf
| ├── 00.json
| ├── 01.png
| ├── 01.json
| ├── ...
| ├── <resume_id>.<resume_extension>
├── jobs/
| ├── 00.json
| ├── 01.json
| ├── ...
| ├── <job_id>.json
| process.json
└── step.json
[
{"job_id": "00", "resume_id": "00", "step": "searched"},
{"job_id": "00", "resume_id": "01", "step": "searched"},
{"job_id": "01", "resume_id": "02", "step": "applied"},
{"job_id": "01", "resume_id": "03", "step": "visualized"},
{"job_id": "03", "resume_id": "04", "step": "applied"},
{"job_id": "03", "resume_id": "05", "step": "visualized"},
{"job_id": "03", "resume_id": "06", "step": "applied"}
]
[
{"step": "searched", "description": "the job offer was returned in the job seeker search", "next_step": ["visualized"]},
{"step": "visualized", "description": "the job offer has been visualized by the job seeker", "next_step": ["applied"]},
{"step": "applied", "description": "the job seeker made an application for the job offer"},
]
Updated about 2 years ago