Large Scale Scoring Data

Everything you need to know about creating Scoring Data for large scale training

In this guide, we will focus on submitting large-scale datasets to the HrFlow.ai team. For context, refer to the relevant use case based on your viewpoint:


Table of Contents

  1. Profiles Format
    1. Parquet DataFrames
    2. HDF5 Files
  2. Jobs Format
    1. Parquet DataFrames
    2. HDF5 Files
  3. Trackings Format
  4. Agents Format
  5. Submit Data

The dataset format used by HrFlow.ai separates actual data from indexation tables.

  • Metadata Storage:

    • Profiles.parquet and Jobs.parquet store metadata for profiles and jobs.
    • Profiles_Tags.parquet and Jobs_Tags.parquet store profile and job tags.
  • Object Storage (HDF5 format):

    • job_objects.h5 contains job JSON objects in an array-like format.
    • profile_objects/0000000-0100000.h5 stores profile JSON objects for the range 0000000-0100000 in an array-like format.
  • Relations & Training:

    • Trackings.parquet describes profile-job interactions (e.g., profile i applied to job j).
    • Agents.parquet defines the training tree for algorithm training.

Here is an example of the resulting folder architecture :

.
β”œβ”€β”€ Agents.parquet
β”œβ”€β”€ job_objects.h5
β”œβ”€β”€ Jobs.parquet
β”œβ”€β”€ Jobs_Tags.parquet
β”œβ”€β”€ profile_objects
β”‚   β”œβ”€β”€ 0000000-0100000.h5
β”‚   β”œβ”€β”€ 0100001-0200000.h5
β”‚   β”œβ”€β”€ 0200001-0300000.h5
β”‚   β”œβ”€β”€ 0300001-0400000.h5
β”‚   └── 0400001-0500000.h5
β”œβ”€β”€ Profiles.parquet
β”œβ”€β”€ Profiles_Tags.parquet
└── Trackings.parquet

1. Profiles Format

Profiles are typically the largest and heaviest data in the set. Therefore, we adopted a partitioned storage approach. The partition name 0000000-0100000 indicates it contains JSON objects for elements with IDs from 0 to 100,000 (both included). This requires building a new index where the first profile ID starts at 0, 1, 2, and so on, up to the total number of profiles.

In addition to this storage, two Parquet dataframes are created to ensure easy and efficient access to profiles.

i. Parquet DataFrames :

Two parquet dataframes are used for profiles, Profiles.parquet and Profile_Tags.parquet.

πŸ“˜

Requirements

In addition to pandas you will need to install either the fastparquet or pyarrow library

Indexation dataframe (Profiles.parquet)

The indexation dataframe enables fast and efficient profile searches and links profiles to jobs through trackings. The id column serves as the table's key and must be unique and incremental, starting from 0 (0, 1, 2, ...). The reference is equally important, allowing profiles to be traced back to the original client database if needed.

The rnd and split values separate training and testing profiles. A deterministic approach is recommended, such as generating a uniform random number between 0 and 1 using the profile's hash.

This dataframe should be a simple Pandas dataframe saved with the to_parquet method. It must include the following columns:

profiles_columns = [
    "id",                       # Id on this dataset, starts from 0 to len(profiles) - 1
    "sql_id",                   # Optional : Id on the original database
    "item_key",                 # Optional : Key inside the profile JSON object
    "provider_key",             # Optional : Key of the source containing the profile, else : None
    "reference",                # Reference of the profile, might be the same as the sql_id
    "partition",                # Partition where the profile is stored (see the HDF5 part)
    "created_at",               # Date of creation of the profile
    "location_lat",             # Optional : Latitude of the profile
    "location_lng",             # Optional : Longitude of the profile
    "experiences_duration",     # Optional : Duration of the experiences, included in HrFlow.ai profile objects
    "educations_duration",      # Optional : Duration of the educations, included in HrFlow.ai profile objects
    "gender",                   # Gender of the profile
    "synthetic_model",          # Optional : None or model key used to create synthetic sample (ex : hugging face model_key)
    "translation_model",        # Optional : None or model key used to translate the sample (ex : hugging face model_key)
    "tagging_rome_jobtitle",    # Optional : Rome job title, else : None
    "tagging_rome_category",	  # Optional : Rome category, else : None
    "tagging_rome_subfamily",   # Optional : Rome subfamily, else : None
    "tagging_rome_family",      # Optional : Rome family, else : None
    "text",                     # Optional : Text of the profile (raw text of the parsed profile)
    "text_language",            # Optional : Language of the text
    "split",                    # Optional : train, test (based on rnd, ex : rnd < 0.8 is train)
    "rnd",                      # Optional : random number (between 0 and 1) used to split the dataset
]

Tags dataframe (Profile_Tags.parquet)

Tags are values associated with a profile. They can describe the candidate, such as experience level or certifications, or their preferences, like remote work or salary expectations. Tags are flexibleβ€”a single candidate can have multiple tags.

Each tag consists of a name and a value. The name represents the tagging referential (e.g., experience_level, remote_work), while the value indicates its state (e.g., senior or mid_level for experience_level).

This dataframe should be a simple Pandas dataframe saved with the to_parquet method. It must include the following columns:

tags_columns = [
    "profile_id",               # Id of the profile, foreign key from the Profiles.parquet
    "name",                     # Name of the tag referential, ex : education_level, industry, can_work_in_EU, ...
    "value",                    # Value of the tag, ex : Master, IT, True, ...
]

ii. HDF5 Files :

This step requires HrFlow.ai profile parsing. There are two cases:

Profiles Not Parsed by HrFlow.ai :

No need to provide HDF5 files. Instead, submit a ZIP folder containing raw CVs. If needed, the folder can be split into multiple chunks.

🚧

It is mendatory to name the Raw CVs with either the profile id or the reference in the parquet tables

❗️

Security

We prioritize security and do not entrust any third party. Therefore, please ensure that the zip file is encrypted with a robust password.

Profiles Parsed by HrFlow.ai :

If profiles are parsed by HrFlow.ai, there is no need to verify the profile JSON formats. They can be downloaded directly from their source or by providing the item_key (equal to key in the profile JSON object).

Below is a Python code snippet to create the HDF5 files :

import json
import os
from math import ceil

import h5py

# Partition Parameters
p_size = 100_000          # Partitions chunk size
p_digits = 7              # Number of digits, 7 means a max number of 9_999_999 elements

# Porfiles
profiles = [							# The index = id in the Porfiles.parquet
    json.dumps({"reference": 2200001, "key": "b3368cec590bfc7e32d73902ca785a4f20a5197b", ...}),
    json.dumps({"reference": 2200002, "key": "14678cec590bfc7e32d73902ca785a48287ca97b", ...}),
    json.dumps({"reference": 2230003, "key": "12349ec590bfc7e32d73902ca78876463ba6377b", ...}),
    ...
]

# Build Partitions
os.makedirs(os.path.join(path_to_data, "profile_objects"), exist_ok=True)
for partition_number in range(ceil((len(profiles)-1)/p_size)):
    if partition_number==0:
        # So we include the first element with id = 0
        partition_name = f"{0:0{p_digits}d}-{p_size:0{p_digits}d}"
        partition_path = os.path.join(path_to_data, f"profile_objects/{partition_name}.h5")
        with h5py.File(partition_path, "w") as h5py_file:
            h5py_file.create_dataset("objects", data=profiles[0:p_size+1])
    else :
        partition_name = f"{partition_number*p_size+1:0{p_digits}d}-{((partition_number+1)*p_size):0{p_digits}d}"
        partition_path = os.path.join(path_to_data, f"profile_objects/{partition_name}.h5")
        with h5py.File(partition_path, "w") as h5py_file:
            h5py_file.create_dataset("objects", data=profiles[partition_number*p_size:(partition_number+1)*p_size+1])

and here is a Python code snippet to check the loading :

import json
import os

import h5py

# Profile Id
profile_id = 223_000                                # Same as the id in the Profiles.parquet
partition = "0200001-0300000"                       # Same as the partition in the Profiles.parquet

# Local partition Id
partition_start_id = int(partition.split("-")[0])
local_profile_id = profile_id - partition_start_id  # Arrays always start at 0, first element is 0

# Load the profile
partition_path = os.path.join(path_to_data, f"profile_objects/{partition}.h5")
with h5py.File(partition_path, "r") as f:
    profile = json.loads(f["objects"][local_profile_id])


2. Jobs Format

Jobs are typically lighter than profiles and are stored in a single job_objects.h5 file containing all job JSON objects. This eliminates the need for partitioning.

In addition to this storage, two Parquet dataframes are created to ensure easy and efficient access to jobs.

i. Parquet DataFrames :

Two parquet dataframes are used for jobs, Jobs.parquet and Profile_Tags.parquet.

πŸ“˜

Requirements

In addition to pandas you will need to install either the fastparquet or pyarrow library

Indexation dataframe (Jobs.parquet)

The indexation dataframe enables fast and efficient jobs searches and links jobs to profiles through trackings. The id column serves as the table's key and must be unique and incremental, starting from 0 (0, 1, 2, ...). The reference is equally important, allowing jobs to be traced back to the original client database if needed.

The rnd and split values separate training and testing jobs. A deterministic approach is recommended, such as generating a uniform random number between 0 and 1 using the job's hash.

This dataframe should be a simple Pandas dataframe saved with the to_parquet method. It must include the following columns:

job_columns = [
    "id",                       # Id on this dataset, starts from 0 to len(jobs) - 1
    "sql_id",                   # Optional : Id on the original database
    "item_key",                 # Optional : Key inside the job JSON object
    "provider_key",             # Optional : Key of the source containing the job, else : None
    "reference",                # Reference of the job, might be the same as the sql_id
    "created_at",               # Date of creation of the job
    "location_lat",             # Optional : Latitude of the job
    "location_lng",             # Optional : Longitude of the job
    "synthetic_model",          # Optional : None or model key used to create synthetic sample (ex : hugging face model_key)
    "translation_model",        # Optional : None or model key used to translate the sample (ex : hugging face model_key)
    "tagging_rome_jobtitle",    # Optional : Rome job title, else : None
    "tagging_rome_category",    # Optional : Rome category, else : None
    "tagging_rome_subfamily",   # Optional : Rome subfamily, else : None
    "tagging_rome_family",      # Optional : Rome family, else : None
    "text",                     # Optional : Text of the job (raw text of the parsed job)
    "text_language",            # Optional : Language of the text
    "split",                    # Optional : train, test (based on rnd, ex : rnd < 0.8 is train)
    "rnd",                      # Optional : random number (between 0 and 1) used to split the dataset
]

Tags dataframe (Job_Tags.parquet)

Tags are values associated with a job. They can describe the job itself, such as industry or contract type, or specific requirements like remote availability. Tags are flexibleβ€”a single job can have multiple tags.

Each tag consists of a name and a value. The name represents the tagging referential (e.g., industry, contract_type), while the value indicates its state (e.g., tech or finance for industry).

This dataframe should be a simple Pandas dataframe saved with the to_parquet method. It must include the following columns:

tags_columns = [
    "job_id",               # Id of the job, foreign key from the Jobs.parquet
    "name",                 # Name of the tag referential, ex : industry, contract_type, ...
    "value",                # Value of the tag, ex : IT, Temporary, ...
]

ii. HDF5 File :

Jobs should follow the HrFlow.ai's Job Object structure. More information about in this link.

In addition to the documentation, here below is an example of an HrFlow.ai Job Object :

{
  "name": "Regulatory Quality Assurance Manager M/F",
  "key": "12343bc6c1b9f47e54567898765432aabcde3293",
  "reference": "2983AAU930",
  "url": null,
  "summary": "Experienced Regulatory Quality Assurance Manager M/F with a demonstrated history of ...",
  "created_at": "2021-09-07T14:22:27.000+0000",
  "sections": [
    {
      "name": "job-description",
      "title": "Job description",
      "description": "..."
    },
    {
      ...
    }
  ],
  "culture": "XX Company is a global leader in the field of ...",
  "responsibilities": "You will be responsible for ...",
  "requirements": "You have a degree in ...",
  "benefits": "We offer a competitive salary ...",
  "location": {
    "text": "Paris, France",
    "lat": 48.8566,
    "lng": 2.3522
  },
  "skills": [],
  "languages": [],
  "tasks": [],
  "certifications": [],
  "courses": [],
  "tags": [
    {
      "name": "experience",
      "value": "5-10 years"
    },
    {
      "name": "trial-period",
      "value": "3 months"
    },
    {
      "name": "country",
      "value": "France"
    }
  ]
}

We highly recommend prioritizing the fields culture, responsibilities, requirements, and benefits, but they can be left as null. If any job information cannot fit into these fields, the sections field can be used instead. It is flexible and can contain any kind of sections, though it is deprecated.

Below is a Python code snippet to create the HDF5 file from the job JSON objects :

import json
import os

import h5py

# Jobs
jobs = [							# The index = id in the Jobs.parquet
    json.dumps({"reference": 2200001, "key": "b3368cec590bfc7e32d73902ca785a4f20a5197b", ...}),
    json.dumps({"reference": 2200002, "key": "14678cec590bfc7e32d73902ca785a48287ca97b", ...}),
    json.dumps({"reference": 2230003, "key": "12349ec590bfc7e32d73902ca78876463ba6377b", ...}),
    ...
]

# save jobs in hdf5
hdf5_path = os.path.join(path_to_data, f"job_objects.h5")
with h5py.File(hdf5_path, "w") as h5py_file:
    h5py_file.create_dataset("objects", data=jobs)

And here is a python code snippet to check the loading :

import json
import os

import h5py

# Job Id
job_id = 22_000                                     # Same as the id in the Jobs.parquet

# Load the job
hdf5_path = os.path.join(path_to_data, f"job_objects.h5")
with h5py.File(hdf5_path, "r") as f:
    job = json.loads(f["objects"][job_id])

3. Trackings Format

Trackings represent interactions between jobs and profiles. A tracking consists of an action and metadata, such as the role or author ID. We distinguish between two cases:

  • Candidate viewpoint: The tracking action reflects the candidate's interaction with the job offer (e.g., view, apply, accepted). The tracking role is candidate, and the author_id is the profile_id or candidate's email.
  • Recruiter viewpoint: The tracking action reflects the recruiter's actions on the candidate's application (e.g., internet_application, first_interview, technical_interview). The tracking role is recruiter, and the author_id is the recruiter's ID or email.

The duration of the tracking depends on the action type and is not always defined. Some actions, such as a candidate viewing a job offer or a technical interview, have durations. For example, a candidate viewing a job offer for 10 seconds differs from viewing it for 120 seconds.

The Trackings.parquet dataframe should be a simple Pandas dataframe saved with the to_parquet method. It must include the following columns:

tracking_columns = [
    "team_id",              # Optional : Team id if know, else : None 
    "author_id",            # Optional : identifier (id or email) of the author of the action
    "profile_id",           # Id of the profile, foreign key from the Profiles.parquet
    "job_id",               # Id of the job, foreign key from the Jobs.parquet
    "action",               # Interaction happening between the profile and the job, ex : view, apply, recruited, ...
    "duration",             # Optional : Duration of the action in seconds, ex: the action view lasted 10 seconds 
    "role",                 # Role of the author of the action, could be one of : recruiter, candidate, employee.
    "comment",              # Additional comment on the action, ex : reason of the rejection
    "timestamp",            # timestamp of the creation in seconds, format : 1641004216
    "date_edition",         # timestamp of the edition in seconds, format : 1641004216
    "rnd",                  # Optional : random number (between 0 and 1) used to split the dataset
]

4. Agents Format

To understand the agentic approach, let's first consider the following agent configuration:

Each block represents a recruiter action. The block's color indicates whether the action is negative (red), positive (green), or neutral (white). This configuration defines the agent setup, which determines the scoring algorithm.

This is the role of the Agents.parquet file. It is a dataframe containing at least one row, where each row represents a unique agent configuration. The column training_labels_tracking summarizes this configuration. For the example above, this variable is set as follows:

[
  {
    "id": 0,
    "label": "Screening",
    "value": 0,
    "principal_input": true,                     // Starting node is the principal input
  "leaf": {
      "label": "Rejected After Screening",
      "value": -1
    }
  },
  {
    "id": 1,
    "label": "Interview",
    "value": 0,
    "principal_input": false,
    "leaf": {
      "label": "Rejected After Interview",
      "value": -1
    }
  },
  {
    "id": 2,
    "label": "Hired",
    "value": 1,
    "principal_input": false,
    "leaf": null
  },
]

The Agents.parquet dataframe should be a simple Pandas dataframe saved with the to_parquet method. It must include the following columns:

agent_columns = [
    "sql_id",                       # Optional : Id on the original database, else: None.
    "agent_key",                    # Optional : Key identifying the agent.
    "training_labels_tracking",     # training labels trackings as described above, list of nodes.
    "labeler_type",                 # Role of the action author, could be : recruiter, candidate or employee.
]

5. Submit Data

As explained in Part 1.ii, there are two possible cases: whether the CVs are parsed by HrFlow.ai or not. Depending on the situation, the resulting data folder structure differs slightly:

  • Profiles Not Parsed by HrFlow.ai :

The resulting folder will have a similar architecture as the following

.
β”œβ”€β”€ Agents.parquet
β”œβ”€β”€ job_objects.h5
β”œβ”€β”€ Jobs.parquet
β”œβ”€β”€ Jobs_Tags.parquet
β”œβ”€β”€ cvs							# CVs are named with the id or the reference
β”‚   β”œβ”€β”€ 0.pdf
β”‚   β”œβ”€β”€ 1.png
β”‚   β”œβ”€β”€ 2.pdf
β”‚   .		.....
β”‚   └── n.pdf
β”œβ”€β”€ Profiles.parquet
β”œβ”€β”€ Profiles_Tags.parquet
└── Trackings.parquet
  • Profiles Parsed by HrFlow.ai :

The resulting folder will have a similar architecture as the following

.
β”œβ”€β”€ Agents.parquet
β”œβ”€β”€ job_objects.h5
β”œβ”€β”€ Jobs.parquet
β”œβ”€β”€ Jobs_Tags.parquet
β”œβ”€β”€ profile_objects
β”‚   β”œβ”€β”€ 0000000-0100000.h5
β”‚   β”œβ”€β”€ 0100001-0200000.h5
β”‚   β”œβ”€β”€ 0200001-0300000.h5
β”‚   β”œβ”€β”€ 0300001-0400000.h5
β”‚   └── 0400001-0500000.h5
β”œβ”€β”€ Profiles.parquet
β”œβ”€β”€ Profiles_Tags.parquet
└── Trackings.parquet