Dataset (Huggingface)

Compile Collection data to Huggingface Dataset

The data we have in collection also allow us to capture the insight which eventually being a part of our organization knowledge but also possible to contribute to SLM development.

All Contracts that C-levels rejected over the last 2 years can be used to train along the company Policies

Risk Report and Evidence in Audit are possible to be train for Risk Management team

1. Introduction

1.1 Purpose of the document

This handbook provides detailed guidance on configuring dataset generation forms in the Datalog platform. It explains each configuration option, when to use specific settings, and how they impact your training data quality.

1.2 Who this guide is for (target audience)

This guide is intended for:

  • Data generation teams & interns create training datasets for language models.

  • AI developers build systems to help users discover and understand compliance information.

  • Annotation teams who need to label and validate training data.

  • Quality assurance specialists reviewing generated for accuracy and realism.

1.3 Overview of the dataset generation feature

The dataset generation feature creates training data from source documents using AI to simulate realistic user interactions. Users approach the system with zero knowledge of the source document contents by asking natural questions or making requests related to the document's subject matter (e.g., compliance procedures, forms, requirements, processes). The AI assistant responds by naturally introducing relevant information from the documents, helping users progressively discover what they need to know through authentic interactions.

Key characteristics:

  • Users ask from their work context (e.g., "How do I use my car for work?"), not from awareness.

  • progress from broad/vague questions to specific follow-ups based on assistant responses.

  • All specific data is replaced with placeholders to prevent memorization of actual details.

  • Realistic scenarios covering multiple user personas (employees, managers, HR coordinators, property managers, etc.)

  • Natural language patterns reflecting how people actually ask questions at work

2. Prerequisites

2.1 Required permissions/access

  • User account with login access to the Datalog platformarrow-up-right

  • User is member/owner of project within Datalog

  • Permission to create datasets within projects

  • The project already contains collections of assets.

2.2 Any dependencies or setup needed before starting

Step-by-step setup workflow:

  1. Login to the data catalog with appropriate credentials

  2. Create a new project (if not already existing)

  • Navigate to project creation

  • Name the project appropriately (e.g., "Compliance_Training_Data")

  1. Create a collection within the project

  • Collections organize and store your source documents

  • Name the collection descriptively (e.g., "Documents_Collection")

  1. Upload compliance PDF files to the collection

  • Upload all source documents as file assets

  • Ensure PDFs are properly formatted and readable

  • Verify documents contain the compliance content needed

  1. Prepare Document Parsing Instructions that specify:

  • How to extract and interpret content from documents

  • Formatting rules (e.g., "Extract text from tables, ignore headers and footers, bold text indicates key information")

  • Instructions for handling compliance-specific elements (findings, recommendations, violation rates, etc.)

  1. Define your Data Generation Instructions including:

  • User scenarios (zero-knowledge approach)

  • Placeholder naming conventions

  • Rules and patterns

  • Response requirements

  1. Create the dataset within the project

  • Select the collection containing your uploaded documents

  • Configure all generation settings

  • Input your data generation instructions

  1. Configure dataset settings:

  • Dataset Name: Follow your naming convention

  • Data Source Collection: Select the collection with uploaded PDFs

  • Target Data Points: Recommended 100+ for adequate coverage

  • Target Data Point Length: Leave empty for diverse lengths

Quick Start

Creating a dataset involves configuring four main sections:

  1. General Settings - Define dataset name, data source collection, and targets

  2. Data Source - Choose between document processing or column-based processing

  3. Generation Function - Select schema type and provide generation instructions

  4. Document Parsing - (Documents only) Configure processing quality and modes

4. Configuration Options

4.1 General Settings

Field name: Dataset Name

Description: A unique identifier for your training dataset

Accepted values: Text (1 - 255 characters)

Default value: None

Required or optional: Required

Example usage: compliance___v1 or DSH__training_data

Field name: Data Source Collection

Description: The collection containing your uploaded compliance PDF documents that will be used as source material for generating

Accepted values: Dropdown selection from available collections in your project

Default value: None

Required or optional: Required

Example usage: Select "Documents_Collection" from the dropdown

Field name: Target Data Points

Description: The desired number of examples to generate. The actual number may be less depending on source document size and content variety.

Accepted values: Integer (positive number)

Default value: N/A

Required or optional: Required

Example usage: 100 for initial testing, 500-1000 for production training datasets

Field name: Target Data Point Length

Description: The desired length of each data point in characters. Leave empty to create diverse data points ranging from short to long for better training variety.

Accepted values: Integer (characters), or empty for variable length

Default value: Empty (variable length)

Required or optional: Optional

Example usage: Leave empty for natural variety, or specify 512 for consistent context windows.

4.2 Data Source

Field name: Type

Description: Allow you to choose the type of data source for your dataset generation

Accepted values:

  • Asset Documents - Process uploaded asset documents in the collection

  • Best For: PDF files, scanned documents - Reports, articles, contracts - Multi-page documents with structured content

  • Configuration Needed: - Document processing mode (Fast/Balanced/Manual) - Data generation quality - Document parsing instructions (optional)

  • Performance: Standard processing speed, comprehensive content extraction

  • Asset Columns - Process pre-extracted column data in the collection

  • Best For: - Text already extracted to database columns - CSV files imported into collections - Structured data with clear text fields - When you need faster processing.

  • Configuration Needed: - Column selection (required) - Chunk separator (defaults to “Lines”)

  • Performance: ~60% faster than document processing

Default value: Asset Documents

Required or optional: Required

Example usage: Select "Asset Documents" for compliance PDFs

Field name: Chunk Separator

Description: Determines how documents are split into smaller pieces (chunks) for processing. This separator determines where splits happen, directly affecting the size and context of each data point and the quality of your training data.

Accepted values:

  • None - Don't split, use entire window content as one chunk. Best for short windows or complete document sections

  • Paragraphs - Split with empty lines. Good for articles and documents where paragraphs are separated by blank lines

  • Lines - Split line by line. Perfect for lists, CSV files, or any content where each line is a separate item

  • Sentences - Split at sentence endings by period (.). Best for detailed text analysis where each sentence matters.

  • Custom - Define your own split pattern. Examples: pipe |, dashes ---, or any custom marker

Default value: Paragraphs

Required or optional: Required

Example usage: Select "Paragraphs" for compliance documents where sections are separated by blank lines.

4.3 Generation Function

Field name: Type

Description: Select how data points should be structured and generated for model training. This determines the input-output format and the learning objective for the model.

Accepted values:

  • Text Classification - Categorize text content by assigning predefined labels or tags

  • Preference Ranking - Compare multiple responses and rank them by quality or preference

  • Conversation - Assess dialogue quality and identify issues

  • Hybrid-Knowledge Domain management - Localise AI Model based on iGOT AI Patent for managing RAW data

  • Text Generation - Generate new text content based on input prompts or context

Default value: None

Required or optional: Required

4.3.1 Type - Text Classification

Field name: Data Generation Instructions

Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)

SQL I. MANDATORY RULES [Full content from your data generation instructions] II. CLASSIFICATION CATEGORIES [Full content from your data generation instructions] III. TEXT SAMPLE REQUIREMENTS [Full content from your data generation instructions] IV. CATEGORY ASSIGNMENT RULES [Full content from your data generation instructions]

Field name: Classification Labels (only appears for Text Classification type)

Description: Create classification categories that annotators will use to label text samples. Use names that match your domain and objectives.

Accepted values: List of label names (text)

Default value: None

Required or optional: Required (for Text Classification type only)

Example usage: For compliance s: driver_safety, property_disposal, documentation, training, procurement

Field name: Allow multiple labels per item (only appears for Text Classification type)

Description: Allow text to belong to multiple categories simultaneously when content can span across different classification types.

Accepted values: Checkbox (enabled/disabled)

Default value: Disabled

Required or optional: Optional

Example usage: Enable if a single conversation finding can relate to multiple categories (e.g., both "driver_safety" and "documentation")

Field name: Reference Context

Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.

Accepted values: Dropdown selection from available columns in the selected collection

Default value: None

Required or optional: Optional

Example usage: Select additional context columns if your collection has pre-extracted metadata like "text_generation_type", "agency_name", "date_range" that can improve generation quality.

4.3.2 Type - Preference Ranking

Field name: Data Generation Instructions

Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. RESPONSE GENERATION GUIDELINES [Full content from your data generation instructions] III. RANKING SCENARIO REQUIREMENTS [Full content from your data generation instructions] IV. QUALITY DIFFERENTIATION RULES [Full content from your data generation instructions]

Field name: Ranking Criteria

Description: Define evaluation standards for comparing and ranking different responses. Choose aspects most relevant to your use case and quality goals.

Accepted values: Text labels (add multiple criteria individually)

Default value: None

Required or optional: Required

Example usage:

Plain Text accuracy helpfulness clarity relevance completeness tone

Field name: Number of Response Options

Description: How many responses to generate for comparison. Higher numbers enable more granular comparisons but require longer processing time.

Accepted values: Integer (1-10)

Default value: 3

Required or optional: Required

Field name: Reference Context

Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.

Accepted values: Column selection from collection

Default value: None

Required or optional: Optional

Example usage: Select columns containing source documents, user profiles, or domain-specific context relevant to the responses being ranked.

4.3.3 Type - Conversation

Field name: Data Generation Instructions

Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through VII)

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. QUESTION CONTENT SCOPE (USER SCENARIOS) [Full content from your scenarios section] III. PERSONA & KNOWLEDGE LEVEL [Full content] IV. INTENTS REQUIRED IN DATASET [Full content] V. ASSISTANT RESPONSE REQUIREMENTS [Full content] VI. NUMBER OF EXCHANGES [Full content] VII. ANNOTATION DIMENSIONS [Full content]

Field name: Annotation Dimensions

Description: Specify conversation elements that annotators will label in each dialogue (e.g., topic category, user role, knowledge level). Focus on aspects important for your analysis needs. These dimensions help categorize and analyze the quality and characteristics of generated conversations.

Accepted values: List of dimension names (text labels)

Default value: None

Required or optional: Required

Example usage:

violation_type - Identifies the category (driver_safety, surplus_property, travel_policies, documentation, training, procurement, financial_controls, policy_violations)

content_scope - Marks location of content being discussed (finding_1, finding_2, recommendation_1, agency_response, legal_citations, deadlines, corrective_actions, compliance_requirements)

knowledge_level - Measures user expertise level (novice, intermediate, expert)

requires_clarification - Indicates if additional information is needed (yes, no)

compliance_accuracy - Evaluates accuracy level of compliance understanding (accurate, partially_accurate, inaccurate)

4.3.4 Hybrid-Knowledge Domain

Field name: Data Generation Instructions

Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. ENTITY EXTRACTION GUIDELINES [Full content from your data generation instructions] III. RELATIONSHIP MAPPING REQUIREMENTS [Full content from your data generation instructions] IV. DOMAIN STRUCTURING RULES [Full content from your data generation instructions]

Field name: Knowledge Domain Types

Description: Define domain-specific categories to extract from text based on your field and requirements. Leave empty for automatic detection of all types.

Accepted values: Text labels (add multiple types individually)

Default value: None (automatic detection)

Required or optional: Optional

Example usage:

Plain Text person location organization product contract date

Field name: Reference Context

Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.

Accepted values: Column selection from collection

Default value: None

Required or optional: Optional

Example usage: Select columns containing source documents, existing entity databases, or domain ontologies relevant to the knowledge extraction task.

4.3.5 Text Generation

Field name: Data Generation Instructions

Description: Provide specific guidelines for generating new text content based on input prompts or context. These instructions define how to create coherent, contextually appropriate text for various content types like explanations, summaries, and instructions.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required (when Text Generation type is selected)

Example usage:

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. PROMPT TYPES [Full content] III. OUTPUT REQUIREMENTS [Full content] IV. CONTENT VARIATION [Full content]

Field name: Generation Types

Description: Select one or more types to control what kind of text generation will be created. If you don't select any types (leave all checkboxes unchecked and don't add custom types), the system will automatically create diverse generation types including summary, Q&A, rewrite, explanation, and more.

Accepted values:

  • Summary - Generate condensed versions of content

  • Rewrite - Rephrase content in different words

  • Question & Answer - Create Q&A pairs from content

  • Explanation - Generate detailed explanations of concepts

  • Custom types - Add your own generation type names

Default value: All types (if none selected)

Required or optional: Optional

Example usage:

For compliance audit documents, select:

  • Summary - to create brief overviews of findings and recommendations

  • Question & Answer - to generate Q&A about compliance requirements

  • Explanation - to explain procedures and regulations

Or add custom types:

  • procedure_steps - Generate step-by-step procedures

  • requirement_list - Create lists of compliance requirements

  • checklist - Generate compliance checklists

Field name: Reference Context

Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.

Accepted values: Dropdown selection from available columns in the selected collection

Default value: None

Required or optional: Optional

Example usage: Select additional context columns if your collection has pre-extracted metadata like "document_type", "audit_category", "date_range" that can improve generation quality

4.4 Document Parsing

Field name: Processing Configuration

Description: Documents are processed in sliding windows - each window contains multiple pages with overlap to maintain context. Choose a preset mode or customize the window size and overlap to match your document complexity.

Accepted values:

  • Fast - Quick processing for faster results (3-page windows, 1-page overlap). Good for: simple layouts, standard documents, clean text, straightforward content

  • Balanced - Balances processing speed and accuracy (2-page windows, 1-page overlap). Good for: complex layouts, documents with tables, dense content, technical documents

  • Advanced Configuration - Customize window size and overlap to match your document requirements

Default value: Fast

Required or optional: Required

Example usage: Select "Balanced" for compliance s with tables, findings sections, and dense technical content

Field name: Data Generation Quality

Description: Select the AI model quality for generating your dataset content. This controls the sophistication and accuracy of the generated data points, affecting the final training quality of your models.

Accepted values:

  • Standard (gemini-2.5-flash) - Fast, balance between quality and cost. Good for: most datasets, straightforward content, general use cases

  • Standard (o4-mini-batch) - Fast, balance between quality and cost using Azure AI. Good for: most datasets, straightforward content, general use cases

  • Premium (gemini-2.5-pro) - Highest quality, trade-off speed and cost. Good for: complex datasets, detailed requirements, critical accuracy

  • Premium (gpt-4.1-batch) - Highest quality using Azure AI, trade-off speed and cost. Good for: complex datasets, detailed requirements, critical accuracy

Default value: Standard (gemini-2.5-flash)

Required or optional: Required

Example usage: Use "Standard (gemini-2.5-flash)" for most compliance datasets; upgrade to "Premium (gemini-2.5-pro)" when critical accuracy is required for complex regulatory scenarios

Field name: Document Parsing Instructions

Description: Specify how to extract and interpret content from your documents, including formatting rules. These instructions guide how the AI reads and processes your PDF files.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage:

Plain Text Code block Extract all text content from compliance PDF documents. Follow these rules: 1. Tables: Extract all text from tables while preserving row/column structure 2. Headers/Footers: Ignore page headers, footers, and page numbers 3. Bold text: Mark bold text to indicate key findings and important information 4. Lists: Preserve bullet points and numbering 5. Dates: Keep date formats as-is (e.g., May 10, 2024) 6. Section markers: Identify and mark sections like [FINDING], [RECOMMENDATION], [RESPONSE] 7. Numerical data: Preserve all statistics and percentages exactly

5. Use Cases / Scenarios

Scenario 1: Customer Service Sentiment Analysis

Goal: Classify customer support tickets by sentiment

Configuration:

  • Data Source: Asset Columns (if tickets in database)

  • Schema: Text Classification

  • Labels: Positive, Neutral, Negative

  • Multi-label: Disabled

  • Target Points: 1,000

  • Instructions: “Focus on customer feedback tone regarding issue resolution”

  • Reference Context: - category (e.g., billing, technical) - priority_level - product_name

Scenario 2: Legal Document Summarization

Goal: Generate summaries of legal contracts

Configuration:

  • Data Source: Asset Documents (PDF contracts)

  • Processing: Balanced mode (complex layouts)

  • Quality: Premium (accuracy critical)

  • Schema: Text Generation

  • Generation Types: summary

  • Parsing Instructions: “Extract from tables, ignore headers, bold text indicates key terms”

  • Target Points: 500

  • Instructions: “Create summaries suitable for non-lawyers, highlighting obligations and termination clauses”

Scenario 3: Chatbot Response Quality Evaluation

Goal: Compare and rank chatbot responses for RLHF

Configuration:

  • Data Source: Asset Documents (chat transcripts)

  • Schema: Preference Ranking

  • Criteria: helpfulness, accuracy, tone, resolution

  • Number of Responses: 3

  • Target Points: 2,000

  • Instructions: “Evaluate customer service responses for billing and technical support inquiries”

Scenario 4: Product Review Topic Classification

Goal: Categorize product reviews by topics

Configuration:

  • Data Source: Asset Columns (review database)

  • Schema: Text Classification

  • Labels: Quality, Shipping, Price, Customer Service, Product Features

  • Multi-label: Enabled (reviews mention multiple topics)

  • Target Points: 3,000

  • Reference Context: - product_category - rating - verified_purchase

Scenario 5: Knowledge Base Construction

Goal: Extract entities and relationships from technical docs

Configuration:

  • Data Source: Asset Documents (technical manuals)

  • Processing: Balanced mode (tables, diagrams)

  • Quality: Premium (technical accuracy)

  • Schema: Entity Relationship Extraction

  • Entity Types: component, process, specification, troubleshooting_step

  • Target Points: 1,500

Scenario 6: Fast Text Processing from Database

Goal: Quick processing of pre-extracted text

Configuration:

  • Data Source: Asset Columns (60% faster)

  • Schema: Text Generation

  • Generation Types: Leave empty (auto-diverse)

  • Target Points: 5,000

  • Chunk Separator: Lines (for structured data)

6. Best Practices

6.1 Start Small, Scale Up

  • Begin with 100-500 data points for testing

  • Review quality and adjust configuration

  • Scale to 1,000-5,000+ for production

6.2 Be Specific with Instructions

  • Vague: “Generate good data”

  • Specific: “Generate customer service scenarios focusing on billing inquiries, prioritizing clear resolution steps”

6.3 Choose Right Data Source

  • Use Documents for: PDFs, reports, contracts, articles

  • Use Columns for: Database content, CSV imports, pre-extracted text

  • Columns are ~60% faster when applicable

6.4 Match Schema to Use Case

Click the image to view the sheet.

6.5 Optimize Chunk Separators

  • Paragraphs: Default for most documents

  • Lines: Best for lists and structured data

  • Sentences: For sentence-level analysis

  • None: For short, complete sections

6.6 Leverage Reference Context

  • Include metadata when it helps understanding

  • Don’t overuse (max 10 columns)

  • Use in generation instructions with <reference_context> tag

6.7 Balance Quality and Speed

  • Standard quality for initial testing

  • Premium quality for final production data

  • Fast processing for simple documents

  • Balanced/Manual for complex layouts

6.8 Validate Early

  • Use Preview button before generating full dataset

  • Check sample data points for quality

  • Adjust configuration based on preview results

7. Troubleshooting / FAQ

Issue: Actual data points less than target

Causes:

- Source data exhausted

- Chunk separator created fewer chunks

- Document processing filtered content

Solutions:

- Add more source data

- Adjust chunk separator (use smaller chunks)

- Reduce target number

Issue: Data points too short/long

Solution:

- Set Target Data Point Length explicitly

- Adjust chunk separator (None for longer, Sentences for shorter)

Issue: Poor quality generated content

Solutions:

- Upgrade to Premium quality mode

- Improve Data Generation Instructions (be more specific)

- Add Reference Context for better understanding

- Use Parsing Instructions for complex documents

Issue: Processing too slow

Solutions:

- Switch from Documents to Asset Columns (~60% faster)

- Use Fast processing mode instead of Balanced

- Switch from Premium to Standard quality

- Reduce Target Data Points for testing

Issue: Not enough context in data points

Solutions:

- Increase Window Size (Manual mode)

- Increase Overlap between windows - Add Reference Context columns

- Change chunk separator to create larger chunks

Last updated