Dataset (Huggingface)

Compile Collection data to Huggingface Dataset

The data we have in collection also allow us to capture the insight which eventually being a part of our organization knowledge but also possible to contribute to SLM development.

All Contracts that C-levels rejected over the last 2 years can be used to train along the company Policies

Risk Report and Evidence in Audit are possible to be train for Risk Management team

1. Introduction

1.1 Purpose of the document

This handbook provides detailed guidance on configuring dataset generation forms in the Datalog platform. It explains each configuration option, when to use specific settings, and how they impact your training data quality.

1.2 Who this guide is for (target audience)

This guide is intended for:

Data generation teams & interns create training datasets for language models.
AI developers build systems to help users discover and understand compliance information.
Annotation teams who need to label and validate training data.
Quality assurance specialists reviewing generated for accuracy and realism.

1.3 Overview of the dataset generation feature

The dataset generation feature creates training data from source documents using AI to simulate realistic user interactions. Users approach the system with zero knowledge of the source document contents by asking natural questions or making requests related to the document's subject matter (e.g., compliance procedures, forms, requirements, processes). The AI assistant responds by naturally introducing relevant information from the documents, helping users progressively discover what they need to know through authentic interactions.

Key characteristics:

Users ask from their work context (e.g., "How do I use my car for work?"), not from awareness.
progress from broad/vague questions to specific follow-ups based on assistant responses.
All specific data is replaced with placeholders to prevent memorization of actual details.
Realistic scenarios covering multiple user personas (employees, managers, HR coordinators, property managers, etc.)
Natural language patterns reflecting how people actually ask questions at work

2. Prerequisites

2.1 Required permissions/access

User account with login access to the Datalog platform
User is member/owner of project within Datalog
Permission to create datasets within projects
The project already contains collections of assets.

2.2 Any dependencies or setup needed before starting

Step-by-step setup workflow:

Login to the data catalog with appropriate credentials
Create a new project (if not already existing)

Navigate to project creation
Name the project appropriately (e.g., "Compliance_Training_Data")

Create a collection within the project

Collections organize and store your source documents
Name the collection descriptively (e.g., "Documents_Collection")

Upload compliance PDF files to the collection

Upload all source documents as file assets
Ensure PDFs are properly formatted and readable
Verify documents contain the compliance content needed

Prepare Document Parsing Instructions that specify:

How to extract and interpret content from documents
Formatting rules (e.g., "Extract text from tables, ignore headers and footers, bold text indicates key information")
Instructions for handling compliance-specific elements (findings, recommendations, violation rates, etc.)

Define your Data Generation Instructions including:

User scenarios (zero-knowledge approach)
Placeholder naming conventions
Rules and patterns
Response requirements

Create the dataset within the project

Select the collection containing your uploaded documents
Configure all generation settings
Input your data generation instructions

Configure dataset settings:

Dataset Name: Follow your naming convention
Data Source Collection: Select the collection with uploaded PDFs
Target Data Points: Recommended 100+ for adequate coverage
Target Data Point Length: Leave empty for diverse lengths

Quick Start

Creating a dataset involves configuring four main sections:

General Settings - Define dataset name, data source collection, and targets
Data Source - Choose between document processing or column-based processing
Generation Function - Select schema type and provide generation instructions
Document Parsing - (Documents only) Configure processing quality and modes

4. Configuration Options

4.1 General Settings

Field name: Dataset Name

Description: A unique identifier for your training dataset

Accepted values: Text (1 - 255 characters)

Default value: None

Required or optional: Required

Example usage: compliance___v1 or DSH__training_data

Field name: Data Source Collection

Description: The collection containing your uploaded compliance PDF documents that will be used as source material for generating

Accepted values: Dropdown selection from available collections in your project

Default value: None

Required or optional: Required

Example usage: Select "Documents_Collection" from the dropdown

Field name: Target Data Points

Description: The desired number of examples to generate. The actual number may be less depending on source document size and content variety.

Accepted values: Integer (positive number)

Default value: N/A

Required or optional: Required

Example usage: 100 for initial testing, 500-1000 for production training datasets

Field name: Target Data Point Length

Description: The desired length of each data point in characters. Leave empty to create diverse data points ranging from short to long for better training variety.

Accepted values: Integer (characters), or empty for variable length

Default value: Empty (variable length)

Required or optional: Optional

Example usage: Leave empty for natural variety, or specify 512 for consistent context windows.

4.2 Data Source

Field name: Type

Description: Allow you to choose the type of data source for your dataset generation

Accepted values:

Asset Documents - Process uploaded asset documents in the collection
Best For: PDF files, scanned documents - Reports, articles, contracts - Multi-page documents with structured content
Configuration Needed: - Document processing mode (Fast/Balanced/Manual) - Data generation quality - Document parsing instructions (optional)
Performance: Standard processing speed, comprehensive content extraction
Asset Columns - Process pre-extracted column data in the collection
Best For: - Text already extracted to database columns - CSV files imported into collections - Structured data with clear text fields - When you need faster processing.
Configuration Needed: - Column selection (required) - Chunk separator (defaults to “Lines”)
Performance: ~60% faster than document processing

Default value: Asset Documents

Required or optional: Required

Example usage: Select "Asset Documents" for compliance PDFs

Field name: Chunk Separator

Description: Determines how documents are split into smaller pieces (chunks) for processing. This separator determines where splits happen, directly affecting the size and context of each data point and the quality of your training data.

Accepted values:

None - Don't split, use entire window content as one chunk. Best for short windows or complete document sections
Paragraphs - Split with empty lines. Good for articles and documents where paragraphs are separated by blank lines
Lines - Split line by line. Perfect for lists, CSV files, or any content where each line is a separate item
Sentences - Split at sentence endings by period (.). Best for detailed text analysis where each sentence matters.
Custom - Define your own split pattern. Examples: pipe |, dashes ---, or any custom marker

Default value: Paragraphs

Required or optional: Required

Example usage: Select "Paragraphs" for compliance documents where sections are separated by blank lines.

4.3 Generation Function

Field name: Type

Description: Select how data points should be structured and generated for model training. This determines the input-output format and the learning objective for the model.

Accepted values:

Text Classification - Categorize text content by assigning predefined labels or tags
Preference Ranking - Compare multiple responses and rank them by quality or preference
Conversation - Assess dialogue quality and identify issues
Hybrid-Knowledge Domain management - Localise AI Model based on iGOT AI Patent for managing RAW data
Text Generation - Generate new text content based on input prompts or context

Default value: None

Required or optional: Required

4.3.1 Type - Text Classification

Field name: Data Generation Instructions

Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)

SQL I. MANDATORY RULES [Full content from your data generation instructions] II. CLASSIFICATION CATEGORIES [Full content from your data generation instructions] III. TEXT SAMPLE REQUIREMENTS [Full content from your data generation instructions] IV. CATEGORY ASSIGNMENT RULES [Full content from your data generation instructions]

Field name: Classification Labels (only appears for Text Classification type)

Description: Create classification categories that annotators will use to label text samples. Use names that match your domain and objectives.

Accepted values: List of label names (text)

Default value: None

Required or optional: Required (for Text Classification type only)

Example usage: For compliance s: driver_safety, property_disposal, documentation, training, procurement

Field name: Allow multiple labels per item (only appears for Text Classification type)

Description: Allow text to belong to multiple categories simultaneously when content can span across different classification types.

Accepted values: Checkbox (enabled/disabled)

Default value: Disabled

Required or optional: Optional

Example usage: Enable if a single conversation finding can relate to multiple categories (e.g., both "driver_safety" and "documentation")

Field name: Reference Context

Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.

Accepted values: Dropdown selection from available columns in the selected collection

Default value: None

Required or optional: Optional

Example usage: Select additional context columns if your collection has pre-extracted metadata like "text_generation_type", "agency_name", "date_range" that can improve generation quality.

4.3.2 Type - Preference Ranking

Field name: Data Generation Instructions

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. RESPONSE GENERATION GUIDELINES [Full content from your data generation instructions] III. RANKING SCENARIO REQUIREMENTS [Full content from your data generation instructions] IV. QUALITY DIFFERENTIATION RULES [Full content from your data generation instructions]

Field name: Ranking Criteria

Description: Define evaluation standards for comparing and ranking different responses. Choose aspects most relevant to your use case and quality goals.

Accepted values: Text labels (add multiple criteria individually)

Default value: None

Required or optional: Required

Example usage:

Plain Text accuracy helpfulness clarity relevance completeness tone

Field name: Number of Response Options

Description: How many responses to generate for comparison. Higher numbers enable more granular comparisons but require longer processing time.

Accepted values: Integer (1-10)

Default value: 3

Required or optional: Required

Field name: Reference Context

Accepted values: Column selection from collection

Default value: None

Required or optional: Optional

Example usage: Select columns containing source documents, user profiles, or domain-specific context relevant to the responses being ranked.

4.3.3 Type - Conversation

Field name: Data Generation Instructions

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through VII)

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. QUESTION CONTENT SCOPE (USER SCENARIOS) [Full content from your scenarios section] III. PERSONA & KNOWLEDGE LEVEL [Full content] IV. INTENTS REQUIRED IN DATASET [Full content] V. ASSISTANT RESPONSE REQUIREMENTS [Full content] VI. NUMBER OF EXCHANGES [Full content] VII. ANNOTATION DIMENSIONS [Full content]

Field name: Annotation Dimensions

Description: Specify conversation elements that annotators will label in each dialogue (e.g., topic category, user role, knowledge level). Focus on aspects important for your analysis needs. These dimensions help categorize and analyze the quality and characteristics of generated conversations.

Accepted values: List of dimension names (text labels)

Default value: None

Required or optional: Required

Example usage:

violation_type - Identifies the category (driver_safety, surplus_property, travel_policies, documentation, training, procurement, financial_controls, policy_violations)

content_scope - Marks location of content being discussed (finding_1, finding_2, recommendation_1, agency_response, legal_citations, deadlines, corrective_actions, compliance_requirements)

knowledge_level - Measures user expertise level (novice, intermediate, expert)

requires_clarification - Indicates if additional information is needed (yes, no)

compliance_accuracy - Evaluates accuracy level of compliance understanding (accurate, partially_accurate, inaccurate)

4.3.4 Hybrid-Knowledge Domain

Field name: Data Generation Instructions

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. ENTITY EXTRACTION GUIDELINES [Full content from your data generation instructions] III. RELATIONSHIP MAPPING REQUIREMENTS [Full content from your data generation instructions] IV. DOMAIN STRUCTURING RULES [Full content from your data generation instructions]

Field name: Knowledge Domain Types

Description: Define domain-specific categories to extract from text based on your field and requirements. Leave empty for automatic detection of all types.

Accepted values: Text labels (add multiple types individually)

Default value: None (automatic detection)

Required or optional: Optional

Example usage:

Plain Text person location organization product contract date

Field name: Reference Context

Accepted values: Column selection from collection

Default value: None

Required or optional: Optional

Example usage: Select columns containing source documents, existing entity databases, or domain ontologies relevant to the knowledge extraction task.

4.3.5 Text Generation

Field name: Data Generation Instructions

Description: Provide specific guidelines for generating new text content based on input prompts or context. These instructions define how to create coherent, contextually appropriate text for various content types like explanations, summaries, and instructions.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required (when Text Generation type is selected)

Example usage:

Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. PROMPT TYPES [Full content] III. OUTPUT REQUIREMENTS [Full content] IV. CONTENT VARIATION [Full content]

Field name: Generation Types

Description: Select one or more types to control what kind of text generation will be created. If you don't select any types (leave all checkboxes unchecked and don't add custom types), the system will automatically create diverse generation types including summary, Q&A, rewrite, explanation, and more.

Accepted values:

Summary - Generate condensed versions of content
Rewrite - Rephrase content in different words
Question & Answer - Create Q&A pairs from content
Explanation - Generate detailed explanations of concepts
Custom types - Add your own generation type names

Default value: All types (if none selected)

Required or optional: Optional

Example usage:

For compliance audit documents, select:

Summary - to create brief overviews of findings and recommendations
Question & Answer - to generate Q&A about compliance requirements
Explanation - to explain procedures and regulations

Or add custom types:

procedure_steps - Generate step-by-step procedures
requirement_list - Create lists of compliance requirements
checklist - Generate compliance checklists

Field name: Reference Context

Accepted values: Dropdown selection from available columns in the selected collection

Default value: None

Required or optional: Optional

Example usage: Select additional context columns if your collection has pre-extracted metadata like "document_type", "audit_category", "date_range" that can improve generation quality

4.4 Document Parsing

Field name: Processing Configuration

Description: Documents are processed in sliding windows - each window contains multiple pages with overlap to maintain context. Choose a preset mode or customize the window size and overlap to match your document complexity.

Accepted values:

Fast - Quick processing for faster results (3-page windows, 1-page overlap). Good for: simple layouts, standard documents, clean text, straightforward content
Balanced - Balances processing speed and accuracy (2-page windows, 1-page overlap). Good for: complex layouts, documents with tables, dense content, technical documents
Advanced Configuration - Customize window size and overlap to match your document requirements

Default value: Fast

Required or optional: Required

Example usage: Select "Balanced" for compliance s with tables, findings sections, and dense technical content

Field name: Data Generation Quality

Description: Select the AI model quality for generating your dataset content. This controls the sophistication and accuracy of the generated data points, affecting the final training quality of your models.

Accepted values:

Standard (gemini-2.5-flash) - Fast, balance between quality and cost. Good for: most datasets, straightforward content, general use cases
Standard (o4-mini-batch) - Fast, balance between quality and cost using Azure AI. Good for: most datasets, straightforward content, general use cases
Premium (gemini-2.5-pro) - Highest quality, trade-off speed and cost. Good for: complex datasets, detailed requirements, critical accuracy
Premium (gpt-4.1-batch) - Highest quality using Azure AI, trade-off speed and cost. Good for: complex datasets, detailed requirements, critical accuracy

Default value: Standard (gemini-2.5-flash)

Required or optional: Required

Example usage: Use "Standard (gemini-2.5-flash)" for most compliance datasets; upgrade to "Premium (gemini-2.5-pro)" when critical accuracy is required for complex regulatory scenarios

Field name: Document Parsing Instructions

Description: Specify how to extract and interpret content from your documents, including formatting rules. These instructions guide how the AI reads and processes your PDF files.

Accepted values: Free-form text (detailed multi-line instructions)

Default value: None

Required or optional: Required

Example usage:

Plain Text Code block Extract all text content from compliance PDF documents. Follow these rules: 1. Tables: Extract all text from tables while preserving row/column structure 2. Headers/Footers: Ignore page headers, footers, and page numbers 3. Bold text: Mark bold text to indicate key findings and important information 4. Lists: Preserve bullet points and numbering 5. Dates: Keep date formats as-is (e.g., May 10, 2024) 6. Section markers: Identify and mark sections like [FINDING], [RECOMMENDATION], [RESPONSE] 7. Numerical data: Preserve all statistics and percentages exactly

5. Use Cases / Scenarios

Scenario 1: Customer Service Sentiment Analysis

Goal: Classify customer support tickets by sentiment

Configuration:

Data Source: Asset Columns (if tickets in database)
Schema: Text Classification
Labels: Positive, Neutral, Negative
Multi-label: Disabled
Target Points: 1,000
Instructions: “Focus on customer feedback tone regarding issue resolution”
Reference Context: - category (e.g., billing, technical) - priority_level - product_name

Scenario 2: Legal Document Summarization

Goal: Generate summaries of legal contracts

Configuration:

Data Source: Asset Documents (PDF contracts)
Processing: Balanced mode (complex layouts)
Quality: Premium (accuracy critical)
Schema: Text Generation
Generation Types: summary
Parsing Instructions: “Extract from tables, ignore headers, bold text indicates key terms”
Target Points: 500
Instructions: “Create summaries suitable for non-lawyers, highlighting obligations and termination clauses”

Scenario 3: Chatbot Response Quality Evaluation

Goal: Compare and rank chatbot responses for RLHF

Configuration:

Data Source: Asset Documents (chat transcripts)
Schema: Preference Ranking
Criteria: helpfulness, accuracy, tone, resolution
Number of Responses: 3
Target Points: 2,000
Instructions: “Evaluate customer service responses for billing and technical support inquiries”

Scenario 4: Product Review Topic Classification

Goal: Categorize product reviews by topics

Configuration:

Data Source: Asset Columns (review database)
Schema: Text Classification
Labels: Quality, Shipping, Price, Customer Service, Product Features
Multi-label: Enabled (reviews mention multiple topics)
Target Points: 3,000
Reference Context: - product_category - rating - verified_purchase

Scenario 5: Knowledge Base Construction

Goal: Extract entities and relationships from technical docs

Configuration:

Data Source: Asset Documents (technical manuals)
Processing: Balanced mode (tables, diagrams)
Quality: Premium (technical accuracy)
Schema: Entity Relationship Extraction
Entity Types: component, process, specification, troubleshooting_step
Target Points: 1,500

Scenario 6: Fast Text Processing from Database

Goal: Quick processing of pre-extracted text

Configuration:

Data Source: Asset Columns (60% faster)
Schema: Text Generation
Generation Types: Leave empty (auto-diverse)
Target Points: 5,000
Chunk Separator: Lines (for structured data)

6. Best Practices

6.1 Start Small, Scale Up

Begin with 100-500 data points for testing
Review quality and adjust configuration
Scale to 1,000-5,000+ for production

6.2 Be Specific with Instructions

Vague: “Generate good data”
Specific: “Generate customer service scenarios focusing on billing inquiries, prioritizing clear resolution steps”

6.3 Choose Right Data Source

Use Documents for: PDFs, reports, contracts, articles
Use Columns for: Database content, CSV imports, pre-extracted text
Columns are ~60% faster when applicable

6.4 Match Schema to Use Case

Click the image to view the sheet.

6.5 Optimize Chunk Separators

Paragraphs: Default for most documents
Lines: Best for lists and structured data
Sentences: For sentence-level analysis
None: For short, complete sections

6.6 Leverage Reference Context

Include metadata when it helps understanding
Don’t overuse (max 10 columns)
Use in generation instructions with <reference_context> tag

6.7 Balance Quality and Speed

Standard quality for initial testing
Premium quality for final production data
Fast processing for simple documents
Balanced/Manual for complex layouts

6.8 Validate Early

Use Preview button before generating full dataset
Check sample data points for quality
Adjust configuration based on preview results

7. Troubleshooting / FAQ

Issue: Actual data points less than target

Causes:

- Source data exhausted

- Chunk separator created fewer chunks

- Document processing filtered content

Solutions:

- Add more source data

- Adjust chunk separator (use smaller chunks)

- Reduce target number

Issue: Data points too short/long

Solution:

- Set Target Data Point Length explicitly

- Adjust chunk separator (None for longer, Sentences for shorter)

Issue: Poor quality generated content

Solutions:

- Upgrade to Premium quality mode

- Improve Data Generation Instructions (be more specific)

- Add Reference Context for better understanding

- Use Parsing Instructions for complex documents

Issue: Processing too slow

Solutions:

- Switch from Documents to Asset Columns (~60% faster)

- Use Fast processing mode instead of Balanced

- Switch from Premium to Standard quality

- Reduce Target Data Points for testing

Issue: Not enough context in data points

Solutions:

- Increase Window Size (Manual mode)

- Increase Overlap between windows - Add Reference Context columns

- Change chunk separator to create larger chunks

PreviousCollection NextData Labeling

Last updated 1 month ago

hashtag1. Introduction

hashtag1.1 Purpose of the document

hashtag1.2 Who this guide is for (target audience)

hashtag1.3 Overview of the dataset generation feature

hashtag2. Prerequisites

hashtag2.1 Required permissions/access

hashtag2.2 Any dependencies or setup needed before starting

hashtag4. Configuration Options

hashtag4.1 General Settings

hashtag4.2 Data Source

hashtag4.3 Generation Function

hashtag4.3.1 Type - Text Classification

hashtag4.3.2 Type - Preference Ranking

hashtag4.3.3 Type - Conversation

hashtag4.3.4 Hybrid-Knowledge Domain

hashtag4.3.5 Text Generation

hashtag4.4 Document Parsing

hashtag5. Use Cases / Scenarios

hashtag6. Best Practices

hashtag6.1 Start Small, Scale Up

hashtag6.2 Be Specific with Instructions

hashtag6.3 Choose Right Data Source

hashtag6.4 Match Schema to Use Case

hashtag6.5 Optimize Chunk Separators

hashtag6.6 Leverage Reference Context

hashtag6.7 Balance Quality and Speed

hashtag6.8 Validate Early

hashtag7. Troubleshooting / FAQ

1. Introduction

1.1 Purpose of the document

1.2 Who this guide is for (target audience)

1.3 Overview of the dataset generation feature

2. Prerequisites

2.1 Required permissions/access

2.2 Any dependencies or setup needed before starting

4. Configuration Options

4.1 General Settings

4.2 Data Source

4.3 Generation Function

4.3.1 Type - Text Classification

4.3.2 Type - Preference Ranking

4.3.3 Type - Conversation

4.3.4 Hybrid-Knowledge Domain

4.3.5 Text Generation

4.4 Document Parsing

5. Use Cases / Scenarios

6. Best Practices

6.1 Start Small, Scale Up

6.2 Be Specific with Instructions

6.3 Choose Right Data Source

6.4 Match Schema to Use Case

6.5 Optimize Chunk Separators

6.6 Leverage Reference Context

6.7 Balance Quality and Speed

6.8 Validate Early

7. Troubleshooting / FAQ