Dataset (Huggingface)
Compile Collection data to Huggingface Dataset
The data we have in collection also allow us to capture the insight which eventually being a part of our organization knowledge but also possible to contribute to SLM development.
All Contracts that C-levels rejected over the last 2 years can be used to train along the company Policies
Risk Report and Evidence in Audit are possible to be train for Risk Management team
1. Introduction
1.1 Purpose of the document
This handbook provides detailed guidance on configuring dataset generation forms in the Datalog platform. It explains each configuration option, when to use specific settings, and how they impact your training data quality.
1.2 Who this guide is for (target audience)
This guide is intended for:
Data generation teams & interns create training datasets for language models.
AI developers build systems to help users discover and understand compliance information.
Annotation teams who need to label and validate training data.
Quality assurance specialists reviewing generated for accuracy and realism.
1.3 Overview of the dataset generation feature
The dataset generation feature creates training data from source documents using AI to simulate realistic user interactions. Users approach the system with zero knowledge of the source document contents by asking natural questions or making requests related to the document's subject matter (e.g., compliance procedures, forms, requirements, processes). The AI assistant responds by naturally introducing relevant information from the documents, helping users progressively discover what they need to know through authentic interactions.
Key characteristics:
Users ask from their work context (e.g., "How do I use my car for work?"), not from awareness.
progress from broad/vague questions to specific follow-ups based on assistant responses.
All specific data is replaced with placeholders to prevent memorization of actual details.
Realistic scenarios covering multiple user personas (employees, managers, HR coordinators, property managers, etc.)
Natural language patterns reflecting how people actually ask questions at work
2. Prerequisites
2.1 Required permissions/access
User account with login access to the Datalog platform
User is member/owner of project within Datalog
Permission to create datasets within projects
The project already contains collections of assets.
2.2 Any dependencies or setup needed before starting
Step-by-step setup workflow:
Login to the data catalog with appropriate credentials
Create a new project (if not already existing)
Navigate to project creation
Name the project appropriately (e.g., "Compliance_Training_Data")
Create a collection within the project
Collections organize and store your source documents
Name the collection descriptively (e.g., "Documents_Collection")
Upload compliance PDF files to the collection
Upload all source documents as file assets
Ensure PDFs are properly formatted and readable
Verify documents contain the compliance content needed
Prepare Document Parsing Instructions that specify:
How to extract and interpret content from documents
Formatting rules (e.g., "Extract text from tables, ignore headers and footers, bold text indicates key information")
Instructions for handling compliance-specific elements (findings, recommendations, violation rates, etc.)
Define your Data Generation Instructions including:
User scenarios (zero-knowledge approach)
Placeholder naming conventions
Rules and patterns
Response requirements
Create the dataset within the project
Select the collection containing your uploaded documents
Configure all generation settings
Input your data generation instructions
Configure dataset settings:
Dataset Name: Follow your naming convention
Data Source Collection: Select the collection with uploaded PDFs
Target Data Points: Recommended 100+ for adequate coverage
Target Data Point Length: Leave empty for diverse lengths
Quick Start
Creating a dataset involves configuring four main sections:
General Settings - Define dataset name, data source collection, and targets
Data Source - Choose between document processing or column-based processing
Generation Function - Select schema type and provide generation instructions
Document Parsing - (Documents only) Configure processing quality and modes
4. Configuration Options
4.1 General Settings
Field name: Dataset Name
Description: A unique identifier for your training dataset
Accepted values: Text (1 - 255 characters)
Default value: None
Required or optional: Required
Example usage: compliance___v1 or DSH__training_data
Field name: Data Source Collection
Description: The collection containing your uploaded compliance PDF documents that will be used as source material for generating
Accepted values: Dropdown selection from available collections in your project
Default value: None
Required or optional: Required
Example usage: Select "Documents_Collection" from the dropdown
Field name: Target Data Points
Description: The desired number of examples to generate. The actual number may be less depending on source document size and content variety.
Accepted values: Integer (positive number)
Default value: N/A
Required or optional: Required
Example usage: 100 for initial testing, 500-1000 for production training datasets
Field name: Target Data Point Length
Description: The desired length of each data point in characters. Leave empty to create diverse data points ranging from short to long for better training variety.
Accepted values: Integer (characters), or empty for variable length
Default value: Empty (variable length)
Required or optional: Optional
Example usage: Leave empty for natural variety, or specify 512 for consistent context windows.
4.2 Data Source
Field name: Type
Description: Allow you to choose the type of data source for your dataset generation
Accepted values:
Asset Documents - Process uploaded asset documents in the collection
Best For: PDF files, scanned documents - Reports, articles, contracts - Multi-page documents with structured content
Configuration Needed: - Document processing mode (Fast/Balanced/Manual) - Data generation quality - Document parsing instructions (optional)
Performance: Standard processing speed, comprehensive content extraction
Asset Columns - Process pre-extracted column data in the collection
Best For: - Text already extracted to database columns - CSV files imported into collections - Structured data with clear text fields - When you need faster processing.
Configuration Needed: - Column selection (required) - Chunk separator (defaults to “Lines”)
Performance: ~60% faster than document processing
Default value: Asset Documents
Required or optional: Required
Example usage: Select "Asset Documents" for compliance PDFs
Field name: Chunk Separator
Description: Determines how documents are split into smaller pieces (chunks) for processing. This separator determines where splits happen, directly affecting the size and context of each data point and the quality of your training data.
Accepted values:
None - Don't split, use entire window content as one chunk. Best for short windows or complete document sections
Paragraphs - Split with empty lines. Good for articles and documents where paragraphs are separated by blank lines
Lines - Split line by line. Perfect for lists, CSV files, or any content where each line is a separate item
Sentences - Split at sentence endings by period (.). Best for detailed text analysis where each sentence matters.
Custom - Define your own split pattern. Examples: pipe |, dashes ---, or any custom marker
Default value: Paragraphs
Required or optional: Required
Example usage: Select "Paragraphs" for compliance documents where sections are separated by blank lines.
4.3 Generation Function
Field name: Type
Description: Select how data points should be structured and generated for model training. This determines the input-output format and the learning objective for the model.
Accepted values:
Text Classification - Categorize text content by assigning predefined labels or tags
Preference Ranking - Compare multiple responses and rank them by quality or preference
Conversation - Assess dialogue quality and identify issues
Hybrid-Knowledge Domain management - Localise AI Model based on iGOT AI Patent for managing RAW data
Text Generation - Generate new text content based on input prompts or context
Default value: None
Required or optional: Required
4.3.1 Type - Text Classification
Field name: Data Generation Instructions
Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.
Accepted values: Free-form text (detailed multi-line instructions)
Default value: None
Required or optional: Required
Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)
SQL I. MANDATORY RULES [Full content from your data generation instructions] II. CLASSIFICATION CATEGORIES [Full content from your data generation instructions] III. TEXT SAMPLE REQUIREMENTS [Full content from your data generation instructions] IV. CATEGORY ASSIGNMENT RULES [Full content from your data generation instructions]
Field name: Classification Labels (only appears for Text Classification type)
Description: Create classification categories that annotators will use to label text samples. Use names that match your domain and objectives.
Accepted values: List of label names (text)
Default value: None
Required or optional: Required (for Text Classification type only)
Example usage: For compliance s: driver_safety, property_disposal, documentation, training, procurement
Field name: Allow multiple labels per item (only appears for Text Classification type)
Description: Allow text to belong to multiple categories simultaneously when content can span across different classification types.
Accepted values: Checkbox (enabled/disabled)
Default value: Disabled
Required or optional: Optional
Example usage: Enable if a single conversation finding can relate to multiple categories (e.g., both "driver_safety" and "documentation")
Field name: Reference Context
Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.
Accepted values: Dropdown selection from available columns in the selected collection
Default value: None
Required or optional: Optional
Example usage: Select additional context columns if your collection has pre-extracted metadata like "text_generation_type", "agency_name", "date_range" that can improve generation quality.
4.3.2 Type - Preference Ranking
Field name: Data Generation Instructions
Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.
Accepted values: Free-form text (detailed multi-line instructions)
Default value: None
Required or optional: Required
Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)
Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. RESPONSE GENERATION GUIDELINES [Full content from your data generation instructions] III. RANKING SCENARIO REQUIREMENTS [Full content from your data generation instructions] IV. QUALITY DIFFERENTIATION RULES [Full content from your data generation instructions]
Field name: Ranking Criteria
Description: Define evaluation standards for comparing and ranking different responses. Choose aspects most relevant to your use case and quality goals.
Accepted values: Text labels (add multiple criteria individually)
Default value: None
Required or optional: Required
Example usage:
Plain Text accuracy helpfulness clarity relevance completeness tone
Field name: Number of Response Options
Description: How many responses to generate for comparison. Higher numbers enable more granular comparisons but require longer processing time.
Accepted values: Integer (1-10)
Default value: 3
Required or optional: Required
Field name: Reference Context
Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.
Accepted values: Column selection from collection
Default value: None
Required or optional: Optional
Example usage: Select columns containing source documents, user profiles, or domain-specific context relevant to the responses being ranked.
4.3.3 Type - Conversation
Field name: Data Generation Instructions
Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.
Accepted values: Free-form text (detailed multi-line instructions)
Default value: None
Required or optional: Required
Example usage: (Paste the complete Data Generation Instructions document - Sections I through VII)
Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. QUESTION CONTENT SCOPE (USER SCENARIOS) [Full content from your scenarios section] III. PERSONA & KNOWLEDGE LEVEL [Full content] IV. INTENTS REQUIRED IN DATASET [Full content] V. ASSISTANT RESPONSE REQUIREMENTS [Full content] VI. NUMBER OF EXCHANGES [Full content] VII. ANNOTATION DIMENSIONS [Full content]
Field name: Annotation Dimensions
Description: Specify conversation elements that annotators will label in each dialogue (e.g., topic category, user role, knowledge level). Focus on aspects important for your analysis needs. These dimensions help categorize and analyze the quality and characteristics of generated conversations.
Accepted values: List of dimension names (text labels)
Default value: None
Required or optional: Required
Example usage:
violation_type - Identifies the category (driver_safety, surplus_property, travel_policies, documentation, training, procurement, financial_controls, policy_violations)
content_scope - Marks location of content being discussed (finding_1, finding_2, recommendation_1, agency_response, legal_citations, deadlines, corrective_actions, compliance_requirements)
knowledge_level - Measures user expertise level (novice, intermediate, expert)
requires_clarification - Indicates if additional information is needed (yes, no)
compliance_accuracy - Evaluates accuracy level of compliance understanding (accurate, partially_accurate, inaccurate)
4.3.4 Hybrid-Knowledge Domain
Field name: Data Generation Instructions
Description: Provide specific guidelines for generating data that align with your annotation goals. These comprehensive instructions define user scenarios, patterns, placeholder usage, and all rules for creating realistic training.
Accepted values: Free-form text (detailed multi-line instructions)
Default value: None
Required or optional: Required
Example usage: (Paste the complete Data Generation Instructions document - Sections I through IV)
Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. ENTITY EXTRACTION GUIDELINES [Full content from your data generation instructions] III. RELATIONSHIP MAPPING REQUIREMENTS [Full content from your data generation instructions] IV. DOMAIN STRUCTURING RULES [Full content from your data generation instructions]
Field name: Knowledge Domain Types
Description: Define domain-specific categories to extract from text based on your field and requirements. Leave empty for automatic detection of all types.
Accepted values: Text labels (add multiple types individually)
Default value: None (automatic detection)
Required or optional: Optional
Example usage:
Plain Text person location organization product contract date
Field name: Reference Context
Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.
Accepted values: Column selection from collection
Default value: None
Required or optional: Optional
Example usage: Select columns containing source documents, existing entity databases, or domain ontologies relevant to the knowledge extraction task.
4.3.5 Text Generation
Field name: Data Generation Instructions
Description: Provide specific guidelines for generating new text content based on input prompts or context. These instructions define how to create coherent, contextually appropriate text for various content types like explanations, summaries, and instructions.
Accepted values: Free-form text (detailed multi-line instructions)
Default value: None
Required or optional: Required (when Text Generation type is selected)
Example usage:
Plain Text I. MANDATORY RULES [Full content from your data generation instructions] II. PROMPT TYPES [Full content] III. OUTPUT REQUIREMENTS [Full content] IV. CONTENT VARIATION [Full content]
Field name: Generation Types
Description: Select one or more types to control what kind of text generation will be created. If you don't select any types (leave all checkboxes unchecked and don't add custom types), the system will automatically create diverse generation types including summary, Q&A, rewrite, explanation, and more.
Accepted values:
Summary - Generate condensed versions of content
Rewrite - Rephrase content in different words
Question & Answer - Create Q&A pairs from content
Explanation - Generate detailed explanations of concepts
Custom types - Add your own generation type names
Default value: All types (if none selected)
Required or optional: Optional
Example usage:
For compliance audit documents, select:
Summary - to create brief overviews of findings and recommendations
Question & Answer - to generate Q&A about compliance requirements
Explanation - to explain procedures and regulations
Or add custom types:
procedure_steps - Generate step-by-step procedures
requirement_list - Create lists of compliance requirements
checklist - Generate compliance checklists
Field name: Reference Context
Description: Include additional columns as reference context to improve data quality. Selected columns will provide supplementary information alongside each data point, helping the AI generate more accurate and contextually relevant content. When using these columns in your instructions, reference them with the <reference_context> tag.
Accepted values: Dropdown selection from available columns in the selected collection
Default value: None
Required or optional: Optional
Example usage: Select additional context columns if your collection has pre-extracted metadata like "document_type", "audit_category", "date_range" that can improve generation quality
4.4 Document Parsing
Field name: Processing Configuration
Description: Documents are processed in sliding windows - each window contains multiple pages with overlap to maintain context. Choose a preset mode or customize the window size and overlap to match your document complexity.
Accepted values:
Fast - Quick processing for faster results (3-page windows, 1-page overlap). Good for: simple layouts, standard documents, clean text, straightforward content
Balanced - Balances processing speed and accuracy (2-page windows, 1-page overlap). Good for: complex layouts, documents with tables, dense content, technical documents
Advanced Configuration - Customize window size and overlap to match your document requirements
Default value: Fast
Required or optional: Required
Example usage: Select "Balanced" for compliance s with tables, findings sections, and dense technical content
Field name: Data Generation Quality
Description: Select the AI model quality for generating your dataset content. This controls the sophistication and accuracy of the generated data points, affecting the final training quality of your models.
Accepted values:
Standard (gemini-2.5-flash) - Fast, balance between quality and cost. Good for: most datasets, straightforward content, general use cases
Standard (o4-mini-batch) - Fast, balance between quality and cost using Azure AI. Good for: most datasets, straightforward content, general use cases
Premium (gemini-2.5-pro) - Highest quality, trade-off speed and cost. Good for: complex datasets, detailed requirements, critical accuracy
Premium (gpt-4.1-batch) - Highest quality using Azure AI, trade-off speed and cost. Good for: complex datasets, detailed requirements, critical accuracy
Default value: Standard (gemini-2.5-flash)
Required or optional: Required
Example usage: Use "Standard (gemini-2.5-flash)" for most compliance datasets; upgrade to "Premium (gemini-2.5-pro)" when critical accuracy is required for complex regulatory scenarios
Field name: Document Parsing Instructions
Description: Specify how to extract and interpret content from your documents, including formatting rules. These instructions guide how the AI reads and processes your PDF files.
Accepted values: Free-form text (detailed multi-line instructions)
Default value: None
Required or optional: Required
Example usage:
Plain Text Code block Extract all text content from compliance PDF documents. Follow these rules: 1. Tables: Extract all text from tables while preserving row/column structure 2. Headers/Footers: Ignore page headers, footers, and page numbers 3. Bold text: Mark bold text to indicate key findings and important information 4. Lists: Preserve bullet points and numbering 5. Dates: Keep date formats as-is (e.g., May 10, 2024) 6. Section markers: Identify and mark sections like [FINDING], [RECOMMENDATION], [RESPONSE] 7. Numerical data: Preserve all statistics and percentages exactly
5. Use Cases / Scenarios
Scenario 1: Customer Service Sentiment Analysis
Goal: Classify customer support tickets by sentiment
Configuration:
Data Source: Asset Columns (if tickets in database)
Schema: Text Classification
Labels: Positive, Neutral, Negative
Multi-label: Disabled
Target Points: 1,000
Instructions: “Focus on customer feedback tone regarding issue resolution”
Reference Context: - category (e.g., billing, technical) - priority_level - product_name
Scenario 2: Legal Document Summarization
Goal: Generate summaries of legal contracts
Configuration:
Data Source: Asset Documents (PDF contracts)
Processing: Balanced mode (complex layouts)
Quality: Premium (accuracy critical)
Schema: Text Generation
Generation Types: summary
Parsing Instructions: “Extract from tables, ignore headers, bold text indicates key terms”
Target Points: 500
Instructions: “Create summaries suitable for non-lawyers, highlighting obligations and termination clauses”
Scenario 3: Chatbot Response Quality Evaluation
Goal: Compare and rank chatbot responses for RLHF
Configuration:
Data Source: Asset Documents (chat transcripts)
Schema: Preference Ranking
Criteria: helpfulness, accuracy, tone, resolution
Number of Responses: 3
Target Points: 2,000
Instructions: “Evaluate customer service responses for billing and technical support inquiries”
Scenario 4: Product Review Topic Classification
Goal: Categorize product reviews by topics
Configuration:
Data Source: Asset Columns (review database)
Schema: Text Classification
Labels: Quality, Shipping, Price, Customer Service, Product Features
Multi-label: Enabled (reviews mention multiple topics)
Target Points: 3,000
Reference Context: - product_category - rating - verified_purchase
Scenario 5: Knowledge Base Construction
Goal: Extract entities and relationships from technical docs
Configuration:
Data Source: Asset Documents (technical manuals)
Processing: Balanced mode (tables, diagrams)
Quality: Premium (technical accuracy)
Schema: Entity Relationship Extraction
Entity Types: component, process, specification, troubleshooting_step
Target Points: 1,500
Scenario 6: Fast Text Processing from Database
Goal: Quick processing of pre-extracted text
Configuration:
Data Source: Asset Columns (60% faster)
Schema: Text Generation
Generation Types: Leave empty (auto-diverse)
Target Points: 5,000
Chunk Separator: Lines (for structured data)
6. Best Practices
6.1 Start Small, Scale Up
Begin with 100-500 data points for testing
Review quality and adjust configuration
Scale to 1,000-5,000+ for production
6.2 Be Specific with Instructions
Vague: “Generate good data”
Specific: “Generate customer service scenarios focusing on billing inquiries, prioritizing clear resolution steps”
6.3 Choose Right Data Source
Use Documents for: PDFs, reports, contracts, articles
Use Columns for: Database content, CSV imports, pre-extracted text
Columns are ~60% faster when applicable
6.4 Match Schema to Use Case
Click the image to view the sheet.
6.5 Optimize Chunk Separators
Paragraphs: Default for most documents
Lines: Best for lists and structured data
Sentences: For sentence-level analysis
None: For short, complete sections
6.6 Leverage Reference Context
Include metadata when it helps understanding
Don’t overuse (max 10 columns)
Use in generation instructions with <reference_context> tag
6.7 Balance Quality and Speed
Standard quality for initial testing
Premium quality for final production data
Fast processing for simple documents
Balanced/Manual for complex layouts
6.8 Validate Early
Use Preview button before generating full dataset
Check sample data points for quality
Adjust configuration based on preview results
7. Troubleshooting / FAQ
Issue: Actual data points less than target
Causes:
- Source data exhausted
- Chunk separator created fewer chunks
- Document processing filtered content
Solutions:
- Add more source data
- Adjust chunk separator (use smaller chunks)
- Reduce target number
Issue: Data points too short/long
Solution:
- Set Target Data Point Length explicitly
- Adjust chunk separator (None for longer, Sentences for shorter)
Issue: Poor quality generated content
Solutions:
- Upgrade to Premium quality mode
- Improve Data Generation Instructions (be more specific)
- Add Reference Context for better understanding
- Use Parsing Instructions for complex documents
Issue: Processing too slow
Solutions:
- Switch from Documents to Asset Columns (~60% faster)
- Use Fast processing mode instead of Balanced
- Switch from Premium to Standard quality
- Reduce Target Data Points for testing
Issue: Not enough context in data points
Solutions:
- Increase Window Size (Manual mode)
- Increase Overlap between windows - Add Reference Context columns
- Change chunk separator to create larger chunks
Last updated