🧹 DBClean Documentation

Complete guide to using DBClean for AI-powered CSV data cleaning, standardization, and ML data preparation. Transform messy datasets into production-ready data pipelines with our comprehensive documentation.

What is DBClean?

DBClean is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.

Whether you're preparing data for machine learning, cleaning customer databases, or standardizing business data, DBClean provides the tools you need. Get started with our interactive platform or jump straight to the installation guide.

πŸ“ Project Structure

After processing, your workspace will look like this:

bash
your-project/
β”œβ”€β”€ data.csv                  # Your original input file
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ data_cleaned.csv      # After preclean step
β”‚   β”œβ”€β”€ data_deduped.csv      # After duplicate removal
β”‚   β”œβ”€β”€ data_stitched.csv     # Final cleaned dataset
β”‚   β”œβ”€β”€ train.csv             # Training set (70%)
β”‚   β”œβ”€β”€ validate.csv          # Validation set (15%)
β”‚   └── test.csv              # Test set (15%)
β”œβ”€β”€ settings/
β”‚   β”œβ”€β”€ instructions.txt      # Custom AI instructions
β”‚   └── exclude_columns.txt   # Columns to skip in preclean
└── outputs/
    β”œβ”€β”€ architect_output.txt  # AI schema design
    β”œβ”€β”€ column_mapping.json   # Column transformations
    β”œβ”€β”€ cleaned_columns/      # Individual column results
    β”œβ”€β”€ cleaner_changes_analysis.html
    └── dedupe_report.txt

✨ Features

πŸ€–AI-Powered Cleaning

Uses advanced language models to intelligently clean and standardize data

πŸ—οΈSchema Design

Automatically creates optimal database schemas from your data

πŸ”Duplicate Detection

AI-powered duplicate identification and removal

🎯Outlier Detection

Uses Isolation Forest to identify and remove anomalies

βœ‚οΈData Splitting

Automatically splits cleaned data into training, validation, and test sets

πŸ”„Full Pipeline

Complete automation from raw CSV to clean, structured data

πŸ“ŠColumn-by-Column Processing

Detailed cleaning and standardization of individual columns

🎯Model Selection

Choose from multiple AI models for different tasks

πŸ“‹Custom Instructions

Guide the AI with your specific cleaning requirements

πŸ’°Credit-Based Billing

Pay only for what you use with transparent pricing

πŸ’³ Credit System
DBClean uses a transparent, pay-as-you-go credit system.
  • Free Tier: 5 free requests per month for new users
  • Minimum Balance: $0.01 required for paid requests
  • Precision: 4 decimal places (charges as low as $0.0001)
  • Pricing: Based on actual AI model costs with no markup
  • Billing: Credits deducted only after successful processing

Check your balance anytime with dbclean credits or get a complete overview with dbclean account.

πŸš€ Quick Start

1. Install DBClean

bash
npm install -g @dbclean/cli

2. Initialize Your Account

bash
dbclean init

Enter your email and API key when prompted. Don't have an account? Sign up at dbclean.dev

3. Verify Setup

bash
dbclean test-auth
dbclean account

4. Process Your Data

bash
# Place your CSV file as data.csv in your current directory
dbclean run

Your cleaned data will be available in data/data_stitched.csv πŸŽ‰

πŸ“– Command Reference

πŸ”§ Setup & Authentication

CommandDescription
dbclean initInitialize with your email and API key
dbclean test-authVerify your credentials are working
dbclean logoutRemove stored credentials
dbclean statusCheck API key status and account info

πŸ’° Account Management

CommandDescription
dbclean accountComplete account overview (credits, usage, status)
dbclean creditsCheck your current credit balance
dbclean usageView API usage statistics
dbclean usage --detailedDetailed breakdown by service and model
dbclean modelsList all available AI models

πŸ“Š Data Processing Pipeline

CommandDescription
dbclean runExecute complete pipeline (recommended)
dbclean precleanClean CSV data (remove newlines, special chars)
dbclean architectAI-powered schema design and standardization
dbclean dedupeAI-powered duplicate detection and removal
dbclean cleanerAI-powered column-by-column data cleaning
dbclean stitcherCombine all changes into final CSV
dbclean isosplitDetect outliers and split into train/validate/test

πŸ”„ Complete Pipeline

The recommended approach is to use the full pipeline with dbclean run.

bash
# Basic full pipeline
dbclean run

# With custom AI model
dbclean run -m "gemini-2.0-flash-exp"

# Different models for different steps
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"

# With custom instructions and larger sample
dbclean run -i -x 10

# Skip certain steps
dbclean run --skip-preclean --skip-dedupe

Pipeline Steps

  1. Preclean - Prepares raw CSV by removing problematic characters and formatting
  2. Architect - AI analyzes your data structure and creates optimized schema
  3. Dedupe - AI identifies and removes duplicate records intelligently
  4. Cleaner - AI processes each column to standardize and clean data
  5. Stitcher - Combines all improvements into final dataset
  6. Isosplit - Removes outliers and splits data for machine learning

πŸŽ›οΈ Command Options

Model Selection

-m <model> - Use same model for all AI steps

--model-architect <model> - Specific model for architect step

--model-cleaner <model> - Specific model for cleaner step

Processing Options

-x <number> - Sample size for architect analysis (default: 5)

-i - Use custom instructions from settings/instructions.txt

--input <file> - Specify input CSV file (default: data.csv)

Skip Options

--skip-preclean - Skip data preparation step

--skip-architect - Skip schema design step

--skip-dedupe - Skip duplicate detection step

--skip-cleaner - Skip column cleaning step

--skip-isosplit - Skip outlier detection and data splitting

πŸ€– AI Models

Recommended Models

ModelBest ForSpeedCost
gemini-2.0-flash-expGeneral purpose, fast processingβš‘βš‘βš‘πŸ’²
gemini-2.0-flash-thinkingComplex data analysisβš‘βš‘πŸ’²πŸ’²
gemini-1.5-proLarge, complex datasetsβš‘πŸ’²πŸ’²πŸ’²

Model Selection Tips

  • For speed and cost: Use gemini-2.0-flash-exp
  • For complex, messy data: Use gemini-2.0-flash-thinking for architect
  • For mixed workloads: Use different models per step with --model-architect and --model-cleaner
bash
# List all available models
dbclean models

πŸ“ Custom Instructions

Create custom cleaning instructions to guide the AI.

  1. For architect step: Use the -i flag with a settings/instructions.txt file.
  2. Example instructions:
    txt
    - Standardize all phone numbers to E.164 format (+1XXXXXXXXXX)
    - Convert all dates to YYYY-MM-DD format
    - Normalize company names (remove Inc, LLC, etc.)
    - Flag any entries with missing critical information
    - Ensure email addresses are properly formatted
bash
dbclean run -i  # Uses instructions from settings/instructions.txt

πŸ’‘ Usage Examples

Basic Processing

bash
# Process a CSV file with default settings
dbclean run

# Use a specific input file
dbclean run --input customer_data.csv

Advanced Processing

bash
# High-quality processing with larger sample
dbclean run -m "gemini-2.0-flash-thinking" -x 15 -i

# Fast processing for large datasets
dbclean run -m "gemini-2.0-flash-exp" --skip-dedupe

# Custom pipeline - architect only
dbclean run --skip-preclean --skip-cleaner --skip-dedupe --skip-isosplit

Individual Steps

bash
# Run architect with custom model and sample size
dbclean architect -m "gemini-2.0-flash-thinking" -x 10 -i

# Clean data with specific model
dbclean cleaner -m "gemini-2.0-flash-exp"

# Remove duplicates with AI analysis
dbclean dedupe

🎯 Best Practices

1. Start Small and Iterate
bash
# Test with small sample first
dbclean architect -x 3

# Review outputs, then run full pipeline
dbclean run
2. Choose the Right Models
bash
# For complex schema design
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"
3. Use Custom Instructions

Create settings/instructions.txt with domain-specific requirements:

txt
Finance data requirements:
- Currency amounts in USD format ($X,XXX.XX)
- Account numbers must be 10-12 digits
- Transaction dates in YYYY-MM-DD format
4. Monitor Your Usage
bash
# Check account status regularly
dbclean account

# Monitor detailed usage
dbclean usage --detailed

❗ Troubleshooting

Common Issues

Authentication Problems
bash
dbclean init     # Re-enter credentials
dbclean test-auth # Verify connection
Data File Issues
  • Ensure data.csv exists in current directory
  • Use --input <file> for different file names
  • Check file permissions and encoding
API Limits
  • Check credit balance: dbclean credits
  • View usage: dbclean usage
  • Free tier: 5 requests per month, then paid credits required
Model Availability
bash
dbclean models   # See available models

Getting Help

bash
dbclean --help              # General help
dbclean run --help          # Command-specific help
dbclean help-commands       # Detailed command reference

πŸ“Š Output Files

After processing, you'll have:

  • data/data_stitched.csv - Your final, cleaned dataset
  • data/train.csv - Training data (70%)
  • data/validate.csv - Validation data (15%)
  • data/test.csv - Test data (15%)
  • outputs/cleaner_changes_analysis.html - Visual changes report
  • outputs/architect_output.txt - AI schema analysis
  • outputs/column_mapping.json - Column transformation details

❓ Frequently Asked Questions

Why would anyone need AI to clean CSVs or text files? Why would anyone pay for this service?

While traditional data cleaning approaches work well for structured datasets with predictable patterns, many organizationsβ€”particularly smaller companiesβ€”encounter datasets that are fundamentally unstructured or semi-structured. Our solution emerged from a real-world scenario where a small company had collected data through Google Forms, resulting in a dataset that was "insanely messy with no structure, basically unusable."

For these organizations, traditional enterprise-grade data cleaning solutions are often cost-prohibitive and unnecessarily complex. Our AI-powered approach provides an accessible middle ground, offering intelligent data structuring capabilities that can extract meaningful information from chaotic datasets. For instance, if a column contains age data embedded within surrounding text, our AI can intelligently parse and extract the numerical age values while preserving data integrity.

Isn't data cleanliness a "solved" problem with existing tools?

Data cleanliness is indeed well-addressed for structured, predictable datasets through rule-based systems and established ETL processes. However, the challenge lies in semi-structured or completely unstructured data where patterns are inconsistent or non-obvious.

Our approach handles standard cleaning operations (whitespace removal, basic formatting) through traditional pre-processing steps without AI involvement. The AI component focuses specifically on the complex task of imposing structure on unstructured dataβ€”a problem that remains challenging for rule-based systems when dealing with highly variable input formats.

Isn't it concerning to let AI autonomously restructure datasets? Won't it hallucinate improvements?

This concern is absolutely valid and represents one of our core engineering challenges. Our approach utilizes careful prompt engineering and validation mechanisms and takes a conservative approach: when the AI cannot confidently process a value, it leaves it unchanged and flags it for subsequent regex validation. We maintain a comprehensive audit log that documents every change made and every change that was rejected due to failing validation checks. This transparency ensures that users can review and verify all modifications made to their data.

Wouldn't rule-based methods with logical constraints perform better?

Rule-based systems excel when data patterns are predictable and well-defined. For many standard cleaning operations, they are indeed more efficient and reliable. However, when dealing with truly unstructured data where patterns vary significantly, rule-based approaches require extensive manual configuration and often fail to handle edge cases.

Our hybrid approach leverages the strengths of both methodologies: we use traditional methods for standard cleaning operations and apply AI specifically to pattern recognition and structure extraction tasks that would be prohibitively complex to encode as explicit rules.

Doesn't AI processing create excessive overhead, especially for large datasets?

Computational efficiency is a legitimate concern with transformer-based approaches. We've addressed this through a multi-stage optimization strategy designed to minimize AI processing overhead:

  • Stage 1: Schema development and initial cleaning on a representative sample (not the full dataset)
  • Stage 2: Fuzzy deduplication targeting only potential duplicates
  • Stage 3: Column-specific cleaning for data that failed regex validation, processed in concurrent batches

This approach ensures that the most computationally expensive AI operations run only on subsets of data that actually require intelligent processing, rather than the entire dataset.

What are the speed and scalability limitations of your AI models?

Processing speed remains one of our primary optimization challenges. Our current architecture is designed with the following performance characteristics:

  • Stage 1: Always operates on a fixed sample size regardless of dataset size
  • Stage 2: Processes only identified duplicate candidates
  • Stage 3: Supports concurrent batch processing for column cleaning

While we optimize for datasets exceeding one million rows, we acknowledge that extensive performance testing and validation are still required. Speed optimization while maintaining accuracy represents our most significant ongoing engineering challenge.

What is your target market and use case?

We've positioned our solution primarily for machine learning researchers and data scientists who are comfortable with command-line tools and need to quickly structure messy datasets for analysis. Our goal is not to replace comprehensive data engineering teams, but rather to accelerate their initial data preparation phases and reduce manual intervention requirements.

The technology is not yet capable of 100% comprehensive cleaning, and experienced data engineers will likely achieve superior results for complex scenarios. However, our solution aims to provide a significant head start, particularly for teams that need to quickly assess and prepare datasets for exploratory analysis.

How do you ensure data quality and prevent AI hallucination?

We've implemented several safeguards to maintain data integrity:

  • Schema instruction capability: Users can pre-define rules and constraints that guide AI processing
  • Conservative processing: When uncertain, the AI preserves original values rather than guessing
  • Multi-stage validation: Each stage includes validation checks appropriate to its function
  • Comprehensive audit logging: Complete transparency of all changes and rejections
  • Regex verification: Final validation layer to catch processing errors

These measures ensure that while AI handles complex pattern recognition, traditional validation methods maintain data quality standards.

🀝 Support

DocumentationGitHub IssuesAPI Status: Check real-time status and get your API key

This project is licensed under the MIT License - see the LICENSE file for details.

Ready to clean your data? Start with dbclean init and transform your messy CSV files into pristine datasets! πŸš€