π§Ή DBClean Documentation
Complete guide to using DBClean for AI-powered CSV data cleaning, standardization, and ML data preparation. Transform messy datasets into production-ready data pipelines with our comprehensive documentation.
What is DBClean?
DBClean is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.
Whether you're preparing data for machine learning, cleaning customer databases, or standardizing business data, DBClean provides the tools you need. Get started with our interactive platform or jump straight to the installation guide.
π Project Structure
After processing, your workspace will look like this:
your-project/
βββ data.csv # Your original input file
βββ data/
β βββ data_cleaned.csv # After preclean step
β βββ data_deduped.csv # After duplicate removal
β βββ data_stitched.csv # Final cleaned dataset
β βββ train.csv # Training set (70%)
β βββ validate.csv # Validation set (15%)
β βββ test.csv # Test set (15%)
βββ settings/
β βββ instructions.txt # Custom AI instructions
β βββ exclude_columns.txt # Columns to skip in preclean
βββ outputs/
βββ architect_output.txt # AI schema design
βββ column_mapping.json # Column transformations
βββ cleaned_columns/ # Individual column results
βββ cleaner_changes_analysis.html
βββ dedupe_report.txt
β¨ Features
Uses advanced language models to intelligently clean and standardize data
Automatically creates optimal database schemas from your data
AI-powered duplicate identification and removal
Uses Isolation Forest to identify and remove anomalies
Automatically splits cleaned data into training, validation, and test sets
Complete automation from raw CSV to clean, structured data
Detailed cleaning and standardization of individual columns
Choose from multiple AI models for different tasks
Guide the AI with your specific cleaning requirements
Pay only for what you use with transparent pricing
- Free Tier: 5 free requests per month for new users
- Minimum Balance: $0.01 required for paid requests
- Precision: 4 decimal places (charges as low as $0.0001)
- Pricing: Based on actual AI model costs with no markup
- Billing: Credits deducted only after successful processing
Check your balance anytime with dbclean credits
or get a complete overview with dbclean account
.
π Quick Start
1. Install DBClean
npm install -g @dbclean/cli
2. Initialize Your Account
dbclean init
Enter your email and API key when prompted. Don't have an account? Sign up at dbclean.dev
3. Verify Setup
dbclean test-auth
dbclean account
4. Process Your Data
# Place your CSV file as data.csv in your current directory
dbclean run
Your cleaned data will be available in data/data_stitched.csv
π
π Command Reference
π§ Setup & Authentication
Command | Description |
---|---|
dbclean init | Initialize with your email and API key |
dbclean test-auth | Verify your credentials are working |
dbclean logout | Remove stored credentials |
dbclean status | Check API key status and account info |
π° Account Management
Command | Description |
---|---|
dbclean account | Complete account overview (credits, usage, status) |
dbclean credits | Check your current credit balance |
dbclean usage | View API usage statistics |
dbclean usage --detailed | Detailed breakdown by service and model |
dbclean models | List all available AI models |
π Data Processing Pipeline
Command | Description |
---|---|
dbclean run | Execute complete pipeline (recommended) |
dbclean preclean | Clean CSV data (remove newlines, special chars) |
dbclean architect | AI-powered schema design and standardization |
dbclean dedupe | AI-powered duplicate detection and removal |
dbclean cleaner | AI-powered column-by-column data cleaning |
dbclean stitcher | Combine all changes into final CSV |
dbclean isosplit | Detect outliers and split into train/validate/test |
π Complete Pipeline
The recommended approach is to use the full pipeline with dbclean run
.
# Basic full pipeline
dbclean run
# With custom AI model
dbclean run -m "gemini-2.0-flash-exp"
# Different models for different steps
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"
# With custom instructions and larger sample
dbclean run -i -x 10
# Skip certain steps
dbclean run --skip-preclean --skip-dedupe
Pipeline Steps
- Preclean - Prepares raw CSV by removing problematic characters and formatting
- Architect - AI analyzes your data structure and creates optimized schema
- Dedupe - AI identifies and removes duplicate records intelligently
- Cleaner - AI processes each column to standardize and clean data
- Stitcher - Combines all improvements into final dataset
- Isosplit - Removes outliers and splits data for machine learning
ποΈ Command Options
-m <model>
- Use same model for all AI steps
--model-architect <model>
- Specific model for architect step
--model-cleaner <model>
- Specific model for cleaner step
-x <number>
- Sample size for architect analysis (default: 5)
-i
- Use custom instructions from settings/instructions.txt
--input <file>
- Specify input CSV file (default: data.csv)
--skip-preclean
- Skip data preparation step
--skip-architect
- Skip schema design step
--skip-dedupe
- Skip duplicate detection step
--skip-cleaner
- Skip column cleaning step
--skip-isosplit
- Skip outlier detection and data splitting
π€ AI Models
Recommended Models
Model | Best For | Speed | Cost |
---|---|---|---|
gemini-2.0-flash-exp | General purpose, fast processing | β‘β‘β‘ | π² |
gemini-2.0-flash-thinking | Complex data analysis | β‘β‘ | π²π² |
gemini-1.5-pro | Large, complex datasets | β‘ | π²π²π² |
Model Selection Tips
- For speed and cost: Use
gemini-2.0-flash-exp
- For complex, messy data: Use
gemini-2.0-flash-thinking
for architect - For mixed workloads: Use different models per step with
--model-architect
and--model-cleaner
# List all available models
dbclean models
π Custom Instructions
Create custom cleaning instructions to guide the AI.
- For architect step: Use the
-i
flag with asettings/instructions.txt
file. - Example instructions:txt
- Standardize all phone numbers to E.164 format (+1XXXXXXXXXX) - Convert all dates to YYYY-MM-DD format - Normalize company names (remove Inc, LLC, etc.) - Flag any entries with missing critical information - Ensure email addresses are properly formatted
dbclean run -i # Uses instructions from settings/instructions.txt
π‘ Usage Examples
Basic Processing
# Process a CSV file with default settings
dbclean run
# Use a specific input file
dbclean run --input customer_data.csv
Advanced Processing
# High-quality processing with larger sample
dbclean run -m "gemini-2.0-flash-thinking" -x 15 -i
# Fast processing for large datasets
dbclean run -m "gemini-2.0-flash-exp" --skip-dedupe
# Custom pipeline - architect only
dbclean run --skip-preclean --skip-cleaner --skip-dedupe --skip-isosplit
Individual Steps
# Run architect with custom model and sample size
dbclean architect -m "gemini-2.0-flash-thinking" -x 10 -i
# Clean data with specific model
dbclean cleaner -m "gemini-2.0-flash-exp"
# Remove duplicates with AI analysis
dbclean dedupe
π― Best Practices
# Test with small sample first
dbclean architect -x 3
# Review outputs, then run full pipeline
dbclean run
# For complex schema design
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"
Create settings/instructions.txt
with domain-specific requirements:
Finance data requirements:
- Currency amounts in USD format ($X,XXX.XX)
- Account numbers must be 10-12 digits
- Transaction dates in YYYY-MM-DD format
# Check account status regularly
dbclean account
# Monitor detailed usage
dbclean usage --detailed
β Troubleshooting
Common Issues
Authentication Problems
dbclean init # Re-enter credentials
dbclean test-auth # Verify connection
Data File Issues
- Ensure
data.csv
exists in current directory - Use
--input <file>
for different file names - Check file permissions and encoding
API Limits
- Check credit balance:
dbclean credits
- View usage:
dbclean usage
- Free tier: 5 requests per month, then paid credits required
Model Availability
dbclean models # See available models
Getting Help
dbclean --help # General help
dbclean run --help # Command-specific help
dbclean help-commands # Detailed command reference
π Output Files
After processing, you'll have:
data/data_stitched.csv
- Your final, cleaned datasetdata/train.csv
- Training data (70%)data/validate.csv
- Validation data (15%)data/test.csv
- Test data (15%)outputs/cleaner_changes_analysis.html
- Visual changes reportoutputs/architect_output.txt
- AI schema analysisoutputs/column_mapping.json
- Column transformation details
β Frequently Asked Questions
Why would anyone need AI to clean CSVs or text files? Why would anyone pay for this service?
While traditional data cleaning approaches work well for structured datasets with predictable patterns, many organizationsβparticularly smaller companiesβencounter datasets that are fundamentally unstructured or semi-structured. Our solution emerged from a real-world scenario where a small company had collected data through Google Forms, resulting in a dataset that was "insanely messy with no structure, basically unusable."
For these organizations, traditional enterprise-grade data cleaning solutions are often cost-prohibitive and unnecessarily complex. Our AI-powered approach provides an accessible middle ground, offering intelligent data structuring capabilities that can extract meaningful information from chaotic datasets. For instance, if a column contains age data embedded within surrounding text, our AI can intelligently parse and extract the numerical age values while preserving data integrity.
Isn't data cleanliness a "solved" problem with existing tools?
Data cleanliness is indeed well-addressed for structured, predictable datasets through rule-based systems and established ETL processes. However, the challenge lies in semi-structured or completely unstructured data where patterns are inconsistent or non-obvious.
Our approach handles standard cleaning operations (whitespace removal, basic formatting) through traditional pre-processing steps without AI involvement. The AI component focuses specifically on the complex task of imposing structure on unstructured dataβa problem that remains challenging for rule-based systems when dealing with highly variable input formats.
Isn't it concerning to let AI autonomously restructure datasets? Won't it hallucinate improvements?
This concern is absolutely valid and represents one of our core engineering challenges. Our approach utilizes careful prompt engineering and validation mechanisms and takes a conservative approach: when the AI cannot confidently process a value, it leaves it unchanged and flags it for subsequent regex validation. We maintain a comprehensive audit log that documents every change made and every change that was rejected due to failing validation checks. This transparency ensures that users can review and verify all modifications made to their data.
Wouldn't rule-based methods with logical constraints perform better?
Rule-based systems excel when data patterns are predictable and well-defined. For many standard cleaning operations, they are indeed more efficient and reliable. However, when dealing with truly unstructured data where patterns vary significantly, rule-based approaches require extensive manual configuration and often fail to handle edge cases.
Our hybrid approach leverages the strengths of both methodologies: we use traditional methods for standard cleaning operations and apply AI specifically to pattern recognition and structure extraction tasks that would be prohibitively complex to encode as explicit rules.
Doesn't AI processing create excessive overhead, especially for large datasets?
Computational efficiency is a legitimate concern with transformer-based approaches. We've addressed this through a multi-stage optimization strategy designed to minimize AI processing overhead:
- Stage 1: Schema development and initial cleaning on a representative sample (not the full dataset)
- Stage 2: Fuzzy deduplication targeting only potential duplicates
- Stage 3: Column-specific cleaning for data that failed regex validation, processed in concurrent batches
This approach ensures that the most computationally expensive AI operations run only on subsets of data that actually require intelligent processing, rather than the entire dataset.
What are the speed and scalability limitations of your AI models?
Processing speed remains one of our primary optimization challenges. Our current architecture is designed with the following performance characteristics:
- Stage 1: Always operates on a fixed sample size regardless of dataset size
- Stage 2: Processes only identified duplicate candidates
- Stage 3: Supports concurrent batch processing for column cleaning
While we optimize for datasets exceeding one million rows, we acknowledge that extensive performance testing and validation are still required. Speed optimization while maintaining accuracy represents our most significant ongoing engineering challenge.
What is your target market and use case?
We've positioned our solution primarily for machine learning researchers and data scientists who are comfortable with command-line tools and need to quickly structure messy datasets for analysis. Our goal is not to replace comprehensive data engineering teams, but rather to accelerate their initial data preparation phases and reduce manual intervention requirements.
The technology is not yet capable of 100% comprehensive cleaning, and experienced data engineers will likely achieve superior results for complex scenarios. However, our solution aims to provide a significant head start, particularly for teams that need to quickly assess and prepare datasets for exploratory analysis.
How do you ensure data quality and prevent AI hallucination?
We've implemented several safeguards to maintain data integrity:
- Schema instruction capability: Users can pre-define rules and constraints that guide AI processing
- Conservative processing: When uncertain, the AI preserves original values rather than guessing
- Multi-stage validation: Each stage includes validation checks appropriate to its function
- Comprehensive audit logging: Complete transparency of all changes and rejections
- Regex verification: Final validation layer to catch processing errors
These measures ensure that while AI handles complex pattern recognition, traditional validation methods maintain data quality standards.