Neurelo’s innovative approach to mock data generation: Neurelo has developed a cutting-edge technology for generating realistic mock data based on database schemas, addressing key challenges in database testing and development.
- The company’s solution works with popular databases including MongoDB, MySQL, and Postgres, generating realistic data automatically without requiring user input.
- Neurelo prioritized low cost and fast response time in their development process, utilizing native Rust for optimal performance.
Initial challenges and pivots: The path to developing this technology was not without obstacles, prompting Neurelo to adapt their approach.
- An initial attempt using Large Language Models (LLMs) to generate code failed due to issues with code quality and data realism.
- The team identified a critical challenge in determining the correct order of insertion for tables with foreign key relationships, a crucial aspect of maintaining data integrity.
Innovative solutions to complex problems: To overcome the insertion order challenge, Neurelo employed sophisticated algorithmic approaches.
- The team implemented Kahn’s algorithm for topological sorting, creating a directed acyclic graph of table relationships.
- This solution ensures that data is inserted in the correct order, maintaining referential integrity across the database.
Technical implementation details: Neurelo’s approach combines multiple technologies and techniques to achieve accurate and efficient mock data generation.
- LLMs are utilized to map column names and types to appropriate faker methods, enabling the generation of realistic data for each field.
- The team developed a Rust-based faker module, equivalent to Python’s faker library, to ensure compatibility with their native Rust implementation.
- Careful handling of foreign key and primary key mapping, as well as unique constraints and references, ensures the generated data maintains proper relationships and uniqueness.
Continuous improvement and version 2.0 enhancements: Neurelo has continued to refine and improve their technology, introducing significant upgrades in version 2.0.
- The classification pipeline now integrates table names, providing additional context for more accurate data generation.
- A “Genesis Point Strategy” was developed, utilizing cross products to efficiently generate unique values, addressing challenges with maintaining uniqueness across large datasets.
- Zero-shot learning techniques were implemented to classify columns that don’t fit into existing categories, expanding the system’s ability to handle diverse schema structures.
Future directions and ongoing development: Neurelo recognizes the evolving nature of database technologies and is committed to further enhancing their mock data generation capabilities.
- The team is working on optimizing for unique constraints, aiming to improve efficiency in scenarios requiring large numbers of unique values.
- Support for composite types and multi-schemas is in development, expanding the technology’s applicability to more complex database structures.
- Neurelo is exploring more cost-effective LLM strategies to further reduce operational costs while maintaining high-quality output.
Broader implications for database testing and development: Neurelo’s mock data generation technology has the potential to significantly impact database-related workflows across industries.
- By automating the creation of realistic test data, developers can more efficiently test and validate database-driven applications, potentially accelerating development cycles.
- The technology’s ability to work with multiple database types and generate data without user input could streamline cross-platform development and testing processes.
- As data privacy concerns continue to grow, tools like Neurelo’s that can generate realistic mock data may become increasingly valuable for testing and development scenarios where using real data poses risks.
In the land of LLMs, can we do better mock data generation?