AI Automates Weeks of Manual Data Labeling
Based on research by Shahar Levy, Eliya Habba, Reshef Mintz, Barak Raveh, Renana Keydar
Answering complex research questions often requires digging through massive document collections to find structured evidence. Traditionally, this means manually designing a labeling system and painstakingly tagging every single piece of data—a process that is slow and prone to human error. Researchers have now introduced ScheMatiQ, a new tool that changes the game by using large language models to automatically build these systems from scratch.
This innovative approach takes a natural-language question and a corpus of text, then generates a custom schema and a grounded database. A built-in web interface allows users to steer the extraction process and revise results in real time, ensuring accuracy without needing deep technical expertise. The system effectively bridges the gap between raw text and structured data ready for immediate analysis.
The tool has already proven its worth in high-stakes fields like law and computational biology, where it supports real-world analysis alongside domain experts. By automating what used to take weeks of manual labor, ScheMatiQ makes rigorous data extraction accessible to anyone with a research question. It stands as a powerful open-source resource that invites experts across all disciplines to apply their own data to the platform.
The takeaway is clear: the barrier between asking a question and getting structured answers is rapidly disappearing. With ScheMatiQ available openly to the public, researchers can focus on discovery rather than tedious data preparation, accelerating progress in science and law alike.