Tamr Patents
Our commitment to innovation extends beyond mere words. We're inspired to bring cutting-edge solutions that transform industries and make a tangible difference. Our extensive portfolio of patents is a testament to this commitment. These patents demonstrate our relentless pursuit of original ideas. They stand as proof of our continuous innovation, our propensity to challenge the status quo, and our resolve to drive the future.
Insights
Answer the questions “what changed?” and “how accurate is my ML model?”
Review and Curation of Record Clustering (1 patent in family)
One of the most basic questions when updating a data product is “what has changed since the last version?” Tamr’s innovative in automated survivorship at scale answers this by reliably tracking changes, enabling the automated creation and maintenance of TamrIDs - one of our most popular features. This capability allows customers to visualize and review changes driven by source data updates, machine learning models or human curation. Tamr’s method computes cluster changes at large scale, aligning and comparing clusterings of overlapping data. It enables scalable re-clustering and propagates cluster IDs through changes for efficient review and feedback.
Unbiased Cluster Accuracy Metrics (1 patent in family)
A major challenge in AI-powered data unification is reliably assessing model accuracy. Existing methods often introduce biases or fall short in practical scenarios, making it difficult to gauge true performance. Tamr tackles this problem that estimated clustering accuracy using minimal human input. Our approach overcomes the limitations of prior techniques, providing a more accurate, unbiased measure of how well records are clustered, leading to improved data unification results. This patent describes a robust, record-based metric for measuring clustering accuracy, ensuring consistent evaluation and monitoring across both training and production workflows. It helps organizations maintain high data quality standards while reducing manual effort.
Curation
Curate large, diverse data sets at scale
Large Scale Data Curation (3 patents in family)
When Tamr was founded at MIT, technologies existed to address two out of the three V’s of big data - volume (with systems like Vertica) and velocity (with frameworks like Kafka). However, no solution existed for managing data variety. Recognizing the limitations of both human effort and machine learning alone, Tamr pioneered a new approach: closely coupling machine learning with subject-matter supervision to balance efficiency with trust and reliability. Tamr developed a scalable, cost-effective system for large-scale data curation. The system combines schema mapping, deduplication, and expert feedback, enabling it to curate millions or billions of records with high accuracy.
Curation with Version Control (1 patent in family)
Tamr is a pioneer in integrating manual data curation with version control, providing a principle approach to tracking how data, model, and curation changes come together to form a version of a data product. This ensures transparency and allows teams to understand the impact of changes to models, the data, and its structure while retaining the benefits of manual curation. Tamr’s approach supports high-level workflows composed of low-level curation components like tokenization, blocking, and candidate generation. It also offers advanced capabilities such as restarting and rolling back workflows, ensuring flexibility in managing data curation processes.
Reusing Transformations for Evolving Schema Mapping (2 patents in family)
Human input is the most costly aspect of data curation. To address this issue, Tamr developed innovations that enable its platform reuse human input beyond the original. For example, when a human describes how to modify source data to fit a unified schema, the platform can apply that transformation to other differing data sources. This patented approach extends schema mapping to include data transformation, enabling the identification of reusable sequences of modifications. It allows Tamr to efficiently scale schema mapping across large datasets, even if the other datasets differ from the original in content and/or structure.
Feedback
Capture human feedback in context and use it to train the model
In-Situ Data Issue Reporting, Presentation, and Resolution (1 patent in family)
Tamr recognized that users are more likely to provide feedback on data when they can do so from within the application they are using. Tamr’s innovations provide an interface for immediate feedback that gathers contextual information about the data in - including dataset and versions, filters applied, selected data elements - from within a web page, spreadsheet, or visualization platform. This approach not only makes it simpler for users to provide feedback, but it also makes it easier for curators to view that feedback in context and take corrective action. These same techniques also provide users with greater visibility and insight into data quality, including whether or not the data they are viewing has open issues.
Using Clusters to Train Supervised Entity Resolution (1 patent in family)
Tamr recognizes that data users are more likely to provide feedback on data if they can do so within the application where they consume data. To streamline this, Tamr developed an interface for immediate feedback, gathering contextual information - such as datasets, versions, filters and selected data elements - directly from within a webpage or visualization tool. This makes it easier for users to report issues and for curators to take corrective action with full context, enhancing data quality and visibility. Tamr’s approach, through its Steward tool, effectively manages data issues across high-volume, high variety datasets, addressing the key challenges that typically hinder data quality issue tracking.
AI/ML Mastering
Make meaningful connections and translate them into active learning
Scalable Binning for Big Data Deduplication (2 patents in family)
Matching a record against a corpus of millions - or even billions - of other records can lead to wasted time comparing relevant records. Tamr developed innovations that enables the machine to quickly focus on comparing relevant records, making large-scale deduplication both feasible and practical. Tamr’s patented approach uses distributed systems to scale deduplication efforts, employing advanced binning and blocking techniques to tackle the N² deduplication challenge. This ensures efficiency even at very large scales.
System for Scalable Hierarchical Classification Using Blocking and Active Learning Method (2 patents in family)
Tamr’s next innovations in AI/ML mastering introduce a practical approach to hierarchical classification for large taxonomies, such as UNSPSC which contains over 100,000 categories. The system reduces training time by using data multiple times in different ways, significantly boosting accuracy. It also enables the machine to quickly focus on meaningful categories, ensuring the best-match category for a each record. This patented system handles large-scale classification using distributed systems, overcoming challenges with large taxonomies, many records, and sparse training data through techniques like record pre-grouping and taxonomy candidate generation.
Geospatial Binning (1 patent in family)
Matching geospatial features such as roads, building footprints, and points of interest can provide strong signals for record matching. But traditional methods often require dedicated geospatial databases that sacrifice accuracy. Tamr’s innovative approach computes similarity across different feature types such as points of interest and building footprint without relying on projection, avoiding the usual accuracy trade-offs. This approach supports large-scale deduplication, advancing geospatial binning by computing proximity without a central index, and scales effectively to extremely large datasets using distributed systems.
Active Learning When Using Clusters for Supervised ML (1 patent in family)
Active learning aims to improve the accuracy of machine learning by strategically select questions with the greatest impact. Tamr’s innovation refines this approach by translating the technical needs of the active learning system into practical questions that a data expert can answer. This method enables the system to build a highly accurate model from minimal expert input. Tamr’s system identifies high-impact clusters for experts to verify during entity resolution training. It efficiently selects and manages these clusters, incorporating an intuitive user interface for verification.