Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance
by: Jim Lehmer (Author)
Publisher finelybook 出版社: O’Reilly Media; (November 7, 2023)
Language 语言: English
Print Length 页数: 282 pages
ISBN-10: 1098152271
ISBN-13: 9781098152277
Book Description
If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and think about heterogeneous data using SQL.
DBAs, programmers, business analysts, and data scientists will learn how to identify and remove duplicates, parse strings, extract data from XML and JSON, generate SQL using SQL, regularize data and prepare datasets, and apply data quality and ETL approaches for finding the similarities and differences between various expressions of the same data.
Full of real-world techniques, the examples in the book contain working code. You’ll learn how to:
Identity and remove duplicates in two different datasets using SQL
Regularize data and achieve data quality using SQL
Extract data from XML and JSON
Generate SQL using SQL to increase your productivity
Prepare datasets for import, merging, and better analysis using SQL
Report results using SQL
Apply data quality and ETL approaches to finding similarities and differences between various expressions of the same data
Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance
From the Preface
This book contains the patterns, practices, techniques, and tricks I’ve picked up over the decades, usually around the problem of “Is this list of data related to anything in that table?” The canonical example is a cold-call list pulled from…“somewhere”, and now Marketing (it’s always Marketing) wants you to match this list they probably paid for against the company’s existing customer database. Why?
There are lots of reasons, but these are the chief two:
Identify new prospects
Filter out existing customers from the list and send the new prospects down a low-cost, standardized route with perhaps a cold call, mailing list, etc. Think Glengarry Glen Ross.
Upsell and cross-sell existing customers
Filter out new prospects, and if an existing customer shows up at a trade show and expresses interest, perhaps they’re not aware of your entire product offering or haven’t been “touched” in a while to determine their needs. Get a salesperson in front of them with some incentives, pronto.
There are other reasons, of course. Two companies merging and wanting to combine their customer relationship management (CRM) systems and eliminate duplicates is another common reason. Often the same types of techniques are used to de-duplicate even within a single dataset, like a CRM database. Or when someone comes to you with data found on the “dark web” attributed to your company. Is it yours? Is it complete? Accurate? Current?
Don’t deal with data about humans? The second case study, covered in Chapter 13, will show the same techniques used to do some impact analysis on code—that is, treating source code as data to be fuzzy-matched against. I have also used the patterns in this book to parse text-based log files. You may be asked to do work in any or all of these areas and more.
For the purposes of this book, I will typically refer to “your customer data” or “your CRM data” as meaning something like your production dataset that presumably is in decent shape in terms of schema design, data quality, etc. (decent, not perfect—we will cover that). I will then talk about the “incoming data” or “imported data” as the data you are being asked to match against the customer data. In your world, “customer data” could be “patient data” or “subject data” or some such, but most of this book will cover aspects of matching human demographic attributes such as name, address, and phone number. The general techniques are useful across other areas, but you will have to do your own cognitive mapping if your data deals with something like products or pandemics.
About the Author
James Lehmer has been “in computers” for over three decades in various software development roles – programmer, systems programmer, software engineer, team lead, and software architect. He has worked on a variety of operating systems with a number of programming languages. James currently works in a Windows shop coding primarily in C#, but with his background in cross-platform development, he often gets tapped to deal with any *IX boxes that enter his environment.Amazon page