Modern analytics depend on high-effort tasks like data preparation and data cleaning to produce accurate results. This talk describes recent work on making routine data preparation tasks such as data cleaning dramatically easier. I will first introduce a formal probabilistic framework to describe the quality of structured data and demonstrate how this framework allows us to cast data cleaning as a statistical learning and inference problem. I will then show how this connection allows us to obtain formal guarantees on automated data cleaning and describe how it forms the basis of the HoloClean framework, a state-of-the-art ML-based solution for managing noisy structured data. I will close with additional examples of how a statistical learning view on managing noisy data can lead to new solutions to classical database problems such as the discovery of functional dependencies in structured data.
0 Comments