Angelina Carvalho, Chiranjit Chakraborty and Georgia Latsi at Bank Underground: “Policy makers have access to more and more detailed datasets. These can be joined together to give an unprecedentedly rich description of the economy. But the data are often noisy and individual entries are not uniquely identifiable. This leads to a trade-off: very strict matching criteria may result in a limited and biased sample; making them too loose risks inaccurate data. The problem gets worse when joining large datasets as the potential number of matches increases exponentially. Even with today’s astonishing computer power, we need efficient techniques. In this post we describe a bipartite matching algorithm on such big data to deal with these issues. Similar algorithms are often used in online dating, closely modelled as the stable marriage problem.
The home-mover problem
The housing market matters and affects almost everything that central banks care about. We want to know why, when and how people move home. And a lot do move: one in nine UK households in 2013/4 according to the Office for National Statistics (ONS). Fortunately, it is also a market that we have an increasing amount of information about. We are going to illustrate the use of the matching algorithm in the context of identifying the characteristics of these movers and the mortgages that many of them took out.
A Potential Solution
The FCA’s Product Sales Data (PSD) on owner-occupied mortgage lending contains loan level product, borrower and property characteristics for all loans originated in the UK since Q2 2005. This dataset captures the attributes of each loan at the point of origination but does not follow the borrowers afterwards. Hence, it does not meaningfully capture if a loan was transferred to another property or closed for certain reason. Also, there is no unique borrower identifier and that is why we cannot easily monitor if a borrower repaid their old mortgage and got a new one against another property.
However, the dataset identify whether a borrower is a first time buyer or a home-mover, together with other information. Even though we do not have information before 2005, we can still try to use this dataset to identify some of the owners’ moving patterns. We try to find from where a home-mover may have moved (origination point) and who moved in to his/her vacant property. If we can successfully track the movers, it will also help us to remove corresponding old mortgages to calculate the stock of mortgages from our flow data. A previous Bank Underground post showed how probabilistic record linkage techniques can be used to join related datasets that do not have unique common identifiers. We have used bipartite graph matching techniques here to extend those ideas….(More)”