Start United States USA — software Zillow: Machine learning and data disrupt real estate

Zillow: Machine learning and data disrupt real estate

Von

July 30, 2017

477

Learn how big data and the Zillow Zestimate changed and disrupted real estate. It’s an important case study on the power of machine learning models and digital innovation.
Anyone buying or selling a house knows about Zillow. In 2006, the company introduced the Zillow Estimate, or Zestimate for short, which uses a variety of data sources and models to create an approximate value for residential properties.
The impact of Zillow’s Zestimate on the real estate industry has been considerable, to say the least.
From the home buyer perspective, Zillow’s Zestimate enables significant transparency around prices and information that historically was available only to brokers. The company has genuinely democratized real estate information and adds tremendous value to consumers.
For real estate brokers, on the other hand, Zillow is fraught with more difficulty. I asked a top real estate broker working in Seattle, Zillow’s home turf, for his view of the company. Edward Krigsman sells multimillion-dollar homes in the city and explains some of the challenges:
Zillow’s market impact on the real estate industry is large, and the company’s data is an important influence on many home transactions.
Zillow offers a textbook example of how data can change established industries, relationships, and economics. The parent company, Zillow Group, runs several real estate marketplaces that together generate about $1 billion in revenue with, reportedly, 75 percent online real estate audience market share.
As part of the CXOTALK series of conversations with disruptive innovators, I invited Zillow’s Chief Analytics Officer (who is also their Chief Economist) , Stan Humphries, to take part in episode 234 .
The conversation offers a fascinating look at how Zillow thinks about data, models, and its role in the real estate ecosystem.
Check out the video embedded above and read a complete transcript on the CXOTALK site. In the meantime, here is an edited and abridged segment from our detailed and lengthy conversation.
There’s always been a lot of data floating around real estate. Though, a lot of that data was largely [hidden] and so it had unrealized potential. As a data person, you love to find that space.
Travel, which a lot of us were in before, was a similar space, dripping with data, but people had not done much with it. It meant that a day wouldn’t go by where you wouldn’t come up with „Holy crap! Let’s do this with the data!“
In real estate, multiple listing services had arisen, which were among different agents and brokers on the real estate side; the homes that were for sale.
However, the public record system was completely independent of that, and there were two public records systems: one for deeds and liens on real property, and then another for the tax rolls.
All of that was disparate information. We tried to solve for the fact that all of this was offline.
We had the sense that it was, from a consumer’s perspective, like the Wizard of Oz, where it was all behind this curtain. You weren’t allowed behind the curtain and really [thought] , „Well, I’d really like to see all the sales myself and figure out what’s going on.“ You’d like the website to show you both the core sale listings and the core rent listings.
But of course, the people selling you the homes didn’t want you to see the rentals alongside them because maybe you might rent a home rather than buy. And we’re like, „We should put everything together, everything in line.“
We had faith that type of transparency was going to benefit the consumer.
You still find that agency representation is very important because it’s a very expensive transaction. For most Americans, the most expensive transaction, and the most expensive financial asset they will ever own. So, there continues to be a reasonable reliance on an agent to help hold the consumer’s hands as they either buy or sell real estate.
But what has changed is that now consumers have access to the same information that the representation has, either on the buy or sell side. That has enriched the dialogue and facilitated the agents and brokers who are helping the people. Now a consumer comes to the agent with a lot more awareness and knowledge, as a smarter consumer. They work with the agent as a partner where they’ve got a lot of data and the agent has a lot of insight and experience. Together, we think they make better decisions than they did before.
When we first rolled out in 2006, the Zestimate was a valuation that we placed on every single home that we had in our database at that time, which was 43 million homes. To create that valuation in 43 million homes, it ran about once a month, and we pushed a couple of terabytes of data through about 34 thousand statistical models, which was, compared to what had been done previously an enormously more computationally sophisticated process.
I should just give you a context of what our accuracy was back then. Back in 2006 when we launched, we were at about 14% median absolute percent error on 43 million homes.
Since then, we’ve gone from 43 million homes to 110 million homes; we put valuations on all 110 million homes. And, we’ve driven our accuracy down to about 5 percent today which, from a machine learning perspective, is quite impressive.
Those 43 million homes that we started with in 2006 tended to be in the largest metropolitan areas where there was much transactional velocity. There were a lot of sales and price signals with which to train the models. As we went from 43 million to 110, you’re now getting out into places like Idaho and Arkansas where there are just fewer sales to look at.
It would have been impressive if we had kept our error rate at 14% while getting out to places that are harder to estimate. But, not only did we more than double our coverage from 43 to 110 million homes, but we almost tripled our accuracy rate from 14 percent down to 5 percent.
The hidden story of achieving that is by collecting enormously more data and getting a lot more sophisticated algorithmically, which requires us to use more computers.
Just to give a context, when we launched, we built 34 thousand statistical models every month. Today, we update the Zestimate every single night and generate somewhere between 7 and 11 million statistical models every single night. Then, when we’re done with that process, we throw them away and repeat the next night again. So, it’s a big data problem.
We never go above a county level for the modeling system, and large counties, with many transactions, we break that down into smaller regions within the county where the algorithms try to find homogeneous sets of homes in the sub-county level to train a modeling framework. That modeling framework itself contains an enormous number of models.
The framework incorporates a bunch of different ways to think about values of homes combined with statistical classifiers. So maybe it’s a decision tree, thinking about it from what you may call a „hedonic“ or housing characteristics approach, or maybe it’s a support vector machine looking at prior sale prices.
The combination of the valuation approach and the classifier together create a model, and there are a bunch of these models generated at that sub-county geography. There are also a bunch of models that become meta-models, which their job is to put together these sub-models into a final consensus opinion, which is the Zestimate.
We believe advertising dollars follow consumers. We want to help consumers the best we can.
We have constructed, in economic language, a two-sided marketplace where we’ve got consumers coming in who want to access inventory and get in touch with professionals. On the other side of that marketplace, we’ve got professionals — be it real estate brokers or agents, mortgage lenders, or home improvers — who want to help those consumers do things.