How A Mapping Project Is Pointing The Way For More Open Data Use

The Overture Maps Foundation, a project to standardize map data, was founded in December 2022 by Amazon, Meta, Microsoft and TomTom, and is hosted by the Linux Foundation. I talked to Marc Prioleau about the project he’s working on, which relies on open data from a number of sources, and what the future holds for open data. This conversation has been edited for length, clarity and continuity.

A shorter version of this interview was in Thursday’s Forbes CIO newsletter.

Tell me about the Overture Maps Foundation.

Prioleau: What we’re doing is open map data. And that seems like a weirdly specific thing, but to put it in context, maps have become a key part of infrastructure, but especially mobile infrastructure. If you think about going around with a mobile device, whether it’s an enterprise app or a consumer app, the map is one of those things that is up there in the top horizontal platforms. When you look at a mapping application, whether you’re looking to search for something or get directions for something, there’s a layer of software, which is what you’re looking at—the routing app—but it sits on this really incredible database of map data. That data, if you think about it, is the digital representation of the physical world. As the physical world is big and varied and changing all the time, the data that represents it has to be big and varied and changing all the time.

Overture Maps Foundation Executive Director Marc Prioleau.

Overture Maps Foundation

When you think about map data, the software is interesting, but the data is the part that is really, really hard. If you were to go back before digital maps, the expectation is you get a map at the gas station that you keep in your car for 10 years in the glove box. If you drove down and something was different than on the map, you just said, ‘Well, that’s the way life is. I got the map 10 years ago.’ Now, to use an example, if [a map app] routes you over [Baltimore’s] Francis Scott Key Bridge [which collapsed on March 26], that would be unacceptable performance. Think about the many other places where roads are changed. The customer expectation has gotten really, really high.

One of the reasons Overture got started is because maps are important. The hardest part about maps is this data layer, and so the thought was: There’s so much built into that data, but could we all agree on some common baseline maps, and then everyone can start to collaborate on building those, and then get the finer points of mapping on top of that. It became a great use case.

What are some other examples of places where open data is important and should be used?

Open data is fairly nascent compared to open code. Open data is a bit of a newer entity. A big area is environmental and climate, which is a massive database changing a lot. No one has all of the data, but the contributions of many people can build it.

As with everything, the hot topic is AI, and AI training models need data for training. Then the question is: Who owns that data that you’re doing the training on? Open data, the nice thing about it is you can use it for whatever you want. Or is it not open data? Is it copyrighted data? Then it gets a lot more complex.

Why do you think open data is such a brand new thing?

I think that there are a lot of similarities between open code and open data, but there’s some key differences. I think one of the things where open code and open data are similar is a bunch of people agree that they would be better off, they would have a better quality product if they collaborated on building something. Usually what happens in open code or open data is there’s an “open” part, but then people still have the ability to create business models on top of that. I think open code took a while to figure that out, but that’s pretty established now. Open data is going through the same thing. What are the areas where we can collaborate on data and it could be open, and still leave room for business models on top of that?

There’s some differences. Let’s say you’re a software engineer. When you write code in an open source project, that code is open as you contribute it. Data is a little different, because data represents facts. It has to come from somewhere. It didn’t just spring from your brain. It was a measurement of something. In mapping, that can be a measurement of: Is the road still there? Is that business still there? Did things change? What’s upstream of open data that’s different than open code is there’s data collection and synthesis and analysis, and that usually is not necessarily open. I think that’s a fairly different business proposition than open code.

How can this data transition from being something that you collect because it’s valuable to you and your company, to something that can be open? What does that do to a company’s value of its data?

Companies really have to think about what is the data that is valuable in terms of creating competitive advantage for them, and what is the data that actually doesn’t create competitive advantage and is just really good information. I’ll give you an example. One part of map data is places data. It’s the businesses that are around, and everyone in mapping hates places data because places go in and out of business. About 20% of them that were here last year will be gone next year. Roads get built and then they stay there for decades, but places come and go.

Places data is proprietary. It’s hard to build. But what’s happened in the market is more and more, what they think about is what part of places data is important. If you think about a company like Meta, they don’t really put a high premium on knowing where all the places are. What they really create value on is the social media signals that go on top of that. Someone like that might look at that and say: We’ll collaborate if it’ll help us know where all the places in the world are, the name and the address and the latitude and longitude. But we’re not going to share our social media signals with our competitors. So they start dividing the data into what’s below the line—which might be shared and held collaboratively—and what’s above that line. That’s the conversation I have very repeatedly.

That line is different for almost every company, I find. Over time, I think that line moves up.

Open source software and code has vulnerability issues. Is the same true for open data?

It could. If you think about any data, especially big datasets, you’re going to have some errors in there. Data is reflective of facts. Unless you have some magical system that all your facts are 100% right, some of those stocks will be wrong. All maps have errors in them. Then the question becomes, how do you improve the quality of that? [It’s] one thing [in] which the quality is bad because it's a mistake. The other is the quality is bad, because some bad actor has created bad data. That’s the case for any type of data. You don’t want bad data in there.

One of the things that open data starts looking at is how do you develop statistical models to identify that bad data. As an example, let’s say you and I have competing businesses and I want to go in and edit that your business is four blocks over from where it is or something like that. You don’t want me having that power to do that to you. What you might do is say, I’m not going to move a business unless I have a statistically significant amount of data. Or if the owner tells me, and I know you’re the owner. You’re going to want to build something in there that gives you some statistical sense that the data is accurate.

For the data we’re looking at, one of the reasons we’re interested in building a broad coalition is because the more different signals we can get on all our data, the better it is. If you take the example of places data, if I can say a company delivered a package there, or a company saw a social media post there, or a company picked up food there, or a company did something else. If I can get 20 of those signals, now I have a fairly good set of signals in terms of my data being accurate.

How much maintenance does it take to keep open data up to date?

I think it depends on the data set and the churn or the variability of the data. The answer can be a lot. If you have real-time data, looking at air quality, that data is changing every day. You need a bunch of sensors to keep it up. In our world, there’s some data that stays relatively static. If you think about state boundaries as a type of map data, that’s relatively static. But on the other hand, traffic data is highly volatile. Today, there’s traffic somewhere, and at 11 o'clock and 12 o'clock, it might be gone. I think the maintenance of it can be very high. And that’s a big difference.

If we’re writing code, conceivably you and I could write some code. It works. It could be stable for months and never need to be changed, and then someone comes in and makes a little modification. If data has to be updated, you need almost like a production line, right? Whereas code, you could write it and it could be stable for a while.

What do you think it would take to get more businesses and other entities interested in open data?

The thing I find is that conceptually, everyone understands the strategy around open data. They get it, they nod their heads, they do everything. The challenge becomes bringing a group together who actually will collaborate. One of the problems in any open source project is how do you share participation? Do you have people who are the so-called free riders, the people who are using the output of an open source project, but aren’t contributing into it? Ultimately, if no one puts into it, the project goes away. I think the key part is really developing a mechanism where people can contribute and have incentives to contribute.

Strategically, at least for our project, I can describe it to people and they understand the strategy, the rationale, the reason you’d want to do it. The reason it makes sense. I don’t find that being a hard part at all. But, you know, then people have to do it. They have to go the distance. We started with four members. Now we’re at 27. Part of it is the benefit that they get if this happens, and the alternative if it doesn’t happen. The alternatives aren’t good.

In five years, do you think open data will be seen more like open code and open software today?

I think it’s going to be a related but distinct field. Today, people think open source is open source. Code data doesn’t matter. I think what you’ll see five years from now is open data will become a distinct entity. It’s for these things we talked about. Data has to come from somewhere, and so you need people contributing things that are often proprietary and have value. Data has to be accurate. Data needs to be output on a certain frequency to be accurate and up to speed. The other one is data has bulk, right? We’re dealing with terabytes of data and petabytes of data. Now you’ve got cloud computing costs.

I think companies will start to develop specialties around that. I think open data foundations like ours, are really going to have to address those unique things in unique ways.

Send me a secure tip.