Finding Real Estate Opportunities in Cambridge, MA: 90 Minutes in a Jupyter Notebook

I've rented the same apartment in Cambridge for 8 years, and for me this neighborhood--one block from Mass Ave, tucked between Harvard and Porter--is like Paris, a paradise aptly located behind a restaurant called Nirvana. But like many people, I can't really afford to buy here, at least not in the traditional way. That changed in February 2025 when Cambridge passed a landmark zoning reform: four-story buildings by-right, citywide, with no minimum lot sizes, no unit caps, no parking requirements. Overnight, thousands of parcels became viable for small-scale development.

As it happens, I have a real estate license, I've built and renovated many homes, and I actually enjoy dealing with subcontractors. I'm also a data nerd who can build AI agents. So I saw an opportunity: find undervalued parcels where I can build 3-9 unit condo projects, sell most units, and keep one for myself. Live in Cambridge, create value, own something I built. But first, I need to answer a seemingly simple question: Which parcels are actually viable?

The data is scattered across three different platforms, each with its own API and quirks. Cambridge ArcGIS REST services holds the parcels geometry, property assessment database with owner information and values, a driveways layer for parking feasibility, and historic district overlays. Cambridge Open Data (Socrata API) has curb-cut permits. MBTA API v3 provides transit station locations. I've never touched any of these APIs before—different authentication, different query patterns, different data formats—and I need to join them all together, score each parcel, and build a map showing opportunities.

I could have started writing an ETL pipeline, guessing field names and debugging for days when nothing worked. Instead, I spent 90 minutes in a Jupyter notebook first, answering some basic questions. First, could I even access the data I'd need? I tried multiple Cambridge GIS URLs before finding one that didn't 404. This alone would've cost hours in a production debugging cycle. Then came the schema reality check: my intuitive guesses about field naming conventions were close but wrong enough to break everything. The actual field names used different capitalization patterns and abbreviation styles than I expected.

The join logic presented its own discoveries. The parcels and assessment data live in separate layers that need to be joined, but the obvious identifier wasn't the right one. After testing different join keys, I found the working combination and confirmed it with a quick merge that returned results instantly. Spatial operations needed validation too—I tested whether I could actually intersect parcels with the driveways layer and got immediate confirmation that about a quarter of parcels have existing driveways.

The notebook also surfaced several gotchas that would've been painful to discover mid-implementation: data is paginated to 2000 records, some direct download URLs have moved (though REST APIs still work), and the coordinate reference system isn't set explicitly in some datasets and needs to be added manually.

The workflow emerged naturally: an hour of interactive exploration, trying URLs and seeing immediate results, inspecting actual schemas rather than trusting documentation, testing joins to catch issues early. The notebook ended up documenting reality—the real field names, the actual join keys that work, the real constraints like pagination limits and missing coordinate systems. When I moved to building the actual ETL pipeline, I was working from validated facts rather than debugging mystery APIs and wrong assumptions mid-implementation. Those 90 minutes made the difference between a clean implementation and days of frustration.