Building a data pipeline to defend New York from cyber threats

April 13, 2019 TH Author

Defending New York City’s IT infrastructure is a daunting task. With 8.6 million residents just within the city’s five boroughs, the city hosts hundreds of web applications so residents can track and use services like street plowing, as well as the popular NYC.gov site. With more than 330,000 employees within the city and 400,000 endpoints to keep track of — all within several, federated agencies such as the NYPD and Immigrant Affairs — the attack surface is huge.

It’s a responsibility that falls to New York City Cyber Command, an 18-month-old agency charged with defending the city from cyber threats and enabling New Yorkers to lead safe digital lives.

Given the scale and federation of New York CIty’s IT infrastructure, the agency decided to build its own data pipeline. The agency wanted to build a secure, cloud-based security log aggregation platform for city systems — one that enabled alerting, visualization and analysis for cybersecurity professionals. The pipeline also had to allow the agency to scale non-linearly as the demand on services grows and cybersecurity threats grow.

“We built it because we needed to solve a New York City-sized challenge… with a new, cutting-edge, cloud-first approach that enabled the latest tools and technology to be applied at scale against our problem,” Colin Ahern, NYC’s deputy CISO, told ZDNet. “One that would allow us to evolve and stay head of the threat.”

Speaking at the Google Next Conference in San Francisco, Ahern described how the pipeline works, noting that the city relied primarily on open source components “because of our government’s commitment to being a thought leader… in how we provide services, and our desire over time to give back to the community.”

Built primarily on the Apache Beam open-source data programming model, the data pipeline leverages “zero trust” security and “zero touch” infrastructure as code for rapid scalability. The system was built to be modular and flexible, Ahern said — “We can take things out and combine them very rapidly.”

And it was intentionally built to be fast. NYC Cyber Command processes billions of events per day, with an average processing time of less than 10 milliseconds per event — about the speed of a camera shutter.

The system uses a publish and subscribe framework “so the right data goes to the right analytical process at the right time,” Ahern said. “Generally speaking, the right time is right now.”

He added, “We want the analyst to be at the speed of their fastest tool, not be held to the speed of their slowest tool.”

Speed is important, Ahern continued, given that “it’s not just the good guys using machine learning and automation, it’s the bad guys.” Ransomware and other types of attacks are now mostly programmatic — they sweep wide swaths of the internet. That means the city must operate at machine speed as well.

When analysts respond to events and make a decision like taking an asset offline, they’re making a tradeoff, Ahern noted. “They’re degrading the usability of that system in some fundamental way.”

“We want this process to happen as rapidly and effectively as possible,” he said.

Google Cloud Next