Meta shared details about its AI and networking advances at this week’s 2025 OCP Global Summit in San Jose, Calif.
Facebook was a founding member of Open Compute Project (OCP) in 2011, and its now-parent-company Meta has used the annual conference to showcase systems it’s developing to stay on the bleeding edge of technology. At this week’s event, that meant detailing Meta’s AI and networking advances.
“The advent of AI has changed all our assumptions on how to scale our infrastructure. Building infrastructure for AI requires innovation at every layer of the stack, from hardware and software, to our networks, to our data centers themselves,” wrote Yee Jiun Song, vice president, and Kaushik Veeraraghavan, software engineer, Infra Foundation for Meta, in a blog post about Meta’s AI networking efforts.
One of Meta’s themes over the years has been to support open systems development, and it’s continuing that effort.
“We have a long way to go in continuing to push open standards. We need standardization of systems, racks and power as rack power density continues to increase. We need standardization of the scale up and scale out network that these AI clusters use so that customers can mix/match different GPUs and accelerators to always use the latest and more cost-effective hardware,” Song and Veeraraghavan wrote.
“We need software innovation and standards to allow us to run jobs across heterogeneous hardware types that may be spread in different geographic locations. These open standards need to exist all the way through the stack, and there are massive opportunities to eliminate friction that is slowing down the build out of AI infrastructure.”
ESUN initiative
As part of its standardization efforts, Meta said it would be a key player in the new Ethernet for Scale-Up Networking (ESUN) initiative that brings together AMD, Arista, ARM, Broadcom, Cisco, HPE Networking, Marvell, Microsoft, NVIDIA, OpenAI and Oracle to advance the networking technology to handle the growing scale-up domain for AI systems.
ESUN will focus solely on open, standards-based Ethernet switching and framing for scale-up networking—excluding host-side stacks, non-Ethernet protocols, application-layer solutions, and proprietary technologies. The group will focus on the development and interoperability of XPU network interfaces and Ethernet switch ASICs for scale-up networks, the OCP wrote in a blog.
ESUN will actively engage with other organizations such as Ultra-Ethernet Consortium (UEC) and long-standing IEEE 802.3 Ethernet to align open standards, incorporate best practices, and accelerate innovation, the OCP stated.
Data center networking milestones
The launch of ESUN is just one of the AI networking developments Meta shared at the event. Meta engineers also announced three data center networking innovations aimed at making its infrastructure more flexible, scalable, and efficient:
- The evolution of Meta’s Disaggregated Scheduled Fabric (DSF) to support scale-out interconnect for large AI clusters that span entire data center buildings.
- A new Non-Scheduled Fabric (NSF) architecture based entirely on shallow-buffer, disaggregated Ethernet switches that will support our largest AI clusters like Prometheus.
- The addition of Minipack3N, based on Nvidia’s Ethernet Spectrum-4 ASIC, to Meta’s portfolio of 51Tbps OCP switches that use OCP’s Switch Abstraction Interface and Meta’s Facebook Open Switching System (FBOSS) software stack.
DSF is Meta’s open networking fabric that completely separates switch hardware, NICs, endpoints, and other networking components from the underlying network and uses OCP-SAI and FBOSS to achieve that, according to Meta. It supports Ethernet-based RoCE RDMA over Converged Ethernet (RoCE/RDMA)) to endpoints, accelerators and NICs from multiple vendors, such as Nvidia, AMD and Broadcom including its own MTIA/accelerator stack. It then uses scheduled fabric techniques between endpoints, particularly Virtual Output Queuing for traffic scheduling to proactively avoid congestion rather than just reacting to it, according to Meta.
“Over the last year, we have evolved DSF to a 2-stage architecture, scaling to support a non-blocking fabric that interconnects up to 18,432 XPUs,” wrote a group of Meta engineers in a co-authored blog post about the new advances. “These clusters are a fundamental building block for constructing AI clusters that span regions (and even multiple regions) in order to meet the increased capacity and performance demands of Meta’s AI workloads.”
To its DSF architecture, Meta has added a new architecture called the Non-Scheduled Fabric (NSF), which it says is based on shallow-buffer OCP Ethernet switches to deliver low round-trip latency, the engineers wrote.
NSF architecture is a three-tier fabric that supports adaptive routing for effective load-balancing. This helps minimize congestion and ensure optimal utilization of GPUs, which is critical for maximizing performance in Meta’s largest AI factories, according to Meta: “NSF supports adaptive routing for effective load-balancing, ensuring optimal utilization and minimizing congestion and serves as foundational building block for Gigawatt-scale AI clusters such as Meta’s Gigawatt-scale AI cluster, Prometheus.”
Going forward, Meta will utilize both DSF and NSF depending on needs. So, for example, DSF will provide a high-efficiency, highly scalable network for large, but still modular, AI clusters, while NSF will be targeted atthe extreme demands of its largest, gigawatt-scale AI factories such as Prometheus, where low latency and robust adaptive routing are paramount.
Meta targeted the optical networking world as well. Last year, it introduced 2x400G FR4 BASE (3-km) optics, the primary solution supporting next-generation 51T platforms across both backend and frontend networks and DSFs. These optics have now been widely deployed throughout Meta’s data centers, the engineers stated:
“This year, we are expanding our portfolio with the launch of 2x400G FR4 LITE (500-m) optics. FR4 LITE is optimized for the majority of intra–data center use cases, supporting fiber links up to 500 meters. This new variant is designed to accelerate optics cost reduction while maintaining robust performance for shorter-reach applications.”
In addition, Meta added the 400G DR4 OSFP-RHS optics — its first-generation DR4 package for AI host-side NIC connectivity. Complementing this, the new 2x400G DR4 OSFP optics are being deployed on the switch side, providing connectivity from host to switch.
🛸 Recommended Intelligence Resource
As UAP researchers and tech enthusiasts, we’re always seeking tools and resources to enhance our investigations and stay ahead of emerging technologies. Check out this resource that fellow researchers have found valuable.
FTC Disclosure: This post contains affiliate links. We may earn a commission if you purchase through these links at no additional cost to you. See our Affiliate Disclosure for details.
Leave a Reply