Reducing Network Latency for Low-cost Beowulf Clusters
|Institution:||University of Cincinnati|
|Department:||Engineering and Applied Science: Computer Engineering|
|Keywords:||Computer Engineering; High Performance Computing; Cluster; Parallel Discrete Event Simulation; Infiniband; RoCE; Time Warp|
|Full text PDF:||http://rave.ohiolink.edu/etdc/view?acc_num=ucin1406880971|
Parallel Discrete Event Simulation (PDES) is a fine-grained parallel application that canbe difficult to optimize on distributed Beowulf clusters. A significant challenge onthese compute platforms is the relatively high network latency compared to the high CPUperformance on each node. The frequent communications and high network latency means thatevent information communicated between nodes can arrive after a significant delay wherethe processing node is either waiting for the event to arrive (conservatively synchronizedsolutions) or prematurely processing events while the transmitted event is in transit(optimistically synchronized solutions). Thus, solutions to reduce network latency arecrucial to the deployment of PDES.Conventional attacks on network latency in cluster environments are to use high pricedhardware such as Infiniband and/or lightweight messaging layers other than TCP/IP.However, clusters are generally high cost systems (tens to hundreds of thousands of dollars)that, by necessity, must be shared. The use of lower latency hardware such as Infinibandcan nearly double the hardware cost and the replacement of the TCP/IP network stack on ashared platform is generally infeasible as other users of the shared platform (withcoarse-grained parallel computations) are well served by the TCP/IP stack and unwilling torewrite their applications to use the APIs of alternate network stacks. Furthermore,configuring the hardware with multiple messaging transport layers is also quite difficultto setup and not generally supported.Low cost, small-form factor compute nodes with multi-core processing chips are becomingwidely available. These solutions have lower performing compute nodes and yet often stillsupport 100Mb/1Gb Ethernet hardware (reducing the network latency/processor performancedisparity). The much lower per node costs (on the order of $200 per node) can enable thedeployment of non-shared, dedicated clusters and thus, may be an attractive alternativefor network customization and use to support PDES applications. This thesis explores thisoption of using an ODROID compute node for the cluster. The conventional TCP/IPnetworking stack is replaced with the (publicly available) RDMA over Converged Ethernet(RoCE) networking layer which has significantly lower latency costs. We find that RoCEsolution is capable of reducing end-to-end small message latency by more than 30%. Thistranslates to a performance improvement of greater than 10% (compared to the TCP/IPsolution) for PDES applications using Rensselaer's Optimistic Simulation System (ROSS).However, when comparing the ODROID-based cluster performance for cost, both in terms ofoperations per second and Parallel Discrete Event Simulation performance, we find that itsperformance does not justify its price for either application.