Published on January 16, 2008
Bus Structures in Network-on-Chips: Bus Structures in Network-on-Chips Interconnect-Centric Design for Advanced SoC and NoC - Chapter 8 Erno Salminen 11.10.2004 Presentation Outline: Presentation Outline Design choices Problems and solutions SoC examples Conclusion (References) Bus: Bus (Shared) Bus Set of signals connected to all devices Shared resource - one connection between devices reserves the whole interconnection Most available SoC communication networks are buses Low implementation costs, simple Bandwidth shared among devices Long signal lines problematic in DSM technologies A A A A A A a) single bus Hierarchical Bus: Hierarchical Bus Hierarchical bus Several bus segments connected with bridges Fast access as long as the target is in the same segment Requires locality of accesses Theoretical max. speed-up = num of segments Segments either circuit or packet-switched together Packet-switching provides more parallelism with added buffering B A A A A A A b) hierarchical bus Signal Resolution: Signal Resolution Figure 1. Signal resolution b) mux-based a) three-state M = Master S = Slave c) AND-OR / OR AND M1 M2 S1 S2 AND AND AND Control Control M1 BUF M2 S1 S2 BUF BUF BUF Global bus OR Structure: Structure 1. Hierarchical structures 2. Unidirectional (‘U’) or bidirectional (‘B’) links 3. Shared (‘S’) or point-to-point signals (‘P’) Exceptions: * In CoreConnect, data lines are shared, control lines form a ring ** In SiliconBackplane, data lines are shared, control flags are point-to-point 4. Synchronous (‘S’) or asynchronous (‘A’) transfers 5. Support for multiple clock domains 6. Test structures Transfers (1): Transfers (1) Pipelined transfer = Address is transferred before data More time for address decoding Address can be interleaved with last data of the previous transfer Split transfer = Read operation is split into two write operations Agent A sends a read-request to agent B Bus is released, when agent B prepares the data When agent B is ready, it writes the data to agent A t rq addr ret addr addr data w addr w data ret data w data w addr rq addr ret addr rq data rq data ret addr ... pipeline split transaction Transfers (2): Transfers (2) Handshaking provides support for multiple clock domains Slower devices can stretch the transfer No additional delay when agents fast enough Mandatory in asynchronous systems Transfers (3): Transfers (3) 1. Dedicated bus control signals used for handshaking Exceptions: * v.1 does not use, v.2 uses 2. Split transfers 3. Pipelined transfers 4. Broadcast support Arbitration / Decoding: Arbitration / Decoding Arbitration decides which master can use the shared resource (e.g. bus) Single-master system does not need arbitration E.g. priority, round-robin, TDMA Two-level : e.g. TDMA + priority Decoding is needed to determine the target Central / Distributed Centralized / Distributed: Centralized / Distributed arbiter/ decoder arbiter/ decoder M = master S = slave a) Centralized b) Distributed Figure 2. Centralized vs. distributed control Reconfiguration: Reconfiguration Not all the communication can be estimated beforehand Communication varies dynamically Arbitration may perform poorly Dynamic reconfiguration can be used to change the key parameters Communication can be tuned to better meet the current requirements Arbitration and reconfiguration: Arbitration and reconfiguration 1. Application specific (‘as’), one-level (‘1’) or two-level (‘2’) arbitration scheme 2. Arbitration done during previous transfer (pipelined arbitration) 3. Centralized arbitration (‘C’) or distributed arbitration (‘D’) 4. Dynamic reconfiguration Problem1 : Bandwidth: Problem1 : Bandwidth d) split-bus A A A A A A a) single bus c) multiple bus B A A A A A A b) hierarchical bus A A A A A A A = Agent B = Bridge Figure 3. Bus structures Problem 2: Signaling (1): Problem 2: Signaling (1) Estimated edge-to-edge propagation delay of 50nm chips 6-10 cycles Wires have a notable capacitance Asynchronous techniques E.g. Marble bus Four-phase hand-shaking Uses two signals for each bit ”01” = low, ”10” = high, ”00” and ”11” = illegal Split-bus technique If target is near, only necessary switches are on so that effective wire capacitance is smaller smaller power parallel transfers smaller delay (beneficial in async only) More complex arbitration Problem 2: Signaling (2): Problem 2: Signaling (2) Latency insensitive protocols Long signals lines pipelined with relay stations (r) Originally for point-to-point networks Multiple clock domains Globally Asynchronous, Locally Synhronous (GALS) Simplifies system design and clock tree generation Power saving in global clock is often stated (”hyped”) as main reason According to [Malley, ISVLSI,03] GALS may even increase power consumption Power saving by lowering frequency of some parts seems more probable A A A r r r r r r Problem 2: Signaling (3): Problem 2: Signaling (3) Bus encoding for low power Invert data if that reduces signal line activity Reported power saving ~25% Problem 3: Reliability: Problem 3: Reliability Long parallel lines increase fault rate due to Crosstalk Dynamic delay Long wires have large coupling capacitance Narrow (for high density) Thick (for smaller resistance) Error detection / correction Bus coding Bus guardians Detection+retransfer seems more energy efficient than correction Layered approach See Chapter 6 Problem 4: Quality-of-service (1): Problem 4: Quality-of-service (1) Guaranteed bandwidth / latency Arbitration Round-robin Fair Priority Min latency for high priorities Starvation possible Time Division Multiple Access (TDMA) Most versatile Requires common notion of time Centralized control favors Qos However, scalability (among other reasons) does not favor centralized control Problem 4: Quality-of-service (2): Problem 4: Quality-of-service (2) Multiple priorities for data (virtual channels) E.g. HIBI supports currently 2 priorities Usually requires more buffering Reconfiguration Set priorities, TDMA, etc. at runtime Hardest part is to decide when to reconfigure Problem5: Interface Standardization: Problem5: Interface Standardization Number of different (incompatible) bus protocols approaches infinity Virtual Component Interface (VCI) Open Core Protocol (OCP) Derived from VCI TUT is a member of OCP Masters and slaves Wrapper ideology Translates protocols Underlying network is ’wrapped’ so that the interface is the same SoC Examples: SoC Examples Amulet3i by Univ. Manchester Asynchronous microcontoller A single Marble bus MoVA by ETRI MPEG-4 video codec AMBA ASB and APB buses Viper by Philips Set-top box SoC Three PI buses and memory bus Amulet3i – Asynchronous microcontroller: Amulet 3i 0.35 um 7 x 3.5 mm2 120 MIPS 215 mW @ 85 MHz Amulet3i – Asynchronous microcontroller MoVA – MPEG-4 codec: MoVA 0.35 um 220k NAND2 gates 412 Kb SRAM 110.25 mm2 Total 1.7 Mgates 3.3 V 0.5 W @ 27 MHz 30 fps QCIF 15 fps CIF MoVA – MPEG-4 codec Viper – Set-top box SoC: 0.18 um 2 processors + 50 cores Total 8M NAND2 gates 750 Kb SRAM 82 clock domains 1.8 V 4.5 W @143/150/200 MHz Viper – Set-top box SoC HIBI: HIBI Heterogeneous IP Block Interconnection Developed at TUT Hierarchical bus NoC Parameterizable, scalable QoS Run-time reconfiguration Efficient protocol Automated communication-centric design flow HIBI Network Example: HIBI Network Example IP BLOCK Figure 7. Example of hierachical HIBI H.263 Video Encoder: H.263 Video Encoder Objective: Show how easily HIBI scales 2-10 ARM7 processors Processor independent C-source code Master + scaleable number of processors generated automatically Verified with HW/SW co-simulation Conclusions: Conclusions No general network suits every application Ratio between achieved and maximum throughput is small Heterogenous network addresses these problems Local and global communication separated Use bus for local communication Application specific network for global communication References: References D. Sylvester and K. Keutzer, “Impact of small process geometries on microarchitectures in systems on a chip,” Proceedings of the IEEE, Vol. 89, No. 4, Apr. 2001, pp. 467-489. P. Wielage and K. Goossens “Networks on silicon: blessing or nightmare?,” Symp. Digital system design, Dortmund, Germany, 4-6 Sep. 2002, pp. 196-200. R. Ho, K.W. Mai, and M.A. Horowitz, “The future of wires,” Proceedings of the IEEE, Vol. 89, No. 4, Apr. 2001, pp. 490-504. D.B. Gustavson, “Computer buses a tutorial,” in Advanced multiprocessor bus architectures, Janusz Zalewski (ed.), IEEE Computer society press, 1995. pp. 10-25. ARM, AMBA Specification Rev 2.0, ARM Limited, 1999. IBM, 32-bit Processor local bus architecture specification, Version 2.9, IBM Corporation, 2001. B. Cordan, “An efficient bus architecture for system-on-chip design,” IEEE Custom integrated circuits conference, San Diego, California, 16-19 May 1999, pp. 623-626. K. Kuusilinna et. al., “Low latency interconnection for IP-block based multimedia chips,” IASTED Int’l conf. Parallel and distributed computing and networks, Brisbane, Australia,14-16 Dec. 1998, pp. 411-416. V. Lahtinen et. al., “Interconnection scheme for continuous-media systems-on-a-chip,” Microprocessors and microsystems, Vol. 26, No. 3, April 2002, pp. 123-138. W.J. Bainbridge and S.B. Furber, “MARBLE: an asynchronous on-chip macrocell bus,” Microprocessors and microsystems, Vol. 24, No. 4, Aug. 2000, pp. 213-222. OMI, PI-bus VHDL toolkit, Version 3.1, Open microprocessor systems initiative, 1997. Sonics, Sonics Networks technical overview, Sonics inc., June 2000. B. Ackland et. al., “A single-chip, 1.6-billion, 16-b MAC/s multiprocessor DSP,” IEEE Journal of solid state circuits, Vol. 35, No. 3, Mar. 2000, pp. 412-424. 14. Silicore, Wishbone system-on-chip (SoC) interconnection architecture for portable IP cores, Revision: B.1, Silicore corporation, 2001. E. Salminen et. al., “Overview of Bus-based System-on-Chip Interconnections,” Int’l symp. Circuits and systems, Scottsdale, Arizona, 26-29 May 2002, pp. II-372-II-375. S. Dutta, R. Jensen, and A. Rieckmann, “Viper: a multiprocessor SoC for advanced set-top box and digital TV systems,” IEEE Design and test of computers, Vol. 8, No. 5, Sep./Oct. 2001, pp. 21-31. K. Lahiri, A. Raghunathan, and G. Lakshminarayana, “Lotterybus: a new highperformance communication architecture for system-on-chip designs,” Design automation conference, Las Vegas, Nevada, 18-22 June 2001, pp. 15-20. VSIA, Virtual component interface specification (OCB 2 1.0), VSI alliance, 1999. OCP international partnership, Open core protocol specification, release 1.0, OCP-IP association, 2001. L. Benini and G. De Micheli, “Networks on chips: a new SoC paradigm,” Computer, Vol. 35, No. 1, Jan. 2002, pp. 70-78. A. Boxer, “Where buses cannot go,” IEEE Spectrum, Vol. 32, No. 2, Feb. 1995, pp. 41-45. 22. L.P. Carloni and A.L. Sangiovanni-Vincentelli, “Coping with latency in SoC design,” IEEE Micro, Vol. 22, No. 5, Sep./Oct. 2002, pp. 24-35. Cheng-Ta Hsieh and M. Pedram, “Architectural energy optimization by bus splitting,” IEEE Transactions on computer-aided design of integrated circuits and systems, Vol. 21, No. 4, Apr. 2002, pp. 408-414. M.R. Stan and W.P. Burleson, “Bus-invert coding for low-power I/O,” IEEE Transactions on very large scale integration (VLSI) systems, Vol. 3, No. 1, Mar. 1995, pp. 49-58. M. Lajolo, “Bus guardians: an effective solution for online detection and correction of faults affecting system-on-chip buses,” IEEE Transactions on very large scale integration (VLSI) systems, Vol. 9, No. 6, Dec. 2001, pp. 974-982. A.B. Kahng, S. Muddu, and E. Sarto, “Interconnect optimization strategies for highperformance VLSI designs,” Int’l conf. VLSI design, Goa, India, 7-10 Jan. 1999, pp. 464-469. W.O. Cesario et. al., “Multiprocessor SoC platforms: a component-based design approach,” IEEE Design and test of computers, Vol. 19, No. 6, Nov./Dec. 2002, pp. 52-63. J.D. Garside et. al., “AMULET3i - an asyncronous system-on-chip,” Int’l symp. Advanced research in asynchronous circuits and systems, Eilat, Israel, 2-6 Apr. 2000, pp. 162-175. J.H. Park et. al., “MPEG-4 Video codec on an ARM core and AMBA,” Works. MPEG-4,San Jose, California, 18-20 June 2001, pp. 95-98.