概念 中断由硬件产生,并发送到中断控制器,中断控制器再发送中断到CPU,CPU检测到中断信号后,会中断当前的工作,每个中断都有IRQ(中断请求),基于IRQ,CPU将中断请求分发到对应的硬件驱动上通知操作系统,操作系统会对中断进行处理。
中断控制器 常见的中断控制器有两种:可编程中断控制器8259A和高级可编程中断控制器(APIC)。传统的8259A只适合单CPU的情况,现在都是多CPU、多核心的SMP体系,所以为了充分利用SMP体系结构,把中断传递给系统上的每个CPU以便更好实现并行和提高性能,Intel引入了高级可编程中断控制器(APIC)。
光有高级可编程中断控制器的硬件支持还不够,Linux内核还必须能利用这些硬件的特质,所以只有kernel 2.4以后的版本才支持把不同的硬件中断请求(IRQs)分配到特定的CPU核心上,这个绑定技术被称为SMP IRQ Affinity。
在设置网卡中断的cpu core时,有一个限制就是,IO-APIC 有两种工作模式:logic 和 physical,在 logic 模式下 IO-APIC 可以同时分布同一种 IO 中断到8颗 CPU (core) 上(受到 bitmask 寄存器的限制,因为 bitmask 只有8位长。);在 physical 模式下不能同时分布同一中断到不同 CPU 上,比如,不能让 eth0 中断同时由 CPU0 和 CPU1 处理,这个时候只能定位 eth0 到 CPU0、eth1 到 CPU1,也就是说 eth0 中断不能像 logic 模式那样可以同时由多个 CPU 处理。
软中断和硬中断 为了解决中断处理程序执行时间过长和中断丢失的问题,Linux系统将中断分为上半部和下半部。
上半部在中断禁止模式下运行,用来快速处理中断,主要用来处理跟硬件密切相关的工作。
下半部处理上半部未完成的工作,通常以内核线程的方式运行。
以Linux接收网卡数据包为例进行说明:
网卡接收到一个数据包后,会通过硬件中断的方式通知内核新的数据到了。内核会调用中断处理程序进行处理。
上半部将网卡中的数据写入到内存中,并更新一下硬件寄存器的状态,最后发送一个软中断信号,通知下半部进一步的处理。
下半部被软中断信号唤醒后,从内存中读到数据,按照网络协议栈对数据进行解析和处理,并发送给应用程序。
上面所说的上半部即硬中断,下半部即软中断,但一些内核自定义的事件也属于软中断,比如内核调度和RCU锁等。
硬中断:由外设产生,用来通知操作系统外设状态的变化。在处理中断的同时要关闭中断。特点为处理要尽可能的快。
软中断:为了满足实时性要求,硬中断处理时间都比较短,将时间比较长的中断放到软中断中来完成,称为下半部。int就是软中断指令,中断向量表是中断号和中断处理程序的对应表。每个CPU对应一个软中断内核线程ksoftirqd/cpu编号
。
1 2 3 4 5 [root@120-14-29-SH-1037-B07 ~]# ps -ef | grep ksoft root 3 2 0 2016 ? 01:47:12 [ksoftirqd/0] root 21 2 0 2016 ? 00:47:51 [ksoftirqd/1] root 26 2 0 2016 ? 00:47:34 [ksoftirqd/2] root 31 2 0 2016 ? 00:47:46 [ksoftirqd/3]
中断嵌套:硬中断可以嵌套,即新的硬中断可以打断正在执行的中断,但同种中断不可以。软中断不可嵌套,但相同类型中断可在不同的cpu上执行。
相关命令 mpstat 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # 显示cpu处理的中断数量 [root@103-17-164-sh-100-k07 ~]# mpstat -I SUM 1 Linux 3.10.0-327.10.1.el7.x86_64 (103-17-164-sh-100-k07.yidian.com) 09/09/2017 _x86_64_ (4 CPU) 05:27:59 PM CPU intr/s 05:28:00 PM all 61274.00 05:28:01 PM all 61712.00 05:28:02 PM all 62315.00 05:28:03 PM all 59280.00 ^C Average: all 61145.25 # 显示每个核处理的中断数量 [root@103-17-164-sh-100-k07 ~]# mpstat -I SUM 1 -P ALL Linux 3.10.0-327.10.1.el7.x86_64 (103-17-164-sh-100-k07.yidian.com) 09/09/2017 _x86_64_ (4 CPU) 05:30:30 PM CPU intr/s 05:30:31 PM all 61446.00 05:30:31 PM 0 40489.00 05:30:31 PM 1 6839.00 05:30:31 PM 2 6935.00 05:30:31 PM 3 7185.00 # 显示更详细的信息 [root@120-14-31-SH-1037-B07 ~]# mpstat -P ALL 1 Linux 3.10.0-327.el7.x86_64 (120-14-31-SH-1037-B07.yidian.com) 09/10/2017 _x86_64_ (4 CPU) 01:15:09 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 01:15:10 AM all 6.35 0.00 11.64 0.00 0.00 7.67 0.00 0.00 0.00 74.34 01:15:10 AM 0 5.05 0.00 11.11 0.00 0.00 0.00 0.00 0.00 0.00 83.84 01:15:10 AM 1 6.38 0.00 12.77 0.00 0.00 9.57 0.00 0.00 0.00 71.28 01:15:10 AM 2 8.70 0.00 10.87 0.00 0.00 14.13 0.00 0.00 0.00 66.30 01:15:10 AM 3 6.32 0.00 12.63 0.00 0.00 7.37 0.00 0.00 0.00 73.68
lspci 可以来查看网卡型号,驱动等信息,内容较多
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 [root@103-17-164-sh-100-k07 ~]# lspci -vvv | more 00:00.0 Host bridge: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7 DMI2 (rev 04) Subsystem: Super Micro Computer Inc Device 0668 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 0 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM not supported, Exit Latency L0s <64ns, L1 <16us ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?> Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?> ...
用来查看网卡信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 [root@103-17-164-sh-100-k07 ~]# ethtool eth3 Settings for eth3: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full 10000baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Full 10000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: 10000Mb/s Duplex: Full Port: FIBRE PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: d Wake-on: d Current message level: 0x00000007 (7) drv probe link Link detected: yes # 可查看Ring buffer的大小 [root@103-17-164-sh-100-k07 ~]# ethtool -g eth3 Ring parameters for eth3: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512 # 列出信息较多,包含网卡的统计信息,包括丢包量信息 [root@103-17-164-sh-100-k07 ~]# ethtool -S eth3 | more NIC statistics: rx_packets: 680955795162 tx_packets: 27260701850 rx_bytes: 248285654162670 tx_bytes: 195321924245892 rx_pkts_nic: 683081802539 tx_pkts_nic: 27260700665 rx_bytes_nic: 251132784690871 tx_bytes_nic: 195447730645152 lsc_int: 11 tx_busy: 0 non_eop_descs: 1811095697 rx_errors: 67381 tx_errors: 0 rx_dropped: 0 tx_dropped: 0 multicast: 1025690658 broadcast: 206937242 rx_no_buffer_count: 0 collisions: 0 rx_over_errors: 0 rx_crc_errors: 67302 rx_frame_errors: 0 hw_rsc_aggregated: 2358414213 hw_rsc_flushed: 232402327 fdir_match: 10634169417 fdir_miss: 669277016191 fdir_overflow: 3321 rx_fifo_errors: 0 rx_missed_errors: 2538 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_timeout_count: 0 tx_restart_queue: 0 rx_long_length_errors: 80264 rx_short_length_errors: 0 tx_flow_control_xon: 1 rx_flow_control_xon: 0 tx_flow_control_xoff: 51 rx_flow_control_xoff: 0 rx_csum_offload_errors: 0 alloc_rx_page_failed: 0 alloc_rx_buff_failed: 0 rx_no_dma_resources: 0 os2bmc_rx_by_bmc: 0 os2bmc_tx_by_bmc: 0 os2bmc_tx_by_host: 0 os2bmc_rx_by_host: 0 # 查看网卡多队列的支持情况,当前网卡支持8个队列,使用了8个队列 [root@103-17-6-sh-100-j11 ~]# ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 0 TX: 0 Other: 1 Combined: 8 Current hardware settings: RX: 0 TX: 0 Other: 1 Combined: 8 # 设置网卡当前使用的多队列,当前使用的网卡数量不能超过最大值8,该值跟网卡的中断数量一一对应,即/proc/interrupts中看到的eth0的中断数量 [root@103-17-6-sh-100-j11 ~]# ethtool -L eth0 combined 2 [root@103-17-6-sh-100-j11 ~]# ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 0 TX: 0 Other: 1 Combined: 8 Current hardware settings: RX: 0 TX: 0 Other: 1 Combined: 2
sar 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # 列出网卡的接收包信息,比iftop更直观 [root@103-17-164-sh-100-k07 ~]# sar -n DEV 1 Linux 3.10.0-327.10.1.el7.x86_64 (103-17-164-sh-100-k07.yidian.com) 09/09/2017 _x86_64_ (4 CPU) 05:06:12 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 05:06:13 PM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:06:13 PM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:06:13 PM eth2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:06:13 PM eth3 77893.00 4328.00 31413.92 30134.16 0.00 0.00 46.00 05:06:13 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 # 可列出错误包的相关信息 [root@103-17-164-sh-100-k07 ~]# sar -n EDEV 1 Linux 3.10.0-327.10.1.el7.x86_64 (103-17-164-sh-100-k07.yidian.com) 09/09/2017 _x86_64_ (4 CPU) 05:07:13 PM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 05:07:14 PM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:07:14 PM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:07:14 PM eth2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:07:14 PM eth3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:07:14 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
通过中断可以看到网卡包含四个中断55-58,均位于cpu0上。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [root@103-17-164-sh-100-k07 ~]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 44 0 0 0 IR-IO-APIC-edge timer 1: 3 0 0 0 IR-IO-APIC-edge i8042 8: 42 0 0 0 IR-IO-APIC-edge rtc0 9: 1 0 0 0 IR-IO-APIC-fasteoi acpi 12: 4 0 0 0 IR-IO-APIC-edge i8042 16: 99 0 0 0 IR-IO-APIC-fasteoi ehci_hcd:usb1 18: 0 0 0 0 IR-IO-APIC-fasteoi i801_smbus 23: 83 0 0 0 IR-IO-APIC-fasteoi ehci_hcd:usb2 37: 12019917 0 0 0 IR-PCI-MSI-edge 0000:00:1f.2 48: 0 0 0 0 DMAR_MSI-edge dmar0 55: 746827548 0 0 0 IR-PCI-MSI-edge eth3-TxRx-0 56: 2674693551 0 0 0 IR-PCI-MSI-edge eth3-TxRx-1 57: 2341522223 0 0 0 IR-PCI-MSI-edge eth3-TxRx-2 58: 3587929355 0 0 0 IR-PCI-MSI-edge eth3-TxRx-3 59: 3334 0 0 0 IR-PCI-MSI-edge eth3 61: 2 0 0 0 IR-PCI-MSI-edge ioat-msix 63: 2 0 0 0 IR-PCI-MSI-edge ioat-msix 64: 2 0 0 0 IR-PCI-MSI-edge ioat-msix 65: 2 0 0 0 IR-PCI-MSI-edge ioat-msix 66: 2 0 0 0 IR-PCI-MSI-edge ioat-msix 67: 2 0 0 0 IR-PCI-MSI-edge ioat-msix 68: 2 0 0 0 IR-PCI-MSI-edge ioat-msix 69: 2 0 0 0 IR-PCI-MSI-edge ioat-msix NMI: 1100069 693493 635982 615953 Non-maskable interrupts LOC: 1120358899 146726541 4134846029 168005659 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 1100069 693493 635982 615953 Performance monitoring interrupts IWI: 57255892 115292160 113458706 112987848 IRQ work interrupts RTR: 0 0 0 0 APIC ICR read retries RES: 525229423 2791640970 427214674 1986396041 Rescheduling interrupts CAL: 4294536344 4294488238 4294478163 4294552026 Function call interrupts TLB: 65239533 57937650 55104990 52690662 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 160132 160132 160132 160132 Machine check polls ERR: 0 MIS: 0
修改中断的cpu分配 echo “2” > /proc/irq/49/smp_affinity
其中2表示cpu1, 49表示中断号。
查看软中断 1 2 3 4 5 6 7 8 9 10 11 12 [root@120-14-31-SH-1037-B07 ~]# cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 1 3 0 1 TIMER: 1795378091 3617740778 2674553229 1524071492 NET_TX: 202188392 22218135 17427628 17205883 NET_RX: 3388060179 60871361 68145291 38670323 BLOCK: 4741950 2422 1309 1489 BLOCK_IOPOLL: 0 0 0 0 TASKLET: 2102131738 3720214 3104944 1942912 SCHED: 526046585 612421231 496815061 456047989 HRTIMER: 0 0 0 0 RCU: 3147020579 4237695975 3820083676 3267816268
软中断包括10个类别,NET_RX(网络接收中断)、NET_TX(网络发送中断)
查看数据包统计 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 [root@c1-g08-120-166-30 ~]# netstat -s Ip: 7586513384 total packets received 0 forwarded 741 with unknown protocol 0 incoming packets discarded 7586512643 incoming packets delivered 7948370396 requests sent out Icmp: 23 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 1 echo requests: 19 echo replies: 3 46 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 9 echo request: 18 echo replies: 19 IcmpMsg: InType0: 3 InType3: 1 InType8: 19 OutType0: 19 OutType3: 9 OutType8: 18 Tcp: 561299810 active connections openings 2005002 passive connection openings 8 failed connection attempts 282644949 connection resets received 725 connections established 7585817181 segments received 13957880471 segments send out 1742807 segments retransmited 136 bad segments received. 523811266 resets sent Udp: 445457 packets received 3 packets to unknown port received. 0 packet receive errors 553840367 packets sent 0 receive buffer errors 0 send buffer errors UdpLite: InErrors: 3 TcpExt: 341304 invalid SYN cookies received 1 resets received for embryonic SYN_RECV sockets 1 ICMP packets dropped because they were out-of-window 1997421 TCP sockets finished time wait in fast timer 222787 delayed acks sent 1819 delayed acks further delayed because of locked socket Quick ack mode was activated 178186 times 13 packets directly queued to recvmsg prequeue. 3280604069 packet headers predicted 1972740996 acknowledgments not containing data payload received 798190042 predicted acknowledgments 642444 times recovered from packet loss by selective acknowledgements Detected reordering 23 times using FACK Detected reordering 1618 times using SACK 59 congestion windows fully recovered without slow start 16 congestion windows partially recovered using Hoe heuristic 17089 congestion windows recovered without slow start by DSACK 9594 congestion windows recovered without slow start after partial ack TCPLostRetransmit: 2849 4926 timeouts after SACK recovery 672451 fast retransmits 28946 forward retransmits 680 retransmits in slow start 5024 other TCP timeouts TCPLossProbes: 1390602 TCPLossProbeRecovery: 997235 4723 SACK retransmits failed 178189 DSACKs sent for old packets 979863 DSACKs received 62 DSACKs for out of order packets received 282645553 connections reset due to unexpected data 9 connections reset due to early user close TCPDSACKIgnoredNoUndo: 936061 TCPSpuriousRTOs: 5150 TCPSackShifted: 233309 TCPSackMerged: 733778 TCPSackShiftFallback: 1768422 IPReversePathFilter: 1 TCPRetransFail: 11 TCPRcvCoalesce: 1290965687 TCPOFOQueue: 1753072 TCPChallengeACK: 136 TCPSYNChallenge: 136 TCPSpuriousRtxHostQueues: 73 TCPAutoCorking: 272452106 TCPSynRetrans: 4538 TCPOrigDataSent: 9260789170 TCPHystartTrainDetect: 3721895 TCPHystartTrainCwnd: 64417412 TCPHystartDelayDetect: 80 TCPHystartDelayCwnd: 2579 TCPACKSkippedSynRecv: 4 IpExt: InMcastPkts: 249744 OutMcastPkts: 83317 InBcastPkts: 226 InOctets: 12312144006244 OutOctets: 12449967070063 InMcastOctets: 22309104 OutMcastOctets: 9664620 InBcastOctets: 104240 InNoECTPkts: 7586513474 InECT0Pkts: 47
irqbalance 查看是否运行:systemctl status irqbalance
irqbalance根据系统中断负载的情况,自动迁移中断保持中断的平衡,同时会考虑到省电因素等等。 但是在实时系统中会导致中断自动漂移,对性能造成不稳定因素,在高性能的场合建议关闭。
irqbalance用于优化中断分配,它会自动收集系统数据以分析使用模式,并依据系统负载状况将工作状态置于 Performance mode 或 Power-save mode。处于Performance mode 时,irqbalance 会将中断尽可能均匀地分发给各个 CPU core,以充分利用 CPU 多核,提升性能。 处于Power-save mode 时,irqbalance 会将中断集中分配给第一个 CPU,以保证其它空闲 CPU 的睡眠时间,降低能耗。
ref