11490 Wdiff usr/src/uts/common/io/mac/mac_sched.c

Print this page

11490 SRS ring polling disabled for VLANs
11491 Want DLS bypass for VLAN traffic
11492 add VLVF bypass to ixgbe core
2869 duplicate packets with vnics over aggrs
11489 DLS stat delete and aggr kstat can deadlock
Portions contributed by: Theo Schlossnagle <jesus@omniti.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>

Split	Close
Expand all
Collapse all

          --- old/usr/src/uts/common/io/mac/mac_sched.c
          +++ new/usr/src/uts/common/io/mac/mac_sched.c

   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each

↓ open down ↓

13 lines elided

↑ open up ↑

  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
  23   23   * Use is subject to license terms.
  24      - * Copyright 2017 Joyent, Inc.
       24 + * Copyright 2018 Joyent, Inc.
  25   25   * Copyright 2013 Nexenta Systems, Inc. All rights reserved.
  26   26   */
  27   27  
  28   28  /*
  29   29   * MAC data path
  30   30   *
  31   31   * The MAC data path is concerned with the flow of traffic from mac clients --
  32   32   * DLS, IP, etc. -- to various GLDv3 device drivers -- e1000g, vnic, aggr,
  33   33   * ixgbe, etc. -- and from the GLDv3 device drivers back to clients.
  34   34   *

  35   35   * -----------
  36   36   * Terminology
  37   37   * -----------
  38   38   *
  39   39   * MAC uses a lot of different, but related terms that are associated with the
  40   40   * design and structure of the data path. Before we cover other aspects, first
  41   41   * let's review the terminology that MAC uses.
  42   42   *
  43   43   * MAC
  44   44   *
  45   45   *      This driver. It interfaces with device drivers and provides abstractions
  46   46   *      that the rest of the system consumes. All data links -- things managed
  47   47   *      with dladm(1M), are accessed through MAC.
  48   48   *
  49   49   * GLDv3 DEVICE DRIVER
  50   50   *
  51   51   *      A GLDv3 device driver refers to a driver, both for pseudo-devices and
  52   52   *      real devices, which implement the GLDv3 driver API. Common examples of
  53   53   *      these are igb and ixgbe, which are drivers for various Intel networking
  54   54   *      cards. These devices may or may not have various features, such as
  55   55   *      hardware rings and checksum offloading. For MAC, a GLDv3 device is the
  56   56   *      final point for the transmission of a packet and the starting point for
  57   57   *      the receipt of a packet.
  58   58   *
  59   59   * FLOWS
  60   60   *
  61   61   *      At a high level, a flow refers to a series of packets that are related.
  62   62   *      Often times the term is used in the context of TCP to indicate a unique
  63   63   *      TCP connection and the traffic over it. However, a flow can exist at
  64   64   *      other levels of the system as well. MAC has a notion of a default flow
  65   65   *      which is used for all unicast traffic addressed to the address of a MAC
  66   66   *      device. For example, when a VNIC is created, a default flow is created
  67   67   *      for the VNIC's MAC address. In addition, flows are created for broadcast
  68   68   *      groups and a user may create a flow with flowadm(1M).
  69   69   *
  70   70   * CLASSIFICATION
  71   71   *
  72   72   *      Classification refers to the notion of identifying an incoming frame
  73   73   *      based on its destination address and optionally its source addresses and
  74   74   *      doing different processing based on that information. Classification can
  75   75   *      be done in both hardware and software. In general, we usually only
  76   76   *      classify based on the layer two destination, eg. for Ethernet, the
  77   77   *      destination MAC address.
  78   78   *
  79   79   *      The system also will do classification based on layer three and layer
  80   80   *      four properties. This is used to support things like flowadm(1M), which
  81   81   *      allows setting QoS and other properties on a per-flow basis.
  82   82   *
  83   83   * RING
  84   84   *
  85   85   *      Conceptually, a ring represents a series of framed messages, often in a
  86   86   *      contiguous chunk of memory that acts as a circular buffer. Rings come in
  87   87   *      a couple of forms. Generally they are either a hardware construct (hw
  88   88   *      ring) or they are a software construct (sw ring) maintained by MAC.
  89   89   *
  90   90   * HW RING
  91   91   *
  92   92   *      A hardware ring is a set of resources provided by a GLDv3 device driver
  93   93   *      (even if it is a pseudo-device). A hardware ring comes in two different
  94   94   *      forms: receive (rx) rings and transmit (tx) rings. An rx hw ring is
  95   95   *      something that has a unique DMA (direct memory access) region and
  96   96   *      generally supports some form of classification (though it isn't always
  97   97   *      used), as well as a means of generating an interrupt specific to that
  98   98   *      ring. For example, the device may generate a specific MSI-X for a PCI
  99   99   *      express device. A tx ring is similar, except that it is dedicated to
 100  100   *      transmission. It may also be a vector for enabling features such as VLAN
 101  101   *      tagging and large transmit offloading. It usually has its own dedicated
 102  102   *      interrupts for transmit being completed.
 103  103   *
 104  104   * SW RING
 105  105   *
 106  106   *      A software ring is a construction of MAC. It represents the same thing
 107  107   *      that a hardware ring generally does, a collection of frames. However,
 108  108   *      instead of being in a contiguous ring of memory, they're instead linked
 109  109   *      by using the mblk_t's b_next pointer. Each frame may itself be multiple
 110  110   *      mblk_t's linked together by the b_cont pointer. A software ring always
 111  111   *      represents a collection of classified packets; however, it varies as to
 112  112   *      whether it uses only layer two information, or a combination of that and
 113  113   *      additional layer three and layer four data.
 114  114   *
 115  115   * FANOUT
 116  116   *
 117  117   *      Fanout is the idea of spreading out the load of processing frames based
 118  118   *      on the source and destination information contained in the layer two,
 119  119   *      three, and four headers, such that the data can then be processed in
 120  120   *      parallel using multiple hardware threads.
 121  121   *
 122  122   *      A fanout algorithm hashes the headers and uses that to place different
 123  123   *      flows into a bucket. The most important thing is that packets that are
 124  124   *      in the same flow end up in the same bucket. If they do not, performance
 125  125   *      can be adversely affected. Consider the case of TCP.  TCP severely
 126  126   *      penalizes a connection if the data arrives out of order. If a given flow
 127  127   *      is processed on different CPUs, then the data will appear out of order,
 128  128   *      hence the invariant that fanout always hash a given flow to the same
 129  129   *      bucket and thus get processed on the same CPU.
 130  130   *
 131  131   * RECEIVE SIDE SCALING (RSS)
 132  132   *
 133  133   *
 134  134   *      Receive side scaling is a term that isn't common in illumos, but is used
 135  135   *      by vendors and was popularized by Microsoft. It refers to the idea of
 136  136   *      spreading the incoming receive load out across multiple interrupts which
 137  137   *      can be directed to different CPUs. This allows a device to leverage
 138  138   *      hardware rings even when it doesn't support hardware classification. The
 139  139   *      hardware uses an algorithm to perform fanout that ensures the flow
 140  140   *      invariant is maintained.
 141  141   *
 142  142   * SOFT RING SET
 143  143   *
 144  144   *      A soft ring set, commonly abbreviated SRS, is a collection of rings and
 145  145   *      is used for both transmitting and receiving. It is maintained in the
 146  146   *      structure mac_soft_ring_set_t. A soft ring set is usually associated
 147  147   *      with flows, and coordinates both the use of hardware and software rings.
 148  148   *      Because the use of hardware rings can change as devices such as VNICs
 149  149   *      come and go, we always ensure that the set has software classification
 150  150   *      rules that correspond to the hardware classification rules from rings.
 151  151   *
 152  152   *      Soft ring sets are also used for the enforcement of various QoS
 153  153   *      properties. For example, if a bandwidth limit has been placed on a
 154  154   *      specific flow or device, then that will be enforced by the soft ring
 155  155   *      set.
 156  156   *
 157  157   * SERVICE ATTACHMENT POINT (SAP)
 158  158   *
 159  159   *      The service attachment point is a DLPI (Data Link Provider Interface)
 160  160   *      concept; however, it comes up quite often in MAC. Most MAC devices speak
 161  161   *      a protocol that has some notion of different channels or message type
 162  162   *      identifiers. For example, Ethernet defines an EtherType which is a part
 163  163   *      of the Ethernet header and defines the particular protocol of the data
 164  164   *      payload. If the EtherType is set to 0x0800, then it defines that the
 165  165   *      contents of that Ethernet frame is IPv4 traffic. For Ethernet, the
 166  166   *      EtherType is the SAP.
 167  167   *
 168  168   *      In DLPI, a given consumer attaches to a specific SAP. In illumos, the ip
 169  169   *      and arp drivers attach to the EtherTypes for IPv4, IPv6, and ARP. Using
 170  170   *      libdlpi(3LIB) user software can attach to arbitrary SAPs. With the
 171  171   *      exception of 802.1Q VLAN tagged traffic, MAC itself does not directly
 172  172   *      consume the SAP; however, it uses that information as part of hashing
 173  173   *      and it may be used as part of the construction of flows.
 174  174   *
 175  175   * PRIMARY MAC CLIENT
 176  176   *
 177  177   *      The primary mac client refers to a mac client whose unicast address
 178  178   *      matches the address of the device itself. For example, if the system has
 179  179   *      instance of the e1000g driver such as e1000g0, e1000g1, etc., the
 180  180   *      primary mac client is the one named after the device itself. VNICs that
 181  181   *      are created on top of such devices are not the primary client.
 182  182   *
 183  183   * TRANSMIT DESCRIPTORS
 184  184   *
 185  185   *      Transmit descriptors are a resource that most GLDv3 device drivers have.
 186  186   *      Generally, a GLDv3 device driver takes a frame that's meant to be output
 187  187   *      and puts a copy of it into a region of memory. Each region of memory
 188  188   *      usually has an associated descriptor that the device uses to manage
 189  189   *      properties of the frames. Devices have a limited number of such
 190  190   *      descriptors. They get reclaimed once the device finishes putting the
 191  191   *      frame on the wire.
 192  192   *
 193  193   *      If the driver runs out of transmit descriptors, for example, the OS is
 194  194   *      generating more frames than it can put on the wire, then it will return
 195  195   *      them back to the MAC layer.
 196  196   *
 197  197   * ---------------------------------
 198  198   * Rings, Classification, and Fanout
 199  199   * ---------------------------------
 200  200   *
 201  201   * The heart of MAC is made up of rings, and not those that Elven-kings wear.
 202  202   * When receiving a packet, MAC breaks the work into two different, though
 203  203   * interrelated phases. The first phase is generally classification and then the
 204  204   * second phase is generally fanout. When a frame comes in from a GLDv3 Device,
 205  205   * MAC needs to determine where that frame should be delivered. If it's a
 206  206   * unicast frame (say a normal TCP/IP packet), then it will be delivered to a
 207  207   * single MAC client; however, if it's a broadcast or multicast frame, then MAC
 208  208   * may need to deliver it to multiple MAC clients.
 209  209   *
 210  210   * On transmit, classification isn't quite as important, but may still be used.
 211  211   * Unlike with the receive path, the classification is not used to determine
 212  212   * devices that should transmit something, but rather is used for special
 213  213   * properties of a flow, eg. bandwidth limits for a given IP address, device, or
 214  214   * connection.
 215  215   *
 216  216   * MAC employs a software classifier and leverages hardware classification as
 217  217   * well. The software classifier can leverage the full layer two information,
 218  218   * source, destination, VLAN, and SAP. If the SAP indicates that IP traffic is
 219  219   * being sent, it can classify based on the IP header, and finally, it also
 220  220   * knows how to classify based on the local and remote ports of TCP, UDP, and
 221  221   * SCTP.
 222  222   *
 223  223   * Hardware classifiers vary in capability. Generally all hardware classifiers
 224  224   * provide the capability to classify based on the destination MAC address. Some
 225  225   * hardware has additional filters built in for performing more in-depth
 226  226   * classification; however, it often has much more limited resources for these
 227  227   * activities as compared to the layer two destination address classification.
 228  228   *
 229  229   * The modus operandi in MAC is to always ensure that we have software-based
 230  230   * capabilities and rules in place and then to supplement that with hardware
 231  231   * resources when available. In general, simple layer two classification is
 232  232   * sufficient and nothing else is used, unless a specific flow is created with
 233  233   * tools such as flowadm(1M) or bandwidth limits are set on a device with
 234  234   * dladm(1M).
 235  235   *
 236  236   * RINGS AND GROUPS
 237  237   *
 238  238   * To get into how rings and classification play together, it's first important
 239  239   * to understand how hardware devices commonly associate rings and allow them to
 240  240   * be programmed. Recall that a hardware ring should be thought of as a DMA
 241  241   * buffer and an interrupt resource. Rings are then collected into groups. A
 242  242   * group itself has a series of classification rules. One or more MAC addresses
 243  243   * are assigned to a group.
 244  244   *
 245  245   * Hardware devices vary in terms of what capabilities they provide. Sometimes
 246  246   * they allow for a dynamic assignment of rings to a group and sometimes they
 247  247   * have a static assignment of rings to a group. For example, the ixgbe driver
 248  248   * has a static assignment of rings to groups such that every group has exactly
 249  249   * one ring and the number of groups is equal to the number of rings.
 250  250   *
 251  251   * Classification and receive side scaling both come into play with how a device
 252  252   * advertises itself to MAC and how MAC uses it. If a device supports layer two
 253  253   * classification of frames, then MAC will assign MAC addresses to a group as a
 254  254   * form of primary classification. If a single MAC address is assigned to a
 255  255   * group, a common case, then MAC will consider packets that come in from rings
 256  256   * on that group to be fully classified and will not need to do any software
 257  257   * classification unless a specific flow has been created.
 258  258   *
 259  259   * If a device supports receive side scaling, then it may advertise or support
 260  260   * groups with multiple rings. In those cases, then receive side scaling will
 261  261   * come into play and MAC will use that as a means of fanning out received
 262  262   * frames across multiple CPUs. This can also be combined with groups that
 263  263   * support layer two classification.
 264  264   *
 265  265   * If a device supports dynamic assignments of rings to groups, then MAC will
 266  266   * change around the way that rings are assigned to various groups as devices
 267  267   * come and go from the system. For example, when a VNIC is created, a new flow
 268  268   * will be created for the VNIC's MAC address. If a hardware ring is available,
 269  269   * MAC may opt to reassign it from one group to another.
 270  270   *
 271  271   * ASSIGNMENT OF HARDWARE RINGS
 272  272   *
 273  273   * This is a bit of a complicated subject that varies depending on the device,
 274  274   * the use of aggregations, the special nature of the primary mac client. This
 275  275   * section deserves being fleshed out.
 276  276   *
 277  277   * FANOUT
 278  278   *
 279  279   * illumos uses fanout to help spread out the incoming processing load of chains
 280  280   * of frames away from a single CPU. If a device supports receive side scaling,
 281  281   * then that provides an initial form of fanout; however, what we're concerned
 282  282   * with all happens after the context of a given set of frames being classified
 283  283   * to a soft ring set.
 284  284   *
 285  285   * After frames reach a soft ring set and account for any potential bandwidth
 286  286   * related accounting, they may be fanned out based on one of the following
 287  287   * three modes:
 288  288   *
 289  289   *     o No Fanout
 290  290   *     o Protocol level fanout
 291  291   *     o Full software ring protocol fanout
 292  292   *

↓ open down ↓

258 lines elided

↑ open up ↑

 293  293   * MAC makes the determination as to which of these modes a given soft ring set
 294  294   * obtains based on parameters such as whether or not it's the primary mac
 295  295   * client, whether it's on a 10 GbE or faster device, user controlled dladm(1M)
 296  296   * properties, and the nature of the hardware and the resources that it has.
 297  297   *
 298  298   * When there is no fanout, MAC does not create any soft rings for a device and
 299  299   * the device has frames delivered directly to the MAC client.
 300  300   *
 301  301   * Otherwise, all fanout is performed by software. MAC divides incoming frames
 302  302   * into one of three buckets -- IPv4 TCP traffic, IPv4 UDP traffic, and
 303      - * everything else. Note, VLAN tagged traffic is considered other, regardless of
 304      - * the interior EtherType. Regardless of the type of fanout, these three
 305      - * categories or buckets are always used.
      303 + * everything else. Regardless of the type of fanout, these three categories
      304 + * or buckets are always used.
 306  305   *
 307  306   * The difference between protocol level fanout and full software ring protocol
 308  307   * fanout is the number of software rings that end up getting created. The
 309  308   * system always uses the same number of software rings per protocol bucket. So
 310  309   * in the first case when we're just doing protocol level fanout, we just create
 311  310   * one software ring each for IPv4 TCP traffic, IPv4 UDP traffic, and everything
 312  311   * else.
 313  312   *
 314  313   * In the case where we do full software ring protocol fanout, we generally use
 315  314   * mac_compute_soft_ring_count() to determine the number of rings. There are

 316  315   * other combinations of properties and devices that may send us down other
 317  316   * paths, but this is a common starting point. If it's a non-bandwidth enforced
 318  317   * device and we're on at least a 10 GbE link, then we'll use eight soft rings
 319  318   * per protocol bucket as a starting point. See mac_compute_soft_ring_count()
 320  319   * for more information on the total number.
 321  320   *
 322  321   * For each of these rings, we create a mac_soft_ring_t and an associated worker
 323  322   * thread. Particularly when doing full software ring protocol fanout, we bind
 324  323   * each of the worker threads to individual CPUs.
 325  324   *
 326  325   * The other advantage of these software rings is that it allows upper layers to
 327  326   * optionally poll on them. For example, TCP can leverage an squeue to poll on
 328  327   * the software ring, see squeue.c for more information.
 329  328   *
 330  329   * DLS BYPASS
 331  330   *
 332  331   * DLS is the data link services module. It interfaces with DLPI, which is the
 333  332   * primary way that other parts of the system such as IP interface with the MAC
 334  333   * layer. While DLS is traditionally a STREAMS-based interface, it allows for
 335  334   * certain modules such as IP to negotiate various more modern interfaces to be
 336  335   * used, which are useful for higher performance and allow it to use direct
 337  336   * function calls to DLS instead of using STREAMS.
 338  337   *
 339  338   * When we have IPv4 TCP or UDP software rings, then traffic on those rings is
 340  339   * eligible for what we call the dls bypass. In those cases, rather than going
 341  340   * out mac_rx_deliver() to DLS, DLS instead registers them to go directly via
 342  341   * the direct callback registered with DLS, generally ip_input().
 343  342   *
 344  343   * HARDWARE RING POLLING
 345  344   *
 346  345   * GLDv3 devices with hardware rings generally deliver chains of messages
 347  346   * (mblk_t chain) during the context of a single interrupt. However, interrupts
 348  347   * are not the only way that these devices may be used. As part of implementing
 349  348   * ring support, a GLDv3 device driver must have a way to disable the generation
 350  349   * of that interrupt and allow for the operating system to poll on that ring.
 351  350   *
 352  351   * To implement this, every soft ring set has a worker thread and a polling
 353  352   * thread. If a sufficient packet rate comes into the system, MAC will 'blank'
 354  353   * (disable) interrupts on that specific ring and the polling thread will start
 355  354   * consuming packets from the hardware device and deliver them to the soft ring
 356  355   * set, where the worker thread will take over.
 357  356   *
 358  357   * Once the rate of packet intake drops down below a certain threshold, then
 359  358   * polling on the hardware ring will be quiesced and interrupts will be
 360  359   * re-enabled for the given ring. This effectively allows the system to shift
 361  360   * how it handles a ring based on its load. At high packet rates, polling on the
 362  361   * device as opposed to relying on interrupts can actually reduce overall system
 363  362   * load due to the minimization of interrupt activity.
 364  363   *
 365  364   * Note the importance of each ring having its own interrupt source. The whole
 366  365   * idea here is that we do not disable interrupts on the device as a whole, but
 367  366   * rather each ring can be independently toggled.
 368  367   *
 369  368   * USE OF WORKER THREADS
 370  369   *
 371  370   * Both the soft ring set and individual soft rings have a worker thread
 372  371   * associated with them that may be bound to a specific CPU in the system. Any
 373  372   * such assignment will get reassessed as part of dynamic reconfiguration events
 374  373   * in the system such as the onlining and offlining of CPUs and the creation of
 375  374   * CPU partitions.
 376  375   *
 377  376   * In many cases, while in an interrupt, we try to deliver a frame all the way
 378  377   * through the stack in the context of the interrupt itself. However, if the
 379  378   * amount of queued frames has exceeded a threshold, then we instead defer to
 380  379   * the worker thread to do this work and signal it. This is particularly useful
 381  380   * when you have the soft ring set delivering frames into multiple software
 382  381   * rings. If it was only delivering frames into a single software ring then
 383  382   * there'd be no need to have another thread take over. However, if it's
 384  383   * delivering chains of frames to multiple rings, then it's worthwhile to have
 385  384   * the worker for the software ring take over so that the different software
 386  385   * rings can be processed in parallel.
 387  386   *
 388  387   * In a similar fashion to the hardware polling thread, if we don't have a
 389  388   * backlog or there's nothing to do, then the worker thread will go back to
 390  389   * sleep and frames can be delivered all the way from an interrupt. This
 391  390   * behavior is useful as it's designed to minimize latency and the default
 392  391   * disposition of MAC is to optimize for latency.
 393  392   *
 394  393   * MAINTAINING CHAINS
 395  394   *
 396  395   * Another useful idea that MAC uses is to try and maintain frames in chains for
 397  396   * as long as possible. The idea is that all of MAC can handle chains of frames
 398  397   * structured as a series of mblk_t structures linked with the b_next pointer.
 399  398   * When performing software classification and software fanout, MAC does not
 400  399   * simply determine the destination and send the frame along. Instead, in the
 401  400   * case of classification, it tries to maintain a chain for as long as possible
 402  401   * before passing it along and performing additional processing.
 403  402   *
 404  403   * In the case of fanout, MAC first determines what the target software ring is
 405  404   * for every frame in the original chain and constructs a new chain for each
 406  405   * target. MAC then delivers the new chain to each software ring in succession.
 407  406   *
 408  407   * The whole rationale for doing this is that we want to try and maintain the
 409  408   * pipe as much as possible and deliver as many frames through the stack at once
 410  409   * that we can, rather than just pushing a single frame through. This can often
 411  410   * help bring down latency and allows MAC to get a better sense of the overall
 412  411   * activity in the system and properly engage worker threads.
 413  412   *
 414  413   * --------------------
 415  414   * Bandwidth Management
 416  415   * --------------------
 417  416   *
 418  417   * Bandwidth management is something that's built into the soft ring set itself.
 419  418   * When bandwidth limits are placed on a flow, a corresponding soft ring set is
 420  419   * toggled into bandwidth mode. This changes how we transmit and receive the
 421  420   * frames in question.
 422  421   *
 423  422   * Bandwidth management is done on a per-tick basis. We translate the user's
 424  423   * requested bandwidth from a quantity per-second into a quantity per-tick. MAC
 425  424   * cannot process a frame across more than one tick, thus it sets a lower bound
 426  425   * for the bandwidth cap to be a single MTU. This also means that when
 427  426   * hires ticks are enabled (hz is set to 1000), that the minimum amount of
 428  427   * bandwidth is higher, because the number of ticks has increased and MAC has to
 429  428   * go from accepting 100 packets / sec to 1000 / sec.
 430  429   *
 431  430   * The bandwidth counter is reset by either the soft ring set's worker thread or
 432  431   * a thread that is doing an inline transmit or receive if they discover that
 433  432   * the current tick is in the future from the recorded tick.
 434  433   *
 435  434   * Whenever we're receiving or transmitting data, we end up leaving most of the
 436  435   * work to the soft ring set's worker thread. This forces data inserted into the
 437  436   * soft ring set to be effectively serialized and allows us to exhume bandwidth
 438  437   * at a reasonable rate. If there is nothing in the soft ring set at the moment
 439  438   * and the set has available bandwidth, then it may processed inline.
 440  439   * Otherwise, the worker is responsible for taking care of the soft ring set.
 441  440   *
 442  441   * ---------------------
 443  442   * The Receive Data Path
 444  443   * ---------------------
 445  444   *
 446  445   * The following series of ASCII art images breaks apart the way that a frame
 447  446   * comes in and is processed in MAC.
 448  447   *
 449  448   * Part 1 -- Initial frame receipt, SRS classification
 450  449   *
 451  450   * Here, a frame is received by a GLDv3 driver, generally in the context of an
 452  451   * interrupt, and it ends up in mac_rx_common(). A driver calls either mac_rx or
 453  452   * mac_rx_ring, depending on whether or not it supports rings and can identify
 454  453   * the interrupt as having come from a specific ring. Here we determine whether
 455  454   * or not it's fully classified and perform software classification as
 456  455   * appropriate. From here, everything always ends up going to either entry [A]
 457  456   * or entry [B] based on whether or not they have subflow processing needed. We
 458  457   * leave via fanout or delivery.
 459  458   *
 460  459   *           +===========+
 461  460   *           v hardware  v
 462  461   *           v interrupt v
 463  462   *           +===========+
 464  463   *                 |
 465  464   *                 * . . appropriate
 466  465   *                 |     upcall made
 467  466   *                 |     by GLDv3 driver  . . always
 468  467   *                 |                      .
 469  468   *  +--------+     |     +----------+     .    +---------------+
 470  469   *  | GLDv3  |     +---->| mac_rx   |-----*--->| mac_rx_common |
 471  470   *  | Driver |-->--+     +----------+          +---------------+
 472  471   *  +--------+     |        ^                         |
 473  472   *      |          |        ^                         v
 474  473   *      ^          |        * . . always   +----------------------+
 475  474   *      |          |        |              | mac_promisc_dispatch |
 476  475   *      |          |    +-------------+    +----------------------+
 477  476   *      |          +--->| mac_rx_ring |               |
 478  477   *      |               +-------------+               * . . hw classified
 479  478   *      |                                             v     or single flow?
 480  479   *      |                                             |
 481  480   *      |                                   +--------++--------------+
 482  481   *      |                                   |        |               * hw class,
 483  482   *      |                                   |        * hw classified | subflows
 484  483   *      |                 no hw class and . *        | or single     | exist
 485  484   *      |                 subflows          |        | flow          |
 486  485   *      |                                   |        v               v
 487  486   *      |                                   |   +-----------+   +-----------+
 488  487   *      |                                   |   |   goto    |   |  goto     |
 489  488   *      |                                   |   | entry [A] |   | entry [B] |
 490  489   *      |                                   |   +-----------+   +-----------+
 491  490   *      |                                   v          ^
 492  491   *      |                            +-------------+   |
 493  492   *      |                            | mac_rx_flow |   * SRS and flow found,
 494  493   *      |                            +-------------+   | call flow cb
 495  494   *      |                                   |          +------+
 496  495   *      |                                   v                 |
 497  496   *      v                             +==========+    +-----------------+
 498  497   *      |                             v For each v--->| mac_rx_classify |
 499  498   * +----------+                       v  mblk_t  v    +-----------------+
 500  499   * |   srs    |                       +==========+
 501  500   * | pollling |
 502  501   * |  thread  |->------------------------------------------+
 503  502   * +----------+                                            |
 504  503   *                                                         v       . inline
 505  504   *            +--------------------+   +----------+   +---------+  .
 506  505   *    [A]---->| mac_rx_srs_process |-->| check bw |-->| enqueue |--*---------+
 507  506   *            +--------------------+   |  limits  |   | frames  |            |
 508  507   *               ^                     +----------+   | to SRS  |            |
 509  508   *               |                                    +---------+            |
 510  509   *               |  send chain              +--------+    |                  |
 511  510   *               *  when clasified          | signal |    * BW limits,       |
 512  511   *               |  flow changes            |  srs   |<---+ loopback,        |
 513  512   *               |                          | worker |      stack too        |
 514  513   *               |                          +--------+      deep             |
 515  514   *      +-----------------+        +--------+                                |
 516  515   *      | mac_flow_lookup |        |  srs   |     +---------------------+    |
 517  516   *      +-----------------+        | worker |---->| mac_rx_srs_drain    |<---+
 518  517   *               ^                 | thread |     | mac_rx_srs_drain_bw |
 519  518   *               |                 +--------+     +---------------------+
 520  519   *               |                                          |
 521  520   *         +----------------------------+                   * software rings
 522  521   *   [B]-->| mac_rx_srs_subflow_process |                   | for fanout?
 523  522   *         +----------------------------+                   |
 524  523   *                                               +----------+-----------+
 525  524   *                                               |                      |
 526  525   *                                               v                      v
 527  526   *                                          +--------+             +--------+
 528  527   *                                          |  goto  |             |  goto  |
 529  528   *                                          | Part 2 |             | Part 3 |
 530  529   *                                          +--------+             +--------+
 531  530   *
 532  531   * Part 2 -- Fanout
 533  532   *
 534  533   * This part is concerned with using software fanout to assign frames to
 535  534   * software rings and then deliver them to MAC clients or allow those rings to
 536  535   * be polled upon. While there are two different primary fanout entry points,
 537  536   * mac_rx_fanout and mac_rx_proto_fanout, they behave in similar ways, and aside
 538  537   * from some of the individual hashing techniques used, most of the general
 539  538   * flow is the same.
 540  539   *
 541  540   *  +--------+              +-------------------+
 542  541   *  |  From  |---+--------->| mac_rx_srs_fanout |----+
 543  542   *  | Part 1 |   |          +-------------------+    |    +=================+
 544  543   *  +--------+   |                                   |    v for each mblk_t v
 545  544   *               * . . protocol only                 +--->v assign to new   v
 546  545   *               |     fanout                        |    v chain based on  v
 547  546   *               |                                   |    v hash % nrings   v
 548  547   *               |    +-------------------------+    |    +=================+
 549  548   *               +--->| mac_rx_srs_proto_fanout |----+             |
 550  549   *                    +-------------------------+                  |
 551  550   *                                                                 v
 552  551   *    +------------+    +--------------------------+       +================+
 553  552   *    | enqueue in |<---| mac_rx_soft_ring_process |<------v for each chain v
 554  553   *    | soft ring  |    +--------------------------+       +================+
 555  554   *    +------------+
 556  555   *         |                                    +-----------+
 557  556   *         * soft ring set                      | soft ring |
 558  557   *         | empty and no                       |  worker   |
 559  558   *         | worker?                            |  thread   |
 560  559   *         |                                    +-----------+
 561  560   *         +------*----------------+                  |
 562  561   *         |      .                |                  v
 563  562   *    No . *      . Yes            |       +------------------------+
 564  563   *         |                       +----<--| mac_rx_soft_ring_drain |
 565  564   *         |                       |       +------------------------+
 566  565   *         v                       |
 567  566   *   +-----------+                 v
 568  567   *   |   signal  |         +---------------+
 569  568   *   | soft ring |         | Deliver chain |
 570  569   *   |   worker  |         | goto Part 3   |
 571  570   *   +-----------+         +---------------+
 572  571   *
 573  572   *
 574  573   * Part 3 -- Packet Delivery
 575  574   *
 576  575   * Here, we go through and deliver the mblk_t chain directly to a given
 577  576   * processing function. In a lot of cases this is mac_rx_deliver(). In the case
 578  577   * of DLS bypass being used, then instead we end up going ahead and deliver it
 579  578   * to the direct callback registered with DLS, generally ip_input.
 580  579   *
 581  580   *
 582  581   *   +---------+            +----------------+    +------------------+
 583  582   *   |  From   |---+------->| mac_rx_deliver |--->| Off to DLS, or   |
 584  583   *   | Parts 1 |   |        +----------------+    | other MAC client |
 585  584   *   |  and 2  |   * DLS bypass                   +------------------+
 586  585   *   +---------+   | enabled   +----------+    +-------------+
 587  586   *                 +---------->| ip_input |--->|    To IP    |
 588  587   *                             +----------+    | and beyond! |
 589  588   *                                             +-------------+
 590  589   *
 591  590   * ----------------------
 592  591   * The Transmit Data Path
 593  592   * ----------------------
 594  593   *
 595  594   * Before we go into the images, it's worth talking about a problem that is a
 596  595   * bit different from the receive data path. GLDv3 device drivers have a finite
 597  596   * amount of transmit descriptors. When they run out, they return unused frames
 598  597   * back to MAC. MAC, at this point has several options about what it will do,
 599  598   * which vary based upon the settings that the client uses.
 600  599   *
 601  600   * When a device runs out of descriptors, the next thing that MAC does is
 602  601   * enqueue them off of the soft ring set or a software ring, depending on the
 603  602   * configuration of the soft ring set. MAC will enqueue up to a high watermark
 604  603   * of mblk_t chains, at which point it will indicate flow control back to the
 605  604   * client. Once this condition is reached, any mblk_t chains that were not
 606  605   * enqueued will be returned to the caller and they will have to decide what to
 607  606   * do with them. There are various flags that control this behavior that a
 608  607   * client may pass, which are discussed below.
 609  608   *
 610  609   * When this condition is hit, MAC also returns a cookie to the client in
 611  610   * addition to unconsumed frames. Clients can poll on that cookie and register a
 612  611   * callback with MAC to be notified when they are no longer subject to flow
 613  612   * control, at which point they may continue to call mac_tx(). This flow control
 614  613   * actually manages to work itself all the way up the stack, back through dls,
 615  614   * to ip, through the various protocols, and to sockfs.
 616  615   *
 617  616   * While the behavior described above is the default, this behavior can be
 618  617   * modified. There are two alternate modes, described below, which are
 619  618   * controlled with flags.
 620  619   *
 621  620   * DROP MODE
 622  621   *
 623  622   * This mode is controlled by having the client pass the MAC_DROP_ON_NO_DESC
 624  623   * flag. When this is passed, if a device driver runs out of transmit
 625  624   * descriptors, then the MAC layer will drop any unsent traffic. The client in
 626  625   * this case will never have any frames returned to it.
 627  626   *
 628  627   * DON'T ENQUEUE
 629  628   *
 630  629   * This mode is controlled by having the client pass the MAC_TX_NO_ENQUEUE flag.
 631  630   * If the MAC_DROP_ON_NO_DESC flag is also passed, it takes precedence. In this
 632  631   * mode, when we hit a case where a driver runs out of transmit descriptors,
 633  632   * then instead of enqueuing packets in a soft ring set or software ring, we
 634  633   * instead return the mblk_t chain back to the caller and immediately put the
 635  634   * soft ring set into flow control mode.
 636  635   *
 637  636   * The following series of ASCII art images describe the transmit data path that
 638  637   * MAC clients enter into based on calling into mac_tx(). A soft ring set has a
 639  638   * transmission function associated with it. There are seven possible
 640  639   * transmission modes, some of which share function entry points. The one that a
 641  640   * soft ring set gets depends on properties such as whether there are
 642  641   * transmission rings for fanout, whether the device involves aggregations,
 643  642   * whether any bandwidth limits exist, etc.
 644  643   *
 645  644   *
 646  645   * Part 1 -- Initial checks
 647  646   *
 648  647   *      * . called by
 649  648   *      |   MAC clients
 650  649   *      v                     . . No
 651  650   *  +--------+  +-----------+ .   +-------------------+  +====================+
 652  651   *  | mac_tx |->| device    |-*-->| mac_protect_check |->v Is this the simple v
 653  652   *  +--------+  | quiesced? |     +-------------------+  v case? See [1]      v
 654  653   *              +-----------+            |               +====================+
 655  654   *                  * . Yes              * failed                 |
 656  655   *                  v                    | frames                 |
 657  656   *             +--------------+          |                +-------+---------+
 658  657   *             | freemsgchain |<---------+          Yes . *            No . *
 659  658   *             +--------------+                           v                 v
 660  659   *                                                  +-----------+     +--------+
 661  660   *                                                  |   goto    |     |  goto  |
 662  661   *                                                  |  Part 2   |     | SRS TX |
 663  662   *                                                  | Entry [A] |     |  func  |
 664  663   *                                                  +-----------+     +--------+
 665  664   *                                                        |                 |
 666  665   *                                                        |                 v
 667  666   *                                                        |           +--------+
 668  667   *                                                        +---------->| return |
 669  668   *                                                                    | cookie |
 670  669   *                                                                    +--------+
 671  670   *
 672  671   * [1] The simple case refers to the SRS being configured with the
 673  672   * SRS_TX_DEFAULT transmission mode, having a single mblk_t (not a chain), their
 674  673   * being only a single active client, and not having a backlog in the srs.
 675  674   *
 676  675   *
 677  676   * Part 2 -- The SRS transmission functions
 678  677   *
 679  678   * This part is a bit more complicated. The different transmission paths often
 680  679   * leverage one another. In this case, we'll draw out the more common ones
 681  680   * before the parts that depend upon them. Here, we're going to start with the
 682  681   * workings of mac_tx_send() a common function that most of the others end up
 683  682   * calling.
 684  683   *
 685  684   *      +-------------+
 686  685   *      | mac_tx_send |
 687  686   *      +-------------+
 688  687   *            |
 689  688   *            v
 690  689   *      +=============+    +==============+
 691  690   *      v  more than  v--->v    check     v
 692  691   *      v one client? v    v VLAN and add v
 693  692   *      +=============+    v  VLAN tags   v
 694  693   *            |            +==============+
 695  694   *            |                  |
 696  695   *            +------------------+
 697  696   *            |
 698  697   *            |                 [A]
 699  698   *            v                  |
 700  699   *       +============+ . No     v
 701  700   *       v more than  v .     +==========+     +--------------------------+
 702  701   *       v one active v-*---->v for each v---->| mac_promisc_dispatch_one |---+
 703  702   *       v  client?   v       v mblk_t   v     +--------------------------+   |
 704  703   *       +============+       +==========+        ^                           |
 705  704   *            |                                   |       +==========+        |
 706  705   *            * . Yes                             |       v hardware v<-------+
 707  706   *            v                      +------------+       v  rings?  v
 708  707   *       +==========+                |                    +==========+
 709  708   *       v for each v       No . . . *                         |
 710  709   *       v mblk_t   v       specific |                         |
 711  710   *       +==========+       flow     |                   +-----+-----+
 712  711   *            |                      |                   |           |
 713  712   *            v                      |                   v           v
 714  713   *    +-----------------+            |               +-------+  +---------+
 715  714   *    | mac_tx_classify |------------+               | GLDv3 |  |  GLDv3  |
 716  715   *    +-----------------+                            |TX func|  | ring tx |
 717  716   *            |                                      +-------+  |  func   |
 718  717   *            * Specific flow, generally                 |      +---------+
 719  718   *            | bcast, mcast, loopback                   |           |
 720  719   *            v                                          +-----+-----+
 721  720   *      +==========+       +---------+                         |
 722  721   *      v valid L2 v--*--->| freemsg |                         v
 723  722   *      v  header  v  . No +---------+               +-------------------+
 724  723   *      +==========+                                 | return unconsumed |
 725  724   *            * . Yes                                |   frames to the   |
 726  725   *            v                                      |      caller       |
 727  726   *      +===========+                                +-------------------+
 728  727   *      v braodcast v      +----------------+                  ^
 729  728   *      v   flow?   v--*-->| mac_bcast_send |------------------+
 730  729   *      +===========+  .   +----------------+                  |
 731  730   *            |        . . Yes                                 |
 732  731   *       No . *                                                v
 733  732   *            |  +---------------------+  +---------------+  +----------+
 734  733   *            +->|mac_promisc_dispatch |->| mac_fix_cksum |->|   flow   |
 735  734   *               +---------------------+  +---------------+  | callback |
 736  735   *                                                           +----------+
 737  736   *
 738  737   *
 739  738   * In addition, many but not all of the routines, all rely on
 740  739   * mac_tx_softring_process as an entry point.
 741  740   *
 742  741   *
 743  742   *                                           . No             . No
 744  743   * +--------------------------+   +========+ .  +===========+ .  +-------------+
 745  744   * | mac_tx_soft_ring_process |-->v worker v-*->v out of tx v-*->|    goto     |
 746  745   * +--------------------------+   v only?  v    v  descr.?  v    | mac_tx_send |
 747  746   *                                +========+    +===========+    +-------------+
 748  747   *                              Yes . *               * . Yes           |
 749  748   *                   . No             v               |                 v
 750  749   *     v=========+   .          +===========+ . Yes   |     Yes .  +==========+
 751  750   *     v apppend v<--*----------v out of tx v-*-------+---------*--v returned v
 752  751   *     v mblk_t  v              v  descr.?  v         |            v frames?  v
 753  752   *     v chain   v              +===========+         |            +==========+
 754  753   *     +=========+                                    |                 *. No
 755  754   *         |                                          |                 v
 756  755   *         v                                          v           +------------+
 757  756   * +===================+           +----------------------+       |   done     |
 758  757   * v worker scheduled? v           | mac_tx_sring_enqueue |       | processing |
 759  758   * v Out of tx descr?  v           +----------------------+       +------------+
 760  759   * +===================+                      |
 761  760   *    |           |           . Yes           v
 762  761   *    * Yes       * No        .         +============+
 763  762   *    |           v         +-*---------v drop on no v
 764  763   *    |      +========+     v           v  TX desc?  v
 765  764   *    |      v  wake  v  +----------+   +============+
 766  765   *    |      v worker v  | mac_pkt_ |         * . No
 767  766   *    |      +========+  | drop     |         |         . Yes         . No
 768  767   *    |           |      +----------+         v         .             .
 769  768   *    |           |         v   ^     +===============+ .  +========+ .
 770  769   *    +--+--------+---------+   |     v Don't enqueue v-*->v ring   v-*----+
 771  770   *       |                      |     v     Set?      v    v empty? v      |
 772  771   *       |      +---------------+     +===============+    +========+      |
 773  772   *       |      |                            |                |            |
 774  773   *       |      |        +-------------------+                |            |
 775  774   *       |      *. Yes   |                          +---------+            |
 776  775   *       |      |        v                          v                      v
 777  776   *       |      |  +===========+               +========+      +--------------+
 778  777   *       |      +<-v At hiwat? v               v append v      |    return    |
 779  778   *       |         +===========+               v mblk_t v      | mblk_t chain |
 780  779   *       |                  * No               v chain  v      |   and flow   |
 781  780   *       |                  v                  +========+      |    control   |
 782  781   *       |               +=========+                |          |    cookie    |
 783  782   *       |               v  append v                v          +--------------+
 784  783   *       |               v  mblk_t v           +========+
 785  784   *       |               v  chain  v           v  wake  v   +------------+
 786  785   *       |               +=========+           v worker v-->|    done    |
 787  786   *       |                    |                +========+   | processing |
 788  787   *       |                    v       .. Yes                +------------+
 789  788   *       |               +=========+  .   +========+
 790  789   *       |               v  first  v--*-->v  wake  v
 791  790   *       |               v append? v      v worker v
 792  791   *       |               +=========+      +========+
 793  792   *       |                   |                |
 794  793   *       |              No . *                |
 795  794   *       |                   v                |
 796  795   *       |       +--------------+             |
 797  796   *       +------>|   Return     |             |
 798  797   *               | flow control |<------------+
 799  798   *               |   cookie     |
 800  799   *               +--------------+
 801  800   *
 802  801   *
 803  802   * The remaining images are all specific to each of the different transmission
 804  803   * modes.
 805  804   *
 806  805   * SRS TX DEFAULT
 807  806   *
 808  807   *      [ From Part 1 ]
 809  808   *             |
 810  809   *             v
 811  810   * +-------------------------+
 812  811   * | mac_tx_single_ring_mode |
 813  812   * +-------------------------+
 814  813   *            |
 815  814   *            |       . Yes
 816  815   *            v       .
 817  816   *       +==========+ .  +============+
 818  817   *       v   SRS    v-*->v   Try to   v---->---------------------+
 819  818   *       v backlog? v    v enqueue in v                          |
 820  819   *       +==========+    v     SRS    v-->------+                * . . Queue too
 821  820   *            |          +============+         * don't enqueue  |     deep or
 822  821   *            * . No         ^     |            | flag or at     |     drop flag
 823  822   *            |              |     v            | hiwat,         |
 824  823   *            v              |     |            | return    +---------+
 825  824   *     +-------------+       |     |            | cookie    | freemsg |
 826  825   *     |    goto     |-*-----+     |            |           +---------+
 827  826   *     | mac_tx_send | . returned  |            |                |
 828  827   *     +-------------+   mblk_t    |            |                |
 829  828   *            |                    |            |                |
 830  829   *            |                    |            |                |
 831  830   *            * . . all mblk_t     * queued,    |                |
 832  831   *            v     consumed       | may return |                |
 833  832   *     +-------------+             | tx cookie  |                |
 834  833   *     | SRS TX func |<------------+------------+----------------+
 835  834   *     |  completed  |
 836  835   *     +-------------+
 837  836   *
 838  837   * SRS_TX_SERIALIZE
 839  838   *
 840  839   *   +------------------------+
 841  840   *   | mac_tx_serializer_mode |
 842  841   *   +------------------------+
 843  842   *               |
 844  843   *               |        . No
 845  844   *               v        .
 846  845   *         +============+ .  +============+    +-------------+   +============+
 847  846   *         v srs being  v-*->v  set SRS   v--->|    goto     |-->v remove SRS v
 848  847   *         v processed? v    v proc flags v    | mac_tx_send |   v proc flag  v
 849  848   *         +============+    +============+    +-------------+   +============+
 850  849   *               |                                                     |
 851  850   *               * Yes                                                 |
 852  851   *               v                                       . No          v
 853  852   *      +--------------------+                           .        +==========+
 854  853   *      | mac_tx_srs_enqueue |  +------------------------*-----<--v returned v
 855  854   *      +--------------------+  |                                 v frames?  v
 856  855   *               |              |   . Yes                         +==========+
 857  856   *               |              |   .                                  |
 858  857   *               |              |   . +=========+                      v
 859  858   *               v              +-<-*-v queued  v     +--------------------+
 860  859   *        +-------------+       |     v frames? v<----| mac_tx_srs_enqueue |
 861  860   *        | SRS TX func |       |     +=========+     +--------------------+
 862  861   *        | completed,  |<------+         * . Yes
 863  862   *        | may return  |       |         v
 864  863   *        |   cookie    |       |     +========+
 865  864   *        +-------------+       +-<---v  wake  v
 866  865   *                                    v worker v
 867  866   *                                    +========+
 868  867   *
 869  868   *
 870  869   * SRS_TX_FANOUT
 871  870   *
 872  871   *                                             . Yes
 873  872   *   +--------------------+    +=============+ .   +--------------------------+
 874  873   *   | mac_tx_fanout_mode |--->v Have fanout v-*-->|           goto           |
 875  874   *   +--------------------+    v   hint?     v     | mac_rx_soft_ring_process |
 876  875   *                             +=============+     +--------------------------+
 877  876   *                                   * . No                    |
 878  877   *                                   v                         ^
 879  878   *                             +===========+                   |
 880  879   *                        +--->v for each  v           +===============+
 881  880   *                        |    v   mblk_t  v           v pick softring v
 882  881   *                 same   *    +===========+           v   from hash   v
 883  882   *                 hash   |          |                 +===============+
 884  883   *                        |          v                         |
 885  884   *                        |   +--------------+                 |
 886  885   *                        +---| mac_pkt_hash |--->*------------+
 887  886   *                            +--------------+    . different
 888  887   *                                                  hash or
 889  888   *                                                  done proc.
 890  889   * SRS_TX_AGGR                                      chain
 891  890   *
 892  891   *   +------------------+    +================================+
 893  892   *   | mac_tx_aggr_mode |--->v Use aggr capab function to     v
 894  893   *   +------------------+    v find appropriate tx ring.      v
 895  894   *                           v Applies hash based on aggr     v
 896  895   *                           v policy, see mac_tx_aggr_mode() v
 897  896   *                           +================================+
 898  897   *                                          |
 899  898   *                                          v
 900  899   *                           +-------------------------------+
 901  900   *                           |            goto               |
 902  901   *                           |  mac_rx_srs_soft_ring_process |
 903  902   *                           +-------------------------------+
 904  903   *
 905  904   *
 906  905   * SRS_TX_BW, SRS_TX_BW_FANOUT, SRS_TX_BW_AGGR
 907  906   *
 908  907   * Note, all three of these tx functions start from the same place --
 909  908   * mac_tx_bw_mode().
 910  909   *
 911  910   *  +----------------+
 912  911   *  | mac_tx_bw_mode |
 913  912   *  +----------------+
 914  913   *         |
 915  914   *         v          . No               . No               . Yes
 916  915   *  +==============+  .  +============+  .  +=============+ .  +=========+
 917  916   *  v  Out of BW?  v--*->v SRS empty? v--*->v  reset BW   v-*->v Bump BW v
 918  917   *  +==============+     +============+     v tick count? v    v Usage   v
 919  918   *         |                   |            +=============+    +=========+
 920  919   *         |         +---------+                   |                |
 921  920   *         |         |        +--------------------+                |
 922  921   *         |         |        |              +----------------------+
 923  922   *         v         |        v              v
 924  923   * +===============+ |  +==========+   +==========+      +------------------+
 925  924   * v Don't enqueue v |  v  set bw  v   v Is aggr? v--*-->|       goto       |
 926  925   * v   flag set?   v |  v enforced v   +==========+  .   | mac_tx_aggr_mode |-+
 927  926   * +===============+ |  +==========+         |       .   +------------------+ |
 928  927   *   |    Yes .*     |        |         No . *       .                        |
 929  928   *   |         |     |        |              |       . Yes                    |
 930  929   *   * . No    |     |        v              |                                |
 931  930   *   |  +---------+  |   +========+          v              +======+          |
 932  931   *   |  | freemsg |  |   v append v   +============+  . Yes v pick v          |
 933  932   *   |  +---------+  |   v mblk_t v   v Is fanout? v--*---->v ring v          |
 934  933   *   |      |        |   v chain  v   +============+        +======+          |
 935  934   *   +------+        |   +========+          |                  |             |
 936  935   *          v        |        |              v                  v             |
 937  936   *    +---------+    |        v       +-------------+ +--------------------+  |
 938  937   *    | return  |    |   +========+   |    goto     | |       goto         |  |
 939  938   *    |  flow   |    |   v wakeup v   | mac_tx_send | | mac_tx_fanout_mode |  |
 940  939   *    | control |    |   v worker v   +-------------+ +--------------------+  |
 941  940   *    | cookie  |    |   +========+          |                  |             |
 942  941   *    +---------+    |        |              |                  +------+------+
 943  942   *                   |        v              |                         |
 944  943   *                   |   +---------+         |                         v
 945  944   *                   |   | return  |   +============+           +------------+
 946  945   *                   |   |  flow   |   v unconsumed v-------+   |   done     |
 947  946   *                   |   | control |   v   frames?  v       |   | processing |
 948  947   *                   |   | cookie  |   +============+       |   +------------+
 949  948   *                   |   +---------+         |              |
 950  949   *                   |                  Yes  *              |
 951  950   *                   |                       |              |
 952  951   *                   |                 +===========+        |
 953  952   *                   |                 v subtract  v        |
 954  953   *                   |                 v unused bw v        |
 955  954   *                   |                 +===========+        |
 956  955   *                   |                       |              |
 957  956   *                   |                       v              |
 958  957   *                   |              +--------------------+  |
 959  958   *                   +------------->| mac_tx_srs_enqueue |  |
 960  959   *                                  +--------------------+  |
 961  960   *                                           |              |
 962  961   *                                           |              |
 963  962   *                                     +------------+       |
 964  963   *                                     |  return fc |       |
 965  964   *                                     | cookie and |<------+
 966  965   *                                     |    mblk_t  |
 967  966   *                                     +------------+
 968  967   */
 969  968  
 970  969  #include <sys/types.h>
 971  970  #include <sys/callb.h>
 972  971  #include <sys/sdt.h>
 973  972  #include <sys/strsubr.h>
 974  973  #include <sys/strsun.h>
 975  974  #include <sys/vlan.h>
 976  975  #include <sys/stack.h>
 977  976  #include <sys/archsystm.h>
 978  977  #include <inet/ipsec_impl.h>
 979  978  #include <inet/ip_impl.h>
 980  979  #include <inet/sadb.h>
 981  980  #include <inet/ipsecesp.h>
 982  981  #include <inet/ipsecah.h>
 983  982  #include <inet/ip6.h>
 984  983  
 985  984  #include <sys/mac_impl.h>
 986  985  #include <sys/mac_client_impl.h>
 987  986  #include <sys/mac_client_priv.h>
 988  987  #include <sys/mac_soft_ring.h>
 989  988  #include <sys/mac_flow_impl.h>
 990  989  
 991  990  static mac_tx_cookie_t mac_tx_single_ring_mode(mac_soft_ring_set_t *, mblk_t *,
 992  991      uintptr_t, uint16_t, mblk_t **);
 993  992  static mac_tx_cookie_t mac_tx_serializer_mode(mac_soft_ring_set_t *, mblk_t *,
 994  993      uintptr_t, uint16_t, mblk_t **);
 995  994  static mac_tx_cookie_t mac_tx_fanout_mode(mac_soft_ring_set_t *, mblk_t *,
 996  995      uintptr_t, uint16_t, mblk_t **);
 997  996  static mac_tx_cookie_t mac_tx_bw_mode(mac_soft_ring_set_t *, mblk_t *,
 998  997      uintptr_t, uint16_t, mblk_t **);
 999  998  static mac_tx_cookie_t mac_tx_aggr_mode(mac_soft_ring_set_t *, mblk_t *,
1000  999      uintptr_t, uint16_t, mblk_t **);
1001 1000  
1002 1001  typedef struct mac_tx_mode_s {
1003 1002          mac_tx_srs_mode_t       mac_tx_mode;
1004 1003          mac_tx_func_t           mac_tx_func;
1005 1004  } mac_tx_mode_t;
1006 1005  
1007 1006  /*
1008 1007   * There are seven modes of operation on the Tx side. These modes get set
1009 1008   * in mac_tx_srs_setup(). Except for the experimental TX_SERIALIZE mode,
1010 1009   * none of the other modes are user configurable. They get selected by
1011 1010   * the system depending upon whether the link (or flow) has multiple Tx
1012 1011   * rings or a bandwidth configured, or if the link is an aggr, etc.
1013 1012   *
1014 1013   * When the Tx SRS is operating in aggr mode (st_mode) or if there are
1015 1014   * multiple Tx rings owned by Tx SRS, then each Tx ring (pseudo or
1016 1015   * otherwise) will have a soft ring associated with it. These soft rings
1017 1016   * are stored in srs_tx_soft_rings[] array.
1018 1017   *
1019 1018   * Additionally in the case of aggr, there is the st_soft_rings[] array
1020 1019   * in the mac_srs_tx_t structure. This array is used to store the same
1021 1020   * set of soft rings that are present in srs_tx_soft_rings[] array but
1022 1021   * in a different manner. The soft ring associated with the pseudo Tx
1023 1022   * ring is saved at mr_index (of the pseudo ring) in st_soft_rings[]
1024 1023   * array. This helps in quickly getting the soft ring associated with the
1025 1024   * Tx ring when aggr_find_tx_ring() returns the pseudo Tx ring that is to
1026 1025   * be used for transmit.
1027 1026   */
1028 1027  mac_tx_mode_t mac_tx_mode_list[] = {
1029 1028          {SRS_TX_DEFAULT,        mac_tx_single_ring_mode},
1030 1029          {SRS_TX_SERIALIZE,      mac_tx_serializer_mode},
1031 1030          {SRS_TX_FANOUT,         mac_tx_fanout_mode},
1032 1031          {SRS_TX_BW,             mac_tx_bw_mode},
1033 1032          {SRS_TX_BW_FANOUT,      mac_tx_bw_mode},
1034 1033          {SRS_TX_AGGR,           mac_tx_aggr_mode},
1035 1034          {SRS_TX_BW_AGGR,        mac_tx_bw_mode}
1036 1035  };
1037 1036  
1038 1037  /*
1039 1038   * Soft Ring Set (SRS) - The Run time code that deals with
1040 1039   * dynamic polling from the hardware, bandwidth enforcement,
1041 1040   * fanout etc.
1042 1041   *
1043 1042   * We try to use H/W classification on NIC and assign traffic for
1044 1043   * a MAC address to a particular Rx ring or ring group. There is a
1045 1044   * 1-1 mapping between a SRS and a Rx ring. The SRS dynamically
1046 1045   * switches the underlying Rx ring between interrupt and
1047 1046   * polling mode and enforces any specified B/W control.
1048 1047   *
1049 1048   * There is always a SRS created and tied to each H/W and S/W rule.
1050 1049   * Whenever we create a H/W rule, we always add the the same rule to
1051 1050   * S/W classifier and tie a SRS to it.
1052 1051   *
1053 1052   * In case a B/W control is specified, it is broken into bytes
1054 1053   * per ticks and as soon as the quota for a tick is exhausted,
1055 1054   * the underlying Rx ring is forced into poll mode for remainder of
1056 1055   * the tick. The SRS poll thread only polls for bytes that are
1057 1056   * allowed to come in the SRS. We typically let 4x the configured
1058 1057   * B/W worth of packets to come in the SRS (to prevent unnecessary
1059 1058   * drops due to bursts) but only process the specified amount.
1060 1059   *
1061 1060   * A MAC client (e.g. a VNIC or aggr) can have 1 or more
1062 1061   * Rx rings (and corresponding SRSs) assigned to it. The SRS
1063 1062   * in turn can have softrings to do protocol level fanout or
1064 1063   * softrings to do S/W based fanout or both. In case the NIC
1065 1064   * has no Rx rings, we do S/W classification to respective SRS.
1066 1065   * The S/W classification rule is always setup and ready. This
1067 1066   * allows the MAC layer to reassign Rx rings whenever needed
1068 1067   * but packets still continue to flow via the default path and
1069 1068   * getting S/W classified to correct SRS.
1070 1069   *
1071 1070   * The SRS's are used on both Tx and Rx side. They use the same
1072 1071   * data structure but the processing routines have slightly different
1073 1072   * semantics due to the fact that Rx side needs to do dynamic
1074 1073   * polling etc.
1075 1074   *
1076 1075   * Dynamic Polling Notes
1077 1076   * =====================
1078 1077   *
1079 1078   * Each Soft ring set is capable of switching its Rx ring between
1080 1079   * interrupt and poll mode and actively 'polls' for packets in
1081 1080   * poll mode. If the SRS is implementing a B/W limit, it makes
1082 1081   * sure that only Max allowed packets are pulled in poll mode
1083 1082   * and goes to poll mode as soon as B/W limit is exceeded. As
1084 1083   * such, there are no overheads to implement B/W limits.
1085 1084   *
1086 1085   * In poll mode, its better to keep the pipeline going where the
1087 1086   * SRS worker thread keeps processing packets and poll thread
1088 1087   * keeps bringing more packets (specially if they get to run
1089 1088   * on different CPUs). This also prevents the overheads associated
1090 1089   * by excessive signalling (on NUMA machines, this can be
1091 1090   * pretty devastating). The exception is latency optimized case
1092 1091   * where worker thread does no work and interrupt and poll thread
1093 1092   * are allowed to do their own drain.
1094 1093   *
1095 1094   * We use the following policy to control Dynamic Polling:
1096 1095   * 1) We switch to poll mode anytime the processing
1097 1096   *    thread causes a backlog to build up in SRS and
1098 1097   *    its associated Soft Rings (sr_poll_pkt_cnt > 0).
1099 1098   * 2) As long as the backlog stays under the low water
1100 1099   *    mark (sr_lowat), we poll the H/W for more packets.
1101 1100   * 3) If the backlog (sr_poll_pkt_cnt) exceeds low
1102 1101   *    water mark, we stay in poll mode but don't poll
1103 1102   *    the H/W for more packets.
1104 1103   * 4) Anytime in polling mode, if we poll the H/W for
1105 1104   *    packets and find nothing plus we have an existing
1106 1105   *    backlog (sr_poll_pkt_cnt > 0), we stay in polling
1107 1106   *    mode but don't poll the H/W for packets anymore
1108 1107   *    (let the polling thread go to sleep).
1109 1108   * 5) Once the backlog is relived (packets are processed)
1110 1109   *    we reenable polling (by signalling the poll thread)
1111 1110   *    only when the backlog dips below sr_poll_thres.
1112 1111   * 6) sr_hiwat is used exclusively when we are not
1113 1112   *    polling capable and is used to decide when to
1114 1113   *    drop packets so the SRS queue length doesn't grow
1115 1114   *    infinitely.
1116 1115   *
1117 1116   * NOTE: Also see the block level comment on top of mac_soft_ring.c
1118 1117   */
1119 1118  
1120 1119  /*
1121 1120   * mac_latency_optimize
1122 1121   *
1123 1122   * Controls whether the poll thread can process the packets inline
1124 1123   * or let the SRS worker thread do the processing. This applies if
1125 1124   * the SRS was not being processed. For latency sensitive traffic,
1126 1125   * this needs to be true to allow inline processing. For throughput
1127 1126   * under load, this should be false.
1128 1127   *
1129 1128   * This (and other similar) tunable should be rolled into a link
1130 1129   * or flow specific workload hint that can be set using dladm
1131 1130   * linkprop (instead of multiple such tunables).
1132 1131   */
1133 1132  boolean_t mac_latency_optimize = B_TRUE;
1134 1133  
1135 1134  /*
1136 1135   * MAC_RX_SRS_ENQUEUE_CHAIN and MAC_TX_SRS_ENQUEUE_CHAIN
1137 1136   *
1138 1137   * queue a mp or chain in soft ring set and increment the
1139 1138   * local count (srs_count) for the SRS and the shared counter
1140 1139   * (srs_poll_pkt_cnt - shared between SRS and its soft rings
1141 1140   * to track the total unprocessed packets for polling to work
1142 1141   * correctly).
1143 1142   *
1144 1143   * The size (total bytes queued) counters are incremented only
1145 1144   * if we are doing B/W control.
1146 1145   */
1147 1146  #define MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) {         \
1148 1147          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1149 1148          if ((mac_srs)->srs_last != NULL)                                \
1150 1149                  (mac_srs)->srs_last->b_next = (head);                   \
1151 1150          else                                                            \
1152 1151                  (mac_srs)->srs_first = (head);                          \
1153 1152          (mac_srs)->srs_last = (tail);                                   \
1154 1153          (mac_srs)->srs_count += count;                                  \
1155 1154  }
1156 1155  
1157 1156  #define MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) {      \
1158 1157          mac_srs_rx_t    *srs_rx = &(mac_srs)->srs_rx;                   \
1159 1158                                                                          \
1160 1159          MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz);          \
1161 1160          srs_rx->sr_poll_pkt_cnt += count;                               \
1162 1161          ASSERT(srs_rx->sr_poll_pkt_cnt > 0);                            \
1163 1162          if ((mac_srs)->srs_type & SRST_BW_CONTROL) {                    \
1164 1163                  (mac_srs)->srs_size += (sz);                            \
1165 1164                  mutex_enter(&(mac_srs)->srs_bw->mac_bw_lock);           \
1166 1165                  (mac_srs)->srs_bw->mac_bw_sz += (sz);                   \
1167 1166                  mutex_exit(&(mac_srs)->srs_bw->mac_bw_lock);            \
1168 1167          }                                                               \
1169 1168  }
1170 1169  
1171 1170  #define MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) {      \
1172 1171          mac_srs->srs_state |= SRS_ENQUEUED;                             \
1173 1172          MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz);          \
1174 1173          if ((mac_srs)->srs_type & SRST_BW_CONTROL) {                    \
1175 1174                  (mac_srs)->srs_size += (sz);                            \
1176 1175                  (mac_srs)->srs_bw->mac_bw_sz += (sz);                   \
1177 1176          }                                                               \
1178 1177  }
1179 1178  
1180 1179  /*
1181 1180   * Turn polling on routines
1182 1181   */
1183 1182  #define MAC_SRS_POLLING_ON(mac_srs) {                                   \
1184 1183          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1185 1184          if (((mac_srs)->srs_state &                                     \
1186 1185              (SRS_POLLING_CAPAB|SRS_POLLING)) == SRS_POLLING_CAPAB) {    \
1187 1186                  (mac_srs)->srs_state |= SRS_POLLING;                    \
1188 1187                  (void) mac_hwring_disable_intr((mac_ring_handle_t)      \
1189 1188                      (mac_srs)->srs_ring);                               \
1190 1189                  (mac_srs)->srs_rx.sr_poll_on++;                         \
1191 1190          }                                                               \
1192 1191  }
1193 1192  
1194 1193  #define MAC_SRS_WORKER_POLLING_ON(mac_srs) {                            \
1195 1194          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1196 1195          if (((mac_srs)->srs_state &                                     \
1197 1196              (SRS_POLLING_CAPAB|SRS_WORKER|SRS_POLLING)) ==              \
1198 1197              (SRS_POLLING_CAPAB|SRS_WORKER)) {                           \
1199 1198                  (mac_srs)->srs_state |= SRS_POLLING;                    \
1200 1199                  (void) mac_hwring_disable_intr((mac_ring_handle_t)      \
1201 1200                      (mac_srs)->srs_ring);                               \
1202 1201                  (mac_srs)->srs_rx.sr_worker_poll_on++;                  \
1203 1202          }                                                               \
1204 1203  }
1205 1204  
1206 1205  /*
1207 1206   * MAC_SRS_POLL_RING
1208 1207   *
1209 1208   * Signal the SRS poll thread to poll the underlying H/W ring
1210 1209   * provided it wasn't already polling (SRS_GET_PKTS was set).
1211 1210   *
1212 1211   * Poll thread gets to run only from mac_rx_srs_drain() and only
1213 1212   * if the drain was being done by the worker thread.
1214 1213   */
1215 1214  #define MAC_SRS_POLL_RING(mac_srs) {                                    \
1216 1215          mac_srs_rx_t    *srs_rx = &(mac_srs)->srs_rx;                   \
1217 1216                                                                          \
1218 1217          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1219 1218          srs_rx->sr_poll_thr_sig++;                                      \
1220 1219          if (((mac_srs)->srs_state &                                     \
1221 1220              (SRS_POLLING_CAPAB|SRS_WORKER|SRS_GET_PKTS)) ==             \
1222 1221                  (SRS_WORKER|SRS_POLLING_CAPAB)) {                       \
1223 1222                  (mac_srs)->srs_state |= SRS_GET_PKTS;                   \
1224 1223                  cv_signal(&(mac_srs)->srs_cv);                          \
1225 1224          } else {                                                        \
1226 1225                  srs_rx->sr_poll_thr_busy++;                             \
1227 1226          }                                                               \
1228 1227  }
1229 1228  
1230 1229  /*
1231 1230   * MAC_SRS_CHECK_BW_CONTROL
1232 1231   *
1233 1232   * Check to see if next tick has started so we can reset the
1234 1233   * SRS_BW_ENFORCED flag and allow more packets to come in the
1235 1234   * system.
1236 1235   */
1237 1236  #define MAC_SRS_CHECK_BW_CONTROL(mac_srs) {                             \
1238 1237          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1239 1238          ASSERT(((mac_srs)->srs_type & SRST_TX) ||                       \
1240 1239              MUTEX_HELD(&(mac_srs)->srs_bw->mac_bw_lock));               \
1241 1240          clock_t now = ddi_get_lbolt();                                  \
1242 1241          if ((mac_srs)->srs_bw->mac_bw_curr_time != now) {               \
1243 1242                  (mac_srs)->srs_bw->mac_bw_curr_time = now;              \
1244 1243                  (mac_srs)->srs_bw->mac_bw_used = 0;                     \
1245 1244                  if ((mac_srs)->srs_bw->mac_bw_state & SRS_BW_ENFORCED)  \
1246 1245                          (mac_srs)->srs_bw->mac_bw_state &= ~SRS_BW_ENFORCED; \
1247 1246          }                                                               \
1248 1247  }
1249 1248  
1250 1249  /*
1251 1250   * MAC_SRS_WORKER_WAKEUP
1252 1251   *
1253 1252   * Wake up the SRS worker thread to process the queue as long as
1254 1253   * no one else is processing the queue. If we are optimizing for
1255 1254   * latency, we wake up the worker thread immediately or else we
1256 1255   * wait mac_srs_worker_wakeup_ticks before worker thread gets
1257 1256   * woken up.
1258 1257   */
1259 1258  int mac_srs_worker_wakeup_ticks = 0;
1260 1259  #define MAC_SRS_WORKER_WAKEUP(mac_srs) {                                \
1261 1260          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1262 1261          if (!((mac_srs)->srs_state & SRS_PROC) &&                       \
1263 1262                  (mac_srs)->srs_tid == NULL) {                           \
1264 1263                  if (((mac_srs)->srs_state & SRS_LATENCY_OPT) ||         \
1265 1264                          (mac_srs_worker_wakeup_ticks == 0))             \
1266 1265                          cv_signal(&(mac_srs)->srs_async);               \
1267 1266                  else                                                    \
1268 1267                          (mac_srs)->srs_tid =                            \
1269 1268                                  timeout(mac_srs_fire, (mac_srs),        \
1270 1269                                          mac_srs_worker_wakeup_ticks);   \
1271 1270          }                                                               \
1272 1271  }
1273 1272  
1274 1273  #define TX_BANDWIDTH_MODE(mac_srs)                              \
1275 1274          ((mac_srs)->srs_tx.st_mode == SRS_TX_BW ||              \
1276 1275              (mac_srs)->srs_tx.st_mode == SRS_TX_BW_FANOUT ||    \
1277 1276              (mac_srs)->srs_tx.st_mode == SRS_TX_BW_AGGR)
1278 1277  
1279 1278  #define TX_SRS_TO_SOFT_RING(mac_srs, head, hint) {                      \
1280 1279          if (tx_mode == SRS_TX_BW_FANOUT)                                \
1281 1280                  (void) mac_tx_fanout_mode(mac_srs, head, hint, 0, NULL);\
1282 1281          else                                                            \
1283 1282                  (void) mac_tx_aggr_mode(mac_srs, head, hint, 0, NULL);  \
1284 1283  }
1285 1284  
1286 1285  /*
1287 1286   * MAC_TX_SRS_BLOCK
1288 1287   *
1289 1288   * Always called from mac_tx_srs_drain() function. SRS_TX_BLOCKED
1290 1289   * will be set only if srs_tx_woken_up is FALSE. If
1291 1290   * srs_tx_woken_up is TRUE, it indicates that the wakeup arrived
1292 1291   * before we grabbed srs_lock to set SRS_TX_BLOCKED. We need to
1293 1292   * attempt to transmit again and not setting SRS_TX_BLOCKED does
1294 1293   * that.
1295 1294   */
1296 1295  #define MAC_TX_SRS_BLOCK(srs, mp)       {                       \
1297 1296          ASSERT(MUTEX_HELD(&(srs)->srs_lock));                   \
1298 1297          if ((srs)->srs_tx.st_woken_up) {                        \
1299 1298                  (srs)->srs_tx.st_woken_up = B_FALSE;            \
1300 1299          } else {                                                \
1301 1300                  ASSERT(!((srs)->srs_state & SRS_TX_BLOCKED));   \
1302 1301                  (srs)->srs_state |= SRS_TX_BLOCKED;             \
1303 1302                  (srs)->srs_tx.st_stat.mts_blockcnt++;           \
1304 1303          }                                                       \
1305 1304  }
1306 1305  
1307 1306  /*
1308 1307   * MAC_TX_SRS_TEST_HIWAT
1309 1308   *
1310 1309   * Called before queueing a packet onto Tx SRS to test and set
1311 1310   * SRS_TX_HIWAT if srs_count exceeds srs_tx_hiwat.
1312 1311   */
1313 1312  #define MAC_TX_SRS_TEST_HIWAT(srs, mp, tail, cnt, sz, cookie) {         \
1314 1313          boolean_t enqueue = 1;                                          \
1315 1314                                                                          \
1316 1315          if ((srs)->srs_count > (srs)->srs_tx.st_hiwat) {                \
1317 1316                  /*                                                      \
1318 1317                   * flow-controlled. Store srs in cookie so that it      \
1319 1318                   * can be returned as mac_tx_cookie_t to client         \
1320 1319                   */                                                     \
1321 1320                  (srs)->srs_state |= SRS_TX_HIWAT;                       \
1322 1321                  cookie = (mac_tx_cookie_t)srs;                          \
1323 1322                  (srs)->srs_tx.st_hiwat_cnt++;                           \
1324 1323                  if ((srs)->srs_count > (srs)->srs_tx.st_max_q_cnt) {    \
1325 1324                          /* increment freed stats */                     \
1326 1325                          (srs)->srs_tx.st_stat.mts_sdrops += cnt;        \
1327 1326                          /*                                              \
1328 1327                           * b_prev may be set to the fanout hint         \
1329 1328                           * hence can't use freemsg directly             \
1330 1329                           */                                             \
1331 1330                          mac_pkt_drop(NULL, NULL, mp_chain, B_FALSE);    \
1332 1331                          DTRACE_PROBE1(tx_queued_hiwat,                  \
1333 1332                              mac_soft_ring_set_t *, srs);                \
1334 1333                          enqueue = 0;                                    \
1335 1334                  }                                                       \
1336 1335          }                                                               \
1337 1336          if (enqueue)                                                    \
1338 1337                  MAC_TX_SRS_ENQUEUE_CHAIN(srs, mp, tail, cnt, sz);       \
1339 1338  }
1340 1339  
1341 1340  /* Some utility macros */
1342 1341  #define MAC_SRS_BW_LOCK(srs)                                            \
1343 1342          if (!(srs->srs_type & SRST_TX))                                 \
1344 1343                  mutex_enter(&srs->srs_bw->mac_bw_lock);
1345 1344  
1346 1345  #define MAC_SRS_BW_UNLOCK(srs)                                          \
1347 1346          if (!(srs->srs_type & SRST_TX))                                 \
1348 1347                  mutex_exit(&srs->srs_bw->mac_bw_lock);
1349 1348  
1350 1349  #define MAC_TX_SRS_DROP_MESSAGE(srs, mp, cookie) {              \
1351 1350          mac_pkt_drop(NULL, NULL, mp, B_FALSE);                  \
1352 1351          /* increment freed stats */                             \
1353 1352          mac_srs->srs_tx.st_stat.mts_sdrops++;                   \
1354 1353          cookie = (mac_tx_cookie_t)srs;                          \
1355 1354  }
1356 1355  
1357 1356  #define MAC_TX_SET_NO_ENQUEUE(srs, mp_chain, ret_mp, cookie) {          \
1358 1357          mac_srs->srs_state |= SRS_TX_WAKEUP_CLIENT;                     \
1359 1358          cookie = (mac_tx_cookie_t)srs;                                  \
1360 1359          *ret_mp = mp_chain;                                             \
1361 1360  }
1362 1361  
1363 1362  /*
1364 1363   * MAC_RX_SRS_TOODEEP
1365 1364   *
1366 1365   * Macro called as part of receive-side processing to determine if handling
1367 1366   * can occur in situ (in the interrupt thread) or if it should be left to a
1368 1367   * worker thread.  Note that the constant used to make this determination is
1369 1368   * not entirely made-up, and is a result of some emprical validation. That
1370 1369   * said, the constant is left as a static variable to allow it to be
1371 1370   * dynamically tuned in the field if and as needed.
1372 1371   */
1373 1372  static uintptr_t mac_rx_srs_stack_needed = 10240;
1374 1373  static uint_t mac_rx_srs_stack_toodeep;
1375 1374  
1376 1375  #ifndef STACK_GROWTH_DOWN
1377 1376  #error Downward stack growth assumed.
1378 1377  #endif
1379 1378  
1380 1379  #define MAC_RX_SRS_TOODEEP() (STACK_BIAS + (uintptr_t)getfp() - \
1381 1380          (uintptr_t)curthread->t_stkbase < mac_rx_srs_stack_needed && \
1382 1381          ++mac_rx_srs_stack_toodeep)
1383 1382  
1384 1383  
1385 1384  /*
1386 1385   * Drop the rx packet and advance to the next one in the chain.
1387 1386   */
1388 1387  static void
1389 1388  mac_rx_drop_pkt(mac_soft_ring_set_t *srs, mblk_t *mp)
1390 1389  {
1391 1390          mac_srs_rx_t    *srs_rx = &srs->srs_rx;
1392 1391  
1393 1392          ASSERT(mp->b_next == NULL);
1394 1393          mutex_enter(&srs->srs_lock);
1395 1394          MAC_UPDATE_SRS_COUNT_LOCKED(srs, 1);
1396 1395          MAC_UPDATE_SRS_SIZE_LOCKED(srs, msgdsize(mp));
1397 1396          mutex_exit(&srs->srs_lock);
1398 1397  
1399 1398          srs_rx->sr_stat.mrs_sdrops++;
1400 1399          freemsg(mp);
1401 1400  }
1402 1401  
1403 1402  /* DATAPATH RUNTIME ROUTINES */
1404 1403  
1405 1404  /*
1406 1405   * mac_srs_fire
1407 1406   *
1408 1407   * Timer callback routine for waking up the SRS worker thread.
1409 1408   */
1410 1409  static void
1411 1410  mac_srs_fire(void *arg)
1412 1411  {
1413 1412          mac_soft_ring_set_t *mac_srs = (mac_soft_ring_set_t *)arg;
1414 1413  
1415 1414          mutex_enter(&mac_srs->srs_lock);
1416 1415          if (mac_srs->srs_tid == NULL) {
1417 1416                  mutex_exit(&mac_srs->srs_lock);
1418 1417                  return;
1419 1418          }
1420 1419  
1421 1420          mac_srs->srs_tid = NULL;
1422 1421          if (!(mac_srs->srs_state & SRS_PROC))
1423 1422                  cv_signal(&mac_srs->srs_async);
1424 1423  
1425 1424          mutex_exit(&mac_srs->srs_lock);
1426 1425  }
1427 1426  
1428 1427  /*
1429 1428   * 'hint' is fanout_hint (type of uint64_t) which is given by the TCP/IP stack,
1430 1429   * and it is used on the TX path.
1431 1430   */
1432 1431  #define HASH_HINT(hint) \
1433 1432          ((hint) ^ ((hint) >> 24) ^ ((hint) >> 16) ^ ((hint) >> 8))
1434 1433  
1435 1434  
1436 1435  /*
1437 1436   * hash based on the src address, dst address and the port information.
1438 1437   */
1439 1438  #define HASH_ADDR(src, dst, ports)                                      \
1440 1439          (ntohl((src) + (dst)) ^ ((ports) >> 24) ^ ((ports) >> 16) ^     \
1441 1440          ((ports) >> 8) ^ (ports))
1442 1441  
1443 1442  #define COMPUTE_INDEX(key, sz)  (key % sz)
1444 1443  
1445 1444  #define FANOUT_ENQUEUE_MP(head, tail, cnt, bw_ctl, sz, sz0, mp) {       \
1446 1445          if ((tail) != NULL) {                                           \
1447 1446                  ASSERT((tail)->b_next == NULL);                         \
1448 1447                  (tail)->b_next = (mp);                                  \
1449 1448          } else {                                                        \
1450 1449                  ASSERT((head) == NULL);                                 \
1451 1450                  (head) = (mp);                                          \
1452 1451          }                                                               \
1453 1452          (tail) = (mp);                                                  \
1454 1453          (cnt)++;                                                        \
1455 1454          if ((bw_ctl))                                                   \
1456 1455                  (sz) += (sz0);                                          \
1457 1456  }
1458 1457  
1459 1458  #define MAC_FANOUT_DEFAULT      0
1460 1459  #define MAC_FANOUT_RND_ROBIN    1
1461 1460  int mac_fanout_type = MAC_FANOUT_DEFAULT;
1462 1461  
1463 1462  #define MAX_SR_TYPES    3
1464 1463  /* fanout types for port based hashing */
1465 1464  enum pkt_type {
1466 1465          V4_TCP = 0,
1467 1466          V4_UDP,

↓ open down ↓

1152 lines elided

↑ open up ↑

1468 1467          OTH,
1469 1468          UNDEF
1470 1469  };
1471 1470  
1472 1471  /*
1473 1472   * Pair of local and remote ports in the transport header
1474 1473   */
1475 1474  #define PORTS_SIZE 4
1476 1475  
1477 1476  /*
1478      - * mac_rx_srs_proto_fanout
1479      - *
1480      - * This routine delivers packets destined to an SRS into one of the
     1477 + * This routine delivers packets destined for an SRS into one of the
1481 1478   * protocol soft rings.
1482 1479   *
1483      - * Given a chain of packets we need to split it up into multiple sub chains
1484      - * destined into TCP, UDP or OTH soft ring. Instead of entering
1485      - * the soft ring one packet at a time, we want to enter it in the form of a
1486      - * chain otherwise we get this start/stop behaviour where the worker thread
1487      - * goes to sleep and then next packets comes in forcing it to wake up etc.
     1480 + * Given a chain of packets we need to split it up into multiple sub
     1481 + * chains: TCP, UDP or OTH soft ring. Instead of entering the soft
     1482 + * ring one packet at a time, we want to enter it in the form of a
     1483 + * chain otherwise we get this start/stop behaviour where the worker
     1484 + * thread goes to sleep and then next packet comes in forcing it to
     1485 + * wake up.
1488 1486   */
1489 1487  static void
1490 1488  mac_rx_srs_proto_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head)
1491 1489  {
1492 1490          struct ether_header             *ehp;
1493 1491          struct ether_vlan_header        *evhp;
1494 1492          uint32_t                        sap;
1495 1493          ipha_t                          *ipha;
1496 1494          uint8_t                         *dstaddr;
1497 1495          size_t                          hdrsize;

1498 1496          mblk_t                          *mp;
1499 1497          mblk_t                          *headmp[MAX_SR_TYPES];
1500 1498          mblk_t                          *tailmp[MAX_SR_TYPES];
1501 1499          int                             cnt[MAX_SR_TYPES];
1502 1500          size_t                          sz[MAX_SR_TYPES];
1503 1501          size_t                          sz1;
1504 1502          boolean_t                       bw_ctl;
1505 1503          boolean_t                       hw_classified;
1506 1504          boolean_t                       dls_bypass;
1507 1505          boolean_t                       is_ether;
1508 1506          boolean_t                       is_unicast;
1509 1507          enum pkt_type                   type;
1510 1508          mac_client_impl_t               *mcip = mac_srs->srs_mcip;
1511 1509  
1512 1510          is_ether = (mcip->mci_mip->mi_info.mi_nativemedia == DL_ETHER);
1513 1511          bw_ctl = ((mac_srs->srs_type & SRST_BW_CONTROL) != 0);
1514 1512  
1515 1513          /*

↓ open down ↓

18 lines elided

↑ open up ↑

1516 1514           * If we don't have a Rx ring, S/W classification would have done
1517 1515           * its job and its a packet meant for us. If we were polling on
1518 1516           * the default ring (i.e. there was a ring assigned to this SRS),
1519 1517           * then we need to make sure that the mac address really belongs
1520 1518           * to us.
1521 1519           */
1522 1520          hw_classified = mac_srs->srs_ring != NULL &&
1523 1521              mac_srs->srs_ring->mr_classify_type == MAC_HW_CLASSIFIER;
1524 1522  
1525 1523          /*
1526      -         * Special clients (eg. VLAN, non ether, etc) need DLS
1527      -         * processing in the Rx path. SRST_DLS_BYPASS will be clear for
1528      -         * such SRSs. Another way of disabling bypass is to set the
     1524 +         * Some clients, such as non-ethernet, need DLS processing in
     1525 +         * the Rx path. Such clients clear the SRST_DLS_BYPASS flag.
     1526 +         * DLS bypass may also be disabled via the
1529 1527           * MCIS_RX_BYPASS_DISABLE flag.
1530 1528           */
1531 1529          dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) &&
1532 1530              ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0);
1533 1531  
1534 1532          bzero(headmp, MAX_SR_TYPES * sizeof (mblk_t *));
1535 1533          bzero(tailmp, MAX_SR_TYPES * sizeof (mblk_t *));
1536 1534          bzero(cnt, MAX_SR_TYPES * sizeof (int));
1537 1535          bzero(sz, MAX_SR_TYPES * sizeof (size_t));
1538 1536  
1539 1537          /*
1540      -         * We got a chain from SRS that we need to send to the soft rings.
1541      -         * Since squeues for TCP & IPv4 sap poll their soft rings (for
1542      -         * performance reasons), we need to separate out v4_tcp, v4_udp
1543      -         * and the rest goes in other.
     1538 +         * We have a chain from SRS that we need to split across the
     1539 +         * soft rings. The squeues for the TCP and IPv4 SAPs use their
     1540 +         * own soft rings to allow polling from the squeue. The rest of
     1541 +         * the packets are delivered on the OTH soft ring which cannot
     1542 +         * be polled.
1544 1543           */
1545 1544          while (head != NULL) {
1546 1545                  mp = head;
1547 1546                  head = head->b_next;
1548 1547                  mp->b_next = NULL;
1549 1548  
1550 1549                  type = OTH;
1551 1550                  sz1 = (mp->b_cont == NULL) ? MBLKL(mp) : msgdsize(mp);
1552 1551  
1553 1552                  if (is_ether) {

1554 1553                          /*
1555 1554                           * At this point we can be sure the packet at least
1556 1555                           * has an ether header.
1557 1556                           */
1558 1557                          if (sz1 < sizeof (struct ether_header)) {
1559 1558                                  mac_rx_drop_pkt(mac_srs, mp);
1560 1559                                  continue;

↓ open down ↓

7 lines elided

↑ open up ↑

1561 1560                          }
1562 1561                          ehp = (struct ether_header *)mp->b_rptr;
1563 1562  
1564 1563                          /*
1565 1564                           * Determine if this is a VLAN or non-VLAN packet.
1566 1565                           */
1567 1566                          if ((sap = ntohs(ehp->ether_type)) == VLAN_TPID) {
1568 1567                                  evhp = (struct ether_vlan_header *)mp->b_rptr;
1569 1568                                  sap = ntohs(evhp->ether_type);
1570 1569                                  hdrsize = sizeof (struct ether_vlan_header);
     1570 +
1571 1571                                  /*
1572      -                                 * Check if the VID of the packet, if any,
1573      -                                 * belongs to this client.
     1572 +                                 * Check if the VID of the packet, if
     1573 +                                 * any, belongs to this client.
     1574 +                                 * Technically, if this packet came up
     1575 +                                 * via a HW classified ring then we
     1576 +                                 * don't need to perform this check.
     1577 +                                 * Perhaps a future optimization.
1574 1578                                   */
1575 1579                                  if (!mac_client_check_flow_vid(mcip,
1576 1580                                      VLAN_ID(ntohs(evhp->ether_tci)))) {
1577 1581                                          mac_rx_drop_pkt(mac_srs, mp);
1578 1582                                          continue;
1579 1583                                  }
1580 1584                          } else {
1581 1585                                  hdrsize = sizeof (struct ether_header);
1582 1586                          }
1583 1587                          is_unicast =

1584 1588                              ((((uint8_t *)&ehp->ether_dhost)[0] & 0x01) == 0);
1585 1589                          dstaddr = (uint8_t *)&ehp->ether_dhost;
1586 1590                  } else {
1587 1591                          mac_header_info_t               mhi;
1588 1592  
1589 1593                          if (mac_header_info((mac_handle_t)mcip->mci_mip,
1590 1594                              mp, &mhi) != 0) {
1591 1595                                  mac_rx_drop_pkt(mac_srs, mp);
1592 1596                                  continue;
1593 1597                          }
1594 1598                          hdrsize = mhi.mhi_hdrsize;
1595 1599                          sap = mhi.mhi_bindsap;
1596 1600                          is_unicast = (mhi.mhi_dsttype == MAC_ADDRTYPE_UNICAST);
1597 1601                          dstaddr = (uint8_t *)mhi.mhi_daddr;
1598 1602                  }
1599 1603  
1600 1604                  if (!dls_bypass) {
1601 1605                          FANOUT_ENQUEUE_MP(headmp[type], tailmp[type],
1602 1606                              cnt[type], bw_ctl, sz[type], sz1, mp);
1603 1607                          continue;
1604 1608                  }
1605 1609  
1606 1610                  if (sap == ETHERTYPE_IP) {
1607 1611                          /*
1608 1612                           * If we are H/W classified, but we have promisc
1609 1613                           * on, then we need to check for the unicast address.
1610 1614                           */
1611 1615                          if (hw_classified && mcip->mci_promisc_list != NULL) {
1612 1616                                  mac_address_t           *map;
1613 1617  
1614 1618                                  rw_enter(&mcip->mci_rw_lock, RW_READER);
1615 1619                                  map = mcip->mci_unicast;
1616 1620                                  if (bcmp(dstaddr, map->ma_addr,
1617 1621                                      map->ma_len) == 0)
1618 1622                                          type = UNDEF;
1619 1623                                  rw_exit(&mcip->mci_rw_lock);
1620 1624                          } else if (is_unicast) {
1621 1625                                  type = UNDEF;
1622 1626                          }
1623 1627                  }
1624 1628  
1625 1629                  /*
1626 1630                   * This needs to become a contract with the driver for
1627 1631                   * the fast path.

↓ open down ↓

44 lines elided

↑ open up ↑

1628 1632                   *
1629 1633                   * In the normal case the packet will have at least the L2
1630 1634                   * header and the IP + Transport header in the same mblk.
1631 1635                   * This is usually the case when the NIC driver sends up
1632 1636                   * the packet. This is also true when the stack generates
1633 1637                   * a packet that is looped back and when the stack uses the
1634 1638                   * fastpath mechanism. The normal case is optimized for
1635 1639                   * performance and may bypass DLS. All other cases go through
1636 1640                   * the 'OTH' type path without DLS bypass.
1637 1641                   */
1638      -
1639 1642                  ipha = (ipha_t *)(mp->b_rptr + hdrsize);
1640 1643                  if ((type != OTH) && MBLK_RX_FANOUT_SLOWPATH(mp, ipha))
1641 1644                          type = OTH;
1642 1645  
1643 1646                  if (type == OTH) {
1644 1647                          FANOUT_ENQUEUE_MP(headmp[type], tailmp[type],
1645 1648                              cnt[type], bw_ctl, sz[type], sz1, mp);
1646 1649                          continue;
1647 1650                  }
1648 1651  
1649 1652                  ASSERT(type == UNDEF);
     1653 +
1650 1654                  /*
1651      -                 * We look for at least 4 bytes past the IP header to get
1652      -                 * the port information. If we get an IP fragment, we don't
1653      -                 * have the port information, and we use just the protocol
1654      -                 * information.
     1655 +                 * Determine the type from the IP protocol value. If
     1656 +                 * classified as TCP or UDP, then update the read
     1657 +                 * pointer to the beginning of the IP header.
     1658 +                 * Otherwise leave the message as is for further
     1659 +                 * processing by DLS.
1655 1660                   */
1656 1661                  switch (ipha->ipha_protocol) {
1657 1662                  case IPPROTO_TCP:
1658 1663                          type = V4_TCP;
1659 1664                          mp->b_rptr += hdrsize;
1660 1665                          break;
1661 1666                  case IPPROTO_UDP:
1662 1667                          type = V4_UDP;
1663 1668                          mp->b_rptr += hdrsize;
1664 1669                          break;

1665 1670                  default:
1666 1671                          type = OTH;
1667 1672                          break;
1668 1673                  }
1669 1674  
1670 1675                  FANOUT_ENQUEUE_MP(headmp[type], tailmp[type], cnt[type],
1671 1676                      bw_ctl, sz[type], sz1, mp);
1672 1677          }
1673 1678  
1674 1679          for (type = V4_TCP; type < UNDEF; type++) {
1675 1680                  if (headmp[type] != NULL) {
1676 1681                          mac_soft_ring_t                 *softring;
1677 1682  
1678 1683                          ASSERT(tailmp[type]->b_next == NULL);
1679 1684                          switch (type) {
1680 1685                          case V4_TCP:
1681 1686                                  softring = mac_srs->srs_tcp_soft_rings[0];
1682 1687                                  break;
1683 1688                          case V4_UDP:
1684 1689                                  softring = mac_srs->srs_udp_soft_rings[0];
1685 1690                                  break;
1686 1691                          case OTH:
1687 1692                                  softring = mac_srs->srs_oth_soft_rings[0];

↓ open down ↓

23 lines elided

↑ open up ↑

1688 1693                          }
1689 1694                          mac_rx_soft_ring_process(mcip, softring,
1690 1695                              headmp[type], tailmp[type], cnt[type], sz[type]);
1691 1696                  }
1692 1697          }
1693 1698  }
1694 1699  
1695 1700  int     fanout_unaligned = 0;
1696 1701  
1697 1702  /*
1698      - * mac_rx_srs_long_fanout
1699      - *
1700      - * The fanout routine for VLANs, and for anything else that isn't performing
1701      - * explicit dls bypass.  Returns -1 on an error (drop the packet due to a
1702      - * malformed packet), 0 on success, with values written in *indx and *type.
     1703 + * The fanout routine for any clients with DLS bypass disabled or for
     1704 + * traffic classified as "other". Returns -1 on an error (drop the
     1705 + * packet due to a malformed packet), 0 on success, with values
     1706 + * written in *indx and *type.
1703 1707   */
1704 1708  static int
1705 1709  mac_rx_srs_long_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *mp,
1706 1710      uint32_t sap, size_t hdrsize, enum pkt_type *type, uint_t *indx)
1707 1711  {
1708 1712          ip6_t           *ip6h;
1709 1713          ipha_t          *ipha;
1710 1714          uint8_t         *whereptr;
1711 1715          uint_t          hash;
1712 1716          uint16_t        remlen;

1713 1717          uint8_t         nexthdr;
1714 1718          uint16_t        hdr_len;
1715 1719          uint32_t        src_val, dst_val;
1716 1720          boolean_t       modifiable = B_TRUE;
1717 1721          boolean_t       v6;
1718 1722  
1719 1723          ASSERT(MBLKL(mp) >= hdrsize);
1720 1724  
1721 1725          if (sap == ETHERTYPE_IPV6) {
1722 1726                  v6 = B_TRUE;
1723 1727                  hdr_len = IPV6_HDR_LEN;
1724 1728          } else if (sap == ETHERTYPE_IP) {
1725 1729                  v6 = B_FALSE;
1726 1730                  hdr_len = IP_SIMPLE_HDR_LENGTH;
1727 1731          } else {
1728 1732                  *indx = 0;
1729 1733                  *type = OTH;
1730 1734                  return (0);
1731 1735          }
1732 1736  
1733 1737          ip6h = (ip6_t *)(mp->b_rptr + hdrsize);
1734 1738          ipha = (ipha_t *)ip6h;
1735 1739  
1736 1740          if ((uint8_t *)ip6h == mp->b_wptr) {
1737 1741                  /*
1738 1742                   * The first mblk_t only includes the mac header.
1739 1743                   * Note that it is safe to change the mp pointer here,
1740 1744                   * as the subsequent operation does not assume mp
1741 1745                   * points to the start of the mac header.
1742 1746                   */
1743 1747                  mp = mp->b_cont;
1744 1748  
1745 1749                  /*
1746 1750                   * Make sure the IP header points to an entire one.
1747 1751                   */
1748 1752                  if (mp == NULL)
1749 1753                          return (-1);
1750 1754  
1751 1755                  if (MBLKL(mp) < hdr_len) {
1752 1756                          modifiable = (DB_REF(mp) == 1);
1753 1757  
1754 1758                          if (modifiable && !pullupmsg(mp, hdr_len))
1755 1759                                  return (-1);
1756 1760                  }
1757 1761  
1758 1762                  ip6h = (ip6_t *)mp->b_rptr;
1759 1763                  ipha = (ipha_t *)ip6h;
1760 1764          }
1761 1765  
1762 1766          if (!modifiable || !(OK_32PTR((char *)ip6h)) ||
1763 1767              ((uint8_t *)ip6h + hdr_len > mp->b_wptr)) {
1764 1768                  /*
1765 1769                   * If either the IP header is not aligned, or it does not hold
1766 1770                   * the complete simple structure (a pullupmsg() is not an
1767 1771                   * option since it would result in an unaligned IP header),
1768 1772                   * fanout to the default ring.
1769 1773                   *
1770 1774                   * Note that this may cause packet reordering.
1771 1775                   */
1772 1776                  *indx = 0;
1773 1777                  *type = OTH;
1774 1778                  fanout_unaligned++;
1775 1779                  return (0);
1776 1780          }
1777 1781  
1778 1782          /*
1779 1783           * Extract next-header, full header length, and source-hash value
1780 1784           * using v4/v6 specific fields.
1781 1785           */
1782 1786          if (v6) {
1783 1787                  remlen = ntohs(ip6h->ip6_plen);
1784 1788                  nexthdr = ip6h->ip6_nxt;
1785 1789                  src_val = V4_PART_OF_V6(ip6h->ip6_src);
1786 1790                  dst_val = V4_PART_OF_V6(ip6h->ip6_dst);
1787 1791                  /*
1788 1792                   * Do src based fanout if below tunable is set to B_TRUE or
1789 1793                   * when mac_ip_hdr_length_v6() fails because of malformed
1790 1794                   * packets or because mblks need to be concatenated using
1791 1795                   * pullupmsg().
1792 1796                   *
1793 1797                   * Perform a version check to prevent parsing weirdness...
1794 1798                   */
1795 1799                  if (IPH_HDR_VERSION(ip6h) != IPV6_VERSION ||
1796 1800                      !mac_ip_hdr_length_v6(ip6h, mp->b_wptr, &hdr_len, &nexthdr,
1797 1801                      NULL)) {
1798 1802                          goto src_dst_based_fanout;
1799 1803                  }
1800 1804          } else {
1801 1805                  hdr_len = IPH_HDR_LENGTH(ipha);
1802 1806                  remlen = ntohs(ipha->ipha_length) - hdr_len;
1803 1807                  nexthdr = ipha->ipha_protocol;
1804 1808                  src_val = (uint32_t)ipha->ipha_src;
1805 1809                  dst_val = (uint32_t)ipha->ipha_dst;
1806 1810                  /*
1807 1811                   * Catch IPv4 fragment case here.  IPv6 has nexthdr == FRAG
1808 1812                   * for its equivalent case.
1809 1813                   */
1810 1814                  if ((ntohs(ipha->ipha_fragment_offset_and_flags) &
1811 1815                      (IPH_MF | IPH_OFFSET)) != 0) {
1812 1816                          goto src_dst_based_fanout;
1813 1817                  }
1814 1818          }
1815 1819          if (remlen < MIN_EHDR_LEN)
1816 1820                  return (-1);
1817 1821          whereptr = (uint8_t *)ip6h + hdr_len;
1818 1822  
1819 1823          /* If the transport is one of below, we do port/SPI based fanout */
1820 1824          switch (nexthdr) {
1821 1825          case IPPROTO_TCP:
1822 1826          case IPPROTO_UDP:
1823 1827          case IPPROTO_SCTP:
1824 1828          case IPPROTO_ESP:
1825 1829                  /*
1826 1830                   * If the ports or SPI in the transport header is not part of
1827 1831                   * the mblk, do src_based_fanout, instead of calling
1828 1832                   * pullupmsg().
1829 1833                   */
1830 1834                  if (mp->b_cont == NULL || whereptr + PORTS_SIZE <= mp->b_wptr)
1831 1835                          break;  /* out of switch... */
1832 1836                  /* FALLTHRU */
1833 1837          default:
1834 1838                  goto src_dst_based_fanout;
1835 1839          }
1836 1840  
1837 1841          switch (nexthdr) {
1838 1842          case IPPROTO_TCP:
1839 1843                  hash = HASH_ADDR(src_val, dst_val, *(uint32_t *)whereptr);
1840 1844                  *indx = COMPUTE_INDEX(hash, mac_srs->srs_tcp_ring_count);
1841 1845                  *type = OTH;
1842 1846                  break;
1843 1847          case IPPROTO_UDP:
1844 1848          case IPPROTO_SCTP:
1845 1849          case IPPROTO_ESP:
1846 1850                  if (mac_fanout_type == MAC_FANOUT_DEFAULT) {
1847 1851                          hash = HASH_ADDR(src_val, dst_val,
1848 1852                              *(uint32_t *)whereptr);
1849 1853                          *indx = COMPUTE_INDEX(hash,
1850 1854                              mac_srs->srs_udp_ring_count);
1851 1855                  } else {
1852 1856                          *indx = mac_srs->srs_ind % mac_srs->srs_udp_ring_count;
1853 1857                          mac_srs->srs_ind++;
1854 1858                  }
1855 1859                  *type = OTH;
1856 1860                  break;
1857 1861          }

↓ open down ↓

145 lines elided

↑ open up ↑

1858 1862          return (0);
1859 1863  
1860 1864  src_dst_based_fanout:
1861 1865          hash = HASH_ADDR(src_val, dst_val, (uint32_t)0);
1862 1866          *indx = COMPUTE_INDEX(hash, mac_srs->srs_oth_ring_count);
1863 1867          *type = OTH;
1864 1868          return (0);
1865 1869  }
1866 1870  
1867 1871  /*
1868      - * mac_rx_srs_fanout
1869      - *
1870      - * This routine delivers packets destined to an SRS into a soft ring member
     1872 + * This routine delivers packets destined for an SRS into a soft ring member
1871 1873   * of the set.
1872 1874   *
1873      - * Given a chain of packets we need to split it up into multiple sub chains
1874      - * destined for one of the TCP, UDP or OTH soft rings. Instead of entering
1875      - * the soft ring one packet at a time, we want to enter it in the form of a
1876      - * chain otherwise we get this start/stop behaviour where the worker thread
1877      - * goes to sleep and then next packets comes in forcing it to wake up etc.
     1875 + * Given a chain of packets we need to split it up into multiple sub
     1876 + * chains: TCP, UDP or OTH soft ring. Instead of entering the soft
     1877 + * ring one packet at a time, we want to enter it in the form of a
     1878 + * chain otherwise we get this start/stop behaviour where the worker
     1879 + * thread goes to sleep and then next packet comes in forcing it to
     1880 + * wake up.
1878 1881   *
1879 1882   * Note:
1880 1883   * Since we know what is the maximum fanout possible, we create a 2D array
1881 1884   * of 'softring types * MAX_SR_FANOUT' for the head, tail, cnt and sz
1882 1885   * variables so that we can enter the softrings with chain. We need the
1883 1886   * MAX_SR_FANOUT so we can allocate the arrays on the stack (a kmem_alloc
1884 1887   * for each packet would be expensive). If we ever want to have the
1885 1888   * ability to have unlimited fanout, we should probably declare a head,
1886 1889   * tail, cnt, sz with each soft ring (a data struct which contains a softring
1887 1890   * along with these members) and create an array of this uber struct so we

1888 1891   * don't have to do kmem_alloc.
1889 1892   */
1890 1893  int     fanout_oth1 = 0;
1891 1894  int     fanout_oth2 = 0;
1892 1895  int     fanout_oth3 = 0;
1893 1896  int     fanout_oth4 = 0;
1894 1897  int     fanout_oth5 = 0;
1895 1898  
1896 1899  static void
1897 1900  mac_rx_srs_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head)
1898 1901  {
1899 1902          struct ether_header             *ehp;
1900 1903          struct ether_vlan_header        *evhp;
1901 1904          uint32_t                        sap;
1902 1905          ipha_t                          *ipha;
1903 1906          uint8_t                         *dstaddr;
1904 1907          uint_t                          indx;
1905 1908          size_t                          ports_offset;
1906 1909          size_t                          ipha_len;
1907 1910          size_t                          hdrsize;
1908 1911          uint_t                          hash;
1909 1912          mblk_t                          *mp;
1910 1913          mblk_t                          *headmp[MAX_SR_TYPES][MAX_SR_FANOUT];
1911 1914          mblk_t                          *tailmp[MAX_SR_TYPES][MAX_SR_FANOUT];
1912 1915          int                             cnt[MAX_SR_TYPES][MAX_SR_FANOUT];
1913 1916          size_t                          sz[MAX_SR_TYPES][MAX_SR_FANOUT];
1914 1917          size_t                          sz1;
1915 1918          boolean_t                       bw_ctl;
1916 1919          boolean_t                       hw_classified;
1917 1920          boolean_t                       dls_bypass;
1918 1921          boolean_t                       is_ether;
1919 1922          boolean_t                       is_unicast;
1920 1923          int                             fanout_cnt;
1921 1924          enum pkt_type                   type;
1922 1925          mac_client_impl_t               *mcip = mac_srs->srs_mcip;
1923 1926  
1924 1927          is_ether = (mcip->mci_mip->mi_info.mi_nativemedia == DL_ETHER);
1925 1928          bw_ctl = ((mac_srs->srs_type & SRST_BW_CONTROL) != 0);
1926 1929  
1927 1930          /*

↓ open down ↓

40 lines elided

↑ open up ↑

1928 1931           * If we don't have a Rx ring, S/W classification would have done
1929 1932           * its job and its a packet meant for us. If we were polling on
1930 1933           * the default ring (i.e. there was a ring assigned to this SRS),
1931 1934           * then we need to make sure that the mac address really belongs
1932 1935           * to us.
1933 1936           */
1934 1937          hw_classified = mac_srs->srs_ring != NULL &&
1935 1938              mac_srs->srs_ring->mr_classify_type == MAC_HW_CLASSIFIER;
1936 1939  
1937 1940          /*
1938      -         * Special clients (eg. VLAN, non ether, etc) need DLS
1939      -         * processing in the Rx path. SRST_DLS_BYPASS will be clear for
1940      -         * such SRSs. Another way of disabling bypass is to set the
1941      -         * MCIS_RX_BYPASS_DISABLE flag.
     1941 +         * Some clients, such as non Ethernet, need DLS processing in
     1942 +         * the Rx path. Such clients clear the SRST_DLS_BYPASS flag.
     1943 +         * DLS bypass may also be disabled via the
     1944 +         * MCIS_RX_BYPASS_DISABLE flag, but this is only consumed by
     1945 +         * sun4v vsw currently.
1942 1946           */
1943 1947          dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) &&
1944 1948              ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0);
1945 1949  
1946 1950          /*
1947 1951           * Since the softrings are never destroyed and we always
1948 1952           * create equal number of softrings for TCP, UDP and rest,
1949 1953           * its OK to check one of them for count and use it without
1950 1954           * any lock. In future, if soft rings get destroyed because
1951 1955           * of reduction in fanout, we will need to ensure that happens

1952 1956           * behind the SRS_PROC.

↓ open down ↓

1 lines elided

↑ open up ↑

1953 1957           */
1954 1958          fanout_cnt = mac_srs->srs_tcp_ring_count;
1955 1959  
1956 1960          bzero(headmp, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (mblk_t *));
1957 1961          bzero(tailmp, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (mblk_t *));
1958 1962          bzero(cnt, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (int));
1959 1963          bzero(sz, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (size_t));
1960 1964  
1961 1965          /*
1962 1966           * We got a chain from SRS that we need to send to the soft rings.
1963      -         * Since squeues for TCP & IPv4 sap poll their soft rings (for
     1967 +         * Since squeues for TCP & IPv4 SAP poll their soft rings (for
1964 1968           * performance reasons), we need to separate out v4_tcp, v4_udp
1965 1969           * and the rest goes in other.
1966 1970           */
1967 1971          while (head != NULL) {
1968 1972                  mp = head;
1969 1973                  head = head->b_next;
1970 1974                  mp->b_next = NULL;
1971 1975  
1972 1976                  type = OTH;
1973 1977                  sz1 = (mp->b_cont == NULL) ? MBLKL(mp) : msgdsize(mp);

1974 1978  
1975 1979                  if (is_ether) {
1976 1980                          /*
1977 1981                           * At this point we can be sure the packet at least
1978 1982                           * has an ether header.
1979 1983                           */
1980 1984                          if (sz1 < sizeof (struct ether_header)) {
1981 1985                                  mac_rx_drop_pkt(mac_srs, mp);
1982 1986                                  continue;

↓ open down ↓

9 lines elided

↑ open up ↑

1983 1987                          }
1984 1988                          ehp = (struct ether_header *)mp->b_rptr;
1985 1989  
1986 1990                          /*
1987 1991                           * Determine if this is a VLAN or non-VLAN packet.
1988 1992                           */
1989 1993                          if ((sap = ntohs(ehp->ether_type)) == VLAN_TPID) {
1990 1994                                  evhp = (struct ether_vlan_header *)mp->b_rptr;
1991 1995                                  sap = ntohs(evhp->ether_type);
1992 1996                                  hdrsize = sizeof (struct ether_vlan_header);
     1997 +
1993 1998                                  /*
1994      -                                 * Check if the VID of the packet, if any,
1995      -                                 * belongs to this client.
     1999 +                                 * Check if the VID of the packet, if
     2000 +                                 * any, belongs to this client.
     2001 +                                 * Technically, if this packet came up
     2002 +                                 * via a HW classified ring then we
     2003 +                                 * don't need to perform this check.
     2004 +                                 * Perhaps a future optimization.
1996 2005                                   */
1997 2006                                  if (!mac_client_check_flow_vid(mcip,
1998 2007                                      VLAN_ID(ntohs(evhp->ether_tci)))) {
1999 2008                                          mac_rx_drop_pkt(mac_srs, mp);
2000 2009                                          continue;
2001 2010                                  }
2002 2011                          } else {
2003 2012                                  hdrsize = sizeof (struct ether_header);
2004 2013                          }
2005 2014                          is_unicast =

2006 2015                              ((((uint8_t *)&ehp->ether_dhost)[0] & 0x01) == 0);
2007 2016                          dstaddr = (uint8_t *)&ehp->ether_dhost;
2008 2017                  } else {
2009 2018                          mac_header_info_t               mhi;
2010 2019  
2011 2020                          if (mac_header_info((mac_handle_t)mcip->mci_mip,
2012 2021                              mp, &mhi) != 0) {
2013 2022                                  mac_rx_drop_pkt(mac_srs, mp);
2014 2023                                  continue;
2015 2024                          }
2016 2025                          hdrsize = mhi.mhi_hdrsize;
2017 2026                          sap = mhi.mhi_bindsap;
2018 2027                          is_unicast = (mhi.mhi_dsttype == MAC_ADDRTYPE_UNICAST);
2019 2028                          dstaddr = (uint8_t *)mhi.mhi_daddr;
2020 2029                  }
2021 2030  
2022 2031                  if (!dls_bypass) {
2023 2032                          if (mac_rx_srs_long_fanout(mac_srs, mp, sap,
2024 2033                              hdrsize, &type, &indx) == -1) {

↓ open down ↓

19 lines elided

↑ open up ↑

2025 2034                                  mac_rx_drop_pkt(mac_srs, mp);
2026 2035                                  continue;
2027 2036                          }
2028 2037  
2029 2038                          FANOUT_ENQUEUE_MP(headmp[type][indx],
2030 2039                              tailmp[type][indx], cnt[type][indx], bw_ctl,
2031 2040                              sz[type][indx], sz1, mp);
2032 2041                          continue;
2033 2042                  }
2034 2043  
2035      -
2036 2044                  /*
2037 2045                   * If we are using the default Rx ring where H/W or S/W
2038 2046                   * classification has not happened, we need to verify if
2039 2047                   * this unicast packet really belongs to us.
2040 2048                   */
2041 2049                  if (sap == ETHERTYPE_IP) {
2042 2050                          /*
2043 2051                           * If we are H/W classified, but we have promisc
2044 2052                           * on, then we need to check for the unicast address.
2045 2053                           */

2046 2054                          if (hw_classified && mcip->mci_promisc_list != NULL) {
2047 2055                                  mac_address_t           *map;
2048 2056  
2049 2057                                  rw_enter(&mcip->mci_rw_lock, RW_READER);
2050 2058                                  map = mcip->mci_unicast;
2051 2059                                  if (bcmp(dstaddr, map->ma_addr,
2052 2060                                      map->ma_len) == 0)
2053 2061                                          type = UNDEF;
2054 2062                                  rw_exit(&mcip->mci_rw_lock);
2055 2063                          } else if (is_unicast) {
2056 2064                                  type = UNDEF;
2057 2065                          }
2058 2066                  }
2059 2067  
2060 2068                  /*
2061 2069                   * This needs to become a contract with the driver for
2062 2070                   * the fast path.
2063 2071                   */
2064 2072  
2065 2073                  ipha = (ipha_t *)(mp->b_rptr + hdrsize);
2066 2074                  if ((type != OTH) && MBLK_RX_FANOUT_SLOWPATH(mp, ipha)) {
2067 2075                          type = OTH;
2068 2076                          fanout_oth1++;
2069 2077                  }
2070 2078  
2071 2079                  if (type != OTH) {
2072 2080                          uint16_t        frag_offset_flags;
2073 2081  
2074 2082                          switch (ipha->ipha_protocol) {
2075 2083                          case IPPROTO_TCP:
2076 2084                          case IPPROTO_UDP:
2077 2085                          case IPPROTO_SCTP:
2078 2086                          case IPPROTO_ESP:
2079 2087                                  ipha_len = IPH_HDR_LENGTH(ipha);
2080 2088                                  if ((uchar_t *)ipha + ipha_len + PORTS_SIZE >
2081 2089                                      mp->b_wptr) {
2082 2090                                          type = OTH;
2083 2091                                          break;
2084 2092                                  }
2085 2093                                  frag_offset_flags =
2086 2094                                      ntohs(ipha->ipha_fragment_offset_and_flags);
2087 2095                                  if ((frag_offset_flags &
2088 2096                                      (IPH_MF | IPH_OFFSET)) != 0) {
2089 2097                                          type = OTH;
2090 2098                                          fanout_oth3++;
2091 2099                                          break;
2092 2100                                  }
2093 2101                                  ports_offset = hdrsize + ipha_len;
2094 2102                                  break;
2095 2103                          default:
2096 2104                                  type = OTH;
2097 2105                                  fanout_oth4++;
2098 2106                                  break;
2099 2107                          }
2100 2108                  }
2101 2109  
2102 2110                  if (type == OTH) {
2103 2111                          if (mac_rx_srs_long_fanout(mac_srs, mp, sap,
2104 2112                              hdrsize, &type, &indx) == -1) {
2105 2113                                  mac_rx_drop_pkt(mac_srs, mp);
2106 2114                                  continue;
2107 2115                          }
2108 2116  
2109 2117                          FANOUT_ENQUEUE_MP(headmp[type][indx],
2110 2118                              tailmp[type][indx], cnt[type][indx], bw_ctl,
2111 2119                              sz[type][indx], sz1, mp);
2112 2120                          continue;
2113 2121                  }
2114 2122  
2115 2123                  ASSERT(type == UNDEF);
2116 2124  
2117 2125                  /*
2118 2126                   * XXX-Sunay: We should hold srs_lock since ring_count
2119 2127                   * below can change. But if we are always called from
2120 2128                   * mac_rx_srs_drain and SRS_PROC is set, then we can
2121 2129                   * enforce that ring_count can't be changed i.e.
2122 2130                   * to change fanout type or ring count, the calling
2123 2131                   * thread needs to be behind SRS_PROC.
2124 2132                   */
2125 2133                  switch (ipha->ipha_protocol) {
2126 2134                  case IPPROTO_TCP:
2127 2135                          /*
2128 2136                           * Note that for ESP, we fanout on SPI and it is at the
2129 2137                           * same offset as the 2x16-bit ports. So it is clumped
2130 2138                           * along with TCP, UDP and SCTP.
2131 2139                           */
2132 2140                          hash = HASH_ADDR(ipha->ipha_src, ipha->ipha_dst,
2133 2141                              *(uint32_t *)(mp->b_rptr + ports_offset));
2134 2142                          indx = COMPUTE_INDEX(hash, mac_srs->srs_tcp_ring_count);
2135 2143                          type = V4_TCP;
2136 2144                          mp->b_rptr += hdrsize;
2137 2145                          break;
2138 2146                  case IPPROTO_UDP:
2139 2147                  case IPPROTO_SCTP:
2140 2148                  case IPPROTO_ESP:
2141 2149                          if (mac_fanout_type == MAC_FANOUT_DEFAULT) {
2142 2150                                  hash = HASH_ADDR(ipha->ipha_src, ipha->ipha_dst,
2143 2151                                      *(uint32_t *)(mp->b_rptr + ports_offset));
2144 2152                                  indx = COMPUTE_INDEX(hash,
2145 2153                                      mac_srs->srs_udp_ring_count);
2146 2154                          } else {
2147 2155                                  indx = mac_srs->srs_ind %
2148 2156                                      mac_srs->srs_udp_ring_count;
2149 2157                                  mac_srs->srs_ind++;
2150 2158                          }
2151 2159                          type = V4_UDP;
2152 2160                          mp->b_rptr += hdrsize;
2153 2161                          break;
2154 2162                  default:
2155 2163                          indx = 0;
2156 2164                          type = OTH;
2157 2165                  }
2158 2166  
2159 2167                  FANOUT_ENQUEUE_MP(headmp[type][indx], tailmp[type][indx],
2160 2168                      cnt[type][indx], bw_ctl, sz[type][indx], sz1, mp);
2161 2169          }
2162 2170  
2163 2171          for (type = V4_TCP; type < UNDEF; type++) {
2164 2172                  int     i;
2165 2173  
2166 2174                  for (i = 0; i < fanout_cnt; i++) {
2167 2175                          if (headmp[type][i] != NULL) {
2168 2176                                  mac_soft_ring_t *softring;
2169 2177  
2170 2178                                  ASSERT(tailmp[type][i]->b_next == NULL);
2171 2179                                  switch (type) {
2172 2180                                  case V4_TCP:
2173 2181                                          softring =
2174 2182                                              mac_srs->srs_tcp_soft_rings[i];
2175 2183                                          break;
2176 2184                                  case V4_UDP:
2177 2185                                          softring =
2178 2186                                              mac_srs->srs_udp_soft_rings[i];
2179 2187                                          break;
2180 2188                                  case OTH:
2181 2189                                          softring =
2182 2190                                              mac_srs->srs_oth_soft_rings[i];
2183 2191                                          break;
2184 2192                                  }
2185 2193                                  mac_rx_soft_ring_process(mcip,
2186 2194                                      softring, headmp[type][i], tailmp[type][i],
2187 2195                                      cnt[type][i], sz[type][i]);
2188 2196                          }
2189 2197                  }
2190 2198          }
2191 2199  }
2192 2200  
2193 2201  #define SRS_BYTES_TO_PICKUP     150000
2194 2202  ssize_t max_bytes_to_pickup = SRS_BYTES_TO_PICKUP;
2195 2203  
2196 2204  /*
2197 2205   * mac_rx_srs_poll_ring
2198 2206   *
2199 2207   * This SRS Poll thread uses this routine to poll the underlying hardware
2200 2208   * Rx ring to get a chain of packets. It can inline process that chain
2201 2209   * if mac_latency_optimize is set (default) or signal the SRS worker thread
2202 2210   * to do the remaining processing.
2203 2211   *
2204 2212   * Since packets come in the system via interrupt or poll path, we also
2205 2213   * update the stats and deal with promiscous clients here.
2206 2214   */
2207 2215  void
2208 2216  mac_rx_srs_poll_ring(mac_soft_ring_set_t *mac_srs)
2209 2217  {
2210 2218          kmutex_t                *lock = &mac_srs->srs_lock;
2211 2219          kcondvar_t              *async = &mac_srs->srs_cv;
2212 2220          mac_srs_rx_t            *srs_rx = &mac_srs->srs_rx;
2213 2221          mblk_t                  *head, *tail, *mp;
2214 2222          callb_cpr_t             cprinfo;
2215 2223          ssize_t                 bytes_to_pickup;
2216 2224          size_t                  sz;
2217 2225          int                     count;
2218 2226          mac_client_impl_t       *smcip;
2219 2227  
2220 2228          CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "mac_srs_poll");
2221 2229          mutex_enter(lock);
2222 2230  
2223 2231  start:
2224 2232          for (;;) {
2225 2233                  if (mac_srs->srs_state & SRS_PAUSE)
2226 2234                          goto done;
2227 2235  
2228 2236                  CALLB_CPR_SAFE_BEGIN(&cprinfo);
2229 2237                  cv_wait(async, lock);
2230 2238                  CALLB_CPR_SAFE_END(&cprinfo, lock);
2231 2239  
2232 2240                  if (mac_srs->srs_state & SRS_PAUSE)
2233 2241                          goto done;
2234 2242  
2235 2243  check_again:
2236 2244                  if (mac_srs->srs_type & SRST_BW_CONTROL) {
2237 2245                          /*
2238 2246                           * We pick as many bytes as we are allowed to queue.
2239 2247                           * Its possible that we will exceed the total
2240 2248                           * packets queued in case this SRS is part of the
2241 2249                           * Rx ring group since > 1 poll thread can be pulling
2242 2250                           * upto the max allowed packets at the same time
2243 2251                           * but that should be OK.
2244 2252                           */
2245 2253                          mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2246 2254                          bytes_to_pickup =
2247 2255                              mac_srs->srs_bw->mac_bw_drop_threshold -
2248 2256                              mac_srs->srs_bw->mac_bw_sz;
2249 2257                          /*
2250 2258                           * We shouldn't have been signalled if we
2251 2259                           * have 0 or less bytes to pick but since
2252 2260                           * some of the bytes accounting is driver
2253 2261                           * dependant, we do the safety check.
2254 2262                           */
2255 2263                          if (bytes_to_pickup < 0)
2256 2264                                  bytes_to_pickup = 0;
2257 2265                          mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2258 2266                  } else {
2259 2267                          /*
2260 2268                           * ToDO: Need to change the polling API
2261 2269                           * to add a packet count and a flag which
2262 2270                           * tells the driver whether we want packets
2263 2271                           * based on a count, or bytes, or all the
2264 2272                           * packets queued in the driver/HW. This
2265 2273                           * way, we never have to check the limits
2266 2274                           * on poll path. We truly let only as many
2267 2275                           * packets enter the system as we are willing
2268 2276                           * to process or queue.
2269 2277                           *
2270 2278                           * Something along the lines of
2271 2279                           * pkts_to_pickup = mac_soft_ring_max_q_cnt -
2272 2280                           *      mac_srs->srs_poll_pkt_cnt
2273 2281                           */
2274 2282  
2275 2283                          /*
2276 2284                           * Since we are not doing B/W control, pick
2277 2285                           * as many packets as allowed.
2278 2286                           */
2279 2287                          bytes_to_pickup = max_bytes_to_pickup;
2280 2288                  }
2281 2289  
2282 2290                  /* Poll the underlying Hardware */
2283 2291                  mutex_exit(lock);
2284 2292                  head = MAC_HWRING_POLL(mac_srs->srs_ring, (int)bytes_to_pickup);
2285 2293                  mutex_enter(lock);
2286 2294  
2287 2295                  ASSERT((mac_srs->srs_state & SRS_POLL_THR_OWNER) ==
2288 2296                      SRS_POLL_THR_OWNER);
2289 2297  
2290 2298                  mp = tail = head;
2291 2299                  count = 0;
2292 2300                  sz = 0;
2293 2301                  while (mp != NULL) {
2294 2302                          tail = mp;
2295 2303                          sz += msgdsize(mp);
2296 2304                          mp = mp->b_next;
2297 2305                          count++;
2298 2306                  }
2299 2307  
2300 2308                  if (head != NULL) {
2301 2309                          tail->b_next = NULL;
2302 2310                          smcip = mac_srs->srs_mcip;
2303 2311  
2304 2312                          SRS_RX_STAT_UPDATE(mac_srs, pollbytes, sz);
2305 2313                          SRS_RX_STAT_UPDATE(mac_srs, pollcnt, count);
2306 2314  
2307 2315                          /*
2308 2316                           * If there are any promiscuous mode callbacks
2309 2317                           * defined for this MAC client, pass them a copy
2310 2318                           * if appropriate and also update the counters.
2311 2319                           */
2312 2320                          if (smcip != NULL) {
2313 2321                                  if (smcip->mci_mip->mi_promisc_list != NULL) {
2314 2322                                          mutex_exit(lock);
2315 2323                                          mac_promisc_dispatch(smcip->mci_mip,
2316 2324                                              head, NULL);
2317 2325                                          mutex_enter(lock);
2318 2326                                  }
2319 2327                          }
2320 2328                          if (mac_srs->srs_type & SRST_BW_CONTROL) {
2321 2329                                  mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2322 2330                                  mac_srs->srs_bw->mac_bw_polled += sz;
2323 2331                                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2324 2332                          }
2325 2333                          MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail,
2326 2334                              count, sz);
2327 2335                          if (count <= 10)
2328 2336                                  srs_rx->sr_stat.mrs_chaincntundr10++;
2329 2337                          else if (count > 10 && count <= 50)
2330 2338                                  srs_rx->sr_stat.mrs_chaincnt10to50++;
2331 2339                          else
2332 2340                                  srs_rx->sr_stat.mrs_chaincntover50++;
2333 2341                  }
2334 2342  
2335 2343                  /*
2336 2344                   * We are guaranteed that SRS_PROC will be set if we
2337 2345                   * are here. Also, poll thread gets to run only if
2338 2346                   * the drain was being done by a worker thread although
2339 2347                   * its possible that worker thread is still running
2340 2348                   * and poll thread was sent down to keep the pipeline
2341 2349                   * going instead of doing a complete drain and then
2342 2350                   * trying to poll the NIC.
2343 2351                   *
2344 2352                   * So we need to check SRS_WORKER flag to make sure
2345 2353                   * that the worker thread is not processing the queue
2346 2354                   * in parallel to us. The flags and conditions are
2347 2355                   * protected by the srs_lock to prevent any race. We
2348 2356                   * ensure that we don't drop the srs_lock from now
2349 2357                   * till the end and similarly we don't drop the srs_lock
2350 2358                   * in mac_rx_srs_drain() till similar condition check
2351 2359                   * are complete. The mac_rx_srs_drain() needs to ensure
2352 2360                   * that SRS_WORKER flag remains set as long as its
2353 2361                   * processing the queue.
2354 2362                   */
2355 2363                  if (!(mac_srs->srs_state & SRS_WORKER) &&
2356 2364                      (mac_srs->srs_first != NULL)) {
2357 2365                          /*
2358 2366                           * We have packets to process and worker thread
2359 2367                           * is not running. Check to see if poll thread is
2360 2368                           * allowed to process.
2361 2369                           */
2362 2370                          if (mac_srs->srs_state & SRS_LATENCY_OPT) {
2363 2371                                  mac_srs->srs_drain_func(mac_srs, SRS_POLL_PROC);
2364 2372                                  if (!(mac_srs->srs_state & SRS_PAUSE) &&
2365 2373                                      srs_rx->sr_poll_pkt_cnt <=
2366 2374                                      srs_rx->sr_lowat) {
2367 2375                                          srs_rx->sr_poll_again++;
2368 2376                                          goto check_again;
2369 2377                                  }
2370 2378                                  /*
2371 2379                                   * We are already above low water mark
2372 2380                                   * so stay in the polling mode but no
2373 2381                                   * need to poll. Once we dip below
2374 2382                                   * the polling threshold, the processing
2375 2383                                   * thread (soft ring) will signal us
2376 2384                                   * to poll again (MAC_UPDATE_SRS_COUNT)
2377 2385                                   */
2378 2386                                  srs_rx->sr_poll_drain_no_poll++;
2379 2387                                  mac_srs->srs_state &= ~(SRS_PROC|SRS_GET_PKTS);
2380 2388                                  /*
2381 2389                                   * In B/W control case, its possible
2382 2390                                   * that the backlog built up due to
2383 2391                                   * B/W limit being reached and packets
2384 2392                                   * are queued only in SRS. In this case,
2385 2393                                   * we should schedule worker thread
2386 2394                                   * since no one else will wake us up.
2387 2395                                   */
2388 2396                                  if ((mac_srs->srs_type & SRST_BW_CONTROL) &&
2389 2397                                      (mac_srs->srs_tid == NULL)) {
2390 2398                                          mac_srs->srs_tid =
2391 2399                                              timeout(mac_srs_fire, mac_srs, 1);
2392 2400                                          srs_rx->sr_poll_worker_wakeup++;
2393 2401                                  }
2394 2402                          } else {
2395 2403                                  /*
2396 2404                                   * Wakeup the worker thread for more processing.
2397 2405                                   * We optimize for throughput in this case.
2398 2406                                   */
2399 2407                                  mac_srs->srs_state &= ~(SRS_PROC|SRS_GET_PKTS);
2400 2408                                  MAC_SRS_WORKER_WAKEUP(mac_srs);
2401 2409                                  srs_rx->sr_poll_sig_worker++;
2402 2410                          }
2403 2411                  } else if ((mac_srs->srs_first == NULL) &&
2404 2412                      !(mac_srs->srs_state & SRS_WORKER)) {
2405 2413                          /*
2406 2414                           * There is nothing queued in SRS and
2407 2415                           * no worker thread running. Plus we
2408 2416                           * didn't get anything from the H/W
2409 2417                           * as well (head == NULL);
2410 2418                           */
2411 2419                          ASSERT(head == NULL);
2412 2420                          mac_srs->srs_state &=
2413 2421                              ~(SRS_PROC|SRS_GET_PKTS);
2414 2422  
2415 2423                          /*
2416 2424                           * If we have a packets in soft ring, don't allow
2417 2425                           * more packets to come into this SRS by keeping the
2418 2426                           * interrupts off but not polling the H/W. The
2419 2427                           * poll thread will get signaled as soon as
2420 2428                           * srs_poll_pkt_cnt dips below poll threshold.
2421 2429                           */
2422 2430                          if (srs_rx->sr_poll_pkt_cnt == 0) {
2423 2431                                  srs_rx->sr_poll_intr_enable++;
2424 2432                                  MAC_SRS_POLLING_OFF(mac_srs);
2425 2433                          } else {
2426 2434                                  /*
2427 2435                                   * We know nothing is queued in SRS
2428 2436                                   * since we are here after checking
2429 2437                                   * srs_first is NULL. The backlog
2430 2438                                   * is entirely due to packets queued
2431 2439                                   * in Soft ring which will wake us up
2432 2440                                   * and get the interface out of polling
2433 2441                                   * mode once the backlog dips below
2434 2442                                   * sr_poll_thres.
2435 2443                                   */
2436 2444                                  srs_rx->sr_poll_no_poll++;
2437 2445                          }
2438 2446                  } else {
2439 2447                          /*
2440 2448                           * Worker thread is already running.
2441 2449                           * Nothing much to do. If the polling
2442 2450                           * was enabled, worker thread will deal
2443 2451                           * with that.
2444 2452                           */
2445 2453                          mac_srs->srs_state &= ~SRS_GET_PKTS;
2446 2454                          srs_rx->sr_poll_goto_sleep++;
2447 2455                  }
2448 2456          }
2449 2457  done:
2450 2458          mac_srs->srs_state |= SRS_POLL_THR_QUIESCED;
2451 2459          cv_signal(&mac_srs->srs_async);
2452 2460          /*
2453 2461           * If this is a temporary quiesce then wait for the restart signal
2454 2462           * from the srs worker. Then clear the flags and signal the srs worker
2455 2463           * to ensure a positive handshake and go back to start.
2456 2464           */
2457 2465          while (!(mac_srs->srs_state & (SRS_CONDEMNED | SRS_POLL_THR_RESTART)))
2458 2466                  cv_wait(async, lock);
2459 2467          if (mac_srs->srs_state & SRS_POLL_THR_RESTART) {
2460 2468                  ASSERT(!(mac_srs->srs_state & SRS_CONDEMNED));
2461 2469                  mac_srs->srs_state &=
2462 2470                      ~(SRS_POLL_THR_QUIESCED | SRS_POLL_THR_RESTART);
2463 2471                  cv_signal(&mac_srs->srs_async);
2464 2472                  goto start;
2465 2473          } else {
2466 2474                  mac_srs->srs_state |= SRS_POLL_THR_EXITED;
2467 2475                  cv_signal(&mac_srs->srs_async);
2468 2476                  CALLB_CPR_EXIT(&cprinfo);
2469 2477                  thread_exit();
2470 2478          }
2471 2479  }
2472 2480  
2473 2481  /*
2474 2482   * mac_srs_pick_chain
2475 2483   *
2476 2484   * In Bandwidth control case, checks how many packets can be processed
2477 2485   * and return them in a sub chain.
2478 2486   */
2479 2487  static mblk_t *
2480 2488  mac_srs_pick_chain(mac_soft_ring_set_t *mac_srs, mblk_t **chain_tail,
2481 2489      size_t *chain_sz, int *chain_cnt)
2482 2490  {
2483 2491          mblk_t                  *head = NULL;
2484 2492          mblk_t                  *tail = NULL;
2485 2493          size_t                  sz;
2486 2494          size_t                  tsz = 0;
2487 2495          int                     cnt = 0;
2488 2496          mblk_t                  *mp;
2489 2497  
2490 2498          ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2491 2499          mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2492 2500          if (((mac_srs->srs_bw->mac_bw_used + mac_srs->srs_size) <=
2493 2501              mac_srs->srs_bw->mac_bw_limit) ||
2494 2502              (mac_srs->srs_bw->mac_bw_limit == 0)) {
2495 2503                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2496 2504                  head = mac_srs->srs_first;
2497 2505                  mac_srs->srs_first = NULL;
2498 2506                  *chain_tail = mac_srs->srs_last;
2499 2507                  mac_srs->srs_last = NULL;
2500 2508                  *chain_sz = mac_srs->srs_size;
2501 2509                  *chain_cnt = mac_srs->srs_count;
2502 2510                  mac_srs->srs_count = 0;
2503 2511                  mac_srs->srs_size = 0;
2504 2512                  return (head);
2505 2513          }
2506 2514  
2507 2515          /*
2508 2516           * Can't clear the entire backlog.
2509 2517           * Need to find how many packets to pick
2510 2518           */
2511 2519          ASSERT(MUTEX_HELD(&mac_srs->srs_bw->mac_bw_lock));
2512 2520          while ((mp = mac_srs->srs_first) != NULL) {
2513 2521                  sz = msgdsize(mp);
2514 2522                  if ((tsz + sz + mac_srs->srs_bw->mac_bw_used) >
2515 2523                      mac_srs->srs_bw->mac_bw_limit) {
2516 2524                          if (!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED))
2517 2525                                  mac_srs->srs_bw->mac_bw_state |=
2518 2526                                      SRS_BW_ENFORCED;
2519 2527                          break;
2520 2528                  }
2521 2529  
2522 2530                  /*
2523 2531                   * The _size & cnt is  decremented from the softrings
2524 2532                   * when they send up the packet for polling to work
2525 2533                   * properly.
2526 2534                   */
2527 2535                  tsz += sz;
2528 2536                  cnt++;
2529 2537                  mac_srs->srs_count--;
2530 2538                  mac_srs->srs_size -= sz;
2531 2539                  if (tail != NULL)
2532 2540                          tail->b_next = mp;
2533 2541                  else
2534 2542                          head = mp;
2535 2543                  tail = mp;
2536 2544                  mac_srs->srs_first = mac_srs->srs_first->b_next;
2537 2545          }
2538 2546          mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2539 2547          if (mac_srs->srs_first == NULL)
2540 2548                  mac_srs->srs_last = NULL;
2541 2549  
2542 2550          if (tail != NULL)
2543 2551                  tail->b_next = NULL;
2544 2552          *chain_tail = tail;
2545 2553          *chain_cnt = cnt;
2546 2554          *chain_sz = tsz;
2547 2555  
2548 2556          return (head);
2549 2557  }
2550 2558  
2551 2559  /*
2552 2560   * mac_rx_srs_drain
2553 2561   *
2554 2562   * The SRS drain routine. Gets to run to clear the queue. Any thread
2555 2563   * (worker, interrupt, poll) can call this based on processing model.
2556 2564   * The first thing we do is disable interrupts if possible and then
2557 2565   * drain the queue. we also try to poll the underlying hardware if
2558 2566   * there is a dedicated hardware Rx ring assigned to this SRS.
2559 2567   *
2560 2568   * There is a equivalent drain routine in bandwidth control mode
2561 2569   * mac_rx_srs_drain_bw. There is some code duplication between the two
2562 2570   * routines but they are highly performance sensitive and are easier
2563 2571   * to read/debug if they stay separate. Any code changes here might
2564 2572   * also apply to mac_rx_srs_drain_bw as well.
2565 2573   */
2566 2574  void
2567 2575  mac_rx_srs_drain(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2568 2576  {
2569 2577          mblk_t                  *head;
2570 2578          mblk_t                  *tail;
2571 2579          timeout_id_t            tid;
2572 2580          int                     cnt = 0;
2573 2581          mac_client_impl_t       *mcip = mac_srs->srs_mcip;
2574 2582          mac_srs_rx_t            *srs_rx = &mac_srs->srs_rx;
2575 2583  
2576 2584          ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2577 2585          ASSERT(!(mac_srs->srs_type & SRST_BW_CONTROL));
2578 2586  
2579 2587          /* If we are blanked i.e. can't do upcalls, then we are done */
2580 2588          if (mac_srs->srs_state & (SRS_BLANK | SRS_PAUSE)) {
2581 2589                  ASSERT((mac_srs->srs_type & SRST_NO_SOFT_RINGS) ||
2582 2590                      (mac_srs->srs_state & SRS_PAUSE));
2583 2591                  goto out;
2584 2592          }
2585 2593  
2586 2594          if (mac_srs->srs_first == NULL)
2587 2595                  goto out;
2588 2596  
2589 2597          if (!(mac_srs->srs_state & SRS_LATENCY_OPT) &&
2590 2598              (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat)) {
2591 2599                  /*
2592 2600                   * In the normal case, the SRS worker thread does no
2593 2601                   * work and we wait for a backlog to build up before
2594 2602                   * we switch into polling mode. In case we are
2595 2603                   * optimizing for throughput, we use the worker thread
2596 2604                   * as well. The goal is to let worker thread process
2597 2605                   * the queue and poll thread to feed packets into
2598 2606                   * the queue. As such, we should signal the poll
2599 2607                   * thread to try and get more packets.
2600 2608                   *
2601 2609                   * We could have pulled this check in the POLL_RING
2602 2610                   * macro itself but keeping it explicit here makes
2603 2611                   * the architecture more human understandable.
2604 2612                   */
2605 2613                  MAC_SRS_POLL_RING(mac_srs);
2606 2614          }
2607 2615  
2608 2616  again:
2609 2617          head = mac_srs->srs_first;
2610 2618          mac_srs->srs_first = NULL;
2611 2619          tail = mac_srs->srs_last;
2612 2620          mac_srs->srs_last = NULL;
2613 2621          cnt = mac_srs->srs_count;

↓ open down ↓

568 lines elided

↑ open up ↑

2614 2622          mac_srs->srs_count = 0;
2615 2623  
2616 2624          ASSERT(head != NULL);
2617 2625          ASSERT(tail != NULL);
2618 2626  
2619 2627          if ((tid = mac_srs->srs_tid) != NULL)
2620 2628                  mac_srs->srs_tid = NULL;
2621 2629  
2622 2630          mac_srs->srs_state |= (SRS_PROC|proc_type);
2623 2631  
2624      -
2625 2632          /*
2626 2633           * mcip is NULL for broadcast and multicast flows. The promisc
2627 2634           * callbacks for broadcast and multicast packets are delivered from
2628 2635           * mac_rx() and we don't need to worry about that case in this path
2629 2636           */
2630 2637          if (mcip != NULL) {
2631 2638                  if (mcip->mci_promisc_list != NULL) {
2632 2639                          mutex_exit(&mac_srs->srs_lock);
2633 2640                          mac_promisc_client_dispatch(mcip, head);
2634 2641                          mutex_enter(&mac_srs->srs_lock);
2635 2642                  }
2636 2643                  if (MAC_PROTECT_ENABLED(mcip, MPT_IPNOSPOOF)) {
2637 2644                          mutex_exit(&mac_srs->srs_lock);
2638 2645                          mac_protect_intercept_dynamic(mcip, head);
2639 2646                          mutex_enter(&mac_srs->srs_lock);
2640 2647                  }
2641 2648          }
2642 2649  
2643 2650          /*
2644      -         * Check if SRS itself is doing the processing
2645      -         * This direct path does not apply when subflows are present. In this
2646      -         * case, packets need to be dispatched to a soft ring according to the
2647      -         * flow's bandwidth and other resources contraints.
     2651 +         * Check if SRS itself is doing the processing. This direct
     2652 +         * path applies only when subflows are present.
2648 2653           */
2649 2654          if (mac_srs->srs_type & SRST_NO_SOFT_RINGS) {
2650 2655                  mac_direct_rx_t         proc;
2651 2656                  void                    *arg1;
2652 2657                  mac_resource_handle_t   arg2;
2653 2658  
2654 2659                  /*
2655 2660                   * This is the case when a Rx is directly
2656 2661                   * assigned and we have a fully classified
2657 2662                   * protocol chain. We can deal with it in

2658 2663                   * one shot.
2659 2664                   */
2660 2665                  proc = srs_rx->sr_func;
2661 2666                  arg1 = srs_rx->sr_arg1;
2662 2667                  arg2 = srs_rx->sr_arg2;
2663 2668  
2664 2669                  mac_srs->srs_state |= SRS_CLIENT_PROC;
2665 2670                  mutex_exit(&mac_srs->srs_lock);
2666 2671                  if (tid != NULL) {
2667 2672                          (void) untimeout(tid);
2668 2673                          tid = NULL;
2669 2674                  }
2670 2675  
2671 2676                  proc(arg1, arg2, head, NULL);
2672 2677                  /*
2673 2678                   * Decrement the size and count here itelf
2674 2679                   * since the packet has been processed.
2675 2680                   */
2676 2681                  mutex_enter(&mac_srs->srs_lock);
2677 2682                  MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
2678 2683                  if (mac_srs->srs_state & SRS_CLIENT_WAIT)
2679 2684                          cv_signal(&mac_srs->srs_client_cv);
2680 2685                  mac_srs->srs_state &= ~SRS_CLIENT_PROC;
2681 2686          } else {
2682 2687                  /* Some kind of softrings based fanout is required */
2683 2688                  mutex_exit(&mac_srs->srs_lock);
2684 2689                  if (tid != NULL) {
2685 2690                          (void) untimeout(tid);
2686 2691                          tid = NULL;
2687 2692                  }
2688 2693  
2689 2694                  /*
2690 2695                   * Since the fanout routines can deal with chains,
2691 2696                   * shoot the entire chain up.
2692 2697                   */
2693 2698                  if (mac_srs->srs_type & SRST_FANOUT_SRC_IP)
2694 2699                          mac_rx_srs_fanout(mac_srs, head);
2695 2700                  else
2696 2701                          mac_rx_srs_proto_fanout(mac_srs, head);
2697 2702                  mutex_enter(&mac_srs->srs_lock);
2698 2703          }
2699 2704  
2700 2705          if (!(mac_srs->srs_state & (SRS_BLANK|SRS_PAUSE)) &&
2701 2706              (mac_srs->srs_first != NULL)) {
2702 2707                  /*
2703 2708                   * More packets arrived while we were clearing the
2704 2709                   * SRS. This can be possible because of one of
2705 2710                   * three conditions below:
2706 2711                   * 1) The driver is using multiple worker threads
2707 2712                   *    to send the packets to us.
2708 2713                   * 2) The driver has a race in switching
2709 2714                   *    between interrupt and polling mode or
2710 2715                   * 3) Packets are arriving in this SRS via the
2711 2716                   *    S/W classification as well.
2712 2717                   *
2713 2718                   * We should switch to polling mode and see if we
2714 2719                   * need to send the poll thread down. Also, signal
2715 2720                   * the worker thread to process whats just arrived.
2716 2721                   */
2717 2722                  MAC_SRS_POLLING_ON(mac_srs);
2718 2723                  if (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat) {
2719 2724                          srs_rx->sr_drain_poll_sig++;
2720 2725                          MAC_SRS_POLL_RING(mac_srs);
2721 2726                  }
2722 2727  
2723 2728                  /*
2724 2729                   * If we didn't signal the poll thread, we need
2725 2730                   * to deal with the pending packets ourselves.
2726 2731                   */
2727 2732                  if (proc_type == SRS_WORKER) {
2728 2733                          srs_rx->sr_drain_again++;
2729 2734                          goto again;
2730 2735                  } else {
2731 2736                          srs_rx->sr_drain_worker_sig++;
2732 2737                          cv_signal(&mac_srs->srs_async);
2733 2738                  }
2734 2739          }
2735 2740  
2736 2741  out:
2737 2742          if (mac_srs->srs_state & SRS_GET_PKTS) {
2738 2743                  /*
2739 2744                   * Poll thread is already running. Leave the
2740 2745                   * SRS_RPOC set and hand over the control to
2741 2746                   * poll thread.
2742 2747                   */
2743 2748                  mac_srs->srs_state &= ~proc_type;
2744 2749                  srs_rx->sr_drain_poll_running++;
2745 2750                  return;
2746 2751          }
2747 2752  
2748 2753          /*
2749 2754           * Even if there are no packets queued in SRS, we
2750 2755           * need to make sure that the shared counter is
2751 2756           * clear and any associated softrings have cleared
2752 2757           * all the backlog. Otherwise, leave the interface
2753 2758           * in polling mode and the poll thread will get
2754 2759           * signalled once the count goes down to zero.
2755 2760           *
2756 2761           * If someone is already draining the queue (SRS_PROC is
2757 2762           * set) when the srs_poll_pkt_cnt goes down to zero,
2758 2763           * then it means that drain is already running and we
2759 2764           * will turn off polling at that time if there is
2760 2765           * no backlog.
2761 2766           *
2762 2767           * As long as there are packets queued either
2763 2768           * in soft ring set or its soft rings, we will leave
2764 2769           * the interface in polling mode (even if the drain
2765 2770           * was done being the interrupt thread). We signal
2766 2771           * the poll thread as well if we have dipped below
2767 2772           * low water mark.
2768 2773           *
2769 2774           * NOTE: We can't use the MAC_SRS_POLLING_ON macro
2770 2775           * since that turn polling on only for worker thread.
2771 2776           * Its not worth turning polling on for interrupt
2772 2777           * thread (since NIC will not issue another interrupt)
2773 2778           * unless a backlog builds up.
2774 2779           */
2775 2780          if ((srs_rx->sr_poll_pkt_cnt > 0) &&
2776 2781              (mac_srs->srs_state & SRS_POLLING_CAPAB)) {
2777 2782                  mac_srs->srs_state &= ~(SRS_PROC|proc_type);
2778 2783                  srs_rx->sr_drain_keep_polling++;
2779 2784                  MAC_SRS_POLLING_ON(mac_srs);
2780 2785                  if (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat)
2781 2786                          MAC_SRS_POLL_RING(mac_srs);
2782 2787                  return;
2783 2788          }
2784 2789  
2785 2790          /* Nothing else to do. Get out of poll mode */
2786 2791          MAC_SRS_POLLING_OFF(mac_srs);
2787 2792          mac_srs->srs_state &= ~(SRS_PROC|proc_type);
2788 2793          srs_rx->sr_drain_finish_intr++;
2789 2794  }
2790 2795  
2791 2796  /*
2792 2797   * mac_rx_srs_drain_bw
2793 2798   *
2794 2799   * The SRS BW drain routine. Gets to run to clear the queue. Any thread
2795 2800   * (worker, interrupt, poll) can call this based on processing model.
2796 2801   * The first thing we do is disable interrupts if possible and then
2797 2802   * drain the queue. we also try to poll the underlying hardware if
2798 2803   * there is a dedicated hardware Rx ring assigned to this SRS.
2799 2804   *
2800 2805   * There is a equivalent drain routine in non bandwidth control mode
2801 2806   * mac_rx_srs_drain. There is some code duplication between the two
2802 2807   * routines but they are highly performance sensitive and are easier
2803 2808   * to read/debug if they stay separate. Any code changes here might
2804 2809   * also apply to mac_rx_srs_drain as well.
2805 2810   */
2806 2811  void
2807 2812  mac_rx_srs_drain_bw(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2808 2813  {
2809 2814          mblk_t                  *head;
2810 2815          mblk_t                  *tail;
2811 2816          timeout_id_t            tid;
2812 2817          size_t                  sz = 0;
2813 2818          int                     cnt = 0;
2814 2819          mac_client_impl_t       *mcip = mac_srs->srs_mcip;
2815 2820          mac_srs_rx_t            *srs_rx = &mac_srs->srs_rx;
2816 2821          clock_t                 now;
2817 2822  
2818 2823          ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2819 2824          ASSERT(mac_srs->srs_type & SRST_BW_CONTROL);
2820 2825  again:
2821 2826          /* Check if we are doing B/W control */
2822 2827          mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2823 2828          now = ddi_get_lbolt();
2824 2829          if (mac_srs->srs_bw->mac_bw_curr_time != now) {
2825 2830                  mac_srs->srs_bw->mac_bw_curr_time = now;
2826 2831                  mac_srs->srs_bw->mac_bw_used = 0;
2827 2832                  if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)
2828 2833                          mac_srs->srs_bw->mac_bw_state &= ~SRS_BW_ENFORCED;
2829 2834          } else if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED) {
2830 2835                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2831 2836                  goto done;
2832 2837          } else if (mac_srs->srs_bw->mac_bw_used >
2833 2838              mac_srs->srs_bw->mac_bw_limit) {
2834 2839                  mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
2835 2840                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2836 2841                  goto done;
2837 2842          }
2838 2843          mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2839 2844  
2840 2845          /* If we are blanked i.e. can't do upcalls, then we are done */
2841 2846          if (mac_srs->srs_state & (SRS_BLANK | SRS_PAUSE)) {
2842 2847                  ASSERT((mac_srs->srs_type & SRST_NO_SOFT_RINGS) ||
2843 2848                      (mac_srs->srs_state & SRS_PAUSE));
2844 2849                  goto done;
2845 2850          }
2846 2851  
2847 2852          sz = 0;
2848 2853          cnt = 0;
2849 2854          if ((head = mac_srs_pick_chain(mac_srs, &tail, &sz, &cnt)) == NULL) {
2850 2855                  /*
2851 2856                   * We couldn't pick up a single packet.
2852 2857                   */
2853 2858                  mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2854 2859                  if ((mac_srs->srs_bw->mac_bw_used == 0) &&
2855 2860                      (mac_srs->srs_size != 0) &&
2856 2861                      !(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
2857 2862                          /*
2858 2863                           * Seems like configured B/W doesn't
2859 2864                           * even allow processing of 1 packet
2860 2865                           * per tick.
2861 2866                           *
2862 2867                           * XXX: raise the limit to processing
2863 2868                           * at least 1 packet per tick.
2864 2869                           */
2865 2870                          mac_srs->srs_bw->mac_bw_limit +=
2866 2871                              mac_srs->srs_bw->mac_bw_limit;
2867 2872                          mac_srs->srs_bw->mac_bw_drop_threshold +=
2868 2873                              mac_srs->srs_bw->mac_bw_drop_threshold;
2869 2874                          cmn_err(CE_NOTE, "mac_rx_srs_drain: srs(%p) "
2870 2875                              "raised B/W limit to %d since not even a "
2871 2876                              "single packet can be processed per "
2872 2877                              "tick %d\n", (void *)mac_srs,
2873 2878                              (int)mac_srs->srs_bw->mac_bw_limit,
2874 2879                              (int)msgdsize(mac_srs->srs_first));
2875 2880                  }
2876 2881                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2877 2882                  goto done;
2878 2883          }
2879 2884  
2880 2885          ASSERT(head != NULL);
2881 2886          ASSERT(tail != NULL);
2882 2887  
2883 2888          /* zero bandwidth: drop all and return to interrupt mode */
2884 2889          mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2885 2890          if (mac_srs->srs_bw->mac_bw_limit == 0) {
2886 2891                  srs_rx->sr_stat.mrs_sdrops += cnt;
2887 2892                  ASSERT(mac_srs->srs_bw->mac_bw_sz >= sz);
2888 2893                  mac_srs->srs_bw->mac_bw_sz -= sz;
2889 2894                  mac_srs->srs_bw->mac_bw_drop_bytes += sz;
2890 2895                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2891 2896                  mac_pkt_drop(NULL, NULL, head, B_FALSE);
2892 2897                  goto leave_poll;
2893 2898          } else {
2894 2899                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2895 2900          }
2896 2901  
2897 2902          if ((tid = mac_srs->srs_tid) != NULL)
2898 2903                  mac_srs->srs_tid = NULL;
2899 2904  
2900 2905          mac_srs->srs_state |= (SRS_PROC|proc_type);
2901 2906          MAC_SRS_WORKER_POLLING_ON(mac_srs);
2902 2907  
2903 2908          /*
2904 2909           * mcip is NULL for broadcast and multicast flows. The promisc
2905 2910           * callbacks for broadcast and multicast packets are delivered from
2906 2911           * mac_rx() and we don't need to worry about that case in this path
2907 2912           */
2908 2913          if (mcip != NULL) {
2909 2914                  if (mcip->mci_promisc_list != NULL) {
2910 2915                          mutex_exit(&mac_srs->srs_lock);
2911 2916                          mac_promisc_client_dispatch(mcip, head);
2912 2917                          mutex_enter(&mac_srs->srs_lock);
2913 2918                  }
2914 2919                  if (MAC_PROTECT_ENABLED(mcip, MPT_IPNOSPOOF)) {
2915 2920                          mutex_exit(&mac_srs->srs_lock);
2916 2921                          mac_protect_intercept_dynamic(mcip, head);
2917 2922                          mutex_enter(&mac_srs->srs_lock);
2918 2923                  }
2919 2924          }
2920 2925  
2921 2926          /*
2922 2927           * Check if SRS itself is doing the processing
2923 2928           * This direct path does not apply when subflows are present. In this
2924 2929           * case, packets need to be dispatched to a soft ring according to the
2925 2930           * flow's bandwidth and other resources contraints.
2926 2931           */
2927 2932          if (mac_srs->srs_type & SRST_NO_SOFT_RINGS) {
2928 2933                  mac_direct_rx_t         proc;
2929 2934                  void                    *arg1;
2930 2935                  mac_resource_handle_t   arg2;
2931 2936  
2932 2937                  /*
2933 2938                   * This is the case when a Rx is directly
2934 2939                   * assigned and we have a fully classified
2935 2940                   * protocol chain. We can deal with it in
2936 2941                   * one shot.
2937 2942                   */
2938 2943                  proc = srs_rx->sr_func;
2939 2944                  arg1 = srs_rx->sr_arg1;
2940 2945                  arg2 = srs_rx->sr_arg2;
2941 2946  
2942 2947                  mac_srs->srs_state |= SRS_CLIENT_PROC;
2943 2948                  mutex_exit(&mac_srs->srs_lock);
2944 2949                  if (tid != NULL) {
2945 2950                          (void) untimeout(tid);
2946 2951                          tid = NULL;
2947 2952                  }
2948 2953  
2949 2954                  proc(arg1, arg2, head, NULL);
2950 2955                  /*
2951 2956                   * Decrement the size and count here itelf
2952 2957                   * since the packet has been processed.
2953 2958                   */
2954 2959                  mutex_enter(&mac_srs->srs_lock);
2955 2960                  MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
2956 2961                  MAC_UPDATE_SRS_SIZE_LOCKED(mac_srs, sz);
2957 2962  
2958 2963                  if (mac_srs->srs_state & SRS_CLIENT_WAIT)
2959 2964                          cv_signal(&mac_srs->srs_client_cv);
2960 2965                  mac_srs->srs_state &= ~SRS_CLIENT_PROC;
2961 2966          } else {
2962 2967                  /* Some kind of softrings based fanout is required */
2963 2968                  mutex_exit(&mac_srs->srs_lock);
2964 2969                  if (tid != NULL) {
2965 2970                          (void) untimeout(tid);
2966 2971                          tid = NULL;
2967 2972                  }
2968 2973  
2969 2974                  /*
2970 2975                   * Since the fanout routines can deal with chains,
2971 2976                   * shoot the entire chain up.
2972 2977                   */
2973 2978                  if (mac_srs->srs_type & SRST_FANOUT_SRC_IP)
2974 2979                          mac_rx_srs_fanout(mac_srs, head);
2975 2980                  else
2976 2981                          mac_rx_srs_proto_fanout(mac_srs, head);
2977 2982                  mutex_enter(&mac_srs->srs_lock);
2978 2983          }
2979 2984  
2980 2985          /*
2981 2986           * Send the poll thread to pick up any packets arrived
2982 2987           * so far. This also serves as the last check in case
2983 2988           * nothing else is queued in the SRS. The poll thread
2984 2989           * is signalled only in the case the drain was done
2985 2990           * by the worker thread and SRS_WORKER is set. The
2986 2991           * worker thread can run in parallel as long as the
2987 2992           * SRS_WORKER flag is set. We we have nothing else to
2988 2993           * process, we can exit while leaving SRS_PROC set
2989 2994           * which gives the poll thread control to process and
2990 2995           * cleanup once it returns from the NIC.
2991 2996           *
2992 2997           * If we have nothing else to process, we need to
2993 2998           * ensure that we keep holding the srs_lock till
2994 2999           * all the checks below are done and control is
2995 3000           * handed to the poll thread if it was running.
2996 3001           */
2997 3002          mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2998 3003          if (!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
2999 3004                  if (mac_srs->srs_first != NULL) {
3000 3005                          if (proc_type == SRS_WORKER) {
3001 3006                                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3002 3007                                  if (srs_rx->sr_poll_pkt_cnt <=
3003 3008                                      srs_rx->sr_lowat)
3004 3009                                          MAC_SRS_POLL_RING(mac_srs);
3005 3010                                  goto again;
3006 3011                          } else {
3007 3012                                  cv_signal(&mac_srs->srs_async);
3008 3013                          }
3009 3014                  }
3010 3015          }
3011 3016          mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3012 3017  
3013 3018  done:
3014 3019  
3015 3020          if (mac_srs->srs_state & SRS_GET_PKTS) {
3016 3021                  /*
3017 3022                   * Poll thread is already running. Leave the
3018 3023                   * SRS_RPOC set and hand over the control to
3019 3024                   * poll thread.
3020 3025                   */
3021 3026                  mac_srs->srs_state &= ~proc_type;
3022 3027                  return;
3023 3028          }
3024 3029  
3025 3030          /*
3026 3031           * If we can't process packets because we have exceeded
3027 3032           * B/W limit for this tick, just set the timeout
3028 3033           * and leave.
3029 3034           *
3030 3035           * Even if there are no packets queued in SRS, we
3031 3036           * need to make sure that the shared counter is
3032 3037           * clear and any associated softrings have cleared
3033 3038           * all the backlog. Otherwise, leave the interface
3034 3039           * in polling mode and the poll thread will get
3035 3040           * signalled once the count goes down to zero.
3036 3041           *
3037 3042           * If someone is already draining the queue (SRS_PROC is
3038 3043           * set) when the srs_poll_pkt_cnt goes down to zero,
3039 3044           * then it means that drain is already running and we
3040 3045           * will turn off polling at that time if there is
3041 3046           * no backlog. As long as there are packets queued either
3042 3047           * is soft ring set or its soft rings, we will leave
3043 3048           * the interface in polling mode.
3044 3049           */
3045 3050          mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
3046 3051          if ((mac_srs->srs_state & SRS_POLLING_CAPAB) &&
3047 3052              ((mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED) ||
3048 3053              (srs_rx->sr_poll_pkt_cnt > 0))) {
3049 3054                  MAC_SRS_POLLING_ON(mac_srs);
3050 3055                  mac_srs->srs_state &= ~(SRS_PROC|proc_type);
3051 3056                  if ((mac_srs->srs_first != NULL) &&
3052 3057                      (mac_srs->srs_tid == NULL))
3053 3058                          mac_srs->srs_tid = timeout(mac_srs_fire,
3054 3059                              mac_srs, 1);
3055 3060                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3056 3061                  return;
3057 3062          }
3058 3063          mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3059 3064  
3060 3065  leave_poll:
3061 3066  
3062 3067          /* Nothing else to do. Get out of poll mode */
3063 3068          MAC_SRS_POLLING_OFF(mac_srs);
3064 3069          mac_srs->srs_state &= ~(SRS_PROC|proc_type);
3065 3070  }
3066 3071  
3067 3072  /*
3068 3073   * mac_srs_worker
3069 3074   *
3070 3075   * The SRS worker routine. Drains the queue when no one else is
3071 3076   * processing it.
3072 3077   */
3073 3078  void
3074 3079  mac_srs_worker(mac_soft_ring_set_t *mac_srs)
3075 3080  {
3076 3081          kmutex_t                *lock = &mac_srs->srs_lock;
3077 3082          kcondvar_t              *async = &mac_srs->srs_async;
3078 3083          callb_cpr_t             cprinfo;
3079 3084          boolean_t               bw_ctl_flag;
3080 3085  
3081 3086          CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "srs_worker");
3082 3087          mutex_enter(lock);
3083 3088  
3084 3089  start:
3085 3090          for (;;) {
3086 3091                  bw_ctl_flag = B_FALSE;
3087 3092                  if (mac_srs->srs_type & SRST_BW_CONTROL) {
3088 3093                          MAC_SRS_BW_LOCK(mac_srs);
3089 3094                          MAC_SRS_CHECK_BW_CONTROL(mac_srs);
3090 3095                          if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)
3091 3096                                  bw_ctl_flag = B_TRUE;
3092 3097                          MAC_SRS_BW_UNLOCK(mac_srs);
3093 3098                  }
3094 3099                  /*
3095 3100                   * The SRS_BW_ENFORCED flag may change since we have dropped
3096 3101                   * the mac_bw_lock. However the drain function can handle both
3097 3102                   * a drainable SRS or a bandwidth controlled SRS, and the
3098 3103                   * effect of scheduling a timeout is to wakeup the worker
3099 3104                   * thread which in turn will call the drain function. Since
3100 3105                   * we release the srs_lock atomically only in the cv_wait there
3101 3106                   * isn't a fear of waiting for ever.
3102 3107                   */
3103 3108                  while (((mac_srs->srs_state & SRS_PROC) ||
3104 3109                      (mac_srs->srs_first == NULL) || bw_ctl_flag ||
3105 3110                      (mac_srs->srs_state & SRS_TX_BLOCKED)) &&
3106 3111                      !(mac_srs->srs_state & SRS_PAUSE)) {
3107 3112                          /*
3108 3113                           * If we have packets queued and we are here
3109 3114                           * because B/W control is in place, we better
3110 3115                           * schedule the worker wakeup after 1 tick
3111 3116                           * to see if bandwidth control can be relaxed.
3112 3117                           */
3113 3118                          if (bw_ctl_flag && mac_srs->srs_tid == NULL) {
3114 3119                                  /*
3115 3120                                   * We need to ensure that a timer  is already
3116 3121                                   * scheduled or we force  schedule one for
3117 3122                                   * later so that we can continue processing
3118 3123                                   * after this  quanta is over.
3119 3124                                   */
3120 3125                                  mac_srs->srs_tid = timeout(mac_srs_fire,
3121 3126                                      mac_srs, 1);
3122 3127                          }
3123 3128  wait:
3124 3129                          CALLB_CPR_SAFE_BEGIN(&cprinfo);
3125 3130                          cv_wait(async, lock);
3126 3131                          CALLB_CPR_SAFE_END(&cprinfo, lock);
3127 3132  
3128 3133                          if (mac_srs->srs_state & SRS_PAUSE)
3129 3134                                  goto done;
3130 3135                          if (mac_srs->srs_state & SRS_PROC)
3131 3136                                  goto wait;
3132 3137  
3133 3138                          if (mac_srs->srs_first != NULL &&
3134 3139                              mac_srs->srs_type & SRST_BW_CONTROL) {
3135 3140                                  MAC_SRS_BW_LOCK(mac_srs);
3136 3141                                  if (mac_srs->srs_bw->mac_bw_state &
3137 3142                                      SRS_BW_ENFORCED) {
3138 3143                                          MAC_SRS_CHECK_BW_CONTROL(mac_srs);
3139 3144                                  }
3140 3145                                  bw_ctl_flag = mac_srs->srs_bw->mac_bw_state &
3141 3146                                      SRS_BW_ENFORCED;
3142 3147                                  MAC_SRS_BW_UNLOCK(mac_srs);
3143 3148                          }
3144 3149                  }
3145 3150  
3146 3151                  if (mac_srs->srs_state & SRS_PAUSE)
3147 3152                          goto done;
3148 3153                  mac_srs->srs_drain_func(mac_srs, SRS_WORKER);
3149 3154          }
3150 3155  done:
3151 3156          /*
3152 3157           * The Rx SRS quiesce logic first cuts off packet supply to the SRS
3153 3158           * from both hard and soft classifications and waits for such threads
3154 3159           * to finish before signaling the worker. So at this point the only
3155 3160           * thread left that could be competing with the worker is the poll
3156 3161           * thread. In the case of Tx, there shouldn't be any thread holding
3157 3162           * SRS_PROC at this point.
3158 3163           */
3159 3164          if (!(mac_srs->srs_state & SRS_PROC)) {
3160 3165                  mac_srs->srs_state |= SRS_PROC;
3161 3166          } else {
3162 3167                  ASSERT((mac_srs->srs_type & SRST_TX) == 0);
3163 3168                  /*
3164 3169                   * Poll thread still owns the SRS and is still running
3165 3170                   */
3166 3171                  ASSERT((mac_srs->srs_poll_thr == NULL) ||
3167 3172                      ((mac_srs->srs_state & SRS_POLL_THR_OWNER) ==
3168 3173                      SRS_POLL_THR_OWNER));
3169 3174          }
3170 3175          mac_srs_worker_quiesce(mac_srs);
3171 3176          /*
3172 3177           * Wait for the SRS_RESTART or SRS_CONDEMNED signal from the initiator
3173 3178           * of the quiesce operation
3174 3179           */
3175 3180          while (!(mac_srs->srs_state & (SRS_CONDEMNED | SRS_RESTART)))
3176 3181                  cv_wait(&mac_srs->srs_async, &mac_srs->srs_lock);
3177 3182  
3178 3183          if (mac_srs->srs_state & SRS_RESTART) {
3179 3184                  ASSERT(!(mac_srs->srs_state & SRS_CONDEMNED));
3180 3185                  mac_srs_worker_restart(mac_srs);
3181 3186                  mac_srs->srs_state &= ~SRS_PROC;
3182 3187                  goto start;
3183 3188          }
3184 3189  
3185 3190          if (!(mac_srs->srs_state & SRS_CONDEMNED_DONE))
3186 3191                  mac_srs_worker_quiesce(mac_srs);
3187 3192  
3188 3193          mac_srs->srs_state &= ~SRS_PROC;
3189 3194          /* The macro drops the srs_lock */
3190 3195          CALLB_CPR_EXIT(&cprinfo);
3191 3196          thread_exit();
3192 3197  }
3193 3198  
3194 3199  /*
3195 3200   * mac_rx_srs_subflow_process
3196 3201   *
3197 3202   * Receive side routine called from interrupt path when there are
3198 3203   * sub flows present on this SRS.
3199 3204   */
3200 3205  /* ARGSUSED */
3201 3206  void
3202 3207  mac_rx_srs_subflow_process(void *arg, mac_resource_handle_t srs,
3203 3208      mblk_t *mp_chain, boolean_t loopback)
3204 3209  {
3205 3210          flow_entry_t            *flent = NULL;
3206 3211          flow_entry_t            *prev_flent = NULL;
3207 3212          mblk_t                  *mp = NULL;
3208 3213          mblk_t                  *tail = NULL;
3209 3214          mac_soft_ring_set_t     *mac_srs = (mac_soft_ring_set_t *)srs;
3210 3215          mac_client_impl_t       *mcip;
3211 3216  
3212 3217          mcip = mac_srs->srs_mcip;
3213 3218          ASSERT(mcip != NULL);
3214 3219  
3215 3220          /*
3216 3221           * We need to determine the SRS for every packet
3217 3222           * by walking the flow table, if we don't get any,
3218 3223           * then we proceed using the SRS we came with.
3219 3224           */
3220 3225          mp = tail = mp_chain;
3221 3226          while (mp != NULL) {
3222 3227  
3223 3228                  /*
3224 3229                   * We will increment the stats for the mactching subflow.
3225 3230                   * when we get the bytes/pkt count for the classified packets
3226 3231                   * later in mac_rx_srs_process.
3227 3232                   */
3228 3233                  (void) mac_flow_lookup(mcip->mci_subflow_tab, mp,
3229 3234                      FLOW_INBOUND, &flent);
3230 3235  
3231 3236                  if (mp == mp_chain || flent == prev_flent) {
3232 3237                          if (prev_flent != NULL)
3233 3238                                  FLOW_REFRELE(prev_flent);
3234 3239                          prev_flent = flent;
3235 3240                          flent = NULL;
3236 3241                          tail = mp;
3237 3242                          mp = mp->b_next;
3238 3243                          continue;
3239 3244                  }
3240 3245                  tail->b_next = NULL;
3241 3246                  /*
3242 3247                   * A null indicates, this is for the mac_srs itself.
3243 3248                   * XXX-venu : probably assert for fe_rx_srs_cnt == 0.
3244 3249                   */
3245 3250                  if (prev_flent == NULL || prev_flent->fe_rx_srs_cnt == 0) {
3246 3251                          mac_rx_srs_process(arg,
3247 3252                              (mac_resource_handle_t)mac_srs, mp_chain,
3248 3253                              loopback);
3249 3254                  } else {
3250 3255                          (prev_flent->fe_cb_fn)(prev_flent->fe_cb_arg1,
3251 3256                              prev_flent->fe_cb_arg2, mp_chain, loopback);
3252 3257                          FLOW_REFRELE(prev_flent);
3253 3258                  }
3254 3259                  prev_flent = flent;
3255 3260                  flent = NULL;
3256 3261                  mp_chain = mp;
3257 3262                  tail = mp;
3258 3263                  mp = mp->b_next;
3259 3264          }
3260 3265          /* Last chain */
3261 3266          ASSERT(mp_chain != NULL);
3262 3267          if (prev_flent == NULL || prev_flent->fe_rx_srs_cnt == 0) {
3263 3268                  mac_rx_srs_process(arg,
3264 3269                      (mac_resource_handle_t)mac_srs, mp_chain, loopback);
3265 3270          } else {
3266 3271                  (prev_flent->fe_cb_fn)(prev_flent->fe_cb_arg1,
3267 3272                      prev_flent->fe_cb_arg2, mp_chain, loopback);
3268 3273                  FLOW_REFRELE(prev_flent);
3269 3274          }
3270 3275  }
3271 3276  
3272 3277  /*
3273 3278   * mac_rx_srs_process
3274 3279   *
3275 3280   * Receive side routine called from the interrupt path.
3276 3281   *
3277 3282   * loopback is set to force a context switch on the loopback
3278 3283   * path between MAC clients.
3279 3284   */
3280 3285  /* ARGSUSED */
3281 3286  void
3282 3287  mac_rx_srs_process(void *arg, mac_resource_handle_t srs, mblk_t *mp_chain,
3283 3288      boolean_t loopback)
3284 3289  {
3285 3290          mac_soft_ring_set_t     *mac_srs = (mac_soft_ring_set_t *)srs;
3286 3291          mblk_t                  *mp, *tail, *head;
3287 3292          int                     count = 0;
3288 3293          int                     count1;
3289 3294          size_t                  sz = 0;
3290 3295          size_t                  chain_sz, sz1;
3291 3296          mac_bw_ctl_t            *mac_bw;
3292 3297          mac_srs_rx_t            *srs_rx = &mac_srs->srs_rx;
3293 3298  
3294 3299          /*
3295 3300           * Set the tail, count and sz. We set the sz irrespective
3296 3301           * of whether we are doing B/W control or not for the
3297 3302           * purpose of updating the stats.
3298 3303           */
3299 3304          mp = tail = mp_chain;
3300 3305          while (mp != NULL) {
3301 3306                  tail = mp;
3302 3307                  count++;
3303 3308                  sz += msgdsize(mp);
3304 3309                  mp = mp->b_next;
3305 3310          }
3306 3311  
3307 3312          mutex_enter(&mac_srs->srs_lock);
3308 3313  
3309 3314          if (loopback) {
3310 3315                  SRS_RX_STAT_UPDATE(mac_srs, lclbytes, sz);
3311 3316                  SRS_RX_STAT_UPDATE(mac_srs, lclcnt, count);
3312 3317  
3313 3318          } else {
3314 3319                  SRS_RX_STAT_UPDATE(mac_srs, intrbytes, sz);
3315 3320                  SRS_RX_STAT_UPDATE(mac_srs, intrcnt, count);
3316 3321          }
3317 3322  
3318 3323          /*
3319 3324           * If the SRS in already being processed; has been blanked;
3320 3325           * can be processed by worker thread only; or the B/W limit
3321 3326           * has been reached, then queue the chain and check if
3322 3327           * worker thread needs to be awakend.
3323 3328           */
3324 3329          if (mac_srs->srs_type & SRST_BW_CONTROL) {
3325 3330                  mac_bw = mac_srs->srs_bw;
3326 3331                  ASSERT(mac_bw != NULL);
3327 3332                  mutex_enter(&mac_bw->mac_bw_lock);
3328 3333                  mac_bw->mac_bw_intr += sz;
3329 3334                  if (mac_bw->mac_bw_limit == 0) {
3330 3335                          /* zero bandwidth: drop all */
3331 3336                          srs_rx->sr_stat.mrs_sdrops += count;
3332 3337                          mac_bw->mac_bw_drop_bytes += sz;
3333 3338                          mutex_exit(&mac_bw->mac_bw_lock);
3334 3339                          mutex_exit(&mac_srs->srs_lock);
3335 3340                          mac_pkt_drop(NULL, NULL, mp_chain, B_FALSE);
3336 3341                          return;
3337 3342                  } else {
3338 3343                          if ((mac_bw->mac_bw_sz + sz) <=
3339 3344                              mac_bw->mac_bw_drop_threshold) {
3340 3345                                  mutex_exit(&mac_bw->mac_bw_lock);
3341 3346                                  MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, mp_chain,
3342 3347                                      tail, count, sz);
3343 3348                          } else {
3344 3349                                  mp = mp_chain;
3345 3350                                  chain_sz = 0;
3346 3351                                  count1 = 0;
3347 3352                                  tail = NULL;
3348 3353                                  head = NULL;
3349 3354                                  while (mp != NULL) {
3350 3355                                          sz1 = msgdsize(mp);
3351 3356                                          if (mac_bw->mac_bw_sz + chain_sz + sz1 >
3352 3357                                              mac_bw->mac_bw_drop_threshold)
3353 3358                                                  break;
3354 3359                                          chain_sz += sz1;
3355 3360                                          count1++;
3356 3361                                          tail = mp;
3357 3362                                          mp = mp->b_next;
3358 3363                                  }
3359 3364                                  mutex_exit(&mac_bw->mac_bw_lock);
3360 3365                                  if (tail != NULL) {
3361 3366                                          head = tail->b_next;
3362 3367                                          tail->b_next = NULL;
3363 3368                                          MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs,
3364 3369                                              mp_chain, tail, count1, chain_sz);
3365 3370                                          sz -= chain_sz;
3366 3371                                          count -= count1;
3367 3372                                  } else {
3368 3373                                          /* Can't pick up any */
3369 3374                                          head = mp_chain;
3370 3375                                  }
3371 3376                                  if (head != NULL) {
3372 3377                                          /* Drop any packet over the threshold */
3373 3378                                          srs_rx->sr_stat.mrs_sdrops += count;
3374 3379                                          mutex_enter(&mac_bw->mac_bw_lock);
3375 3380                                          mac_bw->mac_bw_drop_bytes += sz;
3376 3381                                          mutex_exit(&mac_bw->mac_bw_lock);
3377 3382                                          freemsgchain(head);
3378 3383                                  }
3379 3384                          }
3380 3385                          MAC_SRS_WORKER_WAKEUP(mac_srs);
3381 3386                          mutex_exit(&mac_srs->srs_lock);
3382 3387                          return;
3383 3388                  }
3384 3389          }
3385 3390  
3386 3391          /*
3387 3392           * If the total number of packets queued in the SRS and
3388 3393           * its associated soft rings exceeds the max allowed,
3389 3394           * then drop the chain. If we are polling capable, this
3390 3395           * shouldn't be happening.
3391 3396           */
3392 3397          if (!(mac_srs->srs_type & SRST_BW_CONTROL) &&
3393 3398              (srs_rx->sr_poll_pkt_cnt > srs_rx->sr_hiwat)) {
3394 3399                  mac_bw = mac_srs->srs_bw;
3395 3400                  srs_rx->sr_stat.mrs_sdrops += count;
3396 3401                  mutex_enter(&mac_bw->mac_bw_lock);
3397 3402                  mac_bw->mac_bw_drop_bytes += sz;
3398 3403                  mutex_exit(&mac_bw->mac_bw_lock);
3399 3404                  freemsgchain(mp_chain);
3400 3405                  mutex_exit(&mac_srs->srs_lock);
3401 3406                  return;
3402 3407          }
3403 3408  
3404 3409          MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, mp_chain, tail, count, sz);
3405 3410  
3406 3411          if (!(mac_srs->srs_state & SRS_PROC)) {
3407 3412                  /*
3408 3413                   * If we are coming via loopback, if we are not optimizing for
3409 3414                   * latency, or if our stack is running deep, we should signal
3410 3415                   * the worker thread.
3411 3416                   */
3412 3417                  if (loopback || !(mac_srs->srs_state & SRS_LATENCY_OPT) ||
3413 3418                      MAC_RX_SRS_TOODEEP()) {
3414 3419                          /*
3415 3420                           * For loopback, We need to let the worker take
3416 3421                           * over as we don't want to continue in the same
3417 3422                           * thread even if we can. This could lead to stack
3418 3423                           * overflows and may also end up using
3419 3424                           * resources (cpu) incorrectly.
3420 3425                           */
3421 3426                          cv_signal(&mac_srs->srs_async);
3422 3427                  } else {
3423 3428                          /*
3424 3429                           * Seems like no one is processing the SRS and
3425 3430                           * there is no backlog. We also inline process
3426 3431                           * our packet if its a single packet in non
3427 3432                           * latency optimized case (in latency optimized
3428 3433                           * case, we inline process chains of any size).
3429 3434                           */
3430 3435                          mac_srs->srs_drain_func(mac_srs, SRS_PROC_FAST);
3431 3436                  }
3432 3437          }
3433 3438          mutex_exit(&mac_srs->srs_lock);
3434 3439  }
3435 3440  
3436 3441  /* TX SIDE ROUTINES (RUNTIME) */
3437 3442  
3438 3443  /*
3439 3444   * mac_tx_srs_no_desc
3440 3445   *
3441 3446   * This routine is called by Tx single ring default mode
3442 3447   * when Tx ring runs out of descs.
3443 3448   */
3444 3449  mac_tx_cookie_t
3445 3450  mac_tx_srs_no_desc(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3446 3451      uint16_t flag, mblk_t **ret_mp)
3447 3452  {
3448 3453          mac_tx_cookie_t cookie = 0;
3449 3454          mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3450 3455          boolean_t wakeup_worker = B_TRUE;
3451 3456          uint32_t tx_mode = srs_tx->st_mode;
3452 3457          int cnt, sz;
3453 3458          mblk_t *tail;
3454 3459  
3455 3460          ASSERT(tx_mode == SRS_TX_DEFAULT || tx_mode == SRS_TX_BW);
3456 3461          if (flag & MAC_DROP_ON_NO_DESC) {
3457 3462                  MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie);
3458 3463          } else {
3459 3464                  if (mac_srs->srs_first != NULL)
3460 3465                          wakeup_worker = B_FALSE;
3461 3466                  MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3462 3467                  if (flag & MAC_TX_NO_ENQUEUE) {
3463 3468                          /*
3464 3469                           * If TX_QUEUED is not set, queue the
3465 3470                           * packet and let mac_tx_srs_drain()
3466 3471                           * set the TX_BLOCKED bit for the
3467 3472                           * reasons explained above. Otherwise,
3468 3473                           * return the mblks.
3469 3474                           */
3470 3475                          if (wakeup_worker) {
3471 3476                                  MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3472 3477                                      mp_chain, tail, cnt, sz);
3473 3478                          } else {
3474 3479                                  MAC_TX_SET_NO_ENQUEUE(mac_srs,
3475 3480                                      mp_chain, ret_mp, cookie);
3476 3481                          }
3477 3482                  } else {
3478 3483                          MAC_TX_SRS_TEST_HIWAT(mac_srs, mp_chain,
3479 3484                              tail, cnt, sz, cookie);
3480 3485                  }
3481 3486                  if (wakeup_worker)
3482 3487                          cv_signal(&mac_srs->srs_async);
3483 3488          }
3484 3489          return (cookie);
3485 3490  }
3486 3491  
3487 3492  /*
3488 3493   * mac_tx_srs_enqueue
3489 3494   *
3490 3495   * This routine is called when Tx SRS is operating in either serializer
3491 3496   * or bandwidth mode. In serializer mode, a packet will get enqueued
3492 3497   * when a thread cannot enter SRS exclusively. In bandwidth mode,
3493 3498   * packets gets queued if allowed byte-count limit for a tick is
3494 3499   * exceeded. The action that gets taken when MAC_DROP_ON_NO_DESC and
3495 3500   * MAC_TX_NO_ENQUEUE is set is different than when operaing in either
3496 3501   * the default mode or fanout mode. Here packets get dropped or
3497 3502   * returned back to the caller only after hi-watermark worth of data
3498 3503   * is queued.
3499 3504   */
3500 3505  static mac_tx_cookie_t
3501 3506  mac_tx_srs_enqueue(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3502 3507      uint16_t flag, uintptr_t fanout_hint, mblk_t **ret_mp)
3503 3508  {
3504 3509          mac_tx_cookie_t cookie = 0;
3505 3510          int cnt, sz;
3506 3511          mblk_t *tail;
3507 3512          boolean_t wakeup_worker = B_TRUE;
3508 3513  
3509 3514          /*
3510 3515           * Ignore fanout hint if we don't have multiple tx rings.
3511 3516           */
3512 3517          if (!MAC_TX_SOFT_RINGS(mac_srs))
3513 3518                  fanout_hint = 0;
3514 3519  
3515 3520          if (mac_srs->srs_first != NULL)
3516 3521                  wakeup_worker = B_FALSE;
3517 3522          MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3518 3523          if (flag & MAC_DROP_ON_NO_DESC) {
3519 3524                  if (mac_srs->srs_count > mac_srs->srs_tx.st_hiwat) {
3520 3525                          MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie);
3521 3526                  } else {
3522 3527                          MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3523 3528                              mp_chain, tail, cnt, sz);
3524 3529                  }
3525 3530          } else if (flag & MAC_TX_NO_ENQUEUE) {
3526 3531                  if ((mac_srs->srs_count > mac_srs->srs_tx.st_hiwat) ||
3527 3532                      (mac_srs->srs_state & SRS_TX_WAKEUP_CLIENT)) {
3528 3533                          MAC_TX_SET_NO_ENQUEUE(mac_srs, mp_chain,
3529 3534                              ret_mp, cookie);
3530 3535                  } else {
3531 3536                          mp_chain->b_prev = (mblk_t *)fanout_hint;
3532 3537                          MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3533 3538                              mp_chain, tail, cnt, sz);
3534 3539                  }
3535 3540          } else {
3536 3541                  /*
3537 3542                   * If you are BW_ENFORCED, just enqueue the
3538 3543                   * packet. srs_worker will drain it at the
3539 3544                   * prescribed rate. Before enqueueing, save
3540 3545                   * the fanout hint.
3541 3546                   */
3542 3547                  mp_chain->b_prev = (mblk_t *)fanout_hint;
3543 3548                  MAC_TX_SRS_TEST_HIWAT(mac_srs, mp_chain,
3544 3549                      tail, cnt, sz, cookie);
3545 3550          }
3546 3551          if (wakeup_worker)
3547 3552                  cv_signal(&mac_srs->srs_async);
3548 3553          return (cookie);
3549 3554  }
3550 3555  
3551 3556  /*
3552 3557   * There are seven tx modes:
3553 3558   *
3554 3559   * 1) Default mode (SRS_TX_DEFAULT)
3555 3560   * 2) Serialization mode (SRS_TX_SERIALIZE)
3556 3561   * 3) Fanout mode (SRS_TX_FANOUT)
3557 3562   * 4) Bandwdith mode (SRS_TX_BW)
3558 3563   * 5) Fanout and Bandwidth mode (SRS_TX_BW_FANOUT)
3559 3564   * 6) aggr Tx mode (SRS_TX_AGGR)
3560 3565   * 7) aggr Tx bw mode (SRS_TX_BW_AGGR)
3561 3566   *
3562 3567   * The tx mode in which an SRS operates is decided in mac_tx_srs_setup()
3563 3568   * based on the number of Tx rings requested for an SRS and whether
3564 3569   * bandwidth control is requested or not.
3565 3570   *
3566 3571   * The default mode (i.e., no fanout/no bandwidth) is used when the
3567 3572   * underlying NIC does not have Tx rings or just one Tx ring. In this mode,
3568 3573   * the SRS acts as a pass-thru. Packets will go directly to mac_tx_send().
3569 3574   * When the underlying Tx ring runs out of Tx descs, it starts queueing up
3570 3575   * packets in SRS. When flow-control is relieved, the srs_worker drains
3571 3576   * the queued packets and informs blocked clients to restart sending
3572 3577   * packets.
3573 3578   *
3574 3579   * In the SRS_TX_SERIALIZE mode, all calls to mac_tx() are serialized. This
3575 3580   * mode is used when the link has no Tx rings or only one Tx ring.
3576 3581   *
3577 3582   * In the SRS_TX_FANOUT mode, packets will be fanned out to multiple
3578 3583   * Tx rings. Each Tx ring will have a soft ring associated with it.
3579 3584   * These soft rings will be hung off the Tx SRS. Queueing if it happens
3580 3585   * due to lack of Tx desc will be in individual soft ring (and not srs)
3581 3586   * associated with Tx ring.
3582 3587   *
3583 3588   * In the TX_BW mode, tx srs will allow packets to go down to Tx ring
3584 3589   * only if bw is available. Otherwise the packets will be queued in
3585 3590   * SRS. If fanout to multiple Tx rings is configured, the packets will
3586 3591   * be fanned out among the soft rings associated with the Tx rings.
3587 3592   *
3588 3593   * In SRS_TX_AGGR mode, mac_tx_aggr_mode() routine is called. This routine
3589 3594   * invokes an aggr function, aggr_find_tx_ring(), to find a pseudo Tx ring
3590 3595   * belonging to a port on which the packet has to be sent. Aggr will
3591 3596   * always have a pseudo Tx ring associated with it even when it is an
3592 3597   * aggregation over a single NIC that has no Tx rings. Even in such a
3593 3598   * case, the single pseudo Tx ring will have a soft ring associated with
3594 3599   * it and the soft ring will hang off the SRS.
3595 3600   *
3596 3601   * If a bandwidth is specified for an aggr, SRS_TX_BW_AGGR mode is used.
3597 3602   * In this mode, the bandwidth is first applied on the outgoing packets
3598 3603   * and later mac_tx_addr_mode() function is called to send the packet out
3599 3604   * of one of the pseudo Tx rings.
3600 3605   *
3601 3606   * Four flags are used in srs_state for indicating flow control
3602 3607   * conditions : SRS_TX_BLOCKED, SRS_TX_HIWAT, SRS_TX_WAKEUP_CLIENT.
3603 3608   * SRS_TX_BLOCKED indicates out of Tx descs. SRS expects a wakeup from the
3604 3609   * driver below.
3605 3610   * SRS_TX_HIWAT indicates packet count enqueued in Tx SRS exceeded Tx hiwat
3606 3611   * and flow-control pressure is applied back to clients. The clients expect
3607 3612   * wakeup when flow-control is relieved.
3608 3613   * SRS_TX_WAKEUP_CLIENT get set when (flag == MAC_TX_NO_ENQUEUE) and mblk
3609 3614   * got returned back to client either due to lack of Tx descs or due to bw
3610 3615   * control reasons. The clients expect a wakeup when condition is relieved.
3611 3616   *
3612 3617   * The fourth argument to mac_tx() is the flag. Normally it will be 0 but
3613 3618   * some clients set the following values too: MAC_DROP_ON_NO_DESC,
3614 3619   * MAC_TX_NO_ENQUEUE
3615 3620   * Mac clients that do not want packets to be enqueued in the mac layer set
3616 3621   * MAC_DROP_ON_NO_DESC value. The packets won't be queued in the Tx SRS or
3617 3622   * Tx soft rings but instead get dropped when the NIC runs out of desc. The
3618 3623   * behaviour of this flag is different when the Tx is running in serializer
3619 3624   * or bandwidth mode. Under these (Serializer, bandwidth) modes, the packet
3620 3625   * get dropped when Tx high watermark is reached.
3621 3626   * There are some mac clients like vsw, aggr that want the mblks to be
3622 3627   * returned back to clients instead of being queued in Tx SRS (or Tx soft
3623 3628   * rings) under flow-control (i.e., out of desc or exceeding bw limits)
3624 3629   * conditions. These clients call mac_tx() with MAC_TX_NO_ENQUEUE flag set.
3625 3630   * In the default and Tx fanout mode, the un-transmitted mblks will be
3626 3631   * returned back to the clients when the driver runs out of Tx descs.
3627 3632   * SRS_TX_WAKEUP_CLIENT (or S_RING_WAKEUP_CLIENT) will be set in SRS (or
3628 3633   * soft ring) so that the clients can be woken up when Tx desc become
3629 3634   * available. When running in serializer or bandwidth mode mode,
3630 3635   * SRS_TX_WAKEUP_CLIENT will be set when tx hi-watermark is reached.
3631 3636   */
3632 3637  
3633 3638  mac_tx_func_t
3634 3639  mac_tx_get_func(uint32_t mode)
3635 3640  {
3636 3641          return (mac_tx_mode_list[mode].mac_tx_func);
3637 3642  }
3638 3643  
3639 3644  /* ARGSUSED */
3640 3645  static mac_tx_cookie_t
3641 3646  mac_tx_single_ring_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3642 3647      uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3643 3648  {
3644 3649          mac_srs_tx_t            *srs_tx = &mac_srs->srs_tx;
3645 3650          mac_tx_stats_t          stats;
3646 3651          mac_tx_cookie_t         cookie = 0;
3647 3652  
3648 3653          ASSERT(srs_tx->st_mode == SRS_TX_DEFAULT);
3649 3654  
3650 3655          /* Regular case with a single Tx ring */
3651 3656          /*
3652 3657           * SRS_TX_BLOCKED is set when underlying NIC runs
3653 3658           * out of Tx descs and messages start getting
3654 3659           * queued. It won't get reset until
3655 3660           * tx_srs_drain() completely drains out the
3656 3661           * messages.
3657 3662           */
3658 3663          if ((mac_srs->srs_state & SRS_ENQUEUED) != 0) {
3659 3664                  /* Tx descs/resources not available */
3660 3665                  mutex_enter(&mac_srs->srs_lock);
3661 3666                  if ((mac_srs->srs_state & SRS_ENQUEUED) != 0) {
3662 3667                          cookie = mac_tx_srs_no_desc(mac_srs, mp_chain,
3663 3668                              flag, ret_mp);
3664 3669                          mutex_exit(&mac_srs->srs_lock);
3665 3670                          return (cookie);
3666 3671                  }
3667 3672                  /*
3668 3673                   * While we were computing mblk count, the
3669 3674                   * flow control condition got relieved.
3670 3675                   * Continue with the transmission.
3671 3676                   */
3672 3677                  mutex_exit(&mac_srs->srs_lock);
3673 3678          }
3674 3679  
3675 3680          mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
3676 3681              mp_chain, &stats);
3677 3682  
3678 3683          /*
3679 3684           * Multiple threads could be here sending packets.
3680 3685           * Under such conditions, it is not possible to
3681 3686           * automically set SRS_TX_BLOCKED bit to indicate
3682 3687           * out of tx desc condition. To atomically set
3683 3688           * this, we queue the returned packet and do
3684 3689           * the setting of SRS_TX_BLOCKED in
3685 3690           * mac_tx_srs_drain().
3686 3691           */
3687 3692          if (mp_chain != NULL) {
3688 3693                  mutex_enter(&mac_srs->srs_lock);
3689 3694                  cookie = mac_tx_srs_no_desc(mac_srs, mp_chain, flag, ret_mp);
3690 3695                  mutex_exit(&mac_srs->srs_lock);
3691 3696                  return (cookie);
3692 3697          }
3693 3698          SRS_TX_STATS_UPDATE(mac_srs, &stats);
3694 3699  
3695 3700          return (0);
3696 3701  }
3697 3702  
3698 3703  /*
3699 3704   * mac_tx_serialize_mode
3700 3705   *
3701 3706   * This is an experimental mode implemented as per the request of PAE.
3702 3707   * In this mode, all callers attempting to send a packet to the NIC
3703 3708   * will get serialized. Only one thread at any time will access the
3704 3709   * NIC to send the packet out.
3705 3710   */
3706 3711  /* ARGSUSED */
3707 3712  static mac_tx_cookie_t
3708 3713  mac_tx_serializer_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3709 3714      uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3710 3715  {
3711 3716          mac_tx_stats_t          stats;
3712 3717          mac_tx_cookie_t         cookie = 0;
3713 3718          mac_srs_tx_t            *srs_tx = &mac_srs->srs_tx;
3714 3719  
3715 3720          /* Single ring, serialize below */
3716 3721          ASSERT(srs_tx->st_mode == SRS_TX_SERIALIZE);
3717 3722          mutex_enter(&mac_srs->srs_lock);
3718 3723          if ((mac_srs->srs_first != NULL) ||
3719 3724              (mac_srs->srs_state & SRS_PROC)) {
3720 3725                  /*
3721 3726                   * In serialization mode, queue all packets until
3722 3727                   * TX_HIWAT is set.
3723 3728                   * If drop bit is set, drop if TX_HIWAT is set.
3724 3729                   * If no_enqueue is set, still enqueue until hiwat
3725 3730                   * is set and return mblks after TX_HIWAT is set.
3726 3731                   */
3727 3732                  cookie = mac_tx_srs_enqueue(mac_srs, mp_chain,
3728 3733                      flag, 0, ret_mp);
3729 3734                  mutex_exit(&mac_srs->srs_lock);
3730 3735                  return (cookie);
3731 3736          }
3732 3737          /*
3733 3738           * No packets queued, nothing on proc and no flow
3734 3739           * control condition. Fast-path, ok. Do inline
3735 3740           * processing.
3736 3741           */
3737 3742          mac_srs->srs_state |= SRS_PROC;
3738 3743          mutex_exit(&mac_srs->srs_lock);
3739 3744  
3740 3745          mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
3741 3746              mp_chain, &stats);
3742 3747  
3743 3748          mutex_enter(&mac_srs->srs_lock);
3744 3749          mac_srs->srs_state &= ~SRS_PROC;
3745 3750          if (mp_chain != NULL) {
3746 3751                  cookie = mac_tx_srs_enqueue(mac_srs,
3747 3752                      mp_chain, flag, 0, ret_mp);
3748 3753          }
3749 3754          if (mac_srs->srs_first != NULL) {
3750 3755                  /*
3751 3756                   * We processed inline our packet and a new
3752 3757                   * packet/s got queued while we were
3753 3758                   * processing. Wakeup srs worker
3754 3759                   */
3755 3760                  cv_signal(&mac_srs->srs_async);
3756 3761          }
3757 3762          mutex_exit(&mac_srs->srs_lock);
3758 3763  
3759 3764          if (cookie == 0)
3760 3765                  SRS_TX_STATS_UPDATE(mac_srs, &stats);
3761 3766  
3762 3767          return (cookie);
3763 3768  }
3764 3769  
3765 3770  /*
3766 3771   * mac_tx_fanout_mode
3767 3772   *
3768 3773   * In this mode, the SRS will have access to multiple Tx rings to send
3769 3774   * the packet out. The fanout hint that is passed as an argument is
3770 3775   * used to find an appropriate ring to fanout the traffic. Each Tx
3771 3776   * ring, in turn,  will have a soft ring associated with it. If a Tx
3772 3777   * ring runs out of Tx desc's the returned packet will be queued in
3773 3778   * the soft ring associated with that Tx ring. The srs itself will not
3774 3779   * queue any packets.
3775 3780   */
3776 3781  
3777 3782  #define MAC_TX_SOFT_RING_PROCESS(chain) {                               \
3778 3783          index = COMPUTE_INDEX(hash, mac_srs->srs_tx_ring_count),        \
3779 3784          softring = mac_srs->srs_tx_soft_rings[index];                   \
3780 3785          cookie = mac_tx_soft_ring_process(softring, chain, flag, ret_mp); \
3781 3786          DTRACE_PROBE2(tx__fanout, uint64_t, hash, uint_t, index);       \
3782 3787  }
3783 3788  
3784 3789  static mac_tx_cookie_t
3785 3790  mac_tx_fanout_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3786 3791      uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3787 3792  {
3788 3793          mac_soft_ring_t         *softring;
3789 3794          uint64_t                hash;
3790 3795          uint_t                  index;
3791 3796          mac_tx_cookie_t         cookie = 0;
3792 3797  
3793 3798          ASSERT(mac_srs->srs_tx.st_mode == SRS_TX_FANOUT ||
3794 3799              mac_srs->srs_tx.st_mode == SRS_TX_BW_FANOUT);
3795 3800          if (fanout_hint != 0) {
3796 3801                  /*
3797 3802                   * The hint is specified by the caller, simply pass the
3798 3803                   * whole chain to the soft ring.
3799 3804                   */
3800 3805                  hash = HASH_HINT(fanout_hint);
3801 3806                  MAC_TX_SOFT_RING_PROCESS(mp_chain);
3802 3807          } else {
3803 3808                  mblk_t *last_mp, *cur_mp, *sub_chain;
3804 3809                  uint64_t last_hash = 0;
3805 3810                  uint_t media = mac_srs->srs_mcip->mci_mip->mi_info.mi_media;
3806 3811  
3807 3812                  /*
3808 3813                   * Compute the hash from the contents (headers) of the
3809 3814                   * packets of the mblk chain. Split the chains into
3810 3815                   * subchains of the same conversation.
3811 3816                   *
3812 3817                   * Since there may be more than one ring used for
3813 3818                   * sub-chains of the same call, and since the caller
3814 3819                   * does not maintain per conversation state since it
3815 3820                   * passed a zero hint, unsent subchains will be
3816 3821                   * dropped.
3817 3822                   */
3818 3823  
3819 3824                  flag |= MAC_DROP_ON_NO_DESC;
3820 3825                  ret_mp = NULL;
3821 3826  
3822 3827                  ASSERT(ret_mp == NULL);
3823 3828  
3824 3829                  sub_chain = NULL;
3825 3830                  last_mp = NULL;
3826 3831  
3827 3832                  for (cur_mp = mp_chain; cur_mp != NULL;
3828 3833                      cur_mp = cur_mp->b_next) {
3829 3834                          hash = mac_pkt_hash(media, cur_mp, MAC_PKT_HASH_L4,
3830 3835                              B_TRUE);
3831 3836                          if (last_hash != 0 && hash != last_hash) {
3832 3837                                  /*
3833 3838                                   * Starting a different subchain, send current
3834 3839                                   * chain out.
3835 3840                                   */
3836 3841                                  ASSERT(last_mp != NULL);
3837 3842                                  last_mp->b_next = NULL;
3838 3843                                  MAC_TX_SOFT_RING_PROCESS(sub_chain);
3839 3844                                  sub_chain = NULL;
3840 3845                          }
3841 3846  
3842 3847                          /* add packet to subchain */
3843 3848                          if (sub_chain == NULL)
3844 3849                                  sub_chain = cur_mp;
3845 3850                          last_mp = cur_mp;
3846 3851                          last_hash = hash;
3847 3852                  }
3848 3853  
3849 3854                  if (sub_chain != NULL) {
3850 3855                          /* send last subchain */
3851 3856                          ASSERT(last_mp != NULL);
3852 3857                          last_mp->b_next = NULL;
3853 3858                          MAC_TX_SOFT_RING_PROCESS(sub_chain);
3854 3859                  }
3855 3860  
3856 3861                  cookie = 0;
3857 3862          }
3858 3863  
3859 3864          return (cookie);
3860 3865  }
3861 3866  
3862 3867  /*
3863 3868   * mac_tx_bw_mode
3864 3869   *
3865 3870   * In the bandwidth mode, Tx srs will allow packets to go down to Tx ring
3866 3871   * only if bw is available. Otherwise the packets will be queued in
3867 3872   * SRS. If the SRS has multiple Tx rings, then packets will get fanned
3868 3873   * out to a Tx rings.
3869 3874   */
3870 3875  static mac_tx_cookie_t
3871 3876  mac_tx_bw_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3872 3877      uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3873 3878  {
3874 3879          int                     cnt, sz;
3875 3880          mblk_t                  *tail;
3876 3881          mac_tx_cookie_t         cookie = 0;
3877 3882          mac_srs_tx_t            *srs_tx = &mac_srs->srs_tx;
3878 3883          clock_t                 now;
3879 3884  
3880 3885          ASSERT(TX_BANDWIDTH_MODE(mac_srs));
3881 3886          ASSERT(mac_srs->srs_type & SRST_BW_CONTROL);
3882 3887          mutex_enter(&mac_srs->srs_lock);
3883 3888          if (mac_srs->srs_bw->mac_bw_limit == 0) {
3884 3889                  /*
3885 3890                   * zero bandwidth, no traffic is sent: drop the packets,
3886 3891                   * or return the whole chain if the caller requests all
3887 3892                   * unsent packets back.
3888 3893                   */
3889 3894                  if (flag & MAC_TX_NO_ENQUEUE) {
3890 3895                          cookie = (mac_tx_cookie_t)mac_srs;
3891 3896                          *ret_mp = mp_chain;
3892 3897                  } else {
3893 3898                          MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie);
3894 3899                  }
3895 3900                  mutex_exit(&mac_srs->srs_lock);
3896 3901                  return (cookie);
3897 3902          } else if ((mac_srs->srs_first != NULL) ||
3898 3903              (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
3899 3904                  cookie = mac_tx_srs_enqueue(mac_srs, mp_chain, flag,
3900 3905                      fanout_hint, ret_mp);
3901 3906                  mutex_exit(&mac_srs->srs_lock);
3902 3907                  return (cookie);
3903 3908          }
3904 3909          MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3905 3910          now = ddi_get_lbolt();
3906 3911          if (mac_srs->srs_bw->mac_bw_curr_time != now) {
3907 3912                  mac_srs->srs_bw->mac_bw_curr_time = now;
3908 3913                  mac_srs->srs_bw->mac_bw_used = 0;
3909 3914          } else if (mac_srs->srs_bw->mac_bw_used >
3910 3915              mac_srs->srs_bw->mac_bw_limit) {
3911 3916                  mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
3912 3917                  MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3913 3918                      mp_chain, tail, cnt, sz);
3914 3919                  /*
3915 3920                   * Wakeup worker thread. Note that worker
3916 3921                   * thread has to be woken up so that it
3917 3922                   * can fire up the timer to be woken up
3918 3923                   * on the next tick. Also once
3919 3924                   * BW_ENFORCED is set, it can only be
3920 3925                   * reset by srs_worker thread. Until then
3921 3926                   * all packets will get queued up in SRS
3922 3927                   * and hence this this code path won't be
3923 3928                   * entered until BW_ENFORCED is reset.
3924 3929                   */
3925 3930                  cv_signal(&mac_srs->srs_async);
3926 3931                  mutex_exit(&mac_srs->srs_lock);
3927 3932                  return (cookie);
3928 3933          }
3929 3934  
3930 3935          mac_srs->srs_bw->mac_bw_used += sz;
3931 3936          mutex_exit(&mac_srs->srs_lock);
3932 3937  
3933 3938          if (srs_tx->st_mode == SRS_TX_BW_FANOUT) {
3934 3939                  mac_soft_ring_t *softring;
3935 3940                  uint_t indx, hash;
3936 3941  
3937 3942                  hash = HASH_HINT(fanout_hint);
3938 3943                  indx = COMPUTE_INDEX(hash,
3939 3944                      mac_srs->srs_tx_ring_count);
3940 3945                  softring = mac_srs->srs_tx_soft_rings[indx];
3941 3946                  return (mac_tx_soft_ring_process(softring, mp_chain, flag,
3942 3947                      ret_mp));
3943 3948          } else if (srs_tx->st_mode == SRS_TX_BW_AGGR) {
3944 3949                  return (mac_tx_aggr_mode(mac_srs, mp_chain,
3945 3950                      fanout_hint, flag, ret_mp));
3946 3951          } else {
3947 3952                  mac_tx_stats_t          stats;
3948 3953  
3949 3954                  mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
3950 3955                      mp_chain, &stats);
3951 3956  
3952 3957                  if (mp_chain != NULL) {
3953 3958                          mutex_enter(&mac_srs->srs_lock);
3954 3959                          MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3955 3960                          if (mac_srs->srs_bw->mac_bw_used > sz)
3956 3961                                  mac_srs->srs_bw->mac_bw_used -= sz;
3957 3962                          else
3958 3963                                  mac_srs->srs_bw->mac_bw_used = 0;
3959 3964                          cookie = mac_tx_srs_enqueue(mac_srs, mp_chain, flag,
3960 3965                              fanout_hint, ret_mp);
3961 3966                          mutex_exit(&mac_srs->srs_lock);
3962 3967                          return (cookie);
3963 3968                  }
3964 3969                  SRS_TX_STATS_UPDATE(mac_srs, &stats);
3965 3970  
3966 3971                  return (0);
3967 3972          }
3968 3973  }
3969 3974  
3970 3975  /*
3971 3976   * mac_tx_aggr_mode
3972 3977   *
3973 3978   * This routine invokes an aggr function, aggr_find_tx_ring(), to find
3974 3979   * a (pseudo) Tx ring belonging to a port on which the packet has to
3975 3980   * be sent. aggr_find_tx_ring() first finds the outgoing port based on
3976 3981   * L2/L3/L4 policy and then uses the fanout_hint passed to it to pick
3977 3982   * a Tx ring from the selected port.
3978 3983   *
3979 3984   * Note that a port can be deleted from the aggregation. In such a case,
3980 3985   * the aggregation layer first separates the port from the rest of the
3981 3986   * ports making sure that port (and thus any Tx rings associated with
3982 3987   * it) won't get selected in the call to aggr_find_tx_ring() function.
3983 3988   * Later calls are made to mac_group_rem_ring() passing pseudo Tx ring
3984 3989   * handles one by one which in turn will quiesce the Tx SRS and remove
3985 3990   * the soft ring associated with the pseudo Tx ring. Unlike Rx side
3986 3991   * where a cookie is used to protect against mac_rx_ring() calls on
3987 3992   * rings that have been removed, no such cookie is needed on the Tx
3988 3993   * side as the pseudo Tx ring won't be available anymore to
3989 3994   * aggr_find_tx_ring() once the port has been removed.
3990 3995   */
3991 3996  static mac_tx_cookie_t
3992 3997  mac_tx_aggr_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3993 3998      uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3994 3999  {
3995 4000          mac_srs_tx_t            *srs_tx = &mac_srs->srs_tx;
3996 4001          mac_tx_ring_fn_t        find_tx_ring_fn;
3997 4002          mac_ring_handle_t       ring = NULL;
3998 4003          void                    *arg;
3999 4004          mac_soft_ring_t         *sringp;
4000 4005  
4001 4006          find_tx_ring_fn = srs_tx->st_capab_aggr.mca_find_tx_ring_fn;
4002 4007          arg = srs_tx->st_capab_aggr.mca_arg;
4003 4008          if (find_tx_ring_fn(arg, mp_chain, fanout_hint, &ring) == NULL)
4004 4009                  return (0);
4005 4010          sringp = srs_tx->st_soft_rings[((mac_ring_t *)ring)->mr_index];
4006 4011          return (mac_tx_soft_ring_process(sringp, mp_chain, flag, ret_mp));
4007 4012  }
4008 4013  
4009 4014  void
4010 4015  mac_tx_invoke_callbacks(mac_client_impl_t *mcip, mac_tx_cookie_t cookie)
4011 4016  {
4012 4017          mac_cb_t *mcb;
4013 4018          mac_tx_notify_cb_t *mtnfp;
4014 4019  
4015 4020          /* Wakeup callback registered clients */
4016 4021          MAC_CALLBACK_WALKER_INC(&mcip->mci_tx_notify_cb_info);
4017 4022          for (mcb = mcip->mci_tx_notify_cb_list; mcb != NULL;
4018 4023              mcb = mcb->mcb_nextp) {
4019 4024                  mtnfp = (mac_tx_notify_cb_t *)mcb->mcb_objp;
4020 4025                  mtnfp->mtnf_fn(mtnfp->mtnf_arg, cookie);
4021 4026          }
4022 4027          MAC_CALLBACK_WALKER_DCR(&mcip->mci_tx_notify_cb_info,
4023 4028              &mcip->mci_tx_notify_cb_list);
4024 4029  }
4025 4030  
4026 4031  /* ARGSUSED */
4027 4032  void
4028 4033  mac_tx_srs_drain(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
4029 4034  {
4030 4035          mblk_t                  *head, *tail;
4031 4036          size_t                  sz;
4032 4037          uint32_t                tx_mode;
4033 4038          uint_t                  saved_pkt_count;
4034 4039          mac_tx_stats_t          stats;
4035 4040          mac_srs_tx_t            *srs_tx = &mac_srs->srs_tx;
4036 4041          clock_t                 now;
4037 4042  
4038 4043          saved_pkt_count = 0;
4039 4044          ASSERT(mutex_owned(&mac_srs->srs_lock));
4040 4045          ASSERT(!(mac_srs->srs_state & SRS_PROC));
4041 4046  
4042 4047          mac_srs->srs_state |= SRS_PROC;
4043 4048  
4044 4049          tx_mode = srs_tx->st_mode;
4045 4050          if (tx_mode == SRS_TX_DEFAULT || tx_mode == SRS_TX_SERIALIZE) {
4046 4051                  if (mac_srs->srs_first != NULL) {
4047 4052                          head = mac_srs->srs_first;
4048 4053                          tail = mac_srs->srs_last;
4049 4054                          saved_pkt_count = mac_srs->srs_count;
4050 4055                          mac_srs->srs_first = NULL;
4051 4056                          mac_srs->srs_last = NULL;
4052 4057                          mac_srs->srs_count = 0;
4053 4058                          mutex_exit(&mac_srs->srs_lock);
4054 4059  
4055 4060                          head = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
4056 4061                              head, &stats);
4057 4062  
4058 4063                          mutex_enter(&mac_srs->srs_lock);
4059 4064                          if (head != NULL) {
4060 4065                                  /* Device out of tx desc, set block */
4061 4066                                  if (head->b_next == NULL)
4062 4067                                          VERIFY(head == tail);
4063 4068                                  tail->b_next = mac_srs->srs_first;
4064 4069                                  mac_srs->srs_first = head;
4065 4070                                  mac_srs->srs_count +=
4066 4071                                      (saved_pkt_count - stats.mts_opackets);
4067 4072                                  if (mac_srs->srs_last == NULL)
4068 4073                                          mac_srs->srs_last = tail;
4069 4074                                  MAC_TX_SRS_BLOCK(mac_srs, head);
4070 4075                          } else {
4071 4076                                  srs_tx->st_woken_up = B_FALSE;
4072 4077                                  SRS_TX_STATS_UPDATE(mac_srs, &stats);
4073 4078                          }
4074 4079                  }
4075 4080          } else if (tx_mode == SRS_TX_BW) {
4076 4081                  /*
4077 4082                   * We are here because the timer fired and we have some data
4078 4083                   * to tranmit. Also mac_tx_srs_worker should have reset
4079 4084                   * SRS_BW_ENFORCED flag
4080 4085                   */
4081 4086                  ASSERT(!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED));
4082 4087                  head = tail = mac_srs->srs_first;
4083 4088                  while (mac_srs->srs_first != NULL) {
4084 4089                          tail = mac_srs->srs_first;
4085 4090                          tail->b_prev = NULL;
4086 4091                          mac_srs->srs_first = tail->b_next;
4087 4092                          if (mac_srs->srs_first == NULL)
4088 4093                                  mac_srs->srs_last = NULL;
4089 4094                          mac_srs->srs_count--;
4090 4095                          sz = msgdsize(tail);
4091 4096                          mac_srs->srs_size -= sz;
4092 4097                          saved_pkt_count++;
4093 4098                          MAC_TX_UPDATE_BW_INFO(mac_srs, sz);
4094 4099  
4095 4100                          if (mac_srs->srs_bw->mac_bw_used <
4096 4101                              mac_srs->srs_bw->mac_bw_limit)
4097 4102                                  continue;
4098 4103  
4099 4104                          now = ddi_get_lbolt();
4100 4105                          if (mac_srs->srs_bw->mac_bw_curr_time != now) {
4101 4106                                  mac_srs->srs_bw->mac_bw_curr_time = now;
4102 4107                                  mac_srs->srs_bw->mac_bw_used = sz;
4103 4108                                  continue;
4104 4109                          }
4105 4110                          mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
4106 4111                          break;
4107 4112                  }
4108 4113  
4109 4114                  ASSERT((head == NULL && tail == NULL) ||
4110 4115                      (head != NULL && tail != NULL));
4111 4116                  if (tail != NULL) {
4112 4117                          tail->b_next = NULL;
4113 4118                          mutex_exit(&mac_srs->srs_lock);
4114 4119  
4115 4120                          head = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
4116 4121                              head, &stats);
4117 4122  
4118 4123                          mutex_enter(&mac_srs->srs_lock);
4119 4124                          if (head != NULL) {
4120 4125                                  uint_t size_sent;
4121 4126  
4122 4127                                  /* Device out of tx desc, set block */
4123 4128                                  if (head->b_next == NULL)
4124 4129                                          VERIFY(head == tail);
4125 4130                                  tail->b_next = mac_srs->srs_first;
4126 4131                                  mac_srs->srs_first = head;
4127 4132                                  mac_srs->srs_count +=
4128 4133                                      (saved_pkt_count - stats.mts_opackets);
4129 4134                                  if (mac_srs->srs_last == NULL)
4130 4135                                          mac_srs->srs_last = tail;
4131 4136                                  size_sent = sz - stats.mts_obytes;
4132 4137                                  mac_srs->srs_size += size_sent;
4133 4138                                  mac_srs->srs_bw->mac_bw_sz += size_sent;
4134 4139                                  if (mac_srs->srs_bw->mac_bw_used > size_sent) {
4135 4140                                          mac_srs->srs_bw->mac_bw_used -=
4136 4141                                              size_sent;
4137 4142                                  } else {
4138 4143                                          mac_srs->srs_bw->mac_bw_used = 0;
4139 4144                                  }
4140 4145                                  MAC_TX_SRS_BLOCK(mac_srs, head);
4141 4146                          } else {
4142 4147                                  srs_tx->st_woken_up = B_FALSE;
4143 4148                                  SRS_TX_STATS_UPDATE(mac_srs, &stats);
4144 4149                          }
4145 4150                  }
4146 4151          } else if (tx_mode == SRS_TX_BW_FANOUT || tx_mode == SRS_TX_BW_AGGR) {
4147 4152                  mblk_t *prev;
4148 4153                  uint64_t hint;
4149 4154  
4150 4155                  /*
4151 4156                   * We are here because the timer fired and we
4152 4157                   * have some quota to tranmit.
4153 4158                   */
4154 4159                  prev = NULL;
4155 4160                  head = tail = mac_srs->srs_first;
4156 4161                  while (mac_srs->srs_first != NULL) {
4157 4162                          tail = mac_srs->srs_first;
4158 4163                          mac_srs->srs_first = tail->b_next;
4159 4164                          if (mac_srs->srs_first == NULL)
4160 4165                                  mac_srs->srs_last = NULL;
4161 4166                          mac_srs->srs_count--;
4162 4167                          sz = msgdsize(tail);
4163 4168                          mac_srs->srs_size -= sz;
4164 4169                          mac_srs->srs_bw->mac_bw_used += sz;
4165 4170                          if (prev == NULL)
4166 4171                                  hint = (ulong_t)tail->b_prev;
4167 4172                          if (hint != (ulong_t)tail->b_prev) {
4168 4173                                  prev->b_next = NULL;
4169 4174                                  mutex_exit(&mac_srs->srs_lock);
4170 4175                                  TX_SRS_TO_SOFT_RING(mac_srs, head, hint);
4171 4176                                  head = tail;
4172 4177                                  hint = (ulong_t)tail->b_prev;
4173 4178                                  mutex_enter(&mac_srs->srs_lock);
4174 4179                          }
4175 4180  
4176 4181                          prev = tail;
4177 4182                          tail->b_prev = NULL;
4178 4183                          if (mac_srs->srs_bw->mac_bw_used <
4179 4184                              mac_srs->srs_bw->mac_bw_limit)
4180 4185                                  continue;
4181 4186  
4182 4187                          now = ddi_get_lbolt();
4183 4188                          if (mac_srs->srs_bw->mac_bw_curr_time != now) {
4184 4189                                  mac_srs->srs_bw->mac_bw_curr_time = now;
4185 4190                                  mac_srs->srs_bw->mac_bw_used = 0;
4186 4191                                  continue;
4187 4192                          }
4188 4193                          mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
4189 4194                          break;
4190 4195                  }
4191 4196                  ASSERT((head == NULL && tail == NULL) ||
4192 4197                      (head != NULL && tail != NULL));
4193 4198                  if (tail != NULL) {
4194 4199                          tail->b_next = NULL;
4195 4200                          mutex_exit(&mac_srs->srs_lock);
4196 4201                          TX_SRS_TO_SOFT_RING(mac_srs, head, hint);
4197 4202                          mutex_enter(&mac_srs->srs_lock);
4198 4203                  }
4199 4204          }
4200 4205          /*
4201 4206           * SRS_TX_FANOUT case not considered here because packets
4202 4207           * won't be queued in the SRS for this case. Packets will
4203 4208           * be sent directly to soft rings underneath and if there
4204 4209           * is any queueing at all, it would be in Tx side soft
4205 4210           * rings.
4206 4211           */
4207 4212  
4208 4213          /*
4209 4214           * When srs_count becomes 0, reset SRS_TX_HIWAT and
4210 4215           * SRS_TX_WAKEUP_CLIENT and wakeup registered clients.
4211 4216           */
4212 4217          if (mac_srs->srs_count == 0 && (mac_srs->srs_state &
4213 4218              (SRS_TX_HIWAT | SRS_TX_WAKEUP_CLIENT | SRS_ENQUEUED))) {
4214 4219                  mac_client_impl_t *mcip = mac_srs->srs_mcip;
4215 4220                  boolean_t wakeup_required = B_FALSE;
4216 4221  
4217 4222                  if (mac_srs->srs_state &
4218 4223                      (SRS_TX_HIWAT|SRS_TX_WAKEUP_CLIENT)) {
4219 4224                          wakeup_required = B_TRUE;
4220 4225                  }
4221 4226                  mac_srs->srs_state &= ~(SRS_TX_HIWAT |
4222 4227                      SRS_TX_WAKEUP_CLIENT | SRS_ENQUEUED);
4223 4228                  mutex_exit(&mac_srs->srs_lock);
4224 4229                  if (wakeup_required) {
4225 4230                          mac_tx_invoke_callbacks(mcip, (mac_tx_cookie_t)mac_srs);
4226 4231                          /*
4227 4232                           * If the client is not the primary MAC client, then we
4228 4233                           * need to send the notification to the clients upper
4229 4234                           * MAC, i.e. mci_upper_mip.
4230 4235                           */
4231 4236                          mac_tx_notify(mcip->mci_upper_mip != NULL ?
4232 4237                              mcip->mci_upper_mip : mcip->mci_mip);
4233 4238                  }
4234 4239                  mutex_enter(&mac_srs->srs_lock);
4235 4240          }
4236 4241          mac_srs->srs_state &= ~SRS_PROC;
4237 4242  }
4238 4243  
4239 4244  /*
4240 4245   * Given a packet, get the flow_entry that identifies the flow
4241 4246   * to which that packet belongs. The flow_entry will contain
4242 4247   * the transmit function to be used to send the packet. If the
4243 4248   * function returns NULL, the packet should be sent using the
4244 4249   * underlying NIC.
4245 4250   */
4246 4251  static flow_entry_t *
4247 4252  mac_tx_classify(mac_impl_t *mip, mblk_t *mp)
4248 4253  {
4249 4254          flow_entry_t            *flent = NULL;
4250 4255          mac_client_impl_t       *mcip;
4251 4256          int     err;
4252 4257  
4253 4258          /*
4254 4259           * Do classification on the packet.
4255 4260           */
4256 4261          err = mac_flow_lookup(mip->mi_flow_tab, mp, FLOW_OUTBOUND, &flent);
4257 4262          if (err != 0)
4258 4263                  return (NULL);
4259 4264  
4260 4265          /*
4261 4266           * This flent might just be an additional one on the MAC client,
4262 4267           * i.e. for classification purposes (different fdesc), however
4263 4268           * the resources, SRS et. al., are in the mci_flent, so if
4264 4269           * this isn't the mci_flent, we need to get it.
4265 4270           */
4266 4271          if ((mcip = flent->fe_mcip) != NULL && mcip->mci_flent != flent) {
4267 4272                  FLOW_REFRELE(flent);
4268 4273                  flent = mcip->mci_flent;
4269 4274                  FLOW_TRY_REFHOLD(flent, err);
4270 4275                  if (err != 0)
4271 4276                          return (NULL);
4272 4277          }
4273 4278  
4274 4279          return (flent);
4275 4280  }
4276 4281  
4277 4282  /*
4278 4283   * This macro is only meant to be used by mac_tx_send().
4279 4284   */
4280 4285  #define CHECK_VID_AND_ADD_TAG(mp) {                     \
4281 4286          if (vid_check) {                                \
4282 4287                  int err = 0;                            \
4283 4288                                                          \
4284 4289                  MAC_VID_CHECK(src_mcip, (mp), err);     \
4285 4290                  if (err != 0) {                         \
4286 4291                          freemsg((mp));                  \
4287 4292                          (mp) = next;                    \
4288 4293                          oerrors++;                      \
4289 4294                          continue;                       \
4290 4295                  }                                       \
4291 4296          }                                               \
4292 4297          if (add_tag) {                                  \
4293 4298                  (mp) = mac_add_vlan_tag((mp), 0, vid);  \
4294 4299                  if ((mp) == NULL) {                     \
4295 4300                          (mp) = next;                    \
4296 4301                          oerrors++;                      \
4297 4302                          continue;                       \
4298 4303                  }                                       \
4299 4304          }                                               \
4300 4305  }
4301 4306  
4302 4307  mblk_t *
4303 4308  mac_tx_send(mac_client_handle_t mch, mac_ring_handle_t ring, mblk_t *mp_chain,
4304 4309      mac_tx_stats_t *stats)
4305 4310  {
4306 4311          mac_client_impl_t *src_mcip = (mac_client_impl_t *)mch;
4307 4312          mac_impl_t *mip = src_mcip->mci_mip;
4308 4313          uint_t obytes = 0, opackets = 0, oerrors = 0;
4309 4314          mblk_t *mp = NULL, *next;
4310 4315          boolean_t vid_check, add_tag;
4311 4316          uint16_t vid = 0;
4312 4317  
4313 4318          if (mip->mi_nclients > 1) {
4314 4319                  vid_check = MAC_VID_CHECK_NEEDED(src_mcip);
4315 4320                  add_tag = MAC_TAG_NEEDED(src_mcip);
4316 4321                  if (add_tag)
4317 4322                          vid = mac_client_vid(mch);
4318 4323          } else {
4319 4324                  ASSERT(mip->mi_nclients == 1);
4320 4325                  vid_check = add_tag = B_FALSE;
4321 4326          }
4322 4327  
4323 4328          /*
4324 4329           * Fastpath: if there's only one client, we simply send
4325 4330           * the packet down to the underlying NIC.
4326 4331           */
4327 4332          if (mip->mi_nactiveclients == 1) {
4328 4333                  DTRACE_PROBE2(fastpath,
4329 4334                      mac_client_impl_t *, src_mcip, mblk_t *, mp_chain);
4330 4335  
4331 4336                  mp = mp_chain;
4332 4337                  while (mp != NULL) {
4333 4338                          next = mp->b_next;
4334 4339                          mp->b_next = NULL;
4335 4340                          opackets++;
4336 4341                          obytes += (mp->b_cont == NULL ? MBLKL(mp) :
4337 4342                              msgdsize(mp));
4338 4343  
4339 4344                          CHECK_VID_AND_ADD_TAG(mp);
4340 4345                          MAC_TX(mip, ring, mp, src_mcip);
4341 4346  
4342 4347                          /*
4343 4348                           * If the driver is out of descriptors and does a
4344 4349                           * partial send it will return a chain of unsent
4345 4350                           * mblks. Adjust the accounting stats.
4346 4351                           */
4347 4352                          if (mp != NULL) {
4348 4353                                  opackets--;
4349 4354                                  obytes -= msgdsize(mp);
4350 4355                                  mp->b_next = next;
4351 4356                                  break;
4352 4357                          }
4353 4358                          mp = next;
4354 4359                  }
4355 4360                  goto done;
4356 4361          }
4357 4362  
4358 4363          /*
4359 4364           * No fastpath, we either have more than one MAC client
4360 4365           * defined on top of the same MAC, or one or more MAC
4361 4366           * client promiscuous callbacks.
4362 4367           */
4363 4368          DTRACE_PROBE3(slowpath, mac_client_impl_t *,
4364 4369              src_mcip, int, mip->mi_nclients, mblk_t *, mp_chain);
4365 4370  
4366 4371          mp = mp_chain;
4367 4372          while (mp != NULL) {
4368 4373                  flow_entry_t *dst_flow_ent;
4369 4374                  void *flow_cookie;
4370 4375                  size_t  pkt_size;
4371 4376                  mblk_t *mp1;
4372 4377  
4373 4378                  next = mp->b_next;
4374 4379                  mp->b_next = NULL;
4375 4380                  opackets++;
4376 4381                  pkt_size = (mp->b_cont == NULL ? MBLKL(mp) : msgdsize(mp));
4377 4382                  obytes += pkt_size;
4378 4383                  CHECK_VID_AND_ADD_TAG(mp);
4379 4384  
4380 4385                  /*
4381 4386                   * Find the destination.
4382 4387                   */
4383 4388                  dst_flow_ent = mac_tx_classify(mip, mp);
4384 4389  
4385 4390                  if (dst_flow_ent != NULL) {
4386 4391                          size_t  hdrsize;
4387 4392                          int     err = 0;
4388 4393  
4389 4394                          if (mip->mi_info.mi_nativemedia == DL_ETHER) {
4390 4395                                  struct ether_vlan_header *evhp =
4391 4396                                      (struct ether_vlan_header *)mp->b_rptr;
4392 4397  
4393 4398                                  if (ntohs(evhp->ether_tpid) == ETHERTYPE_VLAN)
4394 4399                                          hdrsize = sizeof (*evhp);
4395 4400                                  else
4396 4401                                          hdrsize = sizeof (struct ether_header);
4397 4402                          } else {
4398 4403                                  mac_header_info_t       mhi;
4399 4404  
4400 4405                                  err = mac_header_info((mac_handle_t)mip,
4401 4406                                      mp, &mhi);
4402 4407                                  if (err == 0)
4403 4408                                          hdrsize = mhi.mhi_hdrsize;
4404 4409                          }
4405 4410  
4406 4411                          /*
4407 4412                           * Got a matching flow. It's either another
4408 4413                           * MAC client, or a broadcast/multicast flow.
4409 4414                           * Make sure the packet size is within the
4410 4415                           * allowed size. If not drop the packet and
4411 4416                           * move to next packet.
4412 4417                           */
4413 4418                          if (err != 0 ||
4414 4419                              (pkt_size - hdrsize) > mip->mi_sdu_max) {
4415 4420                                  oerrors++;
4416 4421                                  DTRACE_PROBE2(loopback__drop, size_t, pkt_size,
4417 4422                                      mblk_t *, mp);
4418 4423                                  freemsg(mp);
4419 4424                                  mp = next;
4420 4425                                  FLOW_REFRELE(dst_flow_ent);
4421 4426                                  continue;
4422 4427                          }
4423 4428                          flow_cookie = mac_flow_get_client_cookie(dst_flow_ent);
4424 4429                          if (flow_cookie != NULL) {
4425 4430                                  /*
4426 4431                                   * The vnic_bcast_send function expects
4427 4432                                   * to receive the sender MAC client
4428 4433                                   * as value for arg2.
4429 4434                                   */
4430 4435                                  mac_bcast_send(flow_cookie, src_mcip, mp,
4431 4436                                      B_TRUE);
4432 4437                          } else {
4433 4438                                  /*
4434 4439                                   * loopback the packet to a local MAC
4435 4440                                   * client. We force a context switch
4436 4441                                   * if both source and destination MAC
4437 4442                                   * clients are used by IP, i.e.
4438 4443                                   * bypass is set.
4439 4444                                   */
4440 4445                                  boolean_t do_switch;
4441 4446                                  mac_client_impl_t *dst_mcip =
4442 4447                                      dst_flow_ent->fe_mcip;
4443 4448  
4444 4449                                  /*
4445 4450                                   * Check if there are promiscuous mode
4446 4451                                   * callbacks defined. This check is
4447 4452                                   * done here in the 'else' case and
4448 4453                                   * not in other cases because this
4449 4454                                   * path is for local loopback
4450 4455                                   * communication which does not go
4451 4456                                   * through MAC_TX(). For paths that go
4452 4457                                   * through MAC_TX(), the promisc_list
4453 4458                                   * check is done inside the MAC_TX()
4454 4459                                   * macro.
4455 4460                                   */
4456 4461                                  if (mip->mi_promisc_list != NULL)
4457 4462                                          mac_promisc_dispatch(mip, mp, src_mcip);
4458 4463  
4459 4464                                  do_switch = ((src_mcip->mci_state_flags &
4460 4465                                      dst_mcip->mci_state_flags &
4461 4466                                      MCIS_CLIENT_POLL_CAPABLE) != 0);
4462 4467  
4463 4468                                  if ((mp1 = mac_fix_cksum(mp)) != NULL) {
4464 4469                                          (dst_flow_ent->fe_cb_fn)(
4465 4470                                              dst_flow_ent->fe_cb_arg1,
4466 4471                                              dst_flow_ent->fe_cb_arg2,
4467 4472                                              mp1, do_switch);
4468 4473                                  }
4469 4474                          }
4470 4475                          FLOW_REFRELE(dst_flow_ent);
4471 4476                  } else {
4472 4477                          /*
4473 4478                           * Unknown destination, send via the underlying
4474 4479                           * NIC.
4475 4480                           */
4476 4481                          MAC_TX(mip, ring, mp, src_mcip);
4477 4482                          if (mp != NULL) {
4478 4483                                  /*
4479 4484                                   * Adjust for the last packet that
4480 4485                                   * could not be transmitted
4481 4486                                   */
4482 4487                                  opackets--;
4483 4488                                  obytes -= pkt_size;
4484 4489                                  mp->b_next = next;
4485 4490                                  break;
4486 4491                          }
4487 4492                  }
4488 4493                  mp = next;
4489 4494          }
4490 4495  
4491 4496  done:
4492 4497          stats->mts_obytes = obytes;
4493 4498          stats->mts_opackets = opackets;
4494 4499          stats->mts_oerrors = oerrors;
4495 4500          return (mp);
4496 4501  }
4497 4502  
4498 4503  /*
4499 4504   * mac_tx_srs_ring_present
4500 4505   *
4501 4506   * Returns whether the specified ring is part of the specified SRS.
4502 4507   */
4503 4508  boolean_t
4504 4509  mac_tx_srs_ring_present(mac_soft_ring_set_t *srs, mac_ring_t *tx_ring)
4505 4510  {
4506 4511          int i;
4507 4512          mac_soft_ring_t *soft_ring;
4508 4513  
4509 4514          if (srs->srs_tx.st_arg2 == tx_ring)
4510 4515                  return (B_TRUE);
4511 4516  
4512 4517          for (i = 0; i < srs->srs_tx_ring_count; i++) {
4513 4518                  soft_ring =  srs->srs_tx_soft_rings[i];
4514 4519                  if (soft_ring->s_ring_tx_arg2 == tx_ring)
4515 4520                          return (B_TRUE);
4516 4521          }
4517 4522  
4518 4523          return (B_FALSE);
4519 4524  }
4520 4525  
4521 4526  /*
4522 4527   * mac_tx_srs_get_soft_ring
4523 4528   *
4524 4529   * Returns the TX soft ring associated with the given ring, if present.
4525 4530   */
4526 4531  mac_soft_ring_t *
4527 4532  mac_tx_srs_get_soft_ring(mac_soft_ring_set_t *srs, mac_ring_t *tx_ring)
4528 4533  {
4529 4534          int             i;
4530 4535          mac_soft_ring_t *soft_ring;
4531 4536  
4532 4537          if (srs->srs_tx.st_arg2 == tx_ring)
4533 4538                  return (NULL);
4534 4539  
4535 4540          for (i = 0; i < srs->srs_tx_ring_count; i++) {
4536 4541                  soft_ring =  srs->srs_tx_soft_rings[i];
4537 4542                  if (soft_ring->s_ring_tx_arg2 == tx_ring)
4538 4543                          return (soft_ring);
4539 4544          }
4540 4545  
4541 4546          return (NULL);
4542 4547  }
4543 4548  
4544 4549  /*
4545 4550   * mac_tx_srs_wakeup
4546 4551   *
4547 4552   * Called when Tx desc become available. Wakeup the appropriate worker
4548 4553   * thread after resetting the SRS_TX_BLOCKED/S_RING_BLOCK bit in the
4549 4554   * state field.
4550 4555   */
4551 4556  void
4552 4557  mac_tx_srs_wakeup(mac_soft_ring_set_t *mac_srs, mac_ring_handle_t ring)
4553 4558  {
4554 4559          int i;
4555 4560          mac_soft_ring_t *sringp;
4556 4561          mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
4557 4562  
4558 4563          mutex_enter(&mac_srs->srs_lock);
4559 4564          /*
4560 4565           * srs_tx_ring_count == 0 is the single ring mode case. In
4561 4566           * this mode, there will not be Tx soft rings associated
4562 4567           * with the SRS.
4563 4568           */
4564 4569          if (!MAC_TX_SOFT_RINGS(mac_srs)) {
4565 4570                  if (srs_tx->st_arg2 == ring &&
4566 4571                      mac_srs->srs_state & SRS_TX_BLOCKED) {
4567 4572                          mac_srs->srs_state &= ~SRS_TX_BLOCKED;
4568 4573                          srs_tx->st_stat.mts_unblockcnt++;
4569 4574                          cv_signal(&mac_srs->srs_async);
4570 4575                  }
4571 4576                  /*
4572 4577                   * A wakeup can come before tx_srs_drain() could
4573 4578                   * grab srs lock and set SRS_TX_BLOCKED. So
4574 4579                   * always set woken_up flag when we come here.
4575 4580                   */
4576 4581                  srs_tx->st_woken_up = B_TRUE;
4577 4582                  mutex_exit(&mac_srs->srs_lock);
4578 4583                  return;
4579 4584          }
4580 4585  
4581 4586          /*
4582 4587           * If you are here, it is for FANOUT, BW_FANOUT,
4583 4588           * AGGR_MODE or AGGR_BW_MODE case
4584 4589           */
4585 4590          for (i = 0; i < mac_srs->srs_tx_ring_count; i++) {
4586 4591                  sringp = mac_srs->srs_tx_soft_rings[i];
4587 4592                  mutex_enter(&sringp->s_ring_lock);
4588 4593                  if (sringp->s_ring_tx_arg2 == ring) {
4589 4594                          if (sringp->s_ring_state & S_RING_BLOCK) {
4590 4595                                  sringp->s_ring_state &= ~S_RING_BLOCK;
4591 4596                                  sringp->s_st_stat.mts_unblockcnt++;
4592 4597                                  cv_signal(&sringp->s_ring_async);
4593 4598                          }
4594 4599                          sringp->s_ring_tx_woken_up = B_TRUE;
4595 4600                  }
4596 4601                  mutex_exit(&sringp->s_ring_lock);
4597 4602          }
4598 4603          mutex_exit(&mac_srs->srs_lock);
4599 4604  }
4600 4605  
4601 4606  /*
4602 4607   * Once the driver is done draining, send a MAC_NOTE_TX notification to unleash
4603 4608   * the blocked clients again.
4604 4609   */
4605 4610  void
4606 4611  mac_tx_notify(mac_impl_t *mip)
4607 4612  {
4608 4613          i_mac_notify(mip, MAC_NOTE_TX);
4609 4614  }
4610 4615  
4611 4616  /*
4612 4617   * RX SOFTRING RELATED FUNCTIONS
4613 4618   *
4614 4619   * These functions really belong in mac_soft_ring.c and here for
4615 4620   * a short period.
4616 4621   */
4617 4622  
4618 4623  #define SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) {             \
4619 4624          /*                                                              \
4620 4625           * Enqueue our mblk chain.                                      \
4621 4626           */                                                             \
4622 4627          ASSERT(MUTEX_HELD(&(ringp)->s_ring_lock));                      \
4623 4628                                                                          \
4624 4629          if ((ringp)->s_ring_last != NULL)                               \
4625 4630                  (ringp)->s_ring_last->b_next = (mp);                    \
4626 4631          else                                                            \
4627 4632                  (ringp)->s_ring_first = (mp);                           \
4628 4633          (ringp)->s_ring_last = (tail);                                  \
4629 4634          (ringp)->s_ring_count += (cnt);                                 \
4630 4635          ASSERT((ringp)->s_ring_count > 0);                              \
4631 4636          if ((ringp)->s_ring_type & ST_RING_BW_CTL) {                    \
4632 4637                  (ringp)->s_ring_size += sz;                             \
4633 4638          }                                                               \
4634 4639  }
4635 4640  
4636 4641  /*
4637 4642   * Default entry point to deliver a packet chain to a MAC client.
4638 4643   * If the MAC client has flows, do the classification with these
4639 4644   * flows as well.
4640 4645   */
4641 4646  /* ARGSUSED */
4642 4647  void
4643 4648  mac_rx_deliver(void *arg1, mac_resource_handle_t mrh, mblk_t *mp_chain,
4644 4649      mac_header_info_t *arg3)
4645 4650  {
4646 4651          mac_client_impl_t *mcip = arg1;
4647 4652  
4648 4653          if (mcip->mci_nvids == 1 &&

↓ open down ↓

1991 lines elided

↑ open up ↑

4649 4654              !(mcip->mci_state_flags & MCIS_STRIP_DISABLE)) {
4650 4655                  /*
4651 4656                   * If the client has exactly one VID associated with it
4652 4657                   * and striping of VLAN header is not disabled,
4653 4658                   * remove the VLAN tag from the packet before
4654 4659                   * passing it on to the client's receive callback.
4655 4660                   * Note that this needs to be done after we dispatch
4656 4661                   * the packet to the promiscuous listeners of the
4657 4662                   * client, since they expect to see the whole
4658 4663                   * frame including the VLAN headers.
     4664 +                 *
     4665 +                 * The MCIS_STRIP_DISABLE is only issued when sun4v
     4666 +                 * vsw is in play.
4659 4667                   */
4660 4668                  mp_chain = mac_strip_vlan_tag_chain(mp_chain);
4661 4669          }
4662 4670  
4663 4671          mcip->mci_rx_fn(mcip->mci_rx_arg, mrh, mp_chain, B_FALSE);
4664 4672  }
4665 4673  
4666 4674  /*
4667      - * mac_rx_soft_ring_process
     4675 + * Process a chain for a given soft ring. If the number of packets
     4676 + * queued in the SRS and its associated soft rings (including this
     4677 + * one) is very small (tracked by srs_poll_pkt_cnt) then allow the
     4678 + * entering thread (interrupt or poll thread) to process the chain
     4679 + * inline. This is meant to reduce latency under low load.
4668 4680   *
4669      - * process a chain for a given soft ring. The number of packets queued
4670      - * in the SRS and its associated soft rings (including this one) is
4671      - * very small (tracked by srs_poll_pkt_cnt), then allow the entering
4672      - * thread (interrupt or poll thread) to do inline processing. This
4673      - * helps keep the latency down under low load.
4674      - *
4675 4681   * The proc and arg for each mblk is already stored in the mblk in
4676 4682   * appropriate places.
4677 4683   */
4678 4684  /* ARGSUSED */
4679 4685  void
4680 4686  mac_rx_soft_ring_process(mac_client_impl_t *mcip, mac_soft_ring_t *ringp,
4681 4687      mblk_t *mp_chain, mblk_t *tail, int cnt, size_t sz)
4682 4688  {
4683 4689          mac_direct_rx_t         proc;
4684 4690          void                    *arg1;

4685 4691          mac_resource_handle_t   arg2;
4686 4692          mac_soft_ring_set_t     *mac_srs = ringp->s_ring_set;
4687 4693  
4688 4694          ASSERT(ringp != NULL);
4689 4695          ASSERT(mp_chain != NULL);
4690 4696          ASSERT(tail != NULL);
4691 4697          ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4692 4698  
4693 4699          mutex_enter(&ringp->s_ring_lock);
4694 4700          ringp->s_ring_total_inpkt += cnt;
4695 4701          ringp->s_ring_total_rbytes += sz;
4696 4702          if ((mac_srs->srs_rx.sr_poll_pkt_cnt <= 1) &&
4697 4703              !(ringp->s_ring_type & ST_RING_WORKER_ONLY)) {
4698 4704                  /* If on processor or blanking on, then enqueue and return */
4699 4705                  if (ringp->s_ring_state & S_RING_BLANK ||
4700 4706                      ringp->s_ring_state & S_RING_PROC) {
4701 4707                          SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4702 4708                          mutex_exit(&ringp->s_ring_lock);
4703 4709                          return;
4704 4710                  }
4705 4711                  proc = ringp->s_ring_rx_func;
4706 4712                  arg1 = ringp->s_ring_rx_arg1;
4707 4713                  arg2 = ringp->s_ring_rx_arg2;
4708 4714                  /*
4709 4715                   * See if anything is already queued. If we are the
4710 4716                   * first packet, do inline processing else queue the
4711 4717                   * packet and do the drain.
4712 4718                   */
4713 4719                  if (ringp->s_ring_first == NULL) {
4714 4720                          /*
4715 4721                           * Fast-path, ok to process and nothing queued.
4716 4722                           */
4717 4723                          ringp->s_ring_run = curthread;
4718 4724                          ringp->s_ring_state |= (S_RING_PROC);
4719 4725  
4720 4726                          mutex_exit(&ringp->s_ring_lock);
4721 4727

↓ open down ↓

37 lines elided

↑ open up ↑

4722 4728                          /*
4723 4729                           * We are the chain of 1 packet so
4724 4730                           * go through this fast path.
4725 4731                           */
4726 4732                          ASSERT(mp_chain->b_next == NULL);
4727 4733  
4728 4734                          (*proc)(arg1, arg2, mp_chain, NULL);
4729 4735  
4730 4736                          ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4731 4737                          /*
4732      -                         * If we have a soft ring set which is doing
4733      -                         * bandwidth control, we need to decrement
4734      -                         * srs_size and count so it the SRS can have a
4735      -                         * accurate idea of what is the real data
4736      -                         * queued between SRS and its soft rings. We
4737      -                         * decrement the counters only when the packet
4738      -                         * gets processed by both SRS and the soft ring.
     4738 +                         * If we have an SRS performing bandwidth
     4739 +                         * control then we need to decrement the size
     4740 +                         * and count so the SRS has an accurate count
     4741 +                         * of the data queued between the SRS and its
     4742 +                         * soft rings. We decrement the counters only
     4743 +                         * when the packet is processed by both the
     4744 +                         * SRS and the soft ring.
4739 4745                           */
4740 4746                          mutex_enter(&mac_srs->srs_lock);
4741 4747                          MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
4742 4748                          MAC_UPDATE_SRS_SIZE_LOCKED(mac_srs, sz);
4743 4749                          mutex_exit(&mac_srs->srs_lock);
4744 4750  
4745 4751                          mutex_enter(&ringp->s_ring_lock);
4746 4752                          ringp->s_ring_run = NULL;
4747 4753                          ringp->s_ring_state &= ~S_RING_PROC;
4748 4754                          if (ringp->s_ring_state & S_RING_CLIENT_WAIT)
4749 4755                                  cv_signal(&ringp->s_ring_client_cv);
4750 4756  
4751 4757                          if ((ringp->s_ring_first == NULL) ||
4752 4758                              (ringp->s_ring_state & S_RING_BLANK)) {
4753 4759                                  /*
4754      -                                 * We processed inline our packet and
4755      -                                 * nothing new has arrived or our
     4760 +                                 * We processed a single packet inline
     4761 +                                 * and nothing new has arrived or our
4756 4762                                   * receiver doesn't want to receive
4757 4763                                   * any packets. We are done.
4758 4764                                   */
4759 4765                                  mutex_exit(&ringp->s_ring_lock);
4760 4766                                  return;
4761 4767                          }
4762 4768                  } else {
4763 4769                          SOFT_RING_ENQUEUE_CHAIN(ringp,
4764 4770                              mp_chain, tail, cnt, sz);
4765 4771                  }

4766 4772  
4767 4773                  /*
4768 4774                   * We are here because either we couldn't do inline
4769 4775                   * processing (because something was already
4770 4776                   * queued), or we had a chain of more than one
4771 4777                   * packet, or something else arrived after we were
4772 4778                   * done with inline processing.
4773 4779                   */
4774 4780                  ASSERT(MUTEX_HELD(&ringp->s_ring_lock));
4775 4781                  ASSERT(ringp->s_ring_first != NULL);
4776 4782  
4777 4783                  ringp->s_ring_drain_func(ringp);
4778 4784                  mutex_exit(&ringp->s_ring_lock);
4779 4785                  return;
4780 4786          } else {
4781 4787                  /* ST_RING_WORKER_ONLY case */
4782 4788                  SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4783 4789                  mac_soft_ring_worker_wakeup(ringp);
4784 4790                  mutex_exit(&ringp->s_ring_lock);
4785 4791          }
4786 4792  }
4787 4793  
4788 4794  /*
4789 4795   * TX SOFTRING RELATED FUNCTIONS
4790 4796   *
4791 4797   * These functions really belong in mac_soft_ring.c and here for
4792 4798   * a short period.
4793 4799   */
4794 4800  
4795 4801  #define TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) {          \
4796 4802          ASSERT(MUTEX_HELD(&ringp->s_ring_lock));                        \
4797 4803          ringp->s_ring_state |= S_RING_ENQUEUED;                         \
4798 4804          SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);        \
4799 4805  }
4800 4806  
4801 4807  /*
4802 4808   * mac_tx_sring_queued
4803 4809   *
4804 4810   * When we are out of transmit descriptors and we already have a
4805 4811   * queue that exceeds hiwat (or the client called us with
4806 4812   * MAC_TX_NO_ENQUEUE or MAC_DROP_ON_NO_DESC flag), return the
4807 4813   * soft ring pointer as the opaque cookie for the client enable
4808 4814   * flow control.
4809 4815   */
4810 4816  static mac_tx_cookie_t
4811 4817  mac_tx_sring_enqueue(mac_soft_ring_t *ringp, mblk_t *mp_chain, uint16_t flag,
4812 4818      mblk_t **ret_mp)
4813 4819  {
4814 4820          int cnt;
4815 4821          size_t sz;
4816 4822          mblk_t *tail;
4817 4823          mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
4818 4824          mac_tx_cookie_t cookie = 0;
4819 4825          boolean_t wakeup_worker = B_TRUE;
4820 4826  
4821 4827          ASSERT(MUTEX_HELD(&ringp->s_ring_lock));
4822 4828          MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
4823 4829          if (flag & MAC_DROP_ON_NO_DESC) {
4824 4830                  mac_pkt_drop(NULL, NULL, mp_chain, B_FALSE);
4825 4831                  /* increment freed stats */
4826 4832                  ringp->s_ring_drops += cnt;
4827 4833                  cookie = (mac_tx_cookie_t)ringp;
4828 4834          } else {
4829 4835                  if (ringp->s_ring_first != NULL)
4830 4836                          wakeup_worker = B_FALSE;
4831 4837  
4832 4838                  if (flag & MAC_TX_NO_ENQUEUE) {
4833 4839                          /*
4834 4840                           * If QUEUED is not set, queue the packet
4835 4841                           * and let mac_tx_soft_ring_drain() set
4836 4842                           * the TX_BLOCKED bit for the reasons
4837 4843                           * explained above. Otherwise, return the
4838 4844                           * mblks.
4839 4845                           */
4840 4846                          if (wakeup_worker) {
4841 4847                                  TX_SOFT_RING_ENQUEUE_CHAIN(ringp,
4842 4848                                      mp_chain, tail, cnt, sz);
4843 4849                          } else {
4844 4850                                  ringp->s_ring_state |= S_RING_WAKEUP_CLIENT;
4845 4851                                  cookie = (mac_tx_cookie_t)ringp;
4846 4852                                  *ret_mp = mp_chain;
4847 4853                          }
4848 4854                  } else {
4849 4855                          boolean_t enqueue = B_TRUE;
4850 4856  
4851 4857                          if (ringp->s_ring_count > ringp->s_ring_tx_hiwat) {
4852 4858                                  /*
4853 4859                                   * flow-controlled. Store ringp in cookie
4854 4860                                   * so that it can be returned as
4855 4861                                   * mac_tx_cookie_t to client
4856 4862                                   */
4857 4863                                  ringp->s_ring_state |= S_RING_TX_HIWAT;
4858 4864                                  cookie = (mac_tx_cookie_t)ringp;
4859 4865                                  ringp->s_ring_hiwat_cnt++;
4860 4866                                  if (ringp->s_ring_count >
4861 4867                                      ringp->s_ring_tx_max_q_cnt) {
4862 4868                                          /* increment freed stats */
4863 4869                                          ringp->s_ring_drops += cnt;
4864 4870                                          /*
4865 4871                                           * b_prev may be set to the fanout hint
4866 4872                                           * hence can't use freemsg directly
4867 4873                                           */
4868 4874                                          mac_pkt_drop(NULL, NULL,
4869 4875                                              mp_chain, B_FALSE);
4870 4876                                          DTRACE_PROBE1(tx_queued_hiwat,
4871 4877                                              mac_soft_ring_t *, ringp);
4872 4878                                          enqueue = B_FALSE;
4873 4879                                  }
4874 4880                          }
4875 4881                          if (enqueue) {
4876 4882                                  TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain,
4877 4883                                      tail, cnt, sz);
4878 4884                          }
4879 4885                  }
4880 4886                  if (wakeup_worker)
4881 4887                          cv_signal(&ringp->s_ring_async);
4882 4888          }
4883 4889          return (cookie);
4884 4890  }
4885 4891  
4886 4892  
4887 4893  /*
4888 4894   * mac_tx_soft_ring_process
4889 4895   *
4890 4896   * This routine is called when fanning out outgoing traffic among
4891 4897   * multipe Tx rings.
4892 4898   * Note that a soft ring is associated with a h/w Tx ring.
4893 4899   */
4894 4900  mac_tx_cookie_t
4895 4901  mac_tx_soft_ring_process(mac_soft_ring_t *ringp, mblk_t *mp_chain,
4896 4902      uint16_t flag, mblk_t **ret_mp)
4897 4903  {
4898 4904          mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
4899 4905          int     cnt;
4900 4906          size_t  sz;
4901 4907          mblk_t  *tail;
4902 4908          mac_tx_cookie_t cookie = 0;
4903 4909  
4904 4910          ASSERT(ringp != NULL);
4905 4911          ASSERT(mp_chain != NULL);
4906 4912          ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4907 4913          /*
4908 4914           * The following modes can come here: SRS_TX_BW_FANOUT,
4909 4915           * SRS_TX_FANOUT, SRS_TX_AGGR, SRS_TX_BW_AGGR.
4910 4916           */
4911 4917          ASSERT(MAC_TX_SOFT_RINGS(mac_srs));
4912 4918          ASSERT(mac_srs->srs_tx.st_mode == SRS_TX_FANOUT ||
4913 4919              mac_srs->srs_tx.st_mode == SRS_TX_BW_FANOUT ||
4914 4920              mac_srs->srs_tx.st_mode == SRS_TX_AGGR ||
4915 4921              mac_srs->srs_tx.st_mode == SRS_TX_BW_AGGR);
4916 4922  
4917 4923          if (ringp->s_ring_type & ST_RING_WORKER_ONLY) {
4918 4924                  /* Serialization mode */
4919 4925  
4920 4926                  mutex_enter(&ringp->s_ring_lock);
4921 4927                  if (ringp->s_ring_count > ringp->s_ring_tx_hiwat) {
4922 4928                          cookie = mac_tx_sring_enqueue(ringp, mp_chain,
4923 4929                              flag, ret_mp);
4924 4930                          mutex_exit(&ringp->s_ring_lock);
4925 4931                          return (cookie);
4926 4932                  }
4927 4933                  MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
4928 4934                  TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4929 4935                  if (ringp->s_ring_state & (S_RING_BLOCK | S_RING_PROC)) {
4930 4936                          /*
4931 4937                           * If ring is blocked due to lack of Tx
4932 4938                           * descs, just return. Worker thread
4933 4939                           * will get scheduled when Tx desc's
4934 4940                           * become available.
4935 4941                           */
4936 4942                          mutex_exit(&ringp->s_ring_lock);
4937 4943                          return (cookie);
4938 4944                  }
4939 4945                  mac_soft_ring_worker_wakeup(ringp);
4940 4946                  mutex_exit(&ringp->s_ring_lock);
4941 4947                  return (cookie);
4942 4948          } else {
4943 4949                  /* Default fanout mode */
4944 4950                  /*
4945 4951                   * S_RING_BLOCKED is set when underlying NIC runs
4946 4952                   * out of Tx descs and messages start getting
4947 4953                   * queued. It won't get reset until
4948 4954                   * tx_srs_drain() completely drains out the
4949 4955                   * messages.
4950 4956                   */
4951 4957                  mac_tx_stats_t          stats;
4952 4958  
4953 4959                  if (ringp->s_ring_state & S_RING_ENQUEUED) {
4954 4960                          /* Tx descs/resources not available */
4955 4961                          mutex_enter(&ringp->s_ring_lock);
4956 4962                          if (ringp->s_ring_state & S_RING_ENQUEUED) {
4957 4963                                  cookie = mac_tx_sring_enqueue(ringp, mp_chain,
4958 4964                                      flag, ret_mp);
4959 4965                                  mutex_exit(&ringp->s_ring_lock);
4960 4966                                  return (cookie);
4961 4967                          }
4962 4968                          /*
4963 4969                           * While we were computing mblk count, the
4964 4970                           * flow control condition got relieved.
4965 4971                           * Continue with the transmission.
4966 4972                           */
4967 4973                          mutex_exit(&ringp->s_ring_lock);
4968 4974                  }
4969 4975  
4970 4976                  mp_chain = mac_tx_send(ringp->s_ring_tx_arg1,
4971 4977                      ringp->s_ring_tx_arg2, mp_chain, &stats);
4972 4978  
4973 4979                  /*
4974 4980                   * Multiple threads could be here sending packets.
4975 4981                   * Under such conditions, it is not possible to
4976 4982                   * automically set S_RING_BLOCKED bit to indicate
4977 4983                   * out of tx desc condition. To atomically set
4978 4984                   * this, we queue the returned packet and do
4979 4985                   * the setting of S_RING_BLOCKED in
4980 4986                   * mac_tx_soft_ring_drain().
4981 4987                   */
4982 4988                  if (mp_chain != NULL) {
4983 4989                          mutex_enter(&ringp->s_ring_lock);
4984 4990                          cookie =
4985 4991                              mac_tx_sring_enqueue(ringp, mp_chain, flag, ret_mp);
4986 4992                          mutex_exit(&ringp->s_ring_lock);
4987 4993                          return (cookie);
4988 4994                  }
4989 4995                  SRS_TX_STATS_UPDATE(mac_srs, &stats);
4990 4996                  SOFTRING_TX_STATS_UPDATE(ringp, &stats);
4991 4997  
4992 4998                  return (0);
4993 4999          }
4994 5000  }

↓ open down ↓

229 lines elided

↑ open up ↑

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX