Print this page
11490 SRS ring polling disabled for VLANs
11491 Want DLS bypass for VLAN traffic
11492 add VLVF bypass to ixgbe core
2869 duplicate packets with vnics over aggrs
11489 DLS stat delete and aggr kstat can deadlock
Portions contributed by: Theo Schlossnagle <jesus@omniti.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Split |
Close |
Expand all |
Collapse all |
--- old/usr/src/uts/common/io/mac/mac_sched.c
+++ new/usr/src/uts/common/io/mac/mac_sched.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
11 11 * and limitations under the License.
12 12 *
13 13 * When distributing Covered Code, include this CDDL HEADER in each
↓ open down ↓ |
13 lines elided |
↑ open up ↑ |
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 21 /*
22 22 * Copyright 2010 Sun Microsystems, Inc. All rights reserved.
23 23 * Use is subject to license terms.
24 - * Copyright 2017 Joyent, Inc.
24 + * Copyright 2018 Joyent, Inc.
25 25 * Copyright 2013 Nexenta Systems, Inc. All rights reserved.
26 26 */
27 27
28 28 /*
29 29 * MAC data path
30 30 *
31 31 * The MAC data path is concerned with the flow of traffic from mac clients --
32 32 * DLS, IP, etc. -- to various GLDv3 device drivers -- e1000g, vnic, aggr,
33 33 * ixgbe, etc. -- and from the GLDv3 device drivers back to clients.
34 34 *
35 35 * -----------
36 36 * Terminology
37 37 * -----------
38 38 *
39 39 * MAC uses a lot of different, but related terms that are associated with the
40 40 * design and structure of the data path. Before we cover other aspects, first
41 41 * let's review the terminology that MAC uses.
42 42 *
43 43 * MAC
44 44 *
45 45 * This driver. It interfaces with device drivers and provides abstractions
46 46 * that the rest of the system consumes. All data links -- things managed
47 47 * with dladm(1M), are accessed through MAC.
48 48 *
49 49 * GLDv3 DEVICE DRIVER
50 50 *
51 51 * A GLDv3 device driver refers to a driver, both for pseudo-devices and
52 52 * real devices, which implement the GLDv3 driver API. Common examples of
53 53 * these are igb and ixgbe, which are drivers for various Intel networking
54 54 * cards. These devices may or may not have various features, such as
55 55 * hardware rings and checksum offloading. For MAC, a GLDv3 device is the
56 56 * final point for the transmission of a packet and the starting point for
57 57 * the receipt of a packet.
58 58 *
59 59 * FLOWS
60 60 *
61 61 * At a high level, a flow refers to a series of packets that are related.
62 62 * Often times the term is used in the context of TCP to indicate a unique
63 63 * TCP connection and the traffic over it. However, a flow can exist at
64 64 * other levels of the system as well. MAC has a notion of a default flow
65 65 * which is used for all unicast traffic addressed to the address of a MAC
66 66 * device. For example, when a VNIC is created, a default flow is created
67 67 * for the VNIC's MAC address. In addition, flows are created for broadcast
68 68 * groups and a user may create a flow with flowadm(1M).
69 69 *
70 70 * CLASSIFICATION
71 71 *
72 72 * Classification refers to the notion of identifying an incoming frame
73 73 * based on its destination address and optionally its source addresses and
74 74 * doing different processing based on that information. Classification can
75 75 * be done in both hardware and software. In general, we usually only
76 76 * classify based on the layer two destination, eg. for Ethernet, the
77 77 * destination MAC address.
78 78 *
79 79 * The system also will do classification based on layer three and layer
80 80 * four properties. This is used to support things like flowadm(1M), which
81 81 * allows setting QoS and other properties on a per-flow basis.
82 82 *
83 83 * RING
84 84 *
85 85 * Conceptually, a ring represents a series of framed messages, often in a
86 86 * contiguous chunk of memory that acts as a circular buffer. Rings come in
87 87 * a couple of forms. Generally they are either a hardware construct (hw
88 88 * ring) or they are a software construct (sw ring) maintained by MAC.
89 89 *
90 90 * HW RING
91 91 *
92 92 * A hardware ring is a set of resources provided by a GLDv3 device driver
93 93 * (even if it is a pseudo-device). A hardware ring comes in two different
94 94 * forms: receive (rx) rings and transmit (tx) rings. An rx hw ring is
95 95 * something that has a unique DMA (direct memory access) region and
96 96 * generally supports some form of classification (though it isn't always
97 97 * used), as well as a means of generating an interrupt specific to that
98 98 * ring. For example, the device may generate a specific MSI-X for a PCI
99 99 * express device. A tx ring is similar, except that it is dedicated to
100 100 * transmission. It may also be a vector for enabling features such as VLAN
101 101 * tagging and large transmit offloading. It usually has its own dedicated
102 102 * interrupts for transmit being completed.
103 103 *
104 104 * SW RING
105 105 *
106 106 * A software ring is a construction of MAC. It represents the same thing
107 107 * that a hardware ring generally does, a collection of frames. However,
108 108 * instead of being in a contiguous ring of memory, they're instead linked
109 109 * by using the mblk_t's b_next pointer. Each frame may itself be multiple
110 110 * mblk_t's linked together by the b_cont pointer. A software ring always
111 111 * represents a collection of classified packets; however, it varies as to
112 112 * whether it uses only layer two information, or a combination of that and
113 113 * additional layer three and layer four data.
114 114 *
115 115 * FANOUT
116 116 *
117 117 * Fanout is the idea of spreading out the load of processing frames based
118 118 * on the source and destination information contained in the layer two,
119 119 * three, and four headers, such that the data can then be processed in
120 120 * parallel using multiple hardware threads.
121 121 *
122 122 * A fanout algorithm hashes the headers and uses that to place different
123 123 * flows into a bucket. The most important thing is that packets that are
124 124 * in the same flow end up in the same bucket. If they do not, performance
125 125 * can be adversely affected. Consider the case of TCP. TCP severely
126 126 * penalizes a connection if the data arrives out of order. If a given flow
127 127 * is processed on different CPUs, then the data will appear out of order,
128 128 * hence the invariant that fanout always hash a given flow to the same
129 129 * bucket and thus get processed on the same CPU.
130 130 *
131 131 * RECEIVE SIDE SCALING (RSS)
132 132 *
133 133 *
134 134 * Receive side scaling is a term that isn't common in illumos, but is used
135 135 * by vendors and was popularized by Microsoft. It refers to the idea of
136 136 * spreading the incoming receive load out across multiple interrupts which
137 137 * can be directed to different CPUs. This allows a device to leverage
138 138 * hardware rings even when it doesn't support hardware classification. The
139 139 * hardware uses an algorithm to perform fanout that ensures the flow
140 140 * invariant is maintained.
141 141 *
142 142 * SOFT RING SET
143 143 *
144 144 * A soft ring set, commonly abbreviated SRS, is a collection of rings and
145 145 * is used for both transmitting and receiving. It is maintained in the
146 146 * structure mac_soft_ring_set_t. A soft ring set is usually associated
147 147 * with flows, and coordinates both the use of hardware and software rings.
148 148 * Because the use of hardware rings can change as devices such as VNICs
149 149 * come and go, we always ensure that the set has software classification
150 150 * rules that correspond to the hardware classification rules from rings.
151 151 *
152 152 * Soft ring sets are also used for the enforcement of various QoS
153 153 * properties. For example, if a bandwidth limit has been placed on a
154 154 * specific flow or device, then that will be enforced by the soft ring
155 155 * set.
156 156 *
157 157 * SERVICE ATTACHMENT POINT (SAP)
158 158 *
159 159 * The service attachment point is a DLPI (Data Link Provider Interface)
160 160 * concept; however, it comes up quite often in MAC. Most MAC devices speak
161 161 * a protocol that has some notion of different channels or message type
162 162 * identifiers. For example, Ethernet defines an EtherType which is a part
163 163 * of the Ethernet header and defines the particular protocol of the data
164 164 * payload. If the EtherType is set to 0x0800, then it defines that the
165 165 * contents of that Ethernet frame is IPv4 traffic. For Ethernet, the
166 166 * EtherType is the SAP.
167 167 *
168 168 * In DLPI, a given consumer attaches to a specific SAP. In illumos, the ip
169 169 * and arp drivers attach to the EtherTypes for IPv4, IPv6, and ARP. Using
170 170 * libdlpi(3LIB) user software can attach to arbitrary SAPs. With the
171 171 * exception of 802.1Q VLAN tagged traffic, MAC itself does not directly
172 172 * consume the SAP; however, it uses that information as part of hashing
173 173 * and it may be used as part of the construction of flows.
174 174 *
175 175 * PRIMARY MAC CLIENT
176 176 *
177 177 * The primary mac client refers to a mac client whose unicast address
178 178 * matches the address of the device itself. For example, if the system has
179 179 * instance of the e1000g driver such as e1000g0, e1000g1, etc., the
180 180 * primary mac client is the one named after the device itself. VNICs that
181 181 * are created on top of such devices are not the primary client.
182 182 *
183 183 * TRANSMIT DESCRIPTORS
184 184 *
185 185 * Transmit descriptors are a resource that most GLDv3 device drivers have.
186 186 * Generally, a GLDv3 device driver takes a frame that's meant to be output
187 187 * and puts a copy of it into a region of memory. Each region of memory
188 188 * usually has an associated descriptor that the device uses to manage
189 189 * properties of the frames. Devices have a limited number of such
190 190 * descriptors. They get reclaimed once the device finishes putting the
191 191 * frame on the wire.
192 192 *
193 193 * If the driver runs out of transmit descriptors, for example, the OS is
194 194 * generating more frames than it can put on the wire, then it will return
195 195 * them back to the MAC layer.
196 196 *
197 197 * ---------------------------------
198 198 * Rings, Classification, and Fanout
199 199 * ---------------------------------
200 200 *
201 201 * The heart of MAC is made up of rings, and not those that Elven-kings wear.
202 202 * When receiving a packet, MAC breaks the work into two different, though
203 203 * interrelated phases. The first phase is generally classification and then the
204 204 * second phase is generally fanout. When a frame comes in from a GLDv3 Device,
205 205 * MAC needs to determine where that frame should be delivered. If it's a
206 206 * unicast frame (say a normal TCP/IP packet), then it will be delivered to a
207 207 * single MAC client; however, if it's a broadcast or multicast frame, then MAC
208 208 * may need to deliver it to multiple MAC clients.
209 209 *
210 210 * On transmit, classification isn't quite as important, but may still be used.
211 211 * Unlike with the receive path, the classification is not used to determine
212 212 * devices that should transmit something, but rather is used for special
213 213 * properties of a flow, eg. bandwidth limits for a given IP address, device, or
214 214 * connection.
215 215 *
216 216 * MAC employs a software classifier and leverages hardware classification as
217 217 * well. The software classifier can leverage the full layer two information,
218 218 * source, destination, VLAN, and SAP. If the SAP indicates that IP traffic is
219 219 * being sent, it can classify based on the IP header, and finally, it also
220 220 * knows how to classify based on the local and remote ports of TCP, UDP, and
221 221 * SCTP.
222 222 *
223 223 * Hardware classifiers vary in capability. Generally all hardware classifiers
224 224 * provide the capability to classify based on the destination MAC address. Some
225 225 * hardware has additional filters built in for performing more in-depth
226 226 * classification; however, it often has much more limited resources for these
227 227 * activities as compared to the layer two destination address classification.
228 228 *
229 229 * The modus operandi in MAC is to always ensure that we have software-based
230 230 * capabilities and rules in place and then to supplement that with hardware
231 231 * resources when available. In general, simple layer two classification is
232 232 * sufficient and nothing else is used, unless a specific flow is created with
233 233 * tools such as flowadm(1M) or bandwidth limits are set on a device with
234 234 * dladm(1M).
235 235 *
236 236 * RINGS AND GROUPS
237 237 *
238 238 * To get into how rings and classification play together, it's first important
239 239 * to understand how hardware devices commonly associate rings and allow them to
240 240 * be programmed. Recall that a hardware ring should be thought of as a DMA
241 241 * buffer and an interrupt resource. Rings are then collected into groups. A
242 242 * group itself has a series of classification rules. One or more MAC addresses
243 243 * are assigned to a group.
244 244 *
245 245 * Hardware devices vary in terms of what capabilities they provide. Sometimes
246 246 * they allow for a dynamic assignment of rings to a group and sometimes they
247 247 * have a static assignment of rings to a group. For example, the ixgbe driver
248 248 * has a static assignment of rings to groups such that every group has exactly
249 249 * one ring and the number of groups is equal to the number of rings.
250 250 *
251 251 * Classification and receive side scaling both come into play with how a device
252 252 * advertises itself to MAC and how MAC uses it. If a device supports layer two
253 253 * classification of frames, then MAC will assign MAC addresses to a group as a
254 254 * form of primary classification. If a single MAC address is assigned to a
255 255 * group, a common case, then MAC will consider packets that come in from rings
256 256 * on that group to be fully classified and will not need to do any software
257 257 * classification unless a specific flow has been created.
258 258 *
259 259 * If a device supports receive side scaling, then it may advertise or support
260 260 * groups with multiple rings. In those cases, then receive side scaling will
261 261 * come into play and MAC will use that as a means of fanning out received
262 262 * frames across multiple CPUs. This can also be combined with groups that
263 263 * support layer two classification.
264 264 *
265 265 * If a device supports dynamic assignments of rings to groups, then MAC will
266 266 * change around the way that rings are assigned to various groups as devices
267 267 * come and go from the system. For example, when a VNIC is created, a new flow
268 268 * will be created for the VNIC's MAC address. If a hardware ring is available,
269 269 * MAC may opt to reassign it from one group to another.
270 270 *
271 271 * ASSIGNMENT OF HARDWARE RINGS
272 272 *
273 273 * This is a bit of a complicated subject that varies depending on the device,
274 274 * the use of aggregations, the special nature of the primary mac client. This
275 275 * section deserves being fleshed out.
276 276 *
277 277 * FANOUT
278 278 *
279 279 * illumos uses fanout to help spread out the incoming processing load of chains
280 280 * of frames away from a single CPU. If a device supports receive side scaling,
281 281 * then that provides an initial form of fanout; however, what we're concerned
282 282 * with all happens after the context of a given set of frames being classified
283 283 * to a soft ring set.
284 284 *
285 285 * After frames reach a soft ring set and account for any potential bandwidth
286 286 * related accounting, they may be fanned out based on one of the following
287 287 * three modes:
288 288 *
289 289 * o No Fanout
290 290 * o Protocol level fanout
291 291 * o Full software ring protocol fanout
292 292 *
↓ open down ↓ |
258 lines elided |
↑ open up ↑ |
293 293 * MAC makes the determination as to which of these modes a given soft ring set
294 294 * obtains based on parameters such as whether or not it's the primary mac
295 295 * client, whether it's on a 10 GbE or faster device, user controlled dladm(1M)
296 296 * properties, and the nature of the hardware and the resources that it has.
297 297 *
298 298 * When there is no fanout, MAC does not create any soft rings for a device and
299 299 * the device has frames delivered directly to the MAC client.
300 300 *
301 301 * Otherwise, all fanout is performed by software. MAC divides incoming frames
302 302 * into one of three buckets -- IPv4 TCP traffic, IPv4 UDP traffic, and
303 - * everything else. Note, VLAN tagged traffic is considered other, regardless of
304 - * the interior EtherType. Regardless of the type of fanout, these three
305 - * categories or buckets are always used.
303 + * everything else. Regardless of the type of fanout, these three categories
304 + * or buckets are always used.
306 305 *
307 306 * The difference between protocol level fanout and full software ring protocol
308 307 * fanout is the number of software rings that end up getting created. The
309 308 * system always uses the same number of software rings per protocol bucket. So
310 309 * in the first case when we're just doing protocol level fanout, we just create
311 310 * one software ring each for IPv4 TCP traffic, IPv4 UDP traffic, and everything
312 311 * else.
313 312 *
314 313 * In the case where we do full software ring protocol fanout, we generally use
315 314 * mac_compute_soft_ring_count() to determine the number of rings. There are
316 315 * other combinations of properties and devices that may send us down other
317 316 * paths, but this is a common starting point. If it's a non-bandwidth enforced
318 317 * device and we're on at least a 10 GbE link, then we'll use eight soft rings
319 318 * per protocol bucket as a starting point. See mac_compute_soft_ring_count()
320 319 * for more information on the total number.
321 320 *
322 321 * For each of these rings, we create a mac_soft_ring_t and an associated worker
323 322 * thread. Particularly when doing full software ring protocol fanout, we bind
324 323 * each of the worker threads to individual CPUs.
325 324 *
326 325 * The other advantage of these software rings is that it allows upper layers to
327 326 * optionally poll on them. For example, TCP can leverage an squeue to poll on
328 327 * the software ring, see squeue.c for more information.
329 328 *
330 329 * DLS BYPASS
331 330 *
332 331 * DLS is the data link services module. It interfaces with DLPI, which is the
333 332 * primary way that other parts of the system such as IP interface with the MAC
334 333 * layer. While DLS is traditionally a STREAMS-based interface, it allows for
335 334 * certain modules such as IP to negotiate various more modern interfaces to be
336 335 * used, which are useful for higher performance and allow it to use direct
337 336 * function calls to DLS instead of using STREAMS.
338 337 *
339 338 * When we have IPv4 TCP or UDP software rings, then traffic on those rings is
340 339 * eligible for what we call the dls bypass. In those cases, rather than going
341 340 * out mac_rx_deliver() to DLS, DLS instead registers them to go directly via
342 341 * the direct callback registered with DLS, generally ip_input().
343 342 *
344 343 * HARDWARE RING POLLING
345 344 *
346 345 * GLDv3 devices with hardware rings generally deliver chains of messages
347 346 * (mblk_t chain) during the context of a single interrupt. However, interrupts
348 347 * are not the only way that these devices may be used. As part of implementing
349 348 * ring support, a GLDv3 device driver must have a way to disable the generation
350 349 * of that interrupt and allow for the operating system to poll on that ring.
351 350 *
352 351 * To implement this, every soft ring set has a worker thread and a polling
353 352 * thread. If a sufficient packet rate comes into the system, MAC will 'blank'
354 353 * (disable) interrupts on that specific ring and the polling thread will start
355 354 * consuming packets from the hardware device and deliver them to the soft ring
356 355 * set, where the worker thread will take over.
357 356 *
358 357 * Once the rate of packet intake drops down below a certain threshold, then
359 358 * polling on the hardware ring will be quiesced and interrupts will be
360 359 * re-enabled for the given ring. This effectively allows the system to shift
361 360 * how it handles a ring based on its load. At high packet rates, polling on the
362 361 * device as opposed to relying on interrupts can actually reduce overall system
363 362 * load due to the minimization of interrupt activity.
364 363 *
365 364 * Note the importance of each ring having its own interrupt source. The whole
366 365 * idea here is that we do not disable interrupts on the device as a whole, but
367 366 * rather each ring can be independently toggled.
368 367 *
369 368 * USE OF WORKER THREADS
370 369 *
371 370 * Both the soft ring set and individual soft rings have a worker thread
372 371 * associated with them that may be bound to a specific CPU in the system. Any
373 372 * such assignment will get reassessed as part of dynamic reconfiguration events
374 373 * in the system such as the onlining and offlining of CPUs and the creation of
375 374 * CPU partitions.
376 375 *
377 376 * In many cases, while in an interrupt, we try to deliver a frame all the way
378 377 * through the stack in the context of the interrupt itself. However, if the
379 378 * amount of queued frames has exceeded a threshold, then we instead defer to
380 379 * the worker thread to do this work and signal it. This is particularly useful
381 380 * when you have the soft ring set delivering frames into multiple software
382 381 * rings. If it was only delivering frames into a single software ring then
383 382 * there'd be no need to have another thread take over. However, if it's
384 383 * delivering chains of frames to multiple rings, then it's worthwhile to have
385 384 * the worker for the software ring take over so that the different software
386 385 * rings can be processed in parallel.
387 386 *
388 387 * In a similar fashion to the hardware polling thread, if we don't have a
389 388 * backlog or there's nothing to do, then the worker thread will go back to
390 389 * sleep and frames can be delivered all the way from an interrupt. This
391 390 * behavior is useful as it's designed to minimize latency and the default
392 391 * disposition of MAC is to optimize for latency.
393 392 *
394 393 * MAINTAINING CHAINS
395 394 *
396 395 * Another useful idea that MAC uses is to try and maintain frames in chains for
397 396 * as long as possible. The idea is that all of MAC can handle chains of frames
398 397 * structured as a series of mblk_t structures linked with the b_next pointer.
399 398 * When performing software classification and software fanout, MAC does not
400 399 * simply determine the destination and send the frame along. Instead, in the
401 400 * case of classification, it tries to maintain a chain for as long as possible
402 401 * before passing it along and performing additional processing.
403 402 *
404 403 * In the case of fanout, MAC first determines what the target software ring is
405 404 * for every frame in the original chain and constructs a new chain for each
406 405 * target. MAC then delivers the new chain to each software ring in succession.
407 406 *
408 407 * The whole rationale for doing this is that we want to try and maintain the
409 408 * pipe as much as possible and deliver as many frames through the stack at once
410 409 * that we can, rather than just pushing a single frame through. This can often
411 410 * help bring down latency and allows MAC to get a better sense of the overall
412 411 * activity in the system and properly engage worker threads.
413 412 *
414 413 * --------------------
415 414 * Bandwidth Management
416 415 * --------------------
417 416 *
418 417 * Bandwidth management is something that's built into the soft ring set itself.
419 418 * When bandwidth limits are placed on a flow, a corresponding soft ring set is
420 419 * toggled into bandwidth mode. This changes how we transmit and receive the
421 420 * frames in question.
422 421 *
423 422 * Bandwidth management is done on a per-tick basis. We translate the user's
424 423 * requested bandwidth from a quantity per-second into a quantity per-tick. MAC
425 424 * cannot process a frame across more than one tick, thus it sets a lower bound
426 425 * for the bandwidth cap to be a single MTU. This also means that when
427 426 * hires ticks are enabled (hz is set to 1000), that the minimum amount of
428 427 * bandwidth is higher, because the number of ticks has increased and MAC has to
429 428 * go from accepting 100 packets / sec to 1000 / sec.
430 429 *
431 430 * The bandwidth counter is reset by either the soft ring set's worker thread or
432 431 * a thread that is doing an inline transmit or receive if they discover that
433 432 * the current tick is in the future from the recorded tick.
434 433 *
435 434 * Whenever we're receiving or transmitting data, we end up leaving most of the
436 435 * work to the soft ring set's worker thread. This forces data inserted into the
437 436 * soft ring set to be effectively serialized and allows us to exhume bandwidth
438 437 * at a reasonable rate. If there is nothing in the soft ring set at the moment
439 438 * and the set has available bandwidth, then it may processed inline.
440 439 * Otherwise, the worker is responsible for taking care of the soft ring set.
441 440 *
442 441 * ---------------------
443 442 * The Receive Data Path
444 443 * ---------------------
445 444 *
446 445 * The following series of ASCII art images breaks apart the way that a frame
447 446 * comes in and is processed in MAC.
448 447 *
449 448 * Part 1 -- Initial frame receipt, SRS classification
450 449 *
451 450 * Here, a frame is received by a GLDv3 driver, generally in the context of an
452 451 * interrupt, and it ends up in mac_rx_common(). A driver calls either mac_rx or
453 452 * mac_rx_ring, depending on whether or not it supports rings and can identify
454 453 * the interrupt as having come from a specific ring. Here we determine whether
455 454 * or not it's fully classified and perform software classification as
456 455 * appropriate. From here, everything always ends up going to either entry [A]
457 456 * or entry [B] based on whether or not they have subflow processing needed. We
458 457 * leave via fanout or delivery.
459 458 *
460 459 * +===========+
461 460 * v hardware v
462 461 * v interrupt v
463 462 * +===========+
464 463 * |
465 464 * * . . appropriate
466 465 * | upcall made
467 466 * | by GLDv3 driver . . always
468 467 * | .
469 468 * +--------+ | +----------+ . +---------------+
470 469 * | GLDv3 | +---->| mac_rx |-----*--->| mac_rx_common |
471 470 * | Driver |-->--+ +----------+ +---------------+
472 471 * +--------+ | ^ |
473 472 * | | ^ v
474 473 * ^ | * . . always +----------------------+
475 474 * | | | | mac_promisc_dispatch |
476 475 * | | +-------------+ +----------------------+
477 476 * | +--->| mac_rx_ring | |
478 477 * | +-------------+ * . . hw classified
479 478 * | v or single flow?
480 479 * | |
481 480 * | +--------++--------------+
482 481 * | | | * hw class,
483 482 * | | * hw classified | subflows
484 483 * | no hw class and . * | or single | exist
485 484 * | subflows | | flow |
486 485 * | | v v
487 486 * | | +-----------+ +-----------+
488 487 * | | | goto | | goto |
489 488 * | | | entry [A] | | entry [B] |
490 489 * | | +-----------+ +-----------+
491 490 * | v ^
492 491 * | +-------------+ |
493 492 * | | mac_rx_flow | * SRS and flow found,
494 493 * | +-------------+ | call flow cb
495 494 * | | +------+
496 495 * | v |
497 496 * v +==========+ +-----------------+
498 497 * | v For each v--->| mac_rx_classify |
499 498 * +----------+ v mblk_t v +-----------------+
500 499 * | srs | +==========+
501 500 * | pollling |
502 501 * | thread |->------------------------------------------+
503 502 * +----------+ |
504 503 * v . inline
505 504 * +--------------------+ +----------+ +---------+ .
506 505 * [A]---->| mac_rx_srs_process |-->| check bw |-->| enqueue |--*---------+
507 506 * +--------------------+ | limits | | frames | |
508 507 * ^ +----------+ | to SRS | |
509 508 * | +---------+ |
510 509 * | send chain +--------+ | |
511 510 * * when clasified | signal | * BW limits, |
512 511 * | flow changes | srs |<---+ loopback, |
513 512 * | | worker | stack too |
514 513 * | +--------+ deep |
515 514 * +-----------------+ +--------+ |
516 515 * | mac_flow_lookup | | srs | +---------------------+ |
517 516 * +-----------------+ | worker |---->| mac_rx_srs_drain |<---+
518 517 * ^ | thread | | mac_rx_srs_drain_bw |
519 518 * | +--------+ +---------------------+
520 519 * | |
521 520 * +----------------------------+ * software rings
522 521 * [B]-->| mac_rx_srs_subflow_process | | for fanout?
523 522 * +----------------------------+ |
524 523 * +----------+-----------+
525 524 * | |
526 525 * v v
527 526 * +--------+ +--------+
528 527 * | goto | | goto |
529 528 * | Part 2 | | Part 3 |
530 529 * +--------+ +--------+
531 530 *
532 531 * Part 2 -- Fanout
533 532 *
534 533 * This part is concerned with using software fanout to assign frames to
535 534 * software rings and then deliver them to MAC clients or allow those rings to
536 535 * be polled upon. While there are two different primary fanout entry points,
537 536 * mac_rx_fanout and mac_rx_proto_fanout, they behave in similar ways, and aside
538 537 * from some of the individual hashing techniques used, most of the general
539 538 * flow is the same.
540 539 *
541 540 * +--------+ +-------------------+
542 541 * | From |---+--------->| mac_rx_srs_fanout |----+
543 542 * | Part 1 | | +-------------------+ | +=================+
544 543 * +--------+ | | v for each mblk_t v
545 544 * * . . protocol only +--->v assign to new v
546 545 * | fanout | v chain based on v
547 546 * | | v hash % nrings v
548 547 * | +-------------------------+ | +=================+
549 548 * +--->| mac_rx_srs_proto_fanout |----+ |
550 549 * +-------------------------+ |
551 550 * v
552 551 * +------------+ +--------------------------+ +================+
553 552 * | enqueue in |<---| mac_rx_soft_ring_process |<------v for each chain v
554 553 * | soft ring | +--------------------------+ +================+
555 554 * +------------+
556 555 * | +-----------+
557 556 * * soft ring set | soft ring |
558 557 * | empty and no | worker |
559 558 * | worker? | thread |
560 559 * | +-----------+
561 560 * +------*----------------+ |
562 561 * | . | v
563 562 * No . * . Yes | +------------------------+
564 563 * | +----<--| mac_rx_soft_ring_drain |
565 564 * | | +------------------------+
566 565 * v |
567 566 * +-----------+ v
568 567 * | signal | +---------------+
569 568 * | soft ring | | Deliver chain |
570 569 * | worker | | goto Part 3 |
571 570 * +-----------+ +---------------+
572 571 *
573 572 *
574 573 * Part 3 -- Packet Delivery
575 574 *
576 575 * Here, we go through and deliver the mblk_t chain directly to a given
577 576 * processing function. In a lot of cases this is mac_rx_deliver(). In the case
578 577 * of DLS bypass being used, then instead we end up going ahead and deliver it
579 578 * to the direct callback registered with DLS, generally ip_input.
580 579 *
581 580 *
582 581 * +---------+ +----------------+ +------------------+
583 582 * | From |---+------->| mac_rx_deliver |--->| Off to DLS, or |
584 583 * | Parts 1 | | +----------------+ | other MAC client |
585 584 * | and 2 | * DLS bypass +------------------+
586 585 * +---------+ | enabled +----------+ +-------------+
587 586 * +---------->| ip_input |--->| To IP |
588 587 * +----------+ | and beyond! |
589 588 * +-------------+
590 589 *
591 590 * ----------------------
592 591 * The Transmit Data Path
593 592 * ----------------------
594 593 *
595 594 * Before we go into the images, it's worth talking about a problem that is a
596 595 * bit different from the receive data path. GLDv3 device drivers have a finite
597 596 * amount of transmit descriptors. When they run out, they return unused frames
598 597 * back to MAC. MAC, at this point has several options about what it will do,
599 598 * which vary based upon the settings that the client uses.
600 599 *
601 600 * When a device runs out of descriptors, the next thing that MAC does is
602 601 * enqueue them off of the soft ring set or a software ring, depending on the
603 602 * configuration of the soft ring set. MAC will enqueue up to a high watermark
604 603 * of mblk_t chains, at which point it will indicate flow control back to the
605 604 * client. Once this condition is reached, any mblk_t chains that were not
606 605 * enqueued will be returned to the caller and they will have to decide what to
607 606 * do with them. There are various flags that control this behavior that a
608 607 * client may pass, which are discussed below.
609 608 *
610 609 * When this condition is hit, MAC also returns a cookie to the client in
611 610 * addition to unconsumed frames. Clients can poll on that cookie and register a
612 611 * callback with MAC to be notified when they are no longer subject to flow
613 612 * control, at which point they may continue to call mac_tx(). This flow control
614 613 * actually manages to work itself all the way up the stack, back through dls,
615 614 * to ip, through the various protocols, and to sockfs.
616 615 *
617 616 * While the behavior described above is the default, this behavior can be
618 617 * modified. There are two alternate modes, described below, which are
619 618 * controlled with flags.
620 619 *
621 620 * DROP MODE
622 621 *
623 622 * This mode is controlled by having the client pass the MAC_DROP_ON_NO_DESC
624 623 * flag. When this is passed, if a device driver runs out of transmit
625 624 * descriptors, then the MAC layer will drop any unsent traffic. The client in
626 625 * this case will never have any frames returned to it.
627 626 *
628 627 * DON'T ENQUEUE
629 628 *
630 629 * This mode is controlled by having the client pass the MAC_TX_NO_ENQUEUE flag.
631 630 * If the MAC_DROP_ON_NO_DESC flag is also passed, it takes precedence. In this
632 631 * mode, when we hit a case where a driver runs out of transmit descriptors,
633 632 * then instead of enqueuing packets in a soft ring set or software ring, we
634 633 * instead return the mblk_t chain back to the caller and immediately put the
635 634 * soft ring set into flow control mode.
636 635 *
637 636 * The following series of ASCII art images describe the transmit data path that
638 637 * MAC clients enter into based on calling into mac_tx(). A soft ring set has a
639 638 * transmission function associated with it. There are seven possible
640 639 * transmission modes, some of which share function entry points. The one that a
641 640 * soft ring set gets depends on properties such as whether there are
642 641 * transmission rings for fanout, whether the device involves aggregations,
643 642 * whether any bandwidth limits exist, etc.
644 643 *
645 644 *
646 645 * Part 1 -- Initial checks
647 646 *
648 647 * * . called by
649 648 * | MAC clients
650 649 * v . . No
651 650 * +--------+ +-----------+ . +-------------------+ +====================+
652 651 * | mac_tx |->| device |-*-->| mac_protect_check |->v Is this the simple v
653 652 * +--------+ | quiesced? | +-------------------+ v case? See [1] v
654 653 * +-----------+ | +====================+
655 654 * * . Yes * failed |
656 655 * v | frames |
657 656 * +--------------+ | +-------+---------+
658 657 * | freemsgchain |<---------+ Yes . * No . *
659 658 * +--------------+ v v
660 659 * +-----------+ +--------+
661 660 * | goto | | goto |
662 661 * | Part 2 | | SRS TX |
663 662 * | Entry [A] | | func |
664 663 * +-----------+ +--------+
665 664 * | |
666 665 * | v
667 666 * | +--------+
668 667 * +---------->| return |
669 668 * | cookie |
670 669 * +--------+
671 670 *
672 671 * [1] The simple case refers to the SRS being configured with the
673 672 * SRS_TX_DEFAULT transmission mode, having a single mblk_t (not a chain), their
674 673 * being only a single active client, and not having a backlog in the srs.
675 674 *
676 675 *
677 676 * Part 2 -- The SRS transmission functions
678 677 *
679 678 * This part is a bit more complicated. The different transmission paths often
680 679 * leverage one another. In this case, we'll draw out the more common ones
681 680 * before the parts that depend upon them. Here, we're going to start with the
682 681 * workings of mac_tx_send() a common function that most of the others end up
683 682 * calling.
684 683 *
685 684 * +-------------+
686 685 * | mac_tx_send |
687 686 * +-------------+
688 687 * |
689 688 * v
690 689 * +=============+ +==============+
691 690 * v more than v--->v check v
692 691 * v one client? v v VLAN and add v
693 692 * +=============+ v VLAN tags v
694 693 * | +==============+
695 694 * | |
696 695 * +------------------+
697 696 * |
698 697 * | [A]
699 698 * v |
700 699 * +============+ . No v
701 700 * v more than v . +==========+ +--------------------------+
702 701 * v one active v-*---->v for each v---->| mac_promisc_dispatch_one |---+
703 702 * v client? v v mblk_t v +--------------------------+ |
704 703 * +============+ +==========+ ^ |
705 704 * | | +==========+ |
706 705 * * . Yes | v hardware v<-------+
707 706 * v +------------+ v rings? v
708 707 * +==========+ | +==========+
709 708 * v for each v No . . . * |
710 709 * v mblk_t v specific | |
711 710 * +==========+ flow | +-----+-----+
712 711 * | | | |
713 712 * v | v v
714 713 * +-----------------+ | +-------+ +---------+
715 714 * | mac_tx_classify |------------+ | GLDv3 | | GLDv3 |
716 715 * +-----------------+ |TX func| | ring tx |
717 716 * | +-------+ | func |
718 717 * * Specific flow, generally | +---------+
719 718 * | bcast, mcast, loopback | |
720 719 * v +-----+-----+
721 720 * +==========+ +---------+ |
722 721 * v valid L2 v--*--->| freemsg | v
723 722 * v header v . No +---------+ +-------------------+
724 723 * +==========+ | return unconsumed |
725 724 * * . Yes | frames to the |
726 725 * v | caller |
727 726 * +===========+ +-------------------+
728 727 * v braodcast v +----------------+ ^
729 728 * v flow? v--*-->| mac_bcast_send |------------------+
730 729 * +===========+ . +----------------+ |
731 730 * | . . Yes |
732 731 * No . * v
733 732 * | +---------------------+ +---------------+ +----------+
734 733 * +->|mac_promisc_dispatch |->| mac_fix_cksum |->| flow |
735 734 * +---------------------+ +---------------+ | callback |
736 735 * +----------+
737 736 *
738 737 *
739 738 * In addition, many but not all of the routines, all rely on
740 739 * mac_tx_softring_process as an entry point.
741 740 *
742 741 *
743 742 * . No . No
744 743 * +--------------------------+ +========+ . +===========+ . +-------------+
745 744 * | mac_tx_soft_ring_process |-->v worker v-*->v out of tx v-*->| goto |
746 745 * +--------------------------+ v only? v v descr.? v | mac_tx_send |
747 746 * +========+ +===========+ +-------------+
748 747 * Yes . * * . Yes |
749 748 * . No v | v
750 749 * v=========+ . +===========+ . Yes | Yes . +==========+
751 750 * v apppend v<--*----------v out of tx v-*-------+---------*--v returned v
752 751 * v mblk_t v v descr.? v | v frames? v
753 752 * v chain v +===========+ | +==========+
754 753 * +=========+ | *. No
755 754 * | | v
756 755 * v v +------------+
757 756 * +===================+ +----------------------+ | done |
758 757 * v worker scheduled? v | mac_tx_sring_enqueue | | processing |
759 758 * v Out of tx descr? v +----------------------+ +------------+
760 759 * +===================+ |
761 760 * | | . Yes v
762 761 * * Yes * No . +============+
763 762 * | v +-*---------v drop on no v
764 763 * | +========+ v v TX desc? v
765 764 * | v wake v +----------+ +============+
766 765 * | v worker v | mac_pkt_ | * . No
767 766 * | +========+ | drop | | . Yes . No
768 767 * | | +----------+ v . .
769 768 * | | v ^ +===============+ . +========+ .
770 769 * +--+--------+---------+ | v Don't enqueue v-*->v ring v-*----+
771 770 * | | v Set? v v empty? v |
772 771 * | +---------------+ +===============+ +========+ |
773 772 * | | | | |
774 773 * | | +-------------------+ | |
775 774 * | *. Yes | +---------+ |
776 775 * | | v v v
777 776 * | | +===========+ +========+ +--------------+
778 777 * | +<-v At hiwat? v v append v | return |
779 778 * | +===========+ v mblk_t v | mblk_t chain |
780 779 * | * No v chain v | and flow |
781 780 * | v +========+ | control |
782 781 * | +=========+ | | cookie |
783 782 * | v append v v +--------------+
784 783 * | v mblk_t v +========+
785 784 * | v chain v v wake v +------------+
786 785 * | +=========+ v worker v-->| done |
787 786 * | | +========+ | processing |
788 787 * | v .. Yes +------------+
789 788 * | +=========+ . +========+
790 789 * | v first v--*-->v wake v
791 790 * | v append? v v worker v
792 791 * | +=========+ +========+
793 792 * | | |
794 793 * | No . * |
795 794 * | v |
796 795 * | +--------------+ |
797 796 * +------>| Return | |
798 797 * | flow control |<------------+
799 798 * | cookie |
800 799 * +--------------+
801 800 *
802 801 *
803 802 * The remaining images are all specific to each of the different transmission
804 803 * modes.
805 804 *
806 805 * SRS TX DEFAULT
807 806 *
808 807 * [ From Part 1 ]
809 808 * |
810 809 * v
811 810 * +-------------------------+
812 811 * | mac_tx_single_ring_mode |
813 812 * +-------------------------+
814 813 * |
815 814 * | . Yes
816 815 * v .
817 816 * +==========+ . +============+
818 817 * v SRS v-*->v Try to v---->---------------------+
819 818 * v backlog? v v enqueue in v |
820 819 * +==========+ v SRS v-->------+ * . . Queue too
821 820 * | +============+ * don't enqueue | deep or
822 821 * * . No ^ | | flag or at | drop flag
823 822 * | | v | hiwat, |
824 823 * v | | | return +---------+
825 824 * +-------------+ | | | cookie | freemsg |
826 825 * | goto |-*-----+ | | +---------+
827 826 * | mac_tx_send | . returned | | |
828 827 * +-------------+ mblk_t | | |
829 828 * | | | |
830 829 * | | | |
831 830 * * . . all mblk_t * queued, | |
832 831 * v consumed | may return | |
833 832 * +-------------+ | tx cookie | |
834 833 * | SRS TX func |<------------+------------+----------------+
835 834 * | completed |
836 835 * +-------------+
837 836 *
838 837 * SRS_TX_SERIALIZE
839 838 *
840 839 * +------------------------+
841 840 * | mac_tx_serializer_mode |
842 841 * +------------------------+
843 842 * |
844 843 * | . No
845 844 * v .
846 845 * +============+ . +============+ +-------------+ +============+
847 846 * v srs being v-*->v set SRS v--->| goto |-->v remove SRS v
848 847 * v processed? v v proc flags v | mac_tx_send | v proc flag v
849 848 * +============+ +============+ +-------------+ +============+
850 849 * | |
851 850 * * Yes |
852 851 * v . No v
853 852 * +--------------------+ . +==========+
854 853 * | mac_tx_srs_enqueue | +------------------------*-----<--v returned v
855 854 * +--------------------+ | v frames? v
856 855 * | | . Yes +==========+
857 856 * | | . |
858 857 * | | . +=========+ v
859 858 * v +-<-*-v queued v +--------------------+
860 859 * +-------------+ | v frames? v<----| mac_tx_srs_enqueue |
861 860 * | SRS TX func | | +=========+ +--------------------+
862 861 * | completed, |<------+ * . Yes
863 862 * | may return | | v
864 863 * | cookie | | +========+
865 864 * +-------------+ +-<---v wake v
866 865 * v worker v
867 866 * +========+
868 867 *
869 868 *
870 869 * SRS_TX_FANOUT
871 870 *
872 871 * . Yes
873 872 * +--------------------+ +=============+ . +--------------------------+
874 873 * | mac_tx_fanout_mode |--->v Have fanout v-*-->| goto |
875 874 * +--------------------+ v hint? v | mac_rx_soft_ring_process |
876 875 * +=============+ +--------------------------+
877 876 * * . No |
878 877 * v ^
879 878 * +===========+ |
880 879 * +--->v for each v +===============+
881 880 * | v mblk_t v v pick softring v
882 881 * same * +===========+ v from hash v
883 882 * hash | | +===============+
884 883 * | v |
885 884 * | +--------------+ |
886 885 * +---| mac_pkt_hash |--->*------------+
887 886 * +--------------+ . different
888 887 * hash or
889 888 * done proc.
890 889 * SRS_TX_AGGR chain
891 890 *
892 891 * +------------------+ +================================+
893 892 * | mac_tx_aggr_mode |--->v Use aggr capab function to v
894 893 * +------------------+ v find appropriate tx ring. v
895 894 * v Applies hash based on aggr v
896 895 * v policy, see mac_tx_aggr_mode() v
897 896 * +================================+
898 897 * |
899 898 * v
900 899 * +-------------------------------+
901 900 * | goto |
902 901 * | mac_rx_srs_soft_ring_process |
903 902 * +-------------------------------+
904 903 *
905 904 *
906 905 * SRS_TX_BW, SRS_TX_BW_FANOUT, SRS_TX_BW_AGGR
907 906 *
908 907 * Note, all three of these tx functions start from the same place --
909 908 * mac_tx_bw_mode().
910 909 *
911 910 * +----------------+
912 911 * | mac_tx_bw_mode |
913 912 * +----------------+
914 913 * |
915 914 * v . No . No . Yes
916 915 * +==============+ . +============+ . +=============+ . +=========+
917 916 * v Out of BW? v--*->v SRS empty? v--*->v reset BW v-*->v Bump BW v
918 917 * +==============+ +============+ v tick count? v v Usage v
919 918 * | | +=============+ +=========+
920 919 * | +---------+ | |
921 920 * | | +--------------------+ |
922 921 * | | | +----------------------+
923 922 * v | v v
924 923 * +===============+ | +==========+ +==========+ +------------------+
925 924 * v Don't enqueue v | v set bw v v Is aggr? v--*-->| goto |
926 925 * v flag set? v | v enforced v +==========+ . | mac_tx_aggr_mode |-+
927 926 * +===============+ | +==========+ | . +------------------+ |
928 927 * | Yes .* | | No . * . |
929 928 * | | | | | . Yes |
930 929 * * . No | | v | |
931 930 * | +---------+ | +========+ v +======+ |
932 931 * | | freemsg | | v append v +============+ . Yes v pick v |
933 932 * | +---------+ | v mblk_t v v Is fanout? v--*---->v ring v |
934 933 * | | | v chain v +============+ +======+ |
935 934 * +------+ | +========+ | | |
936 935 * v | | v v |
937 936 * +---------+ | v +-------------+ +--------------------+ |
938 937 * | return | | +========+ | goto | | goto | |
939 938 * | flow | | v wakeup v | mac_tx_send | | mac_tx_fanout_mode | |
940 939 * | control | | v worker v +-------------+ +--------------------+ |
941 940 * | cookie | | +========+ | | |
942 941 * +---------+ | | | +------+------+
943 942 * | v | |
944 943 * | +---------+ | v
945 944 * | | return | +============+ +------------+
946 945 * | | flow | v unconsumed v-------+ | done |
947 946 * | | control | v frames? v | | processing |
948 947 * | | cookie | +============+ | +------------+
949 948 * | +---------+ | |
950 949 * | Yes * |
951 950 * | | |
952 951 * | +===========+ |
953 952 * | v subtract v |
954 953 * | v unused bw v |
955 954 * | +===========+ |
956 955 * | | |
957 956 * | v |
958 957 * | +--------------------+ |
959 958 * +------------->| mac_tx_srs_enqueue | |
960 959 * +--------------------+ |
961 960 * | |
962 961 * | |
963 962 * +------------+ |
964 963 * | return fc | |
965 964 * | cookie and |<------+
966 965 * | mblk_t |
967 966 * +------------+
968 967 */
969 968
970 969 #include <sys/types.h>
971 970 #include <sys/callb.h>
972 971 #include <sys/sdt.h>
973 972 #include <sys/strsubr.h>
974 973 #include <sys/strsun.h>
975 974 #include <sys/vlan.h>
976 975 #include <sys/stack.h>
977 976 #include <sys/archsystm.h>
978 977 #include <inet/ipsec_impl.h>
979 978 #include <inet/ip_impl.h>
980 979 #include <inet/sadb.h>
981 980 #include <inet/ipsecesp.h>
982 981 #include <inet/ipsecah.h>
983 982 #include <inet/ip6.h>
984 983
985 984 #include <sys/mac_impl.h>
986 985 #include <sys/mac_client_impl.h>
987 986 #include <sys/mac_client_priv.h>
988 987 #include <sys/mac_soft_ring.h>
989 988 #include <sys/mac_flow_impl.h>
990 989
991 990 static mac_tx_cookie_t mac_tx_single_ring_mode(mac_soft_ring_set_t *, mblk_t *,
992 991 uintptr_t, uint16_t, mblk_t **);
993 992 static mac_tx_cookie_t mac_tx_serializer_mode(mac_soft_ring_set_t *, mblk_t *,
994 993 uintptr_t, uint16_t, mblk_t **);
995 994 static mac_tx_cookie_t mac_tx_fanout_mode(mac_soft_ring_set_t *, mblk_t *,
996 995 uintptr_t, uint16_t, mblk_t **);
997 996 static mac_tx_cookie_t mac_tx_bw_mode(mac_soft_ring_set_t *, mblk_t *,
998 997 uintptr_t, uint16_t, mblk_t **);
999 998 static mac_tx_cookie_t mac_tx_aggr_mode(mac_soft_ring_set_t *, mblk_t *,
1000 999 uintptr_t, uint16_t, mblk_t **);
1001 1000
1002 1001 typedef struct mac_tx_mode_s {
1003 1002 mac_tx_srs_mode_t mac_tx_mode;
1004 1003 mac_tx_func_t mac_tx_func;
1005 1004 } mac_tx_mode_t;
1006 1005
1007 1006 /*
1008 1007 * There are seven modes of operation on the Tx side. These modes get set
1009 1008 * in mac_tx_srs_setup(). Except for the experimental TX_SERIALIZE mode,
1010 1009 * none of the other modes are user configurable. They get selected by
1011 1010 * the system depending upon whether the link (or flow) has multiple Tx
1012 1011 * rings or a bandwidth configured, or if the link is an aggr, etc.
1013 1012 *
1014 1013 * When the Tx SRS is operating in aggr mode (st_mode) or if there are
1015 1014 * multiple Tx rings owned by Tx SRS, then each Tx ring (pseudo or
1016 1015 * otherwise) will have a soft ring associated with it. These soft rings
1017 1016 * are stored in srs_tx_soft_rings[] array.
1018 1017 *
1019 1018 * Additionally in the case of aggr, there is the st_soft_rings[] array
1020 1019 * in the mac_srs_tx_t structure. This array is used to store the same
1021 1020 * set of soft rings that are present in srs_tx_soft_rings[] array but
1022 1021 * in a different manner. The soft ring associated with the pseudo Tx
1023 1022 * ring is saved at mr_index (of the pseudo ring) in st_soft_rings[]
1024 1023 * array. This helps in quickly getting the soft ring associated with the
1025 1024 * Tx ring when aggr_find_tx_ring() returns the pseudo Tx ring that is to
1026 1025 * be used for transmit.
1027 1026 */
1028 1027 mac_tx_mode_t mac_tx_mode_list[] = {
1029 1028 {SRS_TX_DEFAULT, mac_tx_single_ring_mode},
1030 1029 {SRS_TX_SERIALIZE, mac_tx_serializer_mode},
1031 1030 {SRS_TX_FANOUT, mac_tx_fanout_mode},
1032 1031 {SRS_TX_BW, mac_tx_bw_mode},
1033 1032 {SRS_TX_BW_FANOUT, mac_tx_bw_mode},
1034 1033 {SRS_TX_AGGR, mac_tx_aggr_mode},
1035 1034 {SRS_TX_BW_AGGR, mac_tx_bw_mode}
1036 1035 };
1037 1036
1038 1037 /*
1039 1038 * Soft Ring Set (SRS) - The Run time code that deals with
1040 1039 * dynamic polling from the hardware, bandwidth enforcement,
1041 1040 * fanout etc.
1042 1041 *
1043 1042 * We try to use H/W classification on NIC and assign traffic for
1044 1043 * a MAC address to a particular Rx ring or ring group. There is a
1045 1044 * 1-1 mapping between a SRS and a Rx ring. The SRS dynamically
1046 1045 * switches the underlying Rx ring between interrupt and
1047 1046 * polling mode and enforces any specified B/W control.
1048 1047 *
1049 1048 * There is always a SRS created and tied to each H/W and S/W rule.
1050 1049 * Whenever we create a H/W rule, we always add the the same rule to
1051 1050 * S/W classifier and tie a SRS to it.
1052 1051 *
1053 1052 * In case a B/W control is specified, it is broken into bytes
1054 1053 * per ticks and as soon as the quota for a tick is exhausted,
1055 1054 * the underlying Rx ring is forced into poll mode for remainder of
1056 1055 * the tick. The SRS poll thread only polls for bytes that are
1057 1056 * allowed to come in the SRS. We typically let 4x the configured
1058 1057 * B/W worth of packets to come in the SRS (to prevent unnecessary
1059 1058 * drops due to bursts) but only process the specified amount.
1060 1059 *
1061 1060 * A MAC client (e.g. a VNIC or aggr) can have 1 or more
1062 1061 * Rx rings (and corresponding SRSs) assigned to it. The SRS
1063 1062 * in turn can have softrings to do protocol level fanout or
1064 1063 * softrings to do S/W based fanout or both. In case the NIC
1065 1064 * has no Rx rings, we do S/W classification to respective SRS.
1066 1065 * The S/W classification rule is always setup and ready. This
1067 1066 * allows the MAC layer to reassign Rx rings whenever needed
1068 1067 * but packets still continue to flow via the default path and
1069 1068 * getting S/W classified to correct SRS.
1070 1069 *
1071 1070 * The SRS's are used on both Tx and Rx side. They use the same
1072 1071 * data structure but the processing routines have slightly different
1073 1072 * semantics due to the fact that Rx side needs to do dynamic
1074 1073 * polling etc.
1075 1074 *
1076 1075 * Dynamic Polling Notes
1077 1076 * =====================
1078 1077 *
1079 1078 * Each Soft ring set is capable of switching its Rx ring between
1080 1079 * interrupt and poll mode and actively 'polls' for packets in
1081 1080 * poll mode. If the SRS is implementing a B/W limit, it makes
1082 1081 * sure that only Max allowed packets are pulled in poll mode
1083 1082 * and goes to poll mode as soon as B/W limit is exceeded. As
1084 1083 * such, there are no overheads to implement B/W limits.
1085 1084 *
1086 1085 * In poll mode, its better to keep the pipeline going where the
1087 1086 * SRS worker thread keeps processing packets and poll thread
1088 1087 * keeps bringing more packets (specially if they get to run
1089 1088 * on different CPUs). This also prevents the overheads associated
1090 1089 * by excessive signalling (on NUMA machines, this can be
1091 1090 * pretty devastating). The exception is latency optimized case
1092 1091 * where worker thread does no work and interrupt and poll thread
1093 1092 * are allowed to do their own drain.
1094 1093 *
1095 1094 * We use the following policy to control Dynamic Polling:
1096 1095 * 1) We switch to poll mode anytime the processing
1097 1096 * thread causes a backlog to build up in SRS and
1098 1097 * its associated Soft Rings (sr_poll_pkt_cnt > 0).
1099 1098 * 2) As long as the backlog stays under the low water
1100 1099 * mark (sr_lowat), we poll the H/W for more packets.
1101 1100 * 3) If the backlog (sr_poll_pkt_cnt) exceeds low
1102 1101 * water mark, we stay in poll mode but don't poll
1103 1102 * the H/W for more packets.
1104 1103 * 4) Anytime in polling mode, if we poll the H/W for
1105 1104 * packets and find nothing plus we have an existing
1106 1105 * backlog (sr_poll_pkt_cnt > 0), we stay in polling
1107 1106 * mode but don't poll the H/W for packets anymore
1108 1107 * (let the polling thread go to sleep).
1109 1108 * 5) Once the backlog is relived (packets are processed)
1110 1109 * we reenable polling (by signalling the poll thread)
1111 1110 * only when the backlog dips below sr_poll_thres.
1112 1111 * 6) sr_hiwat is used exclusively when we are not
1113 1112 * polling capable and is used to decide when to
1114 1113 * drop packets so the SRS queue length doesn't grow
1115 1114 * infinitely.
1116 1115 *
1117 1116 * NOTE: Also see the block level comment on top of mac_soft_ring.c
1118 1117 */
1119 1118
1120 1119 /*
1121 1120 * mac_latency_optimize
1122 1121 *
1123 1122 * Controls whether the poll thread can process the packets inline
1124 1123 * or let the SRS worker thread do the processing. This applies if
1125 1124 * the SRS was not being processed. For latency sensitive traffic,
1126 1125 * this needs to be true to allow inline processing. For throughput
1127 1126 * under load, this should be false.
1128 1127 *
1129 1128 * This (and other similar) tunable should be rolled into a link
1130 1129 * or flow specific workload hint that can be set using dladm
1131 1130 * linkprop (instead of multiple such tunables).
1132 1131 */
1133 1132 boolean_t mac_latency_optimize = B_TRUE;
1134 1133
1135 1134 /*
1136 1135 * MAC_RX_SRS_ENQUEUE_CHAIN and MAC_TX_SRS_ENQUEUE_CHAIN
1137 1136 *
1138 1137 * queue a mp or chain in soft ring set and increment the
1139 1138 * local count (srs_count) for the SRS and the shared counter
1140 1139 * (srs_poll_pkt_cnt - shared between SRS and its soft rings
1141 1140 * to track the total unprocessed packets for polling to work
1142 1141 * correctly).
1143 1142 *
1144 1143 * The size (total bytes queued) counters are incremented only
1145 1144 * if we are doing B/W control.
1146 1145 */
1147 1146 #define MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) { \
1148 1147 ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1149 1148 if ((mac_srs)->srs_last != NULL) \
1150 1149 (mac_srs)->srs_last->b_next = (head); \
1151 1150 else \
1152 1151 (mac_srs)->srs_first = (head); \
1153 1152 (mac_srs)->srs_last = (tail); \
1154 1153 (mac_srs)->srs_count += count; \
1155 1154 }
1156 1155
1157 1156 #define MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) { \
1158 1157 mac_srs_rx_t *srs_rx = &(mac_srs)->srs_rx; \
1159 1158 \
1160 1159 MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz); \
1161 1160 srs_rx->sr_poll_pkt_cnt += count; \
1162 1161 ASSERT(srs_rx->sr_poll_pkt_cnt > 0); \
1163 1162 if ((mac_srs)->srs_type & SRST_BW_CONTROL) { \
1164 1163 (mac_srs)->srs_size += (sz); \
1165 1164 mutex_enter(&(mac_srs)->srs_bw->mac_bw_lock); \
1166 1165 (mac_srs)->srs_bw->mac_bw_sz += (sz); \
1167 1166 mutex_exit(&(mac_srs)->srs_bw->mac_bw_lock); \
1168 1167 } \
1169 1168 }
1170 1169
1171 1170 #define MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) { \
1172 1171 mac_srs->srs_state |= SRS_ENQUEUED; \
1173 1172 MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz); \
1174 1173 if ((mac_srs)->srs_type & SRST_BW_CONTROL) { \
1175 1174 (mac_srs)->srs_size += (sz); \
1176 1175 (mac_srs)->srs_bw->mac_bw_sz += (sz); \
1177 1176 } \
1178 1177 }
1179 1178
1180 1179 /*
1181 1180 * Turn polling on routines
1182 1181 */
1183 1182 #define MAC_SRS_POLLING_ON(mac_srs) { \
1184 1183 ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1185 1184 if (((mac_srs)->srs_state & \
1186 1185 (SRS_POLLING_CAPAB|SRS_POLLING)) == SRS_POLLING_CAPAB) { \
1187 1186 (mac_srs)->srs_state |= SRS_POLLING; \
1188 1187 (void) mac_hwring_disable_intr((mac_ring_handle_t) \
1189 1188 (mac_srs)->srs_ring); \
1190 1189 (mac_srs)->srs_rx.sr_poll_on++; \
1191 1190 } \
1192 1191 }
1193 1192
1194 1193 #define MAC_SRS_WORKER_POLLING_ON(mac_srs) { \
1195 1194 ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1196 1195 if (((mac_srs)->srs_state & \
1197 1196 (SRS_POLLING_CAPAB|SRS_WORKER|SRS_POLLING)) == \
1198 1197 (SRS_POLLING_CAPAB|SRS_WORKER)) { \
1199 1198 (mac_srs)->srs_state |= SRS_POLLING; \
1200 1199 (void) mac_hwring_disable_intr((mac_ring_handle_t) \
1201 1200 (mac_srs)->srs_ring); \
1202 1201 (mac_srs)->srs_rx.sr_worker_poll_on++; \
1203 1202 } \
1204 1203 }
1205 1204
1206 1205 /*
1207 1206 * MAC_SRS_POLL_RING
1208 1207 *
1209 1208 * Signal the SRS poll thread to poll the underlying H/W ring
1210 1209 * provided it wasn't already polling (SRS_GET_PKTS was set).
1211 1210 *
1212 1211 * Poll thread gets to run only from mac_rx_srs_drain() and only
1213 1212 * if the drain was being done by the worker thread.
1214 1213 */
1215 1214 #define MAC_SRS_POLL_RING(mac_srs) { \
1216 1215 mac_srs_rx_t *srs_rx = &(mac_srs)->srs_rx; \
1217 1216 \
1218 1217 ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1219 1218 srs_rx->sr_poll_thr_sig++; \
1220 1219 if (((mac_srs)->srs_state & \
1221 1220 (SRS_POLLING_CAPAB|SRS_WORKER|SRS_GET_PKTS)) == \
1222 1221 (SRS_WORKER|SRS_POLLING_CAPAB)) { \
1223 1222 (mac_srs)->srs_state |= SRS_GET_PKTS; \
1224 1223 cv_signal(&(mac_srs)->srs_cv); \
1225 1224 } else { \
1226 1225 srs_rx->sr_poll_thr_busy++; \
1227 1226 } \
1228 1227 }
1229 1228
1230 1229 /*
1231 1230 * MAC_SRS_CHECK_BW_CONTROL
1232 1231 *
1233 1232 * Check to see if next tick has started so we can reset the
1234 1233 * SRS_BW_ENFORCED flag and allow more packets to come in the
1235 1234 * system.
1236 1235 */
1237 1236 #define MAC_SRS_CHECK_BW_CONTROL(mac_srs) { \
1238 1237 ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1239 1238 ASSERT(((mac_srs)->srs_type & SRST_TX) || \
1240 1239 MUTEX_HELD(&(mac_srs)->srs_bw->mac_bw_lock)); \
1241 1240 clock_t now = ddi_get_lbolt(); \
1242 1241 if ((mac_srs)->srs_bw->mac_bw_curr_time != now) { \
1243 1242 (mac_srs)->srs_bw->mac_bw_curr_time = now; \
1244 1243 (mac_srs)->srs_bw->mac_bw_used = 0; \
1245 1244 if ((mac_srs)->srs_bw->mac_bw_state & SRS_BW_ENFORCED) \
1246 1245 (mac_srs)->srs_bw->mac_bw_state &= ~SRS_BW_ENFORCED; \
1247 1246 } \
1248 1247 }
1249 1248
1250 1249 /*
1251 1250 * MAC_SRS_WORKER_WAKEUP
1252 1251 *
1253 1252 * Wake up the SRS worker thread to process the queue as long as
1254 1253 * no one else is processing the queue. If we are optimizing for
1255 1254 * latency, we wake up the worker thread immediately or else we
1256 1255 * wait mac_srs_worker_wakeup_ticks before worker thread gets
1257 1256 * woken up.
1258 1257 */
1259 1258 int mac_srs_worker_wakeup_ticks = 0;
1260 1259 #define MAC_SRS_WORKER_WAKEUP(mac_srs) { \
1261 1260 ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1262 1261 if (!((mac_srs)->srs_state & SRS_PROC) && \
1263 1262 (mac_srs)->srs_tid == NULL) { \
1264 1263 if (((mac_srs)->srs_state & SRS_LATENCY_OPT) || \
1265 1264 (mac_srs_worker_wakeup_ticks == 0)) \
1266 1265 cv_signal(&(mac_srs)->srs_async); \
1267 1266 else \
1268 1267 (mac_srs)->srs_tid = \
1269 1268 timeout(mac_srs_fire, (mac_srs), \
1270 1269 mac_srs_worker_wakeup_ticks); \
1271 1270 } \
1272 1271 }
1273 1272
1274 1273 #define TX_BANDWIDTH_MODE(mac_srs) \
1275 1274 ((mac_srs)->srs_tx.st_mode == SRS_TX_BW || \
1276 1275 (mac_srs)->srs_tx.st_mode == SRS_TX_BW_FANOUT || \
1277 1276 (mac_srs)->srs_tx.st_mode == SRS_TX_BW_AGGR)
1278 1277
1279 1278 #define TX_SRS_TO_SOFT_RING(mac_srs, head, hint) { \
1280 1279 if (tx_mode == SRS_TX_BW_FANOUT) \
1281 1280 (void) mac_tx_fanout_mode(mac_srs, head, hint, 0, NULL);\
1282 1281 else \
1283 1282 (void) mac_tx_aggr_mode(mac_srs, head, hint, 0, NULL); \
1284 1283 }
1285 1284
1286 1285 /*
1287 1286 * MAC_TX_SRS_BLOCK
1288 1287 *
1289 1288 * Always called from mac_tx_srs_drain() function. SRS_TX_BLOCKED
1290 1289 * will be set only if srs_tx_woken_up is FALSE. If
1291 1290 * srs_tx_woken_up is TRUE, it indicates that the wakeup arrived
1292 1291 * before we grabbed srs_lock to set SRS_TX_BLOCKED. We need to
1293 1292 * attempt to transmit again and not setting SRS_TX_BLOCKED does
1294 1293 * that.
1295 1294 */
1296 1295 #define MAC_TX_SRS_BLOCK(srs, mp) { \
1297 1296 ASSERT(MUTEX_HELD(&(srs)->srs_lock)); \
1298 1297 if ((srs)->srs_tx.st_woken_up) { \
1299 1298 (srs)->srs_tx.st_woken_up = B_FALSE; \
1300 1299 } else { \
1301 1300 ASSERT(!((srs)->srs_state & SRS_TX_BLOCKED)); \
1302 1301 (srs)->srs_state |= SRS_TX_BLOCKED; \
1303 1302 (srs)->srs_tx.st_stat.mts_blockcnt++; \
1304 1303 } \
1305 1304 }
1306 1305
1307 1306 /*
1308 1307 * MAC_TX_SRS_TEST_HIWAT
1309 1308 *
1310 1309 * Called before queueing a packet onto Tx SRS to test and set
1311 1310 * SRS_TX_HIWAT if srs_count exceeds srs_tx_hiwat.
1312 1311 */
1313 1312 #define MAC_TX_SRS_TEST_HIWAT(srs, mp, tail, cnt, sz, cookie) { \
1314 1313 boolean_t enqueue = 1; \
1315 1314 \
1316 1315 if ((srs)->srs_count > (srs)->srs_tx.st_hiwat) { \
1317 1316 /* \
1318 1317 * flow-controlled. Store srs in cookie so that it \
1319 1318 * can be returned as mac_tx_cookie_t to client \
1320 1319 */ \
1321 1320 (srs)->srs_state |= SRS_TX_HIWAT; \
1322 1321 cookie = (mac_tx_cookie_t)srs; \
1323 1322 (srs)->srs_tx.st_hiwat_cnt++; \
1324 1323 if ((srs)->srs_count > (srs)->srs_tx.st_max_q_cnt) { \
1325 1324 /* increment freed stats */ \
1326 1325 (srs)->srs_tx.st_stat.mts_sdrops += cnt; \
1327 1326 /* \
1328 1327 * b_prev may be set to the fanout hint \
1329 1328 * hence can't use freemsg directly \
1330 1329 */ \
1331 1330 mac_pkt_drop(NULL, NULL, mp_chain, B_FALSE); \
1332 1331 DTRACE_PROBE1(tx_queued_hiwat, \
1333 1332 mac_soft_ring_set_t *, srs); \
1334 1333 enqueue = 0; \
1335 1334 } \
1336 1335 } \
1337 1336 if (enqueue) \
1338 1337 MAC_TX_SRS_ENQUEUE_CHAIN(srs, mp, tail, cnt, sz); \
1339 1338 }
1340 1339
1341 1340 /* Some utility macros */
1342 1341 #define MAC_SRS_BW_LOCK(srs) \
1343 1342 if (!(srs->srs_type & SRST_TX)) \
1344 1343 mutex_enter(&srs->srs_bw->mac_bw_lock);
1345 1344
1346 1345 #define MAC_SRS_BW_UNLOCK(srs) \
1347 1346 if (!(srs->srs_type & SRST_TX)) \
1348 1347 mutex_exit(&srs->srs_bw->mac_bw_lock);
1349 1348
1350 1349 #define MAC_TX_SRS_DROP_MESSAGE(srs, mp, cookie) { \
1351 1350 mac_pkt_drop(NULL, NULL, mp, B_FALSE); \
1352 1351 /* increment freed stats */ \
1353 1352 mac_srs->srs_tx.st_stat.mts_sdrops++; \
1354 1353 cookie = (mac_tx_cookie_t)srs; \
1355 1354 }
1356 1355
1357 1356 #define MAC_TX_SET_NO_ENQUEUE(srs, mp_chain, ret_mp, cookie) { \
1358 1357 mac_srs->srs_state |= SRS_TX_WAKEUP_CLIENT; \
1359 1358 cookie = (mac_tx_cookie_t)srs; \
1360 1359 *ret_mp = mp_chain; \
1361 1360 }
1362 1361
1363 1362 /*
1364 1363 * MAC_RX_SRS_TOODEEP
1365 1364 *
1366 1365 * Macro called as part of receive-side processing to determine if handling
1367 1366 * can occur in situ (in the interrupt thread) or if it should be left to a
1368 1367 * worker thread. Note that the constant used to make this determination is
1369 1368 * not entirely made-up, and is a result of some emprical validation. That
1370 1369 * said, the constant is left as a static variable to allow it to be
1371 1370 * dynamically tuned in the field if and as needed.
1372 1371 */
1373 1372 static uintptr_t mac_rx_srs_stack_needed = 10240;
1374 1373 static uint_t mac_rx_srs_stack_toodeep;
1375 1374
1376 1375 #ifndef STACK_GROWTH_DOWN
1377 1376 #error Downward stack growth assumed.
1378 1377 #endif
1379 1378
1380 1379 #define MAC_RX_SRS_TOODEEP() (STACK_BIAS + (uintptr_t)getfp() - \
1381 1380 (uintptr_t)curthread->t_stkbase < mac_rx_srs_stack_needed && \
1382 1381 ++mac_rx_srs_stack_toodeep)
1383 1382
1384 1383
1385 1384 /*
1386 1385 * Drop the rx packet and advance to the next one in the chain.
1387 1386 */
1388 1387 static void
1389 1388 mac_rx_drop_pkt(mac_soft_ring_set_t *srs, mblk_t *mp)
1390 1389 {
1391 1390 mac_srs_rx_t *srs_rx = &srs->srs_rx;
1392 1391
1393 1392 ASSERT(mp->b_next == NULL);
1394 1393 mutex_enter(&srs->srs_lock);
1395 1394 MAC_UPDATE_SRS_COUNT_LOCKED(srs, 1);
1396 1395 MAC_UPDATE_SRS_SIZE_LOCKED(srs, msgdsize(mp));
1397 1396 mutex_exit(&srs->srs_lock);
1398 1397
1399 1398 srs_rx->sr_stat.mrs_sdrops++;
1400 1399 freemsg(mp);
1401 1400 }
1402 1401
1403 1402 /* DATAPATH RUNTIME ROUTINES */
1404 1403
1405 1404 /*
1406 1405 * mac_srs_fire
1407 1406 *
1408 1407 * Timer callback routine for waking up the SRS worker thread.
1409 1408 */
1410 1409 static void
1411 1410 mac_srs_fire(void *arg)
1412 1411 {
1413 1412 mac_soft_ring_set_t *mac_srs = (mac_soft_ring_set_t *)arg;
1414 1413
1415 1414 mutex_enter(&mac_srs->srs_lock);
1416 1415 if (mac_srs->srs_tid == NULL) {
1417 1416 mutex_exit(&mac_srs->srs_lock);
1418 1417 return;
1419 1418 }
1420 1419
1421 1420 mac_srs->srs_tid = NULL;
1422 1421 if (!(mac_srs->srs_state & SRS_PROC))
1423 1422 cv_signal(&mac_srs->srs_async);
1424 1423
1425 1424 mutex_exit(&mac_srs->srs_lock);
1426 1425 }
1427 1426
1428 1427 /*
1429 1428 * 'hint' is fanout_hint (type of uint64_t) which is given by the TCP/IP stack,
1430 1429 * and it is used on the TX path.
1431 1430 */
1432 1431 #define HASH_HINT(hint) \
1433 1432 ((hint) ^ ((hint) >> 24) ^ ((hint) >> 16) ^ ((hint) >> 8))
1434 1433
1435 1434
1436 1435 /*
1437 1436 * hash based on the src address, dst address and the port information.
1438 1437 */
1439 1438 #define HASH_ADDR(src, dst, ports) \
1440 1439 (ntohl((src) + (dst)) ^ ((ports) >> 24) ^ ((ports) >> 16) ^ \
1441 1440 ((ports) >> 8) ^ (ports))
1442 1441
1443 1442 #define COMPUTE_INDEX(key, sz) (key % sz)
1444 1443
1445 1444 #define FANOUT_ENQUEUE_MP(head, tail, cnt, bw_ctl, sz, sz0, mp) { \
1446 1445 if ((tail) != NULL) { \
1447 1446 ASSERT((tail)->b_next == NULL); \
1448 1447 (tail)->b_next = (mp); \
1449 1448 } else { \
1450 1449 ASSERT((head) == NULL); \
1451 1450 (head) = (mp); \
1452 1451 } \
1453 1452 (tail) = (mp); \
1454 1453 (cnt)++; \
1455 1454 if ((bw_ctl)) \
1456 1455 (sz) += (sz0); \
1457 1456 }
1458 1457
1459 1458 #define MAC_FANOUT_DEFAULT 0
1460 1459 #define MAC_FANOUT_RND_ROBIN 1
1461 1460 int mac_fanout_type = MAC_FANOUT_DEFAULT;
1462 1461
1463 1462 #define MAX_SR_TYPES 3
1464 1463 /* fanout types for port based hashing */
1465 1464 enum pkt_type {
1466 1465 V4_TCP = 0,
1467 1466 V4_UDP,
↓ open down ↓ |
1152 lines elided |
↑ open up ↑ |
1468 1467 OTH,
1469 1468 UNDEF
1470 1469 };
1471 1470
1472 1471 /*
1473 1472 * Pair of local and remote ports in the transport header
1474 1473 */
1475 1474 #define PORTS_SIZE 4
1476 1475
1477 1476 /*
1478 - * mac_rx_srs_proto_fanout
1479 - *
1480 - * This routine delivers packets destined to an SRS into one of the
1477 + * This routine delivers packets destined for an SRS into one of the
1481 1478 * protocol soft rings.
1482 1479 *
1483 - * Given a chain of packets we need to split it up into multiple sub chains
1484 - * destined into TCP, UDP or OTH soft ring. Instead of entering
1485 - * the soft ring one packet at a time, we want to enter it in the form of a
1486 - * chain otherwise we get this start/stop behaviour where the worker thread
1487 - * goes to sleep and then next packets comes in forcing it to wake up etc.
1480 + * Given a chain of packets we need to split it up into multiple sub
1481 + * chains: TCP, UDP or OTH soft ring. Instead of entering the soft
1482 + * ring one packet at a time, we want to enter it in the form of a
1483 + * chain otherwise we get this start/stop behaviour where the worker
1484 + * thread goes to sleep and then next packet comes in forcing it to
1485 + * wake up.
1488 1486 */
1489 1487 static void
1490 1488 mac_rx_srs_proto_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head)
1491 1489 {
1492 1490 struct ether_header *ehp;
1493 1491 struct ether_vlan_header *evhp;
1494 1492 uint32_t sap;
1495 1493 ipha_t *ipha;
1496 1494 uint8_t *dstaddr;
1497 1495 size_t hdrsize;
1498 1496 mblk_t *mp;
1499 1497 mblk_t *headmp[MAX_SR_TYPES];
1500 1498 mblk_t *tailmp[MAX_SR_TYPES];
1501 1499 int cnt[MAX_SR_TYPES];
1502 1500 size_t sz[MAX_SR_TYPES];
1503 1501 size_t sz1;
1504 1502 boolean_t bw_ctl;
1505 1503 boolean_t hw_classified;
1506 1504 boolean_t dls_bypass;
1507 1505 boolean_t is_ether;
1508 1506 boolean_t is_unicast;
1509 1507 enum pkt_type type;
1510 1508 mac_client_impl_t *mcip = mac_srs->srs_mcip;
1511 1509
1512 1510 is_ether = (mcip->mci_mip->mi_info.mi_nativemedia == DL_ETHER);
1513 1511 bw_ctl = ((mac_srs->srs_type & SRST_BW_CONTROL) != 0);
1514 1512
1515 1513 /*
↓ open down ↓ |
18 lines elided |
↑ open up ↑ |
1516 1514 * If we don't have a Rx ring, S/W classification would have done
1517 1515 * its job and its a packet meant for us. If we were polling on
1518 1516 * the default ring (i.e. there was a ring assigned to this SRS),
1519 1517 * then we need to make sure that the mac address really belongs
1520 1518 * to us.
1521 1519 */
1522 1520 hw_classified = mac_srs->srs_ring != NULL &&
1523 1521 mac_srs->srs_ring->mr_classify_type == MAC_HW_CLASSIFIER;
1524 1522
1525 1523 /*
1526 - * Special clients (eg. VLAN, non ether, etc) need DLS
1527 - * processing in the Rx path. SRST_DLS_BYPASS will be clear for
1528 - * such SRSs. Another way of disabling bypass is to set the
1524 + * Some clients, such as non-ethernet, need DLS processing in
1525 + * the Rx path. Such clients clear the SRST_DLS_BYPASS flag.
1526 + * DLS bypass may also be disabled via the
1529 1527 * MCIS_RX_BYPASS_DISABLE flag.
1530 1528 */
1531 1529 dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) &&
1532 1530 ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0);
1533 1531
1534 1532 bzero(headmp, MAX_SR_TYPES * sizeof (mblk_t *));
1535 1533 bzero(tailmp, MAX_SR_TYPES * sizeof (mblk_t *));
1536 1534 bzero(cnt, MAX_SR_TYPES * sizeof (int));
1537 1535 bzero(sz, MAX_SR_TYPES * sizeof (size_t));
1538 1536
1539 1537 /*
1540 - * We got a chain from SRS that we need to send to the soft rings.
1541 - * Since squeues for TCP & IPv4 sap poll their soft rings (for
1542 - * performance reasons), we need to separate out v4_tcp, v4_udp
1543 - * and the rest goes in other.
1538 + * We have a chain from SRS that we need to split across the
1539 + * soft rings. The squeues for the TCP and IPv4 SAPs use their
1540 + * own soft rings to allow polling from the squeue. The rest of
1541 + * the packets are delivered on the OTH soft ring which cannot
1542 + * be polled.
1544 1543 */
1545 1544 while (head != NULL) {
1546 1545 mp = head;
1547 1546 head = head->b_next;
1548 1547 mp->b_next = NULL;
1549 1548
1550 1549 type = OTH;
1551 1550 sz1 = (mp->b_cont == NULL) ? MBLKL(mp) : msgdsize(mp);
1552 1551
1553 1552 if (is_ether) {
1554 1553 /*
1555 1554 * At this point we can be sure the packet at least
1556 1555 * has an ether header.
1557 1556 */
1558 1557 if (sz1 < sizeof (struct ether_header)) {
1559 1558 mac_rx_drop_pkt(mac_srs, mp);
1560 1559 continue;
↓ open down ↓ |
7 lines elided |
↑ open up ↑ |
1561 1560 }
1562 1561 ehp = (struct ether_header *)mp->b_rptr;
1563 1562
1564 1563 /*
1565 1564 * Determine if this is a VLAN or non-VLAN packet.
1566 1565 */
1567 1566 if ((sap = ntohs(ehp->ether_type)) == VLAN_TPID) {
1568 1567 evhp = (struct ether_vlan_header *)mp->b_rptr;
1569 1568 sap = ntohs(evhp->ether_type);
1570 1569 hdrsize = sizeof (struct ether_vlan_header);
1570 +
1571 1571 /*
1572 - * Check if the VID of the packet, if any,
1573 - * belongs to this client.
1572 + * Check if the VID of the packet, if
1573 + * any, belongs to this client.
1574 + * Technically, if this packet came up
1575 + * via a HW classified ring then we
1576 + * don't need to perform this check.
1577 + * Perhaps a future optimization.
1574 1578 */
1575 1579 if (!mac_client_check_flow_vid(mcip,
1576 1580 VLAN_ID(ntohs(evhp->ether_tci)))) {
1577 1581 mac_rx_drop_pkt(mac_srs, mp);
1578 1582 continue;
1579 1583 }
1580 1584 } else {
1581 1585 hdrsize = sizeof (struct ether_header);
1582 1586 }
1583 1587 is_unicast =
1584 1588 ((((uint8_t *)&ehp->ether_dhost)[0] & 0x01) == 0);
1585 1589 dstaddr = (uint8_t *)&ehp->ether_dhost;
1586 1590 } else {
1587 1591 mac_header_info_t mhi;
1588 1592
1589 1593 if (mac_header_info((mac_handle_t)mcip->mci_mip,
1590 1594 mp, &mhi) != 0) {
1591 1595 mac_rx_drop_pkt(mac_srs, mp);
1592 1596 continue;
1593 1597 }
1594 1598 hdrsize = mhi.mhi_hdrsize;
1595 1599 sap = mhi.mhi_bindsap;
1596 1600 is_unicast = (mhi.mhi_dsttype == MAC_ADDRTYPE_UNICAST);
1597 1601 dstaddr = (uint8_t *)mhi.mhi_daddr;
1598 1602 }
1599 1603
1600 1604 if (!dls_bypass) {
1601 1605 FANOUT_ENQUEUE_MP(headmp[type], tailmp[type],
1602 1606 cnt[type], bw_ctl, sz[type], sz1, mp);
1603 1607 continue;
1604 1608 }
1605 1609
1606 1610 if (sap == ETHERTYPE_IP) {
1607 1611 /*
1608 1612 * If we are H/W classified, but we have promisc
1609 1613 * on, then we need to check for the unicast address.
1610 1614 */
1611 1615 if (hw_classified && mcip->mci_promisc_list != NULL) {
1612 1616 mac_address_t *map;
1613 1617
1614 1618 rw_enter(&mcip->mci_rw_lock, RW_READER);
1615 1619 map = mcip->mci_unicast;
1616 1620 if (bcmp(dstaddr, map->ma_addr,
1617 1621 map->ma_len) == 0)
1618 1622 type = UNDEF;
1619 1623 rw_exit(&mcip->mci_rw_lock);
1620 1624 } else if (is_unicast) {
1621 1625 type = UNDEF;
1622 1626 }
1623 1627 }
1624 1628
1625 1629 /*
1626 1630 * This needs to become a contract with the driver for
1627 1631 * the fast path.
↓ open down ↓ |
44 lines elided |
↑ open up ↑ |
1628 1632 *
1629 1633 * In the normal case the packet will have at least the L2
1630 1634 * header and the IP + Transport header in the same mblk.
1631 1635 * This is usually the case when the NIC driver sends up
1632 1636 * the packet. This is also true when the stack generates
1633 1637 * a packet that is looped back and when the stack uses the
1634 1638 * fastpath mechanism. The normal case is optimized for
1635 1639 * performance and may bypass DLS. All other cases go through
1636 1640 * the 'OTH' type path without DLS bypass.
1637 1641 */
1638 -
1639 1642 ipha = (ipha_t *)(mp->b_rptr + hdrsize);
1640 1643 if ((type != OTH) && MBLK_RX_FANOUT_SLOWPATH(mp, ipha))
1641 1644 type = OTH;
1642 1645
1643 1646 if (type == OTH) {
1644 1647 FANOUT_ENQUEUE_MP(headmp[type], tailmp[type],
1645 1648 cnt[type], bw_ctl, sz[type], sz1, mp);
1646 1649 continue;
1647 1650 }
1648 1651
1649 1652 ASSERT(type == UNDEF);
1653 +
1650 1654 /*
1651 - * We look for at least 4 bytes past the IP header to get
1652 - * the port information. If we get an IP fragment, we don't
1653 - * have the port information, and we use just the protocol
1654 - * information.
1655 + * Determine the type from the IP protocol value. If
1656 + * classified as TCP or UDP, then update the read
1657 + * pointer to the beginning of the IP header.
1658 + * Otherwise leave the message as is for further
1659 + * processing by DLS.
1655 1660 */
1656 1661 switch (ipha->ipha_protocol) {
1657 1662 case IPPROTO_TCP:
1658 1663 type = V4_TCP;
1659 1664 mp->b_rptr += hdrsize;
1660 1665 break;
1661 1666 case IPPROTO_UDP:
1662 1667 type = V4_UDP;
1663 1668 mp->b_rptr += hdrsize;
1664 1669 break;
1665 1670 default:
1666 1671 type = OTH;
1667 1672 break;
1668 1673 }
1669 1674
1670 1675 FANOUT_ENQUEUE_MP(headmp[type], tailmp[type], cnt[type],
1671 1676 bw_ctl, sz[type], sz1, mp);
1672 1677 }
1673 1678
1674 1679 for (type = V4_TCP; type < UNDEF; type++) {
1675 1680 if (headmp[type] != NULL) {
1676 1681 mac_soft_ring_t *softring;
1677 1682
1678 1683 ASSERT(tailmp[type]->b_next == NULL);
1679 1684 switch (type) {
1680 1685 case V4_TCP:
1681 1686 softring = mac_srs->srs_tcp_soft_rings[0];
1682 1687 break;
1683 1688 case V4_UDP:
1684 1689 softring = mac_srs->srs_udp_soft_rings[0];
1685 1690 break;
1686 1691 case OTH:
1687 1692 softring = mac_srs->srs_oth_soft_rings[0];
↓ open down ↓ |
23 lines elided |
↑ open up ↑ |
1688 1693 }
1689 1694 mac_rx_soft_ring_process(mcip, softring,
1690 1695 headmp[type], tailmp[type], cnt[type], sz[type]);
1691 1696 }
1692 1697 }
1693 1698 }
1694 1699
1695 1700 int fanout_unaligned = 0;
1696 1701
1697 1702 /*
1698 - * mac_rx_srs_long_fanout
1699 - *
1700 - * The fanout routine for VLANs, and for anything else that isn't performing
1701 - * explicit dls bypass. Returns -1 on an error (drop the packet due to a
1702 - * malformed packet), 0 on success, with values written in *indx and *type.
1703 + * The fanout routine for any clients with DLS bypass disabled or for
1704 + * traffic classified as "other". Returns -1 on an error (drop the
1705 + * packet due to a malformed packet), 0 on success, with values
1706 + * written in *indx and *type.
1703 1707 */
1704 1708 static int
1705 1709 mac_rx_srs_long_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *mp,
1706 1710 uint32_t sap, size_t hdrsize, enum pkt_type *type, uint_t *indx)
1707 1711 {
1708 1712 ip6_t *ip6h;
1709 1713 ipha_t *ipha;
1710 1714 uint8_t *whereptr;
1711 1715 uint_t hash;
1712 1716 uint16_t remlen;
1713 1717 uint8_t nexthdr;
1714 1718 uint16_t hdr_len;
1715 1719 uint32_t src_val, dst_val;
1716 1720 boolean_t modifiable = B_TRUE;
1717 1721 boolean_t v6;
1718 1722
1719 1723 ASSERT(MBLKL(mp) >= hdrsize);
1720 1724
1721 1725 if (sap == ETHERTYPE_IPV6) {
1722 1726 v6 = B_TRUE;
1723 1727 hdr_len = IPV6_HDR_LEN;
1724 1728 } else if (sap == ETHERTYPE_IP) {
1725 1729 v6 = B_FALSE;
1726 1730 hdr_len = IP_SIMPLE_HDR_LENGTH;
1727 1731 } else {
1728 1732 *indx = 0;
1729 1733 *type = OTH;
1730 1734 return (0);
1731 1735 }
1732 1736
1733 1737 ip6h = (ip6_t *)(mp->b_rptr + hdrsize);
1734 1738 ipha = (ipha_t *)ip6h;
1735 1739
1736 1740 if ((uint8_t *)ip6h == mp->b_wptr) {
1737 1741 /*
1738 1742 * The first mblk_t only includes the mac header.
1739 1743 * Note that it is safe to change the mp pointer here,
1740 1744 * as the subsequent operation does not assume mp
1741 1745 * points to the start of the mac header.
1742 1746 */
1743 1747 mp = mp->b_cont;
1744 1748
1745 1749 /*
1746 1750 * Make sure the IP header points to an entire one.
1747 1751 */
1748 1752 if (mp == NULL)
1749 1753 return (-1);
1750 1754
1751 1755 if (MBLKL(mp) < hdr_len) {
1752 1756 modifiable = (DB_REF(mp) == 1);
1753 1757
1754 1758 if (modifiable && !pullupmsg(mp, hdr_len))
1755 1759 return (-1);
1756 1760 }
1757 1761
1758 1762 ip6h = (ip6_t *)mp->b_rptr;
1759 1763 ipha = (ipha_t *)ip6h;
1760 1764 }
1761 1765
1762 1766 if (!modifiable || !(OK_32PTR((char *)ip6h)) ||
1763 1767 ((uint8_t *)ip6h + hdr_len > mp->b_wptr)) {
1764 1768 /*
1765 1769 * If either the IP header is not aligned, or it does not hold
1766 1770 * the complete simple structure (a pullupmsg() is not an
1767 1771 * option since it would result in an unaligned IP header),
1768 1772 * fanout to the default ring.
1769 1773 *
1770 1774 * Note that this may cause packet reordering.
1771 1775 */
1772 1776 *indx = 0;
1773 1777 *type = OTH;
1774 1778 fanout_unaligned++;
1775 1779 return (0);
1776 1780 }
1777 1781
1778 1782 /*
1779 1783 * Extract next-header, full header length, and source-hash value
1780 1784 * using v4/v6 specific fields.
1781 1785 */
1782 1786 if (v6) {
1783 1787 remlen = ntohs(ip6h->ip6_plen);
1784 1788 nexthdr = ip6h->ip6_nxt;
1785 1789 src_val = V4_PART_OF_V6(ip6h->ip6_src);
1786 1790 dst_val = V4_PART_OF_V6(ip6h->ip6_dst);
1787 1791 /*
1788 1792 * Do src based fanout if below tunable is set to B_TRUE or
1789 1793 * when mac_ip_hdr_length_v6() fails because of malformed
1790 1794 * packets or because mblks need to be concatenated using
1791 1795 * pullupmsg().
1792 1796 *
1793 1797 * Perform a version check to prevent parsing weirdness...
1794 1798 */
1795 1799 if (IPH_HDR_VERSION(ip6h) != IPV6_VERSION ||
1796 1800 !mac_ip_hdr_length_v6(ip6h, mp->b_wptr, &hdr_len, &nexthdr,
1797 1801 NULL)) {
1798 1802 goto src_dst_based_fanout;
1799 1803 }
1800 1804 } else {
1801 1805 hdr_len = IPH_HDR_LENGTH(ipha);
1802 1806 remlen = ntohs(ipha->ipha_length) - hdr_len;
1803 1807 nexthdr = ipha->ipha_protocol;
1804 1808 src_val = (uint32_t)ipha->ipha_src;
1805 1809 dst_val = (uint32_t)ipha->ipha_dst;
1806 1810 /*
1807 1811 * Catch IPv4 fragment case here. IPv6 has nexthdr == FRAG
1808 1812 * for its equivalent case.
1809 1813 */
1810 1814 if ((ntohs(ipha->ipha_fragment_offset_and_flags) &
1811 1815 (IPH_MF | IPH_OFFSET)) != 0) {
1812 1816 goto src_dst_based_fanout;
1813 1817 }
1814 1818 }
1815 1819 if (remlen < MIN_EHDR_LEN)
1816 1820 return (-1);
1817 1821 whereptr = (uint8_t *)ip6h + hdr_len;
1818 1822
1819 1823 /* If the transport is one of below, we do port/SPI based fanout */
1820 1824 switch (nexthdr) {
1821 1825 case IPPROTO_TCP:
1822 1826 case IPPROTO_UDP:
1823 1827 case IPPROTO_SCTP:
1824 1828 case IPPROTO_ESP:
1825 1829 /*
1826 1830 * If the ports or SPI in the transport header is not part of
1827 1831 * the mblk, do src_based_fanout, instead of calling
1828 1832 * pullupmsg().
1829 1833 */
1830 1834 if (mp->b_cont == NULL || whereptr + PORTS_SIZE <= mp->b_wptr)
1831 1835 break; /* out of switch... */
1832 1836 /* FALLTHRU */
1833 1837 default:
1834 1838 goto src_dst_based_fanout;
1835 1839 }
1836 1840
1837 1841 switch (nexthdr) {
1838 1842 case IPPROTO_TCP:
1839 1843 hash = HASH_ADDR(src_val, dst_val, *(uint32_t *)whereptr);
1840 1844 *indx = COMPUTE_INDEX(hash, mac_srs->srs_tcp_ring_count);
1841 1845 *type = OTH;
1842 1846 break;
1843 1847 case IPPROTO_UDP:
1844 1848 case IPPROTO_SCTP:
1845 1849 case IPPROTO_ESP:
1846 1850 if (mac_fanout_type == MAC_FANOUT_DEFAULT) {
1847 1851 hash = HASH_ADDR(src_val, dst_val,
1848 1852 *(uint32_t *)whereptr);
1849 1853 *indx = COMPUTE_INDEX(hash,
1850 1854 mac_srs->srs_udp_ring_count);
1851 1855 } else {
1852 1856 *indx = mac_srs->srs_ind % mac_srs->srs_udp_ring_count;
1853 1857 mac_srs->srs_ind++;
1854 1858 }
1855 1859 *type = OTH;
1856 1860 break;
1857 1861 }
↓ open down ↓ |
145 lines elided |
↑ open up ↑ |
1858 1862 return (0);
1859 1863
1860 1864 src_dst_based_fanout:
1861 1865 hash = HASH_ADDR(src_val, dst_val, (uint32_t)0);
1862 1866 *indx = COMPUTE_INDEX(hash, mac_srs->srs_oth_ring_count);
1863 1867 *type = OTH;
1864 1868 return (0);
1865 1869 }
1866 1870
1867 1871 /*
1868 - * mac_rx_srs_fanout
1869 - *
1870 - * This routine delivers packets destined to an SRS into a soft ring member
1872 + * This routine delivers packets destined for an SRS into a soft ring member
1871 1873 * of the set.
1872 1874 *
1873 - * Given a chain of packets we need to split it up into multiple sub chains
1874 - * destined for one of the TCP, UDP or OTH soft rings. Instead of entering
1875 - * the soft ring one packet at a time, we want to enter it in the form of a
1876 - * chain otherwise we get this start/stop behaviour where the worker thread
1877 - * goes to sleep and then next packets comes in forcing it to wake up etc.
1875 + * Given a chain of packets we need to split it up into multiple sub
1876 + * chains: TCP, UDP or OTH soft ring. Instead of entering the soft
1877 + * ring one packet at a time, we want to enter it in the form of a
1878 + * chain otherwise we get this start/stop behaviour where the worker
1879 + * thread goes to sleep and then next packet comes in forcing it to
1880 + * wake up.
1878 1881 *
1879 1882 * Note:
1880 1883 * Since we know what is the maximum fanout possible, we create a 2D array
1881 1884 * of 'softring types * MAX_SR_FANOUT' for the head, tail, cnt and sz
1882 1885 * variables so that we can enter the softrings with chain. We need the
1883 1886 * MAX_SR_FANOUT so we can allocate the arrays on the stack (a kmem_alloc
1884 1887 * for each packet would be expensive). If we ever want to have the
1885 1888 * ability to have unlimited fanout, we should probably declare a head,
1886 1889 * tail, cnt, sz with each soft ring (a data struct which contains a softring
1887 1890 * along with these members) and create an array of this uber struct so we
1888 1891 * don't have to do kmem_alloc.
1889 1892 */
1890 1893 int fanout_oth1 = 0;
1891 1894 int fanout_oth2 = 0;
1892 1895 int fanout_oth3 = 0;
1893 1896 int fanout_oth4 = 0;
1894 1897 int fanout_oth5 = 0;
1895 1898
1896 1899 static void
1897 1900 mac_rx_srs_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head)
1898 1901 {
1899 1902 struct ether_header *ehp;
1900 1903 struct ether_vlan_header *evhp;
1901 1904 uint32_t sap;
1902 1905 ipha_t *ipha;
1903 1906 uint8_t *dstaddr;
1904 1907 uint_t indx;
1905 1908 size_t ports_offset;
1906 1909 size_t ipha_len;
1907 1910 size_t hdrsize;
1908 1911 uint_t hash;
1909 1912 mblk_t *mp;
1910 1913 mblk_t *headmp[MAX_SR_TYPES][MAX_SR_FANOUT];
1911 1914 mblk_t *tailmp[MAX_SR_TYPES][MAX_SR_FANOUT];
1912 1915 int cnt[MAX_SR_TYPES][MAX_SR_FANOUT];
1913 1916 size_t sz[MAX_SR_TYPES][MAX_SR_FANOUT];
1914 1917 size_t sz1;
1915 1918 boolean_t bw_ctl;
1916 1919 boolean_t hw_classified;
1917 1920 boolean_t dls_bypass;
1918 1921 boolean_t is_ether;
1919 1922 boolean_t is_unicast;
1920 1923 int fanout_cnt;
1921 1924 enum pkt_type type;
1922 1925 mac_client_impl_t *mcip = mac_srs->srs_mcip;
1923 1926
1924 1927 is_ether = (mcip->mci_mip->mi_info.mi_nativemedia == DL_ETHER);
1925 1928 bw_ctl = ((mac_srs->srs_type & SRST_BW_CONTROL) != 0);
1926 1929
1927 1930 /*
↓ open down ↓ |
40 lines elided |
↑ open up ↑ |
1928 1931 * If we don't have a Rx ring, S/W classification would have done
1929 1932 * its job and its a packet meant for us. If we were polling on
1930 1933 * the default ring (i.e. there was a ring assigned to this SRS),
1931 1934 * then we need to make sure that the mac address really belongs
1932 1935 * to us.
1933 1936 */
1934 1937 hw_classified = mac_srs->srs_ring != NULL &&
1935 1938 mac_srs->srs_ring->mr_classify_type == MAC_HW_CLASSIFIER;
1936 1939
1937 1940 /*
1938 - * Special clients (eg. VLAN, non ether, etc) need DLS
1939 - * processing in the Rx path. SRST_DLS_BYPASS will be clear for
1940 - * such SRSs. Another way of disabling bypass is to set the
1941 - * MCIS_RX_BYPASS_DISABLE flag.
1941 + * Some clients, such as non Ethernet, need DLS processing in
1942 + * the Rx path. Such clients clear the SRST_DLS_BYPASS flag.
1943 + * DLS bypass may also be disabled via the
1944 + * MCIS_RX_BYPASS_DISABLE flag, but this is only consumed by
1945 + * sun4v vsw currently.
1942 1946 */
1943 1947 dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) &&
1944 1948 ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0);
1945 1949
1946 1950 /*
1947 1951 * Since the softrings are never destroyed and we always
1948 1952 * create equal number of softrings for TCP, UDP and rest,
1949 1953 * its OK to check one of them for count and use it without
1950 1954 * any lock. In future, if soft rings get destroyed because
1951 1955 * of reduction in fanout, we will need to ensure that happens
1952 1956 * behind the SRS_PROC.
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
1953 1957 */
1954 1958 fanout_cnt = mac_srs->srs_tcp_ring_count;
1955 1959
1956 1960 bzero(headmp, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (mblk_t *));
1957 1961 bzero(tailmp, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (mblk_t *));
1958 1962 bzero(cnt, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (int));
1959 1963 bzero(sz, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (size_t));
1960 1964
1961 1965 /*
1962 1966 * We got a chain from SRS that we need to send to the soft rings.
1963 - * Since squeues for TCP & IPv4 sap poll their soft rings (for
1967 + * Since squeues for TCP & IPv4 SAP poll their soft rings (for
1964 1968 * performance reasons), we need to separate out v4_tcp, v4_udp
1965 1969 * and the rest goes in other.
1966 1970 */
1967 1971 while (head != NULL) {
1968 1972 mp = head;
1969 1973 head = head->b_next;
1970 1974 mp->b_next = NULL;
1971 1975
1972 1976 type = OTH;
1973 1977 sz1 = (mp->b_cont == NULL) ? MBLKL(mp) : msgdsize(mp);
1974 1978
1975 1979 if (is_ether) {
1976 1980 /*
1977 1981 * At this point we can be sure the packet at least
1978 1982 * has an ether header.
1979 1983 */
1980 1984 if (sz1 < sizeof (struct ether_header)) {
1981 1985 mac_rx_drop_pkt(mac_srs, mp);
1982 1986 continue;
↓ open down ↓ |
9 lines elided |
↑ open up ↑ |
1983 1987 }
1984 1988 ehp = (struct ether_header *)mp->b_rptr;
1985 1989
1986 1990 /*
1987 1991 * Determine if this is a VLAN or non-VLAN packet.
1988 1992 */
1989 1993 if ((sap = ntohs(ehp->ether_type)) == VLAN_TPID) {
1990 1994 evhp = (struct ether_vlan_header *)mp->b_rptr;
1991 1995 sap = ntohs(evhp->ether_type);
1992 1996 hdrsize = sizeof (struct ether_vlan_header);
1997 +
1993 1998 /*
1994 - * Check if the VID of the packet, if any,
1995 - * belongs to this client.
1999 + * Check if the VID of the packet, if
2000 + * any, belongs to this client.
2001 + * Technically, if this packet came up
2002 + * via a HW classified ring then we
2003 + * don't need to perform this check.
2004 + * Perhaps a future optimization.
1996 2005 */
1997 2006 if (!mac_client_check_flow_vid(mcip,
1998 2007 VLAN_ID(ntohs(evhp->ether_tci)))) {
1999 2008 mac_rx_drop_pkt(mac_srs, mp);
2000 2009 continue;
2001 2010 }
2002 2011 } else {
2003 2012 hdrsize = sizeof (struct ether_header);
2004 2013 }
2005 2014 is_unicast =
2006 2015 ((((uint8_t *)&ehp->ether_dhost)[0] & 0x01) == 0);
2007 2016 dstaddr = (uint8_t *)&ehp->ether_dhost;
2008 2017 } else {
2009 2018 mac_header_info_t mhi;
2010 2019
2011 2020 if (mac_header_info((mac_handle_t)mcip->mci_mip,
2012 2021 mp, &mhi) != 0) {
2013 2022 mac_rx_drop_pkt(mac_srs, mp);
2014 2023 continue;
2015 2024 }
2016 2025 hdrsize = mhi.mhi_hdrsize;
2017 2026 sap = mhi.mhi_bindsap;
2018 2027 is_unicast = (mhi.mhi_dsttype == MAC_ADDRTYPE_UNICAST);
2019 2028 dstaddr = (uint8_t *)mhi.mhi_daddr;
2020 2029 }
2021 2030
2022 2031 if (!dls_bypass) {
2023 2032 if (mac_rx_srs_long_fanout(mac_srs, mp, sap,
2024 2033 hdrsize, &type, &indx) == -1) {
↓ open down ↓ |
19 lines elided |
↑ open up ↑ |
2025 2034 mac_rx_drop_pkt(mac_srs, mp);
2026 2035 continue;
2027 2036 }
2028 2037
2029 2038 FANOUT_ENQUEUE_MP(headmp[type][indx],
2030 2039 tailmp[type][indx], cnt[type][indx], bw_ctl,
2031 2040 sz[type][indx], sz1, mp);
2032 2041 continue;
2033 2042 }
2034 2043
2035 -
2036 2044 /*
2037 2045 * If we are using the default Rx ring where H/W or S/W
2038 2046 * classification has not happened, we need to verify if
2039 2047 * this unicast packet really belongs to us.
2040 2048 */
2041 2049 if (sap == ETHERTYPE_IP) {
2042 2050 /*
2043 2051 * If we are H/W classified, but we have promisc
2044 2052 * on, then we need to check for the unicast address.
2045 2053 */
2046 2054 if (hw_classified && mcip->mci_promisc_list != NULL) {
2047 2055 mac_address_t *map;
2048 2056
2049 2057 rw_enter(&mcip->mci_rw_lock, RW_READER);
2050 2058 map = mcip->mci_unicast;
2051 2059 if (bcmp(dstaddr, map->ma_addr,
2052 2060 map->ma_len) == 0)
2053 2061 type = UNDEF;
2054 2062 rw_exit(&mcip->mci_rw_lock);
2055 2063 } else if (is_unicast) {
2056 2064 type = UNDEF;
2057 2065 }
2058 2066 }
2059 2067
2060 2068 /*
2061 2069 * This needs to become a contract with the driver for
2062 2070 * the fast path.
2063 2071 */
2064 2072
2065 2073 ipha = (ipha_t *)(mp->b_rptr + hdrsize);
2066 2074 if ((type != OTH) && MBLK_RX_FANOUT_SLOWPATH(mp, ipha)) {
2067 2075 type = OTH;
2068 2076 fanout_oth1++;
2069 2077 }
2070 2078
2071 2079 if (type != OTH) {
2072 2080 uint16_t frag_offset_flags;
2073 2081
2074 2082 switch (ipha->ipha_protocol) {
2075 2083 case IPPROTO_TCP:
2076 2084 case IPPROTO_UDP:
2077 2085 case IPPROTO_SCTP:
2078 2086 case IPPROTO_ESP:
2079 2087 ipha_len = IPH_HDR_LENGTH(ipha);
2080 2088 if ((uchar_t *)ipha + ipha_len + PORTS_SIZE >
2081 2089 mp->b_wptr) {
2082 2090 type = OTH;
2083 2091 break;
2084 2092 }
2085 2093 frag_offset_flags =
2086 2094 ntohs(ipha->ipha_fragment_offset_and_flags);
2087 2095 if ((frag_offset_flags &
2088 2096 (IPH_MF | IPH_OFFSET)) != 0) {
2089 2097 type = OTH;
2090 2098 fanout_oth3++;
2091 2099 break;
2092 2100 }
2093 2101 ports_offset = hdrsize + ipha_len;
2094 2102 break;
2095 2103 default:
2096 2104 type = OTH;
2097 2105 fanout_oth4++;
2098 2106 break;
2099 2107 }
2100 2108 }
2101 2109
2102 2110 if (type == OTH) {
2103 2111 if (mac_rx_srs_long_fanout(mac_srs, mp, sap,
2104 2112 hdrsize, &type, &indx) == -1) {
2105 2113 mac_rx_drop_pkt(mac_srs, mp);
2106 2114 continue;
2107 2115 }
2108 2116
2109 2117 FANOUT_ENQUEUE_MP(headmp[type][indx],
2110 2118 tailmp[type][indx], cnt[type][indx], bw_ctl,
2111 2119 sz[type][indx], sz1, mp);
2112 2120 continue;
2113 2121 }
2114 2122
2115 2123 ASSERT(type == UNDEF);
2116 2124
2117 2125 /*
2118 2126 * XXX-Sunay: We should hold srs_lock since ring_count
2119 2127 * below can change. But if we are always called from
2120 2128 * mac_rx_srs_drain and SRS_PROC is set, then we can
2121 2129 * enforce that ring_count can't be changed i.e.
2122 2130 * to change fanout type or ring count, the calling
2123 2131 * thread needs to be behind SRS_PROC.
2124 2132 */
2125 2133 switch (ipha->ipha_protocol) {
2126 2134 case IPPROTO_TCP:
2127 2135 /*
2128 2136 * Note that for ESP, we fanout on SPI and it is at the
2129 2137 * same offset as the 2x16-bit ports. So it is clumped
2130 2138 * along with TCP, UDP and SCTP.
2131 2139 */
2132 2140 hash = HASH_ADDR(ipha->ipha_src, ipha->ipha_dst,
2133 2141 *(uint32_t *)(mp->b_rptr + ports_offset));
2134 2142 indx = COMPUTE_INDEX(hash, mac_srs->srs_tcp_ring_count);
2135 2143 type = V4_TCP;
2136 2144 mp->b_rptr += hdrsize;
2137 2145 break;
2138 2146 case IPPROTO_UDP:
2139 2147 case IPPROTO_SCTP:
2140 2148 case IPPROTO_ESP:
2141 2149 if (mac_fanout_type == MAC_FANOUT_DEFAULT) {
2142 2150 hash = HASH_ADDR(ipha->ipha_src, ipha->ipha_dst,
2143 2151 *(uint32_t *)(mp->b_rptr + ports_offset));
2144 2152 indx = COMPUTE_INDEX(hash,
2145 2153 mac_srs->srs_udp_ring_count);
2146 2154 } else {
2147 2155 indx = mac_srs->srs_ind %
2148 2156 mac_srs->srs_udp_ring_count;
2149 2157 mac_srs->srs_ind++;
2150 2158 }
2151 2159 type = V4_UDP;
2152 2160 mp->b_rptr += hdrsize;
2153 2161 break;
2154 2162 default:
2155 2163 indx = 0;
2156 2164 type = OTH;
2157 2165 }
2158 2166
2159 2167 FANOUT_ENQUEUE_MP(headmp[type][indx], tailmp[type][indx],
2160 2168 cnt[type][indx], bw_ctl, sz[type][indx], sz1, mp);
2161 2169 }
2162 2170
2163 2171 for (type = V4_TCP; type < UNDEF; type++) {
2164 2172 int i;
2165 2173
2166 2174 for (i = 0; i < fanout_cnt; i++) {
2167 2175 if (headmp[type][i] != NULL) {
2168 2176 mac_soft_ring_t *softring;
2169 2177
2170 2178 ASSERT(tailmp[type][i]->b_next == NULL);
2171 2179 switch (type) {
2172 2180 case V4_TCP:
2173 2181 softring =
2174 2182 mac_srs->srs_tcp_soft_rings[i];
2175 2183 break;
2176 2184 case V4_UDP:
2177 2185 softring =
2178 2186 mac_srs->srs_udp_soft_rings[i];
2179 2187 break;
2180 2188 case OTH:
2181 2189 softring =
2182 2190 mac_srs->srs_oth_soft_rings[i];
2183 2191 break;
2184 2192 }
2185 2193 mac_rx_soft_ring_process(mcip,
2186 2194 softring, headmp[type][i], tailmp[type][i],
2187 2195 cnt[type][i], sz[type][i]);
2188 2196 }
2189 2197 }
2190 2198 }
2191 2199 }
2192 2200
2193 2201 #define SRS_BYTES_TO_PICKUP 150000
2194 2202 ssize_t max_bytes_to_pickup = SRS_BYTES_TO_PICKUP;
2195 2203
2196 2204 /*
2197 2205 * mac_rx_srs_poll_ring
2198 2206 *
2199 2207 * This SRS Poll thread uses this routine to poll the underlying hardware
2200 2208 * Rx ring to get a chain of packets. It can inline process that chain
2201 2209 * if mac_latency_optimize is set (default) or signal the SRS worker thread
2202 2210 * to do the remaining processing.
2203 2211 *
2204 2212 * Since packets come in the system via interrupt or poll path, we also
2205 2213 * update the stats and deal with promiscous clients here.
2206 2214 */
2207 2215 void
2208 2216 mac_rx_srs_poll_ring(mac_soft_ring_set_t *mac_srs)
2209 2217 {
2210 2218 kmutex_t *lock = &mac_srs->srs_lock;
2211 2219 kcondvar_t *async = &mac_srs->srs_cv;
2212 2220 mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
2213 2221 mblk_t *head, *tail, *mp;
2214 2222 callb_cpr_t cprinfo;
2215 2223 ssize_t bytes_to_pickup;
2216 2224 size_t sz;
2217 2225 int count;
2218 2226 mac_client_impl_t *smcip;
2219 2227
2220 2228 CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "mac_srs_poll");
2221 2229 mutex_enter(lock);
2222 2230
2223 2231 start:
2224 2232 for (;;) {
2225 2233 if (mac_srs->srs_state & SRS_PAUSE)
2226 2234 goto done;
2227 2235
2228 2236 CALLB_CPR_SAFE_BEGIN(&cprinfo);
2229 2237 cv_wait(async, lock);
2230 2238 CALLB_CPR_SAFE_END(&cprinfo, lock);
2231 2239
2232 2240 if (mac_srs->srs_state & SRS_PAUSE)
2233 2241 goto done;
2234 2242
2235 2243 check_again:
2236 2244 if (mac_srs->srs_type & SRST_BW_CONTROL) {
2237 2245 /*
2238 2246 * We pick as many bytes as we are allowed to queue.
2239 2247 * Its possible that we will exceed the total
2240 2248 * packets queued in case this SRS is part of the
2241 2249 * Rx ring group since > 1 poll thread can be pulling
2242 2250 * upto the max allowed packets at the same time
2243 2251 * but that should be OK.
2244 2252 */
2245 2253 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2246 2254 bytes_to_pickup =
2247 2255 mac_srs->srs_bw->mac_bw_drop_threshold -
2248 2256 mac_srs->srs_bw->mac_bw_sz;
2249 2257 /*
2250 2258 * We shouldn't have been signalled if we
2251 2259 * have 0 or less bytes to pick but since
2252 2260 * some of the bytes accounting is driver
2253 2261 * dependant, we do the safety check.
2254 2262 */
2255 2263 if (bytes_to_pickup < 0)
2256 2264 bytes_to_pickup = 0;
2257 2265 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2258 2266 } else {
2259 2267 /*
2260 2268 * ToDO: Need to change the polling API
2261 2269 * to add a packet count and a flag which
2262 2270 * tells the driver whether we want packets
2263 2271 * based on a count, or bytes, or all the
2264 2272 * packets queued in the driver/HW. This
2265 2273 * way, we never have to check the limits
2266 2274 * on poll path. We truly let only as many
2267 2275 * packets enter the system as we are willing
2268 2276 * to process or queue.
2269 2277 *
2270 2278 * Something along the lines of
2271 2279 * pkts_to_pickup = mac_soft_ring_max_q_cnt -
2272 2280 * mac_srs->srs_poll_pkt_cnt
2273 2281 */
2274 2282
2275 2283 /*
2276 2284 * Since we are not doing B/W control, pick
2277 2285 * as many packets as allowed.
2278 2286 */
2279 2287 bytes_to_pickup = max_bytes_to_pickup;
2280 2288 }
2281 2289
2282 2290 /* Poll the underlying Hardware */
2283 2291 mutex_exit(lock);
2284 2292 head = MAC_HWRING_POLL(mac_srs->srs_ring, (int)bytes_to_pickup);
2285 2293 mutex_enter(lock);
2286 2294
2287 2295 ASSERT((mac_srs->srs_state & SRS_POLL_THR_OWNER) ==
2288 2296 SRS_POLL_THR_OWNER);
2289 2297
2290 2298 mp = tail = head;
2291 2299 count = 0;
2292 2300 sz = 0;
2293 2301 while (mp != NULL) {
2294 2302 tail = mp;
2295 2303 sz += msgdsize(mp);
2296 2304 mp = mp->b_next;
2297 2305 count++;
2298 2306 }
2299 2307
2300 2308 if (head != NULL) {
2301 2309 tail->b_next = NULL;
2302 2310 smcip = mac_srs->srs_mcip;
2303 2311
2304 2312 SRS_RX_STAT_UPDATE(mac_srs, pollbytes, sz);
2305 2313 SRS_RX_STAT_UPDATE(mac_srs, pollcnt, count);
2306 2314
2307 2315 /*
2308 2316 * If there are any promiscuous mode callbacks
2309 2317 * defined for this MAC client, pass them a copy
2310 2318 * if appropriate and also update the counters.
2311 2319 */
2312 2320 if (smcip != NULL) {
2313 2321 if (smcip->mci_mip->mi_promisc_list != NULL) {
2314 2322 mutex_exit(lock);
2315 2323 mac_promisc_dispatch(smcip->mci_mip,
2316 2324 head, NULL);
2317 2325 mutex_enter(lock);
2318 2326 }
2319 2327 }
2320 2328 if (mac_srs->srs_type & SRST_BW_CONTROL) {
2321 2329 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2322 2330 mac_srs->srs_bw->mac_bw_polled += sz;
2323 2331 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2324 2332 }
2325 2333 MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail,
2326 2334 count, sz);
2327 2335 if (count <= 10)
2328 2336 srs_rx->sr_stat.mrs_chaincntundr10++;
2329 2337 else if (count > 10 && count <= 50)
2330 2338 srs_rx->sr_stat.mrs_chaincnt10to50++;
2331 2339 else
2332 2340 srs_rx->sr_stat.mrs_chaincntover50++;
2333 2341 }
2334 2342
2335 2343 /*
2336 2344 * We are guaranteed that SRS_PROC will be set if we
2337 2345 * are here. Also, poll thread gets to run only if
2338 2346 * the drain was being done by a worker thread although
2339 2347 * its possible that worker thread is still running
2340 2348 * and poll thread was sent down to keep the pipeline
2341 2349 * going instead of doing a complete drain and then
2342 2350 * trying to poll the NIC.
2343 2351 *
2344 2352 * So we need to check SRS_WORKER flag to make sure
2345 2353 * that the worker thread is not processing the queue
2346 2354 * in parallel to us. The flags and conditions are
2347 2355 * protected by the srs_lock to prevent any race. We
2348 2356 * ensure that we don't drop the srs_lock from now
2349 2357 * till the end and similarly we don't drop the srs_lock
2350 2358 * in mac_rx_srs_drain() till similar condition check
2351 2359 * are complete. The mac_rx_srs_drain() needs to ensure
2352 2360 * that SRS_WORKER flag remains set as long as its
2353 2361 * processing the queue.
2354 2362 */
2355 2363 if (!(mac_srs->srs_state & SRS_WORKER) &&
2356 2364 (mac_srs->srs_first != NULL)) {
2357 2365 /*
2358 2366 * We have packets to process and worker thread
2359 2367 * is not running. Check to see if poll thread is
2360 2368 * allowed to process.
2361 2369 */
2362 2370 if (mac_srs->srs_state & SRS_LATENCY_OPT) {
2363 2371 mac_srs->srs_drain_func(mac_srs, SRS_POLL_PROC);
2364 2372 if (!(mac_srs->srs_state & SRS_PAUSE) &&
2365 2373 srs_rx->sr_poll_pkt_cnt <=
2366 2374 srs_rx->sr_lowat) {
2367 2375 srs_rx->sr_poll_again++;
2368 2376 goto check_again;
2369 2377 }
2370 2378 /*
2371 2379 * We are already above low water mark
2372 2380 * so stay in the polling mode but no
2373 2381 * need to poll. Once we dip below
2374 2382 * the polling threshold, the processing
2375 2383 * thread (soft ring) will signal us
2376 2384 * to poll again (MAC_UPDATE_SRS_COUNT)
2377 2385 */
2378 2386 srs_rx->sr_poll_drain_no_poll++;
2379 2387 mac_srs->srs_state &= ~(SRS_PROC|SRS_GET_PKTS);
2380 2388 /*
2381 2389 * In B/W control case, its possible
2382 2390 * that the backlog built up due to
2383 2391 * B/W limit being reached and packets
2384 2392 * are queued only in SRS. In this case,
2385 2393 * we should schedule worker thread
2386 2394 * since no one else will wake us up.
2387 2395 */
2388 2396 if ((mac_srs->srs_type & SRST_BW_CONTROL) &&
2389 2397 (mac_srs->srs_tid == NULL)) {
2390 2398 mac_srs->srs_tid =
2391 2399 timeout(mac_srs_fire, mac_srs, 1);
2392 2400 srs_rx->sr_poll_worker_wakeup++;
2393 2401 }
2394 2402 } else {
2395 2403 /*
2396 2404 * Wakeup the worker thread for more processing.
2397 2405 * We optimize for throughput in this case.
2398 2406 */
2399 2407 mac_srs->srs_state &= ~(SRS_PROC|SRS_GET_PKTS);
2400 2408 MAC_SRS_WORKER_WAKEUP(mac_srs);
2401 2409 srs_rx->sr_poll_sig_worker++;
2402 2410 }
2403 2411 } else if ((mac_srs->srs_first == NULL) &&
2404 2412 !(mac_srs->srs_state & SRS_WORKER)) {
2405 2413 /*
2406 2414 * There is nothing queued in SRS and
2407 2415 * no worker thread running. Plus we
2408 2416 * didn't get anything from the H/W
2409 2417 * as well (head == NULL);
2410 2418 */
2411 2419 ASSERT(head == NULL);
2412 2420 mac_srs->srs_state &=
2413 2421 ~(SRS_PROC|SRS_GET_PKTS);
2414 2422
2415 2423 /*
2416 2424 * If we have a packets in soft ring, don't allow
2417 2425 * more packets to come into this SRS by keeping the
2418 2426 * interrupts off but not polling the H/W. The
2419 2427 * poll thread will get signaled as soon as
2420 2428 * srs_poll_pkt_cnt dips below poll threshold.
2421 2429 */
2422 2430 if (srs_rx->sr_poll_pkt_cnt == 0) {
2423 2431 srs_rx->sr_poll_intr_enable++;
2424 2432 MAC_SRS_POLLING_OFF(mac_srs);
2425 2433 } else {
2426 2434 /*
2427 2435 * We know nothing is queued in SRS
2428 2436 * since we are here after checking
2429 2437 * srs_first is NULL. The backlog
2430 2438 * is entirely due to packets queued
2431 2439 * in Soft ring which will wake us up
2432 2440 * and get the interface out of polling
2433 2441 * mode once the backlog dips below
2434 2442 * sr_poll_thres.
2435 2443 */
2436 2444 srs_rx->sr_poll_no_poll++;
2437 2445 }
2438 2446 } else {
2439 2447 /*
2440 2448 * Worker thread is already running.
2441 2449 * Nothing much to do. If the polling
2442 2450 * was enabled, worker thread will deal
2443 2451 * with that.
2444 2452 */
2445 2453 mac_srs->srs_state &= ~SRS_GET_PKTS;
2446 2454 srs_rx->sr_poll_goto_sleep++;
2447 2455 }
2448 2456 }
2449 2457 done:
2450 2458 mac_srs->srs_state |= SRS_POLL_THR_QUIESCED;
2451 2459 cv_signal(&mac_srs->srs_async);
2452 2460 /*
2453 2461 * If this is a temporary quiesce then wait for the restart signal
2454 2462 * from the srs worker. Then clear the flags and signal the srs worker
2455 2463 * to ensure a positive handshake and go back to start.
2456 2464 */
2457 2465 while (!(mac_srs->srs_state & (SRS_CONDEMNED | SRS_POLL_THR_RESTART)))
2458 2466 cv_wait(async, lock);
2459 2467 if (mac_srs->srs_state & SRS_POLL_THR_RESTART) {
2460 2468 ASSERT(!(mac_srs->srs_state & SRS_CONDEMNED));
2461 2469 mac_srs->srs_state &=
2462 2470 ~(SRS_POLL_THR_QUIESCED | SRS_POLL_THR_RESTART);
2463 2471 cv_signal(&mac_srs->srs_async);
2464 2472 goto start;
2465 2473 } else {
2466 2474 mac_srs->srs_state |= SRS_POLL_THR_EXITED;
2467 2475 cv_signal(&mac_srs->srs_async);
2468 2476 CALLB_CPR_EXIT(&cprinfo);
2469 2477 thread_exit();
2470 2478 }
2471 2479 }
2472 2480
2473 2481 /*
2474 2482 * mac_srs_pick_chain
2475 2483 *
2476 2484 * In Bandwidth control case, checks how many packets can be processed
2477 2485 * and return them in a sub chain.
2478 2486 */
2479 2487 static mblk_t *
2480 2488 mac_srs_pick_chain(mac_soft_ring_set_t *mac_srs, mblk_t **chain_tail,
2481 2489 size_t *chain_sz, int *chain_cnt)
2482 2490 {
2483 2491 mblk_t *head = NULL;
2484 2492 mblk_t *tail = NULL;
2485 2493 size_t sz;
2486 2494 size_t tsz = 0;
2487 2495 int cnt = 0;
2488 2496 mblk_t *mp;
2489 2497
2490 2498 ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2491 2499 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2492 2500 if (((mac_srs->srs_bw->mac_bw_used + mac_srs->srs_size) <=
2493 2501 mac_srs->srs_bw->mac_bw_limit) ||
2494 2502 (mac_srs->srs_bw->mac_bw_limit == 0)) {
2495 2503 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2496 2504 head = mac_srs->srs_first;
2497 2505 mac_srs->srs_first = NULL;
2498 2506 *chain_tail = mac_srs->srs_last;
2499 2507 mac_srs->srs_last = NULL;
2500 2508 *chain_sz = mac_srs->srs_size;
2501 2509 *chain_cnt = mac_srs->srs_count;
2502 2510 mac_srs->srs_count = 0;
2503 2511 mac_srs->srs_size = 0;
2504 2512 return (head);
2505 2513 }
2506 2514
2507 2515 /*
2508 2516 * Can't clear the entire backlog.
2509 2517 * Need to find how many packets to pick
2510 2518 */
2511 2519 ASSERT(MUTEX_HELD(&mac_srs->srs_bw->mac_bw_lock));
2512 2520 while ((mp = mac_srs->srs_first) != NULL) {
2513 2521 sz = msgdsize(mp);
2514 2522 if ((tsz + sz + mac_srs->srs_bw->mac_bw_used) >
2515 2523 mac_srs->srs_bw->mac_bw_limit) {
2516 2524 if (!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED))
2517 2525 mac_srs->srs_bw->mac_bw_state |=
2518 2526 SRS_BW_ENFORCED;
2519 2527 break;
2520 2528 }
2521 2529
2522 2530 /*
2523 2531 * The _size & cnt is decremented from the softrings
2524 2532 * when they send up the packet for polling to work
2525 2533 * properly.
2526 2534 */
2527 2535 tsz += sz;
2528 2536 cnt++;
2529 2537 mac_srs->srs_count--;
2530 2538 mac_srs->srs_size -= sz;
2531 2539 if (tail != NULL)
2532 2540 tail->b_next = mp;
2533 2541 else
2534 2542 head = mp;
2535 2543 tail = mp;
2536 2544 mac_srs->srs_first = mac_srs->srs_first->b_next;
2537 2545 }
2538 2546 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2539 2547 if (mac_srs->srs_first == NULL)
2540 2548 mac_srs->srs_last = NULL;
2541 2549
2542 2550 if (tail != NULL)
2543 2551 tail->b_next = NULL;
2544 2552 *chain_tail = tail;
2545 2553 *chain_cnt = cnt;
2546 2554 *chain_sz = tsz;
2547 2555
2548 2556 return (head);
2549 2557 }
2550 2558
2551 2559 /*
2552 2560 * mac_rx_srs_drain
2553 2561 *
2554 2562 * The SRS drain routine. Gets to run to clear the queue. Any thread
2555 2563 * (worker, interrupt, poll) can call this based on processing model.
2556 2564 * The first thing we do is disable interrupts if possible and then
2557 2565 * drain the queue. we also try to poll the underlying hardware if
2558 2566 * there is a dedicated hardware Rx ring assigned to this SRS.
2559 2567 *
2560 2568 * There is a equivalent drain routine in bandwidth control mode
2561 2569 * mac_rx_srs_drain_bw. There is some code duplication between the two
2562 2570 * routines but they are highly performance sensitive and are easier
2563 2571 * to read/debug if they stay separate. Any code changes here might
2564 2572 * also apply to mac_rx_srs_drain_bw as well.
2565 2573 */
2566 2574 void
2567 2575 mac_rx_srs_drain(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2568 2576 {
2569 2577 mblk_t *head;
2570 2578 mblk_t *tail;
2571 2579 timeout_id_t tid;
2572 2580 int cnt = 0;
2573 2581 mac_client_impl_t *mcip = mac_srs->srs_mcip;
2574 2582 mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
2575 2583
2576 2584 ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2577 2585 ASSERT(!(mac_srs->srs_type & SRST_BW_CONTROL));
2578 2586
2579 2587 /* If we are blanked i.e. can't do upcalls, then we are done */
2580 2588 if (mac_srs->srs_state & (SRS_BLANK | SRS_PAUSE)) {
2581 2589 ASSERT((mac_srs->srs_type & SRST_NO_SOFT_RINGS) ||
2582 2590 (mac_srs->srs_state & SRS_PAUSE));
2583 2591 goto out;
2584 2592 }
2585 2593
2586 2594 if (mac_srs->srs_first == NULL)
2587 2595 goto out;
2588 2596
2589 2597 if (!(mac_srs->srs_state & SRS_LATENCY_OPT) &&
2590 2598 (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat)) {
2591 2599 /*
2592 2600 * In the normal case, the SRS worker thread does no
2593 2601 * work and we wait for a backlog to build up before
2594 2602 * we switch into polling mode. In case we are
2595 2603 * optimizing for throughput, we use the worker thread
2596 2604 * as well. The goal is to let worker thread process
2597 2605 * the queue and poll thread to feed packets into
2598 2606 * the queue. As such, we should signal the poll
2599 2607 * thread to try and get more packets.
2600 2608 *
2601 2609 * We could have pulled this check in the POLL_RING
2602 2610 * macro itself but keeping it explicit here makes
2603 2611 * the architecture more human understandable.
2604 2612 */
2605 2613 MAC_SRS_POLL_RING(mac_srs);
2606 2614 }
2607 2615
2608 2616 again:
2609 2617 head = mac_srs->srs_first;
2610 2618 mac_srs->srs_first = NULL;
2611 2619 tail = mac_srs->srs_last;
2612 2620 mac_srs->srs_last = NULL;
2613 2621 cnt = mac_srs->srs_count;
↓ open down ↓ |
568 lines elided |
↑ open up ↑ |
2614 2622 mac_srs->srs_count = 0;
2615 2623
2616 2624 ASSERT(head != NULL);
2617 2625 ASSERT(tail != NULL);
2618 2626
2619 2627 if ((tid = mac_srs->srs_tid) != NULL)
2620 2628 mac_srs->srs_tid = NULL;
2621 2629
2622 2630 mac_srs->srs_state |= (SRS_PROC|proc_type);
2623 2631
2624 -
2625 2632 /*
2626 2633 * mcip is NULL for broadcast and multicast flows. The promisc
2627 2634 * callbacks for broadcast and multicast packets are delivered from
2628 2635 * mac_rx() and we don't need to worry about that case in this path
2629 2636 */
2630 2637 if (mcip != NULL) {
2631 2638 if (mcip->mci_promisc_list != NULL) {
2632 2639 mutex_exit(&mac_srs->srs_lock);
2633 2640 mac_promisc_client_dispatch(mcip, head);
2634 2641 mutex_enter(&mac_srs->srs_lock);
2635 2642 }
2636 2643 if (MAC_PROTECT_ENABLED(mcip, MPT_IPNOSPOOF)) {
2637 2644 mutex_exit(&mac_srs->srs_lock);
2638 2645 mac_protect_intercept_dynamic(mcip, head);
2639 2646 mutex_enter(&mac_srs->srs_lock);
2640 2647 }
2641 2648 }
2642 2649
2643 2650 /*
2644 - * Check if SRS itself is doing the processing
2645 - * This direct path does not apply when subflows are present. In this
2646 - * case, packets need to be dispatched to a soft ring according to the
2647 - * flow's bandwidth and other resources contraints.
2651 + * Check if SRS itself is doing the processing. This direct
2652 + * path applies only when subflows are present.
2648 2653 */
2649 2654 if (mac_srs->srs_type & SRST_NO_SOFT_RINGS) {
2650 2655 mac_direct_rx_t proc;
2651 2656 void *arg1;
2652 2657 mac_resource_handle_t arg2;
2653 2658
2654 2659 /*
2655 2660 * This is the case when a Rx is directly
2656 2661 * assigned and we have a fully classified
2657 2662 * protocol chain. We can deal with it in
2658 2663 * one shot.
2659 2664 */
2660 2665 proc = srs_rx->sr_func;
2661 2666 arg1 = srs_rx->sr_arg1;
2662 2667 arg2 = srs_rx->sr_arg2;
2663 2668
2664 2669 mac_srs->srs_state |= SRS_CLIENT_PROC;
2665 2670 mutex_exit(&mac_srs->srs_lock);
2666 2671 if (tid != NULL) {
2667 2672 (void) untimeout(tid);
2668 2673 tid = NULL;
2669 2674 }
2670 2675
2671 2676 proc(arg1, arg2, head, NULL);
2672 2677 /*
2673 2678 * Decrement the size and count here itelf
2674 2679 * since the packet has been processed.
2675 2680 */
2676 2681 mutex_enter(&mac_srs->srs_lock);
2677 2682 MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
2678 2683 if (mac_srs->srs_state & SRS_CLIENT_WAIT)
2679 2684 cv_signal(&mac_srs->srs_client_cv);
2680 2685 mac_srs->srs_state &= ~SRS_CLIENT_PROC;
2681 2686 } else {
2682 2687 /* Some kind of softrings based fanout is required */
2683 2688 mutex_exit(&mac_srs->srs_lock);
2684 2689 if (tid != NULL) {
2685 2690 (void) untimeout(tid);
2686 2691 tid = NULL;
2687 2692 }
2688 2693
2689 2694 /*
2690 2695 * Since the fanout routines can deal with chains,
2691 2696 * shoot the entire chain up.
2692 2697 */
2693 2698 if (mac_srs->srs_type & SRST_FANOUT_SRC_IP)
2694 2699 mac_rx_srs_fanout(mac_srs, head);
2695 2700 else
2696 2701 mac_rx_srs_proto_fanout(mac_srs, head);
2697 2702 mutex_enter(&mac_srs->srs_lock);
2698 2703 }
2699 2704
2700 2705 if (!(mac_srs->srs_state & (SRS_BLANK|SRS_PAUSE)) &&
2701 2706 (mac_srs->srs_first != NULL)) {
2702 2707 /*
2703 2708 * More packets arrived while we were clearing the
2704 2709 * SRS. This can be possible because of one of
2705 2710 * three conditions below:
2706 2711 * 1) The driver is using multiple worker threads
2707 2712 * to send the packets to us.
2708 2713 * 2) The driver has a race in switching
2709 2714 * between interrupt and polling mode or
2710 2715 * 3) Packets are arriving in this SRS via the
2711 2716 * S/W classification as well.
2712 2717 *
2713 2718 * We should switch to polling mode and see if we
2714 2719 * need to send the poll thread down. Also, signal
2715 2720 * the worker thread to process whats just arrived.
2716 2721 */
2717 2722 MAC_SRS_POLLING_ON(mac_srs);
2718 2723 if (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat) {
2719 2724 srs_rx->sr_drain_poll_sig++;
2720 2725 MAC_SRS_POLL_RING(mac_srs);
2721 2726 }
2722 2727
2723 2728 /*
2724 2729 * If we didn't signal the poll thread, we need
2725 2730 * to deal with the pending packets ourselves.
2726 2731 */
2727 2732 if (proc_type == SRS_WORKER) {
2728 2733 srs_rx->sr_drain_again++;
2729 2734 goto again;
2730 2735 } else {
2731 2736 srs_rx->sr_drain_worker_sig++;
2732 2737 cv_signal(&mac_srs->srs_async);
2733 2738 }
2734 2739 }
2735 2740
2736 2741 out:
2737 2742 if (mac_srs->srs_state & SRS_GET_PKTS) {
2738 2743 /*
2739 2744 * Poll thread is already running. Leave the
2740 2745 * SRS_RPOC set and hand over the control to
2741 2746 * poll thread.
2742 2747 */
2743 2748 mac_srs->srs_state &= ~proc_type;
2744 2749 srs_rx->sr_drain_poll_running++;
2745 2750 return;
2746 2751 }
2747 2752
2748 2753 /*
2749 2754 * Even if there are no packets queued in SRS, we
2750 2755 * need to make sure that the shared counter is
2751 2756 * clear and any associated softrings have cleared
2752 2757 * all the backlog. Otherwise, leave the interface
2753 2758 * in polling mode and the poll thread will get
2754 2759 * signalled once the count goes down to zero.
2755 2760 *
2756 2761 * If someone is already draining the queue (SRS_PROC is
2757 2762 * set) when the srs_poll_pkt_cnt goes down to zero,
2758 2763 * then it means that drain is already running and we
2759 2764 * will turn off polling at that time if there is
2760 2765 * no backlog.
2761 2766 *
2762 2767 * As long as there are packets queued either
2763 2768 * in soft ring set or its soft rings, we will leave
2764 2769 * the interface in polling mode (even if the drain
2765 2770 * was done being the interrupt thread). We signal
2766 2771 * the poll thread as well if we have dipped below
2767 2772 * low water mark.
2768 2773 *
2769 2774 * NOTE: We can't use the MAC_SRS_POLLING_ON macro
2770 2775 * since that turn polling on only for worker thread.
2771 2776 * Its not worth turning polling on for interrupt
2772 2777 * thread (since NIC will not issue another interrupt)
2773 2778 * unless a backlog builds up.
2774 2779 */
2775 2780 if ((srs_rx->sr_poll_pkt_cnt > 0) &&
2776 2781 (mac_srs->srs_state & SRS_POLLING_CAPAB)) {
2777 2782 mac_srs->srs_state &= ~(SRS_PROC|proc_type);
2778 2783 srs_rx->sr_drain_keep_polling++;
2779 2784 MAC_SRS_POLLING_ON(mac_srs);
2780 2785 if (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat)
2781 2786 MAC_SRS_POLL_RING(mac_srs);
2782 2787 return;
2783 2788 }
2784 2789
2785 2790 /* Nothing else to do. Get out of poll mode */
2786 2791 MAC_SRS_POLLING_OFF(mac_srs);
2787 2792 mac_srs->srs_state &= ~(SRS_PROC|proc_type);
2788 2793 srs_rx->sr_drain_finish_intr++;
2789 2794 }
2790 2795
2791 2796 /*
2792 2797 * mac_rx_srs_drain_bw
2793 2798 *
2794 2799 * The SRS BW drain routine. Gets to run to clear the queue. Any thread
2795 2800 * (worker, interrupt, poll) can call this based on processing model.
2796 2801 * The first thing we do is disable interrupts if possible and then
2797 2802 * drain the queue. we also try to poll the underlying hardware if
2798 2803 * there is a dedicated hardware Rx ring assigned to this SRS.
2799 2804 *
2800 2805 * There is a equivalent drain routine in non bandwidth control mode
2801 2806 * mac_rx_srs_drain. There is some code duplication between the two
2802 2807 * routines but they are highly performance sensitive and are easier
2803 2808 * to read/debug if they stay separate. Any code changes here might
2804 2809 * also apply to mac_rx_srs_drain as well.
2805 2810 */
2806 2811 void
2807 2812 mac_rx_srs_drain_bw(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2808 2813 {
2809 2814 mblk_t *head;
2810 2815 mblk_t *tail;
2811 2816 timeout_id_t tid;
2812 2817 size_t sz = 0;
2813 2818 int cnt = 0;
2814 2819 mac_client_impl_t *mcip = mac_srs->srs_mcip;
2815 2820 mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
2816 2821 clock_t now;
2817 2822
2818 2823 ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2819 2824 ASSERT(mac_srs->srs_type & SRST_BW_CONTROL);
2820 2825 again:
2821 2826 /* Check if we are doing B/W control */
2822 2827 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2823 2828 now = ddi_get_lbolt();
2824 2829 if (mac_srs->srs_bw->mac_bw_curr_time != now) {
2825 2830 mac_srs->srs_bw->mac_bw_curr_time = now;
2826 2831 mac_srs->srs_bw->mac_bw_used = 0;
2827 2832 if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)
2828 2833 mac_srs->srs_bw->mac_bw_state &= ~SRS_BW_ENFORCED;
2829 2834 } else if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED) {
2830 2835 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2831 2836 goto done;
2832 2837 } else if (mac_srs->srs_bw->mac_bw_used >
2833 2838 mac_srs->srs_bw->mac_bw_limit) {
2834 2839 mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
2835 2840 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2836 2841 goto done;
2837 2842 }
2838 2843 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2839 2844
2840 2845 /* If we are blanked i.e. can't do upcalls, then we are done */
2841 2846 if (mac_srs->srs_state & (SRS_BLANK | SRS_PAUSE)) {
2842 2847 ASSERT((mac_srs->srs_type & SRST_NO_SOFT_RINGS) ||
2843 2848 (mac_srs->srs_state & SRS_PAUSE));
2844 2849 goto done;
2845 2850 }
2846 2851
2847 2852 sz = 0;
2848 2853 cnt = 0;
2849 2854 if ((head = mac_srs_pick_chain(mac_srs, &tail, &sz, &cnt)) == NULL) {
2850 2855 /*
2851 2856 * We couldn't pick up a single packet.
2852 2857 */
2853 2858 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2854 2859 if ((mac_srs->srs_bw->mac_bw_used == 0) &&
2855 2860 (mac_srs->srs_size != 0) &&
2856 2861 !(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
2857 2862 /*
2858 2863 * Seems like configured B/W doesn't
2859 2864 * even allow processing of 1 packet
2860 2865 * per tick.
2861 2866 *
2862 2867 * XXX: raise the limit to processing
2863 2868 * at least 1 packet per tick.
2864 2869 */
2865 2870 mac_srs->srs_bw->mac_bw_limit +=
2866 2871 mac_srs->srs_bw->mac_bw_limit;
2867 2872 mac_srs->srs_bw->mac_bw_drop_threshold +=
2868 2873 mac_srs->srs_bw->mac_bw_drop_threshold;
2869 2874 cmn_err(CE_NOTE, "mac_rx_srs_drain: srs(%p) "
2870 2875 "raised B/W limit to %d since not even a "
2871 2876 "single packet can be processed per "
2872 2877 "tick %d\n", (void *)mac_srs,
2873 2878 (int)mac_srs->srs_bw->mac_bw_limit,
2874 2879 (int)msgdsize(mac_srs->srs_first));
2875 2880 }
2876 2881 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2877 2882 goto done;
2878 2883 }
2879 2884
2880 2885 ASSERT(head != NULL);
2881 2886 ASSERT(tail != NULL);
2882 2887
2883 2888 /* zero bandwidth: drop all and return to interrupt mode */
2884 2889 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2885 2890 if (mac_srs->srs_bw->mac_bw_limit == 0) {
2886 2891 srs_rx->sr_stat.mrs_sdrops += cnt;
2887 2892 ASSERT(mac_srs->srs_bw->mac_bw_sz >= sz);
2888 2893 mac_srs->srs_bw->mac_bw_sz -= sz;
2889 2894 mac_srs->srs_bw->mac_bw_drop_bytes += sz;
2890 2895 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2891 2896 mac_pkt_drop(NULL, NULL, head, B_FALSE);
2892 2897 goto leave_poll;
2893 2898 } else {
2894 2899 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2895 2900 }
2896 2901
2897 2902 if ((tid = mac_srs->srs_tid) != NULL)
2898 2903 mac_srs->srs_tid = NULL;
2899 2904
2900 2905 mac_srs->srs_state |= (SRS_PROC|proc_type);
2901 2906 MAC_SRS_WORKER_POLLING_ON(mac_srs);
2902 2907
2903 2908 /*
2904 2909 * mcip is NULL for broadcast and multicast flows. The promisc
2905 2910 * callbacks for broadcast and multicast packets are delivered from
2906 2911 * mac_rx() and we don't need to worry about that case in this path
2907 2912 */
2908 2913 if (mcip != NULL) {
2909 2914 if (mcip->mci_promisc_list != NULL) {
2910 2915 mutex_exit(&mac_srs->srs_lock);
2911 2916 mac_promisc_client_dispatch(mcip, head);
2912 2917 mutex_enter(&mac_srs->srs_lock);
2913 2918 }
2914 2919 if (MAC_PROTECT_ENABLED(mcip, MPT_IPNOSPOOF)) {
2915 2920 mutex_exit(&mac_srs->srs_lock);
2916 2921 mac_protect_intercept_dynamic(mcip, head);
2917 2922 mutex_enter(&mac_srs->srs_lock);
2918 2923 }
2919 2924 }
2920 2925
2921 2926 /*
2922 2927 * Check if SRS itself is doing the processing
2923 2928 * This direct path does not apply when subflows are present. In this
2924 2929 * case, packets need to be dispatched to a soft ring according to the
2925 2930 * flow's bandwidth and other resources contraints.
2926 2931 */
2927 2932 if (mac_srs->srs_type & SRST_NO_SOFT_RINGS) {
2928 2933 mac_direct_rx_t proc;
2929 2934 void *arg1;
2930 2935 mac_resource_handle_t arg2;
2931 2936
2932 2937 /*
2933 2938 * This is the case when a Rx is directly
2934 2939 * assigned and we have a fully classified
2935 2940 * protocol chain. We can deal with it in
2936 2941 * one shot.
2937 2942 */
2938 2943 proc = srs_rx->sr_func;
2939 2944 arg1 = srs_rx->sr_arg1;
2940 2945 arg2 = srs_rx->sr_arg2;
2941 2946
2942 2947 mac_srs->srs_state |= SRS_CLIENT_PROC;
2943 2948 mutex_exit(&mac_srs->srs_lock);
2944 2949 if (tid != NULL) {
2945 2950 (void) untimeout(tid);
2946 2951 tid = NULL;
2947 2952 }
2948 2953
2949 2954 proc(arg1, arg2, head, NULL);
2950 2955 /*
2951 2956 * Decrement the size and count here itelf
2952 2957 * since the packet has been processed.
2953 2958 */
2954 2959 mutex_enter(&mac_srs->srs_lock);
2955 2960 MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
2956 2961 MAC_UPDATE_SRS_SIZE_LOCKED(mac_srs, sz);
2957 2962
2958 2963 if (mac_srs->srs_state & SRS_CLIENT_WAIT)
2959 2964 cv_signal(&mac_srs->srs_client_cv);
2960 2965 mac_srs->srs_state &= ~SRS_CLIENT_PROC;
2961 2966 } else {
2962 2967 /* Some kind of softrings based fanout is required */
2963 2968 mutex_exit(&mac_srs->srs_lock);
2964 2969 if (tid != NULL) {
2965 2970 (void) untimeout(tid);
2966 2971 tid = NULL;
2967 2972 }
2968 2973
2969 2974 /*
2970 2975 * Since the fanout routines can deal with chains,
2971 2976 * shoot the entire chain up.
2972 2977 */
2973 2978 if (mac_srs->srs_type & SRST_FANOUT_SRC_IP)
2974 2979 mac_rx_srs_fanout(mac_srs, head);
2975 2980 else
2976 2981 mac_rx_srs_proto_fanout(mac_srs, head);
2977 2982 mutex_enter(&mac_srs->srs_lock);
2978 2983 }
2979 2984
2980 2985 /*
2981 2986 * Send the poll thread to pick up any packets arrived
2982 2987 * so far. This also serves as the last check in case
2983 2988 * nothing else is queued in the SRS. The poll thread
2984 2989 * is signalled only in the case the drain was done
2985 2990 * by the worker thread and SRS_WORKER is set. The
2986 2991 * worker thread can run in parallel as long as the
2987 2992 * SRS_WORKER flag is set. We we have nothing else to
2988 2993 * process, we can exit while leaving SRS_PROC set
2989 2994 * which gives the poll thread control to process and
2990 2995 * cleanup once it returns from the NIC.
2991 2996 *
2992 2997 * If we have nothing else to process, we need to
2993 2998 * ensure that we keep holding the srs_lock till
2994 2999 * all the checks below are done and control is
2995 3000 * handed to the poll thread if it was running.
2996 3001 */
2997 3002 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2998 3003 if (!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
2999 3004 if (mac_srs->srs_first != NULL) {
3000 3005 if (proc_type == SRS_WORKER) {
3001 3006 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3002 3007 if (srs_rx->sr_poll_pkt_cnt <=
3003 3008 srs_rx->sr_lowat)
3004 3009 MAC_SRS_POLL_RING(mac_srs);
3005 3010 goto again;
3006 3011 } else {
3007 3012 cv_signal(&mac_srs->srs_async);
3008 3013 }
3009 3014 }
3010 3015 }
3011 3016 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3012 3017
3013 3018 done:
3014 3019
3015 3020 if (mac_srs->srs_state & SRS_GET_PKTS) {
3016 3021 /*
3017 3022 * Poll thread is already running. Leave the
3018 3023 * SRS_RPOC set and hand over the control to
3019 3024 * poll thread.
3020 3025 */
3021 3026 mac_srs->srs_state &= ~proc_type;
3022 3027 return;
3023 3028 }
3024 3029
3025 3030 /*
3026 3031 * If we can't process packets because we have exceeded
3027 3032 * B/W limit for this tick, just set the timeout
3028 3033 * and leave.
3029 3034 *
3030 3035 * Even if there are no packets queued in SRS, we
3031 3036 * need to make sure that the shared counter is
3032 3037 * clear and any associated softrings have cleared
3033 3038 * all the backlog. Otherwise, leave the interface
3034 3039 * in polling mode and the poll thread will get
3035 3040 * signalled once the count goes down to zero.
3036 3041 *
3037 3042 * If someone is already draining the queue (SRS_PROC is
3038 3043 * set) when the srs_poll_pkt_cnt goes down to zero,
3039 3044 * then it means that drain is already running and we
3040 3045 * will turn off polling at that time if there is
3041 3046 * no backlog. As long as there are packets queued either
3042 3047 * is soft ring set or its soft rings, we will leave
3043 3048 * the interface in polling mode.
3044 3049 */
3045 3050 mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
3046 3051 if ((mac_srs->srs_state & SRS_POLLING_CAPAB) &&
3047 3052 ((mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED) ||
3048 3053 (srs_rx->sr_poll_pkt_cnt > 0))) {
3049 3054 MAC_SRS_POLLING_ON(mac_srs);
3050 3055 mac_srs->srs_state &= ~(SRS_PROC|proc_type);
3051 3056 if ((mac_srs->srs_first != NULL) &&
3052 3057 (mac_srs->srs_tid == NULL))
3053 3058 mac_srs->srs_tid = timeout(mac_srs_fire,
3054 3059 mac_srs, 1);
3055 3060 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3056 3061 return;
3057 3062 }
3058 3063 mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3059 3064
3060 3065 leave_poll:
3061 3066
3062 3067 /* Nothing else to do. Get out of poll mode */
3063 3068 MAC_SRS_POLLING_OFF(mac_srs);
3064 3069 mac_srs->srs_state &= ~(SRS_PROC|proc_type);
3065 3070 }
3066 3071
3067 3072 /*
3068 3073 * mac_srs_worker
3069 3074 *
3070 3075 * The SRS worker routine. Drains the queue when no one else is
3071 3076 * processing it.
3072 3077 */
3073 3078 void
3074 3079 mac_srs_worker(mac_soft_ring_set_t *mac_srs)
3075 3080 {
3076 3081 kmutex_t *lock = &mac_srs->srs_lock;
3077 3082 kcondvar_t *async = &mac_srs->srs_async;
3078 3083 callb_cpr_t cprinfo;
3079 3084 boolean_t bw_ctl_flag;
3080 3085
3081 3086 CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "srs_worker");
3082 3087 mutex_enter(lock);
3083 3088
3084 3089 start:
3085 3090 for (;;) {
3086 3091 bw_ctl_flag = B_FALSE;
3087 3092 if (mac_srs->srs_type & SRST_BW_CONTROL) {
3088 3093 MAC_SRS_BW_LOCK(mac_srs);
3089 3094 MAC_SRS_CHECK_BW_CONTROL(mac_srs);
3090 3095 if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)
3091 3096 bw_ctl_flag = B_TRUE;
3092 3097 MAC_SRS_BW_UNLOCK(mac_srs);
3093 3098 }
3094 3099 /*
3095 3100 * The SRS_BW_ENFORCED flag may change since we have dropped
3096 3101 * the mac_bw_lock. However the drain function can handle both
3097 3102 * a drainable SRS or a bandwidth controlled SRS, and the
3098 3103 * effect of scheduling a timeout is to wakeup the worker
3099 3104 * thread which in turn will call the drain function. Since
3100 3105 * we release the srs_lock atomically only in the cv_wait there
3101 3106 * isn't a fear of waiting for ever.
3102 3107 */
3103 3108 while (((mac_srs->srs_state & SRS_PROC) ||
3104 3109 (mac_srs->srs_first == NULL) || bw_ctl_flag ||
3105 3110 (mac_srs->srs_state & SRS_TX_BLOCKED)) &&
3106 3111 !(mac_srs->srs_state & SRS_PAUSE)) {
3107 3112 /*
3108 3113 * If we have packets queued and we are here
3109 3114 * because B/W control is in place, we better
3110 3115 * schedule the worker wakeup after 1 tick
3111 3116 * to see if bandwidth control can be relaxed.
3112 3117 */
3113 3118 if (bw_ctl_flag && mac_srs->srs_tid == NULL) {
3114 3119 /*
3115 3120 * We need to ensure that a timer is already
3116 3121 * scheduled or we force schedule one for
3117 3122 * later so that we can continue processing
3118 3123 * after this quanta is over.
3119 3124 */
3120 3125 mac_srs->srs_tid = timeout(mac_srs_fire,
3121 3126 mac_srs, 1);
3122 3127 }
3123 3128 wait:
3124 3129 CALLB_CPR_SAFE_BEGIN(&cprinfo);
3125 3130 cv_wait(async, lock);
3126 3131 CALLB_CPR_SAFE_END(&cprinfo, lock);
3127 3132
3128 3133 if (mac_srs->srs_state & SRS_PAUSE)
3129 3134 goto done;
3130 3135 if (mac_srs->srs_state & SRS_PROC)
3131 3136 goto wait;
3132 3137
3133 3138 if (mac_srs->srs_first != NULL &&
3134 3139 mac_srs->srs_type & SRST_BW_CONTROL) {
3135 3140 MAC_SRS_BW_LOCK(mac_srs);
3136 3141 if (mac_srs->srs_bw->mac_bw_state &
3137 3142 SRS_BW_ENFORCED) {
3138 3143 MAC_SRS_CHECK_BW_CONTROL(mac_srs);
3139 3144 }
3140 3145 bw_ctl_flag = mac_srs->srs_bw->mac_bw_state &
3141 3146 SRS_BW_ENFORCED;
3142 3147 MAC_SRS_BW_UNLOCK(mac_srs);
3143 3148 }
3144 3149 }
3145 3150
3146 3151 if (mac_srs->srs_state & SRS_PAUSE)
3147 3152 goto done;
3148 3153 mac_srs->srs_drain_func(mac_srs, SRS_WORKER);
3149 3154 }
3150 3155 done:
3151 3156 /*
3152 3157 * The Rx SRS quiesce logic first cuts off packet supply to the SRS
3153 3158 * from both hard and soft classifications and waits for such threads
3154 3159 * to finish before signaling the worker. So at this point the only
3155 3160 * thread left that could be competing with the worker is the poll
3156 3161 * thread. In the case of Tx, there shouldn't be any thread holding
3157 3162 * SRS_PROC at this point.
3158 3163 */
3159 3164 if (!(mac_srs->srs_state & SRS_PROC)) {
3160 3165 mac_srs->srs_state |= SRS_PROC;
3161 3166 } else {
3162 3167 ASSERT((mac_srs->srs_type & SRST_TX) == 0);
3163 3168 /*
3164 3169 * Poll thread still owns the SRS and is still running
3165 3170 */
3166 3171 ASSERT((mac_srs->srs_poll_thr == NULL) ||
3167 3172 ((mac_srs->srs_state & SRS_POLL_THR_OWNER) ==
3168 3173 SRS_POLL_THR_OWNER));
3169 3174 }
3170 3175 mac_srs_worker_quiesce(mac_srs);
3171 3176 /*
3172 3177 * Wait for the SRS_RESTART or SRS_CONDEMNED signal from the initiator
3173 3178 * of the quiesce operation
3174 3179 */
3175 3180 while (!(mac_srs->srs_state & (SRS_CONDEMNED | SRS_RESTART)))
3176 3181 cv_wait(&mac_srs->srs_async, &mac_srs->srs_lock);
3177 3182
3178 3183 if (mac_srs->srs_state & SRS_RESTART) {
3179 3184 ASSERT(!(mac_srs->srs_state & SRS_CONDEMNED));
3180 3185 mac_srs_worker_restart(mac_srs);
3181 3186 mac_srs->srs_state &= ~SRS_PROC;
3182 3187 goto start;
3183 3188 }
3184 3189
3185 3190 if (!(mac_srs->srs_state & SRS_CONDEMNED_DONE))
3186 3191 mac_srs_worker_quiesce(mac_srs);
3187 3192
3188 3193 mac_srs->srs_state &= ~SRS_PROC;
3189 3194 /* The macro drops the srs_lock */
3190 3195 CALLB_CPR_EXIT(&cprinfo);
3191 3196 thread_exit();
3192 3197 }
3193 3198
3194 3199 /*
3195 3200 * mac_rx_srs_subflow_process
3196 3201 *
3197 3202 * Receive side routine called from interrupt path when there are
3198 3203 * sub flows present on this SRS.
3199 3204 */
3200 3205 /* ARGSUSED */
3201 3206 void
3202 3207 mac_rx_srs_subflow_process(void *arg, mac_resource_handle_t srs,
3203 3208 mblk_t *mp_chain, boolean_t loopback)
3204 3209 {
3205 3210 flow_entry_t *flent = NULL;
3206 3211 flow_entry_t *prev_flent = NULL;
3207 3212 mblk_t *mp = NULL;
3208 3213 mblk_t *tail = NULL;
3209 3214 mac_soft_ring_set_t *mac_srs = (mac_soft_ring_set_t *)srs;
3210 3215 mac_client_impl_t *mcip;
3211 3216
3212 3217 mcip = mac_srs->srs_mcip;
3213 3218 ASSERT(mcip != NULL);
3214 3219
3215 3220 /*
3216 3221 * We need to determine the SRS for every packet
3217 3222 * by walking the flow table, if we don't get any,
3218 3223 * then we proceed using the SRS we came with.
3219 3224 */
3220 3225 mp = tail = mp_chain;
3221 3226 while (mp != NULL) {
3222 3227
3223 3228 /*
3224 3229 * We will increment the stats for the mactching subflow.
3225 3230 * when we get the bytes/pkt count for the classified packets
3226 3231 * later in mac_rx_srs_process.
3227 3232 */
3228 3233 (void) mac_flow_lookup(mcip->mci_subflow_tab, mp,
3229 3234 FLOW_INBOUND, &flent);
3230 3235
3231 3236 if (mp == mp_chain || flent == prev_flent) {
3232 3237 if (prev_flent != NULL)
3233 3238 FLOW_REFRELE(prev_flent);
3234 3239 prev_flent = flent;
3235 3240 flent = NULL;
3236 3241 tail = mp;
3237 3242 mp = mp->b_next;
3238 3243 continue;
3239 3244 }
3240 3245 tail->b_next = NULL;
3241 3246 /*
3242 3247 * A null indicates, this is for the mac_srs itself.
3243 3248 * XXX-venu : probably assert for fe_rx_srs_cnt == 0.
3244 3249 */
3245 3250 if (prev_flent == NULL || prev_flent->fe_rx_srs_cnt == 0) {
3246 3251 mac_rx_srs_process(arg,
3247 3252 (mac_resource_handle_t)mac_srs, mp_chain,
3248 3253 loopback);
3249 3254 } else {
3250 3255 (prev_flent->fe_cb_fn)(prev_flent->fe_cb_arg1,
3251 3256 prev_flent->fe_cb_arg2, mp_chain, loopback);
3252 3257 FLOW_REFRELE(prev_flent);
3253 3258 }
3254 3259 prev_flent = flent;
3255 3260 flent = NULL;
3256 3261 mp_chain = mp;
3257 3262 tail = mp;
3258 3263 mp = mp->b_next;
3259 3264 }
3260 3265 /* Last chain */
3261 3266 ASSERT(mp_chain != NULL);
3262 3267 if (prev_flent == NULL || prev_flent->fe_rx_srs_cnt == 0) {
3263 3268 mac_rx_srs_process(arg,
3264 3269 (mac_resource_handle_t)mac_srs, mp_chain, loopback);
3265 3270 } else {
3266 3271 (prev_flent->fe_cb_fn)(prev_flent->fe_cb_arg1,
3267 3272 prev_flent->fe_cb_arg2, mp_chain, loopback);
3268 3273 FLOW_REFRELE(prev_flent);
3269 3274 }
3270 3275 }
3271 3276
3272 3277 /*
3273 3278 * mac_rx_srs_process
3274 3279 *
3275 3280 * Receive side routine called from the interrupt path.
3276 3281 *
3277 3282 * loopback is set to force a context switch on the loopback
3278 3283 * path between MAC clients.
3279 3284 */
3280 3285 /* ARGSUSED */
3281 3286 void
3282 3287 mac_rx_srs_process(void *arg, mac_resource_handle_t srs, mblk_t *mp_chain,
3283 3288 boolean_t loopback)
3284 3289 {
3285 3290 mac_soft_ring_set_t *mac_srs = (mac_soft_ring_set_t *)srs;
3286 3291 mblk_t *mp, *tail, *head;
3287 3292 int count = 0;
3288 3293 int count1;
3289 3294 size_t sz = 0;
3290 3295 size_t chain_sz, sz1;
3291 3296 mac_bw_ctl_t *mac_bw;
3292 3297 mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
3293 3298
3294 3299 /*
3295 3300 * Set the tail, count and sz. We set the sz irrespective
3296 3301 * of whether we are doing B/W control or not for the
3297 3302 * purpose of updating the stats.
3298 3303 */
3299 3304 mp = tail = mp_chain;
3300 3305 while (mp != NULL) {
3301 3306 tail = mp;
3302 3307 count++;
3303 3308 sz += msgdsize(mp);
3304 3309 mp = mp->b_next;
3305 3310 }
3306 3311
3307 3312 mutex_enter(&mac_srs->srs_lock);
3308 3313
3309 3314 if (loopback) {
3310 3315 SRS_RX_STAT_UPDATE(mac_srs, lclbytes, sz);
3311 3316 SRS_RX_STAT_UPDATE(mac_srs, lclcnt, count);
3312 3317
3313 3318 } else {
3314 3319 SRS_RX_STAT_UPDATE(mac_srs, intrbytes, sz);
3315 3320 SRS_RX_STAT_UPDATE(mac_srs, intrcnt, count);
3316 3321 }
3317 3322
3318 3323 /*
3319 3324 * If the SRS in already being processed; has been blanked;
3320 3325 * can be processed by worker thread only; or the B/W limit
3321 3326 * has been reached, then queue the chain and check if
3322 3327 * worker thread needs to be awakend.
3323 3328 */
3324 3329 if (mac_srs->srs_type & SRST_BW_CONTROL) {
3325 3330 mac_bw = mac_srs->srs_bw;
3326 3331 ASSERT(mac_bw != NULL);
3327 3332 mutex_enter(&mac_bw->mac_bw_lock);
3328 3333 mac_bw->mac_bw_intr += sz;
3329 3334 if (mac_bw->mac_bw_limit == 0) {
3330 3335 /* zero bandwidth: drop all */
3331 3336 srs_rx->sr_stat.mrs_sdrops += count;
3332 3337 mac_bw->mac_bw_drop_bytes += sz;
3333 3338 mutex_exit(&mac_bw->mac_bw_lock);
3334 3339 mutex_exit(&mac_srs->srs_lock);
3335 3340 mac_pkt_drop(NULL, NULL, mp_chain, B_FALSE);
3336 3341 return;
3337 3342 } else {
3338 3343 if ((mac_bw->mac_bw_sz + sz) <=
3339 3344 mac_bw->mac_bw_drop_threshold) {
3340 3345 mutex_exit(&mac_bw->mac_bw_lock);
3341 3346 MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, mp_chain,
3342 3347 tail, count, sz);
3343 3348 } else {
3344 3349 mp = mp_chain;
3345 3350 chain_sz = 0;
3346 3351 count1 = 0;
3347 3352 tail = NULL;
3348 3353 head = NULL;
3349 3354 while (mp != NULL) {
3350 3355 sz1 = msgdsize(mp);
3351 3356 if (mac_bw->mac_bw_sz + chain_sz + sz1 >
3352 3357 mac_bw->mac_bw_drop_threshold)
3353 3358 break;
3354 3359 chain_sz += sz1;
3355 3360 count1++;
3356 3361 tail = mp;
3357 3362 mp = mp->b_next;
3358 3363 }
3359 3364 mutex_exit(&mac_bw->mac_bw_lock);
3360 3365 if (tail != NULL) {
3361 3366 head = tail->b_next;
3362 3367 tail->b_next = NULL;
3363 3368 MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs,
3364 3369 mp_chain, tail, count1, chain_sz);
3365 3370 sz -= chain_sz;
3366 3371 count -= count1;
3367 3372 } else {
3368 3373 /* Can't pick up any */
3369 3374 head = mp_chain;
3370 3375 }
3371 3376 if (head != NULL) {
3372 3377 /* Drop any packet over the threshold */
3373 3378 srs_rx->sr_stat.mrs_sdrops += count;
3374 3379 mutex_enter(&mac_bw->mac_bw_lock);
3375 3380 mac_bw->mac_bw_drop_bytes += sz;
3376 3381 mutex_exit(&mac_bw->mac_bw_lock);
3377 3382 freemsgchain(head);
3378 3383 }
3379 3384 }
3380 3385 MAC_SRS_WORKER_WAKEUP(mac_srs);
3381 3386 mutex_exit(&mac_srs->srs_lock);
3382 3387 return;
3383 3388 }
3384 3389 }
3385 3390
3386 3391 /*
3387 3392 * If the total number of packets queued in the SRS and
3388 3393 * its associated soft rings exceeds the max allowed,
3389 3394 * then drop the chain. If we are polling capable, this
3390 3395 * shouldn't be happening.
3391 3396 */
3392 3397 if (!(mac_srs->srs_type & SRST_BW_CONTROL) &&
3393 3398 (srs_rx->sr_poll_pkt_cnt > srs_rx->sr_hiwat)) {
3394 3399 mac_bw = mac_srs->srs_bw;
3395 3400 srs_rx->sr_stat.mrs_sdrops += count;
3396 3401 mutex_enter(&mac_bw->mac_bw_lock);
3397 3402 mac_bw->mac_bw_drop_bytes += sz;
3398 3403 mutex_exit(&mac_bw->mac_bw_lock);
3399 3404 freemsgchain(mp_chain);
3400 3405 mutex_exit(&mac_srs->srs_lock);
3401 3406 return;
3402 3407 }
3403 3408
3404 3409 MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, mp_chain, tail, count, sz);
3405 3410
3406 3411 if (!(mac_srs->srs_state & SRS_PROC)) {
3407 3412 /*
3408 3413 * If we are coming via loopback, if we are not optimizing for
3409 3414 * latency, or if our stack is running deep, we should signal
3410 3415 * the worker thread.
3411 3416 */
3412 3417 if (loopback || !(mac_srs->srs_state & SRS_LATENCY_OPT) ||
3413 3418 MAC_RX_SRS_TOODEEP()) {
3414 3419 /*
3415 3420 * For loopback, We need to let the worker take
3416 3421 * over as we don't want to continue in the same
3417 3422 * thread even if we can. This could lead to stack
3418 3423 * overflows and may also end up using
3419 3424 * resources (cpu) incorrectly.
3420 3425 */
3421 3426 cv_signal(&mac_srs->srs_async);
3422 3427 } else {
3423 3428 /*
3424 3429 * Seems like no one is processing the SRS and
3425 3430 * there is no backlog. We also inline process
3426 3431 * our packet if its a single packet in non
3427 3432 * latency optimized case (in latency optimized
3428 3433 * case, we inline process chains of any size).
3429 3434 */
3430 3435 mac_srs->srs_drain_func(mac_srs, SRS_PROC_FAST);
3431 3436 }
3432 3437 }
3433 3438 mutex_exit(&mac_srs->srs_lock);
3434 3439 }
3435 3440
3436 3441 /* TX SIDE ROUTINES (RUNTIME) */
3437 3442
3438 3443 /*
3439 3444 * mac_tx_srs_no_desc
3440 3445 *
3441 3446 * This routine is called by Tx single ring default mode
3442 3447 * when Tx ring runs out of descs.
3443 3448 */
3444 3449 mac_tx_cookie_t
3445 3450 mac_tx_srs_no_desc(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3446 3451 uint16_t flag, mblk_t **ret_mp)
3447 3452 {
3448 3453 mac_tx_cookie_t cookie = 0;
3449 3454 mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3450 3455 boolean_t wakeup_worker = B_TRUE;
3451 3456 uint32_t tx_mode = srs_tx->st_mode;
3452 3457 int cnt, sz;
3453 3458 mblk_t *tail;
3454 3459
3455 3460 ASSERT(tx_mode == SRS_TX_DEFAULT || tx_mode == SRS_TX_BW);
3456 3461 if (flag & MAC_DROP_ON_NO_DESC) {
3457 3462 MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie);
3458 3463 } else {
3459 3464 if (mac_srs->srs_first != NULL)
3460 3465 wakeup_worker = B_FALSE;
3461 3466 MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3462 3467 if (flag & MAC_TX_NO_ENQUEUE) {
3463 3468 /*
3464 3469 * If TX_QUEUED is not set, queue the
3465 3470 * packet and let mac_tx_srs_drain()
3466 3471 * set the TX_BLOCKED bit for the
3467 3472 * reasons explained above. Otherwise,
3468 3473 * return the mblks.
3469 3474 */
3470 3475 if (wakeup_worker) {
3471 3476 MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3472 3477 mp_chain, tail, cnt, sz);
3473 3478 } else {
3474 3479 MAC_TX_SET_NO_ENQUEUE(mac_srs,
3475 3480 mp_chain, ret_mp, cookie);
3476 3481 }
3477 3482 } else {
3478 3483 MAC_TX_SRS_TEST_HIWAT(mac_srs, mp_chain,
3479 3484 tail, cnt, sz, cookie);
3480 3485 }
3481 3486 if (wakeup_worker)
3482 3487 cv_signal(&mac_srs->srs_async);
3483 3488 }
3484 3489 return (cookie);
3485 3490 }
3486 3491
3487 3492 /*
3488 3493 * mac_tx_srs_enqueue
3489 3494 *
3490 3495 * This routine is called when Tx SRS is operating in either serializer
3491 3496 * or bandwidth mode. In serializer mode, a packet will get enqueued
3492 3497 * when a thread cannot enter SRS exclusively. In bandwidth mode,
3493 3498 * packets gets queued if allowed byte-count limit for a tick is
3494 3499 * exceeded. The action that gets taken when MAC_DROP_ON_NO_DESC and
3495 3500 * MAC_TX_NO_ENQUEUE is set is different than when operaing in either
3496 3501 * the default mode or fanout mode. Here packets get dropped or
3497 3502 * returned back to the caller only after hi-watermark worth of data
3498 3503 * is queued.
3499 3504 */
3500 3505 static mac_tx_cookie_t
3501 3506 mac_tx_srs_enqueue(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3502 3507 uint16_t flag, uintptr_t fanout_hint, mblk_t **ret_mp)
3503 3508 {
3504 3509 mac_tx_cookie_t cookie = 0;
3505 3510 int cnt, sz;
3506 3511 mblk_t *tail;
3507 3512 boolean_t wakeup_worker = B_TRUE;
3508 3513
3509 3514 /*
3510 3515 * Ignore fanout hint if we don't have multiple tx rings.
3511 3516 */
3512 3517 if (!MAC_TX_SOFT_RINGS(mac_srs))
3513 3518 fanout_hint = 0;
3514 3519
3515 3520 if (mac_srs->srs_first != NULL)
3516 3521 wakeup_worker = B_FALSE;
3517 3522 MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3518 3523 if (flag & MAC_DROP_ON_NO_DESC) {
3519 3524 if (mac_srs->srs_count > mac_srs->srs_tx.st_hiwat) {
3520 3525 MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie);
3521 3526 } else {
3522 3527 MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3523 3528 mp_chain, tail, cnt, sz);
3524 3529 }
3525 3530 } else if (flag & MAC_TX_NO_ENQUEUE) {
3526 3531 if ((mac_srs->srs_count > mac_srs->srs_tx.st_hiwat) ||
3527 3532 (mac_srs->srs_state & SRS_TX_WAKEUP_CLIENT)) {
3528 3533 MAC_TX_SET_NO_ENQUEUE(mac_srs, mp_chain,
3529 3534 ret_mp, cookie);
3530 3535 } else {
3531 3536 mp_chain->b_prev = (mblk_t *)fanout_hint;
3532 3537 MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3533 3538 mp_chain, tail, cnt, sz);
3534 3539 }
3535 3540 } else {
3536 3541 /*
3537 3542 * If you are BW_ENFORCED, just enqueue the
3538 3543 * packet. srs_worker will drain it at the
3539 3544 * prescribed rate. Before enqueueing, save
3540 3545 * the fanout hint.
3541 3546 */
3542 3547 mp_chain->b_prev = (mblk_t *)fanout_hint;
3543 3548 MAC_TX_SRS_TEST_HIWAT(mac_srs, mp_chain,
3544 3549 tail, cnt, sz, cookie);
3545 3550 }
3546 3551 if (wakeup_worker)
3547 3552 cv_signal(&mac_srs->srs_async);
3548 3553 return (cookie);
3549 3554 }
3550 3555
3551 3556 /*
3552 3557 * There are seven tx modes:
3553 3558 *
3554 3559 * 1) Default mode (SRS_TX_DEFAULT)
3555 3560 * 2) Serialization mode (SRS_TX_SERIALIZE)
3556 3561 * 3) Fanout mode (SRS_TX_FANOUT)
3557 3562 * 4) Bandwdith mode (SRS_TX_BW)
3558 3563 * 5) Fanout and Bandwidth mode (SRS_TX_BW_FANOUT)
3559 3564 * 6) aggr Tx mode (SRS_TX_AGGR)
3560 3565 * 7) aggr Tx bw mode (SRS_TX_BW_AGGR)
3561 3566 *
3562 3567 * The tx mode in which an SRS operates is decided in mac_tx_srs_setup()
3563 3568 * based on the number of Tx rings requested for an SRS and whether
3564 3569 * bandwidth control is requested or not.
3565 3570 *
3566 3571 * The default mode (i.e., no fanout/no bandwidth) is used when the
3567 3572 * underlying NIC does not have Tx rings or just one Tx ring. In this mode,
3568 3573 * the SRS acts as a pass-thru. Packets will go directly to mac_tx_send().
3569 3574 * When the underlying Tx ring runs out of Tx descs, it starts queueing up
3570 3575 * packets in SRS. When flow-control is relieved, the srs_worker drains
3571 3576 * the queued packets and informs blocked clients to restart sending
3572 3577 * packets.
3573 3578 *
3574 3579 * In the SRS_TX_SERIALIZE mode, all calls to mac_tx() are serialized. This
3575 3580 * mode is used when the link has no Tx rings or only one Tx ring.
3576 3581 *
3577 3582 * In the SRS_TX_FANOUT mode, packets will be fanned out to multiple
3578 3583 * Tx rings. Each Tx ring will have a soft ring associated with it.
3579 3584 * These soft rings will be hung off the Tx SRS. Queueing if it happens
3580 3585 * due to lack of Tx desc will be in individual soft ring (and not srs)
3581 3586 * associated with Tx ring.
3582 3587 *
3583 3588 * In the TX_BW mode, tx srs will allow packets to go down to Tx ring
3584 3589 * only if bw is available. Otherwise the packets will be queued in
3585 3590 * SRS. If fanout to multiple Tx rings is configured, the packets will
3586 3591 * be fanned out among the soft rings associated with the Tx rings.
3587 3592 *
3588 3593 * In SRS_TX_AGGR mode, mac_tx_aggr_mode() routine is called. This routine
3589 3594 * invokes an aggr function, aggr_find_tx_ring(), to find a pseudo Tx ring
3590 3595 * belonging to a port on which the packet has to be sent. Aggr will
3591 3596 * always have a pseudo Tx ring associated with it even when it is an
3592 3597 * aggregation over a single NIC that has no Tx rings. Even in such a
3593 3598 * case, the single pseudo Tx ring will have a soft ring associated with
3594 3599 * it and the soft ring will hang off the SRS.
3595 3600 *
3596 3601 * If a bandwidth is specified for an aggr, SRS_TX_BW_AGGR mode is used.
3597 3602 * In this mode, the bandwidth is first applied on the outgoing packets
3598 3603 * and later mac_tx_addr_mode() function is called to send the packet out
3599 3604 * of one of the pseudo Tx rings.
3600 3605 *
3601 3606 * Four flags are used in srs_state for indicating flow control
3602 3607 * conditions : SRS_TX_BLOCKED, SRS_TX_HIWAT, SRS_TX_WAKEUP_CLIENT.
3603 3608 * SRS_TX_BLOCKED indicates out of Tx descs. SRS expects a wakeup from the
3604 3609 * driver below.
3605 3610 * SRS_TX_HIWAT indicates packet count enqueued in Tx SRS exceeded Tx hiwat
3606 3611 * and flow-control pressure is applied back to clients. The clients expect
3607 3612 * wakeup when flow-control is relieved.
3608 3613 * SRS_TX_WAKEUP_CLIENT get set when (flag == MAC_TX_NO_ENQUEUE) and mblk
3609 3614 * got returned back to client either due to lack of Tx descs or due to bw
3610 3615 * control reasons. The clients expect a wakeup when condition is relieved.
3611 3616 *
3612 3617 * The fourth argument to mac_tx() is the flag. Normally it will be 0 but
3613 3618 * some clients set the following values too: MAC_DROP_ON_NO_DESC,
3614 3619 * MAC_TX_NO_ENQUEUE
3615 3620 * Mac clients that do not want packets to be enqueued in the mac layer set
3616 3621 * MAC_DROP_ON_NO_DESC value. The packets won't be queued in the Tx SRS or
3617 3622 * Tx soft rings but instead get dropped when the NIC runs out of desc. The
3618 3623 * behaviour of this flag is different when the Tx is running in serializer
3619 3624 * or bandwidth mode. Under these (Serializer, bandwidth) modes, the packet
3620 3625 * get dropped when Tx high watermark is reached.
3621 3626 * There are some mac clients like vsw, aggr that want the mblks to be
3622 3627 * returned back to clients instead of being queued in Tx SRS (or Tx soft
3623 3628 * rings) under flow-control (i.e., out of desc or exceeding bw limits)
3624 3629 * conditions. These clients call mac_tx() with MAC_TX_NO_ENQUEUE flag set.
3625 3630 * In the default and Tx fanout mode, the un-transmitted mblks will be
3626 3631 * returned back to the clients when the driver runs out of Tx descs.
3627 3632 * SRS_TX_WAKEUP_CLIENT (or S_RING_WAKEUP_CLIENT) will be set in SRS (or
3628 3633 * soft ring) so that the clients can be woken up when Tx desc become
3629 3634 * available. When running in serializer or bandwidth mode mode,
3630 3635 * SRS_TX_WAKEUP_CLIENT will be set when tx hi-watermark is reached.
3631 3636 */
3632 3637
3633 3638 mac_tx_func_t
3634 3639 mac_tx_get_func(uint32_t mode)
3635 3640 {
3636 3641 return (mac_tx_mode_list[mode].mac_tx_func);
3637 3642 }
3638 3643
3639 3644 /* ARGSUSED */
3640 3645 static mac_tx_cookie_t
3641 3646 mac_tx_single_ring_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3642 3647 uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3643 3648 {
3644 3649 mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3645 3650 mac_tx_stats_t stats;
3646 3651 mac_tx_cookie_t cookie = 0;
3647 3652
3648 3653 ASSERT(srs_tx->st_mode == SRS_TX_DEFAULT);
3649 3654
3650 3655 /* Regular case with a single Tx ring */
3651 3656 /*
3652 3657 * SRS_TX_BLOCKED is set when underlying NIC runs
3653 3658 * out of Tx descs and messages start getting
3654 3659 * queued. It won't get reset until
3655 3660 * tx_srs_drain() completely drains out the
3656 3661 * messages.
3657 3662 */
3658 3663 if ((mac_srs->srs_state & SRS_ENQUEUED) != 0) {
3659 3664 /* Tx descs/resources not available */
3660 3665 mutex_enter(&mac_srs->srs_lock);
3661 3666 if ((mac_srs->srs_state & SRS_ENQUEUED) != 0) {
3662 3667 cookie = mac_tx_srs_no_desc(mac_srs, mp_chain,
3663 3668 flag, ret_mp);
3664 3669 mutex_exit(&mac_srs->srs_lock);
3665 3670 return (cookie);
3666 3671 }
3667 3672 /*
3668 3673 * While we were computing mblk count, the
3669 3674 * flow control condition got relieved.
3670 3675 * Continue with the transmission.
3671 3676 */
3672 3677 mutex_exit(&mac_srs->srs_lock);
3673 3678 }
3674 3679
3675 3680 mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
3676 3681 mp_chain, &stats);
3677 3682
3678 3683 /*
3679 3684 * Multiple threads could be here sending packets.
3680 3685 * Under such conditions, it is not possible to
3681 3686 * automically set SRS_TX_BLOCKED bit to indicate
3682 3687 * out of tx desc condition. To atomically set
3683 3688 * this, we queue the returned packet and do
3684 3689 * the setting of SRS_TX_BLOCKED in
3685 3690 * mac_tx_srs_drain().
3686 3691 */
3687 3692 if (mp_chain != NULL) {
3688 3693 mutex_enter(&mac_srs->srs_lock);
3689 3694 cookie = mac_tx_srs_no_desc(mac_srs, mp_chain, flag, ret_mp);
3690 3695 mutex_exit(&mac_srs->srs_lock);
3691 3696 return (cookie);
3692 3697 }
3693 3698 SRS_TX_STATS_UPDATE(mac_srs, &stats);
3694 3699
3695 3700 return (0);
3696 3701 }
3697 3702
3698 3703 /*
3699 3704 * mac_tx_serialize_mode
3700 3705 *
3701 3706 * This is an experimental mode implemented as per the request of PAE.
3702 3707 * In this mode, all callers attempting to send a packet to the NIC
3703 3708 * will get serialized. Only one thread at any time will access the
3704 3709 * NIC to send the packet out.
3705 3710 */
3706 3711 /* ARGSUSED */
3707 3712 static mac_tx_cookie_t
3708 3713 mac_tx_serializer_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3709 3714 uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3710 3715 {
3711 3716 mac_tx_stats_t stats;
3712 3717 mac_tx_cookie_t cookie = 0;
3713 3718 mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3714 3719
3715 3720 /* Single ring, serialize below */
3716 3721 ASSERT(srs_tx->st_mode == SRS_TX_SERIALIZE);
3717 3722 mutex_enter(&mac_srs->srs_lock);
3718 3723 if ((mac_srs->srs_first != NULL) ||
3719 3724 (mac_srs->srs_state & SRS_PROC)) {
3720 3725 /*
3721 3726 * In serialization mode, queue all packets until
3722 3727 * TX_HIWAT is set.
3723 3728 * If drop bit is set, drop if TX_HIWAT is set.
3724 3729 * If no_enqueue is set, still enqueue until hiwat
3725 3730 * is set and return mblks after TX_HIWAT is set.
3726 3731 */
3727 3732 cookie = mac_tx_srs_enqueue(mac_srs, mp_chain,
3728 3733 flag, 0, ret_mp);
3729 3734 mutex_exit(&mac_srs->srs_lock);
3730 3735 return (cookie);
3731 3736 }
3732 3737 /*
3733 3738 * No packets queued, nothing on proc and no flow
3734 3739 * control condition. Fast-path, ok. Do inline
3735 3740 * processing.
3736 3741 */
3737 3742 mac_srs->srs_state |= SRS_PROC;
3738 3743 mutex_exit(&mac_srs->srs_lock);
3739 3744
3740 3745 mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
3741 3746 mp_chain, &stats);
3742 3747
3743 3748 mutex_enter(&mac_srs->srs_lock);
3744 3749 mac_srs->srs_state &= ~SRS_PROC;
3745 3750 if (mp_chain != NULL) {
3746 3751 cookie = mac_tx_srs_enqueue(mac_srs,
3747 3752 mp_chain, flag, 0, ret_mp);
3748 3753 }
3749 3754 if (mac_srs->srs_first != NULL) {
3750 3755 /*
3751 3756 * We processed inline our packet and a new
3752 3757 * packet/s got queued while we were
3753 3758 * processing. Wakeup srs worker
3754 3759 */
3755 3760 cv_signal(&mac_srs->srs_async);
3756 3761 }
3757 3762 mutex_exit(&mac_srs->srs_lock);
3758 3763
3759 3764 if (cookie == 0)
3760 3765 SRS_TX_STATS_UPDATE(mac_srs, &stats);
3761 3766
3762 3767 return (cookie);
3763 3768 }
3764 3769
3765 3770 /*
3766 3771 * mac_tx_fanout_mode
3767 3772 *
3768 3773 * In this mode, the SRS will have access to multiple Tx rings to send
3769 3774 * the packet out. The fanout hint that is passed as an argument is
3770 3775 * used to find an appropriate ring to fanout the traffic. Each Tx
3771 3776 * ring, in turn, will have a soft ring associated with it. If a Tx
3772 3777 * ring runs out of Tx desc's the returned packet will be queued in
3773 3778 * the soft ring associated with that Tx ring. The srs itself will not
3774 3779 * queue any packets.
3775 3780 */
3776 3781
3777 3782 #define MAC_TX_SOFT_RING_PROCESS(chain) { \
3778 3783 index = COMPUTE_INDEX(hash, mac_srs->srs_tx_ring_count), \
3779 3784 softring = mac_srs->srs_tx_soft_rings[index]; \
3780 3785 cookie = mac_tx_soft_ring_process(softring, chain, flag, ret_mp); \
3781 3786 DTRACE_PROBE2(tx__fanout, uint64_t, hash, uint_t, index); \
3782 3787 }
3783 3788
3784 3789 static mac_tx_cookie_t
3785 3790 mac_tx_fanout_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3786 3791 uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3787 3792 {
3788 3793 mac_soft_ring_t *softring;
3789 3794 uint64_t hash;
3790 3795 uint_t index;
3791 3796 mac_tx_cookie_t cookie = 0;
3792 3797
3793 3798 ASSERT(mac_srs->srs_tx.st_mode == SRS_TX_FANOUT ||
3794 3799 mac_srs->srs_tx.st_mode == SRS_TX_BW_FANOUT);
3795 3800 if (fanout_hint != 0) {
3796 3801 /*
3797 3802 * The hint is specified by the caller, simply pass the
3798 3803 * whole chain to the soft ring.
3799 3804 */
3800 3805 hash = HASH_HINT(fanout_hint);
3801 3806 MAC_TX_SOFT_RING_PROCESS(mp_chain);
3802 3807 } else {
3803 3808 mblk_t *last_mp, *cur_mp, *sub_chain;
3804 3809 uint64_t last_hash = 0;
3805 3810 uint_t media = mac_srs->srs_mcip->mci_mip->mi_info.mi_media;
3806 3811
3807 3812 /*
3808 3813 * Compute the hash from the contents (headers) of the
3809 3814 * packets of the mblk chain. Split the chains into
3810 3815 * subchains of the same conversation.
3811 3816 *
3812 3817 * Since there may be more than one ring used for
3813 3818 * sub-chains of the same call, and since the caller
3814 3819 * does not maintain per conversation state since it
3815 3820 * passed a zero hint, unsent subchains will be
3816 3821 * dropped.
3817 3822 */
3818 3823
3819 3824 flag |= MAC_DROP_ON_NO_DESC;
3820 3825 ret_mp = NULL;
3821 3826
3822 3827 ASSERT(ret_mp == NULL);
3823 3828
3824 3829 sub_chain = NULL;
3825 3830 last_mp = NULL;
3826 3831
3827 3832 for (cur_mp = mp_chain; cur_mp != NULL;
3828 3833 cur_mp = cur_mp->b_next) {
3829 3834 hash = mac_pkt_hash(media, cur_mp, MAC_PKT_HASH_L4,
3830 3835 B_TRUE);
3831 3836 if (last_hash != 0 && hash != last_hash) {
3832 3837 /*
3833 3838 * Starting a different subchain, send current
3834 3839 * chain out.
3835 3840 */
3836 3841 ASSERT(last_mp != NULL);
3837 3842 last_mp->b_next = NULL;
3838 3843 MAC_TX_SOFT_RING_PROCESS(sub_chain);
3839 3844 sub_chain = NULL;
3840 3845 }
3841 3846
3842 3847 /* add packet to subchain */
3843 3848 if (sub_chain == NULL)
3844 3849 sub_chain = cur_mp;
3845 3850 last_mp = cur_mp;
3846 3851 last_hash = hash;
3847 3852 }
3848 3853
3849 3854 if (sub_chain != NULL) {
3850 3855 /* send last subchain */
3851 3856 ASSERT(last_mp != NULL);
3852 3857 last_mp->b_next = NULL;
3853 3858 MAC_TX_SOFT_RING_PROCESS(sub_chain);
3854 3859 }
3855 3860
3856 3861 cookie = 0;
3857 3862 }
3858 3863
3859 3864 return (cookie);
3860 3865 }
3861 3866
3862 3867 /*
3863 3868 * mac_tx_bw_mode
3864 3869 *
3865 3870 * In the bandwidth mode, Tx srs will allow packets to go down to Tx ring
3866 3871 * only if bw is available. Otherwise the packets will be queued in
3867 3872 * SRS. If the SRS has multiple Tx rings, then packets will get fanned
3868 3873 * out to a Tx rings.
3869 3874 */
3870 3875 static mac_tx_cookie_t
3871 3876 mac_tx_bw_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3872 3877 uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3873 3878 {
3874 3879 int cnt, sz;
3875 3880 mblk_t *tail;
3876 3881 mac_tx_cookie_t cookie = 0;
3877 3882 mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3878 3883 clock_t now;
3879 3884
3880 3885 ASSERT(TX_BANDWIDTH_MODE(mac_srs));
3881 3886 ASSERT(mac_srs->srs_type & SRST_BW_CONTROL);
3882 3887 mutex_enter(&mac_srs->srs_lock);
3883 3888 if (mac_srs->srs_bw->mac_bw_limit == 0) {
3884 3889 /*
3885 3890 * zero bandwidth, no traffic is sent: drop the packets,
3886 3891 * or return the whole chain if the caller requests all
3887 3892 * unsent packets back.
3888 3893 */
3889 3894 if (flag & MAC_TX_NO_ENQUEUE) {
3890 3895 cookie = (mac_tx_cookie_t)mac_srs;
3891 3896 *ret_mp = mp_chain;
3892 3897 } else {
3893 3898 MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie);
3894 3899 }
3895 3900 mutex_exit(&mac_srs->srs_lock);
3896 3901 return (cookie);
3897 3902 } else if ((mac_srs->srs_first != NULL) ||
3898 3903 (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
3899 3904 cookie = mac_tx_srs_enqueue(mac_srs, mp_chain, flag,
3900 3905 fanout_hint, ret_mp);
3901 3906 mutex_exit(&mac_srs->srs_lock);
3902 3907 return (cookie);
3903 3908 }
3904 3909 MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3905 3910 now = ddi_get_lbolt();
3906 3911 if (mac_srs->srs_bw->mac_bw_curr_time != now) {
3907 3912 mac_srs->srs_bw->mac_bw_curr_time = now;
3908 3913 mac_srs->srs_bw->mac_bw_used = 0;
3909 3914 } else if (mac_srs->srs_bw->mac_bw_used >
3910 3915 mac_srs->srs_bw->mac_bw_limit) {
3911 3916 mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
3912 3917 MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3913 3918 mp_chain, tail, cnt, sz);
3914 3919 /*
3915 3920 * Wakeup worker thread. Note that worker
3916 3921 * thread has to be woken up so that it
3917 3922 * can fire up the timer to be woken up
3918 3923 * on the next tick. Also once
3919 3924 * BW_ENFORCED is set, it can only be
3920 3925 * reset by srs_worker thread. Until then
3921 3926 * all packets will get queued up in SRS
3922 3927 * and hence this this code path won't be
3923 3928 * entered until BW_ENFORCED is reset.
3924 3929 */
3925 3930 cv_signal(&mac_srs->srs_async);
3926 3931 mutex_exit(&mac_srs->srs_lock);
3927 3932 return (cookie);
3928 3933 }
3929 3934
3930 3935 mac_srs->srs_bw->mac_bw_used += sz;
3931 3936 mutex_exit(&mac_srs->srs_lock);
3932 3937
3933 3938 if (srs_tx->st_mode == SRS_TX_BW_FANOUT) {
3934 3939 mac_soft_ring_t *softring;
3935 3940 uint_t indx, hash;
3936 3941
3937 3942 hash = HASH_HINT(fanout_hint);
3938 3943 indx = COMPUTE_INDEX(hash,
3939 3944 mac_srs->srs_tx_ring_count);
3940 3945 softring = mac_srs->srs_tx_soft_rings[indx];
3941 3946 return (mac_tx_soft_ring_process(softring, mp_chain, flag,
3942 3947 ret_mp));
3943 3948 } else if (srs_tx->st_mode == SRS_TX_BW_AGGR) {
3944 3949 return (mac_tx_aggr_mode(mac_srs, mp_chain,
3945 3950 fanout_hint, flag, ret_mp));
3946 3951 } else {
3947 3952 mac_tx_stats_t stats;
3948 3953
3949 3954 mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
3950 3955 mp_chain, &stats);
3951 3956
3952 3957 if (mp_chain != NULL) {
3953 3958 mutex_enter(&mac_srs->srs_lock);
3954 3959 MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3955 3960 if (mac_srs->srs_bw->mac_bw_used > sz)
3956 3961 mac_srs->srs_bw->mac_bw_used -= sz;
3957 3962 else
3958 3963 mac_srs->srs_bw->mac_bw_used = 0;
3959 3964 cookie = mac_tx_srs_enqueue(mac_srs, mp_chain, flag,
3960 3965 fanout_hint, ret_mp);
3961 3966 mutex_exit(&mac_srs->srs_lock);
3962 3967 return (cookie);
3963 3968 }
3964 3969 SRS_TX_STATS_UPDATE(mac_srs, &stats);
3965 3970
3966 3971 return (0);
3967 3972 }
3968 3973 }
3969 3974
3970 3975 /*
3971 3976 * mac_tx_aggr_mode
3972 3977 *
3973 3978 * This routine invokes an aggr function, aggr_find_tx_ring(), to find
3974 3979 * a (pseudo) Tx ring belonging to a port on which the packet has to
3975 3980 * be sent. aggr_find_tx_ring() first finds the outgoing port based on
3976 3981 * L2/L3/L4 policy and then uses the fanout_hint passed to it to pick
3977 3982 * a Tx ring from the selected port.
3978 3983 *
3979 3984 * Note that a port can be deleted from the aggregation. In such a case,
3980 3985 * the aggregation layer first separates the port from the rest of the
3981 3986 * ports making sure that port (and thus any Tx rings associated with
3982 3987 * it) won't get selected in the call to aggr_find_tx_ring() function.
3983 3988 * Later calls are made to mac_group_rem_ring() passing pseudo Tx ring
3984 3989 * handles one by one which in turn will quiesce the Tx SRS and remove
3985 3990 * the soft ring associated with the pseudo Tx ring. Unlike Rx side
3986 3991 * where a cookie is used to protect against mac_rx_ring() calls on
3987 3992 * rings that have been removed, no such cookie is needed on the Tx
3988 3993 * side as the pseudo Tx ring won't be available anymore to
3989 3994 * aggr_find_tx_ring() once the port has been removed.
3990 3995 */
3991 3996 static mac_tx_cookie_t
3992 3997 mac_tx_aggr_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3993 3998 uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3994 3999 {
3995 4000 mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3996 4001 mac_tx_ring_fn_t find_tx_ring_fn;
3997 4002 mac_ring_handle_t ring = NULL;
3998 4003 void *arg;
3999 4004 mac_soft_ring_t *sringp;
4000 4005
4001 4006 find_tx_ring_fn = srs_tx->st_capab_aggr.mca_find_tx_ring_fn;
4002 4007 arg = srs_tx->st_capab_aggr.mca_arg;
4003 4008 if (find_tx_ring_fn(arg, mp_chain, fanout_hint, &ring) == NULL)
4004 4009 return (0);
4005 4010 sringp = srs_tx->st_soft_rings[((mac_ring_t *)ring)->mr_index];
4006 4011 return (mac_tx_soft_ring_process(sringp, mp_chain, flag, ret_mp));
4007 4012 }
4008 4013
4009 4014 void
4010 4015 mac_tx_invoke_callbacks(mac_client_impl_t *mcip, mac_tx_cookie_t cookie)
4011 4016 {
4012 4017 mac_cb_t *mcb;
4013 4018 mac_tx_notify_cb_t *mtnfp;
4014 4019
4015 4020 /* Wakeup callback registered clients */
4016 4021 MAC_CALLBACK_WALKER_INC(&mcip->mci_tx_notify_cb_info);
4017 4022 for (mcb = mcip->mci_tx_notify_cb_list; mcb != NULL;
4018 4023 mcb = mcb->mcb_nextp) {
4019 4024 mtnfp = (mac_tx_notify_cb_t *)mcb->mcb_objp;
4020 4025 mtnfp->mtnf_fn(mtnfp->mtnf_arg, cookie);
4021 4026 }
4022 4027 MAC_CALLBACK_WALKER_DCR(&mcip->mci_tx_notify_cb_info,
4023 4028 &mcip->mci_tx_notify_cb_list);
4024 4029 }
4025 4030
4026 4031 /* ARGSUSED */
4027 4032 void
4028 4033 mac_tx_srs_drain(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
4029 4034 {
4030 4035 mblk_t *head, *tail;
4031 4036 size_t sz;
4032 4037 uint32_t tx_mode;
4033 4038 uint_t saved_pkt_count;
4034 4039 mac_tx_stats_t stats;
4035 4040 mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
4036 4041 clock_t now;
4037 4042
4038 4043 saved_pkt_count = 0;
4039 4044 ASSERT(mutex_owned(&mac_srs->srs_lock));
4040 4045 ASSERT(!(mac_srs->srs_state & SRS_PROC));
4041 4046
4042 4047 mac_srs->srs_state |= SRS_PROC;
4043 4048
4044 4049 tx_mode = srs_tx->st_mode;
4045 4050 if (tx_mode == SRS_TX_DEFAULT || tx_mode == SRS_TX_SERIALIZE) {
4046 4051 if (mac_srs->srs_first != NULL) {
4047 4052 head = mac_srs->srs_first;
4048 4053 tail = mac_srs->srs_last;
4049 4054 saved_pkt_count = mac_srs->srs_count;
4050 4055 mac_srs->srs_first = NULL;
4051 4056 mac_srs->srs_last = NULL;
4052 4057 mac_srs->srs_count = 0;
4053 4058 mutex_exit(&mac_srs->srs_lock);
4054 4059
4055 4060 head = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
4056 4061 head, &stats);
4057 4062
4058 4063 mutex_enter(&mac_srs->srs_lock);
4059 4064 if (head != NULL) {
4060 4065 /* Device out of tx desc, set block */
4061 4066 if (head->b_next == NULL)
4062 4067 VERIFY(head == tail);
4063 4068 tail->b_next = mac_srs->srs_first;
4064 4069 mac_srs->srs_first = head;
4065 4070 mac_srs->srs_count +=
4066 4071 (saved_pkt_count - stats.mts_opackets);
4067 4072 if (mac_srs->srs_last == NULL)
4068 4073 mac_srs->srs_last = tail;
4069 4074 MAC_TX_SRS_BLOCK(mac_srs, head);
4070 4075 } else {
4071 4076 srs_tx->st_woken_up = B_FALSE;
4072 4077 SRS_TX_STATS_UPDATE(mac_srs, &stats);
4073 4078 }
4074 4079 }
4075 4080 } else if (tx_mode == SRS_TX_BW) {
4076 4081 /*
4077 4082 * We are here because the timer fired and we have some data
4078 4083 * to tranmit. Also mac_tx_srs_worker should have reset
4079 4084 * SRS_BW_ENFORCED flag
4080 4085 */
4081 4086 ASSERT(!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED));
4082 4087 head = tail = mac_srs->srs_first;
4083 4088 while (mac_srs->srs_first != NULL) {
4084 4089 tail = mac_srs->srs_first;
4085 4090 tail->b_prev = NULL;
4086 4091 mac_srs->srs_first = tail->b_next;
4087 4092 if (mac_srs->srs_first == NULL)
4088 4093 mac_srs->srs_last = NULL;
4089 4094 mac_srs->srs_count--;
4090 4095 sz = msgdsize(tail);
4091 4096 mac_srs->srs_size -= sz;
4092 4097 saved_pkt_count++;
4093 4098 MAC_TX_UPDATE_BW_INFO(mac_srs, sz);
4094 4099
4095 4100 if (mac_srs->srs_bw->mac_bw_used <
4096 4101 mac_srs->srs_bw->mac_bw_limit)
4097 4102 continue;
4098 4103
4099 4104 now = ddi_get_lbolt();
4100 4105 if (mac_srs->srs_bw->mac_bw_curr_time != now) {
4101 4106 mac_srs->srs_bw->mac_bw_curr_time = now;
4102 4107 mac_srs->srs_bw->mac_bw_used = sz;
4103 4108 continue;
4104 4109 }
4105 4110 mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
4106 4111 break;
4107 4112 }
4108 4113
4109 4114 ASSERT((head == NULL && tail == NULL) ||
4110 4115 (head != NULL && tail != NULL));
4111 4116 if (tail != NULL) {
4112 4117 tail->b_next = NULL;
4113 4118 mutex_exit(&mac_srs->srs_lock);
4114 4119
4115 4120 head = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
4116 4121 head, &stats);
4117 4122
4118 4123 mutex_enter(&mac_srs->srs_lock);
4119 4124 if (head != NULL) {
4120 4125 uint_t size_sent;
4121 4126
4122 4127 /* Device out of tx desc, set block */
4123 4128 if (head->b_next == NULL)
4124 4129 VERIFY(head == tail);
4125 4130 tail->b_next = mac_srs->srs_first;
4126 4131 mac_srs->srs_first = head;
4127 4132 mac_srs->srs_count +=
4128 4133 (saved_pkt_count - stats.mts_opackets);
4129 4134 if (mac_srs->srs_last == NULL)
4130 4135 mac_srs->srs_last = tail;
4131 4136 size_sent = sz - stats.mts_obytes;
4132 4137 mac_srs->srs_size += size_sent;
4133 4138 mac_srs->srs_bw->mac_bw_sz += size_sent;
4134 4139 if (mac_srs->srs_bw->mac_bw_used > size_sent) {
4135 4140 mac_srs->srs_bw->mac_bw_used -=
4136 4141 size_sent;
4137 4142 } else {
4138 4143 mac_srs->srs_bw->mac_bw_used = 0;
4139 4144 }
4140 4145 MAC_TX_SRS_BLOCK(mac_srs, head);
4141 4146 } else {
4142 4147 srs_tx->st_woken_up = B_FALSE;
4143 4148 SRS_TX_STATS_UPDATE(mac_srs, &stats);
4144 4149 }
4145 4150 }
4146 4151 } else if (tx_mode == SRS_TX_BW_FANOUT || tx_mode == SRS_TX_BW_AGGR) {
4147 4152 mblk_t *prev;
4148 4153 uint64_t hint;
4149 4154
4150 4155 /*
4151 4156 * We are here because the timer fired and we
4152 4157 * have some quota to tranmit.
4153 4158 */
4154 4159 prev = NULL;
4155 4160 head = tail = mac_srs->srs_first;
4156 4161 while (mac_srs->srs_first != NULL) {
4157 4162 tail = mac_srs->srs_first;
4158 4163 mac_srs->srs_first = tail->b_next;
4159 4164 if (mac_srs->srs_first == NULL)
4160 4165 mac_srs->srs_last = NULL;
4161 4166 mac_srs->srs_count--;
4162 4167 sz = msgdsize(tail);
4163 4168 mac_srs->srs_size -= sz;
4164 4169 mac_srs->srs_bw->mac_bw_used += sz;
4165 4170 if (prev == NULL)
4166 4171 hint = (ulong_t)tail->b_prev;
4167 4172 if (hint != (ulong_t)tail->b_prev) {
4168 4173 prev->b_next = NULL;
4169 4174 mutex_exit(&mac_srs->srs_lock);
4170 4175 TX_SRS_TO_SOFT_RING(mac_srs, head, hint);
4171 4176 head = tail;
4172 4177 hint = (ulong_t)tail->b_prev;
4173 4178 mutex_enter(&mac_srs->srs_lock);
4174 4179 }
4175 4180
4176 4181 prev = tail;
4177 4182 tail->b_prev = NULL;
4178 4183 if (mac_srs->srs_bw->mac_bw_used <
4179 4184 mac_srs->srs_bw->mac_bw_limit)
4180 4185 continue;
4181 4186
4182 4187 now = ddi_get_lbolt();
4183 4188 if (mac_srs->srs_bw->mac_bw_curr_time != now) {
4184 4189 mac_srs->srs_bw->mac_bw_curr_time = now;
4185 4190 mac_srs->srs_bw->mac_bw_used = 0;
4186 4191 continue;
4187 4192 }
4188 4193 mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
4189 4194 break;
4190 4195 }
4191 4196 ASSERT((head == NULL && tail == NULL) ||
4192 4197 (head != NULL && tail != NULL));
4193 4198 if (tail != NULL) {
4194 4199 tail->b_next = NULL;
4195 4200 mutex_exit(&mac_srs->srs_lock);
4196 4201 TX_SRS_TO_SOFT_RING(mac_srs, head, hint);
4197 4202 mutex_enter(&mac_srs->srs_lock);
4198 4203 }
4199 4204 }
4200 4205 /*
4201 4206 * SRS_TX_FANOUT case not considered here because packets
4202 4207 * won't be queued in the SRS for this case. Packets will
4203 4208 * be sent directly to soft rings underneath and if there
4204 4209 * is any queueing at all, it would be in Tx side soft
4205 4210 * rings.
4206 4211 */
4207 4212
4208 4213 /*
4209 4214 * When srs_count becomes 0, reset SRS_TX_HIWAT and
4210 4215 * SRS_TX_WAKEUP_CLIENT and wakeup registered clients.
4211 4216 */
4212 4217 if (mac_srs->srs_count == 0 && (mac_srs->srs_state &
4213 4218 (SRS_TX_HIWAT | SRS_TX_WAKEUP_CLIENT | SRS_ENQUEUED))) {
4214 4219 mac_client_impl_t *mcip = mac_srs->srs_mcip;
4215 4220 boolean_t wakeup_required = B_FALSE;
4216 4221
4217 4222 if (mac_srs->srs_state &
4218 4223 (SRS_TX_HIWAT|SRS_TX_WAKEUP_CLIENT)) {
4219 4224 wakeup_required = B_TRUE;
4220 4225 }
4221 4226 mac_srs->srs_state &= ~(SRS_TX_HIWAT |
4222 4227 SRS_TX_WAKEUP_CLIENT | SRS_ENQUEUED);
4223 4228 mutex_exit(&mac_srs->srs_lock);
4224 4229 if (wakeup_required) {
4225 4230 mac_tx_invoke_callbacks(mcip, (mac_tx_cookie_t)mac_srs);
4226 4231 /*
4227 4232 * If the client is not the primary MAC client, then we
4228 4233 * need to send the notification to the clients upper
4229 4234 * MAC, i.e. mci_upper_mip.
4230 4235 */
4231 4236 mac_tx_notify(mcip->mci_upper_mip != NULL ?
4232 4237 mcip->mci_upper_mip : mcip->mci_mip);
4233 4238 }
4234 4239 mutex_enter(&mac_srs->srs_lock);
4235 4240 }
4236 4241 mac_srs->srs_state &= ~SRS_PROC;
4237 4242 }
4238 4243
4239 4244 /*
4240 4245 * Given a packet, get the flow_entry that identifies the flow
4241 4246 * to which that packet belongs. The flow_entry will contain
4242 4247 * the transmit function to be used to send the packet. If the
4243 4248 * function returns NULL, the packet should be sent using the
4244 4249 * underlying NIC.
4245 4250 */
4246 4251 static flow_entry_t *
4247 4252 mac_tx_classify(mac_impl_t *mip, mblk_t *mp)
4248 4253 {
4249 4254 flow_entry_t *flent = NULL;
4250 4255 mac_client_impl_t *mcip;
4251 4256 int err;
4252 4257
4253 4258 /*
4254 4259 * Do classification on the packet.
4255 4260 */
4256 4261 err = mac_flow_lookup(mip->mi_flow_tab, mp, FLOW_OUTBOUND, &flent);
4257 4262 if (err != 0)
4258 4263 return (NULL);
4259 4264
4260 4265 /*
4261 4266 * This flent might just be an additional one on the MAC client,
4262 4267 * i.e. for classification purposes (different fdesc), however
4263 4268 * the resources, SRS et. al., are in the mci_flent, so if
4264 4269 * this isn't the mci_flent, we need to get it.
4265 4270 */
4266 4271 if ((mcip = flent->fe_mcip) != NULL && mcip->mci_flent != flent) {
4267 4272 FLOW_REFRELE(flent);
4268 4273 flent = mcip->mci_flent;
4269 4274 FLOW_TRY_REFHOLD(flent, err);
4270 4275 if (err != 0)
4271 4276 return (NULL);
4272 4277 }
4273 4278
4274 4279 return (flent);
4275 4280 }
4276 4281
4277 4282 /*
4278 4283 * This macro is only meant to be used by mac_tx_send().
4279 4284 */
4280 4285 #define CHECK_VID_AND_ADD_TAG(mp) { \
4281 4286 if (vid_check) { \
4282 4287 int err = 0; \
4283 4288 \
4284 4289 MAC_VID_CHECK(src_mcip, (mp), err); \
4285 4290 if (err != 0) { \
4286 4291 freemsg((mp)); \
4287 4292 (mp) = next; \
4288 4293 oerrors++; \
4289 4294 continue; \
4290 4295 } \
4291 4296 } \
4292 4297 if (add_tag) { \
4293 4298 (mp) = mac_add_vlan_tag((mp), 0, vid); \
4294 4299 if ((mp) == NULL) { \
4295 4300 (mp) = next; \
4296 4301 oerrors++; \
4297 4302 continue; \
4298 4303 } \
4299 4304 } \
4300 4305 }
4301 4306
4302 4307 mblk_t *
4303 4308 mac_tx_send(mac_client_handle_t mch, mac_ring_handle_t ring, mblk_t *mp_chain,
4304 4309 mac_tx_stats_t *stats)
4305 4310 {
4306 4311 mac_client_impl_t *src_mcip = (mac_client_impl_t *)mch;
4307 4312 mac_impl_t *mip = src_mcip->mci_mip;
4308 4313 uint_t obytes = 0, opackets = 0, oerrors = 0;
4309 4314 mblk_t *mp = NULL, *next;
4310 4315 boolean_t vid_check, add_tag;
4311 4316 uint16_t vid = 0;
4312 4317
4313 4318 if (mip->mi_nclients > 1) {
4314 4319 vid_check = MAC_VID_CHECK_NEEDED(src_mcip);
4315 4320 add_tag = MAC_TAG_NEEDED(src_mcip);
4316 4321 if (add_tag)
4317 4322 vid = mac_client_vid(mch);
4318 4323 } else {
4319 4324 ASSERT(mip->mi_nclients == 1);
4320 4325 vid_check = add_tag = B_FALSE;
4321 4326 }
4322 4327
4323 4328 /*
4324 4329 * Fastpath: if there's only one client, we simply send
4325 4330 * the packet down to the underlying NIC.
4326 4331 */
4327 4332 if (mip->mi_nactiveclients == 1) {
4328 4333 DTRACE_PROBE2(fastpath,
4329 4334 mac_client_impl_t *, src_mcip, mblk_t *, mp_chain);
4330 4335
4331 4336 mp = mp_chain;
4332 4337 while (mp != NULL) {
4333 4338 next = mp->b_next;
4334 4339 mp->b_next = NULL;
4335 4340 opackets++;
4336 4341 obytes += (mp->b_cont == NULL ? MBLKL(mp) :
4337 4342 msgdsize(mp));
4338 4343
4339 4344 CHECK_VID_AND_ADD_TAG(mp);
4340 4345 MAC_TX(mip, ring, mp, src_mcip);
4341 4346
4342 4347 /*
4343 4348 * If the driver is out of descriptors and does a
4344 4349 * partial send it will return a chain of unsent
4345 4350 * mblks. Adjust the accounting stats.
4346 4351 */
4347 4352 if (mp != NULL) {
4348 4353 opackets--;
4349 4354 obytes -= msgdsize(mp);
4350 4355 mp->b_next = next;
4351 4356 break;
4352 4357 }
4353 4358 mp = next;
4354 4359 }
4355 4360 goto done;
4356 4361 }
4357 4362
4358 4363 /*
4359 4364 * No fastpath, we either have more than one MAC client
4360 4365 * defined on top of the same MAC, or one or more MAC
4361 4366 * client promiscuous callbacks.
4362 4367 */
4363 4368 DTRACE_PROBE3(slowpath, mac_client_impl_t *,
4364 4369 src_mcip, int, mip->mi_nclients, mblk_t *, mp_chain);
4365 4370
4366 4371 mp = mp_chain;
4367 4372 while (mp != NULL) {
4368 4373 flow_entry_t *dst_flow_ent;
4369 4374 void *flow_cookie;
4370 4375 size_t pkt_size;
4371 4376 mblk_t *mp1;
4372 4377
4373 4378 next = mp->b_next;
4374 4379 mp->b_next = NULL;
4375 4380 opackets++;
4376 4381 pkt_size = (mp->b_cont == NULL ? MBLKL(mp) : msgdsize(mp));
4377 4382 obytes += pkt_size;
4378 4383 CHECK_VID_AND_ADD_TAG(mp);
4379 4384
4380 4385 /*
4381 4386 * Find the destination.
4382 4387 */
4383 4388 dst_flow_ent = mac_tx_classify(mip, mp);
4384 4389
4385 4390 if (dst_flow_ent != NULL) {
4386 4391 size_t hdrsize;
4387 4392 int err = 0;
4388 4393
4389 4394 if (mip->mi_info.mi_nativemedia == DL_ETHER) {
4390 4395 struct ether_vlan_header *evhp =
4391 4396 (struct ether_vlan_header *)mp->b_rptr;
4392 4397
4393 4398 if (ntohs(evhp->ether_tpid) == ETHERTYPE_VLAN)
4394 4399 hdrsize = sizeof (*evhp);
4395 4400 else
4396 4401 hdrsize = sizeof (struct ether_header);
4397 4402 } else {
4398 4403 mac_header_info_t mhi;
4399 4404
4400 4405 err = mac_header_info((mac_handle_t)mip,
4401 4406 mp, &mhi);
4402 4407 if (err == 0)
4403 4408 hdrsize = mhi.mhi_hdrsize;
4404 4409 }
4405 4410
4406 4411 /*
4407 4412 * Got a matching flow. It's either another
4408 4413 * MAC client, or a broadcast/multicast flow.
4409 4414 * Make sure the packet size is within the
4410 4415 * allowed size. If not drop the packet and
4411 4416 * move to next packet.
4412 4417 */
4413 4418 if (err != 0 ||
4414 4419 (pkt_size - hdrsize) > mip->mi_sdu_max) {
4415 4420 oerrors++;
4416 4421 DTRACE_PROBE2(loopback__drop, size_t, pkt_size,
4417 4422 mblk_t *, mp);
4418 4423 freemsg(mp);
4419 4424 mp = next;
4420 4425 FLOW_REFRELE(dst_flow_ent);
4421 4426 continue;
4422 4427 }
4423 4428 flow_cookie = mac_flow_get_client_cookie(dst_flow_ent);
4424 4429 if (flow_cookie != NULL) {
4425 4430 /*
4426 4431 * The vnic_bcast_send function expects
4427 4432 * to receive the sender MAC client
4428 4433 * as value for arg2.
4429 4434 */
4430 4435 mac_bcast_send(flow_cookie, src_mcip, mp,
4431 4436 B_TRUE);
4432 4437 } else {
4433 4438 /*
4434 4439 * loopback the packet to a local MAC
4435 4440 * client. We force a context switch
4436 4441 * if both source and destination MAC
4437 4442 * clients are used by IP, i.e.
4438 4443 * bypass is set.
4439 4444 */
4440 4445 boolean_t do_switch;
4441 4446 mac_client_impl_t *dst_mcip =
4442 4447 dst_flow_ent->fe_mcip;
4443 4448
4444 4449 /*
4445 4450 * Check if there are promiscuous mode
4446 4451 * callbacks defined. This check is
4447 4452 * done here in the 'else' case and
4448 4453 * not in other cases because this
4449 4454 * path is for local loopback
4450 4455 * communication which does not go
4451 4456 * through MAC_TX(). For paths that go
4452 4457 * through MAC_TX(), the promisc_list
4453 4458 * check is done inside the MAC_TX()
4454 4459 * macro.
4455 4460 */
4456 4461 if (mip->mi_promisc_list != NULL)
4457 4462 mac_promisc_dispatch(mip, mp, src_mcip);
4458 4463
4459 4464 do_switch = ((src_mcip->mci_state_flags &
4460 4465 dst_mcip->mci_state_flags &
4461 4466 MCIS_CLIENT_POLL_CAPABLE) != 0);
4462 4467
4463 4468 if ((mp1 = mac_fix_cksum(mp)) != NULL) {
4464 4469 (dst_flow_ent->fe_cb_fn)(
4465 4470 dst_flow_ent->fe_cb_arg1,
4466 4471 dst_flow_ent->fe_cb_arg2,
4467 4472 mp1, do_switch);
4468 4473 }
4469 4474 }
4470 4475 FLOW_REFRELE(dst_flow_ent);
4471 4476 } else {
4472 4477 /*
4473 4478 * Unknown destination, send via the underlying
4474 4479 * NIC.
4475 4480 */
4476 4481 MAC_TX(mip, ring, mp, src_mcip);
4477 4482 if (mp != NULL) {
4478 4483 /*
4479 4484 * Adjust for the last packet that
4480 4485 * could not be transmitted
4481 4486 */
4482 4487 opackets--;
4483 4488 obytes -= pkt_size;
4484 4489 mp->b_next = next;
4485 4490 break;
4486 4491 }
4487 4492 }
4488 4493 mp = next;
4489 4494 }
4490 4495
4491 4496 done:
4492 4497 stats->mts_obytes = obytes;
4493 4498 stats->mts_opackets = opackets;
4494 4499 stats->mts_oerrors = oerrors;
4495 4500 return (mp);
4496 4501 }
4497 4502
4498 4503 /*
4499 4504 * mac_tx_srs_ring_present
4500 4505 *
4501 4506 * Returns whether the specified ring is part of the specified SRS.
4502 4507 */
4503 4508 boolean_t
4504 4509 mac_tx_srs_ring_present(mac_soft_ring_set_t *srs, mac_ring_t *tx_ring)
4505 4510 {
4506 4511 int i;
4507 4512 mac_soft_ring_t *soft_ring;
4508 4513
4509 4514 if (srs->srs_tx.st_arg2 == tx_ring)
4510 4515 return (B_TRUE);
4511 4516
4512 4517 for (i = 0; i < srs->srs_tx_ring_count; i++) {
4513 4518 soft_ring = srs->srs_tx_soft_rings[i];
4514 4519 if (soft_ring->s_ring_tx_arg2 == tx_ring)
4515 4520 return (B_TRUE);
4516 4521 }
4517 4522
4518 4523 return (B_FALSE);
4519 4524 }
4520 4525
4521 4526 /*
4522 4527 * mac_tx_srs_get_soft_ring
4523 4528 *
4524 4529 * Returns the TX soft ring associated with the given ring, if present.
4525 4530 */
4526 4531 mac_soft_ring_t *
4527 4532 mac_tx_srs_get_soft_ring(mac_soft_ring_set_t *srs, mac_ring_t *tx_ring)
4528 4533 {
4529 4534 int i;
4530 4535 mac_soft_ring_t *soft_ring;
4531 4536
4532 4537 if (srs->srs_tx.st_arg2 == tx_ring)
4533 4538 return (NULL);
4534 4539
4535 4540 for (i = 0; i < srs->srs_tx_ring_count; i++) {
4536 4541 soft_ring = srs->srs_tx_soft_rings[i];
4537 4542 if (soft_ring->s_ring_tx_arg2 == tx_ring)
4538 4543 return (soft_ring);
4539 4544 }
4540 4545
4541 4546 return (NULL);
4542 4547 }
4543 4548
4544 4549 /*
4545 4550 * mac_tx_srs_wakeup
4546 4551 *
4547 4552 * Called when Tx desc become available. Wakeup the appropriate worker
4548 4553 * thread after resetting the SRS_TX_BLOCKED/S_RING_BLOCK bit in the
4549 4554 * state field.
4550 4555 */
4551 4556 void
4552 4557 mac_tx_srs_wakeup(mac_soft_ring_set_t *mac_srs, mac_ring_handle_t ring)
4553 4558 {
4554 4559 int i;
4555 4560 mac_soft_ring_t *sringp;
4556 4561 mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
4557 4562
4558 4563 mutex_enter(&mac_srs->srs_lock);
4559 4564 /*
4560 4565 * srs_tx_ring_count == 0 is the single ring mode case. In
4561 4566 * this mode, there will not be Tx soft rings associated
4562 4567 * with the SRS.
4563 4568 */
4564 4569 if (!MAC_TX_SOFT_RINGS(mac_srs)) {
4565 4570 if (srs_tx->st_arg2 == ring &&
4566 4571 mac_srs->srs_state & SRS_TX_BLOCKED) {
4567 4572 mac_srs->srs_state &= ~SRS_TX_BLOCKED;
4568 4573 srs_tx->st_stat.mts_unblockcnt++;
4569 4574 cv_signal(&mac_srs->srs_async);
4570 4575 }
4571 4576 /*
4572 4577 * A wakeup can come before tx_srs_drain() could
4573 4578 * grab srs lock and set SRS_TX_BLOCKED. So
4574 4579 * always set woken_up flag when we come here.
4575 4580 */
4576 4581 srs_tx->st_woken_up = B_TRUE;
4577 4582 mutex_exit(&mac_srs->srs_lock);
4578 4583 return;
4579 4584 }
4580 4585
4581 4586 /*
4582 4587 * If you are here, it is for FANOUT, BW_FANOUT,
4583 4588 * AGGR_MODE or AGGR_BW_MODE case
4584 4589 */
4585 4590 for (i = 0; i < mac_srs->srs_tx_ring_count; i++) {
4586 4591 sringp = mac_srs->srs_tx_soft_rings[i];
4587 4592 mutex_enter(&sringp->s_ring_lock);
4588 4593 if (sringp->s_ring_tx_arg2 == ring) {
4589 4594 if (sringp->s_ring_state & S_RING_BLOCK) {
4590 4595 sringp->s_ring_state &= ~S_RING_BLOCK;
4591 4596 sringp->s_st_stat.mts_unblockcnt++;
4592 4597 cv_signal(&sringp->s_ring_async);
4593 4598 }
4594 4599 sringp->s_ring_tx_woken_up = B_TRUE;
4595 4600 }
4596 4601 mutex_exit(&sringp->s_ring_lock);
4597 4602 }
4598 4603 mutex_exit(&mac_srs->srs_lock);
4599 4604 }
4600 4605
4601 4606 /*
4602 4607 * Once the driver is done draining, send a MAC_NOTE_TX notification to unleash
4603 4608 * the blocked clients again.
4604 4609 */
4605 4610 void
4606 4611 mac_tx_notify(mac_impl_t *mip)
4607 4612 {
4608 4613 i_mac_notify(mip, MAC_NOTE_TX);
4609 4614 }
4610 4615
4611 4616 /*
4612 4617 * RX SOFTRING RELATED FUNCTIONS
4613 4618 *
4614 4619 * These functions really belong in mac_soft_ring.c and here for
4615 4620 * a short period.
4616 4621 */
4617 4622
4618 4623 #define SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) { \
4619 4624 /* \
4620 4625 * Enqueue our mblk chain. \
4621 4626 */ \
4622 4627 ASSERT(MUTEX_HELD(&(ringp)->s_ring_lock)); \
4623 4628 \
4624 4629 if ((ringp)->s_ring_last != NULL) \
4625 4630 (ringp)->s_ring_last->b_next = (mp); \
4626 4631 else \
4627 4632 (ringp)->s_ring_first = (mp); \
4628 4633 (ringp)->s_ring_last = (tail); \
4629 4634 (ringp)->s_ring_count += (cnt); \
4630 4635 ASSERT((ringp)->s_ring_count > 0); \
4631 4636 if ((ringp)->s_ring_type & ST_RING_BW_CTL) { \
4632 4637 (ringp)->s_ring_size += sz; \
4633 4638 } \
4634 4639 }
4635 4640
4636 4641 /*
4637 4642 * Default entry point to deliver a packet chain to a MAC client.
4638 4643 * If the MAC client has flows, do the classification with these
4639 4644 * flows as well.
4640 4645 */
4641 4646 /* ARGSUSED */
4642 4647 void
4643 4648 mac_rx_deliver(void *arg1, mac_resource_handle_t mrh, mblk_t *mp_chain,
4644 4649 mac_header_info_t *arg3)
4645 4650 {
4646 4651 mac_client_impl_t *mcip = arg1;
4647 4652
4648 4653 if (mcip->mci_nvids == 1 &&
↓ open down ↓ |
1991 lines elided |
↑ open up ↑ |
4649 4654 !(mcip->mci_state_flags & MCIS_STRIP_DISABLE)) {
4650 4655 /*
4651 4656 * If the client has exactly one VID associated with it
4652 4657 * and striping of VLAN header is not disabled,
4653 4658 * remove the VLAN tag from the packet before
4654 4659 * passing it on to the client's receive callback.
4655 4660 * Note that this needs to be done after we dispatch
4656 4661 * the packet to the promiscuous listeners of the
4657 4662 * client, since they expect to see the whole
4658 4663 * frame including the VLAN headers.
4664 + *
4665 + * The MCIS_STRIP_DISABLE is only issued when sun4v
4666 + * vsw is in play.
4659 4667 */
4660 4668 mp_chain = mac_strip_vlan_tag_chain(mp_chain);
4661 4669 }
4662 4670
4663 4671 mcip->mci_rx_fn(mcip->mci_rx_arg, mrh, mp_chain, B_FALSE);
4664 4672 }
4665 4673
4666 4674 /*
4667 - * mac_rx_soft_ring_process
4675 + * Process a chain for a given soft ring. If the number of packets
4676 + * queued in the SRS and its associated soft rings (including this
4677 + * one) is very small (tracked by srs_poll_pkt_cnt) then allow the
4678 + * entering thread (interrupt or poll thread) to process the chain
4679 + * inline. This is meant to reduce latency under low load.
4668 4680 *
4669 - * process a chain for a given soft ring. The number of packets queued
4670 - * in the SRS and its associated soft rings (including this one) is
4671 - * very small (tracked by srs_poll_pkt_cnt), then allow the entering
4672 - * thread (interrupt or poll thread) to do inline processing. This
4673 - * helps keep the latency down under low load.
4674 - *
4675 4681 * The proc and arg for each mblk is already stored in the mblk in
4676 4682 * appropriate places.
4677 4683 */
4678 4684 /* ARGSUSED */
4679 4685 void
4680 4686 mac_rx_soft_ring_process(mac_client_impl_t *mcip, mac_soft_ring_t *ringp,
4681 4687 mblk_t *mp_chain, mblk_t *tail, int cnt, size_t sz)
4682 4688 {
4683 4689 mac_direct_rx_t proc;
4684 4690 void *arg1;
4685 4691 mac_resource_handle_t arg2;
4686 4692 mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
4687 4693
4688 4694 ASSERT(ringp != NULL);
4689 4695 ASSERT(mp_chain != NULL);
4690 4696 ASSERT(tail != NULL);
4691 4697 ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4692 4698
4693 4699 mutex_enter(&ringp->s_ring_lock);
4694 4700 ringp->s_ring_total_inpkt += cnt;
4695 4701 ringp->s_ring_total_rbytes += sz;
4696 4702 if ((mac_srs->srs_rx.sr_poll_pkt_cnt <= 1) &&
4697 4703 !(ringp->s_ring_type & ST_RING_WORKER_ONLY)) {
4698 4704 /* If on processor or blanking on, then enqueue and return */
4699 4705 if (ringp->s_ring_state & S_RING_BLANK ||
4700 4706 ringp->s_ring_state & S_RING_PROC) {
4701 4707 SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4702 4708 mutex_exit(&ringp->s_ring_lock);
4703 4709 return;
4704 4710 }
4705 4711 proc = ringp->s_ring_rx_func;
4706 4712 arg1 = ringp->s_ring_rx_arg1;
4707 4713 arg2 = ringp->s_ring_rx_arg2;
4708 4714 /*
4709 4715 * See if anything is already queued. If we are the
4710 4716 * first packet, do inline processing else queue the
4711 4717 * packet and do the drain.
4712 4718 */
4713 4719 if (ringp->s_ring_first == NULL) {
4714 4720 /*
4715 4721 * Fast-path, ok to process and nothing queued.
4716 4722 */
4717 4723 ringp->s_ring_run = curthread;
4718 4724 ringp->s_ring_state |= (S_RING_PROC);
4719 4725
4720 4726 mutex_exit(&ringp->s_ring_lock);
4721 4727
↓ open down ↓ |
37 lines elided |
↑ open up ↑ |
4722 4728 /*
4723 4729 * We are the chain of 1 packet so
4724 4730 * go through this fast path.
4725 4731 */
4726 4732 ASSERT(mp_chain->b_next == NULL);
4727 4733
4728 4734 (*proc)(arg1, arg2, mp_chain, NULL);
4729 4735
4730 4736 ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4731 4737 /*
4732 - * If we have a soft ring set which is doing
4733 - * bandwidth control, we need to decrement
4734 - * srs_size and count so it the SRS can have a
4735 - * accurate idea of what is the real data
4736 - * queued between SRS and its soft rings. We
4737 - * decrement the counters only when the packet
4738 - * gets processed by both SRS and the soft ring.
4738 + * If we have an SRS performing bandwidth
4739 + * control then we need to decrement the size
4740 + * and count so the SRS has an accurate count
4741 + * of the data queued between the SRS and its
4742 + * soft rings. We decrement the counters only
4743 + * when the packet is processed by both the
4744 + * SRS and the soft ring.
4739 4745 */
4740 4746 mutex_enter(&mac_srs->srs_lock);
4741 4747 MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
4742 4748 MAC_UPDATE_SRS_SIZE_LOCKED(mac_srs, sz);
4743 4749 mutex_exit(&mac_srs->srs_lock);
4744 4750
4745 4751 mutex_enter(&ringp->s_ring_lock);
4746 4752 ringp->s_ring_run = NULL;
4747 4753 ringp->s_ring_state &= ~S_RING_PROC;
4748 4754 if (ringp->s_ring_state & S_RING_CLIENT_WAIT)
4749 4755 cv_signal(&ringp->s_ring_client_cv);
4750 4756
4751 4757 if ((ringp->s_ring_first == NULL) ||
4752 4758 (ringp->s_ring_state & S_RING_BLANK)) {
4753 4759 /*
4754 - * We processed inline our packet and
4755 - * nothing new has arrived or our
4760 + * We processed a single packet inline
4761 + * and nothing new has arrived or our
4756 4762 * receiver doesn't want to receive
4757 4763 * any packets. We are done.
4758 4764 */
4759 4765 mutex_exit(&ringp->s_ring_lock);
4760 4766 return;
4761 4767 }
4762 4768 } else {
4763 4769 SOFT_RING_ENQUEUE_CHAIN(ringp,
4764 4770 mp_chain, tail, cnt, sz);
4765 4771 }
4766 4772
4767 4773 /*
4768 4774 * We are here because either we couldn't do inline
4769 4775 * processing (because something was already
4770 4776 * queued), or we had a chain of more than one
4771 4777 * packet, or something else arrived after we were
4772 4778 * done with inline processing.
4773 4779 */
4774 4780 ASSERT(MUTEX_HELD(&ringp->s_ring_lock));
4775 4781 ASSERT(ringp->s_ring_first != NULL);
4776 4782
4777 4783 ringp->s_ring_drain_func(ringp);
4778 4784 mutex_exit(&ringp->s_ring_lock);
4779 4785 return;
4780 4786 } else {
4781 4787 /* ST_RING_WORKER_ONLY case */
4782 4788 SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4783 4789 mac_soft_ring_worker_wakeup(ringp);
4784 4790 mutex_exit(&ringp->s_ring_lock);
4785 4791 }
4786 4792 }
4787 4793
4788 4794 /*
4789 4795 * TX SOFTRING RELATED FUNCTIONS
4790 4796 *
4791 4797 * These functions really belong in mac_soft_ring.c and here for
4792 4798 * a short period.
4793 4799 */
4794 4800
4795 4801 #define TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) { \
4796 4802 ASSERT(MUTEX_HELD(&ringp->s_ring_lock)); \
4797 4803 ringp->s_ring_state |= S_RING_ENQUEUED; \
4798 4804 SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz); \
4799 4805 }
4800 4806
4801 4807 /*
4802 4808 * mac_tx_sring_queued
4803 4809 *
4804 4810 * When we are out of transmit descriptors and we already have a
4805 4811 * queue that exceeds hiwat (or the client called us with
4806 4812 * MAC_TX_NO_ENQUEUE or MAC_DROP_ON_NO_DESC flag), return the
4807 4813 * soft ring pointer as the opaque cookie for the client enable
4808 4814 * flow control.
4809 4815 */
4810 4816 static mac_tx_cookie_t
4811 4817 mac_tx_sring_enqueue(mac_soft_ring_t *ringp, mblk_t *mp_chain, uint16_t flag,
4812 4818 mblk_t **ret_mp)
4813 4819 {
4814 4820 int cnt;
4815 4821 size_t sz;
4816 4822 mblk_t *tail;
4817 4823 mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
4818 4824 mac_tx_cookie_t cookie = 0;
4819 4825 boolean_t wakeup_worker = B_TRUE;
4820 4826
4821 4827 ASSERT(MUTEX_HELD(&ringp->s_ring_lock));
4822 4828 MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
4823 4829 if (flag & MAC_DROP_ON_NO_DESC) {
4824 4830 mac_pkt_drop(NULL, NULL, mp_chain, B_FALSE);
4825 4831 /* increment freed stats */
4826 4832 ringp->s_ring_drops += cnt;
4827 4833 cookie = (mac_tx_cookie_t)ringp;
4828 4834 } else {
4829 4835 if (ringp->s_ring_first != NULL)
4830 4836 wakeup_worker = B_FALSE;
4831 4837
4832 4838 if (flag & MAC_TX_NO_ENQUEUE) {
4833 4839 /*
4834 4840 * If QUEUED is not set, queue the packet
4835 4841 * and let mac_tx_soft_ring_drain() set
4836 4842 * the TX_BLOCKED bit for the reasons
4837 4843 * explained above. Otherwise, return the
4838 4844 * mblks.
4839 4845 */
4840 4846 if (wakeup_worker) {
4841 4847 TX_SOFT_RING_ENQUEUE_CHAIN(ringp,
4842 4848 mp_chain, tail, cnt, sz);
4843 4849 } else {
4844 4850 ringp->s_ring_state |= S_RING_WAKEUP_CLIENT;
4845 4851 cookie = (mac_tx_cookie_t)ringp;
4846 4852 *ret_mp = mp_chain;
4847 4853 }
4848 4854 } else {
4849 4855 boolean_t enqueue = B_TRUE;
4850 4856
4851 4857 if (ringp->s_ring_count > ringp->s_ring_tx_hiwat) {
4852 4858 /*
4853 4859 * flow-controlled. Store ringp in cookie
4854 4860 * so that it can be returned as
4855 4861 * mac_tx_cookie_t to client
4856 4862 */
4857 4863 ringp->s_ring_state |= S_RING_TX_HIWAT;
4858 4864 cookie = (mac_tx_cookie_t)ringp;
4859 4865 ringp->s_ring_hiwat_cnt++;
4860 4866 if (ringp->s_ring_count >
4861 4867 ringp->s_ring_tx_max_q_cnt) {
4862 4868 /* increment freed stats */
4863 4869 ringp->s_ring_drops += cnt;
4864 4870 /*
4865 4871 * b_prev may be set to the fanout hint
4866 4872 * hence can't use freemsg directly
4867 4873 */
4868 4874 mac_pkt_drop(NULL, NULL,
4869 4875 mp_chain, B_FALSE);
4870 4876 DTRACE_PROBE1(tx_queued_hiwat,
4871 4877 mac_soft_ring_t *, ringp);
4872 4878 enqueue = B_FALSE;
4873 4879 }
4874 4880 }
4875 4881 if (enqueue) {
4876 4882 TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain,
4877 4883 tail, cnt, sz);
4878 4884 }
4879 4885 }
4880 4886 if (wakeup_worker)
4881 4887 cv_signal(&ringp->s_ring_async);
4882 4888 }
4883 4889 return (cookie);
4884 4890 }
4885 4891
4886 4892
4887 4893 /*
4888 4894 * mac_tx_soft_ring_process
4889 4895 *
4890 4896 * This routine is called when fanning out outgoing traffic among
4891 4897 * multipe Tx rings.
4892 4898 * Note that a soft ring is associated with a h/w Tx ring.
4893 4899 */
4894 4900 mac_tx_cookie_t
4895 4901 mac_tx_soft_ring_process(mac_soft_ring_t *ringp, mblk_t *mp_chain,
4896 4902 uint16_t flag, mblk_t **ret_mp)
4897 4903 {
4898 4904 mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
4899 4905 int cnt;
4900 4906 size_t sz;
4901 4907 mblk_t *tail;
4902 4908 mac_tx_cookie_t cookie = 0;
4903 4909
4904 4910 ASSERT(ringp != NULL);
4905 4911 ASSERT(mp_chain != NULL);
4906 4912 ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4907 4913 /*
4908 4914 * The following modes can come here: SRS_TX_BW_FANOUT,
4909 4915 * SRS_TX_FANOUT, SRS_TX_AGGR, SRS_TX_BW_AGGR.
4910 4916 */
4911 4917 ASSERT(MAC_TX_SOFT_RINGS(mac_srs));
4912 4918 ASSERT(mac_srs->srs_tx.st_mode == SRS_TX_FANOUT ||
4913 4919 mac_srs->srs_tx.st_mode == SRS_TX_BW_FANOUT ||
4914 4920 mac_srs->srs_tx.st_mode == SRS_TX_AGGR ||
4915 4921 mac_srs->srs_tx.st_mode == SRS_TX_BW_AGGR);
4916 4922
4917 4923 if (ringp->s_ring_type & ST_RING_WORKER_ONLY) {
4918 4924 /* Serialization mode */
4919 4925
4920 4926 mutex_enter(&ringp->s_ring_lock);
4921 4927 if (ringp->s_ring_count > ringp->s_ring_tx_hiwat) {
4922 4928 cookie = mac_tx_sring_enqueue(ringp, mp_chain,
4923 4929 flag, ret_mp);
4924 4930 mutex_exit(&ringp->s_ring_lock);
4925 4931 return (cookie);
4926 4932 }
4927 4933 MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
4928 4934 TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4929 4935 if (ringp->s_ring_state & (S_RING_BLOCK | S_RING_PROC)) {
4930 4936 /*
4931 4937 * If ring is blocked due to lack of Tx
4932 4938 * descs, just return. Worker thread
4933 4939 * will get scheduled when Tx desc's
4934 4940 * become available.
4935 4941 */
4936 4942 mutex_exit(&ringp->s_ring_lock);
4937 4943 return (cookie);
4938 4944 }
4939 4945 mac_soft_ring_worker_wakeup(ringp);
4940 4946 mutex_exit(&ringp->s_ring_lock);
4941 4947 return (cookie);
4942 4948 } else {
4943 4949 /* Default fanout mode */
4944 4950 /*
4945 4951 * S_RING_BLOCKED is set when underlying NIC runs
4946 4952 * out of Tx descs and messages start getting
4947 4953 * queued. It won't get reset until
4948 4954 * tx_srs_drain() completely drains out the
4949 4955 * messages.
4950 4956 */
4951 4957 mac_tx_stats_t stats;
4952 4958
4953 4959 if (ringp->s_ring_state & S_RING_ENQUEUED) {
4954 4960 /* Tx descs/resources not available */
4955 4961 mutex_enter(&ringp->s_ring_lock);
4956 4962 if (ringp->s_ring_state & S_RING_ENQUEUED) {
4957 4963 cookie = mac_tx_sring_enqueue(ringp, mp_chain,
4958 4964 flag, ret_mp);
4959 4965 mutex_exit(&ringp->s_ring_lock);
4960 4966 return (cookie);
4961 4967 }
4962 4968 /*
4963 4969 * While we were computing mblk count, the
4964 4970 * flow control condition got relieved.
4965 4971 * Continue with the transmission.
4966 4972 */
4967 4973 mutex_exit(&ringp->s_ring_lock);
4968 4974 }
4969 4975
4970 4976 mp_chain = mac_tx_send(ringp->s_ring_tx_arg1,
4971 4977 ringp->s_ring_tx_arg2, mp_chain, &stats);
4972 4978
4973 4979 /*
4974 4980 * Multiple threads could be here sending packets.
4975 4981 * Under such conditions, it is not possible to
4976 4982 * automically set S_RING_BLOCKED bit to indicate
4977 4983 * out of tx desc condition. To atomically set
4978 4984 * this, we queue the returned packet and do
4979 4985 * the setting of S_RING_BLOCKED in
4980 4986 * mac_tx_soft_ring_drain().
4981 4987 */
4982 4988 if (mp_chain != NULL) {
4983 4989 mutex_enter(&ringp->s_ring_lock);
4984 4990 cookie =
4985 4991 mac_tx_sring_enqueue(ringp, mp_chain, flag, ret_mp);
4986 4992 mutex_exit(&ringp->s_ring_lock);
4987 4993 return (cookie);
4988 4994 }
4989 4995 SRS_TX_STATS_UPDATE(mac_srs, &stats);
4990 4996 SOFTRING_TX_STATS_UPDATE(ringp, &stats);
4991 4997
4992 4998 return (0);
4993 4999 }
4994 5000 }
↓ open down ↓ |
229 lines elided |
↑ open up ↑ |
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX