[PATCH v14 00/10] Use true MAC address of LAN local remote hosts
Bug #120 asks us to use the true MAC addresses of LAN local remote hosts, since some programs need this information. These commits introduces this for ARP, NDP, UDP, TCP and ICMP. --- v3: Updated according to feedback from Stefano and David: - Made the ARP/NDP lookup call filter out the requested address by itself, qualified by the index if the template interface - Moved the flow specific MAC address from struct flowside to struct flow_common. v4: - Updated according to feedback from David and Stefan - Added a cache table for ARP/NDP table contents v5: - Updated according to feedback from David and Stefan - Added cache table entries to FIFO/LRU queue - New criteria for when to consult ARP/NDP v6: - Simplified and merged mac cache table commits - Other changes after feedback from David. v7: - Fixes in patch #2 based on feedback from David and Stefano. v8: - Redesigned netlink and cache table part to be based on a subscription model. v8: - Small fix to patch #2 so that we cover the case when a MAC addess for a host has changed. - Added a commit where we send a gratuitous ARP/ unsolicitated NA to the guest when a new host is added to the neighbour cache table. v10: - Some fixes after feedback from David Gibson - Reordered: Moved patch #9 to position #3. - Added synchronization step between ARP/NDP table contents and the neigbour table at initialization. This reduces the number of "false" ARP/NDP replies drastically, but not completly. - (Next step could be to scan over the flow table and update affeced entries when we receive a MAC address update.) v11: - Corrected the gratuitous ARP implementation to use the "ARP Announcement" model instead of the "Gratuitous ARP reply" model. v12: - Updated based on feedback from David and Stefano - Added special handling of default GW and loopback addresses. v13: - Updated based on discussion with David and Stefano - Conceptually moved to only considering guest-side visible addresss. A lot of things became simpler and clearer through this change. Thank you, David. - Introduced a 'permanent' flag in the special entries representing addessed mapping to own host and conditionally the guest gw. This flag indicates those entries cannot be altered by possible remote hosts shadowed by these addresses. Suggested by Stefano. - Reordered patch ##4 and 5, since #5 cannot work correctly for NDP unsolicited NA until #4 is in place. - Added a new commit #2 to get later access to the flag no_map_gw. It was wrong to call fwd_neigh_table_init() from inside conf(), it has to be done in main() after random_init() and tap_backend_init(). v14: - Some fixes after feedback from David Gibson, notably - Moved the call to nat_inbound() from fwd.c to netlink.c - Added RFC quotes to explain the format of ARP announce messages. Jon Maloy (10): netlink: add subscription on changes in NDP/ARP table passt: add no_map_gw flag to struct ctx fwd: Add cache table for ARP/NDP contents arp/ndp: respond with true MAC address of LAN local remote hosts arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added flow: add MAC address of LAN local remote hosts to flow udp: forward external source MAC address through tap interface tcp: forward external source MAC address through tap interface tap: change signature of function tap_push_l2h() icmp: let icmp use mac address from flowside structure arp.c | 58 ++++++++++++- arp.h | 2 + conf.c | 10 +-- epoll_type.h | 2 + flow.c | 2 + flow.h | 2 + fwd.c | 227 +++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 7 ++ icmp.c | 8 +- inany.c | 1 + ndp.c | 16 +++- ndp.h | 1 + netlink.c | 215 +++++++++++++++++++++++++++++++++++++++++++++- netlink.h | 4 + passt.c | 17 ++-- passt.h | 4 +- pasta.c | 2 +- tap.c | 24 +++--- tap.h | 7 +- tcp.c | 20 ++++- tcp.h | 2 +- tcp_buf.c | 37 ++++---- tcp_internal.h | 4 +- tcp_vu.c | 5 +- udp.c | 57 ++++++++----- udp.h | 2 +- util.h | 2 + 27 files changed, 650 insertions(+), 88 deletions(-) -- 2.50.1
The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
requires the ability to translate from an IP address to its
corresponding MAC address in cases where those are present in
the ARP or NDP tables.
To keep track of the contents of these tables we add a netlink
based neighbour subscription feature.
Signed-off-by: Jon Maloy
We add a cache table to keep track of the contents of the kernel ARP
and NDP tables. The table is fed from the just introduced netlink based
neigbour subscription function.
Signed-off-by: Jon Maloy
Later in this series we will introduce a new initialization function
which needs access to the 'no_map_gw' flag. This flag is currently
a local variable inside the conf() call, where it is too early to
call the new function. The new function needs both ctx->hash_secret
and a valid ctx->fd_tap to be set, neither of which are initialized
at that point.
We therefore add the 'no_map_gw' flag to struct ctx, so it can be
carried along and used by later functions.
Signed-off-by: Jon Maloy
When we receive an ARP request or NDP neigbour solicitation over
the tap interface for a host on the local network segment attached
to the template interface, we respond with that host's real MAC
address, if available.
Signed-off-by: Jon Maloy
ARP announcements and unsolicited NAs should be handled with caution
because of the risk of malignant users emitting them to disturb
network communication.
There is however one case we where we know it is legitimate
and safe for us to send out such messages: The one time we switch
from using ctx->own_tap_mac to a MAC address received via the
recently added neigbour subscription function. Later changes to
the MAC address of a host in an existing entry cannot be fully
trusted, so we abstain from doing it in such cases.
When sending this type of messages, we notice that the guest accepts
the update, but shortly later asks for a confirmation in the form of
a regular ARP/NS request. This is responded to with the new value,
and we have exactly the effect we wanted.
This commit adds this functionality.
Signed-off-by: Jon Maloy
When communicating with remote hosts on the local network, some guest
applications want to see the real MAC address of that host instead
of PASST/PASTA's own tap address. The flow_common structure is a
convenient location for storing that address, so we do that in this
commit.
Note that we don´t add actual usage of this address here, that will
be done in later commits.
Signed-off-by: Jon Maloy
In the next commit it must be possible for the callers of function
tap_push_l2h() to specify which source MAC address should be
added to the ethernet header sent over the tap interface. As a
preparation, we now add a new argument to that function, still
without any logical changes.
Signed-off-by: Jon Maloy
We forward the incoming MAC address through the tap interface when
receiving incoming packets from network local hosts.
This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy
Even ICMP needs to be updated to use the external MAC address instead
of just the own tap address when applicable. We do that here.
Signed-off-by: Jon Maloy
We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts.
This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy
On Tue, Oct 14, 2025 at 10:55:12PM -0400, Jon Maloy wrote:
The solution to bug https://bugs.passt.top/show_bug.cgi?id=120 requires the ability to translate from an IP address to its corresponding MAC address in cases where those are present in the ARP or NDP tables.
To keep track of the contents of these tables we add a netlink based neighbour subscription feature.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
--- v3: - Added an attribute contianing NDA_DST to sent message, so that we let the kernel do the filtering of the IP address and return only one entry. - Added interface index to the call signature. Since the only interface we know is the template interface, this limits the number of hosts that will be seen as 'network segment local' from a PASST viewpoint. v4: - Made loop independent of attribute order. - Ignoring L2 addresses which are not of size ETH_ALEN. v5: - Changed return value of new function, so caller can know if a MAC address really was found. v6: - Removed warning printout which had ended up in the wrong commit. v8: - Changed to neighbour event subscription model - netlink: arp/ndp table subscription v10:- Updated according to David's latest comments on v8 - Added functionaly where we initially read current state of ARP/NDP tables v12:- Updates based on feedback from David and Stefano v13:- Updates based on feedback from David and Stefano v14:- Updates based on feedback from David --- epoll_type.h | 2 + netlink.c | 209 ++++++++++++++++++++++++++++++++++++++++++++++++++- netlink.h | 4 + passt.c | 7 ++ 4 files changed, 219 insertions(+), 3 deletions(-)
diff --git a/epoll_type.h b/epoll_type.h index 12ac59b..a90ffb6 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -44,6 +44,8 @@ enum epoll_type { EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ EPOLL_TYPE_REPAIR, + /* Netlink neighbour subscription socket */ + EPOLL_TYPE_NL_NEIGH,
EPOLL_NUM_TYPES, }; diff --git a/netlink.c b/netlink.c index 9fe70f2..186383c 100644 --- a/netlink.c +++ b/netlink.c @@ -26,6 +26,7 @@ #include
#include #include +#include #include
#include @@ -40,6 +41,10 @@ #define RTNH_NEXT_AND_DEC(rtnh, attrlen) \ ((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh)) +/* Convenience macro borrowed from kernel */ +#define NUD_VALID \ + (NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE) + /* Netlink expects a buffer of at least 8kiB or the system page size, * whichever is larger. 32kiB is recommended for more efficient. * Since the largest page size on any remotely common Linux setup is @@ -50,9 +55,10 @@ #define NLBUFSIZ 65536
/* Socket in init, in target namespace, sequence (just needs to be monotonic) */ -int nl_sock = -1; -int nl_sock_ns = -1; -static int nl_seq = 1; +int nl_sock = -1; +int nl_sock_ns = -1; +static int nl_sock_neigh = -1; +static int nl_seq = 1;
/** * nl_sock_init_do() - Set up netlink sockets in init or target namespace @@ -1103,3 +1109,200 @@ int nl_link_set_flags(int s, unsigned int ifi,
return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req)); } + +/** + * nl_neigh_msg_read() - Interpret a neigbour state message from netlink + * @c: Execution context + * @nh: Message to be read + */ +static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh) +{ + struct ndmsg *ndm = NLMSG_DATA(nh); + struct rtattr *rta = (struct rtattr *)(ndm + 1); + size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm)); + char ip_str[INET6_ADDRSTRLEN]; + char mac_str[ETH_ADDRSTRLEN]; + const uint8_t *lladdr = NULL; + const void *dst = NULL; + size_t lladdr_len = 0; + uint8_t mac[ETH_ALEN]; + union inany_addr addr; + size_t dstlen = 0; + + if (nh->nlmsg_type == NLMSG_DONE) + return; + + if (nh->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *errmsg = (struct nlmsgerr *)ndm; + + warn("neigh_msg_read() failed, error %i", errmsg->error); + return; + } + + if (nh->nlmsg_type != RTM_NEWNEIGH && + nh->nlmsg_type != RTM_DELNEIGH) + return; + + for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) { + switch (rta->rta_type) { + case NDA_DST: + dst = RTA_DATA(rta); + dstlen = RTA_PAYLOAD(rta); + break; + case NDA_LLADDR: + lladdr = RTA_DATA(rta); + lladdr_len = RTA_PAYLOAD(rta); + break; + default: + break; + } + } + + if (!dst) + return; + + if (ndm->ndm_family == AF_INET && + ndm->ndm_ifindex != c->ifi4) + return; + + if (ndm->ndm_family == AF_INET6 && + ndm->ndm_ifindex != c->ifi6) + return; + + if (ndm->ndm_family != AF_INET && + ndm->ndm_family != AF_INET6) + return; + + if (ndm->ndm_family == AF_INET && + dstlen != sizeof(struct in_addr)) { + warn("netlink: wrong address length in AF_INET notification"); + return; + } + if (ndm->ndm_family == AF_INET6 && + dstlen != sizeof(struct in6_addr)) { + warn("netlink: wrong address length in AF_INET6 notification"); + return; + } + inany_from_af(&addr, ndm->ndm_family, dst); + inany_ntop(dst, ip_str, sizeof(ip_str)); + + if (nh->nlmsg_type == RTM_DELNEIGH) { + trace("neigh table delete: %s", ip_str); + return; + } + if (!(ndm->ndm_state & NUD_VALID)) { + trace("neigh table: invalid state for %s", ip_str); + return; + } + if (nh->nlmsg_type != RTM_NEWNEIGH || !lladdr) { + warn("netlink: notification with illegal msg for %s", ip_str); + return; + } + if (lladdr_len != ETH_ALEN || ndm->ndm_type != ARPHRD_ETHER) + return; + + memcpy(mac, lladdr, ETH_ALEN); + eth_ntop(mac, mac_str, sizeof(mac_str)); + trace("neigh table update: %s / %s", ip_str, mac_str); +} + +/** + * nl_neigh_sync() - Read current contents ARP/NDP tables + * @c: Execution context + * @proto: Protocol, AF_INET or AF_INET6 + * @ifi: Interface index + */ +static void nl_neigh_sync(const struct ctx *c, int proto, int ifi) +{ + struct { + struct nlmsghdr nlh; + struct ndmsg ndm; + } req = { + .nlh = {0}, + .ndm.ndm_family = proto, + .ndm.ndm_ifindex = ifi, + }; + struct nlmsghdr *nh; + char buf[NLBUFSIZ]; + ssize_t status; + uint32_t seq; + + seq = nl_send(nl_sock_neigh, &req, RTM_GETNEIGH, + NLM_F_DUMP, sizeof(req)); + nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH) + nl_neigh_msg_read(c, nh); + if (status < 0) + warn("netlink: RTM_GETNEIGH failed: %s", strerror_(-status)); +} + +/** + * nl_neigh_notify_handler() - Non-blocking drain of pending neighbour updates + * @c: Execution context + */ +void nl_neigh_notify_handler(const struct ctx *c) +{ + char buf[NLBUFSIZ]; + + for (;;) { + ssize_t n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT); + struct nlmsghdr *nh = (struct nlmsghdr *)buf; + + if (n < 0) { + if (errno != EAGAIN && errno != EINTR) + warn_perror("nl_neigh_notify sock read error"); + return; + } + for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) + nl_neigh_msg_read(c, nh); + } +} + +/** + * nl_neigh_notify_init() - Subscribe to neighbour events + * @c: Execution context + * + * Return: 0 on success, -1 on failure + */ +int nl_neigh_notify_init(const struct ctx *c) +{ + union epoll_ref ref = { + .type = EPOLL_TYPE_NL_NEIGH + }; + struct epoll_event ev = { + .events = EPOLLIN + }; + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + .nl_groups = RTMGRP_NEIGH, + }; + + if (nl_sock_neigh >= 0) { + warn("RTMGRP_NEIGH already exists"); + return 0; + } + nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, + NETLINK_ROUTE); + if (nl_sock_neigh < 0) { + warn("Failed create RTMGRP_NEIGH socket"); + return -1; + } + if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) { + close(nl_sock_neigh); + nl_sock_neigh = -1; + warn("Failed bind RTMGRP_NEIGH socket"); + return -1; + } + + ev.data.u64 = ref.u64; + if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) { + close(nl_sock_neigh); + nl_sock_neigh = -1; + warn("Failed do epoll_ctrl() on RTMGRP_NEIGH socket"); + return -1; + } + + nl_neigh_sync(c, AF_INET, c->ifi4); + nl_neigh_sync(c, AF_INET6, c->ifi6); + + return 0; +} diff --git a/netlink.h b/netlink.h index b51e99c..8f1e9b9 100644 --- a/netlink.h +++ b/netlink.h @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, int s_dst, unsigned int ifi_dst, sa_family_t af); int nl_addr_get(int s, unsigned int ifi, sa_family_t af, void *addr, int *prefix_len, void *addr_l); +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi, + unsigned char *mac); int nl_addr_set(int s, unsigned int ifi, sa_family_t af, const void *addr, int prefix_len); int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr); @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac); int nl_link_set_mtu(int s, unsigned int ifi, int mtu); int nl_link_set_flags(int s, unsigned int ifi, unsigned int set, unsigned int change); +int nl_neigh_notify_init(const struct ctx *c); +void nl_neigh_notify_handler(const struct ctx *c);
#endif /* NETLINK_H */ diff --git a/passt.c b/passt.c index 31fbb75..e21d6ba 100644 --- a/passt.c +++ b/passt.c @@ -53,6 +53,7 @@ #include "vu_common.h" #include "migrate.h" #include "repair.h" +#include "netlink.h"
#define EPOLL_EVENTS 8
@@ -79,6 +80,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", + [EPOLL_TYPE_NL_NEIGH] = "netlink neighbour subscription socket", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -322,6 +324,8 @@ int main(int argc, char **argv)
pcap_init(&c);
+ nl_neigh_notify_init(&c); + if (!c.foreground) { if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0) die_perror("Failed to open /dev/null"); @@ -414,6 +418,9 @@ loop: case EPOLL_TYPE_REPAIR: repair_handler(&c, eventmask); break; + case EPOLL_TYPE_NL_NEIGH: + nl_neigh_notify_handler(&c); + break; default: /* Can't happen */ ASSERT(0); -- 2.50.1
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Oct 14, 2025 at 10:55:14PM -0400, Jon Maloy wrote:
We add a cache table to keep track of the contents of the kernel ARP and NDP tables. The table is fed from the just introduced netlink based neigbour subscription function.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
+ /* Blocker entries to stop events from hosts using these addresses */ + if (!inany_is_unspecified4(&mhl)) + fwd_neigh_table_update(c, &mhl, c->our_tap_mac, true); + + if (!inany_is_unspecified4(&ggw) && !c->no_map_gw) + fwd_neigh_table_update(c, &ggw, c->our_tap_mac, true); + + if (!inany_is_unspecified4(&mga) && !inany_equals(&mhl, &mga)) {
That made me realise that we should throw an error during configuration if map_host_loopback == map_guest_addr. It doesn't make sense for these to be the same - if they are, we have no way of knowing if a packet should be mapped to 127.0.0.1 or to guest_addr. *checks* looks like map_host_loopback will take precedence in this case, because of the way nat_outbound() is ordered. In any case I think you can drop the inany_equals() test - the permanent bit will stop the second update from clobbering the first, even if we are misconfigured.
+ uint8_t mac[ETH_ALEN]; + int rc; + + rc = nl_link_get_mac(nl_sock, c->ifi4, mac); + if (rc < 0) { + debug("Couldn't get ip4 MAC addr: %s", strerror_(-rc)); + memcpy(mac, c->our_tap_mac, ETH_ALEN); + }
Using the host's MAC for --map-guest-addr only makes sense if the guest address is the same as the host address. If -a is used to make the guest address different, then it may shadow some other random node, not the host. We don't need special handling for that case - the nat_inbound() you already have will do what we need. IIUC, the host itself doesn't appear in the neighbour table, so we do need special handling if we want to use the host MAC when --map-guest-addr *does* refer to the host. To handle that, I think what we want is pseudo-codishly: fwd_neigh_table_update(c, nat_inbound(host_addr), host_mac, true); The wrinkle is that while we do get the host address at some point, I'm not sure we keep it around (it's typically irrelevant after init). Strictly speaking 'permanent' isn't really correct here, but it's not worth the hassle of setting up a whole other netlink monitor to watch for changes in the host's MAC address. In fact.. I'm not sure it's worth handling this case at all. I think it would be ok to just drop this clause. That means we'll use our_tap_mac by default for things NATted to the (non loopback) host, which is probably fine. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Oct 14, 2025 at 10:55:16PM -0400, Jon Maloy wrote:
ARP announcements and unsolicited NAs should be handled with caution because of the risk of malignant users emitting them to disturb network communication.
There is however one case we where we know it is legitimate and safe for us to send out such messages: The one time we switch from using ctx->own_tap_mac to a MAC address received via the recently added neigbour subscription function. Later changes to the MAC address of a host in an existing entry cannot be fully trusted, so we abstain from doing it in such cases.
When sending this type of messages, we notice that the guest accepts the update, but shortly later asks for a confirmation in the form of a regular ARP/NS request. This is responded to with the new value, and we have exactly the effect we wanted.
This commit adds this functionality.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
--- v10: -Made small changes based of feedback from David G. v11: -Moved from 'Gratuitous ARP reply' model to 'ARP Announcement' model. v12: -Excluding loopback and default GW addresses from the ARP/NA announcement to be sent to the guest v13: -Filtering out all announcements of our_tap_mac instead of explicitly comparing a set of IP addresses. -Changed annc.am.tha in arp_announce() to MAC_ZERO, which is 'unknown' according to RFC826. -Renamed ndp_send_unsolicited_na() to ndp_unsolicited_na() v14: -Added quotes from RFC5227 to explain the format of ARP announce messages, as suggested by David. --- arp.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++ arp.h | 2 ++ fwd.c | 10 ++++++++++ ndp.c | 10 ++++++++++ ndp.h | 1 + 5 files changed, 73 insertions(+)
diff --git a/arp.c b/arp.c index b4a686f..0a89ae3 100644 --- a/arp.c +++ b/arp.c @@ -150,3 +150,53 @@ void arp_send_init_req(const struct ctx *c) debug("Sending initial ARP request for guest MAC address"); tap_send_single(c, &req, sizeof(req)); } + +/** + * arp_announce() - Send an ARP announcement for an IPv4 host + * @c: Execution context + * @ip: IPv4 address we announce as owned by @mac + * @mac: MAC address to advertise for @ip + */ +void arp_announce(const struct ctx *c, struct in_addr *ip, + const unsigned char *mac) +{ + char ip_str[INET_ADDRSTRLEN]; + char mac_str[ETH_ADDRSTRLEN]; + struct { + struct ethhdr eh; + struct arphdr ah; + struct arpmsg am; + } __attribute__((__packed__)) annc; + + /* Ethernet header */ + annc.eh.h_proto = htons(ETH_P_ARP); + memcpy(annc.eh.h_dest, MAC_BROADCAST, sizeof(annc.eh.h_dest)); + memcpy(annc.eh.h_source, mac, sizeof(annc.eh.h_source)); + + /* ARP header */ + annc.ah.ar_op = htons(ARPOP_REQUEST); + annc.ah.ar_hrd = htons(ARPHRD_ETHER); + annc.ah.ar_pro = htons(ETH_P_IP); + annc.ah.ar_hln = ETH_ALEN; + annc.ah.ar_pln = 4; + + /* RFC5227, section 2.1.1, about Probe messages: "The client MUST fill + * in the 'sender hardware address' field of the ARP Request with the + * hardware address of the interface through which it is sending the + * packet. [...] The 'target hardware address' field is ignored and + * SHOULD be set to all zeroes." + * RFC5227, section 2.3: "An ARP Announcement is identical to the ARP + * Probe described above, except that now the sender and target IP + * addresses are both set to the host's newly selected IPv4 address." + */ + memcpy(annc.am.sha, mac, sizeof(annc.am.sha)); + memcpy(annc.am.sip, ip, sizeof(annc.am.sip)); + memcpy(annc.am.tha, MAC_ZERO, sizeof(annc.am.tha)); + memcpy(annc.am.tip, ip, sizeof(annc.am.tip)); + + inet_ntop(AF_INET, ip, ip_str, sizeof(ip_str)); + eth_ntop(mac, mac_str, sizeof(mac_str)); + debug("Announcing ARP for %s / %s", ip_str, mac_str); + + tap_send_single(c, &annc, sizeof(annc)); +} diff --git a/arp.h b/arp.h index d5ad0e1..4862e90 100644 --- a/arp.h +++ b/arp.h @@ -22,5 +22,7 @@ struct arpmsg {
int arp(const struct ctx *c, struct iov_tail *data); void arp_send_init_req(const struct ctx *c); +void arp_announce(const struct ctx *c, struct in_addr *ip, + const unsigned char *mac);
#endif /* ARP_H */ diff --git a/fwd.c b/fwd.c index f70e4fc..df92db8 100644 --- a/fwd.c +++ b/fwd.c @@ -27,6 +27,8 @@ #include "lineread.h" #include "flow_table.h" #include "netlink.h" +#include "arp.h" +#include "ndp.h"
/* Empheral port range: values from RFC 6335 */ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14); @@ -140,6 +142,14 @@ void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr, memcpy(&e->addr, addr, sizeof(*addr)); memcpy(e->mac, mac, ETH_ALEN); e->permanent = permanent; + + if (!memcmp(mac, c->our_tap_mac, ETH_ALEN)) + return; + + if (inany_v4(addr)) + arp_announce(c, inany_v4(addr), e->mac); + else + ndp_unsolicited_na(c, &addr->a6); }
/** diff --git a/ndp.c b/ndp.c index 7e2ae2a..430d420 100644 --- a/ndp.c +++ b/ndp.c @@ -220,6 +220,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst, ndp_send(c, dst, &na, sizeof(na)); }
+/** + * ndp_unsolicited_na() - Send unsolicited NA + * @c: Execution context + * @addr: IPv6 address to advertise + */ +void ndp_unsolicited_na(const struct ctx *c, const struct in6_addr *addr) +{ + ndp_na(c, &in6addr_ll_all_nodes, addr); +} + /** * ndp_ra() - Send an NDP Router Advertisement (RA) message * @c: Execution context diff --git a/ndp.h b/ndp.h index 781ea86..56b756d 100644 --- a/ndp.h +++ b/ndp.h @@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr, struct iov_tail *data); void ndp_timer(const struct ctx *c, const struct timespec *now); void ndp_send_init_req(const struct ctx *c); +void ndp_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
#endif /* NDP_H */ -- 2.50.1
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Oct 14, 2025 at 10:55:15PM -0400, Jon Maloy wrote:
When we receive an ARP request or NDP neigbour solicitation over the tap interface for a host on the local network segment attached to the template interface, we respond with that host's real MAC address, if available.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
--- v3: - Added helper function to find out if a remote ip address is subject to NAT. This filters out local host addresses which should be presented with the passt/pasta local MAC address 9a:55:9a:55:9a:55 even though it is on the local segment. - Adapted to the change in nl_mac_get() function, so that we now consider only the template interface when checking the ARP/NDP table. v4: - Moved NAT check into the function nat_outbound() to obtain more precise criteria for when NAT is used. We may in theory have NAT even if original and translated addresses are equal, and we want to catch this case. - I chose to keep the wrapper funtion inany_nat(), but moved it to fwd.h/fwd.c and renamed it to fwd_inany_nat(). v5: - Simplified criteria for when we do ARP/NDP lookup. Now, we just try with the potentially translated address after an attempted NAT check. - Using the new ARP/NDP cache table instead of using netlink directly. v6: - Fixes after feedback from David: - Renamed nat_outbound() to fwd_nat_outbound() - Eliminated unnecessary temporary variable in arp.c::arp() v12: - Some minor changes after feedback from Stefano v13: - Removed call to nat_outbound() before MAC resolution, as we are now handling guest-side visible addresses only. v14: - Removed obsolete memcpy() in ndp.c::ndp_na() --- arp.c | 8 ++++++-- inany.c | 1 + ndp.c | 4 +++- 3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/arp.c b/arp.c index ad088b1..b4a686f 100644 --- a/arp.c +++ b/arp.c @@ -69,6 +69,7 @@ static bool ignore_arp(const struct ctx *c, */ int arp(const struct ctx *c, struct iov_tail *data) { + union inany_addr tgt; struct { struct ethhdr eh; struct arphdr ah; @@ -102,8 +103,11 @@ int arp(const struct ctx *c, struct iov_tail *data) resp.ah.ar_hln = ah->ar_hln; resp.ah.ar_pln = ah->ar_pln;
- /* ARP message */ - memcpy(resp.am.sha, c->our_tap_mac, sizeof(resp.am.sha)); + /* MAC address to return in ARP message */ + inany_from_af(&tgt, AF_INET, am->tip); + fwd_neigh_mac_get(c, &tgt, resp.am.sha); + + /* Rest of ARP message */ memcpy(resp.am.sip, am->tip, sizeof(resp.am.sip)); memcpy(resp.am.tha, am->sha, sizeof(resp.am.tha)); memcpy(resp.am.tip, am->sip, sizeof(resp.am.tip)); diff --git a/inany.c b/inany.c index 65a39f9..7680439 100644 --- a/inany.c +++ b/inany.c @@ -16,6 +16,7 @@ #include "ip.h" #include "siphash.h" #include "inany.h" +#include "fwd.h"
const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT); const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT); diff --git a/ndp.c b/ndp.c index 588b48f..7e2ae2a 100644 --- a/ndp.c +++ b/ndp.c @@ -196,6 +196,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst, static void ndp_na(const struct ctx *c, const struct in6_addr *dst, const struct in6_addr *addr) { + union inany_addr tgt; struct ndp_na na = { .ih = { .icmp6_type = NA, @@ -213,7 +214,8 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst, } };
- memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN); + inany_from_af(&tgt, AF_INET6, addr); + fwd_neigh_mac_get(c, &tgt, na.target_l2_addr.mac);
ndp_send(c, dst, &na, sizeof(na)); } -- 2.50.1
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On 2025-10-16 23:05, David Gibson wrote:
On Tue, Oct 14, 2025 at 10:55:14PM -0400, Jon Maloy wrote:
We add a cache table to keep track of the contents of the kernel ARP and NDP tables. The table is fed from the just introduced netlink based neigbour subscription function.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
I do see one error here, but it's fairly harmless, so I think a follow up makes more sense than a respin.
[snip]
+ /* Blocker entries to stop events from hosts using these addresses */ + if (!inany_is_unspecified4(&mhl)) + fwd_neigh_table_update(c, &mhl, c->our_tap_mac, true); + + if (!inany_is_unspecified4(&ggw) && !c->no_map_gw) + fwd_neigh_table_update(c, &ggw, c->our_tap_mac, true); + + if (!inany_is_unspecified4(&mga) && !inany_equals(&mhl, &mga)) {
That made me realise that we should throw an error during configuration if map_host_loopback == map_guest_addr. It doesn't make sense for these to be the same - if they are, we have no way of knowing if a packet should be mapped to 127.0.0.1 or to guest_addr. *checks* looks like map_host_loopback will take precedence in this case, because of the way nat_outbound() is ordered.
I also noticed it is possible to set guest_gw to some random address while at the same time setting no_map_gw. It seems to be harmless, since no_map_gw takes precedence, but it is an inconsistency we should fence against.
In any case I think you can drop the inany_equals() test - the permanent bit will stop the second update from clobbering the first, even if we are misconfigured.
+ uint8_t mac[ETH_ALEN]; + int rc; + + rc = nl_link_get_mac(nl_sock, c->ifi4, mac); + if (rc < 0) { + debug("Couldn't get ip4 MAC addr: %s", strerror_(-rc)); + memcpy(mac, c->our_tap_mac, ETH_ALEN); + }
Using the host's MAC for --map-guest-addr only makes sense if the guest address is the same as the host address. If -a is used to make the guest address different, then it may shadow some other random node, not the host. We don't need special handling for that case - the nat_inbound() you already have will do what we need.
IIUC, the host itself doesn't appear in the neighbour table, so we do need special handling if we want to use the host MAC when --map-guest-addr *does* refer to the host. To handle that, I think what we want is pseudo-codishly:
fwd_neigh_table_update(c, nat_inbound(host_addr), host_mac, true);
The wrinkle is that while we do get the host address at some point, I'm not sure we keep it around (it's typically irrelevant after init).
Strictly speaking 'permanent' isn't really correct here, but it's not worth the hassle of setting up a whole other netlink monitor to watch for changes in the host's MAC address.
In fact.. I'm not sure it's worth handling this case at all. I think it would be ok to just drop this clause. That means we'll use our_tap_mac by default for things NATted to the (non loopback) host, which is probably fine.
Fine with me, but I do think we need a blocker entry just in case somebody else comes up with the same address. I'll post a complementary commit once the series has been applied. ///jon
On Tue, 14 Oct 2025 22:55:12 -0400
Jon Maloy
The solution to bug https://bugs.passt.top/show_bug.cgi?id=120 requires the ability to translate from an IP address to its corresponding MAC address in cases where those are present in the ARP or NDP tables.
To keep track of the contents of these tables we add a netlink based neighbour subscription feature.
Signed-off-by: Jon Maloy
--- v3: - Added an attribute contianing NDA_DST to sent message, so that we let the kernel do the filtering of the IP address and return only one entry. - Added interface index to the call signature. Since the only interface we know is the template interface, this limits the number of hosts that will be seen as 'network segment local' from a PASST viewpoint. v4: - Made loop independent of attribute order. - Ignoring L2 addresses which are not of size ETH_ALEN. v5: - Changed return value of new function, so caller can know if a MAC address really was found. v6: - Removed warning printout which had ended up in the wrong commit. v8: - Changed to neighbour event subscription model - netlink: arp/ndp table subscription v10:- Updated according to David's latest comments on v8 - Added functionaly where we initially read current state of ARP/NDP tables v12:- Updates based on feedback from David and Stefano v13:- Updates based on feedback from David and Stefano v14:- Updates based on feedback from David --- epoll_type.h | 2 + netlink.c | 209 ++++++++++++++++++++++++++++++++++++++++++++++++++- netlink.h | 4 + passt.c | 7 ++ 4 files changed, 219 insertions(+), 3 deletions(-)
diff --git a/epoll_type.h b/epoll_type.h index 12ac59b..a90ffb6 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -44,6 +44,8 @@ enum epoll_type { EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ EPOLL_TYPE_REPAIR, + /* Netlink neighbour subscription socket */ + EPOLL_TYPE_NL_NEIGH,
EPOLL_NUM_TYPES, }; diff --git a/netlink.c b/netlink.c index 9fe70f2..186383c 100644 --- a/netlink.c +++ b/netlink.c @@ -26,6 +26,7 @@ #include
#include #include +#include #include
#include @@ -40,6 +41,10 @@ #define RTNH_NEXT_AND_DEC(rtnh, attrlen) \ ((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh)) +/* Convenience macro borrowed from kernel */ +#define NUD_VALID \ + (NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE) + /* Netlink expects a buffer of at least 8kiB or the system page size, * whichever is larger. 32kiB is recommended for more efficient. * Since the largest page size on any remotely common Linux setup is @@ -50,9 +55,10 @@ #define NLBUFSIZ 65536
/* Socket in init, in target namespace, sequence (just needs to be monotonic) */ -int nl_sock = -1; -int nl_sock_ns = -1; -static int nl_seq = 1; +int nl_sock = -1; +int nl_sock_ns = -1; +static int nl_sock_neigh = -1; +static int nl_seq = 1;
/** * nl_sock_init_do() - Set up netlink sockets in init or target namespace @@ -1103,3 +1109,200 @@ int nl_link_set_flags(int s, unsigned int ifi,
return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req)); } + +/** + * nl_neigh_msg_read() - Interpret a neigbour state message from netlink
For some reason I missed this on both v11 and v12: neighbour. I can fix it up on merge. I might also shut up, you might say, but: - I grep for things in the code, typos make it hard - typos invite "clean-up" patches that are just code churn and make git logs and blames harder to follow (while perhaps still better than leaving typos behind, I think)
+ * @c: Execution context + * @nh: Message to be read + */ +static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh) +{ + struct ndmsg *ndm = NLMSG_DATA(nh); + struct rtattr *rta = (struct rtattr *)(ndm + 1); + size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm)); + char ip_str[INET6_ADDRSTRLEN]; + char mac_str[ETH_ADDRSTRLEN]; + const uint8_t *lladdr = NULL; + const void *dst = NULL; + size_t lladdr_len = 0; + uint8_t mac[ETH_ALEN]; + union inany_addr addr; + size_t dstlen = 0; + + if (nh->nlmsg_type == NLMSG_DONE) + return; + + if (nh->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *errmsg = (struct nlmsgerr *)ndm; + + warn("neigh_msg_read() failed, error %i", errmsg->error);
But neigh_msg_read() isn't a function name (it's nl_neigh_msg_read()). To avoid that, if you really need the function name, I'd suggest %s and __func__. However, here, I don't think you can say that the function failed. It's just that we got an error from netlink. The function didn't do almost anything. Maybe just "netlink error on neighbour notifier: ..."? By the way of which, usually, we print the error name with _strerror(): note that error in struct nlmsgerr is errno-like, so you can use that instead of printing a number that perhaps not everybody is familiar with.
+ return; + } + + if (nh->nlmsg_type != RTM_NEWNEIGH && + nh->nlmsg_type != RTM_DELNEIGH)
This fits on a single line now (same here, I missed this on both v11 and v12, sorry).
+ return; + + for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) { + switch (rta->rta_type) { + case NDA_DST: + dst = RTA_DATA(rta); + dstlen = RTA_PAYLOAD(rta); + break; + case NDA_LLADDR: + lladdr = RTA_DATA(rta); + lladdr_len = RTA_PAYLOAD(rta); + break; + default: + break;
With 'if' / 'else if' you save three lines, but not a strong preference, maybe not even a preference at all.
+ } + } + + if (!dst) + return; + + if (ndm->ndm_family == AF_INET && + ndm->ndm_ifindex != c->ifi4) + return; + + if (ndm->ndm_family == AF_INET6 && + ndm->ndm_ifindex != c->ifi6) + return; + + if (ndm->ndm_family != AF_INET && + ndm->ndm_family != AF_INET6) + return; + + if (ndm->ndm_family == AF_INET && + dstlen != sizeof(struct in_addr)) { + warn("netlink: wrong address length in AF_INET notification"); + return; + } + if (ndm->ndm_family == AF_INET6 && + dstlen != sizeof(struct in6_addr)) {
All these five conditions fit on a single line (missed on v12, but it wasn't the case on v11).
+ warn("netlink: wrong address length in AF_INET6 notification"); + return; + } + inany_from_af(&addr, ndm->ndm_family, dst); + inany_ntop(dst, ip_str, sizeof(ip_str)); + + if (nh->nlmsg_type == RTM_DELNEIGH) { + trace("neigh table delete: %s", ip_str); + return; + } + if (!(ndm->ndm_state & NUD_VALID)) { + trace("neigh table: invalid state for %s", ip_str);
It's not an invalid state. It's a state that represents that the neighbour information lost its validity, but the state is not invalid (that would look like an error to anybody who hasn't read the code). Maybe something like "neighbour table: %s unreachable, state: 0x%02x"?
+ return; + } + if (nh->nlmsg_type != RTM_NEWNEIGH || !lladdr) {
nh->nlmsg_type can only be RTM_NEWNEIGH at this point, because you have two early returns for: - nh->nlmsg_type != RTM_NEWNEIGH && nh->nlmsg_type != RTM_DELNEIGH - nh->nlmsg_type == RTM_DELNEIGH
+ warn("netlink: notification with illegal msg for %s", ip_str);
...meaning that you can finally be a bit more descriptive about what's "illegal" in the notification ("missing link-layer address"?).
+ return; + } + if (lladdr_len != ETH_ALEN || ndm->ndm_type != ARPHRD_ETHER) + return; + + memcpy(mac, lladdr, ETH_ALEN); + eth_ntop(mac, mac_str, sizeof(mac_str));
I read 3/10, but I don't understand anyway: why do we need a temporary 'mac' variable? Can't you use 'lladdr' directly?
+ trace("neigh table update: %s / %s", ip_str, mac_str); +} + +/** + * nl_neigh_sync() - Read current contents ARP/NDP tables
Is that a... reverse Saxon Genitive? :) ... content of ...
+ * @c: Execution context + * @proto: Protocol, AF_INET or AF_INET6 + * @ifi: Interface index + */ +static void nl_neigh_sync(const struct ctx *c, int proto, int ifi) +{ + struct { + struct nlmsghdr nlh; + struct ndmsg ndm; + } req = { + .nlh = {0},
We always leave spaces inside curly brackets (K&R / kernel style). By the way, I think David already suggested in another case that you don't need to initialise stuff explicitly to zero. It's also the case here.
+ .ndm.ndm_family = proto, + .ndm.ndm_ifindex = ifi, + }; + struct nlmsghdr *nh; + char buf[NLBUFSIZ]; + ssize_t status; + uint32_t seq; + + seq = nl_send(nl_sock_neigh, &req, RTM_GETNEIGH, + NLM_F_DUMP, sizeof(req)); + nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH) + nl_neigh_msg_read(c, nh); + if (status < 0) + warn("netlink: RTM_GETNEIGH failed: %s", strerror_(-status)); +} + +/** + * nl_neigh_notify_handler() - Non-blocking drain of pending neighbour updates + * @c: Execution context + */ +void nl_neigh_notify_handler(const struct ctx *c) +{ + char buf[NLBUFSIZ]; + + for (;;) { + ssize_t n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT); + struct nlmsghdr *nh = (struct nlmsghdr *)buf; + + if (n < 0) { + if (errno != EAGAIN && errno != EINTR)
On EINTR, shouldn't we re-do the recv() call? We usually do that, because it's actually the right thing to do. See a concise example in vu_message_read_default(), vhost_user.c.
+ warn_perror("nl_neigh_notify sock read error");
Same as above with function names: there's no such function as nl_neigh_notify() and anyway it's not very descriptive to users. What about "netlink notifier read error"?
+ return; + } + for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) + nl_neigh_msg_read(c, nh); + } +} + +/** + * nl_neigh_notify_init() - Subscribe to neighbour events + * @c: Execution context + * + * Return: 0 on success, -1 on failure + */ +int nl_neigh_notify_init(const struct ctx *c) +{ + union epoll_ref ref = { + .type = EPOLL_TYPE_NL_NEIGH + }; + struct epoll_event ev = { + .events = EPOLLIN + }; + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + .nl_groups = RTMGRP_NEIGH, + }; + + if (nl_sock_neigh >= 0) { + warn("RTMGRP_NEIGH already exists");
Strictly speaking, the constant existed for quite some years already. What already and unexpectedly exists here is the notifier socket.
+ return 0; + } + nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC,
Hint: extra newlines don't account for lines of code using cloc(1). :)
+ NETLINK_ROUTE); + if (nl_sock_neigh < 0) { + warn("Failed create RTMGRP_NEIGH socket");
Same here, it's a notifier socket.
+ return -1; + } + if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) { + close(nl_sock_neigh); + nl_sock_neigh = -1; + warn("Failed bind RTMGRP_NEIGH socket");
If you move this to just after bind(), you can conveniently call warn_perror() and also explain why it failed for free. It's a notifier socket, and a preposition is missing.
+ return -1; + } + + ev.data.u64 = ref.u64; + if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) { + close(nl_sock_neigh); + nl_sock_neigh = -1; + warn("Failed do epoll_ctrl() on RTMGRP_NEIGH socket");
Same as for bind(). epoll_ctrl() doesn't exist, by the way.
+ return -1; + } + + nl_neigh_sync(c, AF_INET, c->ifi4); + nl_neigh_sync(c, AF_INET6, c->ifi6); + + return 0; +} diff --git a/netlink.h b/netlink.h index b51e99c..8f1e9b9 100644 --- a/netlink.h +++ b/netlink.h @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, int s_dst, unsigned int ifi_dst, sa_family_t af); int nl_addr_get(int s, unsigned int ifi, sa_family_t af, void *addr, int *prefix_len, void *addr_l); +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi, + unsigned char *mac); int nl_addr_set(int s, unsigned int ifi, sa_family_t af, const void *addr, int prefix_len); int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr); @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac); int nl_link_set_mtu(int s, unsigned int ifi, int mtu); int nl_link_set_flags(int s, unsigned int ifi, unsigned int set, unsigned int change); +int nl_neigh_notify_init(const struct ctx *c); +void nl_neigh_notify_handler(const struct ctx *c);
#endif /* NETLINK_H */ diff --git a/passt.c b/passt.c index 31fbb75..e21d6ba 100644 --- a/passt.c +++ b/passt.c @@ -53,6 +53,7 @@ #include "vu_common.h" #include "migrate.h" #include "repair.h" +#include "netlink.h"
#define EPOLL_EVENTS 8
@@ -79,6 +80,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", + [EPOLL_TYPE_NL_NEIGH] = "netlink neighbour subscription socket",
I already commented on this in v11: even if you subscribe using the socket, it's functionally a notifier (or notification?) socket. In general, I know it takes a bit longer but I wonder if it's worth the effort: quote from: https://docs.kernel.org/process/6.Followthrough.html#working-with-reviewers Andrew Morton has suggested that every review comment which does not result in a code change should result in an additional code comment instead; that can help future reviewers avoid the questions which came up the first time around. David and myself do that very consistently, and I think it helps keeping the number of revisions down. It takes time and sometimes nerves but the return on the investment is substantial. I don't want to bother people more than I do (yes, I think it's possible), so I'm not proposing that we enforce that, but maybe it's something you should consider?
}; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -322,6 +324,8 @@ int main(int argc, char **argv)
pcap_init(&c);
+ nl_neigh_notify_init(&c); + if (!c.foreground) { if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0) die_perror("Failed to open /dev/null"); @@ -414,6 +418,9 @@ loop: case EPOLL_TYPE_REPAIR: repair_handler(&c, eventmask); break; + case EPOLL_TYPE_NL_NEIGH: + nl_neigh_notify_handler(&c); + break; default: /* Can't happen */ ASSERT(0);
-- Stefano
On Tue, 14 Oct 2025 22:55:13 -0400
Jon Maloy
Later in this series we will introduce a new initialization function which needs access to the 'no_map_gw' flag. This flag is currently a local variable inside the conf() call, where it is too early to call the new function. The new function needs both ctx->hash_secret and a valid ctx->fd_tap to be set, neither of which are initialized at that point.
We therefore add the 'no_map_gw' flag to struct ctx, so it can be carried along and used by later functions.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson --- v14: Fixed typo in commit log --- conf.c | 10 +++++----- passt.h | 1 + 2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/conf.c b/conf.c index 66b9e63..6143b54 100644 --- a/conf.c +++ b/conf.c @@ -1410,7 +1410,7 @@ fail: */ void conf(struct ctx *c, int argc, char **argv) { - int netns_only = 0, no_map_gw = 0; + int netns_only = 0; const struct option options[] = { {"debug", no_argument, NULL, 'd' }, {"quiet", no_argument, NULL, 'q' }, @@ -1442,7 +1442,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-ra", no_argument, &c->no_ra, 1 }, {"no-splice", no_argument, &c->no_splice, 1 }, {"freebind", no_argument, &c->freebind, 1 }, - {"no-map-gw", no_argument, &no_map_gw, 1 }, + {"no-map-gw", no_argument, &c->no_map_gw, 1 }, {"ipv4-only", no_argument, NULL, '4' }, {"ipv6-only", no_argument, NULL, '6' }, {"one-off", no_argument, NULL, '1' }, @@ -1656,7 +1656,7 @@ void conf(struct ctx *c, int argc, char **argv) break; case 21: conf_nat(optarg, &c->ip4.map_host_loopback, - &c->ip6.map_host_loopback, &no_map_gw); + &c->ip6.map_host_loopback, &c->no_map_gw);
This is by the way what causes the pre-existing issue you reported in fee75072-42c9-4bc6-959d-9641f0b27ebd@redhat.com as we should process the no_map_gw in a separate argument parsing loop first, and only then we can call conf_nat().
break; case 22: conf_nat(optarg, &c->ip4.map_guest_addr, @@ -2005,11 +2005,11 @@ void conf(struct ctx *c, int argc, char **argv) } }
- if (c->ifi4 && !no_map_gw && + if (c->ifi4 && !c->no_map_gw && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback)) c->ip4.map_host_loopback = c->ip4.guest_gw;
- if (c->ifi6 && !no_map_gw && + if (c->ifi6 && !c->no_map_gw && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback)) c->ip6.map_host_loopback = c->ip6.guest_gw;
diff --git a/passt.h b/passt.h index 0075eb4..830bc36 100644 --- a/passt.h +++ b/passt.h @@ -310,6 +310,7 @@ struct ctx { int no_ndp; int no_ra; int no_splice; + int no_map_gw;
This is now undocumented. See 935bd81936cf ("conf, fwd: Split notion of gateway/router from guest-visible host address") for the hunk you're reverting here.
int host_lo_to_ns_lo; int freebind;
-- Stefano
On Tue, 14 Oct 2025 22:55:14 -0400
Jon Maloy
We add a cache table to keep track of the contents of the kernel ARP and NDP tables. The table is fed from the just introduced netlink based neigbour subscription function.
Signed-off-by: Jon Maloy
--- v5: - Moved to earlier in series to reduce rebase conflicts v6: - Sqashed the hash list commit and the FIFO/LRU queue commit - Removed hash lookup. We now only use linear lookup in a linked list - Eliminated dynamic memory allocation. - Ensured there is only one call to clock_gettime() - Using MAC_ZERO instead of the previously dedicated definitions v7: - NOW using MAC_ZERO where needed - I am still using linear back-off for empty cache entries. Even an incoming, flow-creating packet from a local host gives no guarantee that its MAC address is in the ARP table, so we must allow for a few new attempts at first possible occasions. Only after several failed lookups can we conclude that we probably never will succeed. Hence the back-off. - Fixed a bug that David inadvertently made me aware of: I only intended to set the initial expiry value to MAC_CACHE_RENEWAL when an ARP/NDP table lookup was successful. - Improved struct and function description comments. v8: - Total re-design of table, adapting to the new, subscription based way of updating it. v9: - Catering for MAC address change for an existing host. v10: - Changes according to feedback from David Gibson v12: - Changes according to feedback from David and Stefano - Added dummy entries for loopback and default GW addresses v13: - Changes according to feedback and discussions with David and Stefano v14: - Moved the call to nat_inbound() to a much more sensible place in netlink.c, as suggested by David. --- fwd.c | 217 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 7 ++ netlink.c | 10 ++- passt.c | 1 + 4 files changed, 233 insertions(+), 2 deletions(-)
diff --git a/fwd.c b/fwd.c index 250cf56..f70e4fc 100644 --- a/fwd.c +++ b/fwd.c @@ -26,6 +26,7 @@ #include "passt.h" #include "lineread.h" #include "flow_table.h" +#include "netlink.h"
/* Empheral port range: values from RFC 6335 */ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14); @@ -33,6 +34,222 @@ static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
#define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range"
+#define NEIGH_TABLE_SLOTS 1024 +#define NEIGH_TABLE_SIZE (NEIGH_TABLE_SLOTS / 2) +static_assert((NEIGH_TABLE_SLOTS & (NEIGH_TABLE_SLOTS - 1)) == 0, + "NEIGH_TABLE_SLOTS must be a power of two"); + +/** + * struct neigh_table_entry - Entry in the ARP/NDP table + * @next: Next entry in slot or free list + * @addr: IP address of represented host + * @mac: MAC address of represented host + * @permanent: Entry cannot be altered or freed by notification + */ +struct neigh_table_entry { + struct neigh_table_entry *next; + union inany_addr addr; + uint8_t mac[ETH_ALEN]; + bool permanent; +}; + +/** + * struct neigh_table - Cache of ARP/NDP table contents + * @entries: Entries to be plugged into the hash slots when allocated + * @slots: Hash table slots + * @free: Linked list of unused entries + */ +struct neigh_table { + struct neigh_table_entry entries[NEIGH_TABLE_SIZE]; + struct neigh_table_entry *slots[NEIGH_TABLE_SLOTS]; + struct neigh_table_entry *free; +}; + +static struct neigh_table neigh_table; + +/** + * neigh_table_slot() - Hash key to a number within the table range + * @c: Execution context + * @key: The key to be used for the hash + * + * Return: the resulting hash value + */ +static size_t neigh_table_slot(const struct ctx *c, + const union inany_addr *key) +{ + struct siphash_state st = SIPHASH_INIT(c->hash_secret); + uint32_t i; + + inany_siphash_feed(&st, key); + i = siphash_final(&st, sizeof(*key), 0); + + return ((size_t)i) & (NEIGH_TABLE_SIZE - 1); +} + +/** + * fwd_neigh_table_find() - Find a MAC table entry + * @c: Execution context + * @addr: Neighbour address to be used as key for the lookup + * + * Return: the matching entry, if found. Otherwise NULL + */ +static struct neigh_table_entry *fwd_neigh_table_find(const struct ctx *c, + const union inany_addr *addr) +{ + size_t slot = neigh_table_slot(c, addr); + struct neigh_table_entry *e = neigh_table.slots[slot]; + + while (e && !inany_equals(&e->addr, addr)) + e = e->next; + + return e; +} + +/** + * fwd_neigh_table_update() - Allocate or update neighbour table entry + * @c: Execution context + * @addr: IP address used to determine insertion slot and store in entry + * @mac: The MAC address associated with the neighbour address + * @permanent: Created entry cannot be altered or freed + */ +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr, + const uint8_t *mac, bool permanent) +{ + struct neigh_table *t = &neigh_table; + struct neigh_table_entry *e; + ssize_t slot; + + /* MAC address might change sometimes */ + e = fwd_neigh_table_find(c, addr); + if (e) { + if (!e->permanent) + memcpy(e->mac, mac, ETH_ALEN); + return; + } + + e = t->free; + if (!e) { + debug("Failed to allocate neighbour table entry"); + return; + } + t->free = e->next; + slot = neigh_table_slot(c, addr); + e->next = t->slots[slot]; + t->slots[slot] = e; + + memcpy(&e->addr, addr, sizeof(*addr)); + memcpy(e->mac, mac, ETH_ALEN); + e->permanent = permanent; +} + +/** + * fwd_neigh_table_free() - Remove an entry from a slot and add it to free list + * @c: Execution context + * @addr: IP address used to find the slot for the entry + */ +void fwd_neigh_table_free(const struct ctx *c, const union inany_addr *addr) +{ + ssize_t slot = neigh_table_slot(c, addr); + struct neigh_table *t = &neigh_table; + struct neigh_table_entry *e, **prev; + + prev = &t->slots[slot]; + e = t->slots[slot]; + while (e && !inany_equals(&e->addr, addr)) { + prev = &e->next; + e = e->next; + } + + if (!e || e->permanent) + return; + + *prev = e->next; + e->next = t->free; + t->free = e; + memset(&e->addr, 0, sizeof(*addr)); + memset(e->mac, 0, ETH_ALEN); +} + +/** + * fwd_neigh_mac_get() - Look up MAC address in the ARP/NDP table + * @c: Execution context + * @addr: Neighbour IP address used as lookup key + * @mac: Buffer for returned MAC address + */ +void fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, + uint8_t *mac) +{ + const struct neigh_table_entry *e = fwd_neigh_table_find(c, addr); + + if (e) + memcpy(mac, e->mac, ETH_ALEN); + else + memcpy(mac, c->our_tap_mac, ETH_ALEN); +} + +/** + * fwd_neigh_table_init() - Initialize the neighbour table + * @c: Execution context + */ +void fwd_neigh_table_init(const struct ctx *c) +{ + union inany_addr mhl = inany_from_v4(c->ip4.map_host_loopback); + union inany_addr mga = inany_from_v4(c->ip4.map_guest_addr); + union inany_addr ggw = inany_from_v4(c->ip4.guest_gw);
I already mentioned this a couple of times: guest_gw is *not* *relevant*. Ignore it. You needed to add 2/10 and a bunch of lines below because of it, but you shouldn't use it at all. It's not relevant because, for the purposes of this series, everything you need is already reflected in map_guest_addr and map_host_loopback. If no_map_gw is false, and no other options are given, then map_host_loopback is set to guest_gw. If no_map_gw is true, and no other options are given, then map_host_loopback is *not* set to guest_gw. If zero to two of map_host_loopback and map_guest_addr are set, they will tell you which zero to two addresses are translated (for each IP version).
+ struct neigh_table *t = &neigh_table; + struct neigh_table_entry *e; + int i; + + memset(t, 0, sizeof(*t)); + + for (i = 0; i < NEIGH_TABLE_SIZE; i++) { + e = &t->entries[i]; + e->next = t->free; + t->free = e; + } + + /* Blocker entries to stop events from hosts using these addresses */ + if (!inany_is_unspecified4(&mhl)) + fwd_neigh_table_update(c, &mhl, c->our_tap_mac, true); + + if (!inany_is_unspecified4(&ggw) && !c->no_map_gw) + fwd_neigh_table_update(c, &ggw, c->our_tap_mac, true);
This is not needed, and in some cases wrong: - if guest_gw is set and no_map_gw is false, but map_host_loopback wasn't configured, mhl is the same as ggw, and you already added an entry. No harm done, just one excess entry here - if guest_gw is set and no_map_gw is false, but map_host_loopback was configured, mhl is different from ggw. You already added the right entry we'll map in that case (mhl), and now you're adding an entry that's plain wrong (ggw), because we're *not* mapping ggw if the user gave another address. Just ignore guest_gw altogether.
+ + if (!inany_is_unspecified4(&mga) && !inany_equals(&mhl, &mga)) { + uint8_t mac[ETH_ALEN]; + int rc; + + rc = nl_link_get_mac(nl_sock, c->ifi4, mac); + if (rc < 0) { + debug("Couldn't get ip4 MAC addr: %s", strerror_(-rc));
IPv4
+ memcpy(mac, c->our_tap_mac, ETH_ALEN); + } + fwd_neigh_table_update(c, &mga, mac, true); + } + + mhl = *(union inany_addr *)&c->ip6.map_host_loopback; + mga = *(union inany_addr *)&c->ip4.map_guest_addr;
Shouldn't this be c->ip6.map_guest_addr?
+ ggw = *(union inany_addr *)&c->ip4.guest_gw;
Not needed.
+ + if (!inany_is_unspecified6(&mhl)) + fwd_neigh_table_update(c, &mhl, c->our_tap_mac, true); + + if (!inany_is_unspecified6(&ggw) && !c->no_map_gw) + fwd_neigh_table_update(c, &ggw, c->our_tap_mac, true);
Not needed, possibly harmful.
+ + if (!inany_is_unspecified6(&mga) && !inany_equals(&mhl, &mga)) { + uint8_t mac[ETH_ALEN]; + int rc; + + rc = nl_link_get_mac(nl_sock, c->ifi6, mac); + if (rc < 0) { + debug("Couldn't get ip6 MAC addr: %s", strerror_(-rc));
IPv6
+ memcpy(mac, c->our_tap_mac, ETH_ALEN); + } + fwd_neigh_table_update(c, &mga, mac, true); + } +} + /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral * * Work out what ports the host thinks are emphemeral and record it for later diff --git a/fwd.h b/fwd.h index 65c7c96..352f3b5 100644 --- a/fwd.h +++ b/fwd.h @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt); uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt); +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr, + const uint8_t *mac, bool permanent); +void fwd_neigh_table_free(const struct ctx *c, + const union inany_addr *addr); +void fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, + uint8_t *mac); +void fwd_neigh_table_init(const struct ctx *c);
#endif /* FWD_H */ diff --git a/netlink.c b/netlink.c index 186383c..85643bd 100644 --- a/netlink.c +++ b/netlink.c @@ -1123,10 +1123,10 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh) char ip_str[INET6_ADDRSTRLEN]; char mac_str[ETH_ADDRSTRLEN]; const uint8_t *lladdr = NULL; + union inany_addr addr, daddr; const void *dst = NULL; size_t lladdr_len = 0; uint8_t mac[ETH_ALEN]; - union inany_addr addr; size_t dstlen = 0;
if (nh->nlmsg_type == NLMSG_DONE) @@ -1183,15 +1183,20 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh) warn("netlink: wrong address length in AF_INET6 notification"); return; } + /* We only handle guest-side visible addresses */ inany_from_af(&addr, ndm->ndm_family, dst); - inany_ntop(dst, ip_str, sizeof(ip_str)); + if (!nat_inbound(c, &addr, &daddr)) + return; + inany_ntop(&daddr, ip_str, sizeof(ip_str));
if (nh->nlmsg_type == RTM_DELNEIGH) { trace("neigh table delete: %s", ip_str); + fwd_neigh_table_free(c, &daddr); return; } if (!(ndm->ndm_state & NUD_VALID)) { trace("neigh table: invalid state for %s", ip_str); + fwd_neigh_table_free(c, &daddr); return; } if (nh->nlmsg_type != RTM_NEWNEIGH || !lladdr) { @@ -1204,6 +1209,7 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh) memcpy(mac, lladdr, ETH_ALEN); eth_ntop(mac, mac_str, sizeof(mac_str)); trace("neigh table update: %s / %s", ip_str, mac_str); + fwd_neigh_table_update(c, &daddr, mac, false); }
/** diff --git a/passt.c b/passt.c index e21d6ba..98fc430 100644 --- a/passt.c +++ b/passt.c @@ -324,6 +324,7 @@ int main(int argc, char **argv)
pcap_init(&c);
+ fwd_neigh_table_init(&c); nl_neigh_notify_init(&c);
if (!c.foreground) {
-- Stefano
On Tue, 14 Oct 2025 22:55:16 -0400
Jon Maloy
ARP announcements and unsolicited NAs should be handled with caution because of the risk of malignant users emitting them to disturb network communication.
There is however one case we where we know it is legitimate and safe for us to send out such messages: The one time we switch from using ctx->own_tap_mac to a MAC address received via the recently added neigbour subscription function. Later changes to the MAC address of a host in an existing entry cannot be fully trusted, so we abstain from doing it in such cases.
When sending this type of messages, we notice that the guest accepts the update, but shortly later asks for a confirmation in the form of a regular ARP/NS request. This is responded to with the new value, and we have exactly the effect we wanted.
This commit adds this functionality.
Signed-off-by: Jon Maloy
--- v10: -Made small changes based of feedback from David G. v11: -Moved from 'Gratuitous ARP reply' model to 'ARP Announcement' model. v12: -Excluding loopback and default GW addresses from the ARP/NA announcement to be sent to the guest v13: -Filtering out all announcements of our_tap_mac instead of explicitly comparing a set of IP addresses. -Changed annc.am.tha in arp_announce() to MAC_ZERO, which is 'unknown' according to RFC826. -Renamed ndp_send_unsolicited_na() to ndp_unsolicited_na() v14: -Added quotes from RFC5227 to explain the format of ARP announce messages, as suggested by David. --- arp.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++ arp.h | 2 ++ fwd.c | 10 ++++++++++ ndp.c | 10 ++++++++++ ndp.h | 1 + 5 files changed, 73 insertions(+)
diff --git a/arp.c b/arp.c index b4a686f..0a89ae3 100644 --- a/arp.c +++ b/arp.c @@ -150,3 +150,53 @@ void arp_send_init_req(const struct ctx *c) debug("Sending initial ARP request for guest MAC address"); tap_send_single(c, &req, sizeof(req)); } + +/** + * arp_announce() - Send an ARP announcement for an IPv4 host + * @c: Execution context + * @ip: IPv4 address we announce as owned by @mac + * @mac: MAC address to advertise for @ip + */ +void arp_announce(const struct ctx *c, struct in_addr *ip, + const unsigned char *mac) +{ + char ip_str[INET_ADDRSTRLEN]; + char mac_str[ETH_ADDRSTRLEN]; + struct { + struct ethhdr eh; + struct arphdr ah; + struct arpmsg am; + } __attribute__((__packed__)) annc;
I'm not sure if it's universally clear that "annc" refers to an announcement, I might sound dumb but it took me a bit to figure out. Perhaps 'msg' is more vague but actually less confusing.
+ + /* Ethernet header */ + annc.eh.h_proto = htons(ETH_P_ARP); + memcpy(annc.eh.h_dest, MAC_BROADCAST, sizeof(annc.eh.h_dest)); + memcpy(annc.eh.h_source, mac, sizeof(annc.eh.h_source)); + + /* ARP header */ + annc.ah.ar_op = htons(ARPOP_REQUEST); + annc.ah.ar_hrd = htons(ARPHRD_ETHER); + annc.ah.ar_pro = htons(ETH_P_IP); + annc.ah.ar_hln = ETH_ALEN; + annc.ah.ar_pln = 4; + + /* RFC5227, section 2.1.1, about Probe messages: "The client MUST fill + * in the 'sender hardware address' field of the ARP Request with the + * hardware address of the interface through which it is sending the + * packet. [...] The 'target hardware address' field is ignored and + * SHOULD be set to all zeroes."
Nit: an extra newline here would make it obvious that you're referring to two different sections.
+ * RFC5227, section 2.3: "An ARP Announcement is identical to the ARP + * Probe described above, except that now the sender and target IP + * addresses are both set to the host's newly selected IPv4 address." + */ + memcpy(annc.am.sha, mac, sizeof(annc.am.sha)); + memcpy(annc.am.sip, ip, sizeof(annc.am.sip)); + memcpy(annc.am.tha, MAC_ZERO, sizeof(annc.am.tha)); + memcpy(annc.am.tip, ip, sizeof(annc.am.tip)); + + inet_ntop(AF_INET, ip, ip_str, sizeof(ip_str)); + eth_ntop(mac, mac_str, sizeof(mac_str)); + debug("Announcing ARP for %s / %s", ip_str, mac_str);
Strictly speaking, you're not announcing a protocol. You're using a protocol to announce an address. Maybe "ARP announcement for ..."?
+ + tap_send_single(c, &annc, sizeof(annc)); +} diff --git a/arp.h b/arp.h index d5ad0e1..4862e90 100644 --- a/arp.h +++ b/arp.h @@ -22,5 +22,7 @@ struct arpmsg {
int arp(const struct ctx *c, struct iov_tail *data); void arp_send_init_req(const struct ctx *c); +void arp_announce(const struct ctx *c, struct in_addr *ip, + const unsigned char *mac);
#endif /* ARP_H */ diff --git a/fwd.c b/fwd.c index f70e4fc..df92db8 100644 --- a/fwd.c +++ b/fwd.c @@ -27,6 +27,8 @@ #include "lineread.h" #include "flow_table.h" #include "netlink.h" +#include "arp.h" +#include "ndp.h"
/* Empheral port range: values from RFC 6335 */ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14); @@ -140,6 +142,14 @@ void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr, memcpy(&e->addr, addr, sizeof(*addr)); memcpy(e->mac, mac, ETH_ALEN); e->permanent = permanent; + + if (!memcmp(mac, c->our_tap_mac, ETH_ALEN)) + return; + + if (inany_v4(addr)) + arp_announce(c, inany_v4(addr), e->mac); + else + ndp_unsolicited_na(c, &addr->a6); }
/** diff --git a/ndp.c b/ndp.c index 7e2ae2a..430d420 100644 --- a/ndp.c +++ b/ndp.c @@ -220,6 +220,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst, ndp_send(c, dst, &na, sizeof(na)); }
+/** + * ndp_unsolicited_na() - Send unsolicited NA + * @c: Execution context + * @addr: IPv6 address to advertise + */ +void ndp_unsolicited_na(const struct ctx *c, const struct in6_addr *addr) +{ + ndp_na(c, &in6addr_ll_all_nodes, addr); +} + /** * ndp_ra() - Send an NDP Router Advertisement (RA) message * @c: Execution context diff --git a/ndp.h b/ndp.h index 781ea86..56b756d 100644 --- a/ndp.h +++ b/ndp.h @@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr, struct iov_tail *data); void ndp_timer(const struct ctx *c, const struct timespec *now); void ndp_send_init_req(const struct ctx *c); +void ndp_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
#endif /* NDP_H */
-- Stefano
On Tue, 14 Oct 2025 22:55:21 -0400
Jon Maloy
Even ICMP needs to be updated to use the external MAC address instead of just the own tap address when applicable. We do that here.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson --- v3: - Adapted to the move of external MAC address from struct flowside to struct flow_common v4: - Adapted to name changes in previous commits in this series v5: - Added conditional lookup in ARP/NDP if the flow's tap_omac is undefined v6: - Looking up MAC of ICMP generating node in udp_send_tap_icmp4/6() when available, instead trusting the contents of flow->tap_omac. v12: - Using MAC_IS_UNDEF() instead of MAC_IS_ZERO() - Comment update after feedback from Stefano v13: - No changes --- icmp.c | 8 ++++++-- ndp.c | 2 +- tap.c | 10 ++++++---- tap.h | 4 ++-- udp.c | 12 ++++++++++-- 5 files changed, 25 insertions(+), 11 deletions(-)
diff --git a/icmp.c b/icmp.c index 6dffafb..93b394a 100644 --- a/icmp.c +++ b/icmp.c @@ -125,17 +125,21 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, ini->eport, seq);
+ /* Find if neighbour table has a recorded MAC address */
Nit: "check if", or "find out if", or "find MAC address ...", but "find if" is not really a common way to express this. The rest of the series looks good to me. -- Stefano
On Sun, Oct 19, 2025 at 12:07:30PM +0200, Stefano Brivio wrote:
On Tue, 14 Oct 2025 22:55:12 -0400 Jon Maloy
wrote: [snip] + warn("neigh_msg_read() failed, error %i", errmsg->error);
But neigh_msg_read() isn't a function name (it's nl_neigh_msg_read()). To avoid that, if you really need the function name, I'd suggest %s and __func__.
However, here, I don't think you can say that the function failed. It's just that we got an error from netlink. The function didn't do almost anything. Maybe just "netlink error on neighbour notifier: ..."?
I didn't catch this when I reviewed, but I think it's probably best to avoid function names in messages above debug() level. Rationale: debug() and trace() messages are aimed in large part towards developers, and can have whatever information is useful to them, warn(), err() and die() are aimed at users to whom function names are meaningless. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 17, 2025 at 02:49:33PM -0400, Jon Maloy wrote:
On 2025-10-16 23:05, David Gibson wrote:
On Tue, Oct 14, 2025 at 10:55:14PM -0400, Jon Maloy wrote:
We add a cache table to keep track of the contents of the kernel ARP and NDP tables. The table is fed from the just introduced netlink based neigbour subscription function.
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
I do see one error here, but it's fairly harmless, so I think a follow up makes more sense than a respin.
[snip]
+ /* Blocker entries to stop events from hosts using these addresses */ + if (!inany_is_unspecified4(&mhl)) + fwd_neigh_table_update(c, &mhl, c->our_tap_mac, true); + + if (!inany_is_unspecified4(&ggw) && !c->no_map_gw) + fwd_neigh_table_update(c, &ggw, c->our_tap_mac, true); + + if (!inany_is_unspecified4(&mga) && !inany_equals(&mhl, &mga)) {
That made me realise that we should throw an error during configuration if map_host_loopback == map_guest_addr. It doesn't make sense for these to be the same - if they are, we have no way of knowing if a packet should be mapped to 127.0.0.1 or to guest_addr. *checks* looks like map_host_loopback will take precedence in this case, because of the way nat_outbound() is ordered.
I also noticed it is possible to set guest_gw to some random address while at the same time setting no_map_gw. It seems to be harmless, since no_map_gw takes precedence, but it is an inconsistency we should fence against.
No, the existing behaviour is correct. It gets confusing, because the guest_gw is used for magic NATs, that's often what we're referring to when we discuss the guest_gw address. But the guest_gw is *also* exactly what it says on the tin: the gateway address the guest uses. Without that, the guest won't have connectivity at all, so we need it. --no-map-gw means we don't have the magic NATs, but there's still a gateway, and -g still does and should control its address.
In any case I think you can drop the inany_equals() test - the permanent bit will stop the second update from clobbering the first, even if we are misconfigured.
+ uint8_t mac[ETH_ALEN]; + int rc; + + rc = nl_link_get_mac(nl_sock, c->ifi4, mac); + if (rc < 0) { + debug("Couldn't get ip4 MAC addr: %s", strerror_(-rc)); + memcpy(mac, c->our_tap_mac, ETH_ALEN); + }
Using the host's MAC for --map-guest-addr only makes sense if the guest address is the same as the host address. If -a is used to make the guest address different, then it may shadow some other random node, not the host. We don't need special handling for that case - the nat_inbound() you already have will do what we need.
IIUC, the host itself doesn't appear in the neighbour table, so we do need special handling if we want to use the host MAC when --map-guest-addr *does* refer to the host. To handle that, I think what we want is pseudo-codishly:
fwd_neigh_table_update(c, nat_inbound(host_addr), host_mac, true);
The wrinkle is that while we do get the host address at some point, I'm not sure we keep it around (it's typically irrelevant after init).
Strictly speaking 'permanent' isn't really correct here, but it's not worth the hassle of setting up a whole other netlink monitor to watch for changes in the host's MAC address.
In fact.. I'm not sure it's worth handling this case at all. I think it would be ok to just drop this clause. That means we'll use our_tap_mac by default for things NATted to the (non loopback) host, which is probably fine.
Fine with me, but I do think we need a blocker entry just in case somebody else comes up with the same address.
We don't need a blocker. If someone else (let's call them X) comes up with that address, that implies it's not the host's address. If that's so then --map-guest-gw will NAT to X, and that's intended. In which case it also makes sense to use X's MAC address. We don't need to do anything special to make that happen - X will appear in the host neigh table, and the nat_inbound() call will put it into the slot of the --map-guest-gw address.
I'll post a complementary commit once the series has been applied.
///jon
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On 2025-10-19 20:06, David Gibson wrote:
On Fri, Oct 17, 2025 at 02:49:33PM -0400, Jon Maloy wrote: [...]
I also noticed it is possible to set guest_gw to some random address while at the same time setting no_map_gw. It seems to be harmless, since no_map_gw takes precedence, but it is an inconsistency we should fence against.
No, the existing behaviour is correct. It gets confusing, because the guest_gw is used for magic NATs, that's often what we're referring to when we discuss the guest_gw address.
But the guest_gw is *also* exactly what it says on the tin: the gateway address the guest uses. Without that, the guest won't have connectivity at all, so we need it. --no-map-gw means we don't have the magic NATs, but there's still a gateway, and -g still does and should control its address.
So, your are telling me this is expected behaviour? jmaloy@mimir:~/passt/tests/udp_test$ /home/jmaloy/passt/passt/pasta --config-net --no-splice -d -a 192.168.2.2 -g 192.168.2.3 --no-map-gw Template interface: wlp1s0 (IPv4), wlp1s0 (IPv6) Namespace interface: wlp1s0 MAC: host: 9a:55:9a:55:9a:55 DHCP: assign: 192.168.2.2 mask: 255.255.255.0 router: 192.168.2.3 DNS: 10.11.5.19 10.2.32.1 DNS search list: redhat.com rmtcaqc.csb NDP/DHCPv6: assign: 2001:4958:2193:9901:6217:960c:2ef1:f0f3 router: fe80::c23c:4ff:fe04:4638 our link-local: fe80::c23c:4ff:fe04:4638 DNS search list: redhat.com rmtcaqc.csb Sending initial ARP request for guest MAC address Sending initial NDP NS request for guest MAC address SO_PEEK_OFF supported TCP_INFO tcpi_snd_wnd field supported TCP_INFO tcpi_bytes_acked field supported TCP_INFO tcpi_min_rtt field supported root@mimir:~/passt/tests/udp_test# ip r default via 192.168.2.3 dev wlp1s0 192.168.2.0/24 dev wlp1s0 proto kernel scope link src 192.168.2.2 root@mimir:~/passt/tests/udp_test# ping 192.168.2.3 PING 192.168.2.3 (192.168.2.3) 56(84) bytes of data. ^C^ --- 192.168.2.3 ping statistics --- 5 packets transmitted, 0 received, 100% packet loss, time 4080ms root@mimir:~/passt/tests/udp_test# ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 64 bytes from 8.8.8.8: icmp_seq=1 ttl=255 time=7.68 ms 64 bytes from 8.8.8.8: icmp_seq=2 ttl=255 time=20.9 ms 64 bytes from 8.8.8.8: icmp_seq=3 ttl=255 time=19.1 ms ^C --- 8.8.8.8 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 7.676/15.889/20.876/5.851 ms root@mimir:~/passt/tests/udp_test# ///jon
In any case I think you can drop the inany_equals() test - the permanent bit will stop the second update from clobbering the first, even if we are misconfigured.
+ uint8_t mac[ETH_ALEN]; + int rc; + + rc = nl_link_get_mac(nl_sock, c->ifi4, mac); + if (rc < 0) { + debug("Couldn't get ip4 MAC addr: %s", strerror_(-rc)); + memcpy(mac, c->our_tap_mac, ETH_ALEN); + }
Using the host's MAC for --map-guest-addr only makes sense if the guest address is the same as the host address. If -a is used to make the guest address different, then it may shadow some other random node, not the host. We don't need special handling for that case - the nat_inbound() you already have will do what we need.
IIUC, the host itself doesn't appear in the neighbour table, so we do need special handling if we want to use the host MAC when --map-guest-addr *does* refer to the host. To handle that, I think what we want is pseudo-codishly:
fwd_neigh_table_update(c, nat_inbound(host_addr), host_mac, true);
The wrinkle is that while we do get the host address at some point, I'm not sure we keep it around (it's typically irrelevant after init).
Strictly speaking 'permanent' isn't really correct here, but it's not worth the hassle of setting up a whole other netlink monitor to watch for changes in the host's MAC address.
In fact.. I'm not sure it's worth handling this case at all. I think it would be ok to just drop this clause. That means we'll use our_tap_mac by default for things NATted to the (non loopback) host, which is probably fine.
Fine with me, but I do think we need a blocker entry just in case somebody else comes up with the same address.
We don't need a blocker. If someone else (let's call them X) comes up with that address, that implies it's not the host's address. If that's so then --map-guest-gw will NAT to X, and that's intended. In which case it also makes sense to use X's MAC address. We don't need to do anything special to make that happen - X will appear in the host neigh table, and the nat_inbound() call will put it into the slot of the --map-guest-gw address.
I'll post a complementary commit once the series has been applied.
///jon
On Mon, Oct 20, 2025 at 06:00:37AM -0400, Jon Maloy wrote:
On 2025-10-19 20:06, David Gibson wrote:
On Fri, Oct 17, 2025 at 02:49:33PM -0400, Jon Maloy wrote: [...]
I also noticed it is possible to set guest_gw to some random address while at the same time setting no_map_gw. It seems to be harmless, since no_map_gw takes precedence, but it is an inconsistency we should fence against.
No, the existing behaviour is correct. It gets confusing, because the guest_gw is used for magic NATs, that's often what we're referring to when we discuss the guest_gw address.
But the guest_gw is *also* exactly what it says on the tin: the gateway address the guest uses. Without that, the guest won't have connectivity at all, so we need it. --no-map-gw means we don't have the magic NATs, but there's still a gateway, and -g still does and should control its address.
So, your are telling me this is expected behaviour?
As discussed on our call, yes. It's weird, but it's the most logical thing we can do under the circumstances.
jmaloy@mimir:~/passt/tests/udp_test$ /home/jmaloy/passt/passt/pasta --config-net --no-splice -d -a 192.168.2.2 -g 192.168.2.3 --no-map-gw Template interface: wlp1s0 (IPv4), wlp1s0 (IPv6) Namespace interface: wlp1s0 MAC: host: 9a:55:9a:55:9a:55 DHCP: assign: 192.168.2.2 mask: 255.255.255.0 router: 192.168.2.3 DNS: 10.11.5.19 10.2.32.1 DNS search list: redhat.com rmtcaqc.csb NDP/DHCPv6: assign: 2001:4958:2193:9901:6217:960c:2ef1:f0f3 router: fe80::c23c:4ff:fe04:4638 our link-local: fe80::c23c:4ff:fe04:4638 DNS search list: redhat.com rmtcaqc.csb Sending initial ARP request for guest MAC address Sending initial NDP NS request for guest MAC address SO_PEEK_OFF supported TCP_INFO tcpi_snd_wnd field supported TCP_INFO tcpi_bytes_acked field supported TCP_INFO tcpi_min_rtt field supported root@mimir:~/passt/tests/udp_test# ip r default via 192.168.2.3 dev wlp1s0 192.168.2.0/24 dev wlp1s0 proto kernel scope link src 192.168.2.2 root@mimir:~/passt/tests/udp_test# ping 192.168.2.3 PING 192.168.2.3 (192.168.2.3) 56(84) bytes of data. ^C^ --- 192.168.2.3 ping statistics --- 5 packets transmitted, 0 received, 100% packet loss, time 4080ms
root@mimir:~/passt/tests/udp_test# ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 64 bytes from 8.8.8.8: icmp_seq=1 ttl=255 time=7.68 ms 64 bytes from 8.8.8.8: icmp_seq=2 ttl=255 time=20.9 ms 64 bytes from 8.8.8.8: icmp_seq=3 ttl=255 time=19.1 ms ^C --- 8.8.8.8 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 7.676/15.889/20.876/5.851 ms root@mimir:~/passt/tests/udp_test#
///jon
In any case I think you can drop the inany_equals() test - the permanent bit will stop the second update from clobbering the first, even if we are misconfigured.
+ uint8_t mac[ETH_ALEN]; + int rc; + + rc = nl_link_get_mac(nl_sock, c->ifi4, mac); + if (rc < 0) { + debug("Couldn't get ip4 MAC addr: %s", strerror_(-rc)); + memcpy(mac, c->our_tap_mac, ETH_ALEN); + }
Using the host's MAC for --map-guest-addr only makes sense if the guest address is the same as the host address. If -a is used to make the guest address different, then it may shadow some other random node, not the host. We don't need special handling for that case - the nat_inbound() you already have will do what we need.
IIUC, the host itself doesn't appear in the neighbour table, so we do need special handling if we want to use the host MAC when --map-guest-addr *does* refer to the host. To handle that, I think what we want is pseudo-codishly:
fwd_neigh_table_update(c, nat_inbound(host_addr), host_mac, true);
The wrinkle is that while we do get the host address at some point, I'm not sure we keep it around (it's typically irrelevant after init).
Strictly speaking 'permanent' isn't really correct here, but it's not worth the hassle of setting up a whole other netlink monitor to watch for changes in the host's MAC address.
In fact.. I'm not sure it's worth handling this case at all. I think it would be ok to just drop this clause. That means we'll use our_tap_mac by default for things NATted to the (non loopback) host, which is probably fine.
Fine with me, but I do think we need a blocker entry just in case somebody else comes up with the same address.
We don't need a blocker. If someone else (let's call them X) comes up with that address, that implies it's not the host's address. If that's so then --map-guest-gw will NAT to X, and that's intended. In which case it also makes sense to use X's MAC address. We don't need to do anything special to make that happen - X will appear in the host neigh table, and the nat_inbound() call will put it into the slot of the --map-guest-gw address.
I'll post a complementary commit once the series has been applied.
///jon
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Jon Maloy
-
Stefano Brivio