I have two BIND servers running BIND 9:
BIND 9.11.36-RedHat-9.11.36-3.el8 (Extended Support Version) <id:68dbd5b>
running on Linux x86_64 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 08:57:35 EDT 2022
built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-python=/usr/libexec/platform-python' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--enable-filter-aaaa' '--with-pic' '--disable-static' '--includedir=/usr/include/bind9' '--with-tuning=large' '--with-libidn2' '--enable-openssl-hash' '--with-geoip2' '--enable-native-pkcs11' '--with-pkcs11=/usr/lib64/pkcs11/libsofthsm2.so' '--with-dlopen=yes' '--with-dlz-ldap=yes' '--with-dlz-postgres=yes' '--with-dlz-mysql=yes' '--with-dlz-filesystem=yes' '--with-dlz-bdb=yes' '--with-gssapi=yes' '--disable-isc-spnego' '--with-lmdb=no' '--with-libjson' '--enable-dnstap' '--with-cmocka' '--enable-fixed-rrset' '--with-docbook-xsl=/usr/share/sgml/docbook/xsl-stylesheets' '--enable-full-report' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' 'CPPFLAGS= -DDIG_SIGCHASE' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
compiled by GCC 8.5.0 20210514 (Red Hat 8.5.0-10)
compiled with OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021
linked to OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021
compiled with libxml2 version: 2.9.7
linked to libxml2 version: 20907
compiled with libjson-c version: 0.13.1
linked to libjson-c version: 0.13.1
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
linked to maxminddb version: 1.2.0
compiled with protobuf-c version: 1.3.0
linked to protobuf-c version: 1.3.0
threads support is enabled
default paths:
named configuration: /etc/named.conf
rndc configuration: /etc/rndc.conf
DNSSEC root key: /etc/bind.keys
nsupdate session key: /var/run/named/session.key
named PID file: /var/run/named/named.pid
named lock file: /var/run/named/named.lock
geoip-directory: /usr/share/GeoIP
The master server is at 172.16.19.243 and the secondary at 172.16.19.251. They can ping each other and port 53 (UDP and TCP) is open on both. Both used to work, but some new code was pushed in our automation and both lost network access for around two hours. It is possible the configuration was changed.
The secondary shows no zone files in /etc/named/. Zone transfers fail:
DNS-Secondary named[546308]: general: info: zone 19.16.172.in-addr.arpa/IN: refresh: unexpected rcode (SERVFAIL) from master 172.16.19.251#53 (source 0.0.0.0#0)
/var/log/named/zone_transfers on the primary show:
xfer-out: info: client @0x7f48600ebf90 69.61.12.108#47302 (ns4.mydomain.example): bad zone transfer request: 'ns4.mydomain.example/IN': non-authoritative zone (NOTAUTH)
... 3 days later outage occurs, but no logs appear ...
... a few hours after the outage and repeating to present day ...
notify: info: zone mydomain.example/IN: sending notifies (serial 2022051909)
notify: info: zone 19.16.172.in-addr.arpa/IN: sending notifies (serial 2022051909)
notify: info: zone 16.16.172.in-addr.arpa/IN: sending notifies (serial 2022051909)
notify: info: zone 17.16.172.in-addr.arpa/IN: sending notifies (serial 2022051909)
notify: info: zone 18.16.172.in-addr.arpa/IN: sending notifies (serial 2022051909)
The problem is not resolved by running rndc retransfer mydomain.example
. Requesting AXFR with dig also fails:
dig -t axfr mydomain.example 172.16.19.243
; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> -t axfr mydomain.example 172.16.19.243
;; global options: +cmd
; Transfer failed.
; Transfer failed.
Querying A records and PTRs from the internet to master works. Doing the same to the secondary now fails:
dig @172.16.19.251 191.19.16.172.in-addr.arpa ptr
; <<>> DiG 9.18.2 <<>> @172.16.19.251 191.19.16.172.in-addr.arpa ptr
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 57626
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 204e72e23787aef415f9ec7562866219e93a158c23f1f323 (good)
;; QUESTION SECTION:
;191.19.16.172.in-addr.arpa. IN PTR
;; Query time: 48 msec
;; SERVER: 172.16.19.251#53(172.16.19.251) (UDP)
;; WHEN: Thu May 19 10:28:39 CDT 2022
;; MSG SIZE rcvd: 83
The /etc/named.conf of the master is shown below:
options {
allow-query {
none;
};
allow-transfer {
none;
};
recursion no;
auth-nxdomain no; # conform to RFC1035
minimal-responses yes;
minimal-any yes;
dnssec-enable yes;
dnssec-validation yes;
};
zone "." IN {
type hint;
file "named.ca";
};
include "/etc/named.rfc1912.zones";
include "/etc/named.root.key";
//System Zones
zone "mydomain.example" IN {
type master;
file "/etc/named/mydomain.example.db";
allow-query {any;
};
allow-transfer {
localhost;
172.16.19.243;
};
notify yes;
};
zone "16.16.172.in-addr.arpa" IN {
type master;
file "/etc/named/16.16.172.in-addr.arpa.rev";
allow-query {any;
};
allow-transfer {
localhost;
172.16.19.243;
};
notify yes;
};
// Zones for 17 - 19 are included in the config with the *exact* same format. Programmatically generated - if there's
// a typo here, then there is in all. No zone transfers work.
/etc/named/16.16.172.in-addr.arpa.rev on the master is as follows:
$TTL 86400
@ IN SOA ns3.mydomain.example. admin.mydomain.example. (
2022051917 ;Serial
3600 ;Refresh
1800 ;Retry
604800 ;Expire
86400 ;Minimum TTL
)
;; All Zone NS Records
@ IN NS ns3.mydomain.example.
@ IN NS ns4.mydomain.example.
;; All Zone PTR Records
* IN PTR HDN-UIDO
Again, no DNS lookups for any record works on the secondary, but all work on master. No zones transfer from the master to the secondary. All zones and configurations are generated programmatically, so if there is an error in one zone, it will be present for all. No other errors of note have been found in the logs. No SELinux denials on either server. Permissions of /etc/named/ are 0770 root:named system_u:object_r:named_conf_t:s0 on both servers. Removing all .jnl files did not help (there was only one on the master, and not in /etc/named).
What could be the cause? Thank you.
EDIT 5/19
I confirmed that both servers have 53/UDP and 53/TCP open to each other.
From the secondary:
dig @172.16.19.243 +tcp 200.18.16.172.in-addr.arpa ptr
; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> @172.16.19.243 +tcp 200.18.16.172.in-addr.arpa ptr
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6246
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: a9c777930fc7d210a2b794de6286b88d1d5f01b2a729a2f5 (good)
;; QUESTION SECTION:
;200.18.16.172.in-addr.arpa. IN PTR
;; ANSWER SECTION:
200.18.16.172.in-addr.arpa. 86400 IN PTR EHB-DYN.18.16.172.in-addr.arpa.
;; Query time: 0 msec
;; SERVER: 172.16.19.243#53(172.16.19.243)
;; WHEN: Thu May 19 16:37:15 CDT 2022
;; MSG SIZE rcvd: 110
named-checkzone
was used to check all zones. The only issue it reported was missing a record for NS4, which was also pointed out in the comments. I made note of this and changed it on the servers, but the change didn't make it into this question when I wrote it. In any case, it did not resolve the issue.
named-checkconf
has been run on both servers and both returned status 0.
The slave server has multiple addresses, but it uses the correct (shown) address to query the master, as confirmed with a packet capture on the master.
The A record has been removed from the reverse zone configuration file. The file snippet above is accurate.