12/12/2019 at 21:13 •
This isn't really related to this project, but we did figure out the root cause of the router instability. It's the fault of our Samsung smart TV. It's apparently using the hostname "localhost" in its DHCP requests. That causes a bunch of logging and other complaints from the DHCP server built in to the router's dnsmasq instance. What then causes that to result in the interfaces wedging is unclear, but giving the TV a static address made the problem go away.
EDIT: Well, it turns that the jury has not yet entirely returned a verdict yet...
11/29/2019 at 18:53 •
We're up at the vacation villa this week and I installed the watchdog and set up the Pi that's in residence to take care of business. Well, last night in the middle of the night the router wedged itself and the Pi and box did their job flawlessly and brought everything back.
I've decided to run the script every 2 hours instead of 4 (so the cron spec is 0 */2 * * * $HOME/watchdog.py), and I've reduced the testing timeout to 30 minutes from 60 (a firmware update should take no longer than 15 minutes including time to get the system back up).
Of course, it would be better if this sort of thing just simply didn't happen and the router was reliable, but I don't think any router is going to be reliable enough that I'd not want to keep this system in place, frankly.
11/15/2019 at 17:41 •
I've changed the firmware so that the holdoff interval is no longer restarted when there's an input pulse. Before, if you pulsed every 59 minutes, then the reset would never happen. The more I thought about that, the less I liked it.
Now if you pulse during the holdoff it won't do anything, but if did it every 59 minutes then every other one would still work.
In addition, just tying the line permanently to ground will reset the system once and never again. The input has to be successfully debounced to "off" before any attempt to turn it "on" will be recognized. And, again, the debounce interval is one second. The line has to stay either "on" or "off" for a full second before any change is recognized. And for "on," that state change is recognized exactly once and then it has to transition successfully to "off" before it can be recognized as "on" again (note that this description is of the input to the circuit - where "on" is shorted-to-ground and off is open).
That's about as robust a system as I can envision that doesn't involve sending commands or stuff like that.
11/15/2019 at 05:44 •
It's occurred to me that if you power the circuit at 12 volts, Vgs for the P MOSFETs will be the full 12 volts when they're turned on. This is problematic as the MOSFETs I've designed with have an absolute maximum Vgs of ±8 volts. The fix is simple - each of the P MOSFETs needs a zener diode to limit Vgs. For the output MOSFET it's as simple as placing a zener diode from the gate to the positive rail. There's no need for a pull-up because the gate is normally pulled-down and the switching is across the zener. For the switching P MOSFET, we need to add the same zener to the positive rail, but also must add a pull-up across the zener because the switch pulls the gate momentarily down. There also needs to be a current limiting resistor added in series with the switch to limit the zener current. When the supply voltage is less than the zener voltage, then you can imagine that they're just not installed. When the supply voltage is higher, then you can sort of assume that the voltage on the anode is lower than the cathode by the zener voltage amount.
10/31/2019 at 21:13 •
I checked in the watchdog script into the firmware repository. It now uses syslog rather than standard out/err for some rudimentary logging. My recommendation is that you run this from cron every 4-6 hours. You can't really run it more than hourly, since there's a one hour hold-off in the firmware that is reset if you try to do the reset before the hour is elapsed (so if you pulse the action pin every 59 minutes, it will never actually work after the first time).
The reason you don't want to run it more frequently is that if everything goes terribly wrong and it's doing something crazy, you want to have a chance to get in and stop it. The one-hour firmware hold-off should give you a chance even if everything else goes sour.
The only thing left would be to make a 3D printed case for it, but I don't know if I care that much. We'll see.
10/24/2019 at 15:57 •
I thought of a couple of issues with the first script.
If the router is performing a firmware update, the internet may be unreachable for 10 minutes or so, and it would be a disaster to power-cycle the router then. So the script should keep trying for a solid hour to reach an external host before giving up.
I've also expanded the list of hosts. These are all public DNS servers. Again, using them as ping targets is probably not what their owners had in mind, but the script as written is being very gentle (and answering a ping is a lot less work than answering a DNS query). As long as you only run this no more than once every 6 hours or so (and as long as everybody and their brother doesn't run it), I would think it would be acceptable.
#!/usr/bin/python import RPi.GPIO as GPIO import sys import time import subprocess import os import random hosts = ["220.127.116.11", "18.104.22.168", "22.214.171.124", "126.96.36.199", "188.8.131.52", "184.108.40.206", "220.127.116.11", "18.104.22.168", "22.214.171.124", "126.96.36.199"] random.shuffle(hosts) FNULL = open(os.devnull) start = time.time() while True: for host in hosts: res = subprocess.call(["ping", "-c", "3", "-W", "5", host], stdout=FNULL, stderr=FNULL) if (res == 0): print(host + " is up.") sys.exit(0) # it worked. Bail else: print(host + " is down.") if time.time() - start > 60*60: break time.sleep(5 * 60) # wait 5 minutes print "All hosts unreachable for 60 minutes - resetting router" # physical pin 7 reset_pin = 4 # Perform the reset operation GPIO.setmode(GPIO.BCM) GPIO.setup(pin, GPIO.OUT, initial=GPIO.LOW) time.sleep(2) GPIO.cleanup() sys.exit(1)
10/23/2019 at 23:29 •
Just before someone brings it up... the two P MOSFETs could be replaced by a depletion mode P MOSFET. Depletion mode MOSFETs work just like the more ordinary enhancement mode devices, with the sense of the gate being backwards. Where increasing the amplitude of the gate-source voltage would turn an enhancement mode MOSFET on, doing so with a depletion mode device turns it off instead.
Unfortunately, depletion mode devices are out of the ordinary, so the prospects of using one - particularly one that can pass 2 apps continuous - are poor.
10/23/2019 at 21:50 •
You'd run this out of cron, like, every few hours:
#!/usr/bin/python import RPi.GPIO as GPIO import sys import time import subprocess import os import random hosts = ["188.8.131.52", "184.108.40.206", "220.127.116.11"] random.shuffle(hosts) FNULL = open(os.devnull) for host in hosts: res = subprocess.call(["ping", "-c", "3", "-W", "5", host], stdout=FNULL, stderr=FNULL) if (res == 0): sys.exit(0) # it worked. Bail print "All hosts unreachable - resetting router" reset_pin = 18 # Perform the reset operation GPIO.setmode(GPIO.BCM) GPIO.setup(reset_pin, GPIO.OUT, initial=GPIO.LOW) time.sleep(2) GPIO.cleanup() sys.exit(1)
This assumes you're using GPIO pin 18, but you can select any free one you like.
It's probably not entirely kosher to just ping those hosts, so to insure that you don't wind up in hot water, you should only run this script VERY sparingly. And it wouldn't be a bad idea to maybe pick different hosts - hosts close enough to be a good test for whether your router is up or not.
10/23/2019 at 21:14 •
The first cut of the hardware design and firmware is done. The boards have been ordered and we'll see what comes of it.
The board's firmware is very paranoid about the signaling it receives. The input needs to be asserted for a full second before the power is cycled, and then it has to remain de-asserted for a full hour before a reboot can be reattempted.
The firmware is paranoid because the signaler in this case is going to be a Linux box, and in general my trust of such systems is... measured.
In any event, the two wires from the control input are ground and an arbitrary GPIO pin. If you want to reboot the router, you set the pin to an output and assert it low for 2 seconds then release it (the normal state for GPIO pins is high impedance, which for us is de-asserted).
Having done that, you must not attempt to do so for at least an hour, as any attempts in the meantime will reset the hour hold-off timer, potentially extending it into perpetuity.
So in principle, what's called for here is a cron job. That job should attempt to ping a bunch of Internet places of interest, and if any of them succeed, you're done. If all of them fail, then you hit the history eraser button.