
I was recently challenged with onboarding a customer to a clients digital signage system. I needed to migrate thousand’s of Linux IoT players from there existing stack to what we where developing in-house at the time. It was a challenge to say the least. There previous partner was shutting down there servers in a month! It was worst-case scenario. Screens would go black if nothing was done and millions would be lost in ad-revenue.
No preasure.
Thankfully the API I had built was mostly finished and we could deploy a stripped down version of what we where developing at the time. Since my colleges wern’t super familiar with Linux itself and all of the automation tools surrounding the platform, I was tasked as the project lead and to guide the direction for onboarding the existing Linux players.
In this case study, we’re going to be looking at the problems I faced in the process, the initial soulution I cooked up, the challenges that came up at scale, and the solutions that panned out and worked in the end.
The Problem and Initial Roadmap
The first thought that came to my mind was “how the hell do I migrate all of those players in a month?”. You can’t just ssh your way out of a problem of that scale. It felt daunting to say the least but like most of the hairy problems that seem insormountable, you just gotta take it one step at at time. So that’s exactly what I did. From a high level, I needed to solve these problems:
- Create an automated pipeline to provision and deploy our new application on mass and have it be production ready by the end of the month.
- Have some way to patch the players if need be and pave the way for migration to our to our new management system down the road.
- Make the provisioning processes easy enough that techs could easily install and provision on site without developer support.
- Allow the support team and developers to remote access the players for if/when issues arise and for general troubleshooting.
Manually installing a Linux ISO across varying hardware and age prove to be no easy task. We could have created one ISO image for each player type and flashed that image across all of the devices but there was no way to keep track of what changes where made and if anything went wrong with that version of the image, we couldn’t patch the players with the new updates when needed to fix issues.
Knowing full well this was going to be a problem, I had the forthought of using Ansible to automate, track, and pull the changes for the images full IAC style. It allowed us to reproduce the image in a version controlled way. We also opted to use Packer to spin up VM’s on developers machines so we can quickly iterate and test out images on our machines without having to install and flash over and over again. This would also be useful for when we add the provisioning process to our CI/CD pipelines.
Ubuntu LTS became part our base image for the project. While not my first choice, the developers on the team being familiar with Ubuntu’s packaging system, Ubuntu’s reputation for being stable in the companies eyes, and our application not running well on Windows (also needing a license per device) it became our choice for the project.
With now having a re-producible and modifiable image we could start deploying, but one problem lingered, we needed a way to automate the process of provisioning the players with minimal effort from the developers. In comes PXEboot. Our plan was to send techs out with PXEboot servers that we could put on the clients networks at various locations and re-provision there devices on-site.
Lastly, allowing the support team to remote access the devices. Due to legacy reasons, we opted to roll with Teamviewer. Our support staff was familiar with the software and we had used it previously at scale on Windows based players. Teamviewer claimed to have a stable Linux version that was well support and with us having a business license it became the go-to choice.
With our solid plan in place and our stack picked out, we where off to races to deploy our application on there existing hardware and to automate the process.
The First Attempts (and Why it Failed)
As per the requirements from management, the support team needed a full desktop environment to make changes to the system as needed as well as remote access. I started by installing our application, bootstrapping Chromium to display the web application fullscreen, and setting up Teamviewer to run in the background for remote management.
This implementation had a few issues:
- Video playback was very laggy and screen-tearing was rampent (XORG issues).
- Despite best efforts, notifications would till pop up in GNOMES interface. (I’m looking at you Ubuntu updater)
- Since the players where so low powered, whenever we would remote access the machine, it would crash the application and the would freeze up the whole system.
- An old pulseaudio bug where the last selected audio device would be reset during reboots (which occurred nightly to conserve energy).
- Another pulseaudio bug, audio overtime would become out of sync with the video.
- The Ubuntu LTS release had a major bug causing kernel panics by overfilling the storage from one of the systems package logs.
I put together this image in a week and we had deployed already hundred’s of these players to the wild when we discovered these issues. The last one was the nail in the coffin for using Ubuntu. We could patch most of it but I tried to find a patch that we could easily apply to the package in question, but at the time Ubuntu’s response in the various fourms was to rollback to a previous version or wait for the next release which was months away. This was not an option.
We decided that we should pull all of the players out in the field and re-provision them with something else. We went back to the whiteboard to figure out our next game plan.
Designing a more Minimal Approach
First and foremost we needed to solve the issue with log overflow. We could rollback to a previous version of Ubuntu but that would still leave us three of the same issues on the table that could easily be solved if we just used a newer version of Linux.
I threw out the idea of Using Fedora as it would solve the logging issue as well as knock out a few of the other issues as well:
- Fixes the logging storage overflow.
- Fixes the screen tearing and performance issues with GNOME (Wayland).
- Fixes the audio device switch on reboot and audio going out of sync (Pipewire).
- Streamlines our provisioning process with better tooling (Kickstart files).
Perfect! I solved a lot of the issues but migrating from XORG to Wayland had one big problem that became a hot topic with much discussion and heated debate with management: Teamviewer didn’t work at all under Wayland. This became a huge blocker for the newer setup. We couldn’t just ship out devices with zero visibility and support.
I ended up cooking up a solution using VNC, Tailscale, and Apache Guacamole. This solved our remote access blocker and gave our support staff an easy way to access devices across organizations easily with OIDC support down the road.
The newer approach work well on most devices except for one. The lowest powered one of them all struggled with the desktop environment. This was our last hurdle that we needed to overcome before we could start deploying on mass. So I threw out GNOME for a minimal window manager called River. River needed a few more tools for us to get him up to speed, so I created the B.A.D. utility stack. When you’re down bad on the player, you just gotta press Mod4(Windows/Cmd key) plus B or A or D (or R or T).
- B: opens a web browser for general browser troubleshooting.
- A: opens a audio device management GUI to set the volume and change the device.
- D: opens the display settings to change resolution, refresh rate, and rotation.
- (R): returns you to the application and closes out the opened utilities.
- (T): opens a terminal if all else fails.
I know terrible but cheeky acronym but it was great for getting the tech’s and support to rebmember what to do if they had a problem out on the field or while accessing remotely. With all of that, we had a golden image we could ship out to the players. It was capable enough to run 1080p content at 60hz no problem and kept the power usage of the devices low. Now to just scale up operations and ship all the devices.
The Struggle of Scaling PXEboot
One issue we did not forsee was that we couldn’t modify the networks DNS on-site to point to our PXEboot servers out in the field. The devices had previously used PXEboot and we thought we could just update the DNS server in the network to point to our PXEboot server, but that was off the table. So with it being too complicated for the techs to install on-site, we brought all the players into the office to be provisioned.
It was a painful process to bring in all the players and ship back out but it worked. Our next challenge was just over the horizon, our network at the office was strugging. Turns out, sending ISO images Gigabytes in size over the network to hundred’s of devices at once can be taxing on the network. Who would of thunk! It was obvious once we had discovered the issue and moved the provisioning to it’s own dedicated network with a beefy switch. Horrah! We where cookin up and rolling out the players like a sweaty pizzaiolo cranking out order after order on a hot summer day.
Lessons Learned
Biggest thing I learned was choosing the right tools. It’s hard to know when and what is the right tool for the job. Sometimes it can be a struggle to know what to choose and if it’s the best solution for your problem you’re trying to solve.
I also learned the importance of iterating fast and breaking things early. It would have saved me so much time if we would have tested on-site earlier and at a bigger scale in the office.
Automation can be undervalued by a lot of teams. It can be slow to start and can have some pain points, but the payout over the long-term can be big. It can be the difference between spending weeks on a problem vs spending days on a problem.
The Payoff
Our rollout, although rocky, was a success. We managed to meet the deadline and we saved the client millions in potentially lost ad-revenue. It was a lot of long hours at the office, overnight weekends, and driving all over the state to onboard the client but saved them in a tough spot and hopefully made a life time customer.
Conclusion
If I had to do it again, I would have pushed harder in the beginning for my ideas and made deployment and remote management the first thing we built out. It could have saved us a couple long nights in the office and more quickly iterated on the image itself. If you ever find yourself in a similar deployment setting, take it from me start with the automation and then move to building the image. You’ll thank me later.
