SSH Connection Troubleshooting Lab

Interactive Hands-On Guide for Support Analysts
⚠️ Data Sanitization Notice

This lab has been sanitized for portfolio purposes. All sensitive information including company names, client identifiers, IP addresses, and server hostnames have been replaced with generic placeholders (Company X, xxx.xxx.com, etc.). The troubleshooting methodology and technical approach remain authentic.

📋 Lab Overview

This interactive lab simulates a real-world SSH connectivity issue reported by a customer. You'll walk through the complete troubleshooting process, from initial assessment to resolution, learning industry-standard diagnostic techniques along the way.

🚨
Customer Reported Issue
P2 - High Priority SSH Connectivity

Ticket #78432: "Multiple users from Client ABC cannot establish SSH connections to remote VPN gateway for administrative tasks. Connection attempts timeout after 30 seconds. Affecting 5 administrators across different locations."

Customer Organization:
Client ABC (Enterprise - 1,500 users)
Affected Service:
SSH access to VPN gateway (vpn-gw-abc.xxx.com)
Error Message:
"ssh: connect to host vpn-gw-abc.xxx.com port 22: Connection timed out"
Impact:
Administrators cannot manage VPN configurations, impacting deployment schedule
Reported By:
Client ABC (IT Manager) - clientabc@xxx-client.com

🔍 Investigation Steps

Click on each step below to expand and follow the troubleshooting process:

1
Initial Assessment & Information Gathering

Objective

Gather critical information about the issue before diving into technical diagnostics.

Questions to Ask

  • When did the issue start? (Timeline helps identify changes/deployments)
  • Is it affecting all users or specific ones?
  • Are users connecting from corporate network or remote locations?
  • Has anything changed recently? (Firewall rules, network config, SSH server updates)
  • Can users SSH to other servers successfully?

Initial Findings

Customer Response:

  • Issue started approximately 2 hours ago (around 14:00 UTC)
  • Affecting users from multiple locations (HQ and 3 branch offices)
  • Users can access other internal servers via SSH
  • No recent changes reported by customer's IT team
  • VPN clients are connecting successfully (only SSH admin access affected)

Your Analysis

Based on these findings:

  • ✅ Issue is specific to this VPN gateway (not client-side SSH problem)
  • ✅ Started recently (suggests a change or failure occurred)
  • ✅ Affects multiple locations (rules out local network issue)
  • ⚠️ Need to check our infrastructure logs for changes around 14:00 UTC
2
Check SSH Service Status

Objective

Verify that the SSH service (sshd) is running on the target server.

Access the Server

First, access the VPN gateway through our management console (since direct SSH is failing):

bash
# Connect via console access (alternative method)
ssh -i ~/.ssh/company_x_admin.pem admin@console.xxx.com

# Once connected to console, access the VPN gateway
console> connect vpn-gw-abc.xxx.com

Check Service Status

bash
# Check if SSH service is running
sudo systemctl status sshd

# Alternative: Check if process is running
ps aux | grep sshd

# Check if SSH port is listening
sudo netstat -tuln | grep :22

Expected Output

✅ Normal Output (Service Running):
● sshd.service - OpenSSH server daemon
   Loaded: loaded (/lib/systemd/system/ssh.service; enabled)
   Active: active (running) since Mon 2025-02-19 14:05:32 UTC
   Main PID: 1234 (sshd)
   
tcp  0  0  0.0.0.0:22  0.0.0.0:*  LISTEN

Findings

Result: SSH service is running normally. The service restarted at 14:05 UTC (5 minutes after reported issue start). This is suspicious and warrants further investigation.

3
Analyze SSH and System Logs

Objective

Examine logs to identify what caused the SSH service restart and any connection failures.

Check Authentication Logs

bash
# View recent SSH authentication attempts
sudo tail -n 100 /var/log/auth.log | grep sshd

# Look for errors around 14:00 UTC
sudo journalctl -u sshd --since "14:00" --until "14:10"

# Check for connection timeouts
sudo grep "Connection closed" /var/log/auth.log | tail -20

Critical Finding in Logs

⚠️ Error Found:
Feb 19 14:00:15 vpn-gw-abc sshd[1234]: error: Bind to port 22 failed: Address already in use.
Feb 19 14:00:15 vpn-gw-abc sshd[1234]: fatal: Cannot bind any address.
Feb 19 14:05:30 vpn-gw-abc systemd[1]: sshd.service: Main process exited
Feb 19 14:05:32 vpn-gw-abc systemd[1]: sshd.service: Automatic restart scheduled

Check System Logs for Related Events

bash
# Check what else happened at 14:00
sudo journalctl --since "14:00" --until "14:10" | grep -i error

# Check if configuration was changed
sudo grep "sshd_config" /var/log/syslog

Analysis

🔍 Root Cause Identified

SSH service crashed at 14:00 due to port binding conflict. The service auto-restarted at 14:05, but this explains the 5-minute outage window. Need to investigate what else was trying to use port 22.

4
Network & Firewall Verification

Objective

Verify network connectivity and firewall rules to ensure SSH traffic can reach the server.

Test Network Connectivity

bash
# From your workstation, test basic connectivity
ping -c 4 vpn-gw-abc.xxx.com

# Test if port 22 is reachable
telnet vpn-gw-abc.xxx.com 22
# Or use nc (netcat)
nc -zv vpn-gw-abc.xxx.com 22

# Check route to server
traceroute vpn-gw-abc.xxx.com

Check Firewall Rules on Server

bash
# Check iptables rules
sudo iptables -L -n -v | grep 22

# If using firewalld
sudo firewall-cmd --list-all

# Check if port 22 is allowed
sudo firewall-cmd --query-port=22/tcp

Check Security Groups (Cloud Provider)

Since this is hosted infrastructure, verify security group rules via API:

bash
# Query security groups via API
curl -X GET "https://api.xxx.com/v2/infrastructure/security-groups?server=vpn-gw-abc" \
  -H "Authorization: Bearer xxxTOKENxxx" \
  -H "Content-Type: application/json"

Findings

Network Check Results:

  • ✅ Server is reachable via ping (no packet loss)
  • ✅ Port 22 is accessible from external networks
  • ✅ Firewall rules correctly allow SSH traffic from authorized IPs
  • ✅ Security group configuration unchanged
  • ✅ No network-level issues detected
5
SSH Configuration & Certificate Validation

Objective

Examine SSH daemon configuration and validate host keys/certificates.

Review SSH Configuration

bash
# Check SSH daemon configuration
sudo cat /etc/ssh/sshd_config | grep -v "^#" | grep -v "^$"

# Verify configuration syntax
sudo sshd -t

# Check what port SSH is configured to use
sudo grep "^Port" /etc/ssh/sshd_config

Validate Host Keys

bash
# List SSH host keys
ls -la /etc/ssh/ssh_host_*

# Check key permissions (should be 600 for private keys)
stat /etc/ssh/ssh_host_rsa_key

# Verify key fingerprints
ssh-keygen -lf /etc/ssh/ssh_host_rsa_key.pub
ssh-keygen -lf /etc/ssh/ssh_host_ed25519_key.pub

Critical Discovery

🔴 Configuration Issue Found:
Port 22
Port 2222  ← DUPLICATE PORT DEFINITION!

# This was recently added by automation script
# Caused port binding conflict when service restarted

Root Cause Analysis

💡 Problem Identified

What happened:

  1. An automation script added "Port 2222" to sshd_config at 13:58 UTC (2 minutes before issue)
  2. Configuration reload was triggered at 14:00 UTC
  3. SSH daemon attempted to bind to both ports 22 and 2222
  4. Port binding conflict occurred (another process using port 2222)
  5. SSH service crashed and auto-restarted after 5 minutes
  6. On restart, the duplicate port line caused intermittent binding issues

Verify the Conflict

bash
# Check what's using port 2222
sudo lsof -i :2222

# Alternative command
sudo netstat -tulpn | grep 2222
Output:
COMMAND    PID  USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
openvpn   5678  root    8u  IPv4  12345      0t0  TCP *:2222 (LISTEN)

Explanation: OpenVPN management interface is already using port 2222, creating a conflict with the new SSH configuration.

6
Resolution & Verification

Objective

Resolve the configuration conflict and verify SSH access is restored.

Solution Steps

1. Backup Current Configuration

bash
# Always backup before making changes
sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.backup.$(date +%Y%m%d-%H%M%S)

2. Remove Duplicate Port Configuration

bash
# Edit SSH configuration
sudo nano /etc/ssh/sshd_config

# Remove or comment out the duplicate port line:
# Port 2222  ← DELETE THIS LINE

# Or use sed for automated fix
sudo sed -i '/^Port 2222/d' /etc/ssh/sshd_config

3. Validate Configuration

bash
# Test configuration syntax
sudo sshd -t

# Should return nothing if config is valid

4. Restart SSH Service

bash
# Reload SSH configuration (safer, doesn't disconnect)
sudo systemctl reload sshd

# Or restart if reload doesn't work
sudo systemctl restart sshd

# Verify service is running
sudo systemctl status sshd

5. Test SSH Connectivity

bash
# From external system, test SSH connection
ssh -v admin@vpn-gw-abc.xxx.com

# Test from multiple sources to ensure it's working
ssh -i ~/.ssh/client_abc_key.pem admin@vpn-gw-abc.xxx.com "uptime"
✅ Success Output:
OpenSSH_8.9p1, OpenSSL 3.0.2
debug1: Connecting to vpn-gw-abc.xxx.com [10.x.x.x] port 22.
debug1: Connection established.
debug1: Authentication succeeded (publickey).
 15:23:45 up 5 days,  3:18,  2 users,  load average: 0.15, 0.12, 0.08

Post-Resolution Actions

1. Document the Incident

Required Documentation:

  • Update ticket #78432 with root cause and resolution
  • Create incident report for post-mortem
  • Add to knowledge base: "SSH Port Binding Conflicts"
  • Update runbook with validation steps

2. Preventive Measures

bash
# Add monitoring alert for SSH service failures
curl -X POST "https://api.xxx.com/v2/monitoring/alerts" \
  -H "Authorization: Bearer xxxTOKENxxx" \
  -d '{
    "name": "SSH Service Down - VPN Gateway",
    "condition": "sshd_status != running",
    "severity": "critical",
    "notify": ["oncall-l3@xxx.com"]
  }'

# Add configuration validation to automation script
# Before: sshd config modification
# After: sshd -t validation + port conflict check

3. Customer Communication

📧 Sample Resolution Email

Subject: [RESOLVED] Ticket #78432 - SSH Access Restored


Hi John,

I'm pleased to inform you that SSH access to vpn-gw-abc.xxx.com has been fully restored as of 15:25 UTC.

Root Cause: A configuration management script inadvertently added a duplicate port definition that conflicted with an existing service, causing SSH to fail during a routine reload.

Resolution: We corrected the configuration, restarted the SSH service, and verified connectivity from multiple locations.

Prevention: We've implemented additional validation checks in our automation scripts and added monitoring alerts to detect similar issues faster.

Total downtime: ~85 minutes (14:00 - 15:25 UTC)

Please verify that your team can now access the gateway. Let me know if you have any questions!

Best regards,
Gabriel Mazer
L2 Support Engineer

📖 Quick Commands Reference

Command Purpose Common Options
systemctl status sshd Check SSH service status start, stop, restart, reload
sshd -t Test SSH config syntax -T (dump config), -d (debug mode)
netstat -tuln Show listening ports -p (show PIDs), -a (all connections)
lsof -i :PORT Check what's using a port -i (internet connections)
journalctl -u sshd View SSH service logs -f (follow), --since, --until
ssh -v Verbose SSH connection -vv, -vvv (more verbose)
nc -zv HOST PORT Test port connectivity -w (timeout)
iptables -L -n List firewall rules -v (verbose), -t nat (NAT table)

🎯 Key Takeaways

Systematic Approach: Always follow a structured troubleshooting methodology: gather info → check service → analyze logs → verify network → validate config → resolve.
Logs Are Your Friend: Authentication and system logs contain critical information. Learn to correlate timestamps between different log sources.
Configuration Validation: Always use sshd -t before restarting SSH service. Syntax errors can lock you out.
Port Conflicts: Use lsof and netstat to identify port binding conflicts. Common issue when multiple services compete for same ports.
Communication: Keep customers informed throughout the investigation. Set clear expectations on timeline and next steps.
Prevention > Cure: After resolving, implement monitoring and validation to prevent recurrence. Update runbooks and automation scripts.
Document Everything: Create KB articles for future reference. Your solution today helps the team tomorrow.

🔗 Related Resources

📚 Knowledge Base
  • KB-1234: SSH Best Practices
  • KB-5678: Firewall Troubleshooting
  • KB-9012: Service Recovery Procedures
🎓 Training Labs
  • Lab 2.2: VPN Gateway Management
  • Lab 3.1: Advanced Log Analysis
  • Lab 4.3: Network Diagnostics
📖 External Docs
  • OpenSSH Manual (man sshd)
  • RFC 4253: SSH Protocol
  • Linux System Admin Guide

✅ Lab Completion Checklist

Before marking this lab as complete, ensure you can:

  • ☐ Explain the SSH troubleshooting methodology
  • ☐ Use systemctl commands to manage services
  • ☐ Read and interpret authentication logs
  • ☐ Identify port conflicts using lsof/netstat
  • ☐ Edit and validate SSH configuration safely
  • ☐ Write clear customer-facing resolution emails
  • ☐ Document incidents for knowledge sharing