Troubleshooting - AITHA IoT Infrastructure
Common Problems and Fixes
Services Don't Start
Symptom: ./infra.sh up fails
Possible causes and fixes:
-
Docker network does not exist
source .env
docker network inspect "${DOCKER_NETWORK_NAME}"
# If missing, create it
docker network create "${DOCKER_NETWORK_NAME}" -
Ports already in use
# Identify what is using the port
sudo lsof -i :8080
# Change port in .env
THINGSBOARD_PORT=8081
# Restart
./infra.sh restart thingsboard -
Volumes with wrong permissions
bash ./deployment/scripts/utils/fix_permissions.sh
./infra.sh restart all -
Missing environment variables
# Verify .env exists
ls -la .env
# If missing, copy from example
cp .env.example .env
# Edit .env with your values
Docker Compose Errors
Error: "service 'X' failed to build"
# Show full error
./infra.sh build <project>
# Clear cache and rebuild
docker builder prune -a
./infra.sh build <project>
Error: "pull access denied"
Some images require Docker Hub login:
docker login
./infra.sh up
Error: "context deadline exceeded"
Slow network or Docker timeout:
# Increase timeouts in the Docker daemon
# Edit /etc/docker/daemon.json (Linux)
{
"max-concurrent-downloads": 3
}
sudo systemctl restart docker
MQTT / Mosquitto
Symptom: Devices cannot connect
Checks:
-
Mosquitto is running
./infra.sh status core
# Should show mosquitto as "Up" -
Port is open
# From the host
telnet localhost 1883
# Should connect (Ctrl+C to exit) -
Credentials are correct
# Check credentials in .env
grep MOSQUITTO .env
# Test with mosquitto_sub
docker exec -it mosquitto mosquitto_sub \
-h localhost -t '#' -u mqttuser -P mqttpass -v -
Check logs
./infra.sh logs infrastructure/mosquitto
Symptom: Messages do not arrive
Step-by-step debug:
-
Publish a test message
docker exec -it mosquitto mosquitto_pub \
-h localhost -t 'core2/test/pin/gpio1/get/telemetry' \
-u mqttuser -P mqttpass -m '1' -
Subscribe to verify
docker exec -it mosquitto mosquitto_sub \
-h localhost -t 'core2/#' \
-u mqttuser -P mqttpass -v -
Check Telegraf logs
./infra.sh logs ingestion/telegraf
Telegraf
Symptom: Data does not reach InfluxDB
Checks:
-
Telegraf is running
./infra.sh status ingestion/telegraf -
Check Telegraf logs
./infra.sh logs ingestion/telegrafWhat to look for:
E!indicates errorW!indicates warningI!indicates info
-
Check configuration
# Show active config
docker exec -it telegraf telegraf config
# Validate syntax
docker exec -it telegraf telegraf --test -
Check InfluxDB token
# In .env
grep INFLUX_WRITE_TOKEN .env
# Must match the one configured in InfluxDB
Symptom: ThingsBoard bridge does not work
Bridge debug:
-
Check bridge logs
./infra.sh logs ingestion/telegraf | grep -i "tb-bridge\|thingsboard" -
Check environment variables
docker exec -it telegraf env | grep THINGSBOARD
docker exec -it telegraf env | grep MOSQUITTO -
Check connectivity to ThingsBoard
# From the Telegraf container
docker exec -it telegraf ping thingsboard
# Check ThingsBoard MQTT port
docker exec -it telegraf telnet thingsboard 1883 -
Restart only Telegraf
./infra.sh restart ingestion/telegraf
InfluxDB
Symptom: "unauthorized access"
Fix:
-
Check token in .env
grep INFLUX_WRITE_TOKEN .env -
Regenerate token in InfluxDB UI
- Go to: http://localhost:8086
- Login with
INFLUX_ADMIN_USER/INFLUX_ADMIN_PASS - Load Data → API Tokens → Generate API Token
- Copy the token into
.env - Restart Telegraf
-
Verify organization and bucket
# Must match values in .env
INFLUX_ORG=unlix
INFLUX_BUCKET=core2_dev
Symptom: Very slow query
Optimizations:
-
Reduce the time range
from(bucket: "core2_dev")
|> range(start: -1h) // Last hour only -
Filter by specific tags
from(bucket: "core2_dev")
|> range(start: -1h)
|> filter(fn: (r) => r.device_id == "esp32_001") -
Configure retention policy
- Reduce
INFLUX_RETENTIONin.envfor old data - Configure downsampling with Airflow
- Reduce
ThingsBoard
Symptom: Does not start / database error
Checks:
-
PostgreSQL is running
./infra.sh status thingsboard
# Should show thingsboard-db as "Up" -
Check PostgreSQL logs
docker logs thingsboard-db -
Check ThingsBoard logs
./infra.sh logs platform/thingsboard -
Restart the ThingsBoard stack
./infra.sh down thingsboard
./infra.sh up thingsboard -
Full reset (DELETES DATA)
./infra.sh down thingsboard
rm -rf platform/thingsboard/data/postgres-data/
./infra.sh up thingsboard
Symptom: "Tenant Administrator not found"
Fix:
-
Create tenant admin from the container
docker exec -it thingsboard bash
# Inside the container
cd /usr/share/thingsboard/bin
./thingsboard.sh --loadDemo -
Login with default credentials
- User:
tenant@thingsboard.org - Password:
tenant
- User:
-
Change the password immediately
Symptom: Devices do not appear in ThingsBoard
Checks:
-
Bridge is working
./infra.sh logs ingestion/telegraf | grep "TB connected" -
Device sent data
# Verify in InfluxDB that there is data
# UI: http://localhost:8086 → Data Explorer -
Create the device manually in ThingsBoard
- UI → Devices → + Add Device
- Name: must match
device_idin the MQTT topic - Device type: default
- Copy the Access Token
- Use it as
device_idin MQTT topics
Kafka
Symptom: Kafka does not start
Checks:
-
Zookeeper is running
./infra.sh status kafka
# Should show zookeeper as "Up" and healthy -
Check logs
./infra.sh logs kafka zookeeper
./infra.sh logs kafka kafka -
Remove corrupted data
./infra.sh down kafka
rm -rf infrastructure/kafka/broker/data/*
rm -rf infrastructure/kafka/zookeeper/data/*
./infra.sh up kafka
Symptom: Topics do not exist
Create topics manually:
# Exec into the Kafka container
docker exec -it kafka bash
# Crear topic raw_data
kafka-topics.sh --create --topic raw_data \
--bootstrap-server localhost:9092 \
--partitions 3 --replication-factor 1
# Crear topic processed_data
kafka-topics.sh --create --topic processed_data \
--bootstrap-server localhost:9092 \
--partitions 3 --replication-factor 1
# List topics
kafka-topics.sh --list --bootstrap-server localhost:9092
Symptom: Microservices are not consuming messages
Debug:
-
Verify Kafka has messages
# Kafka UI: http://localhost:8089
# Ver topic raw_data → Messages -
Check microservice logs
./infra.sh logs processing/stream-processors -
Check consumer group
docker exec -it kafka kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--describe --group telegraf_unified_consumer
Airflow
Symptom: Airflow UI does not load
Checks:
-
All Airflow services are UP
./infra.sh status airflowShould be running:
- postgres
- redis
- airflow-apiserver
- airflow-scheduler
- airflow-worker
-
Check apiserver logs
./infra.sh logs airflow airflow-apiserver -
Restart all Airflow
./infra.sh restart airflow
Symptom: DAGs do not appear
Checks:
-
DAG is in the correct folder
ls -la processing/airflow/dags/ -
No syntax errors
# Scheduler logs
./infra.sh logs airflow airflow-scheduler -
Check permissions
bash ./deployment/scripts/utils/fix_permissions.sh
./infra.sh restart airflow
Grafana
Symptom: Cannot connect to InfluxDB
Fix:
-
Verify the data source in Grafana
- UI → Configuration → Data Sources
- InfluxDB should be configured with:
- URL:
http://influxdb:8086 - Organization: value of
INFLUX_ORG - Token: value of
INFLUX_WRITE_TOKEN - Default Bucket: value of
INFLUX_BUCKET
- URL:
-
Test connection
- Click "Save & Test"
- Should show "Data source is working"
-
Verify Docker network
docker exec -it grafana ping influxdb
Debug Tools
List all containers
docker ps -a
Resource usage
docker stats
Docker networks
docker network ls
source .env
docker network inspect "${DOCKER_NETWORK_NAME}"
Volumes
docker volume ls
Real-time logs
./infra.sh logs <project> <service>
# Or with Docker directly
docker logs -f <container_name>
Exec into a container
docker exec -it <container_name> bash
# Or sh if bash is not available
docker exec -it <container_name> sh
Inspect a container
docker inspect <container_name>
Cleanup and Maintenance
Remove stopped containers
docker container prune
Remove unused images
docker image prune -a
Remove orphaned volumes
docker volume prune
Full system cleanup
# CAUTION: Deletes everything unused
docker system prune -a --volumes
Full project reset
# THIS DELETES ALL DATA
./infra.sh clean --data
docker system prune -a --volumes
source .env
docker network create "${DOCKER_NETWORK_NAME}"
./infra.sh up
Getting Help
Logs to check (in order)
./infra.sh status- See what's running./infra.sh logs <project>- Logs for the problematic project/servicedocker logs <container>- Full container logs- Local log files inside each service folder
Information to collect
When reporting an issue, include:
- Output of
./infra.sh status - Output of
docker ps -a - Relevant logs from
./infra.sh logs - Contents of
.env(WITHOUT credentials) - Docker version:
docker --version - OS/kernel:
uname -a
Diagnostic commands
# Collect diagnostic information
{
echo "=== System Info ==="
uname -a
echo ""
echo "=== Docker Version ==="
docker --version
docker compose version
echo ""
echo "=== Container Status ==="
docker ps -a
echo ""
echo "=== Networks ==="
docker network ls
echo ""
echo "=== Volumes ==="
docker volume ls
echo ""
echo "=== Project Status ==="
./infra.sh status
} > diagnostic.txt
# Share diagnostic.txt for analysis
General Tips
- Check logs first: 90% of issues show up in logs
- Verify Docker networking: many issues are connectivity between containers
- File permissions: run
bash ./deployment/scripts/utils/fix_permissions.shwhen you see permission errors - Ports in use: change ports in
.envif there are conflicts - Disk space: Docker uses a lot of space; prune regularly
- RAM: Airflow and ThingsBoard typically need at least 4GB available