Skip to main content

Troubleshooting - AITHA IoT Infrastructure

Common Problems and Fixes

Services Don't Start

Symptom: ./infra.sh up fails

Possible causes and fixes:

  1. Docker network does not exist

    source .env
    docker network inspect "${DOCKER_NETWORK_NAME}"

    # If missing, create it
    docker network create "${DOCKER_NETWORK_NAME}"
  2. Ports already in use

    # Identify what is using the port
    sudo lsof -i :8080

    # Change port in .env
    THINGSBOARD_PORT=8081

    # Restart
    ./infra.sh restart thingsboard
  3. Volumes with wrong permissions

    bash ./deployment/scripts/utils/fix_permissions.sh
    ./infra.sh restart all
  4. Missing environment variables

    # Verify .env exists
    ls -la .env

    # If missing, copy from example
    cp .env.example .env
    # Edit .env with your values

Docker Compose Errors

Error: "service 'X' failed to build"

# Show full error
./infra.sh build <project>

# Clear cache and rebuild
docker builder prune -a
./infra.sh build <project>

Error: "pull access denied"

Some images require Docker Hub login:

docker login
./infra.sh up

Error: "context deadline exceeded"

Slow network or Docker timeout:

# Increase timeouts in the Docker daemon
# Edit /etc/docker/daemon.json (Linux)
{
"max-concurrent-downloads": 3
}

sudo systemctl restart docker

MQTT / Mosquitto

Symptom: Devices cannot connect

Checks:

  1. Mosquitto is running

    ./infra.sh status core
    # Should show mosquitto as "Up"
  2. Port is open

    # From the host
    telnet localhost 1883

    # Should connect (Ctrl+C to exit)
  3. Credentials are correct

    # Check credentials in .env
    grep MOSQUITTO .env

    # Test with mosquitto_sub
    docker exec -it mosquitto mosquitto_sub \
    -h localhost -t '#' -u mqttuser -P mqttpass -v
  4. Check logs

    ./infra.sh logs infrastructure/mosquitto

Symptom: Messages do not arrive

Step-by-step debug:

  1. Publish a test message

    docker exec -it mosquitto mosquitto_pub \
    -h localhost -t 'core2/test/pin/gpio1/get/telemetry' \
    -u mqttuser -P mqttpass -m '1'
  2. Subscribe to verify

    docker exec -it mosquitto mosquitto_sub \
    -h localhost -t 'core2/#' \
    -u mqttuser -P mqttpass -v
  3. Check Telegraf logs

    ./infra.sh logs ingestion/telegraf

Telegraf

Symptom: Data does not reach InfluxDB

Checks:

  1. Telegraf is running

    ./infra.sh status ingestion/telegraf
  2. Check Telegraf logs

    ./infra.sh logs ingestion/telegraf

    What to look for:

    • E! indicates error
    • W! indicates warning
    • I! indicates info
  3. Check configuration

    # Show active config
    docker exec -it telegraf telegraf config

    # Validate syntax
    docker exec -it telegraf telegraf --test
  4. Check InfluxDB token

    # In .env
    grep INFLUX_WRITE_TOKEN .env

    # Must match the one configured in InfluxDB

Symptom: ThingsBoard bridge does not work

Bridge debug:

  1. Check bridge logs

    ./infra.sh logs ingestion/telegraf | grep -i "tb-bridge\|thingsboard"
  2. Check environment variables

    docker exec -it telegraf env | grep THINGSBOARD
    docker exec -it telegraf env | grep MOSQUITTO
  3. Check connectivity to ThingsBoard

    # From the Telegraf container
    docker exec -it telegraf ping thingsboard

    # Check ThingsBoard MQTT port
    docker exec -it telegraf telnet thingsboard 1883
  4. Restart only Telegraf

    ./infra.sh restart ingestion/telegraf

InfluxDB

Symptom: "unauthorized access"

Fix:

  1. Check token in .env

    grep INFLUX_WRITE_TOKEN .env
  2. Regenerate token in InfluxDB UI

    • Go to: http://localhost:8086
    • Login with INFLUX_ADMIN_USER / INFLUX_ADMIN_PASS
    • Load Data → API Tokens → Generate API Token
    • Copy the token into .env
    • Restart Telegraf
  3. Verify organization and bucket

    # Must match values in .env
    INFLUX_ORG=unlix
    INFLUX_BUCKET=core2_dev

Symptom: Very slow query

Optimizations:

  1. Reduce the time range

    from(bucket: "core2_dev")
    |> range(start: -1h) // Last hour only
  2. Filter by specific tags

    from(bucket: "core2_dev")
    |> range(start: -1h)
    |> filter(fn: (r) => r.device_id == "esp32_001")
  3. Configure retention policy

    • Reduce INFLUX_RETENTION in .env for old data
    • Configure downsampling with Airflow

ThingsBoard

Symptom: Does not start / database error

Checks:

  1. PostgreSQL is running

    ./infra.sh status thingsboard
    # Should show thingsboard-db as "Up"
  2. Check PostgreSQL logs

    docker logs thingsboard-db
  3. Check ThingsBoard logs

    ./infra.sh logs platform/thingsboard
  4. Restart the ThingsBoard stack

    ./infra.sh down thingsboard
    ./infra.sh up thingsboard
  5. Full reset (DELETES DATA)

    ./infra.sh down thingsboard
    rm -rf platform/thingsboard/data/postgres-data/
    ./infra.sh up thingsboard

Symptom: "Tenant Administrator not found"

Fix:

  1. Create tenant admin from the container

    docker exec -it thingsboard bash

    # Inside the container
    cd /usr/share/thingsboard/bin
    ./thingsboard.sh --loadDemo
  2. Login with default credentials

    • User: tenant@thingsboard.org
    • Password: tenant
  3. Change the password immediately

Symptom: Devices do not appear in ThingsBoard

Checks:

  1. Bridge is working

    ./infra.sh logs ingestion/telegraf | grep "TB connected"
  2. Device sent data

    # Verify in InfluxDB that there is data
    # UI: http://localhost:8086 → Data Explorer
  3. Create the device manually in ThingsBoard

    • UI → Devices → + Add Device
    • Name: must match device_id in the MQTT topic
    • Device type: default
    • Copy the Access Token
    • Use it as device_id in MQTT topics

Kafka

Symptom: Kafka does not start

Checks:

  1. Zookeeper is running

    ./infra.sh status kafka
    # Should show zookeeper as "Up" and healthy
  2. Check logs

    ./infra.sh logs kafka zookeeper
    ./infra.sh logs kafka kafka
  3. Remove corrupted data

    ./infra.sh down kafka
    rm -rf infrastructure/kafka/broker/data/*
    rm -rf infrastructure/kafka/zookeeper/data/*
    ./infra.sh up kafka

Symptom: Topics do not exist

Create topics manually:

# Exec into the Kafka container
docker exec -it kafka bash

# Crear topic raw_data
kafka-topics.sh --create --topic raw_data \
--bootstrap-server localhost:9092 \
--partitions 3 --replication-factor 1

# Crear topic processed_data
kafka-topics.sh --create --topic processed_data \
--bootstrap-server localhost:9092 \
--partitions 3 --replication-factor 1

# List topics
kafka-topics.sh --list --bootstrap-server localhost:9092

Symptom: Microservices are not consuming messages

Debug:

  1. Verify Kafka has messages

    # Kafka UI: http://localhost:8089
    # Ver topic raw_data → Messages
  2. Check microservice logs

    ./infra.sh logs processing/stream-processors
  3. Check consumer group

    docker exec -it kafka kafka-consumer-groups.sh \
    --bootstrap-server localhost:9092 \
    --describe --group telegraf_unified_consumer

Airflow

Symptom: Airflow UI does not load

Checks:

  1. All Airflow services are UP

    ./infra.sh status airflow

    Should be running:

    • postgres
    • redis
    • airflow-apiserver
    • airflow-scheduler
    • airflow-worker
  2. Check apiserver logs

    ./infra.sh logs airflow airflow-apiserver
  3. Restart all Airflow

    ./infra.sh restart airflow

Symptom: DAGs do not appear

Checks:

  1. DAG is in the correct folder

    ls -la processing/airflow/dags/
  2. No syntax errors

    # Scheduler logs
    ./infra.sh logs airflow airflow-scheduler
  3. Check permissions

    bash ./deployment/scripts/utils/fix_permissions.sh
    ./infra.sh restart airflow

Grafana

Symptom: Cannot connect to InfluxDB

Fix:

  1. Verify the data source in Grafana

    • UI → Configuration → Data Sources
    • InfluxDB should be configured with:
      • URL: http://influxdb:8086
      • Organization: value of INFLUX_ORG
      • Token: value of INFLUX_WRITE_TOKEN
      • Default Bucket: value of INFLUX_BUCKET
  2. Test connection

    • Click "Save & Test"
    • Should show "Data source is working"
  3. Verify Docker network

    docker exec -it grafana ping influxdb

Debug Tools

List all containers

docker ps -a

Resource usage

docker stats

Docker networks

docker network ls
source .env
docker network inspect "${DOCKER_NETWORK_NAME}"

Volumes

docker volume ls

Real-time logs

./infra.sh logs <project> <service>
# Or with Docker directly
docker logs -f <container_name>

Exec into a container

docker exec -it <container_name> bash
# Or sh if bash is not available
docker exec -it <container_name> sh

Inspect a container

docker inspect <container_name>

Cleanup and Maintenance

Remove stopped containers

docker container prune

Remove unused images

docker image prune -a

Remove orphaned volumes

docker volume prune

Full system cleanup

# CAUTION: Deletes everything unused
docker system prune -a --volumes

Full project reset

# THIS DELETES ALL DATA
./infra.sh clean --data
docker system prune -a --volumes
source .env
docker network create "${DOCKER_NETWORK_NAME}"
./infra.sh up

Getting Help

Logs to check (in order)

  1. ./infra.sh status - See what's running
  2. ./infra.sh logs <project> - Logs for the problematic project/service
  3. docker logs <container> - Full container logs
  4. Local log files inside each service folder

Information to collect

When reporting an issue, include:

  • Output of ./infra.sh status
  • Output of docker ps -a
  • Relevant logs from ./infra.sh logs
  • Contents of .env (WITHOUT credentials)
  • Docker version: docker --version
  • OS/kernel: uname -a

Diagnostic commands

# Collect diagnostic information
{
echo "=== System Info ==="
uname -a
echo ""
echo "=== Docker Version ==="
docker --version
docker compose version
echo ""
echo "=== Container Status ==="
docker ps -a
echo ""
echo "=== Networks ==="
docker network ls
echo ""
echo "=== Volumes ==="
docker volume ls
echo ""
echo "=== Project Status ==="
./infra.sh status
} > diagnostic.txt

# Share diagnostic.txt for analysis

General Tips

  1. Check logs first: 90% of issues show up in logs
  2. Verify Docker networking: many issues are connectivity between containers
  3. File permissions: run bash ./deployment/scripts/utils/fix_permissions.sh when you see permission errors
  4. Ports in use: change ports in .env if there are conflicts
  5. Disk space: Docker uses a lot of space; prune regularly
  6. RAM: Airflow and ThingsBoard typically need at least 4GB available

Additional Documentation