Troubleshooting (AEN 4.1.2)#
Overview¶
This is a troubleshooting guide for a Anaconda Enterprise Notebooks deployment.
Normal Operation¶
Server¶
Anaconda Enterprise Notebooks Server is installed in
/opt/wakari/wakari-server
.
You can get the status of the server processes with:
# service wakari-server status
wk-server RUNNING pid 20758, uptime 5 days, 0:30:23
worker RUNNING pid 20757, uptime 5 days, 0:30:23
or:
root@server # ps -Hu wakari
PID TTY TIME CMD
20756 ? 00:02:26 .supervisord
20757 ? 00:05:58 mtq-worker
20758 ? 00:00:08 wk-server
20765 ? 00:02:00 wk-server
20766 ? 00:01:55 wk-server
20767 ? 00:02:20 wk-server
20770 ? 00:02:02 wk-server
supervisord | details |
---|---|
description | Manages wakari-worker and multiple processes of wk-server |
user | wakari |
configuration | /opt/wakari/wakari-server/etc/supervisord.conf |
log | /opt/wakari/wakari-server/var/log/supervisord.log |
control | service wakari-server |
ports | none |
wk-server | details |
---|---|
description | Handles user interaction and passing jobs on to the wakari gateway. Access to it is managed by nginx. |
user | wakari |
command | /opt/wakari/wakari-server/bin/wk-server |
configuration | /opt/wakari/wakari-server/etc/wakari/ |
control | service wakari-server |
logs | /opt/wakari/wakari-server/var/log/wakari/server.log |
ports | 5000 (only on localhost) |
wakari-worker | details |
---|---|
description | Asynchronously executes tasks from wk-server |
user | wakari |
logs | /opt/wakari/wakari-server/var/log/wakari/worker.log |
control | service wakari-server |
nginx | details |
---|---|
description | Serves static files and acts as proxy for all other requests which are passed to wk-server process running on port 5000. |
user | nginx |
configuration | /etc/nginx/nginx.conf
/opt/wakari/wakari-server/etc/conf.d/www.enterprise.conf |
logs | /var/log/nginx/woc.log /var/log/nginx/woc-error.log |
control | service nginx status |
port | 80 |
Nginx runs at least two processes: - master process running as root user - worker processes running as nginx user
Gateway¶
Anaconda Enterprise Notebooks Gateway is installed in
/opt/wakari/wakari-gateway
.
You can get the status of the gateway processes with:
# service wakari-gateway status
wk-gateway RUNNING pid 1137, uptime 5 days, 1:59:28
or:
root@gateway # ps -Hu wakari
PID TTY TIME CMD
1136 ? 00:01:59 .supervisord
1137 ? 00:00:02 wk-gateway
supervisord | details |
---|---|
description | Manages the wk-gateway process. |
user | wakari |
configuration | /opt/wakari/wakari-gateway/etc/supervisord.conf |
log | /opt/wakari/wakari-gateway/var/log/supervisord.log |
control | service wakari-gateway |
ports | none |
wakari-gateway | details |
---|---|
description | Passes requests from Anaconda Enterprise Notebooks Server to the Compute Nodes. |
user | wakari |
configuration | /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json |
logs |
|
working dir | / (root) |
port | 8089 (webcache) |
Compute Node¶
Anaconda Enterprise Notebooks Compute is installed in
/opt/wakari/wakari-compute
.
You can get the status of the compute node processes with:
# service wakari-compute status
wk-compute RUNNING pid 22050, uptime 3 days, 1:03:19
or:
root@compute # ps -Hu wakari
PID TTY TIME CMD
1150 ? 00:02:01 .supervisord
1152 ? 00:00:01 wk-compute
wk-compute will load each of these configuration files, in order:
/etc/wakari/config.json
/etc/wakari/compute-launcher-config.json
./compute-launcher-config.json
- Config file specified by
-c
option
If an option is specified in multiple files, the last one encountered takes precedence.
supervisord | details |
---|---|
description | Manages the wk-compute process. |
user | wakari |
configuration | /opt/wakari/wakari-compute/etc/supervisord.conf |
log | /opt/wakari/wakari-compute/var/log/supervisord.log |
control | service wakari-compute |
working dir | /opt/wakari/wakari-compute/etc |
ports | none |
wk-compute | details |
---|---|
description | Launches compute processes |
user | wakari |
configuration | /opt/wakari/wakari-compute/etc/wakari/wk-compute-launcher-config.json
/opt/wakari/wakari-compute/etc/wakari/scripts/config.json |
logs | /opt/wakari/wakari-compute/var/log/wakari/compute-launcher.application.log
/opt/wakari/wakari-compute/var/log/wakari/compute-launcher.log |
working dir | / (root) |
control | service wakari-compute |
port | 5002 (rfe) |
Projects and Permissions¶
Projects live in the projectRoot folder on the compute node (by default,
/projects). The project directory is created the first time the project
is started; the start-project script clones it from
/opt/wakari/wakari-compute/lib/node_modules/wakari-compute-launcher/skeleton
.
Project directory permissions are as follows:
owner: rwx, user who created the project
group: rwx, owner's group
other: --x, to allow access to the Public folder
ACL: rwx for any other team members
Files and subdirectories within the project directory have the same permissions as the project directory, except:
- The public folder and everything in it are world readable.
- Any files hardlinked into the root anaconda environment
(
/opt/wakari/anaconda
) remain owned by theroot
orwakari
users.
Project file and directory permissions are maintained by the
start-project script. All files and directories in the project will have
their permissions set when the project is started, except for files
owned by root
or the AEN_SRVC_ACCT user (usually wakari
or
aen_admin
). Files owned by root
or the AEN_SRVC_ACCT user do not
have their permissions changed, in order to avoid changing the permissions
of the linked files in /opt/wakari/anaconda
.
CAUTION: DO NOT start a project as the AEN_SRVC_ACCT user (usually wakari
or aen_admin
). The permissions system will not correctly manage project
files owned by this user.
General Troubleshooting Steps¶
Ensure that the Anaconda Enterprise Notebooks services are set to start at boot¶
(on all 3 components: Server, Gateway, and Compute nodes)
chkconfig --list | grep wakari
If they are missing, you can try adding them with:
chkconfig --add [wakari-server|wakari-gateway|wakari-compute]
Then services can be started safely with the restart
command as
follows:
service wakari-server restart
service wakari-gateway restart
service wakari-compute restart
These commands need to be executed on the appropriate nodes.
Ensure that all services are running¶
(see Normal Operation, above).
# service wakari-server status
wk-server RUNNING pid 20758, uptime 5 days, 0:30:23
worker RUNNING pid 20757, uptime 5 days, 0:30:23
root@server # service nginx status
nginx (pid 26303) is running...
# service wakari-gateway status
wk-gateway RUNNING pid 1137, uptime 5 days, 1:59:28
# service wakari-compute status
wk-compute RUNNING pid 22050, uptime 3 days, 1:03:19
If any of the processes are missing, restart them using the commands above.
Check for Extraneous Processes¶
Use ps -Hu wakari
to get a complete list of the processes running
under the wakari
user account.
root@server # ps -Hu wakari
PID TTY TIME CMD
20756 ? 00:02:26 .supervisord
20757 ? 00:05:58 mtq-worker
20758 ? 00:00:08 wk-server
20765 ? 00:02:00 wk-server
20766 ? 00:01:55 wk-server
20767 ? 00:02:20 wk-server
20770 ? 00:02:02 wk-server
root@server # ps -f -C nginx
UID PID PPID C STIME TTY TIME CMD
root 26303 1 0 12:18 ? 00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx 26305 26303 0 12:18 ? 00:00:00 nginx: worker process
root@gateway # ps -Hu wakari
PID TTY TIME CMD
1136 ? 00:01:59 .supervisord
1137 ? 00:00:02 wk-gateway
root@compute # ps -Hu wakari
PID TTY TIME CMD
1150 ? 00:02:01 .supervisord
1152 ? 00:00:01 wk-compute
What’s normal:
- The wk-server, wk-gateway, and wk-compute processes should have the
PIDs reported by
supervisorctl
. - The nginx master process should have the PID reported by
service nginx status
. - If you have installed more than one Anaconda Enterprise Notebooks component on a single machine, the processes from all of the installed components will show up on that machine.
- On the Compute node, any Anaconda Enterprise Notebooks applications currently being run by users will be present. For example:
root@compute # ps -Hu wakari
PID TTY TIME CMD
1150 ? 00:00:00 .supervisord
1152 ? 00:00:00 wk-compute
1340 ? 00:00:00 bash
1341 ? 00:00:00 notebookwrapper
If extra wk-server, wk-gateway, wk-compute, or supervisord processes are
present, use the kill
command to remove them. Then restart the
services using service SERVICE_NAME restart
as described above.
Check connectivity between the servers¶
Server to Gateways¶
On the Server, navigate to Admin/Data Centers. For each data center in
the list, check connectivity from the server to that gateway (in this
example, the gateway is http://gateway.example.com:8089
):
root@server # curl --connect-timeout 5 http://gateway.example.com:8089 > /dev/null
Gateways to Compute Nodes¶
On the Server, navigate to Admin/Enterprise Resources. For each compute
resource in the list, open it and check the contents of the URL field to
ensure that it begins with either “http” or “https”. Check connectivity
to that URL from the corresponding Gateway. For example, if the URL is
http://compute.example.com:5002
:
root@gateway # curl --connect-timeout 5 http://compute.example.com:5002 > /dev/null
Gateways to server¶
This path is used by the gateway configuration command
wk-gateway-configure
. First, ensure that the gateway is linked to
the correct server in the configuration file and that the full server
URL is specified. Then check connectivity to the server.
root@gateway # grep WAKARI_SERVER /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json
"WAKARI_SERVER": "http://wakari.example.com",
root@gateway # curl --connect-timeout 5 http://wakari.example.com > /dev/null
root@gateway # curl --connect-timeout 5 http://error.example.com > /dev/null
curl: (7) Failed to connect to error.example.com port 80: Connection refused
If a connection fails, check the following items:
- Ensure that Gateways (Data Centers) and Compute nodes (Enterprise Resources) are correctly configured on the server.
- Verify that processes are listening on the configured ports:
root@server # netstat -plt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 *:http *:* LISTEN 26409/nginx
tcp 0 0 *:ssh *:* LISTEN 986/sshd
tcp 0 0 localhost:smtp *:* LISTEN 1063/master
tcp 0 0 *:complex-main *:* LISTEN 26192/python
tcp 0 0 localhost:27017 *:* LISTEN 29261/mongod
tcp 0 0 *:ssh *:* LISTEN 986/sshd
tcp 0 0 localhost:smtp *:* LISTEN 1063/master
- Check firewall settings/logs on both hosts to ensure that packets are not being blocked or discarded.
Check Configuration File Syntax¶
Use this command to verify that the configuration file contains valid JSON:
root@server # python -m json.tool /opt/wakari/wakari-server/etc/wakari/*.json
root@gateway # python -m json.tool /opt/wakari/wakari-gateway/etc/wakari/*.json
root@compute # python -m json.tool /opt/wakari/wakari-compute/etc/wakari/*.json
If the file is correct, the contents will be displayed. If there is a
syntax error in the file, the message
No JSON object could be decoded
will be displayed instead. Edit the
configuration file, ensuring correct JSON syntax.
Check file ownership¶
Verify that all files in /opt/wakari/anaconda belong to user/group
wakari
:
root@server # find /opt/wakari/anaconda \! -user wakari -print
root@server # find /opt/wakari/anaconda \! -group wakari -print
If any files are listed in the output, fix their ownership:
chown -R wakari:wakari /opt/wakari/anaconda
Verify that POSIX ACLs are enabled¶
The acl
option must be enabled on the filesystem containing the
project root directory.
First, determine the project root directory. If a custom projectRoot is configured, you can determine it with:
root@compute # grep projectRoot /opt/wakari/wakari-compute/etc/wakari/config.json
If not, the project root is /projects
.
Either the mount
options or default options listed by tune2fs
should indicate the acl
option is enabled.
root@compute # fs=`df /projects | tail -1 | cut -d " " -f 1`
root@compute # mount | grep $fs
/dev/vda on / type ext4 (rw)
root@compute # tune2fs -l $fs | grep options
Default mount options: user_xattr acl
Clear Browser Cookies¶
When the Anaconda Enterprise Notebooks configuration changes, or the software is upgraded, cookies remaining in the browser can cause issues. Clearing cookies and logging in again can help to resolve problems.
Specific Problems¶
Problem | Cause | Solution |
---|---|---|
Browser indicates “too many redirects” | Cookies are out of date | Clear your browser’s cookies and cache, then try again. |
supervisorctl error: “unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file” | “supervisord” is not running on the Server | Ensure that supervisord is included in the crontab, as described above. Then start supervisord manually. |
Data Center Not Found message when deleting a project | Datacenter has already been removed | As root, run /opt/wakari/wakari-server/bin/wk-server-admin remove-project --db-only <user> <project> |
Forgotten administrator password | Use ssh to log in to the server as root, and run the command /opt/wakari/wakari-server/bin/wk-server-admin add-user wakari --admin -p <new password> -e <your email> . You can then log in to Anaconda Enterprise Notebooks as the wakari user with the new password you chose. |
Logs¶
The locations of the Anaconda Enterprise Notebooks log files for each process and application are shown in the tables above.
The Anaconda Enterprise Notebooks installers log in to /tmp/wakari_{server,gateway,compute}.log.
If log files grow too large they can be deleted. To set the logs to be more or less verbose, the Jupyter Notebook system has a setting ‘Application.log_level’. Setting ‘Application.log_level’ to ‘ERROR’ will make the logs less verbose than the default but still fairly informative.
Killed supervisord and “Error: This socket is closed.”¶
When the supervisor daemon “supervisord” is killed, information sent to standard output “stdout” and standard error “stderr” is held in a pipe which eventually fills up. Then attempting to start any app fails with an error message saying “This socket is closed.”
To prevent this problem, always shut down and restart the processes cleanly and do not shut down or kill supervisord without first shutting down wk-compute and other processes that use it.
To recover from this problem, shut down the process “wk-compute” with sudo kill -9
. Then restart the supervisord and wk-compute processes:
sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start
Service Error 502: Can not connect to the application manager¶
When a gateway node shows this error it means that a compute resource is not responding.
This error is caused when the process “wk-compute” has been shut down. To recover from this problem, restart the supervisord and wk-compute processes:
sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start
“502 Communication Error” on Amazon Web Services¶
If you see a page showing “502 Communication Error: This gateway could not communicate with the Wakari server” and the IP address of the Wakari server, configure the AEN gateway to use the DNS hostname of the server. On Amazon Web Services (AWS) this will be the DNS hostname of the Amazon Elastic Compute Cloud (EC2) instance.
Invalid usernames¶
The first character of a username must be a letter [a-z] or a digit [0-9].
Each other character in a username may be a letter [a-z], a digit [0-9], a period [.], an underscore [_], or a hyphen [-].
The POSIX standard specifies that these characters are the portable filename character set, and that portable usernames have the same character set.
An Anaconda Enterprise Notebooks username should be at least 3 characters and no more than 25 characters.