ubuntu 14.04 安装 calamari

ceph 官方提供了 calamari 在 ubuntu 14.04 下的安装包。 国内可使用阿里云镜像, 添加仓库配置 calamari.list, 内容如下:

deb http://mirrors.aliyun.com/ceph/calamari/1.3.1/ubuntu/trusty/ trusty main

安装:

sudo aptitude install calamari-server calamari-clients -y

初始化:

sudo calamari-ctl initialize

初始化过程会询问管理员用户名和密码。

calamari 使用 apache2 启动 wsgi 服务, 查看 /etc/apache2/sites-enabled/calamari.conf 包含如下内容:

  WSGIScriptAlias / /opt/calamari/conf/calamari.wsgi
  WSGIDaemonProcess calamari display-name=calamari-httpd processes=8 threads=1 maximum-requests=32
  WSGIProcessGroup calamari
  WSGIApplicationGroup %{GLOBAL}

访问 http://10.218.137.144/Internal Server Error。 查看日志 /var/log/calamari/calamari.log 发现如下报错:

  File "/usr/lib/python2.7/logging/__init__.py", line 928, in _open
    stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: '/var/log/calamari/cthulhu.log'

查看文件权限如下:

# ll /var/log/calamari/
total 176
-rw-r--r-- 1 www-data www-data  37988 Feb 14 13:40 calamari.log
-rw-r--r-- 1 root     root      20480 Feb 14 13:44 cthulhu.log
-rw-r--r-- 1 www-data www-data      0 Feb 14 13:35 exception.log
-rw-r--r-- 1 root     root       1821 Feb 14 13:40 httpd_access.log
-rw-r--r-- 1 root     root     108040 Feb 14 13:40 httpd_error.log
-rw-r--r-- 1 www-data www-data    147 Feb 14 13:35 info.log

找到文件的所有者进程:

# lsof /var/log/calamari/cthulhu.log
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
cthulhu-m 12113 root    5w   REG  252,0    21789 928194 /var/log/calamari/cthulhu.log
cthulhu-m 12113 root    7w   REG  252,0    21789 928194 /var/log/calamari/cthulhu.log

# pstree -psa 12113
init,1
  └─supervisord,11647 /usr/bin/supervisord -c /etc/supervisor/supervisord.conf
      └─cthulhu-manager,12113 /opt/calamari/venv/bin/cthulhu-manager
#... ...

calamari 使用 supervisor 管理多进程, 查看 /etc/supervisor/conf.d/calamari.conf 内容如下:

[program:carbon-cache]
command=/opt/calamari/venv/bin/carbon-cache.py --debug --config /etc/graphite/carbon.conf start

[program:cthulhu]
command=/opt/calamari/venv/bin/cthulhu-manager

修正 cthulhu-manager 使用 apache2 用户 (即 www-data) 启动, 对应 supervisor 配置添加 user=www-data.

重启 supervisor:

sudo service supervisor stop
rm -rf /var/log/calamari/cthulhu.log
sudo service supervisor start

重启 apache2:

sudo service apache2 restart

发现还有一个 URL /api/v2/key 报错 Server Error (500), /var/log/calamari/calamari.log 看到如下错误:

  File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 341, in minion_status
    keys = self._salt_key.list_keys()
  File "/usr/lib/python2.7/dist-packages/salt/key.py", line 403, in list_keys
    for fn_ in salt.utils.isorted(os.listdir(dir_)):
OSError: [Errno 13] Permission denied: '/etc/salt/pki/master/minions'

查看目录权限:

# ll -al /etc/salt/pki/master/
total 28
drwx------ 5 root root 4096 Feb 14 13:32 .
drwxr-xr-x 4 root root 4096 Feb 14 13:32 ..

/etc/salt/pki/master/ 本身的目录权限没有问题, 但是其父目录没有权限也会导致该目录没权限。 修正权限:

sudo chmod +rx /etc/salt/pki/master/

随后发现执行 salt 相关命令, 如 salt-key -L, /etc/salt/pki/master/ 目录权限即会被自动修改。 修改 cthulhu 保持 root 运行, 但临时修改 /var/log/calamari/cthulhu.log 文件对 apache2 进程可读写。

至此 calamari web 页面访问正常。

访问 URL /graphite/dashboard/ 可查看 graphite 统计数据。

添加被管理节点。

另一台机器上 (同为 ubuntu 14.04) 安装部署 ceph。

安装 salt-minion:

sudo aptitude install salt-minion -y

添加配置文件 /etc/salt/minion.d/calamari.conf 内容如下:

master: 10.218.137.144

重启 salt-minion:

sudo service salt-minion restart

之后控制台可看到 minion 连接:

$ sudo salt-key -L
Accepted Keys:
Unaccepted Keys:
e010218137211.zmf.tbsite.net
Rejected Keys:

控制台接受 minion 连接 (这个操作也可以在 web 界面上完成):

$ sudo salt-key -a e010218137211.zmf.tbsite.net
The following keys are going to be accepted:
Unaccepted Keys:
e010218137211.zmf.tbsite.net
Proceed? [n/Y] y
Key for minion e010218137211.zmf.tbsite.net accepted.

稍后控制台 web 界面上即可查看和管理 ceph 集群信息。

若移除 minion:

$ sudo salt-key -d e010218137211.zmf.tbsite.net
The following keys are going to be deleted:
Accepted Keys:
e010218137211.zmf.tbsite.net
Proceed? [N/y] y
Key for minion e010218137211.zmf.tbsite.net deleted.

控制台 web 界面看到 host 状态变为 pending, 管理操作 (如更新配置) 无法执行, 停掉 salt-minion 后再此删除 salt-key 后就看不到 host 连接了, 但依然可以看到集群信息。 稍后可看到页面提示:

Cluster Updates Are Stale. The Cluster isn't updating Calamari. Please contact Administrator 

可见当前看到的信息应该是缓存, 集群信息已经不再更新了。

另外看到图表信息 (URL /dashboard/#graph) 没有数据, /graphite/dashboard/ 只能看到 calamari.carbon. 数据, 应该是 diamond 监控采集没有安装配置好。 配置好 diamond 后可看到 servers. 数据, 但从 calamari 页面看貌似还需要 ceph. 的数据, 请求 URL /graphite/render/?format=json-array&from=-1d&target=sumSeries(ceph.cluster.0b1505d2-64b6-4731-8357-56a75fb3d8c1.pool.all.num_read,ceph.cluster.0b1505d2-64b6-4731-8357-56a75fb3d8c1.pool.all.num_write) , ceph. 的数据如何采集(?)。 另外 salt minion_id 默认使用完整 DNS 主机名, 而 diamond 默认按短主机名汇报, 需要修改两者保持一直。 修改 etc/salt/minion_id 文件内容为短主机名即可.

google 了一下找到 calamari 中有模拟采集数据的代码 https://github.com/ceph/calamari/blob/master/minion-sim/minion_sim/ceph_cluster.py , 而真实的 ceph 数据应该是通过 salt-minion 采集的.

看了下 salt-minion 日志 /var/log/salt/minion 发现如下报错:

2017-02-15 14:21:18,455 [salt.loaded.int.module.cmdmod][ERROR   ] Command '['apt-get', '-q', '-y', '-o', 'DPkg::Options::=--force-confold', '-o', 'DPkg::Options::=--force-confdef', '--allow-unauthenticated', 'install', 'diamond']' failed with return code: 100
2017-02-15 14:21:18,456 [salt.loaded.int.module.cmdmod][ERROR   ] stdout: Reading package lists...
Building dependency tree...
Reading state information...
2017-02-15 14:21:18,457 [salt.loaded.int.module.cmdmod][ERROR   ] stderr: E: Unable to locate package diamond
2017-02-15 14:21:18,488 [salt.state       ][ERROR   ] The following packages failed to install/update: diamond.

salt-minion 尝试安装 diamond 包, 但这个包不存在, 因此报错. 看了下这条日志打印的时间刚好是 salt-minion 进程启动的时间, 应该是其启动时尝试自动安装 diamond

为了让 salt-minion 能再系统中找到 diamond, 安装 python-stdeb 后用 pypi-install 安装 diamond 。

sudo aptitude install python-stdeb -y
sudo pip install --upgrade stdeb
sudo pypi-install diamond

之后重启 salt-minion, 果然 salt-minion 启动后会收到安装 diamond 的命令。 因为 pypi-install 安装得到的包名为 python-diamond, salt-minion 安装 diamond 依然报失败。 从 vsm 安装包拷贝 diamond 的 deb 安装包进行安装, 完成后重启 salt-minion, 稍后发现 diamond 已经自动配置好, 并设置使用完整主机名, 与 salt-minion 的默认主机名一致。 但 calamari 依然没有 ceph. 监控数据。

手动部署 calamari

看了下代码, 提供 wsgi 服务的模块应该是 calamari_web, google 了一下纯 python 实现的 wsgi 容器可使用 gunicorn, 参考 https://docs.djangoproject.com/en/1.10/howto/deployment/wsgi/gunicorn/

安装 gunicorn 后执行:

pyenv python venv/bin/gunicorn calamari_web.wsgi

访问 URL 报错:

  File "/home/observer.hany/workspace/calamari/calamari-web/calamari_web/settings.py", line 9, in <module>
    config = CalamariConfig()
  File "/home/observer.hany/workspace/calamari/calamari-common/calamari_common/config.py", line 32, in __init__
    raise ConfigNotFound("Configuration not found at %s" % self.path)
ConfigNotFound: Configuration not found at /etc/calamari/calamari.conf

看了下代码配置文件路径可以使用 CALAMARI_CONFIG 环境变量修改. 同时看到 dev/configure.py 脚本可用生产开发测试的配置, 执行该脚本需要安装 jinja2。 生成配置后执行:

pyenv env CALAMARI_CONFIG=dev/calamari.conf python venv/bin/gunicorn calamari_web.wsgi

为简化操作, 可将 CALAMARI_CONFIG=dev/calamari.conf 写到 env.sh

随后报错:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/django/conf/__init__.py", line 48, in _setup
    self._wrapped = Settings(settings_module)
  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/django/conf/__init__.py", line 152, in __init__
    raise ImproperlyConfigured("The SECRET_KEY setting must not be empty.")
ImproperlyConfigured: The SECRET_KEY setting must not be empty.

看到代码 calamari_web/settings.py:

# Make this unique, and don't share it with anybody.
try:
    SECRET_KEY = open(config.get('calamari_web', 'secret_key_path'), 'r').read()
except IOError:
    # calamari-ctl hasn't been run yet, nothing will work yet.
    SECRET_KEY = ""

应该是还没有初始化 calamari 。

calamari 默认使用的数据库是 postgresql, 对这个不熟悉, 修改相关数据库配置和表字段定义后, 切换成 mysql, 增加安装 MySQL-python 依赖。

手动设置数据库用户权限:

GRANT ALL ON calamari.* TO calamari@'%' IDENTIFIED BY 'calamari123';
GRANT ALL ON calamari.* TO calamari@'localhost' IDENTIFIED BY 'calamari123';
FLUSH PRIVILEGES;

执行

./env.sh python venv/bin/calamari-ctl --devmode clear --yes-i-am-sure

报错:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 193, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
OperationalError: (OperationalError) (1049, "Unknown database 'calamari'") None None

手动创建数据库:

CREATE DATABASE calamari DEFAULT CHARACTER SET utf8;

执行 clear 成功, 之后执行:

./env.sh python venv/bin/calamari-ctl --devmode initialize

报错:

OperationalError: (1044, "Access denied for user ''@'localhost' to database 'calamari'")

跟踪代码发现是 django 初始化数据库时失败。 跟踪到代码 venv/lib/python2.7/site-packages/django/db/__init__.py:

connections = ConnectionHandler(settings.DATABASES)

数据库信息是从 settings 取, 再看代码 calamari_web/settings.py:

  DATABASES['default'] = {
      'ENGINE': config.get("calamari_web", "db_engine"),
      'NAME': config.get("calamari_web", "db_name"),
  }

默认数据库设置没有设置用户名、密码等信息, 原来使用的 postgresql 是使用当前用户认证的.

添加数据库设置:

  DATABASES['default'] = {
      'ENGINE': config.get("calamari_web", "db_engine"),
      'NAME': config.get("calamari_web", "db_name"),
      'HOST': config.get("calamari_web", "db_host"),
      'USER': config.get("calamari_web", "db_user"),
      'PASSWORD': config.get("calamari_web", "db_password"),
  }

同时看了下 cthulhu/calamari_ctl.py 代码, 跳过执行 setup_supervisor()。 之后 clear, initialize 执行成功。

启动 calamari_web:

./env.sh python venv/bin/gunicorn calamari_web.wsgi

访问页面 http://localhost:8000/, 日志看到报错:

  File "/home/observer.hany/workspace/calamari/calamari-common/calamari_common/remote/mon_remote.py", line 50, in <module>
    import rados
ImportError: No module named rados

原因是 venv 下缺少 ceph 依赖, 拷贝 ceph 依赖到 venv:

$ rsync -aOviR /usr/lib/python2.7/./dist-packages/{ceph,rados,rbd}* venv/lib/python2.7/
sending incremental file list
cd+++++++++ dist-packages/
>f+++++++++ dist-packages/ceph_argparse.py
>f+++++++++ dist-packages/ceph_argparse.pyc
>f+++++++++ dist-packages/ceph_daemon.py
>f+++++++++ dist-packages/ceph_daemon.pyc
>f+++++++++ dist-packages/ceph_rest_api.py
>f+++++++++ dist-packages/ceph_rest_api.pyc
>f+++++++++ dist-packages/ceph_volume_client.py
>f+++++++++ dist-packages/ceph_volume_client.pyc
>f+++++++++ dist-packages/cephfs.x86_64-linux-gnu.so
>f+++++++++ dist-packages/rados.x86_64-linux-gnu.so
>f+++++++++ dist-packages/rbd.x86_64-linux-gnu.so
# ... ...

可看 ceph rados 等模块是 .so 的形式. 后来测试安装 python-cephlibs 依赖也可解决此问题, 但其版本较旧, 为 0.94.5.post1

再次重试看到报错:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 154, in validate_thread_sharing
    % (self.alias, self._thread_ident, thread.get_ident()))
DatabaseError: DatabaseWrapper objects created in a thread can only be used in that same thread. The object with alias 'default' was created in thread id 139880272557824 and this is thread id 139880181709712.

google 了一下这是个已知问题, 见: https://github.com/benoitc/gunicorn/issues/879 。 看描述说 So, by the looks of things, this is an issue with --preload and --worker-class gevent 。 看了下 gunicorn --help, 默认 --preload False, --worker-class sync 。 既然默认没有开启 preload, 尝试使用 gevent:

./env.sh python venv/bin/gunicorn --worker-class gevent calamari_web.wsgi

之后问题解决, 没有报错了。

发现访问 URL /graphite/dashboard/ 返回 404, 查看代码 calamari_web/urls.py:

try:
    import graphite.metrics.urls
    import graphite.dashboard.urls
except ImportError:
    pass
else:
    urlpatterns.extend([
        url(r'^render/?', include('graphite.render.urls')),
        url(r'^metrics/?', include('graphite.metrics.urls')),
        url(r'^%s/dashboard/?' % GRAPHITE_API_PREFIX.lstrip('/'), include('graphite.dashboard.urls')),

        # XXX this is a hack to make graphite visible where the 1.x GUI expects it,
        url(r'^graphite/render/?', include('graphite.render.urls')),
        url(r'^graphite/metrics/?', include('graphite.metrics.urls')),
    ])

    patch_views(graphite.metrics.urls)
    patch_views(graphite.dashboard.urls)

应该是未找到 graphite 依赖. 搜索了一下:

$ ./env.sh pip search graphite | grep graphite
# ... ...
graphite (0.71)                            - 
graphite-pymetrics (0.1.1)                 - A simple Python metrics framework to use with carbon/graphite.
# ... ...
graphite-web (0.9.15)                      - Enterprise scalable realtime graphing
graphite-query (0.11.3)                    - Utilities for querying graphite's database
graphite-metrics (15.03.0)                 - Standalone Graphite metric data collectors for various stuff thats not (or poorly) handled by other monitoring daemons
# ... ...

在已部署 calamari 机器的 /opt/calamari 路径下搜索:

$ find venv/ -path '*graphite/dashboard/urls*' -o -path '*graphite/metrics/urls*'
venv/lib/python2.7/site-packages/graphite/dashboard/urls.py
venv/lib/python2.7/site-packages/graphite/metrics/urls.py

$ venv/bin/pip list | grep graphite
graphite-web (0.9.12)

项目目录安装 graphite-web (0.9.15) 后看到:

$ find venv/ -path '*graphite/dashboard/urls*' -o -path '*graphite/metrics/urls*'
venv/lib/python2.7/site-packages/opt/graphite/webapp/graphite/dashboard/urls.py
venv/lib/python2.7/site-packages/opt/graphite/webapp/graphite/dashboard/urls.pyc
venv/lib/python2.7/site-packages/opt/graphite/webapp/graphite/metrics/urls.py
venv/lib/python2.7/site-packages/opt/graphite/webapp/graphite/metrics/urls.pyc

observer.hany@ali-59375n:~/workspace/calamari
$ ./env.sh python -c 'import graphite'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named graphite

降级到 graphite-web (0.9.12) 依然如此。 graphite-web 软件包是对的, 如何定制其安装目录(?)。 google 了一下找到 graphite 的安装文档: http://graphite.readthedocs.io/en/latest/install-pip.html#installing-graphite-web-in-a-custom-location

可通过 --install-option 指定相关安装选项:

--prefix=
--install-scripts=
--install-lib=
--install-data=

卸载 graphite-web 后重新安装:

./env.sh pip install --install-option=--prefix=opt graphite-web

报错:

/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/pip/commands/install.py:194: UserWarning: Disabling all use of wheels due to the use of --build-options / --global-options / --install-options.
  cmdoptions.check_install_build_global(options)
# ... ...
    running install_lib
    creating /opt/graphite
    error: could not create '/opt/graphite': Permission denied

执行:

./env.sh pip install --install-option=--install-lib=opt graphite-web

报错:

  running install_data
  creating /opt/graphite
  error: could not create '/opt/graphite': Permission denied

同时设置 --prefix--install-lib:

./env.sh pip install --install-option=--prefix=opt --install-option=--install-lib=optlib graphite-web

发现 graphite 被安装到当前目录 optlib/ 下并且 ./env.sh pip list 看不到 graphite 。 可见指定目录为最终安装目录, 没有 opt/ 目录 (没有 lib 外的其他内容?)。 重新执行安装:

./env.sh pip install --install-option=--prefix=venv/opt/graphite/ --install-option=--install-lib=venv/lib/python2.7/site-packages/ graphite-web

报错:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/pip/req/req_install.py", line 922, in install
    with open(inst_files_path, 'w') as f:
IOError: [Errno 2] 没有那个文件或目录: 'venv/lib/python2.7/site-packages/graphite_web-0.9.15-py2.7.egg-info/installed-files.txt'

uninstall 后重新使用绝对路径 install:

./env.sh pip install --install-option=--prefix=$PWD/venv/opt/graphite/ --install-option=--install-lib=$PWD/venv/lib/python2.7/site-packages/ graphite-web

安装成功, 并可看到 venv/opt/graphite/ 下有许多文件:

$ ./env.sh pip list | grep grap
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
graphite-web (0.9.15)

$ tree -L 1 venv/opt/graphite/
venv/opt/graphite/
├── bin
├── conf
├── examples
├── storage
└── webapp

再次启动 calamari_web 访问 /graphite/dashboard/, 出现错误日志:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/graphite/storage.py", line 4, in <module>
    import whisper
  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/gevent/builtins.py", line 93, in __import__
    result = _import(*args, **kwargs)
ImportError: No module named whisper

安装 whisper 后再次访问, 报错:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1146, "Table 'calamari.account_profile' doesn't exist")

重新执行 calamari 初始化, 出现报错:

OSError: [Errno 2] No such file or directory: '/home/observer.hany/workspace/calamari/venv/webapp/content'

看了下代码应该是 django 执行 collectstatic 时报错。

重新安装 graphite-webvenv 下:

./env.sh pip install --upgrade --force-reinstall --install-option=--prefix=$PWD/venv/ --install-option=--install-lib=$PWD/venv/lib/python2.7/site-packages/ graphite-web

requirements/2.7/requirements.txt 看到有类似命令, 可确认应该这样安装。

重新初始化:

./env.sh python venv/bin/calamari-ctl --devmode clear --yes-i-am-sure && ./env.sh python venv/bin/calamari-ctl --devmode initialize

再查看数据库发现已经有 account_profile 这个表了。

再次启动 calamari_web 访问 /graphite/dashboard/, 日志报错:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/graphite/render/datalib.py", line 323, in <module>
    def fetchRemoteData(requestContext, pathExpr, usePrefetchCache=settings.REMOTE_PREFETCH_DATA):
  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/django/conf/__init__.py", line 54, in __getattr__
    return getattr(self._wrapped, name)
AttributeError: 'Settings' object has no attribute 'REMOTE_PREFETCH_DATA'

查看代码是 venv/lib/python2.7/site-packages/graphite/render/datalib.py 用到了这个配置。 graphite 配置定义文件为 venv/lib/python2.7/site-packages/graphite/settings.py, 如何让 calamari_web 融合 graphite 的配置(?)。

暂降级 graphite-web==0.9.12, 重试报错:

  File "/home/observer.hany/workspace/calamari/venv/local/lib/python2.7/site-packages/gevent/builtins.py", line 93, in __import__
    result = _import(*args, **kwargs)
ImportError: No module named cairo

搜索了下 cairo 不能通过 pip 安装, 把系统包拷贝过来:

$ rsync -av /usr/lib/python2.7/./dist-packages/cairo -Oi venv/lib/python2.7/
sending incremental file list
cd+++++++++ cairo/
>f+++++++++ cairo/__init__.py
>f+++++++++ cairo/__init__.pyc
>f+++++++++ cairo/_cairo.x86_64-linux-gnu.so

再次访问 /graphite/dashboard/ 不再报错, 重定向到了 /login/ 页。

看了下代码 calamari_web/views.py:

# No need for login_required behaviour if auth is switched off.
if 'django.contrib.auth' not in settings.INSTALLED_APPS:
    login_required = lambda x: x
else:
    from django.contrib.auth.decorators import login_required


@login_required
def serve_dir_or_index(request, path, document_root):
	# ... ...

修改 calamari_web/settings.py 不引入 django.contrib.auth, 禁用登录。 此时访问 /api/v2/ 报 403 Authentication credentials were not provided, 而访问 /graphite/dashboard/ 依然跳登录, 登录不能完全禁用, 恢复引入 django.contrib.auth

访问 /api/v2/ 时, 静态资源 /static/rest_framework/** 访问不到。 可看到静态资源是随 djangorestframework 框架自带的:

$ ls venv/lib/python2.7/site-packages/rest_framework/static/rest_framework/
css  img  js

测试重装:

$ ./env.sh pip install --upgrade --force-reinstall --install-option=--prefix=$PWD/opt --install-option=--install-lib=$PWD/optlib djangorestframework==2.3.12
# ... ...

$ ls optlib/rest_framework/static/rest_framework/
css  img  js

可见其 lib 和静态文件是一起打包的, 不能拆分. 比较一下可知 webapp/content/rest_framework/ 其实是 rest_framework/static/rest_framework/ 的一份拷贝, 内容完全一致。 进一步分析发现 webapp/content 下的内容并未提交到 git, 是 django 初始化时自动从依赖库拷贝过来的静态文件。 参考 calamari.conf apache2 配置, 静态文件应支持通过 URL /static/ 访问, 修改 calamari_web/urls.py 添加如下一行:

  url('^static/(?P<path>.*)$', 'django.views.static.serve', {'document_root': STATIC_ROOT}),

之后测试 api 页面静态文件访问 ok。

另外参考代码 calamari_web/settings.py:

CONTENT_DIR = os.path.join(config.get('graphite', 'root'), "webapp/content/")
if graphite:
    STATICFILES_DIRS = STATICFILES_DIRS + (os.path.join(config.get('graphite', 'root'), "webapp/content/"),)

可知 CONTENT_DIR 是 graphite 使用的目录, 与 STATIC_ROOT 不一样。 这两变量在开发环境默认配置如下:

[calamari_web]
# ... ...
static_root = /home/observer.hany/workspace/calamari/webapp/content/

[graphite]
root = /home/observer.hany/workspace/calamari/venv

ubuntu 14.04 实际部署时, STATIC_ROOTCONTENT_DIR 使用了同一个目录, 即 /opt/calamari/webapp/content/

尝试登录, 先根据初始化的提示创建用户:

./env.sh python venv/bin/calamari-ctl --devmode add_user admin --password admin --email $USER@alibaba-inc.com
./env.sh python venv/bin/calamari-ctl --devmode assign_role admin --role superuser

通过 rest api 登录页 /api/rest_framework/login/ 登录之后访问 /graphite/dashboard/ 也 ok 了。 可见多个模块 (如 rest-api 和 graphite) 的登录在整个应用是统一管理的。 但这时 graphite 没有任何数据。

python 小技巧: entry_point 脚本介绍: https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/ 为了生成 python entry_points 脚本, 可将代码安装到一个临时目录:

./env.sh pip install --install-option=--prefix=$PWD/build/ --install-option=--install-script=$PWD/bin --upgrade cthulhu/

注意不要安装到默认目录因为我们不使用安装的 lib, 但为了 pkg_resources 能检查通过, 还需要设置一个软链接:

ln -sf $PWD/build/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg-info/ -t venv/lib/python2.7/site-packages/

接下来准备运行 carbon,已部署机器上看到其运行命令行如下:

/opt/calamari/venv/bin/python /opt/calamari/venv/bin/carbon-cache.py --debug --config /etc/graphite/carbon.conf start

项目目录下查找 carbon 配置文件:

$ grep -F CACHE_QUERY_PORT * -R
conf/carbon/carbon.conf:CACHE_QUERY_PORT = 7002
dev/etc/graphite/carbon.conf:CACHE_QUERY_PORT = 7002
# ... ...

看了下 Makefile, conf/carbon/carbon.conf 是其生产环境配置, dev/ 下是开发环境配置。

安装 carbon:

make build-venv-carbon

安装完成后看到 carbon 也自带了配置示例, 在其安装目录 conf/ 下:

$ ls venv/conf/carbon.*
venv/conf/carbon.amqp.conf.example  venv/conf/carbon.conf.example

$ grep conf/ venv/lib/python2.7/site-packages/carbon-0.9.15-py2.7.egg-info/installed-files.txt 
../../../../conf/storage-schemas.conf.example
../../../../conf/carbon.amqp.conf.example
../../../../conf/carbon.conf.example

执行:

$ ./env.sh carbon-cache.py --help
Usage: carbon-cache.py [options] <start|stop|status>

Options:
  -h, --help            show this help message and exit
  --debug               Run in the foreground, log to stdout
  --syslog              Write logs to syslog
  --nodaemon            Run in the foreground
  --profile=PROFILE     Record performance profile data to the given file
  --profiler=PROFILER   Specify the profiler to use
  --pidfile=PIDFILE     Write pid to the given file
  --umask=UMASK         Use the given umask when creating files
  --config=CONFIG       Use the given config file
  --whitelist=WHITELIST
                        Use the given whitelist file
  --blacklist=BLACKLIST
                        Use the given blacklist file
  --logdir=LOGDIR       Write logs in the given directory
  --instance=INSTANCE   Manage a specific carbon instance

$ ./env.sh carbon-cache.py status
Error: missing required config '/home/hanyong/workspace/calamari/venv/conf/carbon.conf'

$ ./env.sh carbon-cache.py --config dev/etc/graphite/carbon.conf start
Starting carbon-cache (instance a)

$ ./env.sh carbon-cache.py --config dev/etc/graphite/carbon.conf status
carbon-cache (instance a) is running with pid 6512

$ ls venv/storage/
carbon-cache-a.pid  index  lists  log  rrd  whisper

carbon-cache.py 是个带自管理功能的脚本, --debug 使其在前台运行。

此时再访问 URL /graphite/dashboard/, 就可以看到 carbon. 监控数据了。

访问 URL /dashboard/Server Error (500), 控制台没有日志, 日志文件可看到报错。 如何配置日志打印到控制台呢? calamari_web/settings.py 找到日志配置 LOGGING, google 到配置 stdout 日志的一篇文章: http://codeinthehole.com/writing/console-logging-to-stdout-in-django/ , 配置并添加 console handler:

      'console': {
          'level': 'INFO',
          'class': 'logging.StreamHandler',
          'stream': sys.stderr,
          'formatter': 'simple'
      },

重新运行控制台即可看到错误日志。报错日志:

  File "/home/hanyong/workspace/calamari/venv/local/lib/python2.7/site-packages/gevent/hub.py", line 606, in switch
    return greenlet.switch(self)
LostRemote: Lost remote after 10s heartbeat

从堆栈看是 zerorpc 调用超时, 报错代码:

@login_required
def dashboard(request, path, document_root):
    client = zerorpc.Client()
    client.connect(config.get('cthulhu', 'rpc_url'))
    try:
        clusters = client.list_clusters()
    finally:
        client.close()
    if not clusters:
        return redirect("/manage/")
    return serve_dir_or_index(request, path, document_root)

可见是因为 cthulhu 没启动。

已部署机器看 cthulhu 的运行命令不需要带参数:

/opt/calamari/venv/bin/python /opt/calamari/venv/bin/cthulhu-manager

查看 --help:

$ ./env.sh bin/cthulhu-manager --help
usage: cthulhu-manager [-h] [--debug]

Calamari management service

optional arguments:
  -h, --help  show this help message and exit
  --debug     print log to stdout

执行:

$ ./env.sh bin/cthulhu-manager --debug
Traceback (most recent call last):
  File "/home/hanyong/workspace/calamari/bin/cthulhu-manager", line 11, in <module>
    load_entry_point('calamari-cthulhu==0.1', 'console_scripts', 'cthulhu-manager')()
  File "/home/hanyong/workspace/calamari/cthulhu/cthulhu/manager/manager.py", line 362, in main
    import salt.utils.event
  File "/home/hanyong/workspace/calamari/venv/local/lib/python2.7/site-packages/gevent/builtins.py", line 93, in __import__
    result = _import(*args, **kwargs)
ImportError: No module named salt.utils.event

cthulhu 应该是与 salt 通信的。安装 salt:

./env.sh pip install salt

再次运行:

$ ./env.sh bin/cthulhu-manager --debug
2017-02-18 18:01:17,685 - INFO - calamari MANHOLE: Not patching os.fork and os.forkpty. Oneshot activation is done by signal 10
2017-02-18 18:01:17,686 - INFO - calamari.remote.mon ceph_command 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 
2017-02-18 18:01:17,686 - DEBUG - calamari.remote.mon server_heartbeat: {'services': {}, 'boot_time': 1487403530, 'ceph_version': '10.2.5'}
2017-02-18 18:01:17,687 - DEBUG - calamari.remote.mon cluster_heartbeat: {}
2017-02-18 18:01:17,693 - DEBUG - calamari Events will be emitted to salt event bus
2017-02-18 18:01:18,773 - INFO - calamari Manager starting
2017-02-18 18:01:18,773 - INFO - calamari RpcThread bind...
2017-02-18 18:01:18,774 - INFO - calamari RpcThread run...
2017-02-18 18:01:18,774 - INFO - calamari TopLevelEvents running
2017-02-18 18:01:18,774 - INFO - calamari Running ProcessMonitorThread
2017-02-18 18:01:18,774 - DEBUG - calamari Eventer running
2017-02-18 18:01:18,775 - INFO - calamari Eventer._emit: INFO/Calamari server started
2017-02-18 18:01:18,775 - DEBUG - calamari Eventer running _emit_salt
2017-02-18 18:01:18,775 - DEBUG - calamari Eventer running _emit_salt
2017-02-18 18:01:18,775 - DEBUG - calamari Eventer._emit_to_salt_bus: Tag:calamari/ceph/calamari/started | Data: {'message': 'Calamari server started', 'severity': 'INFO', 'tags': {}}
2017-02-18 18:01:18,776 - INFO - calamari.server_monitor Starting ServerMonitor
2017-02-18 18:01:27,688 - DEBUG - calamari.remote.mon get_heartbeats mon_sockets {}
2017-02-18 18:01:27,689 - INFO - calamari.remote.mon ceph_command ['ceph', '--version']
2017-02-18 18:01:27,786 - INFO - calamari.remote.mon ceph_command 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 
2017-02-18 18:01:27,788 - DEBUG - calamari.remote.mon server_heartbeat: {'services': {}, 'boot_time': 1487403530, 'ceph_version': '10.2.5'}
2017-02-18 18:01:27,788 - DEBUG - calamari.remote.mon cluster_heartbeat: {}
2017-02-18 18:01:27,788 - DEBUG - calamari.remote.mon listen: ev: 2
2017-02-18 18:01:27,789 - DEBUG - calamari.server_monitor ServerMonitor.on_server_heartbeat: han2015dev
2017-02-18 18:01:27,789 - INFO - calamari.server_monitor Saw server <ServerState 'han2015dev'> for the first time
2017-02-18 18:01:27,789 - INFO - calamari Eventer._emit: INFO/Added server han2015dev
2017-02-18 18:01:27,789 - DEBUG - calamari Eventer running _emit_salt
2017-02-18 18:01:27,790 - DEBUG - calamari Eventer running _emit_salt
2017-02-18 18:01:27,790 - DEBUG - calamari Eventer._emit_to_salt_bus: Tag:calamari/ceph/server/added | Data: {'message': 'Added server han2015dev', 'severity': 'INFO', 'tags': {'fqdn': 'han2015dev', 'fsid': None}}
2017-02-18 18:01:27,790 - DEBUG - calamari.remote.mon listen: ev: 2
2017-02-18 18:01:37,789 - DEBUG - calamari.remote.mon get_heartbeats mon_sockets {}
2017-02-18 18:01:37,789 - INFO - calamari.remote.mon ceph_command ['ceph', '--version']
2017-02-18 18:01:37,869 - INFO - calamari.remote.mon ceph_command 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

貌似 cthulhu-manager 启动后自动添加服务器(?)。 此时 web ui 还没添加, 看不到相关页面信息。

构建 calamari clients, 即 web ui, 代码地址: git@github.com:ceph/romana.git 。 看了下最近更新的分支是 master, 使用 master 分支。 与 calamari 不同, calamari 最近活跃的分支是 1.5。 看了下 Makefile, romana 每个模块为一个独立的子文件夹, 可以独立构建。 项目根目录执行 make build-real 即会依次构建所有模块。 构建子模块代码 Makefile.sub:

build-stamp: $(QUOTED_SRCS)
	npm install --loglevel warn
	bower --allow-root --config.interactive=false install
	grunt --no-color saveRevision
	grunt --no-color build
	touch build-stamp

准备构建环境, 安装相关工具。

安装 nodejs:

git clone git@github.com:creationix/nvm.git ~/workspace/nvm/
export NVM_NODEJS_ORG_MIRROR=https://npm.taobao.org/dist
source ~/workspace/nvm/nvm.sh
nvm install 6

配置 npm 镜像, 编辑文件 ~/.npmrc 内容如下:

cat <<EOF >~/.npmrc
registry=https://registry.npm.taobao.org
disturl=https://npm.taobao.org/dist
EOF

安装工具:

npm install -g cnpm bower grunt

尝试构建 login 模块, 发现构建过程中还会到 github 下载 phantomjs, 搜索了一下需要再设置下其镜像 URL:

export PHANTOMJS_CDNURL=https://npm.taobao.org/dist/phantomjs
npm install -g phantomjs

再次构建报错:

Running "concurrent:dist" (concurrent) task
    Warning: Running "compass:dist" (compass) task
    Warning: You need to have Ruby and Compass installed and in your system PATH for this task to work. More info: https://github.com/gruntjs/grunt-contrib-compass Use --force to continue.

准备 ruby 环境:

sudo gem update --system
sudo gem sources --add https://gems.ruby-china.org/ --remove https://rubygems.org/
gem sources -l
sudo aptitude install ruby ruby-dev -y

安装:

sudo gem install --no-document compass

再次构建成功。

构建 manage 模块, 出现报错:

> gifsicle@0.1.7 postinstall /home/hanyong/workspace/romana/manage/node_modules/gifsicle
> node index.js

path.js:7
    throw new TypeError('Path must be a string. Received ' + inspect(path));
    ^

TypeError: Path must be a string. Received { url: 'https://raw.github.com/imagemin/gifsicle-bin/v0.1.7/vendor/linux/x64/gifsicle',
  name: 'gifsicle',
  os: 'linux',
  arch: 'x64' }
    at assertPath (path.js:7:11)
    at Object.basename (path.js:1355:5)
    at /home/hanyong/workspace/romana/manage/node_modules/download/index.js:35:43
    at each (/home/hanyong/workspace/romana/manage/node_modules/each-async/each-async.js:63:4)
    at module.exports (/home/hanyong/workspace/romana/manage/node_modules/download/index.js:33:5)
    at /home/hanyong/workspace/romana/manage/node_modules/bin-wrapper/index.js:108:20
    at /home/hanyong/workspace/romana/manage/node_modules/bin-wrapper/index.js:141:24
    at /home/hanyong/workspace/romana/manage/node_modules/bin-check/index.js:30:20
    at /home/hanyong/workspace/romana/manage/node_modules/executable/index.js:39:20
    at FSReqWrap.oncomplete (fs.js:123:15)

简单看了下是 bin-wrapper 下载文件时 download 版本低, 不支持传入参数格式。 看 gifsicle 应该是跟图像处理相关的东西, 以前见过类似错误是 imagemin 相关依赖版本低导致。 看了下 package.json 找到一个 grunt-contrib-imagemin 依赖, 将其版本从 ~0.3.0 升级到最新 ~1.0.1, 构建通过。 随后 admindashboard 也构建通过。

项目根目录执行 make build-real, 其依次构建几个子模块后生成文件 dashboard/dist/scripts/config.json

将构建好的文件拷贝到 calamari 的 STATIC_ROOT 目录下, 有个脚本 utils/11-cp-ui.sh 正是用来做这件事的, 不过目录跟我们不一样不能用, 手动写一下拷贝命令:

for e in login/ manage/ admin/ dashboard/ ; do rsync -avi $e/dist/ ../calamari/webapp/content/$e/ ; done

访问 URL /dashboard/ 自动跳转到 /manage/#/first, 因为还没有集群信息。 网络请求看到有个访问 /api/v2/key 报 404。

之前见过这个 URL 应该时获取 salt key 列表的, 根据 API 文档查找 api 代码:

$ grep -F 'Ceph servers authentication with the Calamari using a key pair' -R *
# ... ...
rest-api/calamari_rest/views/v2.py:Ceph servers authentication with the Calamari using a key pair.  Before

找到 SaltKeyViewSet, 查找其引用处:

grep -F SaltKeyViewSet rest-api/ -R

结果未找到。 而在已部署机器上可找到:

$ grep -F SaltKeyViewSet -R venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/ --line-number
venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/urls/v2.py:115:    url(r'^key$', calamari_rest.views.v2.SaltKeyViewSet.as_view(
venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/urls/v2.py:118:        calamari_rest.views.v2.SaltKeyViewSet.as_view({'get': 'retrieve', 'patch': 'partial_update', 'delete': 'destroy'})),

查看文件 rest-api/calamari_rest/urls/v2.py 的变更历史, 是最近一次变更删掉了这个 API:

Author: Gregory Meno <gmeno@redhat.com>  2016-06-11 06:43:25
Follows: v1.4.0-rc15
Precedes: v1.4.1

    remove urls for logs, keys, grains

不知道为什么, 感觉是 client 代码已经落后于 rest-api 代码。 编译 romana 1.3 分支, 结果还是一样的。

calamari 1.3 分支

为了配合 clients 了解 calamari 的工作和监控采集逻辑, 尝试编译运行 calamari 1.3 分支。

git branch 1.3 github/1.3
git checkout 1.3
git reset 1.5 -- env.sh Makefile.dev
git checkout .

安装 pip 依赖时缺少 libpq-dev, 为减少改造成本, 不修改使用 mysql 了。

安装:

sudo aptitude install postgresql libpq-dev -y

添加 dev/requirements.txt:

gunicorn

修改 Makefile.dev 内容如下:

.PHONY: venv

venv:
	make version
	./env.sh pip install -r requirements/2.7/requirements.production.txt
	./env.sh pip install -r dev/requirements.txt
	./env.sh pip install --install-option=--prefix=$(PWD)/venv/ --install-option=--install-lib=$(PWD)/venv/lib/python2.7/site-packages/ graphite-web==0.9.12
	./env.sh pip install --install-option=--prefix=$(PWD)/venv/ --install-option=--install-lib=$(PWD)/venv/lib/python2.7/site-packages/ carbon==0.9.15
	rsync -av /usr/lib/python2.7/./dist-packages/cairo -Oi venv/lib/python2.7/
	./env.sh pip install --install-option=--prefix=$(PWD)/build/ --install-option=--install-script=$(PWD)/bin --upgrade cthulhu/
	ln -sf $(PWD)/build/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg-info/ -t venv/lib/python2.7/site-packages/

执行

make -f Makefile.dev 
./env.sh dev/configure.py

接下来设置 postgresql 数据库, 设置方法可参考 salt/local/postgres.sls 中的代码。 postgresql 默认基于用户管理数据库, 默认使用系统用户 peer 登录, 需要修改为使用密码 md5 登录。 创建用户 calamari:

sudo -u postgres createuser -s -P calamari

-s 表示创建超级用户, -P 提示输入新用户密码。

为 calamari 用户创建数据库:

sudo -u postgres createdb --owner=calamari --locale=en_US.UTF-8 --encoding=UTF8 calamari

登录数据库:

$ psql -hlocalhost -U calamari
Password for user calamari: 
psql (9.5.6)
Type "help" for help.

calamari=# 

注意必须指定 -hlocalhost 才会使用用户密码方式登录, 否则默认使用 peer 登录。 postgresql 用法与 mysql 有很大区别, 如查看表没有 SHOW TABLES, 要用 \dt, 在此不再深入。

初始化:

$ ./env.sh bin/calamari-ctl --devmode initialize
[INFO] Loading configuration..
[INFO] Initializing database...
[INFO] You will now be prompted for login details for the administrative user account.  This is the account you will use to log into the web interface once setup is complete.
Username (leave blank to use 'hanyong'): admin
Email address: 313982441@qq.com
Password: 
Password (again): 
Superuser created successfully.
[INFO] Initializing web interface...
[INFO] Restarting services...
[ERROR] [Errno 2] No such file or directory
[ERROR] We are sorry, an unexpected error occurred.  Debugging information has
been written to a file at '/tmp/2017-02-24_1635.txt', please include this when seeking technical
support.

有错误, 修改 cthulhu/calamari_ctl.py 使其直接在控制台打印错误堆栈, 可看到是调用 supervisorctl 时报错, 直接注释掉此行代码, 再次初始化即 OK。

依次执行:

./env.sh venv/bin/carbon-cache.py --debug --config dev/etc/graphite/carbon.conf start
./env.sh bin/cthulhu-manager --debug
./env.sh gunicorn --worker-class gevent calamari_web.wsgi

访问 http://localhost:8000/ 即可看到登录页面。 登录完成调到 /dashboard/, 卡住一阵之后返回 Server Error (500)cthulhu-manager 看到日志输出:

2017-02-25 00:51:11,851 - WARNING - cthulhu.salt Re-opening connection to salt-master

应该是无法连接 salt-master 的问题。 参考上文添加 console 日志后看到如下错误:

2017-02-24 11:00:06,277 - ERROR - django.request Internal Server Error: /dashboard/
Traceback (most recent call last):
  File "/home/hanyong/workspace/calamari/venv/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/home/hanyong/workspace/calamari/venv/local/lib/python2.7/site-packages/django/contrib/auth/decorators.py", line 25, in _wrapped_view
    return view_func(request, *args, **kwargs)
  File "/home/hanyong/workspace/calamari/calamari-web/calamari_web/views.py", line 38, in dashboard
    clusters = client.list_clusters()
  File "/home/hanyong/workspace/calamari/venv/local/lib/python2.7/site-packages/zerorpc/core.py", line 260, in <lambda>
    return lambda *args, **kargs: self(method, *args, **kargs)
# ... ...

web 调用 cthulhu RPC 查询集群列表失败。

根据日志搜索代码:

$ grep -F 'Re-opening connection to salt-master' * -R
calamari-common/calamari_common/salt_wrapper.py:                self._log.warning("Re-opening connection to salt-master")
# ... ...

阅读代码, 看到连接 salt-master 的代码如下:

      self._master_event = MasterEvent(self._config['sock_dir'])

此处添加抛异常看到堆栈:

File "/home/hanyong/workspace/calamari/venv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
  result = self._run(*self.args, **self.kwargs)
File "/home/hanyong/workspace/calamari/cthulhu/cthulhu/manager/server_monitor.py", line 147, in _run
  subscription = SaltEventSource(log, salt_config)

跟踪到读取配置代码 cthulhu/manager/__init__.py:

# A salt config instance for places we'll need the sock_dir
salt_config = client_config(config.get('cthulhu', 'salt_config_path'))

开发环境配置路径为 dev/etc/salt/master

执行:

$ ./env.sh salt-master -c dev/etc/salt/ -l debug
[DEBUG   ] Reading configuration from /home/hanyong/workspace/calamari/dev/etc/salt/master
[DEBUG   ] Configuration file path: /home/hanyong/workspace/calamari/dev/etc/salt/master
Failed to create directory path "/var/cache/salt/master/queues" - [Errno 13] Permission denied: '/var/cache/salt'

添加配置:

root_dir: /home/hanyong/workspace/calamari/dev/

之后启动成功。

访问 web 页面依然报错, 考虑到 calamari 可能对 salt 版本有要求, 降级到 salt==2015.8.13, 重启 salt-master 和 cthulhu-manager, 再访问 web 页面果然工作正常了。

salt/srv/salt/ 目录下的文件应该即是 salt-minion 端执行的相关命令, salt 时如何配置使用这些文件的呢, 已部署机器上找到配置文件如下:

$ cat /etc/salt/master.d/calamari.conf 

file_roots:
  base:
      - /opt/calamari/salt/salt/

pillar_roots:
  base:
      - /opt/calamari/salt/pillar/

reactor:
  - 'salt/minion/*/start':
    - /opt/calamari/salt/reactor/start.sls

# add the Debian, RedHat and SUSE default apache users to
# avoid making this file distro-dependent

client_acl:
  www-data:
    - log_tail.*
  apache:
    - log_tail.*
  wwwrun:
    - log_tail.*

查看开发环境配置 dev/etc/salt/master, 也包含了类似的配置。

calamari 使用 python-diamond 采集数据, 连接 ceph server 节点前我们先在 server 上安装 diamond。 为了减少意外, 我们使用 calamari 定制集成 diamond 分支。

git clone git@github.com:ceph/Diamond.git
cd Diamond/
git remote add diamond git@github.com:python-diamond/Diamond.git
git fetch diamond
git checkout calamari_rebased_on_v3.5

看看 calamari 是否在 diamond 上对 ceph 做了定制, 比较一下代码:

git difftool HEAD..v3.5

结构发现只有两个文件有差异, 即 ceph 数据采集器:

src/collectors/ceph/ceph.py
src/collectors/ceph/test/testceph.py

貌似是多了一些定制代码, 与 diamond 最新 tag v4.0.515 比较, 也有差异, 这些代码未合并到官方 diamond(?)。

ceph server 尝试使用 ubuntu 16.04 测试, 参考 diamond 安装文档: http://diamond.readthedocs.io/en/latest/ , 为了将 diamond 安装为系统包, 可以使用 pypi-install 安装。

sudo aptitude install python-stdeb python-dev -y

但是试了一下 pypi-install 不支持从本地源码安装, 看了下代码:

      tarball_fname = get_source_tarball(package_name,verbose=options.verbose,
                                         release=options.release,
                                         allow_unsafe_download=options.allow_unsafe_download)

      cmd = ' '.join(['py2dsc-deb'] + py2dsc_args + [tarball_fname])
      if options.verbose >= 2:
          myprint('executing: %s'%cmd)
      subprocess.check_call(cmd, shell=True)

      os.chdir( 'deb_dist' )
      cmd = 'dpkg -i *.deb'

pypi-install 是下载 tar 包后执行 py2dsc-deb 生成 .deb 包。 google 了一下本地代码可以执行 sdist 生成源码 tar 包。

python setup.py sdist

执行 py2dsc-deb 报错:

$ py2dsc-deb dist/diamond-3.5.16.tar.gz
# ... ...
running build_scripts
creating build
creating build/scripts-2.7
copying and adjusting bin/diamond -> build/scripts-2.7
copying and adjusting bin/diamond-setup -> build/scripts-2.7
changing mode of build/scripts-2.7/diamond from 664 to 775
changing mode of build/scripts-2.7/diamond-setup from 664 to 775
   dh_auto_test -O--buildsystem=pybuild
I: pybuild base:184: python2.7 setup.py test 
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires'
  warnings.warn(msg)
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: setup.py --help [cmd1 cmd2 ...]
   or: setup.py --help-commands
   or: setup.py cmd --help

error: invalid command 'test'
E: pybuild pybuild:274: test: plugin distutils failed with: exit code=1: python2.7 setup.py test 
dh_auto_test: pybuild --test -i python{version} -p 2.7 --dir . returned exit code 13
debian/rules:7: recipe for target 'build' failed
make: *** [build] Error 25
dpkg-buildpackage: 错误: debian/rules build 提供错误退出状态 2
Traceback (most recent call last):
  File "setup.py", line 149, in <module>
    ** setup_kwargs
  File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
    dist.run_commands()
  File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "/usr/lib/python2.7/dist-packages/stdeb/command/bdist_deb.py", line 48, in run
    util.process_command(syscmd,cwd=target_dirs[0])
  File "/usr/lib/python2.7/dist-packages/stdeb/util.py", line 183, in process_command
    check_call(args, cwd=cwd)
  File "/usr/lib/python2.7/dist-packages/stdeb/util.py", line 46, in check_call
    raise CalledProcessError(retcode)
stdeb.util.CalledProcessError: 2
ERROR running: /usr/bin/python setup.py --command-packages stdeb.command sdist_dsc --dist-dir=/home/hanyong/workspace/Diamond/deb_dist --use-premade-distfile=/home/hanyong/workspace/Diamond/dist/diamond-3.5.16.tar.gz bdist_deb
ERROR in deb_dist/tmp_py2dsc/diamond-3.5.16

试了下用最新的 tag v4.0.515 时可以生成成功的:

git checkout master
git reset --hard v4.0.515
git clean -fdx
python setup.py sdist
py2dsc-deb dist/diamond-4.0.515.tar.gz

对应的上述日志如下:

running build_scripts
creating build
creating build/scripts-2.7
copying and adjusting bin/diamond -> build/scripts-2.7
copying and adjusting bin/diamond-setup -> build/scripts-2.7
changing mode of build/scripts-2.7/diamond from 664 to 775
changing mode of build/scripts-2.7/diamond-setup from 664 to 775
   dh_auto_test -O--buildsystem=pybuild
I: pybuild base:184: cd /home/hanyong/workspace/Diamond/deb_dist/diamond-4.0.515/.pybuild/pythonX.Y_2.7/build; python2.7 -m unittest discover -v 

----------------------------------------------------------------------
Ran 0 tests in 0.000s

测试了一下真正导致报错的命令是:

$ dh_auto_test -O--buildsystem=pybuild
I: pybuild base:184: python2.7 setup.py test 
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires'
  warnings.warn(msg)
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: setup.py --help [cmd1 cmd2 ...]
   or: setup.py --help-commands
   or: setup.py cmd --help

error: invalid command 'test'
E: pybuild pybuild:274: test: plugin distutils failed with: exit code=1: python2.7 setup.py test 
dh_auto_test: pybuild --test -i python{version} -p 2.7 --dir . returned exit code 13

dh_auto_test 的功能是打 .deb 包时自动运行测试, 找了下 py2dsc-deb 的选项没看到跳过测试的设置。 尝试在 calamari_rebased_on_v3.5 分支运行 v4.0.515 的测试命令:

$ python2.7 -m unittest discover -v
test (unittest.loader.ModuleImportFailure) ... ERROR

======================================================================
ERROR: test (unittest.loader.ModuleImportFailure)
----------------------------------------------------------------------
ImportError: Failed to import test module: test
Traceback (most recent call last):
  File "/usr/lib/python2.7/unittest/loader.py", line 254, in _find_tests
    module = self._get_module_from_name(name)
  File "/usr/lib/python2.7/unittest/loader.py", line 232, in _get_module_from_name
    __import__(name)
  File "/home/hanyong/workspace/Diamond/test.py", line 12, in <module>
    import configobj
ImportError: No module named configobj


----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)

想起之前看两个版本的 setup.py 代码对比, 应该是 configobj 包名更新了, calamari_rebased_on_v3.5 旧代码使用的还是旧包名。 可是看 test.py 代码已经是新包名 configobj, 系统中也存在此包, 却报找不到。 而其 setup.py 中声明的却是旧包名, 两者不一致:

  install_requires = ['ConfigObj', 'psutil', ],

修正 setup.py 中的包名, 还是不行。

dh_auto_test 是如何推算使用的测试命令的呢, 添加 -v 参数看看:

$ dh_auto_test -O--buildsystem=pybuild -v
	pybuild --test -i python{version} -p 2.7 --dir .
I: pybuild base:184: cd /home/hanyong/workspace/Diamond/.pybuild/pythonX.Y_2.7/build; python2.7 -m unittest discover -v 

看见其是直接调用的 pybuild 命令, 随后两个分支逻辑就不一样了。 跟了一下代码, 最终会调用到 /usr/share/dh-python/dhpython/build/plugin_distutils.py:

  @shell_command
  @create_pydistutils_cfg
  def test(self, context, args):
      if not self.cfg.custom_tests:
          fpath = join(args['dir'], args['setup_py'])
          with open(fpath, 'rb') as fp:
              if fp.read().find(b'test_suite') > 0:
                  # TODO: is that enough to detect if test target is available?
                  return '{interpreter} {setup_py} test {args}'
      return super(BuildSystem, self).test(context, args)

这里有个简单粗暴的逻辑, setup.py 文件包含 test_suite 就会执行 setup.py test。 而 calamari_rebased_on_v3.5 刚好有一行:

  #test_suite='test.main',

其虽然被注释掉, 却导致测试检测逻辑走向了不同的分支。 删除这一行, 重新打包构建 .deb 包, 果然成功了, 跟 setup.py 中的 ConfigObj 都没有关系。

另外跟踪 pybuild 代码时看到:

  nocheck = False
  if 'DEB_BUILD_OPTIONS' in environ:
      nocheck = 'nocheck' in environ['DEB_BUILD_OPTIONS']

设置环境变量也可以直接跳过测试:

DEB_BUILD_OPTIONS=nocheck py2dsc-deb dist/diamond-3.5.16.tar.gz

即使不跳过, 貌似也没执行到任何测试:

I: pybuild base:184: cd /home/hanyong/workspace/Diamond/deb_dist/diamond-3.5.16/.pybuild/pythonX.Y_2.7/build; python2.7 -m unittest discover -v 

----------------------------------------------------------------------
Ran 0 tests in 0.000s

生成的 .deb 包名是 python-diamond, 安装:

sudo dpkg -i ./deb_dist/python-diamond_3.5.16-1_all.deb

看 calamari 代码 salt/srv/salt/diamond.sls 是配置和启动 diamond 的, 其中包名 diamond 修改为 python-diamond, 其为 ubuntu 配置的文件 /etc/default/diamond 经检查并不存在, 删掉相关配置。

接下来要在 ceph server 运行 salt-minion, 为了方便切换版本并减少对系统库的影响, 使用 virtualenv 安装 salt, 同时为了能够直接使用 ceph 等三方库, 设置其可以访问系统库。 系统自带 virtualenv 版本较低有一些问题, 需要先升级下系统 virtualenv。

sudo pip2 install --upgrade virtualenv
python2 -m virtualenv --system-site-packages salt/
cd salt/
bin/pip install salt==2015.8.13

使用 pip 安装的 salt 没有配置文件模板, 参考文档: https://docs.saltstack.com/en/latest/topics/development/hacking.html , 可以从源码拷贝配置模板, 从非系统目录运行, 配置文件主要修改 user:root_dir: 就可以了。 由于需要修改 diamond 系统配置, 启动服务等, salt-minion 还是需要 root 权限, 不过可以先用普通用户测试。 我们不关心其他配置, 可以只设置 root_dir:master: 即可。

mkdir -p etc/salt/minion.d/
echo -e "root_dir: $PWD/\nmaster: han230dev" >etc/salt/minion.d/calamari.conf

运行报错:

$ bin/salt-minion -c etc/salt/ -u $USER -l debug
[DEBUG   ] Missing configuration file: /home/hanyong/opt/salt/etc/salt/minion
[DEBUG   ] Including configuration from '/home/hanyong/opt/salt/etc/salt/minion.d/calamari.conf'
[DEBUG   ] Reading configuration from /home/hanyong/opt/salt/etc/salt/minion.d/calamari.conf
[DEBUG   ] Using cached minion ID from /home/hanyong/opt/salt/etc/salt/minion_id: han230
[DEBUG   ] Configuration file path: /etc/salt/minion
Failed to create directory path "/etc/salt/minion.d" - [Errno 13] Permission denied: '/etc/salt'
[INFO    ] The Salt Minion is shut down
[ERROR   ] 13

猜测是缺少 etc/salt/minion 配置文件就会到默认系统目录去找, touch 一个空文件:

touch etc/salt/minion

再次运行果然 OK 了, web 页面也能看到 salt 连接了。 退出重新使用 root 运行:

sudo bin/salt-minion -c etc/salt/ -l debug

salt-master 接受连接后 minion 看到日志:

[INFO    ] Minion is ready to receive requests!
[INFO    ] User hanyong Executing command state.highstate with jid 20170225174351073263
[DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20170225174351073263', 'tgt': 'han230', 'ret': '', 'user': 'hanyong', 'arg': [], 'fun': 'state.highstate'}
[INFO    ] Starting a new job with PID 19410
[DEBUG   ] LazyLoaded state.highstate
[DEBUG   ] LazyLoaded grains.get
[DEBUG   ] LazyLoaded saltutil.is_running
# ... ...
[INFO    ] User hanyong Executing command saltutil.sync_modules with jid 20170225174351122123
[DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20170225174351122123', 'tgt': 'han230', 'ret': '', 'user': 'hanyong', 'arg': [], 'fun': 'saltutil.sync_modules'}
[INFO    ] Starting a new job with PID 19415
[DEBUG   ] LazyLoaded saltutil.sync_modules
# ... ...
[DEBUG   ] In saltenv 'base', looking at rel_path u'top.sls' to resolve u'salt://top.sls'
[DEBUG   ] In saltenv 'base', ** considering ** path u'/home/hanyong/opt/salt/var/cache/salt/minion/files/base/top.sls' to resolve u'salt://top.sls'
[DEBUG   ] Fetching file from saltenv 'base', ** attempting ** u'salt://top.sls'
[DEBUG   ] No dest file found 
[INFO    ] Fetching file from saltenv 'base', ** done ** u'top.sls'
[DEBUG   ] compile template: /home/hanyong/opt/salt/var/cache/salt/minion/files/base/top.sls
[DEBUG   ] Jinja search path: ['/home/hanyong/opt/salt/var/cache/salt/minion/files/base']
[PROFILE ] Time (in seconds) to render '/home/hanyong/opt/salt/var/cache/salt/minion/files/base/top.sls' using 'jinja' renderer: 0.00566506385803
[DEBUG   ] Rendered data from file: /home/hanyong/opt/salt/var/cache/salt/minion/files/base/top.sls:
base:
    '*':
        - diamond
        - osd_crush_location 
# ... ...
[DEBUG   ] In saltenv 'base', looking at rel_path u'_modules/ceph.py' to resolve u'salt://_modules/ceph.py'
[DEBUG   ] In saltenv 'base', ** considering ** path u'/home/hanyong/opt/salt/var/cache/salt/minion/files/base/_modules/ceph.py' to resolve u'salt://_modules/ceph.py'
[DEBUG   ] Fetching file from saltenv 'base', ** attempting ** u'salt://_modules/ceph.py'
[DEBUG   ] No dest file found 
[INFO    ] Fetching file from saltenv 'base', ** done ** u'_modules/ceph.py'
# ... ...
[DEBUG   ] Missing configuration file: /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
# ... ...
[DEBUG   ] In saltenv 'base', looking at rel_path u'diamond.sls' to resolve u'salt://diamond.sls'
[DEBUG   ] In saltenv 'base', ** considering ** path u'/home/hanyong/opt/salt/var/cache/salt/minion/files/base/diamond.sls' to resolve u'salt://diamond.sls'
[DEBUG   ] Fetching file from saltenv 'base', ** attempting ** u'salt://diamond.sls'
[DEBUG   ] No dest file found 
[INFO    ] Fetching file from saltenv 'base', ** done ** u'diamond.sls'
[DEBUG   ] compile template: /home/hanyong/opt/salt/var/cache/salt/minion/files/base/diamond.sls
# ... ...
[DEBUG   ] In saltenv 'base', looking at rel_path u'osd_crush_location.sls' to resolve u'salt://osd_crush_location.sls'
[DEBUG   ] In saltenv 'base', ** considering ** path u'/home/hanyong/opt/salt/var/cache/salt/minion/files/base/osd_crush_location.sls' to resolve u'salt://osd_crush_location.sls'
[DEBUG   ] Fetching file from saltenv 'base', ** attempting ** u'salt://osd_crush_location.sls'
[DEBUG   ] No dest file found 
[INFO    ] Fetching file from saltenv 'base', ** done ** u'osd_crush_location.sls'
[DEBUG   ] compile template: /home/hanyong/opt/salt/var/cache/salt/minion/files/base/osd_crush_location.sls
# ... ...
[ERROR   ] Command ['systemd-run', '--scope', 'apt-get', '-q', '-y', '-o', 'DPkg::Options::=--force-confold', '-o', 'DPkg::Options::=--force-confdef', '--allow-unauthenticated', 'install', 'diamond'] failed with return code: 100
[ERROR   ] output: Running scope as unit run-r11fa549119b34af6ab1989695c276429.scope.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package diamond
[INFO    ] Executing command ['dpkg-query', '--showformat', '${Status} ${Package} ${Version} ${Architecture}\n', '-W'] in directory '/home/hanyong'
[ERROR   ] The following packages failed to install/update: diamond
# ... ...
[INFO    ] Running state [/usr/bin/calamari-crush-location] at time 17:43:56.761598
[INFO    ] Executing state file.managed for /usr/bin/calamari-crush-location
[DEBUG   ] In saltenv 'base', looking at rel_path u'base/calamari-crush-location.py' to resolve u'salt://base/calamari-crush-location.py'
[DEBUG   ] In saltenv 'base', ** considering ** path u'/home/hanyong/opt/salt/var/cache/salt/minion/files/base/base/calamari-crush-location.py' to resolve u'salt://base/calamari-crush-location.py'
[DEBUG   ] Fetching file from saltenv 'base', ** attempting ** u'salt://base/calamari-crush-location.py'
[DEBUG   ] No dest file found 
[INFO    ] Fetching file from saltenv 'base', ** done ** u'base/calamari-crush-location.py'
[INFO    ] File changed:
New file
[INFO    ] Completed state [/usr/bin/calamari-crush-location] at time 17:43:56.794575
# ... ...
[INFO    ] Running state [find /etc/ceph -name '*.conf' | while read conf; do echo; cp "$conf" "$conf.orig"; echo "modifying $conf"; grep -EH 'osd crush update on start = false|osd crush location hook' "$conf" || sed 's/\[global\]/\[global\]\nosd crush location hook = \/usr\/bin\/calamari-crush-location/' -i "$conf"; done] at time 17:43:56.796781
[INFO    ] Executing state cmd.run for find /etc/ceph -name '*.conf' | while read conf; do echo; cp "$conf" "$conf.orig"; echo "modifying $conf"; grep -EH 'osd crush update on start = false|osd crush location hook' "$conf" || sed 's/\[global\]/\[global\]\nosd crush location hook = \/usr\/bin\/calamari-crush-location/' -i "$conf"; done
[INFO    ] Executing command 'find /etc/ceph -name \'*.conf\' | while read conf; do echo; cp "$conf" "$conf.orig"; echo "modifying $conf"; grep -EH \'osd crush update on start = false|osd crush location hook\' "$conf" || sed \'s/\\[global\\]/\\[global\\]\\nosd crush location hook = \\/usr\\/bin\\/calamari-crush-location/\' -i "$conf"; done' in directory '/home/hanyong'
[DEBUG   ] stdout: 
modifying /etc/ceph/ceph.conf
[INFO    ] {'pid': 19700, 'retcode': 0, 'stderr': '', 'stdout': '\nmodifying /etc/ceph/ceph.conf'}
[INFO    ] Completed state [find /etc/ceph -name '*.conf' | while read conf; do echo; cp "$conf" "$conf.orig"; echo "modifying $conf"; grep -EH 'osd crush update on start = false|osd crush location hook' "$conf" || sed 's/\[global\]/\[global\]\nosd crush location hook = \/usr\/bin\/calamari-crush-location/' -i "$conf"; done] at time 17:43:56.813751
# ... ...
[DEBUG   ] Sending event - data = {'pretag': None, '_stamp': '2017-02-25T09:44:42.070365', 'tag': 'ceph/server', 'data': {'services': {'ceph-osd.0': {'status': None, 'cluster': 'ceph', 'version': u'10.2.5', 'type': 'osd', 'id': '0', 'fsid': u'13e5f237-9387-4506-badd-c66cc25f9629'}}, 'boot_time': 1487997369, 'ceph_version': '10.2.5-0ubuntu0.16.04.1'}, 'events': None}

可看到 salt-minion 从 master 拉取了相关配置和脚本, 并执行了相关命令。 拉取或修改的相关文件:

$ sudo tree var/cache/salt/minion/                                                                                                  
var/cache/salt/minion/
├── accumulator
├── extmods
│   └── modules
│       ├── ceph.py
│       ├── ceph.pyc
│       ├── log_tail.py
│       └── log_tail.pyc
├── files
│   └── base
│       ├── base
│       │   └── calamari-crush-location.py
│       ├── diamond.sls
│       ├── _modules
│       │   ├── ceph.py
│       │   └── log_tail.py
│       ├── osd_crush_location.sls
│       └── top.sls
├── highstate.cache.p
├── module_refresh
└── proc

$ sudo tree /etc/salt/
/etc/salt/
└── minion.d
    └── _schedule.conf
    
$ sudo cat /etc/salt/minion.d/_schedule.conf 
schedule:
  __mine_interval: {function: mine.update, jid_include: true, maxrunning: 2, minutes: 60,
    return_job: false}

$ ll /usr/bin/calamari-crush-location 
-rwxr-xr-x 1 root root 3319 2月  25 17:43 /usr/bin/calamari-crush-location*

$ cat /etc/ceph/ceph.conf
[global]
osd crush location hook = /usr/bin/calamari-crush-location
fsid = 13e5f237-9387-4506-badd-c66cc25f9629
# ... ...

ceph 自带的也有一个 ceph-crush-location 命令, 貌似功能与 calamari-crush-location 是相同的(?):

$ dpkg -S /usr/bin/ceph-crush-location
ceph-common: /usr/bin/ceph-crush-location

$ ceph-crush-location 
must specify entity type
usage: /usr/bin/ceph-crush-location [--cluster <cluster>] --id <id> --type <osd|mds|client>

$ calamari-crush-location --help
usage: calamari-crush-location [-h] [--cluster CLUSTER] --id ID --type TYPE

Calamari setup tool.

optional arguments:
  -h, --help         show this help message and exit
  --cluster CLUSTER  ceph cluster to operate on
  --id ID            id to emit crush location for
  --type TYPE        osd

参考 ceph 文档: http://docs.ceph.com/docs/master/rados/operations/crush-map/#custom-location-hooks , 这个脚本是用来生成 ceph 节点位置的描述信息, 也许用 ceph 默认规则比使用 calamari 好, 因此去掉这个对应的 salt 配置。

从上述日志输出看还有两个问题:

  1. diamond 配置包名有漏改。
  2. /etc/salt/minion.d/_schedule.conf 不应该在系统目录下, 我们没使用系统目录。

查找配置 _schedule.conf 的地方, 原来是写在代码里的:

$ grep -F '_schedule.conf' -R *
# ... ...
venv/lib/python2.7/site-packages/salt/utils/schedule.py:        Persist the modified schedule into <<configdir>>/minion.d/_schedule.conf

代码如下:

  def persist(self):
      '''
      Persist the modified schedule into <<configdir>>/minion.d/_schedule.conf
      '''
      schedule_conf = os.path.join(
              salt.syspaths.CONFIG_DIR,
              'minion.d',
              '_schedule.conf')
      log.debug('Persisting schedule')
      try:
          with salt.utils.fopen(schedule_conf, 'wb+') as fp_:
              fp_.write(yaml.dump({'schedule': self.opts['schedule']}))
      except (IOError, OSError):
          log.error('Failed to persist the updated schedule')

从代码看, 应该是写在 salt 配置目录下, 不应该在系统目录才对。 添加日志:

      log.error('Persisting schedule ' + str([salt.syspaths.ROOT_DIR, salt.syspaths.CONFIG_DIR,]))
      raise Exception()

看到:

[ERROR   ] Persisting schedule ['/', '/etc/salt']

可看到此时 ROOT_DIR 不是从配置文件读取的。

跟踪代码 venv/lib/python2.7/site-packages/salt/syspaths.py:

ROOT_DIR = __generated_syspaths.ROOT_DIR
if ROOT_DIR is None:
    # The installation time value was not provided, let's define the default
    if __PLATFORM.startswith('win'):
        ROOT_DIR = r'c:\salt'
    else:
        ROOT_DIR = '/'
try:
    # Let's try loading the system paths from the generated module at
    # installation time.
    import salt._syspaths as __generated_syspaths  # pylint: disable=no-name-in-module
except ImportError:
    import imp
    __generated_syspaths = imp.new_module('salt._syspaths')
    for key in ('ROOT_DIR', 'CONFIG_DIR', 'CACHE_DIR', 'SOCK_DIR',
                'SRV_ROOT_DIR', 'BASE_FILE_ROOTS_DIR', 'BASE_PILLAR_ROOTS_DIR',
                'BASE_MASTER_ROOTS_DIR', 'LOGS_DIR', 'PIDFILE_DIR',
                'SPM_FORMULA_PATH', 'SPM_PILLAR_PATH', 'SPM_REACTOR_PATH'):
        setattr(__generated_syspaths, key, None)

测试:

$ bin/python -c 'import salt._syspaths as __generated_syspaths ; print __generated_syspaths'                                        
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named _syspaths

安装 salt .deb 包的系统上测试:

$ python2 -c 'import salt._syspaths as __generated_syspaths ; print __generated_syspaths'
<module 'salt._syspaths' from '/usr/lib/python2.7/dist-packages/salt/_syspaths.pyc'>

$ dpkg -S /usr/lib/python2.7/dist-packages/salt/_syspaths.py
salt-common: /usr/lib/python2.7/dist-packages/salt/_syspaths.py

$ cat /usr/lib/python2.7/dist-packages/salt/_syspaths.py
# This file was auto-generated by salt's setup on Friday, 07 February 2014 @ 05:02:24 UTC.

ROOT_DIR = '/'
# ... ...

可见这个路径是要生成 python 配置 salt._syspaths 来维护, 不在配置文件中。 这样导致 ROOT_DIR 维护在两个地方, 容易踩坑。

代码配置的 ROOT_DIR 会不会成为配置文件的默认值呢? 配置文件注释掉 root_dir: 配置, 生成 python 配置:

( echo "ROOT_DIR = '$PWD/'" && for e in $( sed -r '/^__all__ =/,/^]/ { /(\[|\])/ d ; s#.*?\b(\w+).*#\1# ; /^ROOT_DIR$/ d ; p ; } ; d' lib/python2.7/site-packages/salt/syspaths.py ) ; do echo "$e = None" ; done ) > lib/python2.7/site-packages/salt/_syspaths.py

测试运行 OK。 可见要彻底修改 ROOT_DIR, 应生成 python 配置, 配置文件去掉 root_dir:, 命令行也不需要 -c 参数了。

另外测试发现 salt-master 端修改 diamond.sls 文件后 salt-minion 端文件不会自动更新, 需要重启才会更新(?)。

之后查看 diamond 配置文件也生成了, 但进程未起来:

$ sudo systemctl status diamond
● python-diamond.service - LSB: System statistics collector for Graphite.
   Loaded: loaded (/etc/init.d/python-diamond; bad; vendor preset: enabled)
   Active: active (exited) since Sat 2017-02-25 21:06:11 CST; 4min 4s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 30276 ExecStop=/etc/init.d/python-diamond stop (code=exited, status=0/SUCCESS)
  Process: 30288 ExecStart=/etc/init.d/python-diamond start (code=exited, status=0/SUCCESS)

Feb 25 21:06:11 han230 systemd[1]: Starting LSB: System statistics collector for Graphite....
Feb 25 21:06:11 han230 python-diamond[30288]: Traceback (most recent call last):
Feb 25 21:06:11 han230 python-diamond[30288]:   File "/usr/bin/diamond", line 6, in <module>
Feb 25 21:06:11 han230 python-diamond[30288]:     import configobj
Feb 25 21:06:11 han230 python-diamond[30288]: ImportError: No module named configobj
Feb 25 21:06:11 han230 systemd[1]: Started LSB: System statistics collector for Graphite..

setup.py diamond 依赖 configobjpsutil, 安装:

sudo aptitude install python-configobj python-psutil -y

手动停止 diamond 再重启 salt-minion, 即正常拉起 diamond 服务了。

salt-minion 部署在 OSD 机器上, calamari 显示未找到 ceph 集群。 在 ceph monitor 机器上部署 salt-minion 后, web 上即看到集群信息了。 /graphite/dashboard/ 没有 ceph 相关的统计, 同时 calamari 日志看到如下信息:

2017-02-25 07:54:53,775 - metric_access - django.request No graphite data for ceph.cluster.13e5f237-9387-4506-badd-c66cc25f9629.pool.0.num_objects
2017-02-25 07:54:53,776 - metric_access - django.request No graphite data for ceph.cluster.13e5f237-9387-4506-badd-c66cc25f9629.pool.0.num_bytes
2017-02-25 07:54:54,056 - metric_access - django.request No graphite data for ceph.cluster.13e5f237-9387-4506-badd-c66cc25f9629.df.total_used_bytes
2017-02-25 07:54:54,057 - metric_access - django.request No graphite data for ceph.cluster.13e5f237-9387-4506-badd-c66cc25f9629.df.total_used
2017-02-25 07:54:54,058 - metric_access - django.request No graphite data for ceph.cluster.13e5f237-9387-4506-badd-c66cc25f9629.df.total_space
2017-02-25 07:54:54,059 - metric_access - django.request No graphite data for ceph.cluster.13e5f237-9387-4506-badd-c66cc25f9629.df.total_avail

主机监控 /#graph/han230 也的图表也看不到数据, 发现其请求数据格式为 /graphite/render/?format=json-array, 而返回结构时一张图片, 估计又是 API 变化导致的不兼容, romana 使用 1.3 分支和 master 分支皆如此。 修改 format=json 后返回了 json 数据。

修改前端代码:

$ find -type f -exec grep -Pq '\bjson-array\b' {} \; -print
./content/dashboard/scripts/main.js

$ sed -r -e 's#\bjson-array\b#json#g' ./content/dashboard/scripts/main.js -i

图表还是无显示, 还是数据格式不兼容(?)。

比较了一下本地和 ubuntu 14.04 已部署机器的代码, 两边版本都显示是 0.9.12, 单已部署机器是 jsonjson-array 两个格式都有的, 而本地只有 json。 google 了一下 json-arry 是第 3 方实现的修改, graphite-web 官方没有接受这个修改。 参考: https://github.com/graphite-project/graphite-web/pull/525/files 。 第 3 方修改分支: https://github.com/jcsp/graphite-web/tree/json-array 。 拉取修改分支到本地安装, 恢复前端代码, 图表数据终于有显示了。

现在看 /graphite/dashboard/ 还是没有 ceph 的监控数据, /var/log/diamond/diamond.log 日志看到如下错误:

[2017-02-26 00:27:54,787] [Thread-1] Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/diamond/collector.py", line 491, in _run
    self.collect()
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 464, in collect
    self._collect_service_stats(path)
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 450, in _collect_service_stats
    self._publish_stats(counter_prefix, stats, schema, GlobalName)
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 305, in _publish_stats
    assert path[-1] == 'type'
AssertionError

报错代码 collectors/ceph/ceph.py:

  def _publish_stats(self, counter_prefix, stats, schema, name_class):
      """Publish a set of Ceph performance counters, including schema.

      :param counter_prefix: string prefixed to metric names
      :param stats: dictionary containing performance counters
      :param schema: performance counter schema
      """
      for path, stat_type in flatten_dictionary(schema):
          # remove 'stat_type' component to get metric name
          assert path[-1] == 'type'
          del path[-1]

添加日志:

      import json
      self.log.error([counter_prefix, name_class,])
      self.log.error(json.dumps(stats))
      self.log.error(json.dumps(schema))
      for path, stat_type in flatten_dictionary(schema):
          # remove 'stat_type' component to get metric name
          self.log.error([path, stat_type,])
          assert path[-1] == 'type'
          del path[-1]

看到:

[2017-02-26 01:12:47,784] [Thread-1] ['75df7d7c-2645-461e-834a-c838a67155c6.mon.han2015', <class 'ceph.GlobalName'>]
[2017-02-26 01:12:47,784] [Thread-1] {"leveldb": {"leveldb_get": 114961, "leveldb_submit_sync_latency": ... ...
[2017-02-26 01:12:47,785] [Thread-1] {"leveldb": {"leveldb_get": {"nick": "", "type": 10, "description": "Gets"}, ... ...
[2017-02-26 01:12:47,785] [Thread-1] [[u'cluster', u'mds_epoch', u'description'], u'Current epoch of MDS map']

stats 格式为:

"cluster": {
  "num_mds_up": {
    "nick": "",
    "type": 2,
    "description": "MDSs that are up"
  },
  "num_pg": {
    "nick": "",
    "type": 2,
    "description": "Placement groups"
  },

schema 格式为:

"cluster": {
  "num_mds_up": {
    "nick": "",
    "type": 2,
    "description": "MDSs that are up"
  },
  "num_pg": {
    "nick": "",
    "type": 2,
    "description": "Placement groups"
  },

看代码上下文是要遍历 schema 处理相应类型的 stats 统计, 其遍历 schema 时应忽略 key 不为 type 的叶子节点。 修改代码如下:

      for path, stat_type in flatten_dictionary(schema):
          # ignore leaves where key is not 'type'
          if path[-1] != 'type':
              continue
          # remove 'type' component to get metric name
          del path[-1]

修改之后果然可以看到 ceph 的监控数据了。

对比了一下 calamari 提供的 diamond 配置和 diamond 默认配置, 除了修改 graphite 服务器地址外, 只是修改了主机获取方式(保持与 salt-minion 一致), 数据采集间隔和日志级别。 另外主要是提供了 ceph 数据采集配置。 后续看看标准的 diamond ceph 采集能否拿到相同的数据, salt 配置 diamond 的逻辑也可以去掉 (或修改为使用 venv 运行?)。

安装 diamond:

python2 -m virtualenv diamond/
cd diamond/
bin/pip install --install-option=--prefix=$PWD/ --install-option=--install-lib=$PWD/lib/python2.7/site-packages/ diamond
cp etc/diamond/diamond.conf{.example,}
sed -r -e "s#(\s)/usr/#\1/#g ; s#/(etc|var|share)/#$PWD/\1/#g" etc/diamond/diamond.conf -i

修改 graphite 服务器地址和默认采集间隔, 日志级别。

配置:

bin/diamond-setup -c etc/diamond/diamond.conf -C NetworkCollector
bin/diamond-setup -c etc/diamond/diamond.conf -C CephCollector
bin/diamond-setup -c etc/diamond/diamond.conf -C CephStatsCollector

开启采集项, 单位使用 byte, 其他都使用默认值。

启动 (采集 ceph 数据需要 root 权限):

mkdir -p var/log/diamond
sudo bin/diamond -c etc/diamond/diamond.conf -f -l --skip-fork

OSD 机器 salt-minion 看到如下日志:

[INFO    ] Executing command ['dpkg-query', '--showformat', '${Status} ${Package} ${Version} ${Architecture}\n', '-W'] in directory '/home/hanyong'
{'fun_args': [], 'jid': '20170226170612482237', 'return': None, 'success': True, 'schedule': 'ceph.heartbeat', 'pid': 31708, 'fun': 'ceph.heartbeat', 'id': 'han230'}
[INFO    ] Running scheduled job: ceph.heartbeat
no valid command found; 10 closest matches:
rbd cache flush rbd/test
perfcounters_schema
perf reset <var>
perf dump {<logger>} {<counter>}
perfcounters_dump

为什么会频繁执行 dpkg-query, ceph.heartbeat 是在哪定义的呢? 找到 salt/srv/pillar/schedules.sls 代码:

schedule:
  ceph.heartbeat:
    function: ceph.heartbeat
    seconds: 10
    returner: local
    maxrunning: 1

同时找到文件 salt/srv/salt/_modules/ceph.py 中有 heartbeat() 函数, 可知 salt 执行的任务原来是使用 python 定义的。

看了一下 salt-master 配置, 可找到如下配置:

extension_modules: /home/hanyong/workspace/calamari/salt/srv/salt/_modules

salt-minion 上可看到 ceph.py 文件出现在两个地方, 但时间和内容一致:

$ sudo find -name ceph.py
./var/cache/salt/minion/files/base/_modules/ceph.py
./var/cache/salt/minion/extmods/modules/ceph.py

salt-master 端修改文件, 重启 salt-minion 后可看到两处同时更新。 不重启 minion 是否能够更新?

查找调用 dpkg-query 的地方:

$ grep -Fw dpkg-query -R *
# ... ...
venv/lib/python2.7/site-packages/salt/modules/dpkg.py:    cmd = "dpkg-query -W -f='package:" + bin_var + "\\n" \
venv/lib/python2.7/site-packages/salt/modules/aptpkg.py:    cmd = ['dpkg-query', '--showformat',

原来是在 salt 中定义的。

再跟踪 heartbeat() 函数看到:

  # Installed Ceph version (as oppose to per-service running ceph version)
  ceph_version_str = __salt__['pkg.version']('ceph')  # noqa
  if ceph_version_str:
      ceph_version = ceph_version_str
  else:
      ceph_version = None

  # For each ceph cluster with an in-quorum mon on this node, interrogate the cluster
  cluster_heartbeat = {}
  for fsid, socket_path in mon_sockets.items():
      try:
          cluster_handle = rados.Rados(name=RADOS_NAME, clustername=fsid_names[fsid], conffile='')
          cluster_handle.connect()
          cluster_heartbeat[fsid] = cluster_status(cluster_handle, fsid_names[fsid])
      except (rados.Error, MonitoringError):
          # Something went wrong getting data for this cluster, exclude it from our report
          pass

应该就是这里导致 salt 调用 dpkg-query 查询 ceph 安装版本。 推测是随后调用 cluster_status() 报错, 添加日志:

  import logging as log
  log.info(['--------------', mon_sockets])

结果发现 mon_sockets 为空, 这是 OK 的。 前面的错误看似 ceph 命令行的错误输出, 不一定符合日志的顺序, 可能是更前面发生的错误。

跟踪代码 service_status() 并添加日志:

def service_status(socket_path):
    """
    Given an admin socket path, learn all we can about that service
    """
    try:
        cluster_name, service_type, service_id = \
            re.match("^(.+?)-(.+?)\.(.+)\.asok$", os.path.basename(socket_path)).groups()
    except AttributeError:
        return None

    log.info(['---------------', socket_path, cluster_name, service_type, service_id,])
    status = None
    # Interrogate the service for its FSID
    if service_type != 'mon':
        try:
            fsid = json.loads(admin_socket(socket_path, ['status'], 'json'))['cluster_fsid']
            log.info(['---------------', " fsid = json.loads(admin_socket(socket_path, ['status'], 'json'))['cluster_fsid'] ", fsid])
        except AdminSocketError:
            # older osd/mds daemons don't support 'status'; try our best
            config = json.loads(admin_socket(socket_path, ['config', 'get', 'fsid'], 'json'))
            log.info(['---------------', " config = json.loads(admin_socket(socket_path, ['config', 'get', 'fsid'], 'json')) ", config])
            fsid = config['fsid']
    else:
        # For mons, we send some extra info here, because if they're out
        # of quorum we may not find out from the cluster heartbeats, so
        # need to use the service heartbeats to detect that.
        status = json.loads(admin_socket(socket_path, ['mon_status'], 'json'))
        fsid = status['monmap']['fsid']

从日志看到, 应该时处理 /var/run/ceph/ceph-client.admin.asok 时报错。

验证:

$ sudo ceph --admin-daemon /var/run/ceph/ceph-client.admin.asok status                                                            
no valid command found; 10 closest matches:
rbd cache flush rbd/test
perfcounters_schema
perf reset <var>
perf dump {<logger>} {<counter>}
perfcounters_dump
perf schema
log flush
log dump
objecter_requests
log reopen
admin_socket: invalid command
admin_socket: invalid command

$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok status
{
    "cluster_fsid": "75df7d7c-2645-461e-834a-c838a67155c6",
    "osd_fsid": "aa01749f-4e2a-4b98-855d-71d92fae25d7",
    "whoami": 0,
    "state": "active",
    "oldest_map": 1,
    "newest_map": 7,
    "num_pgs": 64
}

修改代码忽略 client 类型的 admin socket 即可解决此问题。 从这段代码也可看到, 从 mon admin socket 才能获取到集群信息, 所以必须在 mon 节点部署 salt-minion calamari 才能探测到 ceph 集群。

calamari manager 页面修改配置, 浏览器看到调用的 API 是 /api/v2/cluster/75df7d7c-2645-461e-834a-c838a67155c6/osd_config, 跟踪代码+日志, 看到其最终调用到 cthulhu/manager/user_request.py:

class RadosRequest(UserRequest):
    """
    A user request whose remote operations consist of librados mon commands
    """
    def __init__(self, headline, fsid, cluster_name, commands):
        self._commands = commands
        super(RadosRequest, self).__init__(headline, fsid, cluster_name)

    def _submit(self, commands=None):
        if commands is None:
            commands = self._commands

        self.log.debug("%s._submit: %s/%s/%s" % (self.__class__.__name__,
                                                 self._minion_id, self._cluster_name, commands))

        client = LocalClient(config.get('cthulhu', 'salt_config_path'))
        pub_data = client.run_job(self._minion_id, 'ceph.rados_commands',
                                  [self.fsid, self._cluster_name, commands])

即使用 salt.client.LocalClient 远程执行命令。 此处命令也在上述 ceph.py 模块文件中定义。

ceph 监控数据采集

看 diamond CephCollector 代码, 采集数据可参考文档: http://docs.ceph.com/docs/jewel/dev/perf_counters/

看监控数据:

ceph daemon mon.$HOSTNAME perf dump

看监控数据的含义说明:

ceph daemon mon.$HOSTNAME perf schema

等价于:

ceph --admin-daemon /var/run/ceph/ceph-mon.$HOSTNAME.asok perf dump
ceph --admin-daemon /var/run/ceph/ceph-mon.$HOSTNAME.asok perf schema

需要对 admin socket 有读写权限。 简单看了下 monosd 的采集数据, 貌似没有 iops 和读写带宽统计(?)。

看了下 diamond CephStatsCollector 是直接对 ceph -s 的结果进行解析, 简单看了下集群无 IO 时没有对应解析行, 不同情况可能输出单位和输出值也不同, 这个采集感觉不可靠。

ceph status 可以输出 json 格式数据, 但其中貌似也没有 iops 和带宽数据。

ceph status -f json-pretty

参考 vsm 的 ppt: https://01.org/sites/default/files/documentation/03_-_intel_virtual_storage_manager_for_ceph_0.5_in_detail_v2.pptx 查看其统计数据的相关命令, 找到一条:

$ ceph osd pool stats -f json-pretty

[
    {
        "pool_name": "rbd",
        "pool_id": 0,
        "recovery": {},
        "recovery_rate": {},
        "client_io_rate": {
            "read_bytes_sec": 409,
            "write_bytes_sec": 2458,
            "read_op_per_sec": 0,
            "write_op_per_sec": 0
        }
    }
]

这个输出包含 IO 数据, 可见 ceph 可用空间在整个集群看, IO 情况却要在 pool 上看 (这样看多 pool 的数据更方便)。

对比 diamond calamari_rebased_on_v3.5 分支和 v4.0.515 的代码, CephStatsCollector 基本是一样的。 CephCollector 官方代码是直接采集 perf dump 的结果, calamari 在此基础上增加了 _collect_cluster_stats()

看了下其逻辑, 首先只在 mon leader 上采集:

      # We have a mon, see if it is the leader
      mon_status = self._admin_command(path, ['mon_status'])
      if mon_status['state'] != 'leader':
          return
      fsid = mon_status['monmap']['fsid']

这样即保证数据正确性(?), 又避免多 mon 重复采集。

然后采集:

ceph -f json-pretty pg dump pools
ceph -f json-pretty pg dump summary
ceph -f json-pretty df

代码见:

      # Gather "ceph pg dump pools" and file the stats by pool
      for pool in self._mon_command(cluster_name, ['pg', 'dump', 'pools']):
          publish_pool_stats(pool['poolid'], pool['stat_sum'])

      all_pools_stats = self._mon_command(cluster_name, ['pg', 'dump', 'summary'])['pg_stats_sum']['stat_sum']
      publish_pool_stats('all', all_pools_stats)

      # Gather "ceph df"
      df = self._mon_command(cluster_name, ['df'])
      self._publish_cluster_stats(cluster_name, fsid, "df", df['stats'])

看了下 ceph df 返回的集群字节数统计与 ceph daemon mon.$HOSTNAME perf dump 返回的 cluster 段数据是一致的, 同一数据可通过不同渠道采集。

看 calamari 展示数据缺少读写带宽。 看 iops 页面渲染请求, 使用的是 ceph pg dump poolsnum_read, num_write 数据, 数据请求 URL: http://localhost:8000/graphite/render/?format=json-array&target=ceph.cluster.75df7d7c-2645-461e-834a-c838a67155c6.pool.0.num_read&target=ceph.cluster.75df7d7c-2645-461e-834a-c838a67155c6.pool.0.num_write 。

观察了一下这两个数据, 应该是总值而不是速度(每秒值), 难道 calamari 自动计算差值得到速度(?)。 看代码这两个值也是作为 counter 采集的:

      def publish_pool_stats(pool_id, stats):
          # Some of these guys we treat as counters, some as gauges
          delta_fields = ['num_read', 'num_read_kb', 'num_write', 'num_write_kb', 'num_objects_recovered',
                          'num_bytes_recovered', 'num_keys_recovered']
          for k, v in stats.items():
              self._publish_cluster_stats(cluster_name, fsid, "pool.{0}".format(pool_id), {k: v},
                                          counter=k in delta_fields)

同时采集到的还有 num_read_kbnum_write_kb, 这应该是读写数据大小, 这么说 calamari 应该也可以计算带宽(?)。 之后看到 diamond 的代码, 自动计算差值是 diamond 做的, 见 diamond/collector.py:

  def publish_counter(self, name, value, precision=0, max_value=0,
                      time_delta=True, interval=None, allow_negative=False,
                      instance=None):
      raw_value = value
      value = self.derivative(name, value, max_value=max_value,
                              time_delta=time_delta, interval=interval,
                              allow_negative=allow_negative,
                              instance=instance)
      return self.publish(name, value, raw_value=raw_value,
                          precision=precision, metric_type='COUNTER',
                          instance=instance)

最后, diamond 采集数据默认都是保存到 servers.$HOSTNAME. 下, calamari 是怎样保存到 ceph. 下的呢? 从代码看, 只是自定义了一个 GlobalName 类型并做了一次封装:

class GlobalName(str):
    pass

# ... ...

    def _publish_cluster_stats(self, cluster_name, fsid, prefix, stats, counter=False):
        """
        Given a stats dictionary, publish under the cluster path (respecting
        short_names and cluster_prefix)
        """

        for stat_name, stat_value in flatten_dictionary(
            stats,
            path=[self._cluster_id_prefix(cluster_name, fsid), prefix]
        ):
            stat_name = _PATH_SEP.join(stat_name)
            name = GlobalName(stat_name)
            if counter:
                self.publish_counter(name, stat_value)
            else:
                self.publish_gauge(name, stat_value)

看 diamond 获取拼接 metric path 的代码 diamond/collector.py, 并未找到 calamari 自定义的这个 GlobalName 有什么卵用, 不再深究。 同时看这段代码逻辑, 自定义全局采集项的正确方式是配置 instance_prefix (如 ceph.cluster), 并传入非 None instance (如 fsid)。 查看 diamond v4.0.515 分支代码, 还是类似的逻辑, 代码如下:

  def get_metric_path(self, name, instance=None):
      """
      Get metric path.
      Instance indicates that this is a metric for a
          virtual machine and should have a different
          root prefix.
      """
      if 'path' in self.config:
          path = self.config['path']
      else:
          path = self.__class__.__name__

      if instance is not None:
          if 'instance_prefix' in self.config:
              prefix = self.config['instance_prefix']
          else:
              prefix = 'instances'
          if path == '.':
              return '.'.join([prefix, instance, name])
          else:
              return '.'.join([prefix, instance, path, name])

      if 'path_prefix' in self.config:
          prefix = self.config['path_prefix']
      else:
          prefix = 'systems'

      if 'path_suffix' in self.config:
          suffix = self.config['path_suffix']
      else:
          suffix = None

      hostname = get_hostname(self.config)
      if hostname is not None:
          if prefix:
              prefix = ".".join((prefix, hostname))
          else:
              prefix = hostname

      # if there is a suffix, add after the hostname
      if suffix:
          prefix = '.'.join((prefix, suffix))

      is_path_invalid = path == '.' or not path

      if is_path_invalid and prefix:
          return '.'.join([prefix, name])
      elif prefix:
          return '.'.join([prefix, path, name])
      elif is_path_invalid:
          return name
      else:
          return '.'.join([path, name])

至此 calamari 对 diamond 的定制分析完毕。 我们可以将这些采集提取到一个新的 CephClusterCollector

还有个问题是看不到磁盘空间变化图, 流量器看了下其获取数据是 total_used, URL: http://localhost:8000/graphite/render/?format=json-array&target=scale(ceph.cluster.75df7d7c-2645-461e-834a-c838a67155c6.df.total_avail,%201024)&target=scale(ceph.cluster.75df7d7c-2645-461e-834a-c838a67155c6.df.total_used,%201024) , 而实际采集的数据是 total_used_bytes, 这里前端代码需要做一下修改。 romana 代码从 master 切换到 1.3 分支就有数据了, 看来 1.3 分支的代码要新一些。