OpenStack Denver PTG – Sept 2018 – Notetaking

I take notes, especially in big meetings, as a way to keep my attention on the meeting (and avoid falling asleep). Sometimes it comes in handy, as my memory isn’t great anyway. Sometimes the benefit is questionable, as my notes might not make sense to anyone else, especially if they aren’t in an OpenStack session.

I took the majority of the notes for the Monasca sessions at the Denver 2018 OpenStack Program Teams Gathering.  Take a look – https://etherpad.openstack.org/p/monasca-ptg-stein – and I did the same last time around – https://etherpad.openstack.org/p/monasca-ptg-rocky .

As a sample, I also took notes for the parts of the OpenStack-Helm sessions that I was interested in.  I didn’t slap them in to the OS-Helm etherpad as I am just peripherally part of the team and didn’t want to break their flow.  But I’m including them below in raw format just for backup/reference.

Thursday openstack-helm
fluentd and fluentbit
they had to add some plugins and stuff, but JAe’s team got it more mature
fluentbit is used as aggregator, then to elasticsearch
handled through the values file

some connection to fluentd and kafka
chart includes job for bootstrapping templates to elasticsearch, can determine fields to index/normalize and search on
kibana goes hand in hand with elasticsearch, chart could use a bit of work
since not using elastic beats have to create dashboards yourself

monitoring – running Prometheus, the reference for mon in kube
typical exporters – node export, kube state metrics, proc exporter, some exporters for rabbit mysql(maria)
wanted to export openstack, there wasn’t a readily available os exporter (cannonical one didn’t seem to work well, reached out to them) a separate exporter was developed and is available in ATTcondev, would like to get it moved to the openstack namespace

vis in grafana (works with prometheus) only showing prometheus now but could add influx or elastic

exporter vs mon or ceil? exporter today primarily checks if api endpoints are up and responsive. since behind ingress, also can check number of requests and response time
goal and scope of mon is wider, additional project. not only about gathering infra metrics but can do app mon and is multi tenant and can expose to customers per-tenant.
metrics collected are flexible and can plug in. have a plugin for prom to push to monasca api. backend is influx or cassandra. have alerting. also have logging based on elk, use logstash or fuelmd.

want knowledge of what things to look at to check openstack status properly. summarized doc coming for current stuff. listing of all metrics and try to figure out important ones and come up with rules.
half of the problem. up to now concerns have been operator focused – geared not to multi tenant but just ops based tools. attractive that mon has monitoring and alerting to customers

in terms of deployment, kolla has worked on containerizing and addign ansible roles for onasca, almost completely there. hpe team has created project monasca-helm with some charts, not actively maintained in last half year but should work. ?ci on that? think so, tempest tests. <osh guy> have looked at charts and overall good, easiest may be to look at charts or import to oshelm or cross-gate and integrate test so it can be a supported part of deployment. would help with keystone integration. pretty sure hpe doesn’t have resources, SUSE a maybe (hard to say).

what are the future plans for monasca? add support for os notifications to be persisted and can expose to 3rd parties outside openstack. upgrade kafka version. in addition to multitenancy want to add more security related to mon agent.

jae, what is your impression? different system , mon is complete mon suite. prom and elsearch just about gathering and store. can build a system to do analys.
monitoring orgs internally want data, but scared to let them. dont want to open prom up to mon teams because they will plunder and do expensive queries.
witek – influx is not exposed directly, all through mon apis. dont have quotas but could add to mon api (how many metrics to store per project or how many queries).
att limiting mon team to just nagios, have lots of rich data but no clean way to solve corporate access issues and control what they can see. consider building tooling around it.
witek – yes, and if clear requirements we can try to address them.
would be interesting to do a mon poc and take/develop charts so we can provide some of the support issues. oshelm team offers to help getting charts incorporated
witek – hpe did mon-docker and mon-helm. did work a year or two ago. have worked on getting it upstreamed and building images. different images than the kolla ones. difference is that kolla is centos and ubuntu, these images are diff and in process of having ci
requirements of images are driven by the charts.

good potential to work together
poc – take existing charts and write a script to bring up monasca alongside oshelm then kick tires.
should be able to try it out within a month

biggest diff in mon helm is that deploy monasca as a fat chart (lot of reqs that it brings in). one of the things oshelm learned is that model leads to being harder to maintain manage. helm tends to touch everything to modify one thing, so if break up and dont include better. so may need a shift there

try to make openstack-helm infra more of a ‘enterprise ready’ set of components.

there is a monasca plugin for prometheus. change in chart to use it? should all be functional

may want to do cassandra chart right to oshelm infra
may want an influxdb chart too

? service discovery? is there service discovery for mon? prom uses annotations on pods to determine what to monitor. will check for monasca-prom plugin


feedback and lessons from running current oshelm stack
learned from 50 node clusters that current stuff was pushing really hard and had to remediate
number of prom exporters and number of exports found a catch all config wasnt realistic. cAdvisor for container metrics on each host, found larger deployments that prom consumed mem and pvcs for prom were filling up quickly (maybe in a week) ~500G per prom replica (mon for 45 days was about 1T) prom allows to exclude some metrics, so started pruning unussed metrics – removing about 45 metrics from prom dropped util by about 60% mem. Each metric is 3 dimensional, so not being mindful of metrics you gather can be big
want to provide doc or op guide for using prom
can change scrape interval, though too large can get stale metrics. suggested about 60 seconds. retention is difficult to manage as it gets thrown out after interval – need to push to other storage for longer . not granular (mon working on retention)

thanos runs as a sidecar for prometheus and claims to give ability to control HA prom and integration for storing metrics in S3 object storage. want to explore that integration. has smart query retrieval for both prom and obj storage.

some learning on elastic too. was getting multiple copies of same event into elastic – needed config in fluentd and fluentbit to discard stdout to avoid dups

elastic retention can be just 2 weeks. need your own system to store longer. skt has their own system, trying to push data to kafka for other systems
lma in general is such a big area. is this covered in doc? tuning/optimiz for sizes and uses of clusters is really valuable. want to bring in richer examples
doc about what each exporter does and important variables
att looking at calico v3

filebeat and metricbeat – steve has been looking at
leverage for whatever deployment an operator is using. started playing with charts for filebeat as an alt for fluentd, metricbeat a possible for metrics, trafficbeat for network, other for app perf metrics in elasticapm config. benefits of packetbeat give bootstrap of kibana dashboards with no additional conf. usable out of box. also a beta module for prom – scrape prom endpoints then can store in elastic. patch set WIP
elasticapm is an elastic service for app perf to try to identify bottlenecks. can add libs into your app to report perf.

elastic doesnt have built in rotation or maint or snapshotting of indexes – elastic curator include as a cron job. curator supports multiple types of store – shared file store or s3. Some WIP for using ceph rados gateway for es snapshots

You can see, the flow is hard to follow if you weren’t in the meeting. I took these notes in gedit (simple text editor), so there isn’t any formatting like I would do on etherpad. As I was new to the OS-Helm group, I didn’t know everyone’s names otherwise I would have prefaced all the questions with who was asking.

Here is another shorter chunk from when I dropped in on the OpenStack-Ansible meeting for a few minutes when I noticed they had “telemetry” on the schedule.

osa @ 11am

osquery and elasticsearch in repo now, zuul testing
just elastic stack – beat collectors audit journal (community) packetbeat, etc
collecting sensor data for most part. not openstack notifications, just how well underlying opsys is functioning
elasticapm also for plugin to horizon and keystone
doesnt work with everything – doesnt work with flask or other frameworks
all in osa ops repo, tested independently
all roles go through and enable based on detecting
storage-only non openstack deployments can still do the same mon
are enablign osprofiler to use elasticsearch backend. nova boot –profile <hmac> it will profile that task. good for debugging
tooling around curation of indexes, still wip
backend for many things monitored, expose to other teams to use data
no user restrictions
in use in some large environments

That was an interesting quick drop-in, though I still don’t know much about OSA.  What they were calling “telemetry” actually had nothing to do with OpenStack Telemetry or Ceilometer. Rather they were using ElasticSearch beat collectors to gather operations data.  While this may meet some operations monitoring needs, it was nowhere near a state that would work for allowing tenants to see the monitoring data, or to be usable for billing.