Monitoring workfows
scheduler interface
check the run interface
http://localhost:3000/runs
If there is a failure, click on the runid of the run, then you can look at the run log
portainer status
The gleaner and nabu are run as services are prefixed with sch_ pattern is
sch_PROJECT_step
so if it looks like something is not working, find the container starting with sch_project_step,
then go into a terminal,
you may need to use bin/sh
cd logs
ls -l
tail some log name
Issues
why did it fail
Server: * is disk full... [find in minio logs] Basic Source issues * does sitemap exist [ check url] * do urls in sitemap have jsonld [ from sitemap check some urls on validator.schema.org ] * are the JSONLD's types we handle [@type dataset, datacatalog]
Basic Failure in gleaner * Nonexistent partition keys [ add source gleaner configs, and run s3 job manually if needed ] * 409_error[HTTP 409 error - dupe container running in portainer, remove old container] * source_not_found[Missing source in a gleaner config two for now] * logs[logs in minio] * s3[ did data make it to s3]
release * is there data in s3 * is there a release file in s3
summarize * is there data in s3 * is this data a dataset?
Sitemap issue
``{"file":"/home/runner/work/gleaner/gleaner/internal/summoner/acquire/resources.go:75","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.ResourceURLs","level":"error","msg":"Error getting sitemap urls for: wodbXML syntax error on line 1800: element \u003clink\u003e closed by \u003c/head\u003e","time":"2024-11-21T17:44:43Z"}
logs/gleaner-runstats-2024-11-21-17-44-43.log 0000644 0000000 0000000 00000000541 14717670613 016147 0 ustar 00 0000000 0000000 RunStats:
Start: 2024-11-21 17:44:41.875835094 +0000 UTC m=+0.627571534
Reason: Complete
Soruce:
- name: wodb
Start: 2024-11-21 17:44:43.81414505 +0000 UTC m=+2.565881545
End: 2024-11-21 17:44:43.874671482 +0000 UTC m=+2.626407982
SitemapCount: 0
SitemapHttpError: 0
SitemapIssues: 0
SitemapSummoned: 0
logs/repo-wodb-loaded-2024-11-21-17-44-43.log 0000644 0000000 0000000 00000000104 14717670613 015770 0 ustar 00 0000000 0000000 level=info msg="Queuing URLs for wodb"
level=info msg="URL Count 0"
logs/repo-wodb-stats-2024-11-21-17-44-43.log 0000644 0000000 0000000 00000000411 14717670613 015677 0 ustar 00 0000000 0000000 SourceStats:
409_error
this will be found in the error in dagster
source_not_found
this is found in the logs for a gleaner on minio
{"file":"/home/runner/work/gleaner/gleaner/cmd/gleaner/main.go:125","func":"main.main","level":"error","msg":"CAUTION: no matching source, did your -source VALUE match a sources.name VALUE in your config file?","time":"2024-11-21T17:46:22Z"}
sumarize is there data
no data returned from a summary qeury
error in dagster
loading Summary graph failed. argument of type 'numpy.float64' is not iterabl
e