Loading Data for The Initial Installation
This is step 4 of 5 major steps:
- Install base containers on a server
- Setup services containers
- Setup Gleaner containers
- Initial setup of services and loading of data
- Setup Geocodes UI using datastores defined in Initial Setup
Step Overview:
- create data stores in minioadmin and graph
- install glcon, if not installed
- create a configuration file to install a small set of data
./glcon config init --cfgName gctest
- edit
./glcon config generate --cfgName gctest
- setup and summon data using 'gleaner'
./glcon gleaner setup --cfgName gctest
./glcon gleaner batch --cfgName gctest
- load data to graph using 'nabu'
./glcon nabu prefix --cfgName gctest
./glcon nabu prune --cfgName gctest
- Test data in Graph
- Example of how to edit the source
- edit gctest.csv
- regenerate configs
- rerun batch
- Run Summarize task. This is performance related.
regenerate
if you edit localConfig.yaml, you need to regenerate the configs using
./glcon config generate --cfgName gctest
Step Details
Setup Datastores
There are several datastores required to enable data summoning(harvesting), converting to a graph. While the production presently uses the earthcube repository convention, we suggest that tutorial and communities setting up an instance to use the geocodes repository pattern. Earthcube/Decoder staff should use the A Community pattern when setting up an instance for a community.
Repository | config | s3 Bucket | graph namespaces | notes |
---|---|---|---|---|
GeocodesTest | gctest | gctest | gctest, gctest_summary | samples of actual datasets |
geocodes | geocodes | geocodes | geocodes, geocodes_summary | suggested standalone repository |
earthcube | geocodes | gleaner | earthcube, summary | DEFAULT PRODUCTION NAME |
A COMMUNITY eg | {acomm} | {acomm} | {acomm}, {acomm}_summary | A communities tenant repository |
Initial Setup
we will be setting up both the gctest and gecodes repositories.
Setup Minio buckets
Gleaner extracts JSONLD from a web apge, and stores it in an s3 system (Minio) in
go to https://minioadmin.{youhost}/
create buckets gctest, and geocodes
go to settings for the bucket and make public.
Setup Graph stores.
Nabu pulls from the s3 system, converts to RDF quads, and uploads to a graph store.
go to https://graph.{your host}
namespace tab, create a mode 'quads' namespace with full text index, "gctest", and "geocodes"
namespace tab, create mode 'triples' namespace with full text index, "gctest_summary", and "geocodes_summary"
Install Indexing Software
glcon
is a console application that combines the functionality of Gleaner and Nabu into a single application.
It also has features to create and manage configurations for gleaner and nabu.
Harvest and load data
Goal is to create a configuration file to load gctest data. The sitemap is here:
Create a configuration and load sample data
Create a configuration for Continuous Integration
./glcon config init --cfgName gctest
ubuntu@geocodes-dev:~/indexing$ ./glcon config init --cfgName gctest
2022/07/21 23:27:31 EarthCube Gleaner
init called
make a config template is there isn't already one
ubuntu@geocodes-dev:~/indexing$ ls configs/gctest
README_Configure_Template.md localConfig.yaml sources.csv
gleaner_base.yaml nabu_base.yaml
ubuntu@geocodes-dev:~/indexing$
Copy sources list to configs/gctest
Note
assumes you are in indexing, and have put the geocodes at ~/geocodes aka your home directory
cp ~/geocodes/deployment/ingestconfig/gctest.csv configs/gctest/
edit files:
You will need to change the localConfig.yaml
nano configs/gctest/localConfig.yaml
---
minio:
address: oss.{YOU HOST}
port: 433
accessKey: worldsbestaccesskey
secretKey: worldsbestaccesskey
ssl: true
bucket: gctest # can be overridden with MINIO_BUCKET
sparql:
endpoint: https://graph.{YOU HOST}/blazegraph/namespace/gctest/sparql
s3:
bucket: gctest # sync with above... can be overridden with MINIO_BUCKET... get's zapped if it's not here.
domain: us-east-1
#headless field in gleaner.summoner
headless: http://127.0.0.1:9222
sourcesSource:
type: csv
location: gctest.csv
# this can be a remote csv
# type: csv
# location: https://docs.google.com/spreadsheets/d/1G7Wylo9dLlq3tmXe8E8lZDFNKFDuoIEeEZd3epS0ggQ/gviz/tq?tqx=out:csv&sheet=TestDatasetSources
regenerate
if you edit localConfig.yaml, you need to regenerate the configs using
./glcon config generate --cfgName gctest
Generate configs
./glcon config generate --cfgName gctest
./glcon config generate --cfgName gctest
2022/07/21 23:37:46 EarthCube Gleaner
generate called
{SourceType:sitemap Name:geocodes_demo_datasets Logo:https://github.com/earthcube/GeoCODES-Metadata/metadata/OtherResources URL:https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/gh-pages/metadata/Dataset/sitemap.xml Headless:false PID:https://www.earthcube.org/datasets/ ProperName:Geocodes Demo Datasets Domain:0 Active:true CredentialsFile: Other:map[] HeadlessWait:0}
make copy of servers.yaml
Regnerate gleaner
Regnerate nabu
flightest
Run setup to see if you can connect to the minio store
`./glcon gleaner setup --cfgName gctest
ubuntu@geocodes-dev:~/indexing$ ./glcon gleaner setup --cfgName gctest
2022/07/21 23:42:54 EarthCube Gleaner
Using gleaner config file: /home/ubuntu/indexing/configs/gctest/gleaner
Using nabu config file: /home/ubuntu/indexing/configs/gctest/nabu
setup called
2022/07/21 23:42:54 Validating access to object store
2022/07/21 23:42:54 Connection issue, make sure the minio server is running and accessible. The specified bucket does not exist.
ubuntu@geocodes-dev:~/indexing$
Access issues
{“file”:“/github/workspace/internal/organizations/org.go:87",“func”:“github.com/gleanerio/gleaner/internal/organizations.BuildGraph”,“level”:“error”,“msg”:“orgs/geocodes_demo_datasets.nqThe Access Key Id you provided does not exist in our records.“,”time”:“2023-01-31T15:27:39-06:00”}
- Access Key password could be incorrect
- address may be incorrect. It is a hostname or TC/IP, and not a URL
- ssl may need to be true
- See setup issues
Load Data
Gleaner will harvest jsonld from the URL's listed in the sitemap.
Robots.txt
OK TO IGNORE. you will need to ignore errors about robot.txt and sitemap.xml not being an index
{"file":"/github/workspace/internal/summoner/acquire/resources.go:204","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getRobotsForDomain","level":"error","msg":"error getting robots.txt for https://www.earthcube.org/datasets/allgood:Robots.txt unavailable at https://www.earthcube.org/datasets/allgood/robots.txt","time":"2023-01-30T20:45:53-06:00"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:66","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.ResourceURLs","level":"error","msg":"Error getting robots.txt for geocodes_demo_datasets, continuing without it.","time":"2023-01-30T20:45:53-06:00"}
Access issues
{“file”:“/github/workspace/internal/organizations/org.go:87",“func”:“github.com/gleanerio/gleaner/internal/organizations.BuildGraph”,“level”:“error”,“msg”:“orgs/geocodes_demo_datasets.nqThe Access Key Id you provided does not exist in our records.“,”time”:“2023-01-31T15:27:39-06:00”}
- Access Key password could be incorrect
- address may be incorrect. It is a hostname or TC/IP, and not a URL
- ssl may need to be true
- See setup issues
./glcon gleaner batch --cfgName gctest
ubuntu@geocodes-dev:~/indexing$ ./glcon gleaner batch --cfgName gctest
INFO[0000] EarthCube Gleaner
Using gleaner config file: /home/ubuntu/indexing/configs/gctest/gleaner
Using nabu config file: /home/ubuntu/indexing/configs/gctest/nabu
batch called
{"file":"/github/workspace/internal/organizations/org.go:55","func":"github.com/gleanerio/gleaner/internal/organizations.BuildGraph","level":"info","msg":"Building organization graph.","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/pkg/gleaner.go:35","func":"github.com/gleanerio/gleaner/pkg.Cli","level":"info","msg":"Sitegraph(s) processed","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/summoner.go:17","func":"github.com/gleanerio/gleaner/internal/summoner.Summoner","level":"info","msg":"Summoner start time:2022-07-22 19:16:53.451745139 +0000 UTC m=+0.182100234","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:189","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getRobotsForDomain","level":"info","msg":"Getting robots.txt from 0/robots.txt","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/utils.go:23","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getRobotsTxt","level":"error","msg":"error fetching robots.txt at 0/robots.txtGet \"0/robots.txt\": unsupported protocol scheme \"\"","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:192","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getRobotsForDomain","level":"error","msg":"error getting robots.txt for 0:Get \"0/robots.txt\": unsupported protocol scheme \"\"","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:63","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.ResourceURLs","level":"error","msg":"Error getting robots.txt for geocodes_demo_datasetscontinuing without it.","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:127","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getSitemapURLList","level":"info","msg":"https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/gh-pages/metadata/Dataset/sitemap.xml is not a sitemap index, checking to see if it is a sitemap","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/acquire.go:32","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.ResRetrieve","level":"info","msg":"Queuing URLs for geocodes_demo_datasets","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/acquire.go:74","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getConfig","level":"info","msg":"Thread count 5 delay 0","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:53Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
12% |██████ | (2/16, 2 it/s) [0s:7s]{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
43% |███████████████████████ | (7/16, 6 it/s) [1s:1s]{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:54Z"}
68% |████████████████████████████████████ | (11/16, 6 it/s) [1s:0s]{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:55Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:55Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:55Z"}
75% |████████████████████████████████████████ | (12/16, 6 it/s) [1s:0s]{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:55Z"}
{"file":"/github/workspace/internal/summoner/acquire/jsonutils.go:89","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.Upload","level":"info","msg":"context.strict is not set to true; doing json-ld fixups.","time":"2022-07-22T19:16:55Z"}
100% |██████████████████████████████████████████████████████| (16/16, 9 it/s)
{"file":"/github/workspace/internal/summoner/summoner.go:37","func":"github.com/gleanerio/gleaner/internal/summoner.Summoner","level":"info","msg":"Summoner end time:2022-07-22 19:16:55.660367672 +0000 UTC m=+2.390721648","time":"2022-07-22T19:16:55Z"}
{"file":"/github/workspace/internal/summoner/summoner.go:38","func":"github.com/gleanerio/gleaner/internal/summoner.Summoner","level":"info","msg":"Summoner run time:0.0368103569","time":"2022-07-22T19:16:55Z"}
{"file":"/github/workspace/internal/millers/millers.go:27","func":"github.com/gleanerio/gleaner/internal/millers.Millers","level":"info","msg":"Miller start time2022-07-22 19:16:55.661434567 +0000 UTC m=+2.391819553","time":"2022-07-22T19:16:55Z"}
{"file":"/github/workspace/internal/millers/millers.go:44","func":"github.com/gleanerio/gleaner/internal/millers.Millers","level":"info","msg":"Adding bucket to milling list:summoned/geocodes_demo_datasets","time":"2022-07-22T19:16:55Z"}
{"file":"/github/workspace/internal/millers/millers.go:55","func":"github.com/gleanerio/gleaner/internal/millers.Millers","level":"info","msg":"Adding bucket to prov building list:prov/geocodes_demo_datasets","time":"2022-07-22T19:16:55Z"}
100% |█████████████████████████████████████████████████████| (15/15, 51 it/s)
{"file":"/github/workspace/internal/millers/graph/graphng.go:82","func":"github.com/gleanerio/gleaner/internal/millers/graph.GraphNG","level":"info","msg":"Assembling result graph for prefix:summoned/geocodes_demo_datasetsto:milled/geocodes_demo_datasets","time":"2022-07-22T19:16:56Z"}
{"file":"/github/workspace/internal/millers/graph/graphng.go:83","func":"github.com/gleanerio/gleaner/internal/millers/graph.GraphNG","level":"info","msg":"Result graph will be at:results/runX/geocodes_demo_datasets_graph.nq","time":"2022-07-22T19:16:56Z"}
{"file":"/github/workspace/internal/millers/graph/graphng.go:89","func":"github.com/gleanerio/gleaner/internal/millers/graph.GraphNG","level":"info","msg":"Pipe copy for graph done","time":"2022-07-22T19:16:56Z"}
{"file":"/github/workspace/internal/millers/millers.go:84","func":"github.com/gleanerio/gleaner/internal/millers.Millers","level":"info","msg":"Miller end time:2022-07-22 19:16:56.387639969 +0000 UTC m=+3.117994225","time":"2022-07-22T19:16:56Z"}
{"file":"/github/workspace/internal/millers/millers.go:85","func":"github.com/gleanerio/gleaner/internal/millers.Millers","level":"info","msg":"Miller run time:0.0121029112","time":"2022-07-22T19:16:56Z"}
See files in Minio
You can open the minioadmin console (https://minioadmin.{your host}/) and look to see that file are uploaded into the bucket, in this case gctest.. summon/gecodes_demo_data
(NEED IMAGE HERE)
Push to graph
Nabu will read files from the bucket, and push them to the graph store.
./glcon nabu prefix --cfgName gctest
```json lines
./glcon nabu prefix --cfgName gctest
INFO[0000] EarthCube Gleaner
Using gleaner config file: /home/ubuntu/indexing/configs/gctest/gleaner
Using nabu config file: /home/ubuntu/indexing/configs/gctest/nabu
check called
2022/07/22 19:23:16 Load graphs from prefix to triplestore
{"file":"/go/pkg/mod/github.com/gleanerio/nabu@v0.0.0-20220223141452-a01fa9352430/internal/sparqlapi/pipeload.go:41","func":"github.com/gleanerio/nabu/internal/sparqlapi.ObjectAssembly","level":"info","msg":"[milled/geocodes_demo_datasets prov/geocodes_demo_datasets org]","time":"2022-07-22T19:23:16Z"}
{"file":"/go/pkg/mod/github.com/gleanerio/nabu@v0.0.0-20220223141452-a01fa9352430/internal/sparqlapi/pipeload.go:61","func":"github.com/gleanerio/nabu/internal/sparqlapi.ObjectAssembly","level":"info","msg":"gleaner:milled/geocodes_demo_datasets object count: 15\n","time":"2022-07-22T19:23:16Z"}
{"file":"/go/pkg/mod/github.com/gleanerio/nabu@v0.0.0-20220223141452-a01fa9352430/internal/sparqlapi/pipeload.go:79","func":"github.com/gleanerio/nabu/internal/sparqlapi.PipeLoad","level":"info","msg":"Loading milled/geocodes_demo_datasets/11316929f925029101493e8a05d043b0ae829559.rdf \n","time":"2022-07-22T19:23:16Z"}
[snip]
{"file":"/go/pkg/mod/github.com/gleanerio/nabu@v0.0.0-20220223141452-a01fa9352430/internal/sparqlapi/pipeload.go:197","func":"github.com/gleanerio/nabu/internal/sparqlapi.Insert","level":"info","msg":"response Status: 200 OK","time":"2022-07-22T19:23:21Z"}
{"file":"/go/pkg/mod/github.com/gleanerio/nabu@v0.0.0-20220223141452-a01fa9352430/internal/sparqlapi/pipeload.go:198","func":"github.com/gleanerio/nabu/internal/sparqlapi.Insert","level":"info","msg":"response Headers: map[Access-Control-Allow-Credentials:[true] Access-Control-Allow-Headers:[Authorization,Origin,Content-Type,Accept] Access-Control-Allow-Origin:[*] Content-Length:[449] Content-Type:[text/html;charset=utf-8] Date:[Fri, 22 Jul 2022 19:23:21 GMT] Server:[Jetty(9.4.z-SNAPSHOT)] Vary:[Origin] X-Frame-Options:[SAMEORIGIN]]","time":"2022-07-22T19:23:21Z"}
100% |███████████████████████████████████████████████████████| (1/1, 15 it/s)
```
Test in Graph
One the data is loaded into the graph store
https://graph.{your host}/blazegraph/#query
- go to namespace tab, select gctest,
- go to query tab, input the
returns all triples
select *
where {
?s ?p ?o
}
limit 1000
A more complex query can be ran:
what types are in the system
prefix schema: <https://schema.org/>
SELECT ?type (count(distinct ?s ) as ?scount)
WHERE {
{
?s a ?type .
}
}
GROUP By ?type
ORDER By DESC(?scount)
A more complex query can be ran:
Show me just datasets
SELECT (count(?g ) as ?count)
WHERE { GRAPH ?g {?s a <https://schema.org/Dataset>}}
More SPARQL Examples
Example of how to edit the source
This demonstrates a feature where if you have duplicate identifiers, then you can ensure all data get loaded. It's a bad idea to have the same ID, but it happens.
There are two lines in gctest csv. The second dataset is [actual data] (https://github.com/earthcube/GeoCODES-Metadata/tree/main/metadata/Dataset/actualdata). There are three files, the two earthchem files have the same @id, 1 2 The identifierType is set to 'filesha' which generates a sha based on the entire file.
gctest cs
hack,SourceType,Active,Name,ProperName,URL,Headless,HeadlessWait,IdentifierType,IdentifierPath,Domain,PID,Logo,validator link,NOTE
58,sitemap,TRUE,geocodes_demo_datasets,Geocodes Demo Datasets,https://earthcube.github.io/GeoCODES-Metadata/metadata/Dataset/allgood/sitemap.xml,FALSE,0,identifiersha,,https://www.earthcube.org/datasets/allgood,https://github.com/earthcube/GeoCODES-Metadata/metadata/OtherResources,,,
59,sitemap,FALSE,geocodes_actual_datasets,Geocodes Actual Datasets,https://earthcube.github.io/GeoCODES-Metadata/metadata/Dataset/actualdata/sitemap.xml,FALSE,0,filesha,,https://www.earthcube.org/datasets/actual,https://github.com/earthcube/GeoCODES-Metadata/metadata/,,,
edit gctest.csv
Set the second line active to TRUE
edited gctest cs
hack,SourceType,Active,Name,ProperName,URL,Headless,HeadlessWait,IdentifierType,IdentifierPath,Domain,PID,Logo,validator link,NOTE
58,sitemap,TRUE,geocodes_demo_datasets,Geocodes Demo Datasets,https://earthcube.github.io/GeoCODES-Metadata/metadata/Dataset/allgood/sitemap.xml,FALSE,0,identifiersha,,https://www.earthcube.org/datasets/allgood,https://github.com/earthcube/GeoCODES-Metadata/metadata/OtherResources,,,
59,sitemap,TRUE,geocodes_actual_datasets,Geocodes Actual Datasets,https://earthcube.github.io/GeoCODES-Metadata/metadata/Dataset/actualdata/sitemap.xml,FALSE,0,filesha,,https://www.earthcube.org/datasets/actual,https://github.com/earthcube/GeoCODES-Metadata/metadata/,,,
regenerate configs
./glcon config generate --cfgName gctest
rerun batch
./glcon gleaner batch --cfgName gctest
ubuntu@geocodes:~/indexing$ ./glcon gleaner batch --cfgName gctest
version: v3.0.8-fix129
batch called
{"file":"/github/workspace/internal/summoner/acquire/resources.go:204","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getRobotsForDomain","level":"error","msg":"error getting robots.txt for https://www.earthcube.org/datasets/allgood:Robots.txt unavailable at https://www.earthcube.org/datasets/allgood/robots.txt","time":"2023-01-30T21:09:49-06:00"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:66","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.ResourceURLs","level":"error","msg":"Error getting robots.txt for geocodes_demo_datasets, continuing without it.","time":"2023-01-30T21:09:49-06:00"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:204","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getRobotsForDomain","level":"error","msg":"error getting robots.txt for https://www.earthcube.org/datasets/actual:Robots.txt unavailable at https://www.earthcube.org/datasets/actual/robots.txt","time":"2023-01-30T21:09:49-06:00"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:66","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.ResourceURLs","level":"error","msg":"Error getting robots.txt for geocodes_actual_datasets, continuing without it.","time":"2023-01-30T21:09:49-06:00"}
100% |███████████████████████████████████████████████████████████████████████████████████████████| (3/3, 10 it/s)
100% |███████████████████████████████████████████████████████████████████████████████████████████| (9/9, 25 it/s)
RunStats:
Start: 2023-01-30 21:09:49.120833598 -0600 CST m=+0.105789938
Repositories:
- name: geocodes_demo_datasets
SitemapCount: 9
SitemapHttpError: 0
SitemapIssues: 0
SitemapSummoned: 9
SitemapStored: 9
- name: geocodes_actual_datasets
SitemapSummoned: 3
SitemapStored: 3
SitemapCount: 3
SitemapHttpError: 0
SitemapIssues: 0
100% |██████████████████████████████████████████████████████████████████████████████████████████| (9/9, 168 it/s)
100% |██████████████████████████████████████████████████████████████████████████████████████████| (2/2, 123 it/s)
Create a materialized view of the data using summarize to the repo_summary namespace
DOCUMENTATION NEEDED
(TBD assigned to Mike Bobak)