Workflow

From TP3-STTCL
Jump to navigation Jump to search

Steps based on the metadata in JSON

Prepare keywords (done)

  1. Extract set of keywords ("keywords.csv")
  2. Add them to the Wikibase using Quick Statements
  3. Export keywords with QID and Label
  4. Match keywords and QIDs in the CSV file

Affiliations (done)

  1. Extract set of affilations (don't merge near identical strings!)
  2. Add affiliations to Wikibase
    1. Len = String
    2. Den = "An institution"
    3. Aen = String of label without "University", "of", "the", "The", "College", "at", double spaces
    4. P1 = Q14 (instance of institution)
    5. P20 = """exact string"""
  3. Export affiliations with QID, Label and P20 (for matching)

Authors (done)

  1. Extract set of authors, and identify first names, last names, additional name elements, affiliations, in CSV file ("authors.csv")
  2. Match affiliation QIDs and affiliation strings in the CSV file;
  3. Add authors with first names, last name, additional name elements, affiliation-QIDs
  4. Match author QIDs to author strings; this also gives us the author QIDs relevant to each article
  5. TODO: Apply manual fixes for authors who appear multiple times, sometimes with changing affilations

Add basic article information (done)

  1. Add articles with
    1. Len = title
    2. Den = "Article in STCCL"
    3. Aen = Article ID as String
    4. P1 = Q10 ("instance of" = Article)
  2. Verify that all articles are there: all 685 are there (or one missing because of double Solshenizin?)
  3. Export QIDs and Article IDs for further matching using query

Authors for articles (done)

  1. Based on a match between the QID of each article
  2. qid,P16

Add keywords to articles (done)

  1. Match keyword QID from Wikibase to keywords strings from JSON metadata
  2. P15 = Keywords (Items) – multiple statements per article, one for each keyword (varying numbers)

Some further article information (in progress)

  1. Build table with the following and import using QS
    1. P11 = Year (EDTF)
    2. P12 = DOI (URL)

Add abstracts to articles

  1. Use QID-ArticleID table to match articles in JSON using ID to Wikibase items via QID
  2. Add abstracts (PROBLEM: field limit stays at 400 chars!!)
    1. P2 = Abstract as String (more than 400 chars in 87% of the case!)

Optional: Further article information

  1. Build table with the following and import using QS
    1. P4 = HTML (URL)
    2. P5 = PDF (URL)
    3. P14 = Volume (String)
    4. P10 = Issue (P10)
    5. P9 = Number (String)
    6. P18 = Q1 (Journal = STTCL)

Some fixes

  1. Merge near-identical institution names? (see e.g.: Berkeley or Boulder)
  2. Fix keywords that could not be matched automatically, see links to: Item:Q8186
  3. Remove some items: Q7542 (Editor's note), Q8182 (Interview), Q7827 (poems?), Q7960, Q7961, Q7964, Q7965, Q7966, Q8067, Q8067, Q8105, Q8141
  4. Check some items with short abstracts, and remove as required: Q7707, Q7729, Q7828, Q7830, Q7833, Q7916, Q7999, Q8000, Q8001, Q8002, Q8003, Q8004, Q8005, Q8006, Q8007, Q8008
  5. Add 39 "special topic" articles as articles.
  6. Take care of missing affiliations / change to "independent scholar" as appropriate: https://tp3-sttcl.wikibase.cloud/wiki/Special:WhatLinksHere/Item:Q6732

Potential improvements

  1. Possibly: Improve keywords through corrections (e.g. "hisory", "198S") and merging (e.g. capitalization) – or keep as found?
  2. Add a "point in time" reference to institutional affilations based on publication data of relevant article(s) to better represent the record (e.g. Avital Ronnell) or to disambiguate multiple affiliations for one author
  3. Match institutions to some identifier (ROR, Wikidata ID, etc.) and then pull city, country, world region, geocoordinates
  4. Match authors to some identifier (ORCiD, Wikidata, GND, etc.)
  5. Match keywords to some identifier (Wikidata, GND, Dewey, etc.)

Further steps

  1. Add category annotations using Cradle form
  2. Add generated categories
  3. Add generated topics?