Current Schema: Difference between revisions
(Credited myself along with everyone else :-) and removed some of my more judgemental-sounding terms in descriptions.) |
No edit summary |
||
Line 341: | Line 341: | ||
| Price | | Price | ||
| price | | price | ||
| ISO code | | A decimal price followed by the ISO code, i.e. ''0.10 USD'' | ||
| | | [[Cover Price]] | ||
|- | |- | ||
| storycount | | storycount |
Revision as of 20:34, 6 April 2008
This wiki page attempts to document the current schema for the GCD. The public schema consists of four principle data tables as well as several supporting tables. The supporting tables are Countries, Languages (lists of codes for each of those things), Indexers (a list of indexers, but not of full account information) and IndexCredit (maps indexers to series for credit, but there are also credit notes in the series table). The four main tables are described below.
The original organization of the GCD data was into two files--or, more accurately, into a Series file (later two and then four Series files) and several dozen detail files. The two types of files were normalized by Lionel English in the mid-1990s in order to put them in to a real database, and then the two normalized files were further normalized in order to create the Publishers table from the Series table, and an Issues table from the Details table. The supporting tables were mostly created in the same fashion. Note in particular that normalization occurred after-the-fact and organization efforts were applied retroactively to a large existing data-set upon which no strict data type-enforcement had been been applied. Many, if not most, of the current schema's anomalies are a direct result of the database's free-form beginnings.
The "Name in New Code" column is for those who might be looking at the new Python/Django implementation. Since the existing table and column names are not even remotely consistent in capitalization, formatting or abbreviation, they were standardized in the Django model classes to make working with them easier (Django provides a convenient name mapping system for exactly this purpose).
Note that all foreign keys in this database are nullable. The current data set includes stories with no issue, issues with no series, and series with no publisher. And other stranger things. These are all errors; resulting from unenforced relational integrity and data formatting errors during early imports.
Please note that much of the issue and story table information on this page is based on earlier documentation by Peter Croome, with the remaining initial data contributed by Henry Andrews. Lionel English has attempted to provide clarification where possible. Jon has added some comments per April 1st.
Publishers
The publisher table is mostly straightforward. The largest area of confusion is the notion of imprints and of master publishers. Many of the larger publishers have had many names and/or many concurrent corporate identities during their life time (particularly true of Golden Age publishers, where multiple corporation names were used to reduce risk and hide ownership). In order to group related publishing companies together, the GCD uses what we have recently been calling "house" names, or names that have been established in fandom as umbrella names to refer to a particular, identifiable group of related companies (e.g. Detective Comics, Inc; National Periodicals; DC Comics Inc). In an effort to provide a place to note the actual publishers of given series, a "Publisher Notes" field was later added, which was used both to list actual company names and publishing imprints, eventually resulting in a rather ugly mess. Much attention was focused on correcting this mess in the new schema.
DB Column | Name in New Code | Description | Usage Notes |
---|---|---|---|
Core publisher data fields: | |||
ID | id | DB-generated primary key. | In the lasso-based site, this is the number you see in URLs like http://www.comics.org/publisher_details.lasso?id=4066 It is generated by the database and has no inherent meaning. |
PubName | name | Name of the publishing "family", publisher (legal entity), imprint, or set of imprints | Because imprints are not stored per-issue, in some cases a semicolon-separated list of imprints or corporate names corresponding to all imprints used for a given series has been given an entry in this table. Also, not all things that we term "Publishers" or "Imprints" strictly fall into those categories, and we do not accurately reflect the evolution of company names over time (for instance, "DC" was not until fairly recently the official name of the company, although it was long an informal name. It is still not exactly correct- "DC Comics" appears to be the current legal name. But "DC" is what everyone knows, so that is what appears in the database). |
YearBegan | year_began | Year (by cover date) of first publication. | |
YearEnded | year_ended | Year (by cover date) of last publication, if any. | |
CountryID | country | The country from which the publisher operates. | Foreign key into the Countries table. |
Notes | notes | Space for arbitrary notes. | |
web | url | The URL for the publisher's website. | While this column sometimes has data, it does not appear to be used by the Lasso-based site. |
AlphaSortCode | alpha_sort_code | A single character. | Appears to be used to sort related publishers near each other even if the more obvious alphabetical sort would not do so. It's a touch unclear. Intended to strip out "The" etc. in an earlier version of the database. |
Publisher / imprint relations and publication counts: | |||
ImprintCount | imprint_count | Count of how many imprints a given publisher has. | Does not appear to be uniformly accurate. Should be recalculated on updates. |
BookCount | series_count | Count of how many series this publisher publishes. | Does not appear to be uniformly accurate. Should be recalculated on updates. |
IssueCount | issue_count | Total number of issues in all series published by the publisher. | Does not appear to be uniformly accurate. Should be recalculated on updates. |
Master | is_master | Boolean field. | This field indicated the "master" or top level in the hierarchy of publisher/imprint in the days where both publishers and imprints were stored in the same, self-joined table in FileMaker. |
Connection | connection | Not used. | Was supposed to explain the nature of each publisher-to-publisher relationship. |
ParentID | parent | Self-join key for parent/child relationships. | This is how the imprints for a publisher are stored. |
NextID | next | Self-join key, was intended to provide links from one incarnation of a publisher to the next incarnation (e.g. National Periodical Publications -> DC Comics Inc). | Never implemented. |
Change tracking fields: | |||
created | created | Row creation timestamp. | |
Modified | modified | One of two change timestamps. | Actively used by the Lasso site. |
Updated | updated | One of two change timestamps. | Deprecated. |
Series
The series table includes a number of fields that are now obsolete, or perhaps represent possible new features that were abandoned. It also has fields related that may be related to cover scans, which are unclear as the cover tables are not included in the public schema.
DB Column | Name in New Code | Description | Usage Notes |
---|---|---|---|
Core series data fields: | |||
ID | id | DB-generated primary key. | In the lasso-based site, this is the number you see in URLs like http://www.comics.org/series.lasso?seriesid=72 It is generated by the database and has no inherent meaning. |
Bk_Name | name | Series title. | |
Format | format | Physical format of the series (size, color presence or absence, paper stock, etc.). | Since format is really an attribute of the issue rather than the whole series, this field may contain a fairly long description of how the format changed over the life of the series. In the new schema, this is being replace by a number of distinct fields, which individually describe the constituent attributes commonly listed here. |
Notes | notes | Space for arbitrary notes. | |
Yr_Began | year_began | First year (by cover date) of publication. | |
Yr_Ended | year_ended | Last year (by cover date) of publication, if any. | |
PubDates | publication_dates | First and last full cover publication dates of the series, separated by a '-'. No last date if still ongoing. | |
Pub_Note | publication_note | Arbitrary publication / publisher-related notes. Some records in the current DB include addresses, imprint information and/or general reprint information here. | Legacy field. Not used by the Lasso implementation. Replaced by a foreign key into the Publisher table that identified the imprint record. This foreign key does not appear to be listed here; this was different than PubID. I am not sure how the database currently associates a series with information about the publishing imprint(s). |
Frst_Iss | first_issue | Issue # (or other identifier) of the first issue in the series. | Note that it's not currently reliably possible to determine this from the issue table, since the series may only be partially indexed. |
Last_Iss | last_issue | Issue # (or other identifier) of the last issue in the series, if any. | Note that it's not currently reliably possible to determine this from the issue table, since the series may only be partially indexed. |
Issuecount | issue_count | Number of issues associated with the series. | Does not appear to be reliably accurate. |
PubID | publisher | Foreign key into the publishers table. | |
Pub_Name | publisher_name | Cached Publisher name and stems historically from the flat file format. | Reliability of this field has not been verified. |
CounCode | country_code | Not a foreign key into the Country table. Instead the country code is duplicated here, and does not necessarily match properly. | Historically it was done this way to save some joining time. |
LangCode | language_code | Not a foreign key into the Languages table. Instead the language code is duplicated here, and does not necessarily match properly. | Historically it was done this way to save some joining time. |
Fields relating series to each other (continuations, etc.): | |||
Tracking | tracking_notes | Notes about how a given series is continued in or continued from another series. | |
Crossref | crossref | Not in use. | Intended to indicate a cross reference to replace tracking |
CrossrefID | crossref_id | Not in use. | Intended to be a foreign key to a table of one-to-many references to other series to replace tracking. |
Field related to the (cover?) gallery: | |||
HasGallery | gallery_present | 3-character field, but used as boolean. | Gallery flag. Initially used for caching display of the gallery indicator. 3-char field stems from very early "YES/NO" values. |
Included | gallery_includes | Obsolete. | Legacy field that used to be used to indicate which issues had been indexed. Superceded by various matrices on the current site. |
Fields related to indexing: | |||
Indexers | indexers | Semicolon-separated string of indexers. | Originally a free-form text field recording who had worked on a given series. Used to create initial version of Indexers table, and then used to create the IndexCredit join table. Later updated periodically *from* the IndexCredit join table so that it could be exported into legacy format. Presumably no longer completely up-to-date, now that the lasso-site tracks indexing at the issue level. Not clear how the current site builds the display of indexers on a given page. |
OpenReserve | open_reserve | Integer field. | How many issues of this series are open for reservation. |
Fields apparently related to flat file management / distribution: | |||
File | file | Obsolete. The name of the file used for flat-file distribution. They were sized for floppies IIRC. | |
InitDist | initial_distribution | Obsolete. | Indicated the first distribution set the record was contained in (flat-file). |
UpdateDist | update_distribution | Obsolete. | Indicated the last distribution set the record was updated in (flat-file). |
Change tracking fields: | |||
Created | created | Row creation timestamp. | |
Modified | modified | Date of last row change. | |
ModTime | modification_time | Time of last row change (24 hour). | |
LstChang | last_change | Date of some sort, but apparently unused. | Only five non-NULL, non-zero values in the current data, all in the future (nearest is in 2010). |
oldID | old_id | The ID from a prior organization of the data. | Long deprecated. |
Miscellaneous unused and/or mysterious fields: | |||
Themes | themes | Does not appear to be used (always NULL). Was intended as an early incarnation of keywords. | |
SelfCount | self_count | Obsolete. | Stems from previous FileMaker incarnation of the database and used as a way to track duplicated during imports. |
Issues
The issues table contains large swaths of unused columns, due to it initally being a cloning of the sequence 0 story records when moving from flat file to a semi-relational database (FileMaker 4). Some fields stayed aboard instead of being kicked out due to a number of flat-file based files pending inclusion.
DB Column | Name in New Code | Description | Usage Notes |
---|---|---|---|
Core issue data fields: | |||
ID | id | DB-generated primary key. | In the lasso-based site, this is the number you see in URLs like http://www.comics.org/details.lasso?id=293 It is generated by the database and has no inherent meaning. |
VolumeNum | volume | Volume number from the indicia. | Not consistently used, and according to recent editor list comments, not planned for migration into the new schema. Where especially relevant, volume numbers are typically included in the Issue (number) field. NOTE: This should actually be used for proper indexing of some foreign titles which restart numbering every year. |
Issue | number | Issue number or other identifier from the indicia. | Use the value nn to indicate "no number" when there is no identifier. For issues where the volume number is important, the form v1#1 is used to specify both the volume and issue numbers. When the cover number or commonly assumed number is different from the indicia number, it goes in brackets after the indicia number: v3#2 [26]. The same format is used for disambiguation of identically numbered issues: 1 [1946] or nn [1947]. |
Pub_Date | publication_date | Publication date of the issue as given in the indicia, but fully spelled out. | Issues that list a publication date of December-January, Winter or Holiday should specify both years involved, i.e. Winter 1942-1943. |
Price | price | A decimal price followed by the ISO code, i.e. 0.10 USD | Cover Price |
storycount | story_count | Number of stories linked to this issue. | May not be 100% accurate for all issues, but is updated whenever the issue itself is updated. |
Key_Date | key_date | Specially formatted date string that (in theory) determines the ordering of issues within a series. | In practice, many issues are missing this field, meaning that if their issue numbers are not standard, there is no reliable programmatic way to sort the issues in the series. In practice, the cover sort codes are slightly more reliable (although not always), but they are not included in the public data dumps. |
Fields that would be core issue data but aren't actually used: | |||
Pg_Cnt | page_count | Not Used | Page count for the whole issue is taken from story sequence 0 (which otherwise represents the front cover). |
Editing | editor | Not Used | Editor(s) for the whole issue are taken from story sequence 0 (which otherwise represents the front cover). |
Notes | notes | Not Used | Notes for the whole issue are taken from story sequence 0 (which otherwise represents the front cover). |
Fields related to indexing: | |||
IndexStatus | index_status | Whether or not an issue is being indexed, and where it is in the process. | 0=not (white); 1=reserved (red); 2=pending (orange); 3=approved (green). Colors noted are how the issues appear in various index status grids. |
ReserveCheck | reserve_check | Not Used | The intention of this field is unknown. |
ReserveStatus | reserve_status | 1=indexed, pending or reserved; 0=not reserved | The intended relationship between this and IndexStatus/index_status is not known. |
Fields related to cover images: | |||
CoverCheck | has_cover | Not Used | |
CoverCount | num_covers | Not Used | Intended for variants |
Fields related to the series, including unused cache fields: | |||
SeriesID | series | Foreign key into the series table. | |
Bk_Name | series_name | Not Used (deprecated) | |
Yr_Began | year_began | Not Used | |
Pub_Name | publisher_name | Not Used | |
Fields apparently related to flat file management / distribution: | |||
InitDist | initial_distribution | Not Used | |
UpdateDist | update_distribution | Not Used | |
isUpdated | is_updated | Deprecated. Used for tracking updated indexes before processing each distribution. | |
Change tracking fields: | |||
created | created | Row creation timestamp. | |
Modified | modified | Date of last row change. | |
ModTime | modification_time | Time of last row change (24 hour). | |
LstChang | last_change | Date of some sort, but apparently unused? | Column has numerous values, but they do not line up with the values in the Modified column. NOTE: This field was butchered data-wise and then replaced by the modified/modtime combo. |
Miscellaneous unused and/or mysterious fields: | |||
rel_year | rel_year | Not Used | Intended for released year, caching of keydate year. |
SelfCount | self_count | Not Used | Historically used for duplcation checking. |
SeriesLink | series_link | Not Used | Deprecated. Early non-id based series link used to build the initial foreign keys into series. |
Unused fields that would apply only to the cover. Stemming from seq 0 clone.: | |||
Feature | cover_feature | Not Used | |
Char_App | cover_characters | Not Used | |
Script | cover_script | Not Used | |
Pencils | cover_pencils | Not Used | |
Inks | cover_inks | Not Used | |
Colors | cover_colors | Not Used | |
Letters | cover_letters | Not Used | |
Synopsis | cover_synopsis | Not Used | Yes, I realize cover_synopsis does not make a whole lot of sense. It's not used anyway. |
Reprints | cover_reprints | Not Used | Arguably this could go in the "whole issue" section and be used to indicates issues that reprint entire other issues instead of doing story-by-story reprint links. |
Title | cover_title | Not Used | Arguably this could go in the "whole issue" section. |
Genre | cover_genre | Not Used | Arguably this could go in the "whole issue" section. |
Type | cover_type | Not Used | Since the type of a cover is "cover", this also doesn't make much sense. But it doesn't make sense as applied to an entire issue either, as far as I can tell. |
Seq_No | cover_sequence_number | Not Used | Makes even less sense that the rest of these. Stemming from the cloning of the stories table... |
Stories
The stories table has a number of unused columns that are duplicates of columns from the series and issues tables (they're known to be unused because in some cases the data does not match, and the versions shown on the Lasso site are the ones in the issue and series tables, not these). They were initially used for caching data to avoid expensive joins in the FileMaker days. Also, sequence 0 (which represents the front cover) is treated specially, with its page count, notes, reprints and editor fields being applied to the whole issue by the current UI implementation.
DB Column | Name in New Code | Description | Usage Notes |
---|---|---|---|
Core story data fields: | |||
ID | id | DB-generated primary key. | Since there is no story page, this field does not show up in any URLs in the Lasso implementation. |
Seq_No | sequence_number | Order of the story within the issue. | By convention, sequence 0 is the front cover, and some of its fields are considered to apply to the whole issue. |
Feature | feature | The name of the feature, if any. | Often but not always the name of the primary character. Some stories do not have a feature. |
Pg_Cnt | page_count | Number of story pages. | Does not include ad pages that might be mixed in. |
Notes | notes | ||
Reprints | reprints | Reprint information (both reprint from and reprinted in). | |
Type | type | Type of "story", or more accurately sequence. | Actual comic story vs text story vs ad vs letters page, etc. |
Fields for creator credits: | |||
Script | script | ||
Pencils | pencils | ||
Inks | inks | ||
Colors | colors | ||
Letters | letters | ||
Editor | editor | Rarely used, as the editor for the book is typically assumed to have edited the whole book. However, this field can be used if individual stories have separate editors. | |
JobNo | job_number | Publisher-assigned job number for the story. | |
Fields related to content: | |||
Title | title | ||
Char_App | characters | ||
Synopsis | synopsis | ||
Genre | genre | ||
Field related to flat file management / distribution. See series/issues: | |||
InitDist | initial_distribution | Not Used | |
Fields relating to or copied from the issues table: | |||
IssueID | issue | Foreign key into the issues table. | |
Issue | issue_number | Not Used | Earlier indexes may have data in this field, but it is ignored. |
Pub_Date | publication_date | Not Used | Earlier indexes may have data in this field, but it is ignored. |
Key_Date | key_date | Not Used | Earlier indexes may have data in this field, but it is ignored. |
Price | price | Not Used | |
rel_year | rel_year | Not Used | |
Fields relating to or copied from the series table: | |||
SeriesID | series | Foreign key into the series table, duplicated from the issues table presumably to allow for direct stories <-> series joins. | Actually, it was used to help troubleshoot and QA the initial building of a relational database way back. Also use for statistics. |
Yr_Began | year_began | Not Used | Earlier indexes may have data in this field, but it is ignored. |
Pub_Name | publisher_name | Not Used | Earlier indexes may have data in this field, but it is ignored. |
Change tracking fields: | |||
created | created | Row creation timestamp. | |
Modified | modified | Date of last row change. | |
ModTime | modification_time | Time of last row change (24 hour). | |
LstChang | last_change | Date of some sort, but apparently unused? | Column has numerous values, but they do not line up with the values in the Modified column. Data got butchered through various ins and outs of flat files and between systems changing the date format. |