Current Schema

From GCD
Revision as of 09:29, 12 August 2011 by Adia (talk | contribs) (Fixing links after "Publication Date" move)
Jump to navigation Jump to search

Introduction

This wiki page attempts to documents the current schema for the GCD. As of October 2009, the schema is being actively maintained and enhanced, and as of December 2009 all efforts are being made to keep this page up to date. This page was last updated in February 2010 for the 0.3 Dr. Festus release.

Public and Private Tables

The GCD schema consists of public and private tables. The public tables contain the visible data records (except for cover information) and are available to anyone in the public dumps. No account or membership of any sort is required to access the data dumps. The GCD requires that any use of our data from the dumps be credited to us. On a web site, this should be done on each page where the data is used, and a link should be provided to our site. Commercial use is welcome as long as credit and (on the web) links are provided. For questions about using the dump please contact the Board.

The private tables consist of three types of tables:

  • account data (private for the obvious reasons)
  • data editing tables (private to avoid sending out unapproved data- we'd like to make the change history public at some point if folks want it)
  • the covers table (private for legal reasons involving the distributions of cover scans and copyright law).

Note: Please contact the Board if you have concerns about the policy regarding covers. Please do not contact the tech team as this is not a technical policy and complaining to them only wastes both your time and theirs.

A Bit of History

The original organization of the GCD data was into two files--or, more accurately, into a Series file (later two and then four Series files) and several dozen detail files. The two types of files were normalized by Lionel English in the mid-1990s in order to put them in to a real database, and then the two normalized files were further normalized in order to create the Publishers table from the Series table, and an Issues table from the Details table. During the server migration and site rewrite that took place during August-December 2009, Henry Andrews, Jochen Garcke and the rest of the tech team changed the database to a system supporting true foreign keys and transactions, and imposed a consistent naming scheme on the tables and columns. Several new public tables, as well as an entirely new set of data editing tables, were added at this time. The documentation for the pre-2009-migration schema is still available for those curious about it.

Future Plans

We're progressing towards our real "1.0" release of this generation of the site, code-named "New Fun". Much discussion on the lists references the "New Fun" schema which supports a much more complex and realistic view of the data. Please join the gcd-tech group if you would like to participate in that discussion. For the curious, the December 2009 release was 0.2 "Vieux Bois".

Supporting Tables

There are several supporting tables necessary to use the data which lay out fixed sets of options for certain fields. They are:

  • gcd_country
    • Has columns id, code and name, and provides the set of countries used by the various publisher and series tables. The code column is the ISO 3166 code for the country, and should arguably be used as the foreign key, but due to the very rushed development timeline of the December 2009 release (as the prior server was no longer viable), we went with the standard numeric id column for reasons which seemed like a good idea at the time, but involved a great deal of sleep deprivation :-) We'll probably revisit this eventually.
  • gcd_language
    • Has columns id, code and name, and provides the set of languages used by the series table. The code column is the ISO 639 code for the language. The two-letter ISO 639-1 code is used if available, three letter code from 639-2 or 639-3 otherwise. This might get standardized into 3-letter codes or 4-letter codes if we need to take ISO 639-6 into account once its code list is published. This table has the same foreign key oddness about code vs id as the gcd_country table, although the possibility of changing the codes makes it more sensible to stick with id for now.
  • gcd_story_type
    • Has columns id and name, and provides the set of types used by the story table. The type field is somewhat controversial in that its values actually cover various different attributes of story sequences. However, until late 2009 there was no way for the GCD to change the field (and it was in fact a plain text field within the story table) so discussions on changing it were set aside. This will be one of many things revisited along the path to the 1.0 New Fun release.

Publishers

The publisher table currently holds two sorts of records: "master publishers" and "imprints". Both of these concepts are less precise than one might expect.

"Master publishers" are the common names of publishers as typically grouped by comic book researchers. As this vague definition suggests, there is considerable room for disagreement over what constitutes a master publisher and how such entities should be named. This field is often referred to as the "argue 'till we're blue in the face" field. Part of the current definition of a series in the GCD is that all issues within the series are published by the same master publisher. This has led to the grouping of some companies together under a single master publisher because one took over several series from the other. Debate on the proper arrangement and definition of master publishers is ongoing. A master publisher is a record that has the is_master field set to 1. The name of the publisher is chosen primarily based on what is most likely to be expected by a user running a search.

As of late 2009, "imprints" are considered a deprecated concept and are being replaced by indicia publishers and brands (covered in their own tables below). The concept of an imprint may return in a more well-defined form in the future for formal imprints of publishing houses. The data records in this table that are considered "imprints" serve as a catch-all for several different publishing concepts, often requiring long semi-colon separated lists as the field is set per-series but the data often changes per-issue. An imprint is a record that has a non-NULL parent_id field.

Note that there are a very small number of records that are both master publishers and imprints. Explaining those records is beyond the scope of this documentation.

gcd_publisher [apps.gcd.models.Publisher class in the code]
Column Name Description Usage Notes
Core publisher data fields:
id DB-generated unique, primary key.
name Name of the master publishing "family", imprint, or set of imprints See the discussion above for how this name is chosen
year_began Year (by cover date) of first publication.
year_ended Year (by cover date) of last publication, if any.
notes Space for arbitrary notes.
url The URL for the publisher's website. Includes the protocol and leading slashes (e.g. 'http://').
Relations to other tables and counts of related objects:
imprint_count Count of how many imprints a given master publisher has. Recalculated by the code on data updates.
brand_count Count of how many brands a given publisher has. Recalculated by the code on data updates.
indicia_publisher_count Count of how many indicia publishers a given publisher has. Recalculated by the code on data updates.
series_count Count of how many series this publisher publishes. Recalculated by the code on data updates.
issue_count Total number of issues in all series published by the publisher. Recalculated by the code on data updates.
is_master Boolean field (1 or 0 in MySQL). See discussion of master publishers above.
country_id The country from which the publisher operates. Foreign key into the gcd_country table (id column).
parent_id Self-join key for parent/child relationships between master publishers and imprints. NULL for non-imprints.
Change tracking fields:
reserved If true (1 in MySQL, as opposed to 0 for false), there is an active change being made in the data editing tables.
created
modified

Brands

Brands represent emblems (graphic logos or text) that appear on the front or back covers or on the spine of an issue / book and identify the publisher in some way. While intended to be a fairly objective field, there is still considerable controversy over how to define brands and when to consider two similar emblems separate. The current working definition is that changes in the text of the emblem mean that the two emblems are different brands. Changing only graphic elements generally does not create a new brand. But there are exceptions to these rules depending on the nature of the text or whether there was a change in the common name of the brand / publisher / imprint accompanying a purely graphical change.

Brand effectively identifies publisher imprints as well as publisher branding other than imprints. However, it is not always a perfect match for imprints so a more formal imprint field / table may be introduced in the future.

Brands are only intended to be used for the publisher's brand. Distributor marks are not tracked here (they will be given their own field/table in a future release of the site), nor are brand emblems from partners, sponsors, etc. [for instance, a TV network's name or logo if the issue ties in to a TV show]).

The brand concept may change over the next few months, or may be given a new name in the UI. It is a new field, and therefore not filled out on most issues at this time.

gcd_brand [apps.gcd.models.Brand class in the code]
Column Name Description Usage Notes
Core brand data fields:
id DB-generated unique, primary key.
name Name of the brand, generally the text of the emblem. If there is no text or it does not clearly identify the emblem, the common name (if any) may be used, or a brief description of the emblem may be used.
year_began Year (by cover date) of first publication. May be NULL If unknown.
year_ended Year (by cover date) of last appearance, if any. May be NULL if unknown or still in use.
notes Space for arbitrary notes. Should include any description of the emblem or emblems involved in the brand to aid identification.
url The URL for the brand's website, if and only if it is distinct from the parent master publisher (this is common when the brand is a publisher's imprint). Includes the protocol and leading slashes (e.g. 'http://').
Relations to other tables and counts of related objects:
issue_count Total number of issues in all series published by the publisher. Recalculated by the code on data updates.
parent_id Foreign key to the gcd_publisher table (id column).. An issue can only be linked to a brand that is linked to the same master publisher to which the issue's series is linked.
Change tracking fields:
reserved If true (1 in MySQL, as opposed to 0 for false), there is an active change being made in the data editing tables.
created
modified

Indicia Publishers

Indicia publishers are the companies, identified by their legal names, that are actually recorded within the comic (in the indicia if the comic has such a thing) on an issue-by-issue basis. This is pretty much the only publishing field that is objectively clear, as master publisher relies on grouping these companies by somewhat variable rules and sometimes tenuous speculation, while brand is controversial in a number of ways as described above. Unlike master publisher, it is linked at the issue level and therefore a series can (and in certain countries and time periods, often does) span many indicia publishers.

This is a new table so it is not filled out for many issues yet.

gcd_indicia_publisher [apps.gcd.models.IndiciaPublisher class in the code]
Column Name Description Usage Notes
Core indicia publisher data fields:
id DB-generated unique, primary key.
name Name of the company exactly as it appears in the indicia, including abbreviations, punctuation, etc. (e.g. "Co." and "Inc."). On rare occasions minor variations in the name may be grouped as one indicia publisher. This should be explained in the notes field.
year_began Year (by cover date) of first publication. May be NULL if not known.
year_ended Year (by cover date) of last publication, if any. May be NULL if not known or if the company is still active.
notes Space for arbitrary notes.
url The URL for the company's website, if and only if distinct from the master publisher website. Includes the protocol and leading slashes (e.g. 'http://').
Relations to other tables and counts of related objects:
issue_count Total number of issues in all series published by the company. Recalculated by the code on data updates.
country_id The country in which this company is incorporated. Foreign key into the gcd_country table (id column).
parent_id Foreign key to the gcd_publisher table (id column). An issue can only be linked to an indicia publisher that is linked to the same master publisher to which the issue's series is linked.
is_surrogate If true (1 in MySQL, as opposed to 0 for false), this company is not properly part of its linked master publisher (exactly what "part of" means is somewhat variable). Instead, this company is an otherwise unrelated company that published material on behalf of the parent master publisher. This was common in the United States during World War II and explains a number of otherwise puzzling issues.
Change tracking fields:
reserved If true (1 in MySQL, as opposed to 0 for false), there is an active change being made in the data editing tables.
created
modified

Series

A series in the GCD is officially defined by the indicia title and publisher - when either of these changes a new series is created. Minor changes in the indicia title (such as the presence or absence of an article, i.e. "The") may be allowed within one series, especially if the article comes and goes. Some exceptions exist, such as Dell's Four Color (2nd series) which is commonly grouped in guides but actually uses a different indicia title for nearly every issue.

A series may be split due to a particularly long lapse in its publication schedule (on the order of years) or due to certain kinds of renumbering, but the exact rules in this area are still being debated.

gcd_series [apps.gcd.models.Series class in the code]
Column Name Description Usage Notes
Core series data fields:
id DB-generated unique, primary key.
name Series title, as it appears in the indicia (if any, as on the cover if there is no indicia), with any leading article moved to the end after a comma. The article-moving rule applies only if the title is in the language in which the series is written.
format Physical format of the series (size, color presence or absence, paper stock, etc.). Since format is really an attribute of the issue rather than the whole series, this field may contain a fairly long description of how the format changed over the life of the series. In the "New Fun" schema, this is being replace by a number of distinct fields, which individually describe the constituent attributes commonly listed here.
year_began First year (by cover date) of publication.
year_ended Last year (by cover date) of publication, if any.
is_current Set to true (1 in MySQL, as opposed to 0 for false) if new issues are still being produced for this series. This is a new field and may not be properly set for all series yet.
publication_dates First and last full cover publication dates of the series, separated by a hyphen (i.e. '-'). No last date if still ongoing. This field is being replaced by the first_issue_id and last_issue_id fields. It is not currently possible to edit it from the OI, so errors are not being actively corrected. Once all data is migrated to the new field, this one will be dropped.
tracking_notes Notes about how a given series is continued in or continued from another series. See the [Formatting Documentation] for how these notes should be structured. This will eventually be replaced with foreign key fields.
publication_notes Formerly the place where arbitrary publication details could be recorded. Now considered deprecated. This predates the imprint field, which predates the issue-level indicia publisher and brand fields. Notes here that are not migrated to those fields will most likely be merged into the general notes field and this field will be eventually dropped.
notes Space for arbitrary notes.
has_gallery If true (1 in MySQL, as opposed to 0 for false), the series has at least one cover image. Updated by the code as covers are added/changed/removed.
Relations to other tables and counts of related objects:
issue_count Number of issues associated with the series. Recalculated by the code when the data is updated.
first_issue_id Foreign key to the first issue of this series in the gcd_issue table (id column). Updated by the code when issues are added, deleted or moved. This is a new field and is not always properly set, nor is it yet used in the UI.
last_issue_id Foreign key to the last issue of this series in the gcd_issue table (id column). Updated by the code when issues are added, deleted or moved. This is a new field and is not always properly set, nor is it yet used in the UI.
publisher_id Foreign key into the gcd_publisher table (id column). Should only ever reference a master publisher record (enforced at the code level).
imprint_id Foreign key into the publishers table (id column) for the deprecated "imprint" concept. This is being set to NULL once its information has been migrated to the brand_id and indicia_publisher_id on the issues for the series.
country_id Foreign key into the gcd_country table (id column). This is properly an attribute of the indicia publisher (linked to the issue table) rather than the series, and will eventually be dropped once that field is sufficiently populated.
language_id Foreign key into the gcd_language table (id column). The language in which the comic is written. For comics that are purely visual, there is an ISO code for "no linguistic content".
Change tracking fields:
open_reserve Obsolete field that we forgot to drop during schema migration. This will be dropped next time we change the schema, and should be ignored.
reserved If true (1 in MySQL,as opposed to 0 for false), there is an active change being made in the data editing tables.
created
modified

Issues

Issues are what you would expect. They include both periodical issue (pamphlets) and bound volumes / collections, currently with no distinctions between the two. The "New Fun" schema will add classifications so that we can treat these items differently. As it is, books are sometimes awkwardly wedged into fields designed for periodicals, but it mostly works.

One notable exception to the sensibility of this table is that issue variants do *not* each get separate records at this time. A future schema (possibly "New Fun" but possibly a release some time after that) will address this inconsistency. For now, variant covers are recorded by adding multiple cover sequences to the story table for the issue. There are some exceptions to this arrangement when more than just the cover varies.

gcd_issue [apps.gcd.models.Issue class in the code]
Column Name Description Usage Notes
Core issue data fields:
id DB-generated unique, primary key.
number Issue number or other identifier from the indicia (if there is one) or cover (if there is no indicia) or in some cases both (if there are both and they disagree). See Issue Numbers for numerous special cases of missing / dual / annual / per-volume or other sorts of "numbers". Note that this is a text field rather than a numeric field because many issue "numbers" are partially or entirely non-numeric, i.e. "Summer Special".
volume Volume number from the indicia (if there is one) or cover (if there is no indicia). See Volume While this field has been present for a long time, it has only recently been displayed on the site. Therefore the data is not present as often as for other fields.
display_volume_with_number If true (1 in MySQL, as opposed to 0 for false), combine the volume number with the issue number in most places that the issue number would be displayed. This produces numbers like "v2#1" (although in many indexes that notation is still used directly in the issue number field - this will eventually be corrected). See Volume and Issue Numbers.
no_volume If true (1 in MySQL, as opposed to 0 for false), there is no volume number for the issue. This is used to confirm the meaning of an empty volume field since that field was ignored for many years and therefore is often empty when it should not be. See Volume and Issue Numbers.
sort_code Numeric field that determines how issues sort within each series. There is a uniqueness constraint on the combination of this column and series_id. This replaces the old cover sort code and is now the sole source of ordering for issues within a series. key_date (which is not always known) is now only used for chronological search results.
key_date Specially formatted date string that (in theory) determines the ordering of issues in search results that are organized by date. In practice, many issues are missing this field, meaning that they will just sort to the beginning of any chronological list. This is a data problem, not a coding error. For reliably sorting issues within a series, see sort_code. Also see Keydate.
publication_date Publication date of the issue as given in the indicia, but fully spelled out. See Publication Date.
indicia_frequency If the issue has an indicia and it specified a publication frequency (monthly, quarterly, one-shot, etc.) that is recorded here. This is a text field rather than a foreign key into a set of options because of frequencies that are totally non-standard (common in alternative/underground publications) or irregular in such ways as "monthly except in September and December". This is a new field so it is sparsely filled out at this time.
price A decimal price followed by the ISO code, i.e. 0.10 USD See Cover Price for details, including multiple prices and non-decimal currencies.
page_count Overall page count for the issue, counting covers but not inserts or dust jackets. A single sheet with one fold (the smallest possible "comic book") would count as 4 pages. See Page Count.
page_count_uncertain If true (1 in MySQL, as opposed to 0 for false), the page count is unknown (if page_count is NULL) or could not be reliably determined due to second-hand information or missing pages. This field allows the page_count field to be of numeric type. Previously, it was text so that a "?" could be appended for this purpose.
editing Editor for the whole issue, or for the overall anthology if there are different editors for each story in a collection. Also used for other issue-level credits such as assistant/associate editor, publisher, production, special thanks, etc. See Credits and Editing for how credits are structured.
no_editing If there is no editor (or similar credit) at the issue level, this field should be set to true (1 in MySQL, as opposed to 0 for false). Distinguishes from an unknown editor or from the field simply not having been filled out.
notes Space for arbitrary notes about the entire issue. Because the issue-level notes were originally (by convention) placed in the notes field of the first cover sequence (in the story table), the only reasonable option was to duplicate those notes into the issue table during migration. We are working through these duplicates, but for now it is very common to see identical notes in the issue record and in the sequence number 0 story record.
Fields relating to the other tables:
indicia_publisher_id Foreign key into the gcd_indicia_publisher table (id column, can be NULL for no indicia publisher or unknown indicia publisher [data not entered yet]). Indicates the actual company that published the issue as shown in the indicia or other formal publication data. This is a new field and therefore is sparsely filled out at this time.
brand_id Foreign key into the gcd_brand table (id column, can be NULL for no brand or unknown brand [data not entered yet]). Indicates the publisher's logo / name / tagline on the exterior (front cover, back cover or spine) of the issue. There is still a fair amount of debate about what constitutes a separate brand (how much change in the logo appearance and/or text). This is a new field and therefore it is sparsely filled out at this time.
no_brand If true (1 in MySQL as opposed to 0 for false), there are no brand markings on the issue. If this field is false and brand is NULL, the UI will display a question mark indicating that the branding is unknown.
series_id Foreign key into the gcd_series table (id column).
story_type_count Number of stories of type "story" linked to this issue. Used to determine whether an issue is "indexed" or not. Currently, an issue is indexed if there is at least one entry in the story table of type "story" (to prevent the existence of a story of type "cover" or other type from making the issue look complete). In the future we may also implement tracking of the expected number of stories to make this measurement more precise. Calculated whenever an issue's stories are updated.
Change tracking fields:
index_status Obsolete field kept to help debug potential migration problems (so far, so good...). This field will be dropped the next time we update the schema, and should be ignored.
reserve_status Obsolete field kept to help debug potential migration problems (so far, so good...). This field will be dropped the next time we update the schema, and should be ignored.
reserve_check Obsolete field kept to help debug potential migration problems (so far, so good...). This field will be dropped the next time we update the schema, and should be ignored.
reserved If true (1 in MySQL, as opposed to 0 for false), there is an active change being made in the data editing tables.
created
modified

Stories

The story table would more accurately be called the sequence table, and we plan to eventually rename it. It records the contents of an issue, whether they are actual sequential art stories, text stories, covers, advertisements or any number of other sequence types. Only sequential art stories, covers and text stories are technically required to "complete" an index, but cover-to-cover indexes are appreciated.

gcd_story [apps.gcd.models.Story class in the code]
Column Name Description Usage Notes
Core story data fields:
id DB-generated unique, primary key.
sequence_number Order of the story within the issue. See Sequence Number. Sequence zero is no longer treated specially (although it still represents the front cover, or at least one variant of it).
title The title of the story, or a quote from the beginning of the script if there is no title. See Title.
title_inferred True (1 in MySQL, as opposed to 0 for false) if the indexer chose the title either directly or by using the first line of the script. This causes the UI to display the title with [square brackets] around it.
feature The name of the feature, if any. Typically as from the splash or title page, but with some exceptions. Often but not always the name of the primary character. Some stories do not have a feature. See Feature.
page_count Number of pages. One side counts as one page. See Pages.
page_count_uncertain True (1 in MySQL, as opposed to 0 for false) if the page count is unknown or uncertain. This is a separate field so that page_count could be a numeric field instead of varchar.
notes Arbitrary notes from the indexer. See Notes.
reprint_notes Reprint information (both reprint from and reprinted in). See Reprints for the extremely detailed formatting requirements for these notes. This field will be replaced with database links in release 0.5 "Kin-der-kids".
type_id Foreign key to the data_story_type table (id column). Type of "story", or more accurately sequence. Actual comic story, text story, advertisement, or letters page, etc.
job_number Job Number
Fields for creator credits:
script Writer, scripter and/or (co-)plotter credits. See Credits and Script.
no_script True (1 in MySQL, as opposed to 0 for false) if there is no writing, such as for a single panel illustration with no text (pin-up) or a simple promo that shows the next issue's cover with no significant text (lines like "Continued next issue" do not count as significant text). Note that the absence of dialog or narration does not mean there is no script. If there is a sequence of events depicted in the artwork, that constitutes a "script".
pencils Pencils or other primary artwork (painting, photography, etc.) or layouts / breakdowns and finishing pencils. See Credits and Pencils.
no_pencils True (1 in MySQL, as opposed to 0 for false) if there is no artwork, for instance a text story with no illustrations or a plain text recap page with limited visual elements (such as a solid color background, which is more of a production choice than the sort of artwork credited in these fields).
inks Inks or other artwork that completes the primary artwork. Painting / photography should be credited both here and in pencils an possibly colors. See Credits and Inks.
no_inks True (1 in MySQL, as opposed to 0 for false) if there is no inking or similar finishing task that applies to the artwork, or there is no artwork.
colors Color added to non-colored artwork, or similar image processing. Painting / color photography should be credited both here and in pencils, and possibly inks. See Credits and Colors.
no_colors True (1 in MySQL, as opposed to 0 for false) if there is no colorist to credit, for instance because the sequence is printed in black and white, or if there is no artwork.
letters Lettering. For text not designed/produced by a letterer, use "typeset" (for instance typewritten text stories word processor output not done in word balloons or captions). See Credits and Letters.
no_letters True (1 in MySQL, as opposed to 0 for false) if there is no letterer (or "typeset") for the sequence. Covers with no significant text beyond the title and standard promotional taglines should set this to true.
editing Details regarding editing for this individual story, as opposed to the editing notes for the whole issue. Rarely used except for sequence 0, as the editor for the book is typically assumed to have edited the whole book. However, this field can be used if individual stories have separate editors such as certain original anthologies or collections covering diverse material.
no_editing True (1 in MySQL, as opposed to 0 for false) if there is no editor to credit for this story (as opposed to the single editor that commonly edits the whole issue).
Fields related to content:
characters Characters appearing should be populated in this field. Sometimes only recurring characters, but sometimes including things like "unnamed boy" or "Nazis". See Character Appearances. The formatting rules for this field are extensive, but have been changed several times and multiple sets of rules are currently in active use.
synopsis A brief synopsis of the content. The purpose of the field is to aid in identifying the story, not to provide a plot summary to replace reading the whole thing. Do not use text from other sites or sources as it may be assumed to be under copyright. See Synopsis.
genre One or more genres, separated by semicolons. Free-form, although there is an Official Genres List. See also Genre.
Fields relating to the other tables:
issue_id Foreign key into the issues table (id column).
type_id Foreign key to the data_story_type table (id column). Type of "story", or more accurately sequence. Actual comic story, text story, advertisement, or letters page, etc.
Change tracking fields:
reserved If true (1 in MySQL, as opposed to 0 for false), there is an active change being made in the data editing tables.
created
modified