Current Schema

From GCD
Revision as of 01:44, 21 December 2009 by Handrews (talk | contribs) (Convert all docs to the 0.2 Vieux Bois schema. Still need to add brand and indicia publisher tables.)
Jump to navigation Jump to search

Introduction

This wiki page attempts to documents the current schema for the GCD. As of October 2009, the schema is being actively maintained and enhanced, and as of December 2009 all efforts are being made to keep this page up to date.

Public and Private Tables

The GCD schema consists of public and private tables. The public tables contain the visible data records (except for cover information) and are available to anyone in the public dumps. No account or membership of any sort is required to access the data dumps. The GCD requires that any use of our data from the dumps be credited to us. On a web site, this should be done on each page where the data is used, and a link should be provided to our site. Commercial use is welcome as long as credit and (on the web) links are provided. For questions about using the dump please contact the Board.

The private tables consist of three types of tables:

  • account data (private for the obvious reasons)
  • data editing tables (private to avoid sending out unapproved data- we'd like to make the change history public at some point if folks want it)
  • the covers table (private for legal reasons involving the distributions of cover scans and copyright law. Please contact the Board if you have concerns about this policy. Please do not contact the tech team as this is not a technical policy and complaining to them only wastes both your time and theirs)

A Bit of History

The original organization of the GCD data was into two files--or, more accurately, into a Series file (later two and then four Series files) and several dozen detail files. The two types of files were normalized by Lionel English in the mid-1990s in order to put them in to a real database, and then the two normalized files were further normalized in order to create the Publishers table from the Series table, and an Issues table from the Details table. During the server migration and site rewrite the took place during August-December 2009, Henry Andrews, Jochen Garcke and the rest of the tech team changed the database to a system supporting true foreign keys and transactions, and imposed a consistent naming scheme on the tables and columns. Several new public tables, as well as an entirely new set of data editing tables, were added at this time. The documentation for the pre-2009-migration schema is [PreOctober2009_Schema still available] for those curious about it.

Future Plans

We're progressing towards our real "1.0" release of this generation of the site, code-named "New Fun". Much discussion on the lists references the "New Fun" schema which supports a much more complex and realistic view of the data. Please join the [Email_Lists_FAQ gcd-tech group] if you would like to participate in that discussion. For the curious, the December 2009 release was 0.2 "Vieux Bois", for

Supporting Tables

There are several supporting tables necessary to use the data which lay out fixed sets of options for certain fields. They are:

  • gcd_country
    • Has columns id', code and name, and provides the set of countries used by the various publisher and series tables. The code column is the ISO 3166 code for the country, and should arguably be used as the foreign key, but due to the very rushed development timeline of the December 2009 release (as the prior server was no longer viable), we went with the standard numeric id column for reasons which seemed like a good idea at the time, but involved a great deal of sleep deprivation :-) We'll probably revisit this eventually.
  • gcd_language
    • Has columns id, code and name, and provides the set of languages used by the series table. The code column is the ISO 639 code for the language. The two-letter ISO 639-1 code is used if available, three letter code from 639-2 or 639-3 otherwise. This might get standardized into 3-letter codes or 4-letter codes if we need to take ISO 639-6 into account once its code list is published. This table has the same foreign key oddness about code vs id as the gcd_country table, although the possibility of changing the codes makes it more sensible to stick with id for now.
  • gcd_story_type
    • Has columns id and name, and provides the set of types used by the story table. The type field is somewhat controversial in that its values actually cover various different attributes of story sequences. However, until late 2009 there was no way for the GCD to change the field (and it was in fact a plain text field within the story table) so discussions on changing it were set aside. This will be one of many things revisited along the path to the 1.0 New Fun release.

Publishers

The publisher table currently holds two sorts of records: "master publishers" and "imprints". Both of these concepts are less precise than one might expect.

"Master publishers" are the common names of publishers as typically grouped by comic book researchers. As this vague definition suggests, there is considerable room for disagreement over what constitutes a master publisher and how such entities should be named. This field is often referred to as the "argue 'till we're blue in the face" field. Part of the current definition of a series in the GCD is that all issues within the series are published by the same master publisher. This has led to the grouping some companies together under a single master publisher because one took over several series from the other. Debate on the proper arrangement and definition of master publishers is ongoing. A master publisher is a record that has the is_master field set to 1. The name of the publisher is chosen primarily based on what is most likely to be expected by a user running a search.

As of late 2009, "imprints" are considered a deprecated concept and are being replaced by indicia publishers and brands (covered in their own tables below). The concept of an imprint may return in a more well-defined form in the future for formal imprints of publishing houses. The data records in this table that are considered "imprints" serve as a catch-all for several different publishing concepts, often requiring long semi-colon separated lists as the field is set per-series but the data often changes per-issue. An imprint is a record that has a non-NULL parent_id field.

Note that there are a very small number of records that are both master publishers and imprints. Explaining those records is beyond the scope of this documentation.

gcd_publisher [apps.gcd.models.Publisher class in the code]
Column Name Description Usage Notes
Core publisher data fields:
id DB-generated primary key.
name Name of the master publishing "family", imprint, or set of imprints See the discussion above for how this name is chosen
year_began Year (by cover date) of first publication.
year_ended Year (by cover date) of last publication, if any.
country_id The country from which the publisher operates. Foreign key into the gcd_country table.
notes Space for arbitrary notes.
url The URL for the publisher's website.
Relations to other tables and counts of related objects:
imprint_count Count of how many imprints a given master publisher has. Recalculated by the code on data updates.
series_count Count of how many series this publisher publishes. Recalculated by the code on data updates.
issue_count Total number of issues in all series published by the publisher. Recalculated by the code on data updates.
is_master Boolean field (1 or 0 in MySQL). See discussion of master publishers above.
parent_id Self-join key for parent/child relationships between master publishers and imprints. NULL for non-imprints.
Change tracking fields:
reserved If true, there is an active change being made in the data editing tables.
created
modified

Brands

Indicia Publishers

Series

A series in the GCD is officially defined by the indicia title and publisher- when either of these changes a new series is created. Minor changes in the indicia title (such as the presence or absence of an article, i.e. "The") may be allowed within one series, especially if the article comes and goes. Some exceptions exist, such as Dell's Four Color (2nd series) which is commonly grouped in guides but actually uses a different indicia title for nearly every issue.

A series may be split due to a particularly long lapse in its publication schedule (on the order of years) or due to certain kinds of renumbering, but the exact rules in this area are still being debated.

gcd_series [apps.gcd.models.Series class in the code]
Column Name Description Usage Notes
Core series data fields:
id DB-generated primary key.
name Series title, as it appears in the indicia (if any, as on the cover if there is no indicia), with any leading article moved to the end after a comma. The article-moving rule applies only if the title is in the language in which the series is written.
format Physical format of the series (size, color presence or absence, paper stock, etc.). Since format is really an attribute of the issue rather than the whole series, this field may contain a fairly long description of how the format changed over the life of the series. In the "New Fun" schema, this is being replace by a number of distinct fields, which individually describe the constituent attributes commonly listed here.
year_began First year (by cover date) of publication.
year_ended Last year (by cover date) of publication, if any.
is_current Set to true if new issues are still being produced for this series. This is a new field and may not be properly set for all series yet.
first_issue_id Foreign key to the first issue of this series in the gcd_issue table. Updated by the code when issues are added, deleted or moved. This is a new field and is not always properly set, nor is it yet used in the UI.
last_issue_id Foreign key to the last issue of this series in the gcd_issue table. Updated by the code when issues are added, deleted or moved. This is a new field and is not always properly set, nor is it yet used in the UI.
publication_dates First and last full cover publication dates of the series, separated by a '-'. No last date if still ongoing. This field is being replaced by the first_issue_id and last_issue_id fields. It is not currently possible to edit it from the OI, so errors are not being actively corrected. Once all data is migrated to the new field, this one will be dropped.
tracking_notes Notes about how a given series is continued in or continued from another series. See the [Formatting Documentation] for how these notes should be structured. This will eventually be replaced with foreign key fields.
publication_note Formerly the place where arbitrary publication details could be recorded. Now considered deprecated. This predates the imprint field, which predates the issue-level indicia publisher and brand fields. Notes here that are not migrated to those fields will most likely be merged into the general notes field and this field will be eventually dropped.
notes Space for arbitrary notes.
publisher_id Foreign key into the gcd_publisher table. Should only ever reference a master publisher record (enforced at the code level).
imprint_id Foreign key into the publishers table for the depreated "imprint" concept. This is being set to NULL once its information has been migrated to the brand_id and indicia_publisher_id on the issues for the series.
country_id Foreign key into the gcd_country table. This is properly an attribute of the indicia publisher (linked to the issue table) rather than the series, and will eventually be dropped once that field is sufficiently populated.
language_id Foreign key into the gcd_language table. The language in which the comic is written. For comics that are purely visual, there is an ISO code for "no linguistic content".
issue_count Number of issues associated with the series. Recalculated by the code when the data is updated.
has_gallery If true, the series has at least one cover image. Updated by the code as covers are added/changed/removed.
Change tracking fields:
open_reserve Obsolete field that we forgot to drop during schema migration. This will be dropped next time we change the schema, and should be ignored.
reserved If true, there is an active change being made in the data editing tables.
created
modified

Issues

Issues are what you would expect. They include both periodical issue (pamphlets) and bound volumes / collections, currently with no distinctions between the two. The "New Fun" schema will add classifications so that we can treat these items differently. As it is books are sometimes awkwardly wedged into fields designed for periodicals, but it mostly works.

One notable exception to the sensibility of this table is that issue variants do *not* each get separate records at this time. A future schema (possibly "New Fun" but possibly a release some time after that) will address this inconsistency. For now, variant covers are recorded by adding multiple cover sequences to the story table for the issue. There are some exceptions to this arrangement when more than just the cover varies.

gcd_issue [apps.gcd.models.Issue class in the code]
Column Name Description Usage Notes
Core issue data fields:
id DB-generated primary key.
number Issue number or other identifier from the indicia (if there is one) or cover (if there is no indicia) or in some cases both (if there are both and they disagree). See Issue Numbers for numerous special cases of missing / dual / annual / per-volume or other sorts of "numbers". Note that this is a text field rather than a numeric field because many issue "numbers" are partially or entirely non-numeric, i.e. "Summer Special".
volume Volume number from the indicia (if there is one) or cover (if there is no indicia). See Volume While this field has been present for a long time, it has only recently been displayed on the site. Therefore the data is not present as often as for other fields.
display_volume_with_number If true, combine the volume number with the issue number in most places that the issue number would be displayed. This produces numbers like "v2#1" (although in many indexes that notation is still used directly in the issue number field- this will eventually be corrected). See Volume and Issue Numbers.
no_volume If true, there is no volume number for the issue. This is used to confirm the meaning of an empty volume field since that field was ignored for many years and therefore is often empty when it should not be. See Volume and Issue Numbers.
sort_code Numeric field that determines how issues sort within each series. There is a uniqueness constraint on the combination of this column and series_id. This replaces the old cover sort code and is now the sole source of ordering for issues within a series. key_date (which is not always known) is now only used for chronological search results.
key_date Specially formatted date string that (in theory) determines the ordering of issues in search results that are organized by date. In practice, many issues are missing this field, meaning that they will just sort to the beginning of any chronological list. This is a data problem, not a coding error. For reliably sorting issues within a series, see sort_code. Also see Keydate
publication_date Publication date of the issue as given in the indicia, but fully spelled out. See Published Date
indicia_frequency If the issue has an indicia and it specified a publication frequency (monthly, quarterly, one-shot, etc.) that is recorded here. This is a text field rather than a foreign key into a set of options because of frequencies that are totally non-standard (common in alternative/underground publications) or irregular in such ways as "monthly except in September and December". This is a new field so it is sparsely filled out at this time.
price A decimal price followed by the ISO code, i.e. 0.10 USD See Cover Price for details, including multiple prices and non-decimal currencies.
page_count Overall page count for the issue, counting covers but not inserts or dust jackets. A single sheet with one fold (the smallest possible "comic book") would count as 4 pages. See Page Count.
page_count_uncertain If true, the page count is unknown (if page_count is NULL) or could not be reliably determined due to second-hand information or missing pages. This field allows the page_count field to be of numeric type. Previously it was text so that a "?" could be appended for this purpose.
editing Editor for the whole issue, or for the overall anthology if there are different editors for each story in a collection. Also used for other issue-level credits such as assistant/associate editor, publisher, production, special thanks, etc. See Credits and Editing for how credits are structured.
no_editing If there is no editor (or similar credit) at the issue level, this field should be set to true. Distinguishes from an unknown editor or from the field simply not having been filled out.
notes Space for arbitrary notes about the entire issue. Because the issue-level notes were originally (by convention) placed in the notes field of the first cover sequence (in the story table), the only reasonable option was to duplicate those notes into the issue table during migration. We are working through these duplicates, but for now it is very common to see identical notes in the issue record and in the sequence number 0 story record.
Fields relating to the other tables:
indicia_publisher_id Foreign key into the gcd_indicia_publisher table (nullable). Indicates the actual company that published the issue as shown in the indicia or other formal publication data. This is a new field and therefore is sparsely filled out at this time.
brand_id Foreign key into the gcd_brand table (nullable). Indicates the publisher's logo / name/ tagline on the exterior (front cover, back cover or spine) of the issue. There is still a fair amount of debate about what constitutes a separate brand (how much change in the logo appearance and/or text). This is a new field and therefore it is sparsely filled out at this time.
series_id Foreign key into the gcd_series table.
story_type_count Number of stories of type "story" linked to this issue. Used to determine whether an issue is "indexed" or not. Currently, an issue is indexed if there is at least one entry in the story table of type "story" (to prevent the existence of a story of type "cover" or other type from making the issue look complete). In the future we may also implement tracking of the expected number of stories to make this measurement more precise. Calculated whenever an issue's stories are updated.
Change tracking fields:
index_status Obsolete field kept to help debug potential migration problems (so far, so good...). This field will be dropped the next time we update the schema, and should be ignored.
reserve_status Obsolete field kept to help debug potential migration problems (so far, so good...). This field will be dropped the next time we update the schema, and should be ignored.
reserve_check Obsolete field kept to help debug potential migration problems (so far, so good...). This field will be dropped the next time we update the schema, and should be ignored.
reserved If true, there is an active change being made in the data editing tables.
created
modified

Stories

The story table would more accurately be called the sequence table, and we plan to eventually rename it. It records the contents of an issue, whether they are actual sequential art stories, text stories, covers, advertisements or any number of other sequence types. Only sequential art stories, covers and text stories are technically required to "complete" an index, but cover-to-cover indexes are appreciated.

gcd_story [apps.gcd.models.Story class in the code]
Column Name Description Usage Notes
Core story data fields:
id DB-generated primary key.
sequence_number Order of the story within the issue. See Sequence Number. Sequence zero is no longer treated specially (although it still represents the front cover, or at least one variant of it).
title The title of the story, or a quote from the beginning of the script if there is no title. See Title
title_inferred True if the indexer chose the title either directly or by using the first line of the script. This causes the UI to display the title with [square brackets] around it.
feature The name of the feature, if any. Typically as from the splash or title page, but with some exceptions. Often but not always the name of the primary character. Some stories do not have a feature. See Feature
page_count Number of pages. One side counts as one page. Pages
page_count_uncertain True if the page count is unknown or uncertain. This is a separate field so that page_count could be a numeric field instead of varchar.
notes Arbitrary notes from the indexer. See Notes
reprint_notes Reprint information (both reprint from and reprinted in). See Reprints for the extremely detailed formatting requirements for these notes. This field will be replaced with database links in release 0.5 "Kin-der-kids".
type_id Foreign key to the data_story_type table. Type of "story", or more accurately sequence. Actual comic story vs text story vs ad vs letters page, etc.
job_number Job Number
Fields for creator credits:
script Writer, scripter and/or (co-)plotter credits. See Credits and Script.
no_script True if there is no writing, such as for a single panel illustration with no text (pin-up) or a simple promo that shows the next issue's cover with no significant text (Lines like "Continued next issue" do not count as significant text). Note that the absence of dialog or narration does not mean there is no script. If there is a sequence of events depicted in the artwork, that constitutes a "script".
pencils Pencils or other primary artwork (painting, photography, etc.) or layouts / breakdowns and finishing pencils. See Credits and Pencils.
no_pencils True if there is no artwork, for instance a text story with no illustrations or a plain text recap page with limited visual elements (such as a solid color background, which is more of a production choice than the sort of artwork credited in these fields.
inks Inks or other artwork that completes the primary artwork. Painting / photography should be credited both here and in pencils an possibly colors. See Credits and Inks.
no_inks True if there is no inking or similar finishing task that applies to the artwork, or there is no artwork.
colors Color added to non-colored artwork, or similar image processing. Painting / color photography should be credited both here and in pencils and inks. See Credits and Colors.
no_colors True if there is no colorist to credit, for instance because the sequence is printed in black and white, or if there is no artwork.
letters Lettering. For text not designed/produced by a letterer, use "typeset" (for instance typewritten text stories word processor output not done in word balloons or captions). See Credits and Letters
no_letters True if there is no letterer (or "typeset") for the sequence. Covers with no significant text beyond the title and standard promotional taglines should set this to true.
editing Rarely used except for sequence 0, as the editor for the book is typically assumed to have edited the whole book. However, this field can be used if individual stories have separate editors such as certain original anthologies or collections covering diverse material.
Fields related to content:
characters Characters appearing in the field. Sometimes only recurring characters, but sometimes including things like "unnamed boy" or "Nazis". See Character Appearances. The formatting rules for this field are extensive, but have been changed several times and multiple sets of rules are currently in active use.
synopsis A brief synopsis of the content. The purpose of the field is to aid in identifying the story, not to provide a plot summary to replace reading the whole thing. Do not use text from other sites or sources as it may be assumed to be under copyright. See Synopsis.
genre One or more genres, separated by semicolons. Free-form, although there is an Official Genres List. See also Genre.
issue_id Foreign key into the issues table.
Change tracking fields:
reserved If true, there is an active change being made in the data editing tables.
created
modified }