Current Schema: Difference between revisions

From GCD
Jump to navigation Jump to search
(Checkpointing work on the series table.)
(Finished series table (first round anyway).)
Line 1: Line 1:
This wiki page attempts to document the current schema for the GCD.  The public schema consists of four principle data tables as well as several supporting tables.  The supporting tables are '''Countries''', '''Languages''' (lists of codes for each of those things), '''Indexers''' (a list of indexers, but not of full account information) and '''IndexCredit''' (an incomplete selection of indexer credits, although where the other credits live remains something of a mystery to the writer of this guide).  The four main tables are described below.
This wiki page attempts to document the current schema for the GCD.  The public schema consists of four principle data tables as well as several supporting tables.  The supporting tables are '''Countries''', '''Languages''' (lists of codes for each of those things), '''Indexers''' (a list of indexers, but not of full account information) and '''IndexCredit''' (maps indexers to series for credit, but there are also credit notes int he series table).  The four main tables are described below.


The "Name in New Code" column is for those who might be looking at the new Python/Django implementation.  Since the existing table and column names are not even remotely consistent in capitalization, formatting or abbreviation, they were standardized in the Django model classes to make working with them easier (Django provides a convenient name mapping system for exactly this purpose).
The "Name in New Code" column is for those who might be looking at the new Python/Django implementation.  Since the existing table and column names are not even remotely consistent in capitalization, formatting or abbreviation, they were standardized in the Django model classes to make working with them easier (Django provides a convenient name mapping system for exactly this purpose).


Please note that much of this page is based on earlier documentation by Peter Croome.
Please note that much of the '''issue''' and '''story''' table information on this page is based on earlier documentation by Peter Croome.


{| border="1"
{| border="1"
Line 90: Line 90:
| created
| created
| Row creation timestamp.
| Row creation timestamp.
|
|-
|-
| Modified
| Modified
Line 102: Line 103:
|-
|-
|}
|}


{| border="1"
{| border="1"
Line 187: Line 190:
| indexers
| indexers
| Semicolon-separated string of indexers.
| Semicolon-separated string of indexers.
| This does not appear to be quite what gets displayed in the credits on the page in the Lasso site.  The exact relationship is unclear.
| This does not appear to be quite what gets displayed in the credits on the page in the Lasso site.  The exact relationship between this and the '''IndexCredit''' table is unclear.
|-
|-
| OpenReserve
| OpenReserve
Line 211: Line 214:
| LangCode
| LangCode
| language_code
| language_code
| Inexplicably '''not''' a foreign key into the Languages table.  Instead the language code is dupliated here, and does not necessarily match properly.
| Inexplicably '''not''' a foreign key into the Languages table.  Instead the language code is duplicated here, and does not necessarily match properly.
|
|-
| File
| file
| Presumably the name of the file used for flat-file indexing.  Or possibly distribution.
|
|-
| InitDist
| initial_distribution
| Presumably something to do with flat-file distribution of the data.
|
|-
| UpdateDist
| update_distribution
| Presumably something to do with flat-file distribution of the data.
|
|-
| PubID
| publisher
| Foreign key into the '''publishers''' table.
|
|-
| Pub_Name
| publisher_name
| Copied from publishers table, presumably to save a join.
| Reliability of this field has not been verified.
|-
| Created
| created
| Row creation timestamp.
|
|-
| Modified
| modified
| Date of last row change.
|
|-
| ModTime
| modification_time
| Time of last row change (24 hour).
|
|-
| LstChang
| last_change
| Date of some sort, but apparently unused.
| Only five non-NULL, non-zero values in the current data, all in the future (nearest is in 2010).
|-
| oldID
| old_id
| Not clear.  Presumably the ID from a prior organization of the data?
| Does not appear to be in active use currently.
|-
| SelfCount
| self_count
| Absolutely no clue.
|
|
|-
|-
|}
|}

Revision as of 01:47, 24 March 2008

This wiki page attempts to document the current schema for the GCD. The public schema consists of four principle data tables as well as several supporting tables. The supporting tables are Countries, Languages (lists of codes for each of those things), Indexers (a list of indexers, but not of full account information) and IndexCredit (maps indexers to series for credit, but there are also credit notes int he series table). The four main tables are described below.

The "Name in New Code" column is for those who might be looking at the new Python/Django implementation. Since the existing table and column names are not even remotely consistent in capitalization, formatting or abbreviation, they were standardized in the Django model classes to make working with them easier (Django provides a convenient name mapping system for exactly this purpose).

Please note that much of the issue and story table information on this page is based on earlier documentation by Peter Croome.

publishers [Publisher class in new site]
DB Column Name in New Code Description Usage Notes
ID id DB-generated primary key. In the lasso-based site, this is the number you see in URLs like http://www.comics.org/publisher_details.lasso?id=4066 It is generated by the database and has no inherent meaning.
PubName name Name of the publisher, imprint, or set of imprints Because imprints are not stored per-issue, in some cases a semicolon-separated list of imprints corresponding to all imprints used for a given series has been given an entry in this table. Also, not all things that we term "Publishers" or "Imprints" strictly fall into those categories, and we do not accurately reflect the evolution of company names over time (for instance, "DC" was not until fairly recently the official name of the company, although it was long an informal name. It is still not exactly correct- "DC Comics" appears to be the current legal name. But "DC" is what everyone knows, so that is what appears in the database).
YearBegan year_began Year (by cover date) of first publication.
YearEnded year_ended Year (by cover date) of last publication, if any.
CountryID country The country from which the publisher operates. Foreign key into the Countries table.
Notes notes Space for arbitrary notes.
web url The URL for the publisher's website. While this column sometimes has data, it does not appear to be used by the Lasso-based site.
AlphaSortCode alpha_sort_code A single character. Appears to be used to sort related publishers near each other even if the more obvious alphabetical sort would not do so. It's a touch unclear.
ImprintCount imprint_count Count of how many imprints a given publisher has. Does not appear to be uniformly accurate.
BookCount series_count Count of how many series this publisher publishes. Does not appear to be uniformly accurate.
IssueCount issue_count Total number of issues in all series published by the publisher. Does not appear to be uniformly accurate.
Master is_master Boolean field. Appears to indicate if something is a publisher or an imprint. Or more generally, a first-class publishing entity or one that is somehow part of a larger entity (because it was bought long ago, or is part of a group of companies commonly referred to by another name, etc). The new site mostly ignores this in favor of checking for rows with a matching parent id.
Connection connection No clue. It's a varchar, though. Perhaps it's supposed to explain the nature of each publisher-to-publisher relationship? Always null or in one case an empty string in the current data dump, so pretty clearly not used.
ParentID parent Self-join key for parent/child relationships. This is how the imprints for a publisher are stored.
NextID next Self-join key, purpose unknown. Does not appear to be in use.
created created Row creation timestamp.
Modified modified One of two change timestamps. No idea which one the Lasso site actually uses.
Updated updated One of two change timestamps. No idea which one the Lasso site actually uses.


series [Series class in new site]
DB Column Name in New Code Description Usage Notes
ID id DB-generated primary key. In the lasso-based site, this is the number you see in URLs like http://www.comics.org/series.lasso?seriesid=72 It is generated by the database and has no inherent meaning.
Bk_Name name Series title.
Format format Physical format of the series (size, color presence or absence, paper stock, etc.). Since format is really an attribute of the issue rather than the whole series, this field may contain a fairly long description of how the format changed over the life of the series.
Notes notes Space for arbitrary notes.
Yr_Began year_began First year (by cover date) of publication.
Yr_Ended year_ended Last year (by cover date) of publication, if any.
PubDates publication_dates First and last full cover publication dates of the series, separated by a '-'. No last date if still ongoing.
Pub_Note publication_note Arbitrary publication / publisher-related notes. Some records in the current DB include addresses, imprint information and/or general reprint information here. Does not appear to be used by the Lasso implementation.
Themes themes Does not appear to be used (always NULL).
Issuecount issue_count Number of issues associated with the series. Does not appear to be reliably accurate.
Tracking tracking_notes Notes about how a given series is continued in or continued from another series.
Crossref crossref Appears to be a boolean field flagging if there is a crossreference. Does not appear to be used by the Lasso implementation. Only one record where this is not NULL or 0.
CrossrefID crossref_id Apparently a self-join key for the cross-referenced series. Does not appear to be used by the Lasso implementation. Only one record where this is not NULL or 0.
HasGallery gallery_present 3-character field, but used as boolean. If true, there is a cover gallery. I think.
Included gallery_includes varchar field, seems to be comma-separated list (with hyphens for ranges) of issues included in something? I speculate the gallery, but it's not clear how this is used, if at all, by the current implementation. It is not the string that shows up in the "Issues published:" space on the series page.
Indexers indexers Semicolon-separated string of indexers. This does not appear to be quite what gets displayed in the credits on the page in the Lasso site. The exact relationship between this and the IndexCredit table is unclear.
OpenReserve open_reserve Integer field. Presumably something to do with online indexing, but since reservations are per-issue, it's not clear quite what.
Frst_Iss first_issue Issue # (or other identifier) of the first issue in the series. Note that it's not currently reliably possible to determine this from the issue table.
Last_Iss last_issue Issue # (or other identifier) of the last issue in the series, if any. Note that it's not currently reliably possible to determine this from the issue table.
CounCode country_code Inexplicably not a foreign key into the Country table. Instead the country code is duplicated here, and does not necessarily match properly.
LangCode language_code Inexplicably not a foreign key into the Languages table. Instead the language code is duplicated here, and does not necessarily match properly.
File file Presumably the name of the file used for flat-file indexing. Or possibly distribution.
InitDist initial_distribution Presumably something to do with flat-file distribution of the data.
UpdateDist update_distribution Presumably something to do with flat-file distribution of the data.
PubID publisher Foreign key into the publishers table.
Pub_Name publisher_name Copied from publishers table, presumably to save a join. Reliability of this field has not been verified.
Created created Row creation timestamp.
Modified modified Date of last row change.
ModTime modification_time Time of last row change (24 hour).
LstChang last_change Date of some sort, but apparently unused. Only five non-NULL, non-zero values in the current data, all in the future (nearest is in 2010).
oldID old_id Not clear. Presumably the ID from a prior organization of the data? Does not appear to be in active use currently.
SelfCount self_count Absolutely no clue.