New Fun Schema: Difference between revisions

From GCD
Jump to navigation Jump to search
(Start documenting the schema for New Fun, starting with the Publisher table. Has much of the old schema docs still hanging around to work from.)
 
No edit summary
 
(36 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Introduction== This page documents the proposed database schema for the "New Fun" release, i.e. the first release of the new web site that will go into productionIt is not set in stone, but is nearing its final formAt some point, we will declare a soft schema freeze, after which only minor changes (such as adding a column that doesn't affect table relationships or major display options) will be allowed.
[This was the direction the schema was going towards at the time of the server crisis in August of 2009. That crisis required a more modest migration of the database, followed by numerous gradual releases which may or may not end up at a release that looks like this schemaThe documentation here is largely untouched since the crisis, and therefore quite out of dateIt remains a useful reference for concepts that have not yet made it into production.]


A few notes on database vs code names:
This page documents the proposed database schema for the "New Fun" release, i.e. the first release of the new web site that will go into production.  It is not set in stone, but is nearing its final form.  At some point, we will declare a soft schema freeze, after which only minor changes (such as adding a column that doesn't affect table relationships or major display options) will be allowed.


Table names are lower case with underscores.  Model class names are capitalized with no spaces or underscores.  It should be obvious which model fits which table.
==Projects and Applications==
 
Django organizes projects into applications, which may be more or less independent of one another.  The project corresponds to the top-level Python package, with each application receiving a sub-package.  The package name appears as a prefix on database tables.  The GCD project will use the following applications (with possibly more to be added as needed):
 
* <tt>data</tt>: Holds the data for display.  In theory, dumping the data tables should provide all of the information that we wish to distribute publicly.
* <tt>oi</tt>: Holds the state of the online indexing system such as reservations, approval queues, indexing credits and space for data to be edited outside of the publicly visible tables.
* <tt>resource</tt>:  Holds files such as cover scans.  Also holds information about linking those scans to the public data, since that linkage is of no interest without the scans.  Also holds other files like logo scans or anything else we decide to include.  Some of this may be publicly accessible, but unlike the <tt>data</tt> application that is not the default assumption.
* <tt>account</tt>: User account information, holding user preferences and all data that doesn't fit in Django's standard authorization and prefs apps.
* <tt>migration</tt>: Space for temporary data and status tracking while we migrate fully to the new schema.  This keeps transient junk out of our "proper" data design.
 
==Storage Architecture vs Application Architecture==
 
This version of the schema is the first to dramatically separate the representation of the data in storage (the database) from the application layer (and the UI above it).  In addition to the standard database vs UI separations, this is enabled by our decision to use a separate search server to implement search.  The server will organize an index of our data tables in a way that optimizes for searches, and all end-user searches will run against that index.  This means that we can use a table structure that would otherwise result in very complex search queries without the performance impact of those queries.  So when looking at this schema, don't think to yourself "wow that will be hard/expensive to search against", because it won't.
 
There are several key concepts involved in this separation that you should understand before diving into the table layout.  Most notably, these involve dealing with what we have always called series and issues, and how they relate to each other.
 
==Series Classification==
 
One of the most important fields introduced by the New Fun schema is the series classification.  This selects exactly one of several very broad categories that supplies a context for interpreting the information associated with the series.  For the most part, this field tells the UI (and to a lesser extent, the application) how display things from this series, and what fields to show the indexers when they index or create something in the series.  This is our primary tool for cutting down the high number of potential fields and choices into something manageable and relevant for each task at hand.
 
An important thing to note is that while classification helps us '''optimize''' the UI for both display and indexing, all fields are available to all classifications.  You will just have to go to some sort of "advanced" option to use fields that are rare or that few indexers prefer to see for the given classification.  And some fields will probably always be "advanced" in any classification.
 
The tentative list of classifications is (names very much subject to change):
 
'''Periodical / pamphlet series''' These are series of things that are generally called "issues" and are more like magazines than books.
 
'''Periodical / pamphlet one-shot''' These are things that look like something from the periodical / pamphlet series classification but are only ever intended to have one physical item within the series.  The only difference between this classification and the series is that the UI may do some display optimization that it would *not* do for a series that only has one issue simply because the second one hasn't come out yet.  For instance, the indexing UI will never ask for a publishing frequency for a one-shot.  If your series lists a publishing frequency other than "published as one-shot", it should be classified as a "periodical pamphlet / series" that was cancelled after the first issue (for instance, Timely's '''Red Raven Comics''' should be a "periodical / pamphlet series" despite having only one issue).  If this proves too confusing, the distinction will be dropped and all such series will automatically become "periodical / pamphlet series".
 
'''Series of books''' Original books or collections of periodicals, but they look more like books than magazines.  Items are probably referred to as "books" or "volumes" or some similar concept.  They also quite likely have their own title in addition to some sort of numeric designation.
 
'''Single book''' Original book or collection, but not part of a real series.  Again, this is mostly for UI optimization.  Note that a book may appear in two series, one in this classification (to catalogue it stand-alone) and another in the "series of books" classification (as part of a larger series).  Books in this classification are very likely to have their own titles, and are less likely to have a volume or book number (or need to have "[nn]" displayed).  Again, if this proves too confusing, the distinction will be dropped and all such series will automatically become "series of books".
 
We could take this further and separate books that are collections of periodicals from other books, but that can easily be done later if the concept proves useful.  Sticking with the two choice areas that have come up over and over again on the lists (single vs true series, book vs periodical) seems like a good start.  There are clear demands for these- quite recently with the book vs periodical distinction, and during the last prototype work when Jochen implemented a view of single-issue series that combined the series and issue pages.  I objected at the time but the idea stuck with me and now I think there is a proper framework in which to implement it.
 
===Items vs Issues vs Books===
 
During the debates over handling books, it became extremely confusing to refer to things as "issues".  But using "books" is arguably more accurate (as in "comic books"), it is still confusing.  In the database, we will use the neutral term '''item''' for the smallest ''physical'' thing that is tracked in the database.  An item is anything you can't split further without undoing its binding (even if that binding is just folding, as with the current ''Wednesday Comics'' from DC.
 
At the application layer, the distinction will be made, based on the classification, whether a given item is an issue, a book, or some other term we add in the future.  As an item can be in more than one series, it can potentially be classified in different ways (although making something both an issue and a book is probably an error, but it's hard to predict for certain).
 
===What's a Descriptor?===
 
We all know that the term "issue number" is problematic.  They neither universally apply to "issues", nor are they all "numbers".  As has been discussed ad nauseum.  Sometimes they don't even serve as faux-numbers, but behave more like names.  And then there are volume numbers.  Which are sometimes Roman numerals since apparently the publishing industry didn't get the memo about these great new numbers Europe nicked from the Hindus by way of the Arabs and Persians.  And some volume numbers are probably completely non-numeric.  And then there are people who object to calling them volume numbers and say they should be book numbers.  Whether that is true depends on the series in question and whether you find the term "volume" pretentious or not, apparently.  Whatever, it's all a horrible mess.  Add to it the fact that these things can contradict each other depending on where they appear on the item, and you have at least six fields, three for each of the concepts (indicia, cover, and inferred a.k.a. indexer-assigned).  And someone came up with some example with three different sorts of numbers.  Or at least two plus a name.
 
So... what to do about this?  At one point, I proposed calling the issue-level ones labels, but that term is already needed for certain UI-related things, and would just add more confusion on the technical side.  And it didn't solve the larger problems of how to organize the fields clearly for the various puproses either.  Tossing in an issue name/title field brings the number of fields up to seven, eight or nine, depending on how you look at it.
 
So I decided to abandon the basic notion of a fixed set of fields, and allow us to link in whatever is needed to the item.  As with "item", I needed a neutral term for this.  Abandoning all pretense of human comprehensibility, I appropriated the techie term "descriptor" (as in "file descriptor", if you must know).  It is accurate, as these things describe the item.
 
===Descriptor Properities===
 
Descriptors have a few properties.
 
* They have a '''label''', by which I mean a term that a normal human would use by sticking it in front of the actual descriptor.  Like "number", or "volume" or "book".  Or "title" which you wouldn't actually ''say'', but would make sense as a label in the UI or the indexing form.
* They may apply to either a single item, or to groups of items.
** The group may be pathological in that it only has one item in it, but the intention is usually clear from the label.  Or in some cases, label plus classification (i.e. "volume" applies to groups in periodicals, but single items in series of books).
 
Another aspect of descriptors is that unlike many other things that are factored into their own table, we're mostly not trying to ensure that there's only one row per value and all uses link to that row.  For instance, the number "1" for the issue of one series is not meaningfully the same as the number "1" for another series.  The only reason to store them both as the same row is to reduce the storage and table scan time.  But as far as the application is concerned, descriptor's aren't really shared.
 
==Naming and Design Conventions==
 
===Formatting===
 
A few notes on database table and column names vs code class and variable names:
 
Table names are lower case with underscores.  Model class names are capitalized with no spaces or underscores.  It should be obvious which model fits which table, as the words in the names should be the same.  Column and variable names are all lower case with underscores, and should match exactly except for foreign keys.


All foreign keys end in <tt>_id</tt> in the database, but do not have that suffix in the code.  So wherever you see something like <tt>issue_id</tt> in this table, expect to see <tt>issue</tt> in the code.  This is because in the code you get an object, not just an ID.
All foreign keys end in <tt>_id</tt> in the database, but do not have that suffix in the code.  So wherever you see something like <tt>issue_id</tt> in this table, expect to see <tt>issue</tt> in the code.  This is because in the code you get an object, not just an ID.
Line 9: Line 73:
Tables are prefixed with an application name, which does not appear in the model class name.  The main app name is currently usually <tt>core</tt>, which is arguably a lousy name.  I'm contemplating changing it to <tt>data</tt> but I'm not sure about that.  It can't be <tt>gcd</tt> because that is the name of the overall project.  Currently, the only other application is <tt>migration</tt>, which is for tables that exist only to assist in migration of data from old to new forms.
Tables are prefixed with an application name, which does not appear in the model class name.  The main app name is currently usually <tt>core</tt>, which is arguably a lousy name.  I'm contemplating changing it to <tt>data</tt> but I'm not sure about that.  It can't be <tt>gcd</tt> because that is the name of the overall project.  Currently, the only other application is <tt>migration</tt>, which is for tables that exist only to assist in migration of data from old to new forms.


==Publisher (core_publisher)==
===Meaningful Terms===
 
Many of the names in the database and/or code attempt to follow some sort of systematic convention.  Here are some notable ones:
 
'''<tt>_inferred</tt>''' Many fields will have another field with the same name plus the <tt>_inferred</tt> suffix.  This other field will be a boolean that tells you whether the value of the primary field was really present on the item at hand (false) or was inferred (or just made up) by the indexer (true).  In the old schema, this was mostly handled by enclosing the value in brackets, which should no longer be done.  However, the UI may continue to display inferred values by enclosing them in brackets.  Note that "[nn]" or "nn" for the issue number should not appear at all, even with an inferred boolean- see the section about item descriptors to understand why this is and how the display will still be able to show "[nn]" when appropriate.
 
'''<tt>_uncertain</tt>''' Analagous to <tt>_inferred</tt>, but indicates that the value is a guess of some sort.  Both flags may be set on the same value.
 
'''<tt>sort_code</tt>''' always refers to a field that is *not* shown to the user, but is instead used behind the scenes to order things.  In the UI, rather than viewing and entering numbers, you will use a more intuitive system such as specifying that an item follows another item, or by dragging and dropping items in a list.  As auto-generated numbers, they may not make much sense if you look at them raw, except that they *will* define a clear ordering.
 
'''<tt>_links</tt>''' is a suffix that shows up in the code when many-to-many relationships are involved, which is very common in the new schema.  In general, the set of relationship objects (rows in the join table containing data about the relationship rather than about either related object) will end in <tt>_links</tt>.  This will help distinguish these sets from either set of related objects on the ends of the relationship.
 
'''<tt>label</tt>''' is used for fields that describe other fields.  The value of label fields is generally intended to be used by the UI to label a field that does not always have an obvious name.  This is hard to explain without an example, so look through the tables to see how this is used.  This did '''not''' end up being a term that replaces "issue number", although that is one area where a label field is used.
 
===Proxy Models===
 
Typically, each database table corresponds to one model class in the code.  However, several of our tables, instead of holding homogenous data, hold rows that may fall into one or more subtypes that are treated somewhat differently in the applications.  Publishers and Items (on which more is explained below) come to mind.  These are dealt with by defining Proxy Models, which are classes that provide different views of existing tables.  In particular, they are used to make models that act like they can only access a subset of the table that is relevant for a particular application-level concept.
 
===Indexes and Keys===
 
Don't read much into the presence or absence of indexes in the current table definitions.  Aside from all foreign keys having indexes, the presence or absence of an index for a column is mostly just left over from the old schema, plus a few nearly arbitrary decisions made as I worked on the tables.  The reason for this is that we'll be using an external search server, so many columns that would normally need an index may not- we'll have to see how the search server works.  Once we know that, and have a better feel for what sort of queries will actually be run, we will revisit the indexes and also consider multi-column indexes.
 
==Table Documentation==
 
'''Note:''' This documentation does '''not''' cover every table.  It focuses on the more complex new areas of the schema.  Hopefully we'll get the whole thing documented as we go along, but for the definitive schema, see the files in subversion.  Email the gcd-tech list if you have questions or would like to help write documentation.
 
===Publishers===


The Publisher table is unusual in that it stores several distinct types of entities, all of which fall under the general umbrella of "Publisher".  Several of them may play multiple roles, which is the primary reason for keeping them in one table.  Each type of publishing entity is represented by a proxy model in the Django code.  Except as noted, publishers are connected to each other via the PublisherRelationship table described further down.  Before going through the fields, here is the list of proxy models and their definitions:
The Publisher table is unusual in that it stores several distinct types of entities, all of which fall under the general umbrella of "Publisher".  Several of them may play multiple roles, which is the primary reason for keeping them in one table.  Each type of publishing entity is represented by a proxy model in the Django code.  Except as noted, publishers are connected to each other via the PublisherRelationship table described further down.  Before going through the fields, here is the list of proxy models and their definitions:


===MasterPublisher===
====MasterPublisher====


This is the legacy notion of a top-level publisher, linked at the series.  Defined by setting <tt>is_master = 1</tt>.  The nature and future role of master publishers is very much a matter of debate.
This is the legacy notion of a top-level publisher, linked at the series.  Defined by setting <tt>is_master = 1</tt>.  The nature and future role of master publishers is very much a matter of debate.


===Imprint===
====LegacyImprint====
 
This is the legacy notion of an imprint, effectively a hodge-podge of anything and everything that did not fit as a MasterPublisher.  These are linked at the series level, and will be retained '''only''' until they can be migrated to the other fields, after which they will be dropped from the application.  Something is a legacy imprint if it has a relationship of type LEGACY with a MasterPublisher.
 
====PublishingCompany====


This is the legacy notion of an imprint, effectively a hodge-podge of anything and everything that did not fit as a MasterPublisher.  These are linked at the series level, and will be retained '''only''' until they can be migrated to the other fields, after which they will be dropped from the application.  Something is an imprint if its <tt>parent</tt> field is not NULL (relationships between new publisher entries do not use the <tt>parent</tt> field).
An actual corporate entity of some sort involved in publication.  These are linked at the item level and defined as having <tt>is_company = 1</tt>.


===PublishingCompany===
====Brand====


An actual corporate entity of some sort involved in publication.  These are linked at the issue level and defined as having <tt>is_company = 1</tt>.
A marketing brand involving a particular logo or set of closely related logos.  These are linked at the item level and defined as having <tt>is_brand = 1</tt>.


===Brand===
====Distributor====


A marketing brand involving a particular logo or set of closely related logos.  These are linked at the issue level and defined as having <tt>is_brand = 1</tt>.
A company that distributes issues to retailers.  Technically not a publisher, but the same fields apply and several publishers have served as their own distributors.  These are linked at the item level and are defined as having <tt>is_distributor = 1</tt>.


===The Publisher Table===
====The Publisher Table====


{| border="1"
{| border="1"
|+ '''core_publisher'''
|+ '''data_publisher'''
! Column
! Column
! Type
! Type
Line 41: Line 135:
|-
|-
| id
| id
| int(11) auto_auto_increment
| int(11) auto_increment
| DB-generated primary key.
| DB-generated primary key.
|  
|  
Line 51: Line 145:
|-
|-
| country_id
| country_id
| foreign key to '''core_country'''
| foreign key to '''data_country'''
| The country from which the publisher operates.
| The country from which the publisher operates.
|
|
|-
|-
| year_began
| year_began
| int(11)
| year
| Year (by cover date) of first publication.
| Year (by cover date) of first publication.
|
|
|-
|-
| year_ended
| year_ended
| int(11)
| year
| Year (by cover date) of last publication, if any.
| Year (by cover date) of last publication, if any.
|  
|  
Line 78: Line 172:
|-
|-
| is_master
| is_master
| tinyint(1)
| boolean
| Defines which rows appear in the MasterPublisher proxy model.
| Defines which rows appear in the MasterPublisher proxy model.
|  
|  
|-
|-
| is_company
| is_company
| tinyint(1)
| boolean
| Defines which rows appear in the PublishingCompany proxy model.
| Defines which rows appear in the PublishingCompany proxy model.
|
|
|-
|-
| is_brand
| is_brand
| tinyint(1)
| boolean
| Defines which rows appear in the Brand proxy model.
| Defines which rows appear in the Brand proxy model.
|
|-
| is_distributor
| boolean
| Defines which rows appear in the Distributor proxy model.
|
|
|-
! colspan="4" | Change tracking fields:
|-
| series_count
| int(11)
| The number of series associated with the publishing entity.
| This may have to go due to possible confusion over using a row as more than one type of publishing entity (for instance, both a master publisher and a brand).
|-
| item_count
| int(11)
| The number of items associated with the publishing entity.
| This may have to go due to possible confusion over using a row as more than one type of publishing entity (for instance, both a master publisher and a brand).
|-
|-
! colspan="4" | Change tracking fields:
! colspan="4" | Change tracking fields:
|-
|-
| created
| created
| date
| datetime
| Row creation date.
| Row creation date and time.
|
| Previously NULL or zero entries are now set to January 1, 1901.
|-
|-
| modified
| modified
| date
| datetime
| Last modified date.
| Last modified date and time.
|
| Previously NULL or zero entries are now set equal to "created".
|-
|-
|}
|}


===Series===


==Series==
Series is fairly straightforward, except that series names are held in a separate table to allow multiple names to be assigned to a given seriesSee the code for details.
 
The series table includes a number of fields that are now obsolete, or perhaps represent possible new features that were abandonedIt also has fields related that may be related to cover scans, which are unclear as the cover tables are not included in the public schema.


{| border="1"
{| border="1"
|+ '''series''' ['''Series''' class in new site]
|+ '''data_series_name'''
! DB Column
! Column
! Name in New Code
! Type
! Description
! Description
! Usage Notes
! Usage Notes
|-
|-
! colspan="4" | Core series data fields:
|-
| ID
| id
| id
| int(11) auto_increment
| DB-generated primary key.
| DB-generated primary key.
| In the lasso-based site, this is the number you see in URLs like http://www.comics.org/series.lasso?seriesid=72  It is generated by the database and has no inherent meaning.
|  
|-
| value
| varchar(255) NOT NULL
| A name used for a series.
| The name may come from the cover, the indicia, common usage or any other source.
|-
|-
| Bk_Name
| source_id
| name
| Foreign key to '''data_source''
| Series title.
| Explains where (indicia, cover, gcd-assigned, etc.) the name comes from.
|
|
|-
|-
| Format
| is_primary
| format
| boolean
| Physical format of the series (size, color presence or absence, paper stock, etc.).
| Exactly one name per series should have this set, indicating that this is the most important name.
| Since format is really an attribute of the issue rather than the whole series, this field may contain a fairly long description of how the format changed over the life of the series.  In the new schema, this is being replace by a number of distinct fields, which individually describe the constituent attributes commonly listed here.
| This is the name that will be put at the top of the series' page, and will be used in display formats where only one name can be shown (and it's not a search result that matched one of the non-primary names).
|-
|}
 
{| border="1"
|+ '''data_series'''
! Column
! Type
! Description
! Usage Notes
|-
! colspan="4" | Core series data fields:
|-
|-
| Notes
| id
| notes
| int(11) auto_increment
| Space for arbitrary notes.
| DB-generated primary key.
|  
|  
|-
|-
| Yr_Began
| year_began
| year_began
| First year (by cover date) of publication.
| int(11)
| Year (by cover date) of first publication.
|
|
|-
|-
| Yr_Ended
| year_ended
| year_ended
| Last year (by cover date) of publication, if any.
| int(11)
|
| Year (by cover date) of last publication, if any.
|  
|-
|-
| PubDates
| classification_id
| publication_dates
| foreign key to '''data_classification'''
| First and last full cover publication dates of the series, separated by a '-'.  No last date if still ongoing.
| Used to guide the application and UI layer as to how best to display or index items in this series.
|
|
|-
|-
| Pub_Note
| language_id
| publication_note
| foreign key to '''data_country'''
| Arbitrary publication / publisher-related notes.  Some records in the current DB include addresses, imprint information and/or general reprint information here.
| The language in which the series is written.
| Legacy field.  Not used by the Lasso implementation.  Replaced by the imprint_id foreign key into the Publisher table that identified the imprint record.
| Currently there is no facility to record multi-lingual comics.
|-
|-
| Frst_Iss
| format
| first_issue
| varchar(255)
| Issue # (or other identifier) of the first issue in the series.
| legacy format field
| Note that it's not currently reliably possible to determine this from the issue table, since the series may only be partially indexed.
| This field will be retained only until its contents have been migrated to the issue and sequence-level format fields. It's contents are a mess of at least five different sorts of data (including size, binding, paper stock, coloration and frequency of publication).
|-
|-
| Last_Iss
| notes
| last_issue
| mediumtext
| Issue # (or other identifier) of the last issue in the series, if any.
| Space for arbitrary notes.
| Note that it's not currently reliably possible to determine this from the issue table, since the series may only be partially indexed.
|
|-
|-
| Issuecount
| tracking_notes
| issue_count
| mediumtext
| Number of issues associated with the series.
| Notes about how a given series is continued in or continued from another series.
| Does not appear to be reliably accurate.
| Largely replaced by the '''data_series_relationship''' table, after most data is migrated this field will only be used for tracking notes where one end of the relationship is not and cannot be in the database (tracking from a newspaper or web comic, for instance).
|-
|-
| PubID
| has_gallery
| publisher
| tinyint(1)
| Foreign key into the '''publishers''' table.
| True if there's a cover gallery to display for this series.
|
|
|-
|-
| Pub_Name
! colspan="4" | Change tracking fields:
| publisher_name
| Cached Publisher name and stems historically from the flat file format.
| Reliability of this field has not been verified.
|-
|-
| imprint_id
| created
| imprint
| datetime
| Foreign key into the '''publishers''' table for the "imprint" concept.  Which bears only occasional resemblance to any normal definition of the term.
| Row creation date.
|
|
|-
|-
| CounCode
| modified
| country_code
| datetime
| '''Not''' a foreign key into the Country table.  Instead the country code is duplicated here, and does not necessarily match properly.
| Last modified date.
| Historically it was done this way to save some joining time.
|
|-
|-
| LangCode
|}
| language_code
 
| '''Not''' a foreign key into the Languages tableInstead the language code is duplicated here, and does not necessarily match properly.
=Items (a.k.a. Issues)=
| Historically it was done this way to save some joining time.
 
As recent discussions delved more into the problems created by trying to treat books as issues, the terminology got quite confusing.  For that reason, and to better reflect the requirements of the project, I've renamed the "issues" to "items" at the database level.  Of course, they will still be "issues" (or in some cases "books") in the UI.  In general, this section has a '''lot''' of techy terminology that will not be at all visible in the UI, or even in the middle layers of the web applicationThis is intentional, with the aim to break faulty associations between objects as we conceive of them visually and the database-level storage requirements.  Within the web application there will be classes such as "Issue" that make sense out of this very complex sub-area of the schema.
 
For instance, it's difficult to see from looking at the next few tables how to figure out something as conceptually simple as "what is the number of this issue?"  However, in the typical case an indexer will just enter the issue number into a very obvious issue number field, or edit it there if they need to change something later.  The point being, unless you really care about database design, don't get hung up on how much this doesn't look anything like a comic book issue anymore.  Unless you squint really hard.
 
{| border="1"
|+ '''item_item_descriptor'''
! DB Column
! Type
! Description
! Usage Notes
|-
|-
! colspan="4" | Fields relating series to each other (continuations, etc.):
| id
| int(11) auto_increment
| DB-generated primary key.
|
|-
|-
| Tracking
| series_item_id
| tracking_notes
| foreign key to '''data_series_item'''
| Notes about how a given series is continued in or continued from another series.
| The item in a particular series to which this descriptor applies.
|
|
|-
|-
| Crossref
| scope
| crossref
| enum('item', 'group')
| Not in use.  
| Indicates whether this descriptor applies to one item or a group of items.
| Intended to indicate a cross reference to replace tracking
| A single item is typically an issue or a book, or more technically, one row in the '''data_item''' table.  A group is a set of items within the series (possibly the entire series, and possibly a set with only one member, but at least theoretically multiple items).  Volume numbers in periodicals are the most typical example of descriptors at group scope.  Indexers will never directly select or see this field as it will be handled behind the scenes.  But it will, for instance, determine whether the UI should display a descriptor of type "volume" as the main number for an item (as in a trade paperback collection) or as a prefix to the main item (as in a a periodical issue).
|-
|-
| CrossrefID
| source_id
| crossref_id
| Foreign key to '''data_source'''
| Not in use.  
| Where the descriptor originated.
| Intended to be a foreign key to a table of one-to-many references to other series to replace tracking.  
| Typical values include indicia, cover, title page, spine, etc. as well as "inferred", meaning the indexer made it up or got it from somewhere other than the actual item (long standing fan convention, for instance).
|-
|-
! colspan="4" | Field related to the (cover?) gallery:
| label_id
| Foreign key to '''data_descriptor_label'''
| The display label for this descriptor, such as "issue" or "volume".
|
|-
|-
| HasGallery
| value
| gallery_present
| varchar(255)
| 3-character field, but used as boolean.
| The actual issue/volume/book/ISBN/ISSN/etc. number/name/title/whatever.
| Gallery flag. Initially used for caching display of the gallery indicator. 3-char field stems from very early "YES/NO" values.
| Note that the special no-number value "[nn]" should never be used (nor should the older "nn"). An item with no number should simply not have an entry in this table, or possibly should have an entry with a NULL value (still working on that). Or should only have an entry that is marked as "inferred", meaning that it was assigned by the indexer and does not appear on the item. The UI will translate this condition to "[nn]" or whatever else is appropriate.
|-
|-
| Included
|}
| gallery_includes
 
| Obsolete.
 
| Legacy field that used to be used to indicate which issues had been indexed.  Superceded by various matrices on the current site.
 
{| border="1"
|+ '''data_descriptor_label'''
! DB Column
! Type
! Description
! Usage Notes
|-
|-
! colspan="4" | Fields related to indexing:
| id
|-
| int(11) auto_increment
| Indexers
| DB-generated primary key.
| indexers
| Semicolon-separated string of indexers.
| Originally a free-form text field recording who had worked on a given series.  Used to create initial version of '''Indexers''' table, and then used to create the '''IndexCredit''' join table.  Later updated periodically *from* the '''IndexCredit''' join table so that it could be exported into legacy format.  Presumably no longer completely up-to-date, now that the lasso-site tracks indexing at the issue level.  Not clear how the current site builds the display of indexers on a given page.
|-
| OpenReserve
| open_reserve
| Integer field.
| How many issues of this series are open for reservation.
|-
! colspan="4" | Fields apparently related to flat file management / distribution:
|-
| File
| file
| Obsolete. The name of the file used for flat-file distribution. They were sized for floppies IIRC.
|
|
|-
|-
| InitDist
| name
| initial_distribution
| varchar(255)
| Obsolete.  
| The human-readable display name for this kind of descriptor.
| Indicated the first distribution set the record was contained in (flat-file).
| Note that unlike most literal data, these labels may not all be in English and therefore may have unusual translation rules.  The exact strategy here is still TBD, but part of the need for this feature is dealing with different names in various countries that don't directly translate outside of that country/language.
|-
|-
| UpdateDist
| for_item
| update_distribution
| boolean
| Obsolete.  
| If true, may be used at the item level.
| Indicated the last distribution set the record was updated in (flat-file).
| i.e. descriptors where scope='item'
|-
|-
! colspan="4" | Change tracking fields:
| for_group
| boolean
| If true may be used at the group level.
| i.e. descriptors where scope='group'
|}
 
 
 
{| border="1"
|+ '''data_source'''
! DB Column
! Type
! Description
! Usage Notes
|-
|-
| Created
| id
| created
| int(11) auto_increment
| Row creation timestamp.
| DB-generated primary key.
|
|
|-
|-
| Modified
| name
| modified
| varchar(255)
| Date of last row change.
| The human-readable display name for the source location.
|
|
|-
|-
| ModTime
| is_inferred
| modification_time
| boolean
| Time of last row change (24 hour).
| If true, the descriptor marked with this source did not come from the item at hand, and should be displayed accordingly.
|
|-
| LstChang
| last_change
| Date of some sort, but apparently unused.
| Only five non-NULL, non-zero values in the current data, all in the future (nearest is in 2010).
|-
| oldID
| old_id
| The ID from a prior organization of the data.
| Long deprecated.
|-
! colspan="4" | Miscellaneous unused and/or mysterious fields:
|-
| Themes
| themes
|
|
| Does not appear to be used (always NULL). Was intended as an early incarnation of keywords.
|-
| SelfCount
| self_count
| Obsolete.
| Stems from previous FileMaker incarnation of the database and used as a way to track duplicated during imports.
|-
|}
|}


=Issues=


The issues table contains large swaths of unused columns, due to it initally being a cloning of the sequence 0 story records when moving from flat file to a semi-relational database (FileMaker 4). Some fields stayed aboard instead of being kicked out due to a number of flat-file based files pending inclusion.


{| border="1"
{| border="1"
|+ '''issues''' ['''Issue''' class in new site]
|+ '''data_item'''
! DB Column
! DB Column
! Name in New Code
! Type
! Description
! Description
! Usage Notes
! Usage Notes
|-
|-
! colspan="4" | Core issue data fields:
|-
| ID
| id
| id
| int(11) auto_increment
| DB-generated primary key.
| DB-generated primary key.
| In the lasso-based site, this is the number you see in URLs like http://www.comics.org/details.lasso?id=293  It is generated by the database and has no inherent meaning.
|  
|-
| VolumeNum
| volume
| Volume number from the indicia.
| Not consistently used, and according to recent editor list comments, not planned for migration into the new schema.  Where especially relevant, volume numbers are typically included in the Issue (number) field. NOTE: This should actually be used for proper indexing of some foreign titles which restart numbering every year.
|-
| Issue
| number
| Issue number or other identifier from the indicia.
| Use the value ''nn'' to indicate "no number" when there is no identifier.  For issues where the volume number is important, the form ''v1#1'' is used to specify both the volume and issue numbers.  When the cover number or commonly assumed number is different from the indicia number, it goes in brackets after the indicia number: ''v3#2 [26]''.  The same format is used for disambiguation of identically numbered issues:  ''1 [1946]'' or ''nn [1947]''.
|-
|-
| Pub_Date
! colspan="4" | Basic item information
| publication_date
| Publication date of the issue as given in the indicia, but fully spelled out.
| [[Published Date]]
|-
|-
| Price
| price
| price
| A decimal price followed by the ISO code, i.e. ''0.10 USD''
| varchar(255)
| [[Cover Price]]
| Legacy price field.  A decimal price followed by the ISO code, i.e. ''0.10 USD''
| See [[Cover Price]], but superseded by '''data_item_price'''
|-
|-
| storycount
| story_count
| Number of stories linked to this issue.
| May not be 100% accurate for all issues, but is updated whenever the issue itself is updated.
|-
| Key_Date
| key_date
| Specially formatted date string that (in theory) determines the ordering of issues within a series.
| In practice, many issues are missing this field, meaning that if their issue numbers are not standard, there is no reliable programmatic way to sort the issues in the series.  In practice, the cover sort codes are slightly more reliable (although not always), but they are not included in the public data dumps.  Also see [[Keydate]]
|-
! colspan="4" | Fields that would be core issue data but aren't actually used:
|-
| Pg_Cnt
| page_count
| page_count
| Not Used
| varchar(10)
| Page count for the whole issue is taken from story sequence 0 (which otherwise represents the front cover).
| Page count including covers but not inserts or dust jackets.
| Should be decimal, but needs migration work.
|-
|-
| Editing
| editing
| editor
| mediumtext
| Not Used
| Editor for the item as a whole.
| Editor(s) for the whole issue are taken from story sequence 0 (which otherwise represents the front cover).
|
|-
|-
| Notes
| notes
| notes
| Not Used
| mediumtext
| Notes for the whole issue are taken from story sequence 0 (which otherwise represents the front cover).
| Notes for the item.
|
|-
|-
! colspan="4" | Fields related to indexing:
! colspan="4" | Date fields:
|-
|-
| IndexStatus
| publication_date
| index_status
| varchar(255)
| Whether or not an issue is being indexed, and where it is in the process.
| Legacy field.  Publication date of the item as given in the indicia, but fully spelled out.
| 0=not (white); 1=reserved (red); 2=pending (orange); 3=approved (green).  Colors noted are how the issues appear in various index status grids.
| [[Publication Date]], but superseded by the other publication fields in this table.
|-
|-
| ReserveCheck
| publication_year
| reserve_check
| int(11)
| Not Used
| Year of publication.
| The intention of this field is unknown.
|
|-
|-
| ReserveStatus
| publication_year_inferred
| reserve_status
| boolean
| 1=indexed, pending or reserved; 0=not reserved
| True if the indexer guessed or got the year from somewhere other than the item.
| The intended relationship between this and IndexStatus/index_status is not known. from Jon: "0 = not reserved, 1 = reserved, 2 = reserved for editing. The value 2 has been used to a small extent to reserve books for re-editing or adding data after it's been indexed and approved."
| For use with an undated item, or for a book cover-dated something like "December-January 1950" that came out in early 1950 and would therefore have an inferred first year of 1949.
|-
|-
! colspan="4" | Fields related to cover images:
| publication_second_year
| int(11)
| For use with dates like "December-January 1941-1942"
|
|-
|-
| CoverCheck
| publication_year_inferred
| has_cover
| boolean
| Not Used
| True if the indexer inferred the 2nd year.
| For instance, a book cover-dated "Winter 1941" that came out in late 1941 would have an inferred 2nd year of 1942.
|-
| publication_month_id
| foreign key to '''data_publication_month'''
| The month or vaguely month-like part of the publication date.  May be two months, may be a season, may be a value like "Holiday"
|
|
|-
|-
| CoverCount
| publication_month_modifier
| num_covers
| enum('early', 'mid', 'late')
| Not Used
| Can be applied to the selected month.
| Intended for variants
| This cuts down substantially on the size of the drop down menu of "months".
|-
|-
! colspan="4" | Fields related to the series, including unused cache fields:
| publication_month_inferred
|-
| boolean
| SeriesID
| True if the indexer guessed or got the month from somewhere other than the item.
| series
| Foreign key into the '''series''' table.
|
|
|-
|-
| Bk_Name
| publication_day
| series_name
| int(11)
| Not Used (deprecated)
| Day of the month of the publication date, if any.
|  
|-
| Yr_Began
| year_began
| Not Used
|
|
|-
|-
| Pub_Name
| publication_day_inferred
| publisher_name
| int(11)
| Not Used
| True if the indexer guessed or got the day for somewhere other than the items.  This is not meant to be used for monthlies or other publications that should not have a publication day, only for weeklies or similar items that simply omitted the day from the cover or publication information.
|
|
|-
|-
! colspan="4" | Fields apparently related to flat file management / distribution:
! colspan="4" | Format fields:
|-
|-
| InitDist
| size_id
| initial_distribution
| Foreign key to '''data_size'''
| Not Used
| References a standard size (e.g. U.S. Golden Age) from a pre-defined set.
|
|
|-
|-
| UpdateDist
| height
| update_distribution
| decimal(10,3)
| Not Used
| Height in either inches or centimeters
|
|
|-
|-
| isUpdated
| width
| is_updated
| decimal(10,3)
| Deprecated. Used for tracking updated indexes before processing each distribution.
| Width in either inches or centimeters
|
|
|-
|-
! colspan="4" | Change tracking fields:
| size_in_metric
| boolean
| If true (which is the default), height and width are in centimeters.  If false, they are in inches.|
| This does not affect the standard sizes in the other table, which have their own metric flag.
|-
|-
| created
| interior_paper_id
| created
| Foreign key to '''data_paper'''
| Row creation timestamp.
| Paper stock of the interior of the publication.
|
|
|-
|-
| Modified
| cover_paper_id
| modified
| Foreign key to '''data_paper'''
| Date of last row change.
| Paper stock of the cover of the publication.
|
|
|-
|-
| ModTime
| binding_id
| modification_time
| Foreign key to '''data_binding'''
| Time of last row change (24 hour).
| How the publication is bound, although what that means remains a matter of some debate.
|
|
|-
|-
| LstChang
! colspan="4" | Field relating to sequences:
| last_change
| Date of some sort, but apparently unused?
| Column has numerous values, but they do not line up with the values in the Modified column. NOTE: This field was butchered data-wise and then replaced by the modified/modtime combo.
|-
|-
! colspan="4" | Miscellaneous unused and/or mysterious fields:
| sequence_count
| int(11)
| Number of stories linked to this issue.
|
|-
|-
| rel_year
! colspan="4" | Change tracking fields:
| rel_year
| Not Used
| Intended for released year, caching of keydate year.
|-
|-
| SelfCount
| created
| self_count
| datetime
| Not Used
| Row creation date.
| Historically used for duplcation checking.
|
|-
|-
| SeriesLink
| modified
| series_link
| datetime
| Not Used
| Last modified date.
| Deprecated. Early non-id based series link used to build the initial foreign keys into series.
|
|-
|-
! colspan="4" | Unused fields that would apply only to the cover. Stemming from seq 0 clone.:
|}
 
 
 
{| border="1"
|+ '''data_series_item'''
! DB Column
! Type
! Description
! Usage Notes
|-
|-
| Feature
| id
| cover_feature
| int(11) auto_increment
| Not Used
| DB-generated primary key.
|
|  
|-
|-
| Char_App
| series_id
| cover_characters
| Foreign key into '''data_series'''
| Not Used
|
|
|-
| Script
| cover_script
| Not Used
|
|
|-
|-
| Pencils
| item_id
| cover_pencils
| Foreign key into '''data_item'''
| Not Used
|
|
|-
| Inks
| cover_inks
| Not Used
|
|
|-
|-
| Colors
| sort_code
| cover_colors
| int(11)
| Not Used
| Non-user-visible number that orders items within the series.
|
|-
| Letters
| cover_letters
| Not Used
|
|
|-
| Synopsis
| cover_synopsis
| Not Used
| Yes, I realize cover_synopsis does not make a whole lot of sense.  It's not used anyway.
|-
| Reprints
| cover_reprints
| Not Used
| Arguably this could go in the "whole issue" section and be used to indicates issues that reprint entire other issues instead of doing story-by-story reprint links.
|-
| Title
| cover_title
| Not Used
| Arguably this could go in the "whole issue" section.
|-
| Genre
| cover_genre
| Not Used
| Arguably this could go in the "whole issue" section.
|-
| Type
| cover_type
| Not Used
| Since the type of a cover is "cover", this also doesn't make much sense.  But it doesn't make sense as applied to an entire issue either, as far as I can tell.
|-
| Seq_No
| cover_sequence_number
| Not Used
| Makes even less sense that the rest of these. Stemming from the cloning of the stories table...
|-
|-
|}
|}


=Stories=
=Sequences=
 
The stories table has a number of unused columns that are duplicates of columns from the series and issues tables (they're known to be unused because in some cases the data does not match, and the versions shown on the Lasso site are the ones in the issue and series tables, not these). They were initially used for caching data to avoid expensive joins in the FileMaker days.  Also, sequence 0 (which represents the front cover) is treated specially, with its page count, notes, reprints and editor fields being applied to the whole issue by the current UI implementation.
 


The '''core_sequence''' table is the new name of the old stories table, using a more generic term. 


{| border="1"
{| border="1"
|+ '''stories''' ['''Story''' class in new site]
|+ '''core_sequence'''
! DB Column
! DB Column
! Name in New Code
! Type
! Description
! Description
! Usage Notes
! Usage Notes
|-
|-
! colspan="4" | Core story data fields:
|-
| ID
| id
| id
| int(11) auto_increment
| DB-generated primary key.
| DB-generated primary key.
| Since there is no story page, this field does not show up in any URLs in the Lasso implementation.
|  
|-
|-
| Seq_No
| title
| sequence_number
| varchar(255)
| Order of the story within the issue.
| The story title.  In quotes if taken from the first line of the story.
| By convention, sequence 0 is the front cover, and some of its fields are considered to apply to the whole issue. [[Sequence Number]]
|
|-
| title_inferred
| boolean
| True if the title was chosen by the indexer (either made up or quoted from the script).
| This replaces the use of brackets, which should no longer appear in the title field.
|-
|-
| Feature
| feature
| feature
| varchar(255)
| The name of the feature, if any.
| The name of the feature, if any.
| Often but not always the name of the primary character. Some stories do not have a feature.
| Should match the logo on the splash page, or possibly the bolded text in the intro blurb (for stories that don't get fancy splash pages and logos).
|-
| type_id
| Foreign key to '''data_type'''
| Type of "story", or more accurately sequence.
| Actual comic story vs text story vs ad vs letters page, etc.
|-
|-
| Pg_Cnt
| page_count
| page_count
| varchar(10)
| Number of pages.
| Number of pages.
| [[Pages]]
| [[Page_Count|Page Count]].  Should be decimal but needs migration work.
|-
|-
| Notes
| notes
| notes
|
| mediumtext
|
| General notes about the sequence.
|-
| Reprints
| reprints
| Reprint information (both reprint from and reprinted in).
|
|
|-
|-
| Type
| reprint_notes
| type
| mediumtext
| Type of "story", or more accurately sequence.
| Reprint information that can't be stored as links because the other end can't be put in the database.
| Actual comic story vs text story vs ad vs letters page, etc.
| Most of the original purpose of this field has been superseded by the various reprint join tables.
|-
|-
! colspan="4" | Fields for creator credits:
! colspan="4" | Fields for creator credits:
|-
|-
| Script
| script
| script
|
| mediumtext
| Script, writer, plotter, etc.
|
|
|-
|-
| Pencils
| pencils
| pencils
|
| mediumtext
| Pencils, painted art, layouts, photographs, etc.
|
|
|-
|-
| Inks
| inks
| inks
|
| mediumtext
| Inks, painted art, photographs
|
|
|-
|-
| Colors
| colors
| colors
|
| mediumtext
| Colors, painted art, color photographs
|
|
|-
|-
| Letters
| letters
| letters
|
| mediumtext
| Lettering credit
|
|
|-
|-
| Editing
| editing
| editor
| mediumtext
|
| Editor credit for the sequence.
| Rarely used except for sequence 0, as the editor for the book is typically assumed to have edited the whole book.  However, this field can be used if individual stories have separate editors.
| Rarely used at the sequence level, but particularly useful for anthologies or compilations where individual sequences are drawn from original publications with various editors.
|-
|-
| JobNo
| job_number
| job_number
| varchar(25)
| [[Job Number]]
| [[Job Number]]
|
|
Line 629: Line 685:
! colspan="4" | Fields related to content:
! colspan="4" | Fields related to content:
|-
|-
| Title
| title
|
|
|-
| Char_App
| characters
| characters
| mediumtext
|
|
|
|
|-
|-
| Synopsis
| synopsis
| synopsis
| mediumtext
|
|
|
|
|-
|-
| Genre
| genre
| genre
| varchar(255)
|
|
|
|
|-
! colspan="4" | Field related to flat file management / distribution. See series/issues:
|-
| InitDist
| initial_distribution
| Not Used
|
|-
! colspan="4" | Fields relating to or copied from the '''issues''' table:
|-
| IssueID
| issue
| Foreign key into the '''issues''' table.
|
|-
| Issue
| issue_number
| Not Used
| Earlier indexes may have data in this field, but it is ignored.
|-
| Pub_Date
| publication_date
| Not Used
| Earlier indexes may have data in this field, but it is ignored.
|-
| Key_Date
| key_date
| Not Used
| Earlier indexes may have data in this field, but it is ignored.
|-
| Price
| price
| Not Used
|
|-
| rel_year
| rel_year
| Not Used
|
|-
! colspan="4" | Fields relating to or copied from the '''series''' table:
|-
| SeriesID
| series
| Foreign key into the '''series''' table, duplicated from the '''issues''' table presumably to allow for direct stories <-> series joins.
| Actually, it was used to help troubleshoot and QA the initial building of a relational database way back. Also use for statistics.
|-
| Yr_Began
| year_began
| Not Used
| Earlier indexes may have data in this field, but it is ignored.
|-
| Pub_Name
| publisher_name
| Not Used
| Earlier indexes may have data in this field, but it is ignored.
|-
|-
! colspan="4" | Change tracking fields:
! colspan="4" | Change tracking fields:
|-
|-
| created
| created
| created
| datetime
| Row creation timestamp.
| Row creation date.
|
|
|-
|-
| Modified
| modified
| modified
| Date of last row change.
| datetime
|
| Last modified date.
|-
| ModTime
| modification_time
| Time of last row change (24 hour).
|
|
|-
| LstChang
| last_change
| Date of some sort, but apparently unused?
| Column has numerous values, but they do not line up with the values in the Modified column. Data got butchered through various ins and outs of flat files and between systems changing the date format.
|-
|-
|}
|}
[[Category: GCD Technical]]

Latest revision as of 19:50, 2 January 2012

[This was the direction the schema was going towards at the time of the server crisis in August of 2009. That crisis required a more modest migration of the database, followed by numerous gradual releases which may or may not end up at a release that looks like this schema. The documentation here is largely untouched since the crisis, and therefore quite out of date. It remains a useful reference for concepts that have not yet made it into production.]

This page documents the proposed database schema for the "New Fun" release, i.e. the first release of the new web site that will go into production. It is not set in stone, but is nearing its final form. At some point, we will declare a soft schema freeze, after which only minor changes (such as adding a column that doesn't affect table relationships or major display options) will be allowed.

Projects and Applications

Django organizes projects into applications, which may be more or less independent of one another. The project corresponds to the top-level Python package, with each application receiving a sub-package. The package name appears as a prefix on database tables. The GCD project will use the following applications (with possibly more to be added as needed):

  • data: Holds the data for display. In theory, dumping the data tables should provide all of the information that we wish to distribute publicly.
  • oi: Holds the state of the online indexing system such as reservations, approval queues, indexing credits and space for data to be edited outside of the publicly visible tables.
  • resource: Holds files such as cover scans. Also holds information about linking those scans to the public data, since that linkage is of no interest without the scans. Also holds other files like logo scans or anything else we decide to include. Some of this may be publicly accessible, but unlike the data application that is not the default assumption.
  • account: User account information, holding user preferences and all data that doesn't fit in Django's standard authorization and prefs apps.
  • migration: Space for temporary data and status tracking while we migrate fully to the new schema. This keeps transient junk out of our "proper" data design.

Storage Architecture vs Application Architecture

This version of the schema is the first to dramatically separate the representation of the data in storage (the database) from the application layer (and the UI above it). In addition to the standard database vs UI separations, this is enabled by our decision to use a separate search server to implement search. The server will organize an index of our data tables in a way that optimizes for searches, and all end-user searches will run against that index. This means that we can use a table structure that would otherwise result in very complex search queries without the performance impact of those queries. So when looking at this schema, don't think to yourself "wow that will be hard/expensive to search against", because it won't.

There are several key concepts involved in this separation that you should understand before diving into the table layout. Most notably, these involve dealing with what we have always called series and issues, and how they relate to each other.

Series Classification

One of the most important fields introduced by the New Fun schema is the series classification. This selects exactly one of several very broad categories that supplies a context for interpreting the information associated with the series. For the most part, this field tells the UI (and to a lesser extent, the application) how display things from this series, and what fields to show the indexers when they index or create something in the series. This is our primary tool for cutting down the high number of potential fields and choices into something manageable and relevant for each task at hand.

An important thing to note is that while classification helps us optimize the UI for both display and indexing, all fields are available to all classifications. You will just have to go to some sort of "advanced" option to use fields that are rare or that few indexers prefer to see for the given classification. And some fields will probably always be "advanced" in any classification.

The tentative list of classifications is (names very much subject to change):

Periodical / pamphlet series These are series of things that are generally called "issues" and are more like magazines than books.

Periodical / pamphlet one-shot These are things that look like something from the periodical / pamphlet series classification but are only ever intended to have one physical item within the series. The only difference between this classification and the series is that the UI may do some display optimization that it would *not* do for a series that only has one issue simply because the second one hasn't come out yet. For instance, the indexing UI will never ask for a publishing frequency for a one-shot. If your series lists a publishing frequency other than "published as one-shot", it should be classified as a "periodical pamphlet / series" that was cancelled after the first issue (for instance, Timely's Red Raven Comics should be a "periodical / pamphlet series" despite having only one issue). If this proves too confusing, the distinction will be dropped and all such series will automatically become "periodical / pamphlet series".

Series of books Original books or collections of periodicals, but they look more like books than magazines. Items are probably referred to as "books" or "volumes" or some similar concept. They also quite likely have their own title in addition to some sort of numeric designation.

Single book Original book or collection, but not part of a real series. Again, this is mostly for UI optimization. Note that a book may appear in two series, one in this classification (to catalogue it stand-alone) and another in the "series of books" classification (as part of a larger series). Books in this classification are very likely to have their own titles, and are less likely to have a volume or book number (or need to have "[nn]" displayed). Again, if this proves too confusing, the distinction will be dropped and all such series will automatically become "series of books".

We could take this further and separate books that are collections of periodicals from other books, but that can easily be done later if the concept proves useful. Sticking with the two choice areas that have come up over and over again on the lists (single vs true series, book vs periodical) seems like a good start. There are clear demands for these- quite recently with the book vs periodical distinction, and during the last prototype work when Jochen implemented a view of single-issue series that combined the series and issue pages. I objected at the time but the idea stuck with me and now I think there is a proper framework in which to implement it.

Items vs Issues vs Books

During the debates over handling books, it became extremely confusing to refer to things as "issues". But using "books" is arguably more accurate (as in "comic books"), it is still confusing. In the database, we will use the neutral term item for the smallest physical thing that is tracked in the database. An item is anything you can't split further without undoing its binding (even if that binding is just folding, as with the current Wednesday Comics from DC.

At the application layer, the distinction will be made, based on the classification, whether a given item is an issue, a book, or some other term we add in the future. As an item can be in more than one series, it can potentially be classified in different ways (although making something both an issue and a book is probably an error, but it's hard to predict for certain).

What's a Descriptor?

We all know that the term "issue number" is problematic. They neither universally apply to "issues", nor are they all "numbers". As has been discussed ad nauseum. Sometimes they don't even serve as faux-numbers, but behave more like names. And then there are volume numbers. Which are sometimes Roman numerals since apparently the publishing industry didn't get the memo about these great new numbers Europe nicked from the Hindus by way of the Arabs and Persians. And some volume numbers are probably completely non-numeric. And then there are people who object to calling them volume numbers and say they should be book numbers. Whether that is true depends on the series in question and whether you find the term "volume" pretentious or not, apparently. Whatever, it's all a horrible mess. Add to it the fact that these things can contradict each other depending on where they appear on the item, and you have at least six fields, three for each of the concepts (indicia, cover, and inferred a.k.a. indexer-assigned). And someone came up with some example with three different sorts of numbers. Or at least two plus a name.

So... what to do about this? At one point, I proposed calling the issue-level ones labels, but that term is already needed for certain UI-related things, and would just add more confusion on the technical side. And it didn't solve the larger problems of how to organize the fields clearly for the various puproses either. Tossing in an issue name/title field brings the number of fields up to seven, eight or nine, depending on how you look at it.

So I decided to abandon the basic notion of a fixed set of fields, and allow us to link in whatever is needed to the item. As with "item", I needed a neutral term for this. Abandoning all pretense of human comprehensibility, I appropriated the techie term "descriptor" (as in "file descriptor", if you must know). It is accurate, as these things describe the item.

Descriptor Properities

Descriptors have a few properties.

  • They have a label, by which I mean a term that a normal human would use by sticking it in front of the actual descriptor. Like "number", or "volume" or "book". Or "title" which you wouldn't actually say, but would make sense as a label in the UI or the indexing form.
  • They may apply to either a single item, or to groups of items.
    • The group may be pathological in that it only has one item in it, but the intention is usually clear from the label. Or in some cases, label plus classification (i.e. "volume" applies to groups in periodicals, but single items in series of books).

Another aspect of descriptors is that unlike many other things that are factored into their own table, we're mostly not trying to ensure that there's only one row per value and all uses link to that row. For instance, the number "1" for the issue of one series is not meaningfully the same as the number "1" for another series. The only reason to store them both as the same row is to reduce the storage and table scan time. But as far as the application is concerned, descriptor's aren't really shared.

Naming and Design Conventions

Formatting

A few notes on database table and column names vs code class and variable names:

Table names are lower case with underscores. Model class names are capitalized with no spaces or underscores. It should be obvious which model fits which table, as the words in the names should be the same. Column and variable names are all lower case with underscores, and should match exactly except for foreign keys.

All foreign keys end in _id in the database, but do not have that suffix in the code. So wherever you see something like issue_id in this table, expect to see issue in the code. This is because in the code you get an object, not just an ID.

Tables are prefixed with an application name, which does not appear in the model class name. The main app name is currently usually core, which is arguably a lousy name. I'm contemplating changing it to data but I'm not sure about that. It can't be gcd because that is the name of the overall project. Currently, the only other application is migration, which is for tables that exist only to assist in migration of data from old to new forms.

Meaningful Terms

Many of the names in the database and/or code attempt to follow some sort of systematic convention. Here are some notable ones:

_inferred Many fields will have another field with the same name plus the _inferred suffix. This other field will be a boolean that tells you whether the value of the primary field was really present on the item at hand (false) or was inferred (or just made up) by the indexer (true). In the old schema, this was mostly handled by enclosing the value in brackets, which should no longer be done. However, the UI may continue to display inferred values by enclosing them in brackets. Note that "[nn]" or "nn" for the issue number should not appear at all, even with an inferred boolean- see the section about item descriptors to understand why this is and how the display will still be able to show "[nn]" when appropriate.

_uncertain Analagous to _inferred, but indicates that the value is a guess of some sort. Both flags may be set on the same value.

sort_code always refers to a field that is *not* shown to the user, but is instead used behind the scenes to order things. In the UI, rather than viewing and entering numbers, you will use a more intuitive system such as specifying that an item follows another item, or by dragging and dropping items in a list. As auto-generated numbers, they may not make much sense if you look at them raw, except that they *will* define a clear ordering.

_links is a suffix that shows up in the code when many-to-many relationships are involved, which is very common in the new schema. In general, the set of relationship objects (rows in the join table containing data about the relationship rather than about either related object) will end in _links. This will help distinguish these sets from either set of related objects on the ends of the relationship.

label is used for fields that describe other fields. The value of label fields is generally intended to be used by the UI to label a field that does not always have an obvious name. This is hard to explain without an example, so look through the tables to see how this is used. This did not end up being a term that replaces "issue number", although that is one area where a label field is used.

Proxy Models

Typically, each database table corresponds to one model class in the code. However, several of our tables, instead of holding homogenous data, hold rows that may fall into one or more subtypes that are treated somewhat differently in the applications. Publishers and Items (on which more is explained below) come to mind. These are dealt with by defining Proxy Models, which are classes that provide different views of existing tables. In particular, they are used to make models that act like they can only access a subset of the table that is relevant for a particular application-level concept.

Indexes and Keys

Don't read much into the presence or absence of indexes in the current table definitions. Aside from all foreign keys having indexes, the presence or absence of an index for a column is mostly just left over from the old schema, plus a few nearly arbitrary decisions made as I worked on the tables. The reason for this is that we'll be using an external search server, so many columns that would normally need an index may not- we'll have to see how the search server works. Once we know that, and have a better feel for what sort of queries will actually be run, we will revisit the indexes and also consider multi-column indexes.

Table Documentation

Note: This documentation does not cover every table. It focuses on the more complex new areas of the schema. Hopefully we'll get the whole thing documented as we go along, but for the definitive schema, see the files in subversion. Email the gcd-tech list if you have questions or would like to help write documentation.

Publishers

The Publisher table is unusual in that it stores several distinct types of entities, all of which fall under the general umbrella of "Publisher". Several of them may play multiple roles, which is the primary reason for keeping them in one table. Each type of publishing entity is represented by a proxy model in the Django code. Except as noted, publishers are connected to each other via the PublisherRelationship table described further down. Before going through the fields, here is the list of proxy models and their definitions:

MasterPublisher

This is the legacy notion of a top-level publisher, linked at the series. Defined by setting is_master = 1. The nature and future role of master publishers is very much a matter of debate.

LegacyImprint

This is the legacy notion of an imprint, effectively a hodge-podge of anything and everything that did not fit as a MasterPublisher. These are linked at the series level, and will be retained only until they can be migrated to the other fields, after which they will be dropped from the application. Something is a legacy imprint if it has a relationship of type LEGACY with a MasterPublisher.

PublishingCompany

An actual corporate entity of some sort involved in publication. These are linked at the item level and defined as having is_company = 1.

Brand

A marketing brand involving a particular logo or set of closely related logos. These are linked at the item level and defined as having is_brand = 1.

Distributor

A company that distributes issues to retailers. Technically not a publisher, but the same fields apply and several publishers have served as their own distributors. These are linked at the item level and are defined as having is_distributor = 1.

The Publisher Table

data_publisher
Column Type Description Usage Notes
Core publisher data fields:
id int(11) auto_increment DB-generated primary key.
name varchar(255) NOT NULL Name of the publishing entity. Different for each type. Master publishers use names chosen for easy recognition. Brand names are based on the logos that denote them. Publishing company names should be exactly as found in the indicia that reference them (minor variations, such as "Co." vs "Company" may be harmonized). Imprints have arbitrarily complicated garbage for names that mash together all of the above. Don't even try to figure it out.
country_id foreign key to data_country The country from which the publisher operates.
year_began year Year (by cover date) of first publication.
year_ended year Year (by cover date) of last publication, if any.
url varchar(255) The URL for the publisher's website. Should be a complete URL, including the protocol (i.e. "http://")
notes longtext Space for arbitrary notes.
Type flags. A row may have more than one set to true.
is_master boolean Defines which rows appear in the MasterPublisher proxy model.
is_company boolean Defines which rows appear in the PublishingCompany proxy model.
is_brand boolean Defines which rows appear in the Brand proxy model.
is_distributor boolean Defines which rows appear in the Distributor proxy model.
Change tracking fields:
series_count int(11) The number of series associated with the publishing entity. This may have to go due to possible confusion over using a row as more than one type of publishing entity (for instance, both a master publisher and a brand).
item_count int(11) The number of items associated with the publishing entity. This may have to go due to possible confusion over using a row as more than one type of publishing entity (for instance, both a master publisher and a brand).
Change tracking fields:
created datetime Row creation date and time. Previously NULL or zero entries are now set to January 1, 1901.
modified datetime Last modified date and time. Previously NULL or zero entries are now set equal to "created".

Series

Series is fairly straightforward, except that series names are held in a separate table to allow multiple names to be assigned to a given series. See the code for details.

data_series_name
Column Type Description Usage Notes
id int(11) auto_increment DB-generated primary key.
value varchar(255) NOT NULL A name used for a series. The name may come from the cover, the indicia, common usage or any other source.
source_id Foreign key to 'data_source Explains where (indicia, cover, gcd-assigned, etc.) the name comes from.
is_primary boolean Exactly one name per series should have this set, indicating that this is the most important name. This is the name that will be put at the top of the series' page, and will be used in display formats where only one name can be shown (and it's not a search result that matched one of the non-primary names).
data_series
Column Type Description Usage Notes
Core series data fields:
id int(11) auto_increment DB-generated primary key.
year_began int(11) Year (by cover date) of first publication.
year_ended int(11) Year (by cover date) of last publication, if any.
classification_id foreign key to data_classification Used to guide the application and UI layer as to how best to display or index items in this series.
language_id foreign key to data_country The language in which the series is written. Currently there is no facility to record multi-lingual comics.
format varchar(255) legacy format field This field will be retained only until its contents have been migrated to the issue and sequence-level format fields. It's contents are a mess of at least five different sorts of data (including size, binding, paper stock, coloration and frequency of publication).
notes mediumtext Space for arbitrary notes.
tracking_notes mediumtext Notes about how a given series is continued in or continued from another series. Largely replaced by the data_series_relationship table, after most data is migrated this field will only be used for tracking notes where one end of the relationship is not and cannot be in the database (tracking from a newspaper or web comic, for instance).
has_gallery tinyint(1) True if there's a cover gallery to display for this series.
Change tracking fields:
created datetime Row creation date.
modified datetime Last modified date.

Items (a.k.a. Issues)

As recent discussions delved more into the problems created by trying to treat books as issues, the terminology got quite confusing. For that reason, and to better reflect the requirements of the project, I've renamed the "issues" to "items" at the database level. Of course, they will still be "issues" (or in some cases "books") in the UI. In general, this section has a lot of techy terminology that will not be at all visible in the UI, or even in the middle layers of the web application. This is intentional, with the aim to break faulty associations between objects as we conceive of them visually and the database-level storage requirements. Within the web application there will be classes such as "Issue" that make sense out of this very complex sub-area of the schema.

For instance, it's difficult to see from looking at the next few tables how to figure out something as conceptually simple as "what is the number of this issue?" However, in the typical case an indexer will just enter the issue number into a very obvious issue number field, or edit it there if they need to change something later. The point being, unless you really care about database design, don't get hung up on how much this doesn't look anything like a comic book issue anymore. Unless you squint really hard.

item_item_descriptor
DB Column Type Description Usage Notes
id int(11) auto_increment DB-generated primary key.
series_item_id foreign key to data_series_item The item in a particular series to which this descriptor applies.
scope enum('item', 'group') Indicates whether this descriptor applies to one item or a group of items. A single item is typically an issue or a book, or more technically, one row in the data_item table. A group is a set of items within the series (possibly the entire series, and possibly a set with only one member, but at least theoretically multiple items). Volume numbers in periodicals are the most typical example of descriptors at group scope. Indexers will never directly select or see this field as it will be handled behind the scenes. But it will, for instance, determine whether the UI should display a descriptor of type "volume" as the main number for an item (as in a trade paperback collection) or as a prefix to the main item (as in a a periodical issue).
source_id Foreign key to data_source Where the descriptor originated. Typical values include indicia, cover, title page, spine, etc. as well as "inferred", meaning the indexer made it up or got it from somewhere other than the actual item (long standing fan convention, for instance).
label_id Foreign key to data_descriptor_label The display label for this descriptor, such as "issue" or "volume".
value varchar(255) The actual issue/volume/book/ISBN/ISSN/etc. number/name/title/whatever. Note that the special no-number value "[nn]" should never be used (nor should the older "nn"). An item with no number should simply not have an entry in this table, or possibly should have an entry with a NULL value (still working on that). Or should only have an entry that is marked as "inferred", meaning that it was assigned by the indexer and does not appear on the item. The UI will translate this condition to "[nn]" or whatever else is appropriate.


data_descriptor_label
DB Column Type Description Usage Notes
id int(11) auto_increment DB-generated primary key.
name varchar(255) The human-readable display name for this kind of descriptor. Note that unlike most literal data, these labels may not all be in English and therefore may have unusual translation rules. The exact strategy here is still TBD, but part of the need for this feature is dealing with different names in various countries that don't directly translate outside of that country/language.
for_item boolean If true, may be used at the item level. i.e. descriptors where scope='item'
for_group boolean If true may be used at the group level. i.e. descriptors where scope='group'


data_source
DB Column Type Description Usage Notes
id int(11) auto_increment DB-generated primary key.
name varchar(255) The human-readable display name for the source location.
is_inferred boolean If true, the descriptor marked with this source did not come from the item at hand, and should be displayed accordingly.


data_item
DB Column Type Description Usage Notes
id int(11) auto_increment DB-generated primary key.
Basic item information
price varchar(255) Legacy price field. A decimal price followed by the ISO code, i.e. 0.10 USD See Cover Price, but superseded by data_item_price
page_count varchar(10) Page count including covers but not inserts or dust jackets. Should be decimal, but needs migration work.
editing mediumtext Editor for the item as a whole.
notes mediumtext Notes for the item.
Date fields:
publication_date varchar(255) Legacy field. Publication date of the item as given in the indicia, but fully spelled out. Publication Date, but superseded by the other publication fields in this table.
publication_year int(11) Year of publication.
publication_year_inferred boolean True if the indexer guessed or got the year from somewhere other than the item. For use with an undated item, or for a book cover-dated something like "December-January 1950" that came out in early 1950 and would therefore have an inferred first year of 1949.
publication_second_year int(11) For use with dates like "December-January 1941-1942"
publication_year_inferred boolean True if the indexer inferred the 2nd year. For instance, a book cover-dated "Winter 1941" that came out in late 1941 would have an inferred 2nd year of 1942.
publication_month_id foreign key to data_publication_month The month or vaguely month-like part of the publication date. May be two months, may be a season, may be a value like "Holiday"
publication_month_modifier enum('early', 'mid', 'late') Can be applied to the selected month. This cuts down substantially on the size of the drop down menu of "months".
publication_month_inferred boolean True if the indexer guessed or got the month from somewhere other than the item.
publication_day int(11) Day of the month of the publication date, if any.
publication_day_inferred int(11) True if the indexer guessed or got the day for somewhere other than the items. This is not meant to be used for monthlies or other publications that should not have a publication day, only for weeklies or similar items that simply omitted the day from the cover or publication information.
Format fields:
size_id Foreign key to data_size References a standard size (e.g. U.S. Golden Age) from a pre-defined set.
height decimal(10,3) Height in either inches or centimeters
width decimal(10,3) Width in either inches or centimeters
size_in_metric boolean This does not affect the standard sizes in the other table, which have their own metric flag.
interior_paper_id Foreign key to data_paper Paper stock of the interior of the publication.
cover_paper_id Foreign key to data_paper Paper stock of the cover of the publication.
binding_id Foreign key to data_binding How the publication is bound, although what that means remains a matter of some debate.
Field relating to sequences:
sequence_count int(11) Number of stories linked to this issue.
Change tracking fields:
created datetime Row creation date.
modified datetime Last modified date.


data_series_item
DB Column Type Description Usage Notes
id int(11) auto_increment DB-generated primary key.
series_id Foreign key into data_series
item_id Foreign key into data_item
sort_code int(11) Non-user-visible number that orders items within the series.

Sequences

The core_sequence table is the new name of the old stories table, using a more generic term.

core_sequence
DB Column Type Description Usage Notes
id int(11) auto_increment DB-generated primary key.
title varchar(255) The story title. In quotes if taken from the first line of the story.
title_inferred boolean True if the title was chosen by the indexer (either made up or quoted from the script). This replaces the use of brackets, which should no longer appear in the title field.
feature varchar(255) The name of the feature, if any. Should match the logo on the splash page, or possibly the bolded text in the intro blurb (for stories that don't get fancy splash pages and logos).
type_id Foreign key to data_type Type of "story", or more accurately sequence. Actual comic story vs text story vs ad vs letters page, etc.
page_count varchar(10) Number of pages. Page Count. Should be decimal but needs migration work.
notes mediumtext General notes about the sequence.
reprint_notes mediumtext Reprint information that can't be stored as links because the other end can't be put in the database. Most of the original purpose of this field has been superseded by the various reprint join tables.
Fields for creator credits:
script mediumtext Script, writer, plotter, etc.
pencils mediumtext Pencils, painted art, layouts, photographs, etc.
inks mediumtext Inks, painted art, photographs
colors mediumtext Colors, painted art, color photographs
letters mediumtext Lettering credit
editing mediumtext Editor credit for the sequence. Rarely used at the sequence level, but particularly useful for anthologies or compilations where individual sequences are drawn from original publications with various editors.
job_number varchar(25) Job Number
Fields related to content:
characters mediumtext
synopsis mediumtext
genre varchar(255)
Change tracking fields:
created datetime Row creation date.
modified datetime Last modified date.