Format Field Refactoring

Background

In Nov, 2012 the GCD-Policy list agreed by consensus to split the current series level Format field into it's currently defined five fields.

Format currently is a semi-colon delimited format specified as "[color]; [dimensions]; [paper stock]; [binding]; [publishing format]" however since this is a free text field which has been manually maintained the refactoring from a single text field into five individual fields has a number of challenges, which include:

Variable number of fields: For many series there are more or less that five fields, some have a single value, others have as many as eleven. In some cases other delimiters were used (eg: commas or spaces) and in other cases semi-colons were used to further sub-divide a field (eg: "newsstand: magazine (#1-10: 27.5 cm high; 20.7 cm wide; #11-25: 26.7 cm high; 20.4 cm wide" is a single dimension value but is four semi-colon delimited fields)
Missing fields: not all series have all five fields defined
Ordering of fields: the first fields is not always color etc
Typos: many, many typos in fields
Data Inconsistency: Many ways of entering the same data (eg: US, U.S., us or saddle-stitch, saddle-stitched, saddlestitched)
Extraneous data: Data is contained in the Format field which doesn't belong in any of the five defined fields

The tech team is moving ahead with implementing the new separate data fields while in parallel the policy list is determine how these new fields should be defined.

There are a number of approaches to populating the new fields:

Do nothing: make the old format field read-only and individuals will populate the new fields manually over time. The challenge is that due to the large number of series and limited number of indexers this becomes a special project and could take years to complete.
Blindly auto-assign: a simple script could parse the data using the semi-colon delimiter and auto-populate the five new fields. The challenge is that due to the data inconsistency problems identified above there are going to be a lot of bad values
Intelligently auto-assign: create a more complex script to parse the data and populate the new fields. This is what I'm recommending.

Design

My design goal is fairly straight forward, to take the existing data in the format field, make some simple corrections and then populate the new fields. Since the full definition of the new fields is still being defined I'm not attempting to fully define the new fields but to populate them with their existing values. Doing this now before the new fields are finalized serves to establish a bit of a baseline that should make it easier to move to the newer definition when defined but also it provides data on what occupies these existing fields today to inform the disussion. Eventually I'd expect that the ultimate definition of these new fields will be a narrowly defined set of pre-determined values with an escape value to allow some special cases. Another migration script could be written to eventually move to that new format.

The final version of this script should be similar to those in /apps/gcd/migration/ however the initial approach is a brute force perl script working off a current MySQL data dump to determine the necessary text transformations and categorizations needed. If we are willing to modify the mySQL data structures directly then the resulting data could be manually imported as a one time action, however I understand this isn't the recommended process.

The design is fairly straightforward and brute force:

traverse the series table
- extract the format field data
- separate the format field data into individual fields based on the semi-colon delimiter
  - TODO: additionally determine if the resulting fields need to be further sub-divided based on a secondary delimiter
- for each resulting field
  - correct any typos in the field
  - normalize any special values as defined
  - categorize the fields as one of the new five fields
    - If there is extraneous data mark that to be appended to the notes field
    - I'm still evaluating if there is any data which should be discarded
  - if there are multiple values for a single field, then concatenate them into a single field with a comma delimiter
- output the resulting new series table values:
  - five new field values, some of which may still be empty
  - updated notes field as necessary
  - currently this is just a text file dump

while parsing the data record some additional metrics:
- total number of series processed
- total number of unique fields found
- how the data is categorized, eg: how many different color, paper, dimensions, binding, pubformat values are there
- how many series have a value for each of the five fields

Considerations for new Field rules

Just a few thoughts on items that need to be considered when defining the new field descriptions

General
- Language - GCD needs a general i18n policy as most text fields have multiple language data values which is a general site/data challenge. However this also applies to English language variations eg: color vs colour.
- handling series variants or changes in format. lots of notes in the data about specific issues only
Color
- delimiters for multiple colors
- how to encode "black and white" and "black and white with one additional color" (eg: black, white, red vs b&w with red etc)
- what is a color comic? - two color vs four color vs full color vs standard color etc
Paper
- using the word "paper" in the field eg: "baxter" vs "baxter paper"
- cardboard vs cardstock
Dimensions
- use of units (inches vs cm vs std page sizes (A4))
- separate fields for "length vs width" vs single text field and all the variations on "x" or "by" and how to describe the value (length, width, len, high, tall etc)
Binding

Pubformat

Rules

Typos

simple replacement rules to fix typos
- there are some transformations here, eg: dust cover to dustjacket that I'll clean up

"bi-montly" --> "bi-monthly" "book shelf" --> "bookshelf" "(carboard|carbook|cardtock|cardtsock|carstock|carstock)" --> "cardboard" "(clotbound|clothsbound|colthsbound)" --> "clothbound" "collecton" --> "collection" "coloes" --> "color " "colot" --> "color" "(diimensions|dmensions|dimentions)" --> "dimensions" "delux " --> "deluxe " "digestformaat" --> "digest format" "(duckjacket|dust-jacket|dustjacket|dustjackets|dust cover)" --> "dust jacket" "exterioir" --> "exterior" "fomat" --> "format" "(interiior |interio |interor )" --> "interior " "(glosy|gossy)" --> "glossy" "graphiv" --> "graphic" "isue " --> "issue " "jucket" --> "jacket" "limtied " --> "limited " "limuted" --> "limited" "(magazineformaat|magazinformat)" --> "magazine format" " matt " --> " matte" "monochroom" --> "monochrome" "monthy " --> "monthly " "newpaper" --> "newspaper" "newsprin " --> "newsprint " "(newprint|news print|newspirnt|newspront)" --> "newsprint" "newstand" --> "newsstand" "(papoerback|paper back|papreback|paperpack|paprback)" --> "paperback" "occassional " --> "occasional " "(on-shot|one- shot|one--shot|one=shot|one shot.*)" --> "one-shot" "pocketbok" --> "pocketbook" "(quartery|quaterly)" --> "quarterly" "atandard" --> "standard" "standar " --> "standard " "twleve" --> "twelve" "vairant" --> "variant" "varioius" --> "various"

Transformations

Most of these transformation rules are fairly basic and fix a few typos along the way

use the " (0x22) for all representations of inch

"\s*inches" --> "\"" "\s*inch" --> "\"" "\s*imches" --> "\"" " inces" --> "\"" "(\'\'|\x92+|\x94)" --> """

use "cm" for all centimeter values

"centimetres" --> "cm" " cms." --> "cm"

standardize on "U.S."

"U\s*S\s*" --> "U.S."

standardize on "ongoing"

"on-going" --> "ongoing" "^ngoing" --> "on-going" "(on going|on-goin)" --> "on-going"

standardize on "color" - which goes against my Canadian education but color is more prevalent in the data
- I'm going to remove this rule so that both colour and color are handled in my next update

"(colour|clour)" --> "color"

the rest are fairly self-explanatory. All of the variations on standard or saddle-stitched where amazing!

"miniseries" --> "mini-series"

"approx\.* " --> "approximately " "aproximately " --> "approximately"

"baxter color comic" --> "baxter paper color comic" "baxter paper comic\." --> "baxter paper color comic" "baxter paperstock" --> "baxter paper" "baxter stock paper" --> "baxter paper"

"(back and white|blac \& white|black \& whte|black and whiter|black and whtie|black\s*\&\s*white|black-and-white|black-white|blcak \& white|balck \& white)" --> "black and white" " lack and white" --> " black and white" "black and whit " --> "black and white " "black \& wite with color covers" --> "black and white with color covers" "black an white interior pages" --> "black and white interior pages" "black\,red" --> "black\, red" "blac\, white\, and red" --> "black\, white and red"

"greytones" --> "grey tones"

"oneshot" --> "one-shot"

"card\sstock" --> "card\-stock"

"(collect edition|collected edition\.|collected edtion)" --> "collected edition"

"colorcover" --> "color cover"

"(full-color |fulll color )" --> "full color " "full colo glossyr" --> "full color glossy"

"cover \& interior" --> "cover and interior"

"squarebond" --> "squarebound"

"slipcase" --> "slip case"

"(harccover|harcover|hardover)" --> "hardcover" "hardcover,b" --> "hardcover, b" "hardcover,d" --> "hardcover, d"

"paperstock" --> "paper-stock"

"(min-series|miniseries|miniseries\.)" --> "mini-series"

"mangaformaat" --> "manga format"

"maxiseries" --> "maxi-series"

"(minicomics|minicomic)" --> "mini-comic"

"(prestigeformat|pestige format|prestiege format)" --> "prestige format"

"(pefectbound|perfect binding)" --> "perfect bound"

"(printed on one side of page only|printed on onse side of each page only.)" --> "printed on one side of each page only)"

"(published erratically|published infrequently|published irregularily)" --> "published irregularly"

"(sofftcover|soft-cover|soft covers|softover)" --> "soft cover"

"(usa size|usa suze|ussize)" --> "U.S. size"

"(trade paprback|trade paprback|tradepaper back)" --> "tradepaperback"

"web\s*comic" --> "web-comic"

Categorization

These rules are used to determine which of the resulting fields should be populated. The regex values are very brute force and will be cleaned up and optimized down the road. Whether I created a new rule or an OR condition in an existing rule for a word match was very arbitrary and really was just dependent on when I found the case. Most of the rules are just looking for code word matches and occasionally check for location in the field.

Color

Dimensions

A bunch of rules to try to check for numbers and units

/cm\.*$/ /(cm|mm|in.+|\")\s*(x|by|\xD7)/ /(wide|high|h)\s*(x|by|\xD7)/ /cm (long|wide)/ /\d+x\d+mm$/ /^\(*\d+[\.\,\']*\d*\s*(x|by|\xD7)/ /^\(*\d+\s*\d+\/*\d*\s*(x|by|\xD7)/ /\d+\s*\d*(\.|\/)*\d*(\"|\')*\s*\x/ /\d+(\s|\-)\d\/\d/

Paper

Binding

Pub Format

Notes

Discard

Data that is garbage and is discarded.

none yet

Other

No additional special rules yet but I expect there to be a few

04 DEC 12 Status

First report back to the tech list. I plan on doing this weekly and hopefully wrapping up this project before the end of the year.

My goal with this first update is to get some feedback to make sure that this process is going to be useful to the group before I dive back in for round 2.

Current Status

Using the 01-Dec-2012 MySQL datadump I created the first iteration of this brute force script using perl
The challenge really isn't the programming is the data evaluation and hence I used perl out of convenience. The current version of the script is very brute force and is simply a set of string transforms and checks. I'll clean this up down the road.
The set of transforms and categorization checks are listed above
I set a goal to get to < 2000 unknowns and stopped there for this week. In reviewing the results there are still a number of errors in the rules which I need to go back and fix
There are two major remaining issues to address:
- handling the further subdividing of a field based on a secondary delimiter
- tightening the categorization checks which are currently execution order specific and as a result some fields which match multiple rules are going in the wrong final bucket. I just need to tighten the rules
I've avoided many of the foreign language entries but will tackle them next
I also want to double check that I'd not inadvertently changing character sets for the foreign language encoding

Current Stats

Metric	Value	Notes
Total number of series	67244
# of series with N fields based only on semi-colon delimiter
1	6919
2	6964
3	6624
4	18145
5	18316
6	8096
7	1922
8	226
9	15
10	12
11	2
12	3
Total number of unique fields	11187
Field category count distribution
Unknown	1919
Color	1237
Dimensions	5436
Paper	739
Binding	892
PubFormat	756
Notes	208
Series Field Count Distribution
Unknown	20380
Color	40614
Dimensions	43852
Paper	12775
Binding	38280
PubFormat	21980
Notes	626

Next Steps

continue to plow through the data with another update posted next weekend 09 Dec 2012

Current Data

121204 Format Refactoring.zip - all the data is in a single zip file for convenience. Hopefully there won't be any access issues as I hosted this on my google drive account.
- six separate text files with all of the individual values for each field: fields-binding.txt, fields-color.txt, fields-dimensions.txt, fields-notes.txt, fields-paper.txt, fields-pubformat.txt, fields-unknown.txt
- fields-log.txt - text file showing how all field values were transformed either to fix a typo or to normalize the data
- fields-stats.txt - text file with the stats above
- format.txt - tab separated text file containing
  - gcd_series.id - unique series id from gcd_series table
  - gcd_series.sort_name - use sort_name from gcd_series table
  - gcd_series.format - original format field data from gcd_series table
  - color - new data values
  - dimensions - new data values
  - paper - new data values
  - binding - new data values
  - pubformat - new data values
  - notes - new data value really this is just the data that will be appended to the current notes value. I prefaced each of the uncategorized fields with TODO:
- gcd-format.pl - the really ugly perl script that does the work

Format Field Refactoring

Contents

Background

Design

Considerations for new Field rules

Rules

Typos

Transformations

Categorization

Color

Dimensions

Paper

Binding

Pub Format

Notes

Discard

Other

04 DEC 12 Status

Current Status

Current Stats

Next Steps

Current Data

Navigation menu

Format Field Refactoring

Background

Design

Considerations for new Field rules

Rules

Typos

Transformations

Categorization

Color

Dimensions

Paper

Binding

Pub Format

Notes

Discard

Other

04 DEC 12 Status

Current Status

Current Stats

Next Steps

Current Data

Navigation menu

Search