For this project, you will be creating a music database along the lines of MusicBrainz (musicbrainz.org) or Discogs (www.discogs.com). These databases track an enormous amount of metadata on commercial and non-commercial music releases on various media. Your database will be considerably simpler, focusing just on albums (generically, ignoring media-specific variations), artists, and labels.
To see the data you will be working with, take a look at the table public.project6 in the course database. You will see there is a considerable amount of data on various artists and their albums, but of course, it is just a drop in the bucket compared to what the above websites manage. (The data, by the way, comes from the MusicBrainz database. I performed queries on their database to gather a limited dataset in the format I wanted, so any data errors should likely be ascribed to my querying and not to the MusicBrainz project.)
This project is similar to project 3 in that you are asked to create some tables and populate them with data from another table. However, the emphasis in project 3 was on basic SQL with a focus on SELECT queries. In this project the emphasis is on schema creation and data migration, and the fine details are left up to you. You are provided with an entity-relationship diagram (ERD) and asked to create a schema based on the ERD. Finally, you need to extract data for your schema from the raw data provided.
You will be asked to provide several SQL queries, not so much to test your skill at constructing SELECT queries (although practice is always good!), but to better illuminate for us your database schema.
This project requires the creation of SQL DDL to produce the required schema, and SQL queries to populate the schema tables from the project data. All of the tasks below should be done strictly via queries executed on the database, e.g., using psql or SQuirreL. That is, although you could query the database and do some of the tasks within a client programming environment, you are required for this project to do all of the work on the database (server) side, using pure SQL. The deliverable for this project is a SQL script (a text file containing your queries - traditionally with a .sql extension) that constructs, populates, and tests your schema. However, unlike previous projects, you may use multiple queries to achieve the end goals - you may even create intermediate tables or views, if you wish.
Build your SQL script carefully and test as you go. You can create a single script which drops tables and recreates them (note you can use something like DROP TABLE IF EXISTS foo CASCADE; in your script. This will drop foo and any objects dependent on it, but won't give an error if foo doesn't currently exist).
Please comment your SQL script (anything between /* and */ will be ignored by the SQL processor) so that we can figure out what is happening if something goes wrong when we test!
See the end of this document for the ERD and detailed information about the data items you will be working with. Your job is to turn this ERD into a database schema, complete with tables, primary keys, foreign keys, and anything else you feel is necessary. There is no one correct answer here; an ERD can give rise to different schemas, although some similar structure is expected. Please follow the algorithm presented in your book and in class to convert the ERD to a schema as closely as possible.
As presented, the ERD suggests you should use ID attributes as primary keys for albums, artists, and labels. This is generally good practice, since (a) natural key values may not exist due to the possibility of non-unique names (e.g., more than one artist or album of the same name) and (b) it can be cumbersome and costly in terms of space to use complex or text keys as foreign keys. Typically ID fields are just integers. In the PostgreSQL database, the easiest way to generate appropriate unique values in these fields is to declare them as type "serial". When you insert into the tables, insert into all columns except the ID column; the ID values will be generated sequentially by the database on your behalf. (Note that when you reference these columns using a foreign key in another table, the foreign key column should just be of type "integer", not "serial".)
You will need to decide what to name your tables and attributes in the tables, although naming them the same as the entities and attributes in the ERD is not a bad first choice. You will have to add additional attributes for foreign keys; generally these should be named like "foreignentity_id", where you are referencing the ID attribute of the foreign entity. You can name things any way you like, as long as they are semantically reasonable.
You next need to work out how to extract the data you need in your tables from the table public.project6. Take some time to explore this table and its data. This table contains a very flattened (de-normalized - what you get when you take a normalized database and join everything together into one big table) view of data about albums and artists, so there is a lot of redundancy. Working out how to eliminate the redundancy appropriately (hint: DISTINCT) and populate your tables correctly will be a challenge.
Here are some hints to get you started:
At the end of your SQL script, please include queries to answer each of the following questions. The purpose here isn't so much to test your SELECT skills as to let us see how your database is structured. These queries may also be a good test of your database (as well as of your skills!), so you may want to construct these as you do part II above and use them to test portions of your database as they come online.
Together with the questions below, I'm showing the answers I received when I went through this assignment. If your answer does not match mine, don't immediately assume it is an error (in your approach or mine), but spend a little time trying to work out what might cause the difference.
Questions and answers:
name | begin_year | end_year ------------------------+------------+---------- John Entwistle | 1964 | 2002 Keith Moon | 1964 | 1978 Pete Townshend | 1964 | Roger Daltrey | 1964 | Kenney Jones | 1978 | 1985 John "Rabbit" Bundrick | 1979 | 2011 (6 rows)
name ---------------- Punch Brothers Nickel Creek (2 rows)
title | year | label -------------------------------------+------+-------------- Talking Heads: 77 | 1977 | Philips More Songs About Buildings and Food | 1978 | Sire Records Remain in Light | 1980 | Sire Records Speaking in Tongues | 1983 | Sire Records Little Creatures | 1985 | Sire Records True Stories | 1986 | EMI Naked | 1988 | Fly (7 rows)
title | year | artist | label -------------------------------------+------+----------------+-------------------- Nickel Creek | 2000 | Nickel Creek | Sugar Hill Records This Side | 2002 | Nickel Creek | Sugar Hill Records Into the Cauldron | 2003 | Chris Thile | Sugar Hill Records Why Should the Fire Die? | 2005 | Nickel Creek | Sugar Hill Records How to Grow a Woman From the Ground | 2006 | Chris Thile | Sugar Hill Records Punch | 2008 | Punch Brothers | Nonesuch A Dotted Line | 2014 | Nickel Creek | Nonesuch (7 rows)
name | title | year ----------------+------------------------+------ Massive Attack | Blue Lines | 1991 Underworld | Dubnobasswithmyheadman | 1993 Massive Attack | Protection | 1994 Fluke | Risotto | 1997 Massive Attack | Mezzanine | 1998 Amon Tobin | Supermodified | 2000 Massive Attack | 100th Window | 2003 (7 rows)
name | number -----------------------------+-------- The Song Remains the Same | A1 The Rain Song | A2 Over the Hills and Far Away | A3 The Crunge | A4 Dancing Days | B1 D'yer Mak'er | B2 No Quarter | B3 The Ocean | B4 (8 rows)
genre --------- rock folk jazz country (4 rows)
name | title | year | name ------------+---------------------------------+------+------------- Supertramp | Breakfast in America | 1979 | A&M Records Supertramp | Crisis? What Crisis? | 1975 | A&M Records Supertramp | Crime of the Century | 1974 | A&M Records (3 rows)
Part I: Correct (or at least reasonable) conversion of ERD into SQL | 115 points |
Part II/III: 10 points for each query we can successfully run against your database | 80 points |
README | 5 points |
Total: | 200 points |
README:
Submit a zip file on Canvas containing:
ERD:
A printable version is available here. A detailed description of the entities, attributes, and relationships (i.e., a "data dictionary") is provided below. Because many of the attribute names are not unique in the ERD, I've prepended their entity names and a dot to make clear which attribute I'm describing.
Element | Description |
---|---|
Artist | An entity describing a person or group recording music albums. |
Artist.ID | Unique identifier for an artist. |
Artist.Name | The name of the person or group, e.g., "Led Zeppelin". |
Artist.Type | This is an indicator of whether the artist is a person or a group. Groups may have associated member artists, whereas persons may not. |
is member of | A relationship between artist groups and their members. A group can have 1 or more members who are also artists (persons). A person can be a member of 0 or more groups. Where known, the membership years of the person are also indicated by relationship attributes (described below). |
is member of.Begin Year | The year that the group member in the relationship began in the group. |
is member of.End Year | The year that the group member in the relationship stopped being in the group. |
Label | An entity describing a record label (a publisher of albums). |
Label.ID | Unique identifier for a record label. |
Label.Name | The label name, e.g., "Decca Records". |
Label.Location | The headquarters location of the record label. |
Album | An entity describing a single music album. This is an abstraction in that it tries to capture the notion of a single release by the artist of a collection of tracks under a specific album title. In real life, releases may be made at different times in different countries, even by different labels and with different tracks. The intent in this database is to capture information about the initial release as a proxy for all subsequent releases of the same title. |
Album.ID | Unique identifier for an album. |
Album.Title | The name assigned by the artist to this album. |
Album.Year | The year the album was released. |
Album.Genre | A genre that the album might be considered to be a part of. An album might be associated with multiple genres. The genre values allowed in this database are 'folk','rock','country','jazz', and 'electronica'. |
released | The relationship between artists and their albums. An album has exactly one principal artist (the group or person who released the album, not individual members of a group). An artist may release 0 or more albums. |
published | The relationship between labels and albums. An album has exactly one label, and a label can publish 0 or more albums. |
Track | An entity representing a single track (song) on an album. |
Track.Name | The name of the track or song. Note that this is marked as a partial key, suggesting that track names are unique per album. This assumption will work for the data provided, but in fact, does not work in a larger context - songs may be repeated on albums. |
Track.Number | A number or string indicating the position of the track in relation to the rest of the album. |
is song on | The relationship between albums and their tracks. An album has 1 or more tracks, whereas each track is only on 1 album. |