CSCI 403 Database Management

Spring 2019

Project 6: Convert

Purpose

Practice schema design and construction of a schema from an ERD
Practice taking raw data and factoring it into tables
More practice writing SELECT queries

Overview

For this project, you will be creating a music database along the lines of MusicBrainz (musicbrainz.org) or Discogs (www.discogs.com). These databases track an enormous amount of metadata on commercial and non-commercial music releases on various media. Your database will be considerably simpler, focusing just on albums (generically, ignoring media-specific variations), artists, and labels.

To see the data you will be working with, take a look at the table public.project6 in the course database. You will see there is a considerable amount of data on various artists and their albums, but of course, it is just a drop in the bucket compared to what the above websites manage. (The data, by the way, comes from the MusicBrainz database. I performed queries on their database to gather a limited dataset in the format I wanted, so any data errors should likely be ascribed to my querying and not to the MusicBrainz project.)

This project is similar to project 3 in that you are asked to create some tables and populate them with data from another table. However, the emphasis in project 3 was on basic SQL with a focus on SELECT queries. In this project the emphasis is on schema creation and data migration, and the fine details are left up to you. You are provided with an entity-relationship diagram (ERD) and asked to create a schema based on the ERD. Finally, you need to extract data for your schema from the raw data provided.

You will be asked to provide several SQL queries, not so much to test your skill at constructing SELECT queries (although practice is always good!), but to better illuminate for us your database schema.

Details

This project requires the creation of SQL DDL to produce the required schema, and SQL queries to populate the schema tables from the project data. All of the tasks below should be done strictly via queries executed on the database, e.g., using psql or SQuirreL. That is, although you could query the database and do some of the tasks within a client programming environment, you are required for this project to do all of the work on the database (server) side, using pure SQL. The deliverable for this project is a SQL script (a text file containing your queries - traditionally with a .sql extension) that constructs, populates, and tests your schema. However, unlike previous projects, you may use multiple queries to achieve the end goals - you may even create intermediate tables or views, if you wish.

Build your SQL script carefully and test as you go. You can create a single script which drops tables and recreates them (note you can use something like DROP TABLE IF EXISTS foo CASCADE; in your script. This will drop foo and any objects dependent on it, but won't give an error if foo doesn't currently exist).

Please comment your SQL script (anything between /* and */ will be ignored by the SQL processor) so that we can figure out what is happening if something goes wrong when we test!

Part I: Constructing your database schema

See the end of this document for the ERD and detailed information about the data items you will be working with. Your job is to turn this ERD into a database schema, complete with tables, primary keys, foreign keys, and anything else you feel is necessary. There is no one correct answer here; an ERD can give rise to different schemas, although some similar structure is expected. Please follow the algorithm presented in your book and in class to convert the ERD to a schema as closely as possible.

As presented, the ERD suggests you should use ID attributes as primary keys for albums, artists, and labels. This is generally good practice, since (a) natural key values may not exist due to the possibility of non-unique names (e.g., more than one artist or album of the same name) and (b) it can be cumbersome and costly in terms of space to use complex or text keys as foreign keys. Typically ID fields are just integers. In the PostgreSQL database, the easiest way to generate appropriate unique values in these fields is to declare them as type "serial". When you insert into the tables, insert into all columns except the ID column; the ID values will be generated sequentially by the database on your behalf. (Note that when you reference these columns using a foreign key in another table, the foreign key column should just be of type "integer", not "serial".)

You will need to decide what to name your tables and attributes in the tables, although naming them the same as the entities and attributes in the ERD is not a bad first choice. You will have to add additional attributes for foreign keys; generally these should be named like "foreignentity_id", where you are referencing the ID attribute of the foreign entity. You can name things any way you like, as long as they are semantically reasonable.

Part II: Populating your database

You next need to work out how to extract the data you need in your tables from the table public.project6. Take some time to explore this table and its data. This table contains a very flattened (de-normalized - what you get when you take a normalized database and join everything together into one big table) view of data about albums and artists, so there is a lot of redundancy. Working out how to eliminate the redundancy appropriately (hint: DISTINCT) and populate your tables correctly will be a challenge.

Here are some hints to get you started:

The most difficult table to work with is the artist table. This needs to be populated with data both from the artist_name column of project6 as well as the member_name column of project6. When you are done, your artist table should contain:

Solo artists
Groups
Group members

Note that solo artists and group members should both have the type "Person"; a tricky problem for you to solve is that some solo artists are also members of groups! Your artist table should list them only once.
In general, there is the potential with a dataset like this that you will be unable to distinguish two different instances of some entity that happen to have the same attributes - e.g., two artists with the same name. This is a hard problem! So, the dataset I've given you doesn't have that issue - you can assume artist name uniquely identifies an artist (this also applies to group member names).
Once you've populated the artist table, you need to work out how to associate members with groups in the artist table, as well as store the begin/end year for the artists. The best way to proceed here is to create a cross-reference table (as described in your book and in lecture) from artist back to artist; one of the foreign keys in the cross-reference table will correspond to a group's artist ID, and the other foreign key will correspond to the member's artist ID. You may need to take a few tries at this to get it populated correctly!
If you can get artist and the cross-reference table created successfully, then do label and album next. The label table is relatively simple, and once you have it, you can think about how to populate album correctly. The only tricky part is how to assign the foreign key values. In the case of this database, you can assume that artist names and label names are unique, which will make it much easier to make the appropriate associations.
If you've gotten this far, the rest should be pretty straightforward! Remember each time you modify your database, be sure to test it thoroughly! If you've scripted everything as suggested, it will cost you nothing to throw away what you've done and redo it as often as you need. Work incrementally!

Part III: Querying your database

At the end of your SQL script, please include queries to answer each of the following questions. The purpose here isn't so much to test your SELECT skills as to let us see how your database is structured. These queries may also be a good test of your database (as well as of your skills!), so you may want to construct these as you do part II above and use them to test portions of your database as they come online.

Together with the questions below, I'm showing the answers I received when I went through this assignment. If your answer does not match mine, don't immediately assume it is an error (in your approach or mine), but spend a little time trying to work out what might cause the difference.

Questions and answers:

Get all members of The Who and their begin/end years with the group ordered by their starting year and name. My solution looks like:

          name          | begin_year | end_year 
------------------------+------------+----------
 John Entwistle         |       1964 |     2002
 Keith Moon             |       1964 |     1978
 Pete Townshend         |       1964 |         
 Roger Daltrey          |       1964 |         
 Kenney Jones           |       1978 |     1985
 John "Rabbit" Bundrick |       1979 |     2011
(6 rows)

Get all groups that Chris Thile has been a part of:

      name      
----------------
 Punch Brothers
 Nickel Creek
(2 rows)

Get the album titles, years, and labels for all the albums put out by the band Talking Heads, ordered by year:

                title                | year |    label     
-------------------------------------+------+--------------
 Talking Heads: 77                   | 1977 | Philips
 More Songs About Buildings and Food | 1978 | Sire Records
 Remain in Light                     | 1980 | Sire Records
 Speaking in Tongues                 | 1983 | Sire Records
 Little Creatures                    | 1985 | Sire Records
 True Stories                        | 1986 | EMI
 Naked                               | 1988 | Fly
(7 rows)

Get all albums (album, year, artist, and label) that Chris Thile has performed on, ordered by year:

                title                | year |     artist     |       label        
-------------------------------------+------+----------------+--------------------
 Nickel Creek                        | 2000 | Nickel Creek   | Sugar Hill Records
 This Side                           | 2002 | Nickel Creek   | Sugar Hill Records
 Into the Cauldron                   | 2003 | Chris Thile    | Sugar Hill Records
 Why Should the Fire Die?            | 2005 | Nickel Creek   | Sugar Hill Records
 How to Grow a Woman From the Ground | 2006 | Chris Thile    | Sugar Hill Records
 Punch                               | 2008 | Punch Brothers | Nonesuch
 A Dotted Line                       | 2014 | Nickel Creek   | Nonesuch
(7 rows)

Get all albums (artist, album, year) in the 'electronica' genre ordered by year, artist:

      name      |         title          | year 
----------------+------------------------+------
 Massive Attack | Blue Lines             | 1991
 Underworld     | Dubnobasswithmyheadman | 1993
 Massive Attack | Protection             | 1994
 Fluke          | Risotto                | 1997
 Massive Attack | Mezzanine              | 1998
 Amon Tobin     | Supermodified          | 2000
 Massive Attack | 100th Window           | 2003
(7 rows)

Get all the tracks on Led Zeppelin's Houses of the Holy in order by track number:

            name             | number 
-----------------------------+--------
 The Song Remains the Same   | A1
 The Rain Song               | A2
 Over the Hills and Far Away | A3
 The Crunge                  | A4
 Dancing Days                | B1
 D'yer Mak'er                | B2
 No Quarter                  | B3
 The Ocean                   | B4
(8 rows)

Get all genres that James Taylor has performed in:

  genre  
---------
 rock
 folk
 jazz
 country
(4 rows)

Get all albums published by a label headquartered in Hollywood:

    name    |             title               | year |    name     
------------+---------------------------------+------+-------------
 Supertramp | Breakfast in America            | 1979 | A&M Records
 Supertramp | Crisis? What Crisis?            | 1975 | A&M Records
 Supertramp | Crime of the Century            | 1974 | A&M Records
(3 rows)

Grading:

Part I: Correct (or at least reasonable) conversion of ERD into SQL	115 points
Part II/III: 10 points for each query we can successfully run against your database	80 points
README	5 points
Total:	200 points

Documentation:

README:

Include names of all people who helped/collaborated as per the syllabus
Describe the challenges you encountered and how you surmounted them
What did you like/dislike about the assignment?
How long did you spend on this assignment?

Submit a zip file on Canvas containing:

README
Your SQL script.

ERD and Data Dictionary

ERD:

A printable version is available here. A detailed description of the entities, attributes, and relationships (i.e., a "data dictionary") is provided below. Because many of the attribute names are not unique in the ERD, I've prepended their entity names and a dot to make clear which attribute I'm describing.

Element Description

Artist An entity describing a person or group recording music albums.

Artist.ID Unique identifier for an artist.

Artist.Name The name of the person or group, e.g., "Led Zeppelin".

Artist.Type This is an indicator of whether the artist is a person or a group. Groups may have associated member artists, whereas persons may not.

is member of A relationship between artist groups and their members. A group can have 1 or more members who are also artists (persons). A person can be a member of 0 or more groups. Where known, the membership years of the person are also indicated by relationship attributes (described below).

is member of.Begin Year The year that the group member in the relationship began in the group.

is member of.End Year The year that the group member in the relationship stopped being in the group.

Label An entity describing a record label (a publisher of albums).

Label.ID Unique identifier for a record label.

Label.Name The label name, e.g., "Decca Records".

Label.Location The headquarters location of the record label.

Album An entity describing a single music album. This is an abstraction in that it tries to capture the notion of a single release by the artist of a collection of tracks under a specific album title. In real life, releases may be made at different times in different countries, even by different labels and with different tracks. The intent in this database is to capture information about the initial release as a proxy for all subsequent releases of the same title.

Album.ID Unique identifier for an album.

Album.Title The name assigned by the artist to this album.

Album.Year The year the album was released.

Album.Genre A genre that the album might be considered to be a part of. An album might be associated with multiple genres. The genre values allowed in this database are 'folk','rock','country','jazz', and 'electronica'.

released The relationship between artists and their albums. An album has exactly one principal artist (the group or person who released the album, not individual members of a group). An artist may release 0 or more albums.

published The relationship between labels and albums. An album has exactly one label, and a label can publish 0 or more albums.

Track An entity representing a single track (song) on an album.

Track.Name The name of the track or song. Note that this is marked as a partial key, suggesting that track names are unique per album. This assumption will work for the data provided, but in fact, does not work in a larger context - songs may be repeated on albums.

Track.Number A number or string indicating the position of the track in relation to the rest of the album.

is song on The relationship between albums and their tracks. An album has 1 or more tracks, whereas each track is only on 1 album.

Element	Description
Artist	An entity describing a person or group recording music albums.
Artist.ID	Unique identifier for an artist.
Artist.Name	The name of the person or group, e.g., "Led Zeppelin".
Artist.Type	This is an indicator of whether the artist is a person or a group. Groups may have associated member artists, whereas persons may not.
is member of	A relationship between artist groups and their members. A group can have 1 or more members who are also artists (persons). A person can be a member of 0 or more groups. Where known, the membership years of the person are also indicated by relationship attributes (described below).
is member of.Begin Year	The year that the group member in the relationship began in the group.
is member of.End Year	The year that the group member in the relationship stopped being in the group.
Label	An entity describing a record label (a publisher of albums).
Label.ID	Unique identifier for a record label.
Label.Name	The label name, e.g., "Decca Records".
Label.Location	The headquarters location of the record label.
Album	An entity describing a single music album. This is an abstraction in that it tries to capture the notion of a single release by the artist of a collection of tracks under a specific album title. In real life, releases may be made at different times in different countries, even by different labels and with different tracks. The intent in this database is to capture information about the initial release as a proxy for all subsequent releases of the same title.
Album.ID	Unique identifier for an album.
Album.Title	The name assigned by the artist to this album.
Album.Year	The year the album was released.
Album.Genre	A genre that the album might be considered to be a part of. An album might be associated with multiple genres. The genre values allowed in this database are 'folk','rock','country','jazz', and 'electronica'.
released	The relationship between artists and their albums. An album has exactly one principal artist (the group or person who released the album, not individual members of a group). An artist may release 0 or more albums.
published	The relationship between labels and albums. An album has exactly one label, and a label can publish 0 or more albums.
Track	An entity representing a single track (song) on an album.
Track.Name	The name of the track or song. Note that this is marked as a partial key, suggesting that track names are unique per album. This assumption will work for the data provided, but in fact, does not work in a larger context - songs may be repeated on albums.
Track.Number	A number or string indicating the position of the track in relation to the rest of the album.
is song on	The relationship between albums and their tracks. An album has 1 or more tracks, whereas each track is only on 1 album.