EDA of GoodReads Best Books Ever by Andy Trick
========================================================

## [1] 30583    16
##  [1] "rating"         "votes"          "isbn"           "author"        
##  [5] "series"         "page_count"     "publisher"      "reviews"       
##  [9] "setting"        "awards"         "year"           "genre"         
## [13] "title"          "rank"           "year.bucket"    "part_of_series"
## 'data.frame':    30583 obs. of  16 variables:
##  $ rating        : num  4.39 4.42 3.56 4.23 4.25 4.22 3.8 4.38 4.18 3.79 ...
##  $ votes         : num  6110 391 4471 540 10531 ...
##  $ isbn          : Factor w/ 21699 levels "",",","000100039X",..: 7458 7537 4206 11265 8262 1718 8926 374 4954 6757 ...
##  $ author        : Factor w/ 13884 levels "'Ali Ibn ABI Al-Hazm Ibn Al-Nafis",..: 12550 5260 12332 5600 8436 1734 4350 12005 3311 3702 ...
##  $ series        : Factor w/ 820 levels "","1-800-Where-R-You",..: 674 1 774 1 1 1 1 1 1 1 ...
##  $ page_count    : num  374 NA NA 279 NA 767 NA NA NA 464 ...
##  $ publisher     : Factor w/ 6720 levels "","'Hayastan' hratarakchutyu",..: 5377 5361 3537 3903 2427 2661 4801 2685 1569 6454 ...
##  $ reviews       : num  1003 1109 3423 2226 945 ...
##  $ setting       : Factor w/ 1246 levels "","ABA Indie Next Book",..: 341 206 388 1 1 1 1 1 1 9 ...
##  $ awards        : Factor w/ 359 levels "","Abraham Lincoln Award Nominee (2006)",..: 124 1 123 1 1 1 1 1 1 1 ...
##  $ year          : num  2008 2003 2005 1813 1936 ...
##  $ genre         : Factor w/ 147 levels "","Academic",..: 147 51 147 28 28 28 28 24 120 20 ...
##  $ title         : Factor w/ 28863 levels "","'Are these my basoomas I see before me?'",..: 23887 13231 27398 19034 12958 22436 6268 23523 23803 28713 ...
##  $ rank          : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ year.bucket   : Factor w/ 8 levels "(0,1900]","(1900,1950]",..: 7 6 6 1 2 2 2 3 3 1 ...
##  $ part_of_series: logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##   [1] ""                        "Academic"               
##   [3] "Action"                  "Adult"                  
##   [5] "Adult Fiction"           "Adventure"              
##   [7] "Alcohol"                 "American"               
##   [9] "Amish"                   "Animals"                
##  [11] "Anthologies"             "Anthropology"           
##  [13] "Apocalyptic"             "Architecture"           
##  [15] "Art"                     "Asian Literature"       
##  [17] "Autobiography"           "Biography"              
##  [19] "Biology"                 "Book Club"              
##  [21] "Buddhism"                "Business"               
##  [23] "Category Romance"        "Childrens"              
##  [25] "Christian"               "Christian Fiction"      
##  [27] "Christianity"            "Classics"               
##  [29] "Comics"                  "Computer Science"       
##  [31] "Contemporary"            "Crafts"                 
##  [33] "Crime"                   "Criticism"              
##  [35] "Cultural"                "Culture"                
##  [37] "Dark"                    "Dc Comics"              
##  [39] "Design"                  "Disability"             
##  [41] "Drama"                   "Dungeons And Dragons"   
##  [43] "Economics"               "Education"              
##  [45] "Environment"             "Erotica"                
##  [47] "Esoterica"               "European Literature"    
##  [49] "Family"                  "Fan Fiction"            
##  [51] "Fantasy"                 "Feminism"               
##  [53] "Fiction"                 "Folklore"               
##  [55] "Food And Drink"          "Football"               
##  [57] "Games"                   "Gardening"              
##  [59] "Gender"                  "Glbt"                   
##  [61] "Gothic"                  "Graphic Novels Manga"   
##  [63] "Health"                  "Historical"             
##  [65] "Historical Fiction"      "History"                
##  [67] "History And Politics"    "Holiday"                
##  [69] "Horror"                  "Humanities"             
##  [71] "Humor"                   "Inspirational"          
##  [73] "Kids"                    "Language"               
##  [75] "Law"                     "Lds"                    
##  [77] "Leadership"              "Literary Fiction"       
##  [79] "Literature"              "Love"                   
##  [81] "Love Inspired"           "Magical Realism"        
##  [83] "Management"              "Manga"                  
##  [85] "Marriage"                "Marvel"                 
##  [87] "Media Tie In"            "Medical"                
##  [89] "Mental Health"           "Mermaids"               
##  [91] "Military"                "Military History"       
##  [93] "Modern"                  "Music"                  
##  [95] "Mystery"                 "Mythology"              
##  [97] "New Adult"               "Non Fiction"            
##  [99] "Novels"                  "Occult"                 
## [101] "Paranormal"              "Parenting"              
## [103] "Philosophy"              "Plays"                  
## [105] "Poetry"                  "Politics"               
## [107] "Polyamory"               "Productivity"           
## [109] "Psychology"              "Queer"                  
## [111] "Race"                    "Realistic Fiction"      
## [113] "Reference"               "Regency"                
## [115] "Relationships"           "Religion"               
## [117] "Roman"                   "Romance"                
## [119] "Science"                 "Science Fiction"        
## [121] "Science Fiction Fantasy" "Self Help"              
## [123] "Sequential Art"          "Sexuality"              
## [125] "Shapeshifters"           "Short Stories"          
## [127] "Social Science"          "Sociology"              
## [129] "Space"                   "Speculative Fiction"    
## [131] "Spirituality"            "Sports"                 
## [133] "Sports And Games"        "Spy Thriller"           
## [135] "Superheroes"             "Suspense"               
## [137] "Teaching"                "Textbooks"              
## [139] "Thriller"                "Travel"                 
## [141] "Urban"                   "War"                    
## [143] "Western"                 "Womens Fiction"         
## [145] "World War II"            "Writing"                
## [147] "Young Adult"
## [1] "(0,1900]"    "(1900,1950]" "(1950,1980]" "(1980,1990]" "(1990,2000]"
## [6] "(2000,2005]" "(2005,2010]" "(2010,2015]"
##      rating         votes      
##  Min.   :0.00   Min.   :    1  
##  1st Qu.:3.80   1st Qu.: 2990  
##  Median :4.00   Median : 6171  
##  Mean   :4.00   Mean   : 6179  
##  3rd Qu.:4.21   3rd Qu.: 9485  
##  Max.   :5.00   Max.   :12559  
##                                
##                                          isbn      
##                                            : 8864  
##  http://watergreen.wix.com/watersgreenhouse:    6  
##  ,                                         :    3  
##  http://forums.sennadar.com                :    3  
##  0061353450                                :    2  
##  0062248162                                :    2  
##  (Other)                                   :21703  
##              author                                 series     
##  James Patterson:   69                                 :29680  
##  Stephen King   :   68   Dark Saga                     :    8  
##  Nora Roberts   :   66   Kate Shugak                   :    7  
##  Francine Pascal:   59   Otherworld/Sisters of the Moon:    7  
##  Agatha Christie:   56   Breeds                        :    6  
##  Meg Cabot      :   53   Argeneau                      :    5  
##  (Other)        :30212   (Other)                       :  870  
##    page_count                publisher        reviews    
##  Min.   :   0.0                   : 1897   Min.   :   1  
##  1st Qu.: 210.0   Vintage         :  420   1st Qu.:1060  
##  Median : 306.0   Penguin Books   :  328   Median :1807  
##  Mean   : 356.1   HarperCollins   :  322   Mean   :1884  
##  3rd Qu.: 416.8   Ballantine Books:  257   3rd Qu.:2793  
##  Max.   :4892.0   Createspace     :  254   Max.   :3642  
##  NA's   :28253    (Other)         :27105                 
##                     setting     
##                         :28599  
##  United States          :   82  
##  London, England        :   58  
##  New York City, New York:   48  
##  United Kingdom         :   28  
##  New York               :   23  
##  (Other)                : 1745  
##                                                                                                                                                                                                                                                                                  awards     
##                                                                                                                                                                                                                                                                                     :30180  
##  Goodreads Choice Nominee for Romance (2010)                                                                                                                                                                                                                                        :    5  
##  Golden Duck Award for Hal Clement Award for Young Adult (2010), YALSA Teens' Top Ten (2010), Children's Choice Book Award for Teen Choice Book of the Year (2010), Indies Choice Book Award for Young Adult (2010), Teen Read Award Nominee for Best Read (2010)                   :    4  
##  Goodreads Choice Nominee (2013)                                                                                                                                                                                                                                                    :    4  
##  Goodreads Choice Nominee for Paranormal Fantasy (2010)                                                                                                                                                                                                                             :    4  
##  Locus Award Nominee for Best Young Adult Novel (2008), Mythopoeic Fantasy Award for Children's Literature (2008), Odyssey Award for Excellence in Audiobook Production Honor (2008), Books I Loved Best Yearly (BILBY) Awards for Older Readers (2008), YALSA Teens' Top Ten (2008):    4  
##  (Other)                                                                                                                                                                                                                                                                            :  382  
##       year              genre                            title      
##  Min.   :-800   Fantasy    : 3835   English                 :  479  
##  1st Qu.:1981   Fiction    : 3767   Arabic                  :   72  
##  Median :2000              : 2866   Urdu                    :   12  
##  Mean   :1975   Romance    : 2419   The Hobbit              :   10  
##  3rd Qu.:2008   Young Adult: 2200   ヴァンパイア騎士:    9  
##  Max.   :2800   Non Fiction: 1418   Spanish                 :    9  
##  NA's   :9105   (Other)    :14078   (Other)                 :29992  
##       rank            year.bucket   part_of_series 
##  Min.   :    1   (2005,2010]:4198   Mode :logical  
##  1st Qu.: 7646   (1990,2000]:3610   FALSE:29680    
##  Median :15292   (2010,2015]:3296   TRUE :903      
##  Mean   :15292   (2000,2005]:3102   NA's :0        
##  3rd Qu.:22938   (1950,1980]:2795                  
##  Max.   :30583   (Other)    :4424                  
##                  NA's       :9158

Univariate Plots Section

Taking a first look at The best books ever dataset. It is not too surprising that the majority of the books on the list are more current. Lets take a closer look:

It appears there is a steady increase in book releases per year that peaks at almost 1100 books in 2011! to compare, the average yearly entries for the 20 years before that is 545.5… almost half.

## [1] 1078
## [1] 545.5

one more histogram of book released based on the Year split applied earlier.

Average book length is between 200 and 500 pages.

A closer look shows the length of a book is most likely betwen275 and 340. Do the top 250 books apply to this grouping? That may help us determine if page length correlated any to popularity of a book.

It looks like the top 250 books from the dataset are pretty similiar to the entire population. There may be a slight increase in the overall numbers though. Lets look at summarys of the two groups.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   210.0   306.0   356.1   416.8  4892.0   28253
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    32.0   248.8   374.0   485.8   481.2  2700.0     198
## 
##  Pearson's product-moment correlation
## 
## data:  gr$page_count and gr$rating
## t = 5.2171, df = 2328, p-value = 1.978e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06718583 0.14746469
## sample estimates:
##       cor 
## 0.1075005

So it looks like all across the board the top250 books have a slightly higher page count than the entire dataset. There is a .10 positive correlation between length and average rating of a book.

A look at the top genre occurances in the dataset. Any genre related plots from here out will only include genre’s with atleast 500 occurances in the data.

Fiction and Fantasy are by far the most popular genres. Romance and Young adult also see a surprisingly large population.

Interestingly, when subsetting the top 250 ranked books, fantasy drops very low and classics take over the number one count. Fiction stays popular while romance is nowhere on the list.

##            n
## 1 0.02952621

A look at how many books in the dataset are part of a series. Only around 3 percent.

Top authors in the book list. I’m very surprised to see 8 authors have over 50 books in the data.

Top Publishers. Vintage Publishing has the most books on the list by a large margin at a count of 420. The next highest, Penguin Books, has 92 fewer.

Althought techniquely different, if Penguin Books and Penguin Classics were combined, Penguin would come in at 566 books on the list.

Above are four histograms depicting what could be considered the popularity of a book. Average rating (out of 5 stars), and number of votes for best book.

There is not much of a difference between the entire dataset and the top250 subset. Rating tends to be between 3.5 and 4.5, while number of voted is pretty spread between 0 and 10,000. The only notable outlier in the voted plots would be the 0 for the entire dataset. This is due to a large amount of books at the end of the list having under 100 votes.

Univariate Analysis

What is the structure of your dataset?

This Data was obtained from GoodReads.com’s best books ever list. There are 30,583 books in the dataset with 14 features (rating, votes, isbn, author, series, page_count, publisher, reviews, setting, awards, year, genre, title, rank). I created two more variables (year.bucket and part.of.series.) The variables isbn, author, series, publisher, setting, awards, genre, and year.bucket are factor variables. part.of.series is a boolean.

Observations: -Most books are Not a part of a series. -The majority of books on the list are from the past 20 years. -median book length is 306 pages. -mean rating is 3.99. -there are 147 unique Genres.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in this dataset are Genre. I am looking to discover any trends or patterns in peoples opinion of, and the popularity of, genre.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Rank, rating, year, year.bucket, andpage_count are all important features to discover trends in genre. Year will help me discover any differences over time between genres, while page count will assist in my interest in learning if length of books is different depending on genre. Rank and Rating are both good values to determine popularity and opinion of a book.

Did you create any new variables from existing variables in the dataset?

I created two new variables: year.bucket, and part.of.series. Year.bucket was created to split the books into 8 similiar sized year categories. Fewer years in the more current years to counteract the large amount of books released in the past 25 years. Part.of.Series is a boolean determining if the book is a stand alone or part of a series.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Instead of changing the form of the data, I opted to use several different subsets of the original dataset. This allowed me to keep a copy of the initial data, while also clean the data for specific purposes.

Author.sub, genre.sub, and publisher.sub are all subsets specific for each title. for example: author.sub is a dataset of only authors with 50 or more books in the dataset. genre.sub is similiar in that it is genres with atleast 200 occurances.

I also subset the top 250 ranked books to see if a more immediate pattern emerged from a smaller, but more popular, selection. Along with these, there are several more subsets split amongst specific genres. (example: sff.sub, fantasy.sub)

Bivariate Plots Section

Plotting Length(page count) by Rating shows a slight positive correlation between the two.

## 
##  Pearson's product-moment correlation
## 
## data:  gr$page_count and gr$rating
## t = 5.2171, df = 2328, p-value = 1.978e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06718583 0.14746469
## sample estimates:
##       cor 
## 0.1075005

There seems to be a slight correlation between rank and page count as well. This was expected after earlier finding the top 250 ranked books were typically longer. How correlated?

## 
##  Pearson's product-moment correlation
## 
## data:  gr$rank and gr$page_count
## t = -5.8784, df = 2328, p-value = 4.737e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.16075838 -0.08072846
## sample estimates:
##        cor 
## -0.1209399

cor = .12 The negative must be reversed becasue rank is ‘higher’ the lower the number.

Votes by Length does not show any pattern. The only thing this plot shows is that most books qre between 200 and 500 pages. Expect all plots with votes included to not depict any trends due to not being correlated to the data in any particular way.

Here is something I was setting out to discover; Are particular genres typically longer of shorter than others?

In our dataset the genres History, Historical Fiction, and Fantasy are on average longer than most. Children books are, very predictably, the shortest.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   210.0   306.0   356.1   416.8  4892.0   28253
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   104.0   324.0   463.0   512.1   597.0  1264.0     516

Above shows a summary of length for the dataset followed by a summary of only History books length.

Still votes values do not show any pattern or trend to the data.

A scatterplot and boxplot of genre by average rating. All genres are typically around the 4 star rating. Sequential Art is abnormally high.

A scatterplot and boxplot of the same data: Rank by genre. The scatterplot allows us to determine that there are very few classics in the data. There is fewer of them, but they are on average ranked in the upper places. 27% of the classics are under 2000 on the list.

##           n
## 1 0.2711058

In the same scatterplot we can determine that Romance novels populate the end of the list more heavily than other genres.

The boxplot allows us the see that History and Childrens are on average lower on the list than most. It is interesting to see that romance is in the middle of the boxplot, while with the scatterplot we can see most of them are ranked very low. Are there a few Romance novels ranked very high to counteract this?

Does a higher rating correlate to a better rank?
Interestingly, No.

In both categories, top250 and the entire dataset, The higher ranked genres are not the higher rated. This is completely unexpected.

## 
##  Pearson's product-moment correlation
## 
## data:  gr$rating and gr$rank
## t = -13.3306, df = 30581, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08714258 -0.06485694
## sample estimates:
##         cor 
## -0.07600925

There is a very very slight positive correlation between rating and rank. (again reverse the negative on the cor.test)

This is weird. Why are the reviews by votes plotted into a column type image?

Once again, votes shows nothing informative about the data.

Is there a relation between length of book and the year it was released? Over the past 200 year there does not seemt to be a pattern. What if we cut this down to books from 2000 and later?

There is a very slight downward slope on length over the past 15 years.

Too far out to get any information. Lets cut to year 1950 to current.

at first look it appears that more current books are better quality (higher rating). This is probably just due to the majority of books in the data releasing more recently.

## 
##  Pearson's product-moment correlation
## 
## data:  after1950$year and after1950$rating
## t = -2.86, df = 19039, p-value = 0.004241
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.034916656 -0.006521013
## sample estimates:
##         cor 
## -0.02072301

Nothing interesting here, just more proof that more books were released after 2000.

A look at genre trends over the years. A classic most likely was released between 1850 and 1920. Almost nothing considered a classic after the year 1950.

It is interesting that there are no outliers of sequential art before the year 1930. History, Young Adult and Romance could also be considered ‘newer’ genres.

Nothing too informative here.

Publisher does not have much of an effect on rating. Createspace has a smaller above average mean rating. # Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is a definitive difference in the typical rating and rank scores dependant on genre type.

Although they have the highest mean rank, Classical books have the lowest average rating. The History and Sequential Art categories have both high rating averages and some of the best overall rankings.Oddly, when subset to the top 250 books ever, These two categories dont make it on the list once.

Genre’s also seem to become more popular at different times over the years. Classics are released almost solely before the 1920’s. The same time fiction books started to become popular. Several categories (young adult, squen art, history, romance, and mystery) saw a huge increase in popularity in the 1990s which continues to our current date.

The length of a book also seems to have a small correlation with genre type. History Books tend to be the longest in our dataset while children’s books are, unsurprisingly, the shortest.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It appears the length of a book has a small effect on standard public opinion. In the dataset, on average the longer a book is, the higher its rating and better its rank.

Although the Publisher does not seem to have any significant trends in the data, People seem to enjoy ‘Createspace’ books more than others.

What was the strongest relationship you found?

The strongest relationship in the GoodReads.com best books ever dataset is the length of a book, and the rank of that book. Page_count to rank is negatively correlated at around -0.12. Could this partially be do to certain genres typically being longer than others?

Multivariate Plots Section

## [1] "rating"      "votes"       "page_count"  "reviews"     "year"       
## [6] "rank"        "year.bucket"

Everything has a very small correlation. I has assumed that there would be more trends and patterns in the dataset.

Young Adult, Romance, and Fantasy are more currently populated. (majority of input from those genres come from after 2000.)

No visible trend or pattern here.

Unfortunately no new information from this facet regarding length. What about rank or rating?

With such a low correlation it is hard to determine minor significance in the plots. Sci-fi and Seq. art books are Ranked slightly better the shorter they are. Long Fantasy book are typically ranked better.

Regarding rating: Myster and Historical Fiction are even across the board. All other genres tend to have a slight positive relations between length and rating. This is especially seen in seq. art.

This is a mess. WIll the same graph with year bucket instead of genre show anything?

This also is very noisy and hard to read. There doesnt seem to be any correlation between length and rank in regards to genre or year released.

In the 90’s Sequential Art jumped up in average rating. Childrens books do the same for the years 2005 - 2010. Sci-fi books slightly increase in ratings over the years.

Romance books are typically ranked higher the more recent they are. Classics adhere to the opposite. Fantasy and Fiction are all over the place. Lets look at last 30 years.

High ranked Young adult and Romance books are typically from the past 10 years. Fantasy books from around 1995 to 2005 seem to be the most popular for their genre. Sci-fi hits seem to have a gap and highly ranked one are either from the 80’s or after 2010.

Fiction books dip in rating slightly after 1975.

High ranked Romance are from the post 2010’s. High rated fiction tends to be from 80’s and 90’s.

Since 2000 there is a large increase in Fantasy, Romance, Sequential Art, and Young Adult. Of these, both Romance and Fantasy see a huge spike in popularity from 2010 to current.

There is a massive increase in Fantasy, Young Adult, and Romance Novels starting around the early 2000’s. Could this reason for this be do to popularity of the genres in mainstream media such as movies or tv series?

## 'data.frame':    7 obs. of  3 variables:
##  $ movie_titles: Factor w/ 7 levels "Chronicles of Narnia",..: 3 2 7 1 4 6 5
##  $ movie_years : num  2001 2001 2008 2005 2004 ...
##  $ movie_genre : Factor w/ 3 levels "Fantasy","Romance",..: 1 1 3 3 2 2 2
## [1] 7 3
## [1] Lord of the Rings    Harry Potter         Twilight            
## [4] Chronicles of Narnia Notebook             PS I Love You       
## [7] Pride and Prejudice 
## 7 Levels: Chronicles of Narnia Harry Potter ... Twilight

After researching mainstream movie hits from the 2000’s that where based on novels, 7 movies stood out to me. These incorperate the three genres in question.

A simple scatterplot of the year release for movies based on books. I will eplore this data and possible relationship to literature genre popularity more in the final plot section.

## Source: local data frame [1 x 1]
## 
##    n
## 1 57
## Source: local data frame [1 x 1]
## 
##    n
## 1 77
## [1] 1.350877

Fantasy novel count released in 2000 was 57. The year 2002, after the release year of Lord of the Rings and Harry Potter, saw 77 releases for fantasy.. a 35% increase! It would unfortunately take much more data and research to discover if this is correlated or just coincidental.

above is basically putting an earlier facet onto one graph via color. same results.

Plotting Fantasy vs scifi over the years by average rank. The size of the point relates to average length of book.

Both saw a dip in high ranked books between 1980 and 2000.

Fantasy books Spike in current years. Scifi saw a lot of releases between 50s and 80s, then drops in count until recent years.

Overall Valuse of a Book is determined by Rating/Rank. This allows for a value that is based on The rating of a book weighed by its rank.

There is a positive increase in value over time.

All genres aside from history and childrens books see an increase in value in the past 10 years, especially mystery and horror.

Is value affected by the length of a book?

Most books seem to have the largest grouping of positive outliers at between 300 and 400 pages.

It was rare for books before the 1900s to be longer than 500 pages.

This histogram adds the mean of value based on year bucket for each genre. Childrens books have the highest value over time. Is this because there are fewer childrens books with higher average values?
Maybe the only childrens books that made it onto the list are very memorable and the nostalgia of them Is the reason for the large value.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There were very few features that strengthened eachother when plotted together. One of these was genre by year by rank. Almost every genre category has a lower rank the newer a book is. Young adult and classical are exceptions for this. The rest of the genres,particularly fantasy, tend to filter into the very high ranks after the year 2000. The value (popularity(rank)/quality(rating)) of a book also has a very large amount of low scores around the year 2000.

Another group of features that saw a small correlation was Value by Year by Length. Longer books started to get much higher value scores around the 1980’s. This may be due to the increase in longer genre books (such as History and Fantasy) that have seen a large increase in popularity in the past 20 years.

Were there any interesting or surprising interactions between features?

It is surprising that there is not a more definite correlation anywhere in the dataset. When initally loading the data I had predicted that there would be a very distinct difference in popular genres over years and the length by genre. Although there is a slight relationship, it is nowhere near what I expected.

It is interesting though that longer books tend to have a better rank and higher rating than shorter books.


Final Plots and Summary

GoodReads.com website contains a page for users to vote for the best books of all time. This list contains over 30,000 books from almost 140,000 voters. From this list I discovered an interesting trend in book genre popularity (yearly release count on list).

There are three Book Genres; Fantasy, Romance, and Young Adult, that see a large increase in yearly count in the 2000’s. Comparing books per year for the years 1990 and 2010, there is a x6, x19, and x22 increase respectively! These are the only genres to see such an increase.

Before looking into genre popularity, here are a few interesting observations of the dataset not delved into in the final plots:

-There is an overall 80% increase in yearly book releases in the past 15 years.
-Most popular books are between 250 and 350 pages long.
-Fantasy and Fiction are the most popular book genres.
-Classics are the highest rated books
-Historical books are the longest
-There is a small correlation between book length and rating

Plot One

Description One

There is a prodigious difference over the past 30 years for Fantasy, Romance, and Young Adult literature genres. Each genre sees an exponential increase starting around year 2000 that peaks around 2010.

While this rise may be due to an increase in overall releases thanks to the quicker and simpler methods of book publishing available now, why are not all genres seeing a similiar increase? Could there a third party reason to this popularity bloom?

Plot Two

Description Two

Research of Blockbuster hits based on Novels from the genres of Fantasy, Young Adult, and Romance resulted in a list of seven movies from the 2000’s that may be a reason for the massive increase in their popularity increase. These are movies from the necessary genres that saw good ratings and box office numbers.

The movies are:
‘Lord of the Rings’ 2001 (Fantasy),
‘Harry Potter’ 2001 (Fantasy/Young Adult),
Chronicles of Narnia’ 2005 (Young Adult),
‘Twilight’ 2008 (Young Adult),
‘Notebook’ 2004 (Romance),
‘PS I Love You’ 2007 (Romance),
‘Pride and Prejudice’ 2005 (Romance).

It is interesting to see that each of these genres see a massive increase in popularity over the years after a corresponding movie releases. x3 the amount of fantasy novels were released in 2011 compared to 10 years earlier when LOTR and Harry Potter were out in theatres. Young Adult saw x2.54 increase six years after Chronicles of Narnia, while romance yeilded a x2.52 increase only five years after Notebook and Pride and Prejudice.

Does this increase in popularity and book release count equate to an overall lower quality of book?

Plot Three

Description Three

Surprisingly, the massive increase in Book Releases for the three genres in question did not result in a large drop of quality across the board. Although fantasy saw a bit of a wave in rank over the past 30 years, both romance and young adult saw a steady rise in rank. (The lower the rank, the better).

Reflection

While it can not be definitively said that movie renditions or mainstream media application of books equates to an increase in the respective literature genre, it is hard to deny there is some influence. There is a large peak of interest in Fantasy, Young Adult, and Romance books over the past 15 years that can partially be in thanks to popular movies based on books from those genres.

Only a handful of years after the releases of hits like ‘Lord of the Rings’, ‘Harry Potter’, and ‘Twilight’, and the Fantasy and young Adult literature markets are seeing multiple times more popular releases per year than in the years previous. Atleast, according to GoodReads.com voters.

It was very difficult to find any significant relationship in the data (most of which had a less than .1 correlation). although I struggled with this problem, it was interesting to find such an extreme change in popularity of books around the early 200’s in only three genres. I hope to have the time to explore more into the movie-book relationship in the future.