About bad words in movie
(Version française de l'article)Introduction
Hide your kids, we are going to talk about bad words in this article and mainly the word "fuck" and all derived versions. No need to say it won't be of high level of language. In american cinema, the word "fuck" is almost considered as punctuation sign (in french, people from south of France sometimes use the word "putaing" the same way) ans some movies use this word to put the audience in the context of ordinary verbal violence. As I never invent anything, I used the wikipedia list of films that most frequently use the word "fuck"
The two first places are dominated by "Swearnet : the movie" and "Fuck : A documentary on the word". Both of them are unrivalled because their core concept is exactly to focus about the word "fuck". These two movies uses that word around 9 times by minutes (we will note that 9 FPM in the international unit system). It is roughly the same goal than the South Park episode "It hits the fan" where the word "shit" has been used around 200 times in a 25 minutes episode (hence, roughly a 8FPM if you follow correctly)
So, I prefer to focus the analysis on "regular" movies (i.e, movies that want to tell a story not related to the word "fuck" and that are widely known). The top three is then :
1- The wolf of de Wall Street, with Leonardo Di Caprio that contains 569 fucks at 3.16 FPM
2- Summer of Sam, from Spyke Lee, with 435 fucks at 3.06 FPM
3- Casino, from Martin Scorcese with Robert de Niro, with 422 fucks at 2.4 FPM
Data
As seen, counting the number of occurrences has already been done, but we coul extend the analysis a bit further to look for the distribution in the chronology of the film. The goal is to locate when each occurrences have been pronounced. To do such, I see two possible methods :
The first one (called pen-and-paper) is to watch the whole movie and to write on a sheet all the timings where the word have been told. It's a bit tedious and looks a little bit like a drinking game (but I advise you not to try to take a sip every f-word)
The second method (called "lazy method") consist in downloading the subtitles files of the film. These file contains all that we need, the pronounced word as well as the (approximative) time it has been pronounced. So, we just need to parse this file in the search of the pattern we are interested in.
Example of a subtile file :
124
00:06:06,089 --> 00:06:07,488
Hey, fuck him.
We should search for a pattern with the help of reguler expresisons and deduce the timing. In the previous example, the word has been pronounced between the 366th and 367th second of the movie. Depending on the position of the word in the sentence (in this case in the middle) one can deduce the moment it has been said. This is not a perfectly accurate method as sutitles tends to appears slightly before the beginning of the lines said in the movie but we do not really need to be accurate at the exact second anyway.>
I have also decided to include all the varations of the word, like "fucking" and "motherfucker" for example, as I think they also participate to the same principle. Finally, also in an arbitrary way, I have decided to present results by grouping them in the from of 10-minute time blocks.
The first one (called pen-and-paper) is to watch the whole movie and to write on a sheet all the timings where the word have been told. It's a bit tedious and looks a little bit like a drinking game (but I advise you not to try to take a sip every f-word)
The second method (called "lazy method") consist in downloading the subtitles files of the film. These file contains all that we need, the pronounced word as well as the (approximative) time it has been pronounced. So, we just need to parse this file in the search of the pattern we are interested in.
Example of a subtile file :
124
00:06:06,089 --> 00:06:07,488
Hey, fuck him.
We should search for a pattern with the help of reguler expresisons and deduce the timing. In the previous example, the word has been pronounced between the 366th and 367th second of the movie. Depending on the position of the word in the sentence (in this case in the middle) one can deduce the moment it has been said. This is not a perfectly accurate method as sutitles tends to appears slightly before the beginning of the lines said in the movie but we do not really need to be accurate at the exact second anyway.>
I have also decided to include all the varations of the word, like "fucking" and "motherfucker" for example, as I think they also participate to the same principle. Finally, also in an arbitrary way, I have decided to present results by grouping them in the from of 10-minute time blocks.
Pictures
I was on my way to present histograms of the distribution of the F-word across the movie timeline but it was not particularly appealing (esthetically speaking). So I decided to add an extra information I've seen previously (which does not have a proper name), it is the "average color of every frame of a movie, compressed in one picture"
Basically, a film is a serie of pictures that appears 24 times per second. Each picture is made of pixels of different colors. Hence, for each of these pictures, we just have to make the average of the colors to summarize this picture and put this result along the other pictures to create a synthetic timeframe of the colors.
We can see the result obtained for five movies below :
Basically, a film is a serie of pictures that appears 24 times per second. Each picture is made of pixels of different colors. Hence, for each of these pictures, we just have to make the average of the colors to summarize this picture and put this result along the other pictures to create a synthetic timeframe of the colors.
We can see the result obtained for five movies below :
We can easily see that each movie has a different color signature, like a unique barcode that could identify the movies. (If you are colorblind you cas skip this part).
This result is mainly esthetic but I was hoping that the color variations could match the use frequency of the word "fuck". In the case where a rising of the movie tension could be expressed by diffrent color patterns as well as a more intense use of the word "fuck".
The method to create such an output is rather easy. In my case, I have used the software called ffmpeg taht allows, with a single command line, to extract pictures from a video at regular intervals.
ffmpeg -i TheWolfOfWallStreet.avi -vf -fps=1 thumbs%05.jpg
In this example, the command take as an input the video file and create at 1 frame by second a jpg picture iteratively numbered. The second part of the job is then to read these pictures one by one with your favorite software (I used R with the jpeg package) and convert each color matrices into one unique color value and then create the final pattern.
Results
The wolf of Wall Street
A nice peak around 2/3 of the movie with 60 fucks in 10 minutes.Summer of Sam
Not bad, especially at the endCasino
Again, a nice ending with 80 fucks in 10 minutes.Conclusion
The three movies analyzed have the same pattern, a peak near the end of the movie, where the action is more intense. We can also see a overall greyish color in all the movies. Casino is slightly more pinkish when The wolf of Wall Street is more grey-brown.It also could have been nice to make these graphs interactive, to be able to see the line and the picture of the scene when moving the mouse on the item.