Can we have a generic oil price pattern for the whole year? Are there any hidden data patterns in oil price movements that still needs to be investigated? Is the Chinese new year more important than US driving season or Hajj period in Saudi Arabia? Questions in oil prices never cease. That is an attractive property of oil price discussions, the zenith of such discussions is oil price forecasts. Experts refrain, ignorants try their chances with a gamblers enthusiasm. In this article, a Python based oil price behaviour will be discussed. The code and the relevant parts are available on my web page Data¶Data is downloaded from EIA website from https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls . It includes daily Brent prices from 1987 and provides a good source for analysis Libraries¶I do not enjoy Python's red alert warnings. They are important, do doubts. In my first run I let them scream. But for the notebook purposes I choose to supress them In [85]:
# Supress warnings
import warnings
warnings.filterwarnings('ignore')
Then comes the necessary libraries for analysis. I use inline to include matplotlib(plotting library), numpy(for numerical analysis) and pandas In [86]:
# use pylab inline to numpy, matplotlib
%pylab inline
# now include Pandas library for data analysis
import pandas as pd
A first look at data¶For the rest of the exercise, I will use one data source. It is an Excel file. If it fails to load, check internet connection and then make sure that you installed xlrd library from terminal. For xlrd use command "pip install xlrd" My data source is named Brent_data In [87]:
Brent_data="https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls"
read_excel command from Pandas library can directly download and convert the date into data frames. Excel file may have extra rows or columns before data, so we have to tell how many rows should we skip before reaching the data In [88]:
prices=pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheet_name="Data 1", skiprows=2)
Let's see the shape of data with shape attribute In [89]:
prices.shape
Out[89]:
Our data has 2 columns and 8043 rows. Now let's check the beginning and end of our data set. We have to call head and tail commands for start and end of data. It is also possible to set the number of data you want to see. In [90]:
#first 4 data points
prices.head(4)
Out[90]:
In [91]:
# the last 3 data points
prices.tail(3)
Out[91]:
And let's see the column titles with list command In [92]:
list(prices)
Out[92]:
Plotting¶Plotting the whole series can be easily done with plot command WARNING: if you didn't run the initial %pylab inline command before, you will see errors In data frames after square brackets use ' and then press TAB to see the list In [93]:
plot(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'])
Out[93]:
Histogram¶A histogram is a plot of ordered list of frequencies. You can assume bins as the number of divisions seen on X axis. with a simple hist command we can get the distribution of oil prices In [94]:
hist(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'])
Out[94]:
Now let's increase the resolution with bins command In [95]:
hist(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'], bins=100);
See the difference? In the second plot we just used bins=100 command to increase resolution in X axis But the numbers are oil prices. The histogram of change can be much more meaningful. To have the percentage change of a data set is done by subfunction .pct__change() In [96]:
hist(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'].pct_change(), bins=100 );
This brings us to our first conclusion: oil prices do not change a lot daily. . But what is the extent of these changes.. Now we will investigate this change A Closer Look¶Most of the percent change in oil price data is aligned between -0.1 (-10%) and 0.1 (10%). The automatic range on X axis shows us that there are extreme negatives (-30%ish) and extreme positives (%20%~). We may want to look these regions closely. As a first step, we will assign the percentage changes to a new variable called pc. Then we will produce the histogram of pc(percent change) data. It should look similar to the one above. In [97]:
pc=prices['Europe Brent Spot Price FOB (Dollars per Barrel)'].pct_change()
hist(pc, bins=100);
Filtering Data¶The graph above is to cumbersome to see most of the action. I need to filter the data. Specifically I want to look at 1%(=0.01) and -1%(-0.01) range. Initially, by conditional and logical operators I can choose certain numbers inline with my rules such as larger than 0.01. If pc>0.01 is executed, it will produce bunch of True and Falses. When these True and Falses are inputted into original data to produce DATA from True values. If a value is True for that cell, it's value will be selected. Let's do it one by one Step 1. Values larger than 0.01 (means >0.01) - You can use without paranthesis. To save space, i am looking for the first 5 data In [98]:
(pc>0.01 ).head(5)
Out[98]:
Now, we will insert the True-False values to produce data points bigger than %1 (=0.01) In [99]:
pc[pc>0.01].head(5)
Out[99]:
Now i can combine two conditions with logical and(&) operator. Such as percent changes smaller than 1% and percent changes larger than -1% (in negative terms it is just the opposite) translates to (pc<0.01) & (pc>-0.01) In [100]:
pc[(pc<0.01) & (pc>-0.01)].head(3)
Out[100]:
Now we can plot the histogram of data between 1% and -1% again In [101]:
hist(pc[(pc<0.01) & (pc>-0.01)], bins=100);
This looks quite strange let's look at 5% ranges. In [102]:
hist(pc[(pc<0.05) & (pc>-0.05)], bins=100);
An easier way is to just put range in hist command range= (-0.05, 0.05) parameter tells the hist command that we would like to see only 5%<->-5% range, without filtering data In [103]:
hist(pc, bins=100, range=(-0.05,0.05));
Calculating extreme and normal condition probabilities¶So we can filter data, and find number of variables that the filtration results. Now we can try to find the probability of 1% <-> -1% oil price movements in the whole series. First we will use cumbersome method, then .count command will be used. In [104]:
# What was the shape of original data?
pc.shape
Out[104]:
Since shape returns tuple. So we can extract value using [0] or [1] In [105]:
pc.shape[0]
Out[105]:
We know that there are 8043 rows ( In Python (rows,columns) is the general usage). We can also call a command called count In [106]:
pc.count()
Out[106]:
Warning With shape we got 8043, but with count we found 8042. The difference is because count() excludes NA values. For probabilities I will use both ways to illustrate both ways. Probability of 1%<->-1% = (number of cells in between 1% and -1%) / (total data population) Let's start with 1% In [107]:
# For 1%
pc[(0.01>pc) & (-0.01<pc)].shape[0]/pc.shape[0]
Out[107]:
How many cells are in between 5% and -5%? In [108]:
pc[(0.05>pc) & (-0.05<pc)].shape[0]
Out[108]:
So probability for 5% should be 7765/8043 In [109]:
7765/8043
Out[109]:
Now we can calculate and print the ranges in order In [110]:
print("The probability of oil price changes between 1% and -1%% is ", (100*pc[(0.01>pc) & (-0.01<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes between 5% and -5%% is " ,(100*pc[(0.05>pc) & (-0.05<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes larger than 5% is " , (100*(pc[(0.05<pc)].shape[0])/pc.shape[0]))
print("The probability of oil price changes lower(extreme) than -5% is " , (100*pc[(-0.05>pc)].shape[0]/pc.shape[0]))
Unfortunately the output is not that pretty. Print command can also do formating. Print formatting can be tricky. Especially, if you are using % (percent) sign in the text part.
In [111]:
print("The probability of oil price changes between 1%% and -1%% is %.2f%%" % (100*pc[(0.01>pc) & (-0.01<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes between 5%% and -5%% is %.2f%%" % (100*pc[(0.05>pc) & (-0.05<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes larger than 5%% is %.2f%%" % (100*(pc[(0.05<pc)].shape[0])/pc.shape[0]))
print("The probability of oil price changes lower(extreme) than -5%% is %.2f%%" % (100*pc[(-0.05>pc)].shape[0]/pc.shape[0]))
Investigating Extremes - Minimum and Maximum values and their position¶This time, the extreme values will be looked after. As a start, we can plot biggest drops and biggest increases in oil prices. The largest price falls on a single day are as follows (checking the range -31% <-> -5% ) In [112]:
hist(pc, bins=100, range=(-0.31,-0.05));
Are we sure that this is the minimum? We can check by min command In [113]:
pc.min()
Out[113]:
Largest single day price drop is -30% on a SINGLE DAY. Wow. What was the day? The command is idxmax() In [114]:
pc.idxmin()
Out[114]:
That was dull. Just a mere position of minimum value. We have to insert this number into array to get the value. Or we can do better: Dates around that position To find the dates around that position, we have to turn back to our initial data prices . Since pc does not have corresponding date value. For accesing a data on a specific row or column we use iloc notation. Such as data.iloc[row, column] There are two ways we can do it:
In [115]:
prices.iloc[[935,936,937]]
Out[115]:
In [116]:
prices.iloc[range(935,938)]
Out[116]:
Why not pc? Let's try the same with pc In [117]:
pc.iloc[[935,936,937]]
Out[117]:
As you see there are no date values! We have to check original data. There is another way to solve this problem. A new column can be created and filled with percent change values in the original dataset. We will do it in the next part. With .idxmax() , maximum value's row can be found In [118]:
# max value
pc.idxmax()
Out[118]:
In [119]:
prices.iloc[range(5498,5501)]
Out[119]:
As a result:
Increasing Complexity : Creating a bigger data set with Week, Month, Percent Change¶Our original dataset prices includes all oil prices, but we may want to add
data to original set. As a first step, we can add Week data as a new column to original dataset. It is straight forward. Add data to original set with a new title. In this case "Week_Number" In [120]:
# Add a new column Week_number and convert the Date data in the prices dataset to week number
prices["Week_Number"]=prices["Date"].dt.week
Now let's see the dataset's ending In [121]:
prices.tail(3)
Out[121]:
If there are no problems so far, we can continue with other relevant data In [122]:
prices["Month"]=prices["Date"].dt.month
prices["Year"]=prices["Date"].dt.year
prices["DoW"]=prices["Date"].dt.dayofweek
The dataset transformed into its new form. For Day of the Week, 0 corresponds to Monday, 1 Tuesday, etc. For oil prices, there are no weekends. The numbers must be between 0 and 4 In [123]:
#let's see last 4 lines
prices.tail(4)
Out[123]:
Now we can continue with our analysis Pivot Tables in Python Pandas¶Grouping¶One of the easist starting point for creating aggregated data is grouping. We can group the data according to weeks. In pandas you can use groupby to group data according to a column. With .mean() or .count() , the aggregate outputs can be obtained In [124]:
plot(prices.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean())
Out[124]:
Unfortunately depending on years, there may be the 53th weeks. I prefer to drop these by taking the first 52 weeks In [125]:
plot(prices.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
Out[125]:
This graph needs a title, x and y axis labels. We will use plt.title or plt.xlabel since %pylab inline creates a plot variable named plt In [126]:
plot(prices.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
plt.title("52 Week average of 1987-2019 Oil Prices")
plt.xlabel("Week")
plt.ylabel("$/barrel")
Out[126]:
Now let's take it to another level. For example let's see the weekly price movements for 2018. It requires two steps. 1) filter the data for 2018 with prices[prices.Year==2018] 2) The whole dataset will take 365 lines, so just see the first few lines with .head(4) In [127]:
pf=prices[prices.Year==2018]
pf.head(4)
Out[127]:
Now, we will group the filtered results In [128]:
# Just look at the averages for the first 4 week
pf.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(4)
Out[128]:
Now we can choose the data for 2018 and view weekly means In [129]:
plot(pf.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean())
Out[129]:
Or we can write in a very long but no-variable kind of way In [130]:
plot(prices[prices.Year==2018].groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean())
Out[130]:
Now we can check last 5 and 10 years with range command:
In [131]:
plot(prices[prices.Year>=2014].groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
Out[131]:
Or we can do it in a more readable way. 1) Assign partial dataset (for years larger and equal to 2014) to pf (price frame) 2) Then use the groupby command on pf In [132]:
pf=prices[prices.Year>=2014]
In [133]:
plot(pf.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
# title for the plot and ";" for supressing text output
title("Weekly oil prices (2014-2019)");
Pivot Table¶Warning I have to warn the reader that, grouping and pivoting produced slightly different results for averages. At the time of writing, I couldn't figured out why. I tried NA (not available) data options, other options but no results unfortunately. Pivot tables are an essential part of data analysis. Basically, pivot tables re-arrange dataset into user's liking. Additional mathematical operations like "averaging", "summing" can be used. Now from the original prices dataset, we will rearrage values by
It may look stupid for careful observer to use the long name of "Europe Brent Spot Price FOB (Dollars per Barrel)" for Brent price, but in jupyter you just press tab and it pops up In [134]:
pt=prices.pivot_table( values='Europe Brent Spot Price FOB (Dollars per Barrel)', index=['Week_Number'],
columns=['Year'], aggfunc=np.mean)
Now let's check pt (pt for pivot table) In [135]:
pt.head(4)
Out[135]:
More Filtering¶The first thing that strikes on the table is 1987. There are lots of NaN until May of that year. Dataset starts from May 1987. Apart from that, there are now weekly averages for all years in columns. Now we can choose years from 2014 to 2019 and plot their first 52 weeks
In [136]:
plot(pt.loc[:,range(2014,2020)].mean(axis=1).head(52))
title ("Weekly price changes (2014-2019)");
You can check what is included with *range and print command in a line In [137]:
# range command does not include the last element
print(*range(2014,2020))
If everyting is clear, we can produce percent changes of weekly oil prices. Just call pct_change() subcommand for dataframe and it will work In [138]:
ptpc=pt.pct_change()
Always check the data, or at least part of it (first 4 lines with head(4) ) In [139]:
ptpc.head(4)
Out[139]:
First line is NaN for each year since we didn't put December 12th month of the previous year to the head of January But now let's graph the percent changes for the whole dataset again In [140]:
plot(ptpc.mean(axis=1))
title("Weekly oil price changes(1987-2019)")
xlabel("Weeks")
ylabel("Percent change Week-on-Week");
5 year and 10 year average weekly prices¶Instead of % changes, let's turn back to original dataset of pt. For the last 5 years and 10 years, we will graph level changes, afterwards percent changes will be examined
In [141]:
# First 5 years 2014,2015,2016,2017,2018
plot(pt.loc[:,range(2014,2019)].mean(axis=1).head(52))
Out[141]:
Now let's see the 10 year data In [142]:
# 10 years of average weekly prices
plot(pt.loc[:,range(2009,2019)].mean(axis=1).head(52));
For the 10 year average data there is a huge jump before 10th week. But I rather choose the last 5 years. The US shale production has shown its real effect in the last 5 years. You can do the 10 by yourself To look more closely, we will take percent change and graph it again In [143]:
pt_weekly=pt.loc[:,range(2014,2019)].mean(axis=1).head(52)
plot(pt_weekly.pct_change());
Let's assign percent changes to a new variable. We will do more filtering In [144]:
pt_weekly_pc=pt_weekly.pct_change()
This time we will filter the weeks, where change is bigger than 2% in absolute terms.
In [145]:
pt_weekly_pc[abs(pt_weekly_pc)>0.029]
Out[145]:
Converting week numbers to dates¶Now we can convert these week numbers to date with time command. Converting week numbers to dates is tricky. We again choose the dates for higher than 2.9% (arbitrary), and then insert the corresponding week numbers (.index.values) to strptime In [146]:
for weeks in pt_weekly_pc[abs(pt_weekly_pc)>0.029].index.values:
print(time.asctime(time.strptime('{} {} 1'.format(2019, weeks), '%Y %W %w')))
For the last 5 years, we are seeing strongest
Monthly data¶Weekly data part is as follows. In my original paper, I did the analysis for the last 10 years. In this tutorial it is for the last 5 years. For the montly data analysis, we have another step to conclude. For January percent changes will appear as NA. Therefore, we have to copy paste each last years December data before January, but we have to shift is as well. 1987 December will be before 1988 January etc. our steps are as follows:
In [147]:
# Prices monthly data to pm variable
pm=prices.pivot_table( values='Europe Brent Spot Price FOB (Dollars per Barrel)', index=['Month'],
columns=['Year'], aggfunc=np.mean)
First 4 lines of the new varible: In [148]:
pm.head(4)
Out[148]:
Now comes the tricky part
In [149]:
pm.loc[0]=pm.loc[12,:].shift(1)
pm.head(3)
Out[149]:
We can not see the 0th row. That is because pasted row is added to the last line. Our index is month, therefore sort_index() will solve our problem. In [150]:
pm=pm.sort_index()
pm.head(3)
Out[150]:
Now we can calculate percent change In [151]:
pmc=pm.pct_change()
pmc.head(3)
Out[151]:
Now we can delete row 0, since it is full of NaN To delete row 0, drop command is used. But do not forget to assign the new data table to original variable In [152]:
pmc=pmc.drop(pmc.index[0])
In [153]:
pmc.head()
Out[153]:
Now we can calculate the last 10 years of percent changes or each month.
In [154]:
pmc.loc[:,range(2009,2020)].mean(axis=1)
Out[154]:
For the last 10 years, February-April is most probably positive months. November and December are probably negative months. Now let's check for the 1987-2019 In [155]:
pmc.mean(axis=1)
Out[155]:
Combining two results:
In more detail, we can calculate the probabilites for each month. The process is straight forward
In [156]:
pmc["Positive"]=pmc[(pmc.iloc[:,:]>0)].count(axis=1)/(2020-1987+1)
Do the same for negative price movements In [157]:
pmc["Negative"]=pmc[(pmc.iloc[:,:]<0)].count(axis=1)/(2020-1987+1)
Let's see the total results for Positive and Negative columns In [158]:
pmc[["Positive","Negative"]]
Out[158]:
The results above means that for April (month=4), a price increase(=positive) is 70%(=0.705) where as probability for price decrease is 26%(=0.26) More Monthly Graphics¶Now we can plot average montly prices for 2009-2019 and 1987-2019. In [159]:
plot(pm.loc[:,range(2009,2020)].mean(axis=1))
plt.title('2009-2019 Averaged Monthly Prices')
plt.xlabel("Months (0 = previous December)")
plt.ylabel("$/barrel");
In [160]:
plot(pm.loc[:,range(1987,2020)].mean(axis=1))
plt.title('1987-2019 Averaged Monthly Prices')
plt.xlabel("Months (0 = previous December)")
plt.ylabel("$/barrel");
Day of the Week effect¶In oil prices, there may be Monday or Friday effects. "Never short on Friday" is a known saying among traders. We can test it First let's rearrange our data with pivot date, this time columns are day of week In [161]:
prices.head(3)
Out[161]:
We can add a percentage change column to the data In [162]:
prices["Percent"]=prices["Europe Brent Spot Price FOB (Dollars per Barrel)"].pct_change()
prices.head(3)
Out[162]:
Now we can use pivot_table to arrange our data. This time index is Years and columns willbe Day of Week (0=Monday, 1=Tue ... , 4=Friday) In [163]:
pday=prices.pivot_table( values='Percent', index=['Year'],columns=['DoW'], aggfunc=np.mean)
# This time check the last 5 lines
pday.tail (5)
Out[163]:
How many rows and columns are there in our new variable pday ? In [164]:
pday.shape
Out[164]:
We have yearly averages and we can count positive and negative values for each day In [165]:
pday[pday>0].count()
Out[165]:
To make sure, now number of negative percent changes In [166]:
pday[pday<0].count()
Out[166]:
And the the probability calculations are pretty straightforward. Divide the number by 33 or Number of rows In [167]:
pday[pday>0].count()/pday.shape[0]
Out[167]:
The Day of The Week 3 has a probability of %72. That means Thursdays are most probably ends up with a positive oil price movement. In [168]:
pday[pday<0].count()/pday.shape[0]
Out[168]:
On the negative side, Monday and Tuesdays have a higher probability to be negative. |