Can we have a generic oil price pattern for the whole year? Are there any hidden data patterns in oil price movements that still needs to be investigated? Is the Chinese new year more important than US driving season or Hajj period in Saudi Arabia? Questions in oil prices never cease. That is an attractive property of oil price discussions, the zenith of such discussions is oil price forecasts. Experts refrain, ignorants try their chances with a gamblers enthusiasm. In this article, a Python based oil price behaviour will be discussed. The code and the relevant parts are available on my web page ## Data¶Data is downloaded from EIA website from https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls . It includes daily Brent prices from 1987 and provides a good source for analysis ## Libraries¶I do not enjoy Python's red alert warnings. They are important, do doubts. In my first run I let them scream. But for the notebook purposes I choose to supress them In [85]:
```
# Supress warnings
import warnings
warnings.filterwarnings('ignore')
``` Then comes the necessary libraries for analysis. I use inline to include matplotlib(plotting library), numpy(for numerical analysis) and pandas In [86]:
```
# use pylab inline to numpy, matplotlib
%pylab inline
# now include Pandas library for data analysis
import pandas as pd
``` ## A first look at data¶For the rest of the exercise, I will use one data source. It is an Excel file. If it fails to load, check internet connection and then make sure that you installed pip install xlrd"
My data source is named Brent_dataIn [87]:
```
Brent_data="https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls"
```
In [88]:
```
prices=pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheet_name="Data 1", skiprows=2)
``` Let's see the shape of data with shape attribute In [89]:
```
prices.shape
``` Out[89]:
Our data has 2 columns and 8043 rows. Now let's check the beginning and end of our data set.
We have to call In [90]:
```
#first 4 data points
prices.head(4)
``` Out[90]:
In [91]:
```
# the last 3 data points
prices.tail(3)
``` Out[91]:
And let's see the column titles with list command In [92]:
```
list(prices)
``` Out[92]:
## Plotting¶Plotting the whole series can be easily done with In data frames after square brackets use ' and then press TAB to see the list In [93]:
```
plot(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'])
``` Out[93]:
## Histogram¶A histogram is a plot of ordered list of frequencies. You can assume with a simple In [94]:
```
hist(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'])
``` Out[94]:
Now let's increase the resolution with In [95]:
```
hist(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'], bins=100);
``` See the difference? In the second plot we just used But the numbers are oil prices. The histogram of change can be much more meaningful. To have the percentage change of a data set is done by subfunction In [96]:
```
hist(prices['Europe Brent Spot Price FOB (Dollars per Barrel)'].pct_change(), bins=100 );
``` This brings us to our first conclusion: ## A Closer Look¶Most of the percent change in oil price data is aligned between -0.1 (-10%) and 0.1 (10%). The automatic range on X axis shows us that there are extreme negatives (-30%ish) and extreme positives (%20%~). We may want to look these regions closely. As a first step, we will assign the percentage changes to a new variable called pc. Then we will produce the histogram of pc(percent change) data. It should look similar to the one above. In [97]:
```
pc=prices['Europe Brent Spot Price FOB (Dollars per Barrel)'].pct_change()
hist(pc, bins=100);
``` ## Filtering Data¶The graph above is to cumbersome to see most of the action. I need to filter the data. Specifically I want to look at 1%(=0.01) and -1%(-0.01) range. Initially, by conditional and logical operators I can choose certain numbers inline with my rules such as larger than 0.01. If pc>0.01 is executed, it will produce bunch of Let's do it one by one Step 1. Values larger than 0.01 (means >0.01) - You can use without paranthesis. To save space, i am looking for the first 5 data In [98]:
```
(pc>0.01 ).head(5)
``` Out[98]:
Now, we will insert the True-False values to produce data points bigger than %1 (=0.01) In [99]:
```
pc[pc>0.01].head(5)
``` Out[99]:
Now i can combine two conditions with logical and(&) operator. Such as percent changes smaller than 1% and percent changes larger than -1% (in negative terms it is just the opposite) translates to
In [100]:
```
pc[(pc<0.01) & (pc>-0.01)].head(3)
``` Out[100]:
Now we can plot the histogram of data between 1% and -1% again In [101]:
```
hist(pc[(pc<0.01) & (pc>-0.01)], bins=100);
``` This looks quite strange let's look at 5% ranges. In [102]:
```
hist(pc[(pc<0.05) & (pc>-0.05)], bins=100);
``` An easier way is to just put range in hist command
In [103]:
```
hist(pc, bins=100, range=(-0.05,0.05));
``` ## Calculating extreme and normal condition probabilities¶So we can filter data, and find number of variables that the filtration results. Now we can try to find the probability of 1% <-> -1% oil price movements in the whole series. First we will use cumbersome method, then In [104]:
```
# What was the shape of original data?
pc.shape
``` Out[104]:
Since In [105]:
```
pc.shape[0]
``` Out[105]:
We know that there are 8043 rows ( In Python (rows,columns) is the general usage). We can also call a command called count In [106]:
```
pc.count()
``` Out[106]:
For probabilities I will use both ways to illustrate both ways. Probability of 1%<->-1% = (number of cells in between 1% and -1%) / (total data population) Let's start with 1% In [107]:
```
# For 1%
pc[(0.01>pc) & (-0.01<pc)].shape[0]/pc.shape[0]
``` Out[107]:
How many cells are in between 5% and -5%? In [108]:
```
pc[(0.05>pc) & (-0.05<pc)].shape[0]
``` Out[108]:
So probability for 5% should be 7765/8043 In [109]:
```
7765/8043
``` Out[109]:
Now we can calculate and print the ranges in order In [110]:
```
print("The probability of oil price changes between 1% and -1%% is ", (100*pc[(0.01>pc) & (-0.01<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes between 5% and -5%% is " ,(100*pc[(0.05>pc) & (-0.05<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes larger than 5% is " , (100*(pc[(0.05<pc)].shape[0])/pc.shape[0]))
print("The probability of oil price changes lower(extreme) than -5% is " , (100*pc[(-0.05>pc)].shape[0]/pc.shape[0]))
``` Unfortunately the output is not that pretty. Print command can also do formating. Print formatting can be tricky. Especially, if you are using % (percent) sign in the text part. - You have to replace
**all '%' signs with '%%'**to help python not confuse this sign with formatting sign. - Instead of comma (,) use
*%*at the end of quotes " - Insert
**'%.2f'**(meaning 2 digits decimals) to relevant position in the string to be printed
In [111]:
```
print("The probability of oil price changes between 1%% and -1%% is %.2f%%" % (100*pc[(0.01>pc) & (-0.01<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes between 5%% and -5%% is %.2f%%" % (100*pc[(0.05>pc) & (-0.05<pc)].shape[0]/pc.shape[0]))
print("The probability of oil price changes larger than 5%% is %.2f%%" % (100*(pc[(0.05<pc)].shape[0])/pc.shape[0]))
print("The probability of oil price changes lower(extreme) than -5%% is %.2f%%" % (100*pc[(-0.05>pc)].shape[0]/pc.shape[0]))
``` ## Investigating Extremes - Minimum and Maximum values and their position¶This time, the extreme values will be looked after. As a start, we can plot biggest drops and biggest increases in oil prices. The largest price falls on a single day are as follows (checking the range -31% <-> -5% ) In [112]:
```
hist(pc, bins=100, range=(-0.31,-0.05));
``` Are we sure that this is the minimum? We can check by In [113]:
```
pc.min()
``` Out[113]:
Largest single day price drop is In [114]:
```
pc.idxmin()
``` Out[114]:
That was dull. Just a mere position of minimum value. We have to insert this number into array to get the value. Or we can do better: Dates around that position To find the dates around that position, we have to turn back to our initial data For accesing a data on a specific row or column we use There are two ways we can do it: - Put an array for days such as
**[935,936,937]** - Or use the range command to get the results in the 1st item.
**range(935,938)**
In [115]:
```
prices.iloc[[935,936,937]]
``` Out[115]:
In [116]:
```
prices.iloc[range(935,938)]
``` Out[116]:
Why not In [117]:
```
pc.iloc[[935,936,937]]
``` Out[117]:
As you see there are no date values! We have to check original data. There is another way to solve this problem. A new column can be created and filled with percent change values in the original dataset. We will do it in the next part. With In [118]:
```
# max value
pc.idxmax()
``` Out[118]:
In [119]:
```
prices.iloc[range(5498,5501)]
``` Out[119]:
As a result: - Biggest fall in oil prices happened on 17 Jan 1991. Oil prices dropped from 30 to 21\$
- Biggest increase in oil prices is observed on 2 Jan 2009. Prices jumped from 35 to 42\$
## Increasing Complexity : Creating a bigger data set with Week, Month, Percent Change¶Our original dataset - Day
- Week
- Month
- Year
- Percent Change
data to original set. As a first step, we can add Week data as a new column to original dataset. It is straight forward. Add data to original set with a new title. In this case "Week_Number" In [120]:
```
# Add a new column Week_number and convert the Date data in the prices dataset to week number
prices["Week_Number"]=prices["Date"].dt.week
``` Now let's see the dataset's ending In [121]:
```
prices.tail(3)
``` Out[121]:
If there are no problems so far, we can continue with other relevant data In [122]:
```
prices["Month"]=prices["Date"].dt.month
prices["Year"]=prices["Date"].dt.year
prices["DoW"]=prices["Date"].dt.dayofweek
``` The dataset transformed into its new form. For In [123]:
```
#let's see last 4 lines
prices.tail(4)
``` Out[123]:
Now we can continue with our analysis ## Pivot Tables in Python Pandas¶## Grouping¶One of the easist starting point for creating aggregated data is grouping. We can group the data according to weeks. In In [124]:
```
plot(prices.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean())
``` Out[124]:
Unfortunately depending on years, there may be the 53th weeks. I prefer to drop these by taking the first 52 weeks In [125]:
```
plot(prices.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
``` Out[125]:
This graph needs a title, x and y axis labels. We will use In [126]:
```
plot(prices.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
plt.title("52 Week average of 1987-2019 Oil Prices")
plt.xlabel("Week")
plt.ylabel("$/barrel")
``` Out[126]:
Now let's take it to another level. For example let's see the weekly price movements for 2018. It requires two steps. 1) filter the data for 2018 with 2) The whole dataset will take 365 lines, so just see the first few lines with In [127]:
```
pf=prices[prices.Year==2018]
pf.head(4)
``` Out[127]:
Now, we will group the filtered results In [128]:
```
# Just look at the averages for the first 4 week
pf.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(4)
``` Out[128]:
Now we can choose the data for 2018 and view weekly means In [129]:
```
plot(pf.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean())
``` Out[129]:
Or we can write in a very long but no-variable kind of way In [130]:
```
plot(prices[prices.Year==2018].groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean())
``` Out[130]:
Now we can check last 5 and 10 years with range command: - 5 years with
*range(2014,2019)* - 10 years with
*range(2009,2019)*
In [131]:
```
plot(prices[prices.Year>=2014].groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
``` Out[131]:
Or we can do it in a more readable way. 1) Assign partial dataset (for years larger and equal to 2014) to 2) Then use the groupby command on In [132]:
```
pf=prices[prices.Year>=2014]
``` In [133]:
```
plot(pf.groupby("Week_Number")['Europe Brent Spot Price FOB (Dollars per Barrel)'].mean().head(52))
# title for the plot and ";" for supressing text output
title("Weekly oil prices (2014-2019)");
``` ## Pivot Table¶
Pivot tables are an essential part of data analysis. Basically, pivot tables re-arrange dataset into user's liking. Additional mathematical operations like "averaging", "summing" can be used. Now from the original prices dataset, we will rearrage values by - Change the index of dataset to week number ---->
**index=['Week_Number']** - Produce Columns for each year with averages ---->
**columns=['Year']** - We will use average function from numpy ---->
**aggfunc=np.mean** - The data cells will be values from Brent price ---->
**values='Europe Brent Spot Price FOB (Dollars per Barrel)'**
It may look stupid for careful observer to use the long name of In [134]:
```
pt=prices.pivot_table( values='Europe Brent Spot Price FOB (Dollars per Barrel)', index=['Week_Number'],
columns=['Year'], aggfunc=np.mean)
``` Now let's check pt (pt for pivot table) In [135]:
```
pt.head(4)
``` Out[135]:
## More Filtering¶The first thing that strikes on the table is 1987. There are lots of NaN until May of that year. Dataset starts from May 1987. Apart from that, there are now weekly averages for all years in columns. Now we can choose years from 2014 to 2019 and plot their first 52 weeks - range(2014, 2020) produces 2014,2015,2016,2017,2018,2019.... The last element is never included in range
- By taking the mean of weekly data,
**mean**command with**(axis=1)**is used to tell Python to take averages row-wise - 53th week in a year rarely happens, therefore I omit it with
**head(52)**
In [136]:
```
plot(pt.loc[:,range(2014,2020)].mean(axis=1).head(52))
title ("Weekly price changes (2014-2019)");
``` You can check what is included with *range and print command in a line In [137]:
```
# range command does not include the last element
print(*range(2014,2020))
``` If everyting is clear, we can produce percent changes of weekly oil prices. Just call In [138]:
```
ptpc=pt.pct_change()
``` Always check the data, or at least part of it (first 4 lines with In [139]:
```
ptpc.head(4)
``` Out[139]:
First line is NaN for each year since we didn't put December 12th month of the previous year to the head of January But now let's graph the percent changes for the whole dataset again In [140]:
```
plot(ptpc.mean(axis=1))
title("Weekly oil price changes(1987-2019)")
xlabel("Weeks")
ylabel("Percent change Week-on-Week");
``` ## 5 year and 10 year average weekly prices¶Instead of % changes, let's turn back to original dataset of - produce year data by range(2014, 2019) ---> 2014,2015,2016,2017,2018
- Filter the
**pt**for this years ---> pt.loc[:range(2014,2019)] - Take the mean of same weeks ---> mean(axis=1)
- Skip the 53rd week ---> head(52)
In [141]:
```
# First 5 years 2014,2015,2016,2017,2018
plot(pt.loc[:,range(2014,2019)].mean(axis=1).head(52))
``` Out[141]:
Now let's see the 10 year data In [142]:
```
# 10 years of average weekly prices
plot(pt.loc[:,range(2009,2019)].mean(axis=1).head(52));
``` For the 10 year average data there is a huge jump before 10th week. But I rather choose the last 5 years. The US shale production has shown its real effect in the last 5 years. You can do the 10 by yourself To look more closely, we will take percent change and graph it again In [143]:
```
pt_weekly=pt.loc[:,range(2014,2019)].mean(axis=1).head(52)
plot(pt_weekly.pct_change());
``` Let's assign percent changes to a new variable. We will do more filtering In [144]:
```
pt_weekly_pc=pt_weekly.pct_change()
``` This time we will filter the weeks, where change is bigger than 2% in absolute terms. - Get the absolute value of percent change ----> abs(pt_weekly_pc)>0.02
- Now filter the data points ----> pt_weekly_pc[abs(pt_weekly_pc)>0.02]
In [145]:
```
pt_weekly_pc[abs(pt_weekly_pc)>0.029]
``` Out[145]:
## Converting week numbers to dates¶Now we can convert these week numbers to date with time command. Converting week numbers to dates is tricky. We again choose the dates for higher than 2.9% (arbitrary), and then insert the corresponding week numbers (.index.values) to In [146]:
```
for weeks in pt_weekly_pc[abs(pt_weekly_pc)>0.029].index.values:
print(time.asctime(time.strptime('{} {} 1'.format(2019, weeks), '%Y %W %w')))
``` For the last 5 years, we are seeing strongest - Price increases in Second week of January, Last week of February, Mid April (positive values)
- Price decreases in November and December
## Monthly data¶Weekly data part is as follows. In my original paper, I did the analysis for the last 10 years. In this tutorial it is for the last 5 years. For the montly data analysis, we have another step to conclude. For January percent changes will appear as NA. Therefore, we have to copy paste each last years December data before January, but we have to shift is as well. 1987 December will be before 1988 January etc. our steps are as follows: - Rearrange data with pivot table
- copy and paste last row (December) before January row with a shift (shift row by 1 column)
- Calculate percent changes
In [147]:
```
# Prices monthly data to pm variable
pm=prices.pivot_table( values='Europe Brent Spot Price FOB (Dollars per Barrel)', index=['Month'],
columns=['Year'], aggfunc=np.mean)
``` First 4 lines of the new varible: In [148]:
```
pm.head(4)
``` Out[148]:
Now comes the tricky part **pm.loc[12,:]**gives us all the data on 12th row (December)**.shift(1)**shifts the data to the right. So 1987 December can be on the same column with 1988- pasting the shifted row data to zero index will be the final touch
In [149]:
```
pm.loc[0]=pm.loc[12,:].shift(1)
pm.head(3)
``` Out[149]:
We can not see the 0th row. That is because pasted row is added to the last line. Our index is month, therefore In [150]:
```
pm=pm.sort_index()
pm.head(3)
``` Out[150]:
Now we can calculate percent change In [151]:
```
pmc=pm.pct_change()
pmc.head(3)
``` Out[151]:
Now we can delete row 0, since it is full of NaN To delete row 0, drop command is used. But do not forget to assign the new data table to original variable In [152]:
```
pmc=pmc.drop(pmc.index[0])
``` In [153]:
```
pmc.head()
``` Out[153]:
Now we can calculate the last 10 years of percent changes or each month. - Filter pmc
- Take mean for each row
In [154]:
```
pmc.loc[:,range(2009,2020)].mean(axis=1)
``` Out[154]:
For the last 10 years, February-April is most probably positive months. November and December are probably negative months. Now let's check for the 1987-2019 In [155]:
```
pmc.mean(axis=1)
``` Out[155]:
Combining two results: - Rise starts with Feb-March
- April is most probable for price hikes
- Price changes turn negative for November and December
In more detail, we can calculate the probabilites for each month. The process is straight forward - if the cell data is bigger than 0 ---->
**pmc[(pmc.iloc[:,:]>0)]** - then count the number of cells in each row ---->
**.count(axis=1)** - divide the number of positive cells to total ---->
**/(2020-1987+1)** - assign the values to "Positive" labeled column -->
**pmc["Positive"]**
In [156]:
```
pmc["Positive"]=pmc[(pmc.iloc[:,:]>0)].count(axis=1)/(2020-1987+1)
``` Do the same for negative price movements In [157]:
```
pmc["Negative"]=pmc[(pmc.iloc[:,:]<0)].count(axis=1)/(2020-1987+1)
``` Let's see the total results for Positive and Negative columns In [158]:
```
pmc[["Positive","Negative"]]
``` Out[158]:
The results above means that for April (month=4), a price increase(=positive) is 70%(=0.705) where as probability for price decrease is 26%(=0.26) ## More Monthly Graphics¶Now we can plot average montly prices for 2009-2019 and 1987-2019. In [159]:
```
plot(pm.loc[:,range(2009,2020)].mean(axis=1))
plt.title('2009-2019 Averaged Monthly Prices')
plt.xlabel("Months (0 = previous December)")
plt.ylabel("$/barrel");
``` In [160]:
```
plot(pm.loc[:,range(1987,2020)].mean(axis=1))
plt.title('1987-2019 Averaged Monthly Prices')
plt.xlabel("Months (0 = previous December)")
plt.ylabel("$/barrel");
``` ## Day of the Week effect¶In oil prices, there may be Monday or Friday effects. "Never short on Friday" is a known saying among traders. We can test it First let's rearrange our data with pivot date, this time columns are In [161]:
```
prices.head(3)
``` Out[161]:
We can add a percentage change column to the data In [162]:
```
prices["Percent"]=prices["Europe Brent Spot Price FOB (Dollars per Barrel)"].pct_change()
prices.head(3)
``` Out[162]:
Now we can use pivot_table to arrange our data. This time index is Years and columns willbe Day of Week (0=Monday, 1=Tue ... , 4=Friday) In [163]:
```
pday=prices.pivot_table( values='Percent', index=['Year'],columns=['DoW'], aggfunc=np.mean)
# This time check the last 5 lines
pday.tail (5)
``` Out[163]:
How many rows and columns are there in our new variable In [164]:
```
pday.shape
``` Out[164]:
We have yearly averages and we can count positive and negative values for each day In [165]:
```
pday[pday>0].count()
``` Out[165]:
To make sure, now number of negative percent changes In [166]:
```
pday[pday<0].count()
``` Out[166]:
And the the probability calculations are pretty straightforward. Divide the number by 33 or Number of rows In [167]:
```
pday[pday>0].count()/pday.shape[0]
``` Out[167]:
The Day of The Week 3 has a probability of %72. That means Thursdays are most probably ends up with a positive oil price movement. In [168]:
```
pday[pday<0].count()/pday.shape[0]
``` Out[168]:
On the negative side, Monday and Tuesdays have a higher probability to be negative. |