Miranda Auhl: PostgreSQL vs Python for data cleaning: A guide

Introduction

During analysis, you rarely – if ever – get to go directly from evaluating data to transforming and analyzing it. Sometimes to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning, and that’s a lot of cleaning! In order to accomplish all this work, you may use Excel, R, or Python, but are these the best tools for data cleaning tasks?

In this blog post, I explore some classic data cleaning scenarios and show how you can perform them directly within your database using TimescaleDB and PostgreSQL, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  

Cleaning is a very important part of the analysis process and generally can be the most grueling from my experience! By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks one time rather than repetitively within a script, saving me considerable time in the long run.

A recap of the data analysis process

I began this series of posts on data analysis by presenting the following summary of the analysis process:

and appears like this…

type date start_time end_time usage units cost notes
Electric usage 2016-10-22 00:00:00 00:14:00 0.01 kWh $0.00
Electric usage 2016-10-22 00:15:00 00:29:00 0.01 kWh $0.00
Electric usage 2016-10-22 00:30:00 00:44:00 0.01 kWh $0.00
Electric usage 2016-10-22 00:45:00 00:59:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:00:00 01:14:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:15:00 01:29:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:30:00 01:44:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:45:00 01:59:00 0.01 kWh $0.00

In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:

  • The cost is seen as a text data type which will cause some issues.
  • The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.
  • I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household).

In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!

Cleaning the data

I will show most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.

Please feel free to explore these various techniques and skip around if you need! There is a lot here, and I designed it to be a helpful glossary of tools that you could use as you need.

The techniques that I will cover include:

Note on cleaning approach:

There are many ways that I could approach the cleaning process in PostgreSQL. I could create a table then ALTER it as I clean, I could create multiple tables as I add or change data, or I could work with VIEWs. Depending on the size of my data, any of these approaches could make sense, however, they will have different computational consequences.

You may have noticed above that my raw data table was called energy_usage_staging. This is because I decided that given the state of my raw data, it would be best for me to place the raw data in a staging table, clean it using VIEWs, then insert it into a more usable table as part of my cleaning process. This move from raw table to usable table could happen even before the evaluation step of analysis. As I discussed above, sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL VIEWs, something I will talk about later.

Generally, if you are dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post I will show you how to build up clean data using VIEWs along with additional tables. This method of cleaning is more efficient and sets you up for the next blog post about data transformation which includes the use of scripts in PostgreSQL.

Correcting structural issues

Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have date and time columns separated and costs is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the cost column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.

TimescaleDB hypertables and why timestamp is important

For those of you not familiar with the structure of

--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time 
FROM energy_usage_staging eus;

Results:

date start_time time
2016-10-22 00:00:00 2016-10-22 00:00:00.000
2016-10-22 00:15:00 2016-10-22 00:15:00.000
2016-10-22 00:30:00 2016-10-22 00:30:00.000
2016-10-22 00:45:00 2016-10-22 00:45:00.000
2016-10-22 01:00:00 2016-10-22 01:00:00.000
2016-10-22 01:15:00 2016-10-22 01:15:00.000

Python code:

In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.

energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])

Changing column data types

Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the TO_NUMBER() function.

The format of the function is as follows: TO_NUMBER(‘text’, ‘format’) . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a $ symbol followed by a numeric set up 0.00. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point.

I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’

PostgreSQL code:

--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus  
ORDER BY cost_new DESC

Results:

cost cost_new
$0.65 0.65
$0.65 0.65
$0.65 0.65
$0.57 0.57
$0.46 0.46
$0.46 0.46
$0.46 0.46
$0.46 0.46

Python code:

For Python, I used a lambda function which systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.

energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])

Creating a VIEW

Now that I know how to convert my columns, I can combine the two queries and create a VIEW of my new restructured table. A VIEW is a PostgreSQL object which allows you to define a query and call it by it’s VIEWs name, as if it were a table within your database. I can use the following query to generate the data I want and then create a VIEW that I can query it as if it were a table.

PostgreSQL code:

-- query the right data that I want
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Results:

type time usage units cost notes
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:15:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:30:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:45:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 02:00:00.000 0.02 kWh 0.00
Electric usage 2016-10-22 02:15:00.000 0.02 kWh 0.00

I decided to call my VIEW energy_view. Now, when I want to do further cleaning, I can just specify its name in the FROM statement.

--create view from the query above
CREATE VIEW energy_view AS
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Python code:

energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))

It is important to note that with PostgreSQL VIEWs, the data inside of them have to be recalculated every time you query it. This is why we want to insert our VIEW data into a hypertable once we have the data set up just right. You can think of VIEWs as a shorthand version of CTEs WITH AS statement I discussed in my last post.

We are now one step closer to cleaner data!

Creating or generating relevant data

With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a WHERE clause and specify where notes are not equal to an empty string.

PostgreSQL code:

SELECT * 
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';

Results come out empty

Python code:

print(energy_df[energy_df['notes'].notnull()])

Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modelling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the EXTRACT() command. The EXTRACT() command is a PostgreSQL date/time function which allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.

PostgreSQL code:

--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew

Results:

type time usage units cost notes day_of_week
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 01:15:00.000 0.01 kWh 0.00 6

Python code:

energy_df['day_of_week'] = energy_df['time'].dt.dayofweek

Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. I will do this by creating a boolean column, where true represents a weekend, and false represents a weekday. To do this, I will apply a CASE statement . With this command I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say WHEN a day_of_week value is IN the set (0,6) THEN the output should be true, ELSE the value should be false.

PostgreSQL code:

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
	ELSE false
END AS is_weekend
FROM energy_view ew

Results:

type time usage units cost day_of_week is_weekend
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00 6 true

Fun fact: you can do the same query without a CASE statement, however it only works for binary columns.

--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew

Python code:

Notice that in Python, the weekends are represented by numbers 5 and 6 vs the PostgreSQL weekend values 0 and 6.

energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))

And maybe things then start getting real crazy, maybe you want to add more parameters!

Let’s consider holidays. Now you may be asking “Why in the world would we do that?!”, but often people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least some of the holidays off whether they are the day of OR a federal holiday. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I’m going to create another boolean column that identifies when a federal holiday occurs.

To do this, I am going to use TimescaleDB’s time_bucket() function. The time_bucket() function is one of the functions I discussed in detail within my previous post. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the time_bucket() function, I would only see changes to the row associated with the 12am time period.

PostgreSQL code:

After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!

--create table for the holidays
CREATE TABLE holidays (
date date)

--insert the holidays into table
INSERT INTO holidays 
VALUES ('2016-11-11'), 
('2016-11-24'), 
('2016-12-24'), 
('2016-12-25'), 
('2016-12-26'), 
('2017-01-01'),  
('2017-01-02'), 
('2017-01-16'), 
('2017-02-20'), 
('2017-05-29'), 
('2017-07-04'), 
('2017-09-04'), 
('2017-10-9'), 
('2017-11-10'), 
('2017-11-23'), 
('2017-11-24'), 
('2017-12-24'), 
('2017-12-25'), 
('2018-01-01'), 
('2018-01-15'), 
('2018-02-19'), 
('2018-05-28'), 
('2018-07-4'), 
('2018-09-03'), 
('2018-10-8')

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew

Results:

type time usage units cost day_of_week is_weekend is_holiday
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 01:15:00.000 0.01 kWh 0.00 6 true false

Python code:

holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01',  '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))

At this point, I’m going to save this expanded table into another VIEW so that I can call the data without writing out the query.

PostgreSQL code:

--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew

You may be asking, “Why did you create these as boolean columns??”, a very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, when you use boolean columns you can filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend AND a holiday. I can do this simply by adding in a WHERE statement along with the specified columns.

PostgreSQL code:

--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday

Results:

type time usage units cost day_of_week is_weekend is_holiday
Electric usage 2016-12-24 00:00:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 00:15:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 00:30:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 00:45:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 01:00:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 01:15:00.000 0.34 kWh 0.06 6 true true

Python code:

print(energy_df[(energy_df['is_weekend']==1) & (energy_df['is_holiday']==1)].head(10))

Adding data to a hypertable

Now that I have new columns ready to go and I know how I would like my table to be structured, I can create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have done the cleaning up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, whether that is before or after evaluation.

PostgreSQL:

CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
) 

--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')

INSERT INTO energy_usage 
SELECT *
FROM energy_view_exp

Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis.

We will discuss this in detail in my next post, so make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!

Renaming values

Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the popularity of this Python data analysis question on StackOverflow. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple.

When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s CASE statement within the SELECT query. Let’s say I don’t like Sunday being represented by a 0 in the day_of_week column, but would prefer it to be a 7. I can do this with the following query.

PostgreSQL code:

SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values 
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week 
END
FROM energy_usage

Python code:

Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.

energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))

Now, let’s say that I wanted to actually use the names of the days of the week instead of showing numeric values? For this example, I actually want to ditch the CASE statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join to this table using the JOIN command.

PostgreSQL:

--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)

--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')

--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week

Results:

type time usage units cost day_of_week_name weekend
Electric usage 2018-07-22 00:45:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-07-22 00:30:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-07-22 00:15:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-07-22 00:00:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-02-11 23:00:00.000 0.04 kWh 0.01 Sunday true

Python:

In this case, python has similar mapping functions.

energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))

Hopefully, one of these techniques will be useful for you as you approach data renaming!

Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as AS (I couldn’t not use such a ridiculous statement 😂). When you use the SELECT statement, you can rename you columns like so,

PostgreSQL code:

SELECT type AS usage_type,
time as time_stamp,
usage,
units, 
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;

Results:

usage_type time_stamp usage units dollar_amount
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00

Python code:

Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in it’s code.

energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))

Fill in missing data

Another common problem in the data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points, however, it is very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this.

For example, let’s say that you are modeling the energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data, or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values.

As an example, I created some data. I called this table energy_data and it is missing both time and energy readings for the timestamps between 7:45am and 11:30am.

time energy
2021-01-01 07:00:00.000 0
2021-01-01 07:15:00.000 0.1
2021-01-01 07:30:00.000 0.1
2021-01-01 07:45:00.000 0.2
2021-01-01 11:30:00.000 0.04
2021-01-01 11:45:00.000 0.04
2021-01-01 12:00:00.000 0.03
2021-01-01 12:15:00.000 0.02
2021-01-01 12:30:00.000 0.03
2021-01-01 12:45:00.000 0.02
2021-01-01 13:00:00.000 0.03

I can use TimescaleDB’s gapfilling hyperfunctions to fill in these missing values. The interpolate() function is another one of TimescaleDB’s hyperfunctions and it creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the locf() hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the time_bucket_gapfill() function.

PostgreSQL code:

SELECT
--here I specified that the data should increment by 15 mins
  time_bucket_gapfill('15 min', time) AS timestamp,
  interpolate(avg(energy)),
  locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time > '2021-01-01 07:00:00.000' AND time 

Results:

timestamp interpolate locf
2021-01-01 07:00:00.000 0.1 0.10000000000000000000
2021-01-01 07:30:00.000 0.15 0.15000000000000000000
2021-01-01 08:00:00.000 0.13625 0.15000000000000000000
2021-01-01 08:30:00.000 0.1225 0.15000000000000000000
2021-01-01 09:00:00.000 0.10875 0.15000000000000000000
2021-01-01 09:30:00.000 0.095 0.15000000000000000000
2021-01-01 10:00:00.000 0.08125 0.15000000000000000000
2021-01-01 10:30:00.000 0.0675 0.15000000000000000000
2021-01-01 11:00:00.000 0.05375 0.15000000000000000000
2021-01-01 11:30:00.000 0.04 0.04000000000000000000
2021-01-01 12:00:00.000 0.025 0.02500000000000000000
2021-01-01 12:30:00.000 0.025 0.02500000000000000000

Python code:

energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)

Bonus:

The following query is how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a WHERE clause to specify the times which I could like to ignore (the second query).

SELECT * 
FROM energy_data 
WHERE energy IS NOT NULL

SELECT * 
FROM energy_data
WHERE time = '2021-01-01 11:30:00.000'

Wrap Up

After reading through these various cleaning techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you are looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB.

In my next posts, I will go over techniques on how to transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).

If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).

If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.

Until next time!

Functionality Glossary:

PostgreSQL