I have a dataframe containing daily data
import pandas as pd
import numpy as np
# Set the random seed for reproducibility
np.random.seed(42)
# Generate random data
dates = pd.date_range(start='2020-06-25', periods=1679, freq='D')
open_prices = np.random.uniform(low=100, high=200, size=len(dates))
high_prices = open_prices + np.random.uniform(low=0, high=10, size=len(dates))
low_prices = open_prices - np.random.uniform(low=0, high=10, size=len(dates))
close_prices = np.random.uniform(low=low_prices, high=high_prices)
# Create the DataFrame
ohlc_data = pd.DataFrame({
'Open': open_prices,
'High': high_prices,
'Low': low_prices,
'Close': close_prices
}, index=dates)
>>> ohlc_data
Open High Low Close
Date
2020-06-25 137.454012 144.403523 129.235702 143.945741
2020-06-26 195.071431 198.532428 188.476323 195.793458
2020-06-27 173.199394 182.955496 165.236980 181.584150
2020-06-28 159.865848 166.275569 157.146388 164.104760
2020-06-29 115.601864 123.826670 108.678275 117.641837
... ... ... ... ...
2025-01-24 179.003044 184.073640 173.358878 175.845545
2025-01-25 130.467914 132.347118 124.462082 130.051784
2025-01-26 108.091928 108.861645 106.429375 108.432944
2025-01-27 140.298018 147.259579 136.500644 145.721364
2025-01-28 117.352451 121.180439 111.180509 115.331552
I need to resample data to 3 days starting from the first day for each year in DataFrame
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resampled = ohlc_data.resample('3D').agg(agg)
>>> resampled
Open High Low Close
Date
2020-06-25 137.454012 198.532428 129.235702 181.584150
2020-06-28 159.865848 166.275569 108.678275 113.720371
2020-07-01 105.808361 195.845186 96.417676 161.755865
2020-07-04 170.807258 198.739371 100.149184 192.952485
2020-07-07 183.244264 188.269925 115.280878 118.845486
... ... ... ... ...
2025-01-15 185.142496 195.768948 111.549399 191.451661
2025-01-18 111.636640 184.883239 109.047597 136.075629
2025-01-21 187.797432 191.054177 175.267831 176.687211
2025-01-24 179.003044 184.073640 106.429375 108.432944
2025-01-27 140.298018 147.259579 111.180509 115.331552
first year:
>>> resampled.loc['2020-01-01': '2020-06-26']
Open High Low Close
Date
2020-06-25 137.454012 198.532428 129.235702 181.584150
2020-06-28 159.865848 166.275569 108.678275 113.720371
This is okay for now because I don't have data before 2020-06-25
second year:
>>> resampled.loc['2021-01-01': '2021-01-06']
Open High Low Close
Date
2021-01-03 190.041806 192.176919 128.095988 134.056420
2021-01-06 134.920957 195.614503 129.865870 189.599085
resampling here start from 2021-01-03 , I need it to start from 2021-01-01
third year:
>>> resampled.loc['2022-01-01': '2022-01-06']
Open High Low Close
Date
2022-01-01 140.348287 147.026333 98.926715 103.562334
2022-01-04 175.513726 177.367027 158.572894 169.912020
resampling in this year work as what I need, starting from 2022-01-01
I tried using origin parameter:
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
ori = str(ohlc_data.index[0].date().replace(month=1, day=1))
resampled = ohlc_data.resample('3D', origin=ori).agg(agg)
but this works only with first year in DataFrame
I have a dataframe containing daily data
import pandas as pd
import numpy as np
# Set the random seed for reproducibility
np.random.seed(42)
# Generate random data
dates = pd.date_range(start='2020-06-25', periods=1679, freq='D')
open_prices = np.random.uniform(low=100, high=200, size=len(dates))
high_prices = open_prices + np.random.uniform(low=0, high=10, size=len(dates))
low_prices = open_prices - np.random.uniform(low=0, high=10, size=len(dates))
close_prices = np.random.uniform(low=low_prices, high=high_prices)
# Create the DataFrame
ohlc_data = pd.DataFrame({
'Open': open_prices,
'High': high_prices,
'Low': low_prices,
'Close': close_prices
}, index=dates)
>>> ohlc_data
Open High Low Close
Date
2020-06-25 137.454012 144.403523 129.235702 143.945741
2020-06-26 195.071431 198.532428 188.476323 195.793458
2020-06-27 173.199394 182.955496 165.236980 181.584150
2020-06-28 159.865848 166.275569 157.146388 164.104760
2020-06-29 115.601864 123.826670 108.678275 117.641837
... ... ... ... ...
2025-01-24 179.003044 184.073640 173.358878 175.845545
2025-01-25 130.467914 132.347118 124.462082 130.051784
2025-01-26 108.091928 108.861645 106.429375 108.432944
2025-01-27 140.298018 147.259579 136.500644 145.721364
2025-01-28 117.352451 121.180439 111.180509 115.331552
I need to resample data to 3 days starting from the first day for each year in DataFrame
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resampled = ohlc_data.resample('3D').agg(agg)
>>> resampled
Open High Low Close
Date
2020-06-25 137.454012 198.532428 129.235702 181.584150
2020-06-28 159.865848 166.275569 108.678275 113.720371
2020-07-01 105.808361 195.845186 96.417676 161.755865
2020-07-04 170.807258 198.739371 100.149184 192.952485
2020-07-07 183.244264 188.269925 115.280878 118.845486
... ... ... ... ...
2025-01-15 185.142496 195.768948 111.549399 191.451661
2025-01-18 111.636640 184.883239 109.047597 136.075629
2025-01-21 187.797432 191.054177 175.267831 176.687211
2025-01-24 179.003044 184.073640 106.429375 108.432944
2025-01-27 140.298018 147.259579 111.180509 115.331552
first year:
>>> resampled.loc['2020-01-01': '2020-06-26']
Open High Low Close
Date
2020-06-25 137.454012 198.532428 129.235702 181.584150
2020-06-28 159.865848 166.275569 108.678275 113.720371
This is okay for now because I don't have data before 2020-06-25
second year:
>>> resampled.loc['2021-01-01': '2021-01-06']
Open High Low Close
Date
2021-01-03 190.041806 192.176919 128.095988 134.056420
2021-01-06 134.920957 195.614503 129.865870 189.599085
resampling here start from 2021-01-03 , I need it to start from 2021-01-01
third year:
>>> resampled.loc['2022-01-01': '2022-01-06']
Open High Low Close
Date
2022-01-01 140.348287 147.026333 98.926715 103.562334
2022-01-04 175.513726 177.367027 158.572894 169.912020
resampling in this year work as what I need, starting from 2022-01-01
I tried using origin parameter:
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
ori = str(ohlc_data.index[0].date().replace(month=1, day=1))
resampled = ohlc_data.resample('3D', origin=ori).agg(agg)
but this works only with first year in DataFrame
Here's one approach:
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
out = (
ohlc_data.groupby(ohlc_data.index.year, group_keys=False)
.apply(
lambda g: g.resample('3D', origin=pd.Timestamp(g.name, 1, 1))
.agg(agg)
)
)
Output:
out[~out.index.year.duplicated()]
Open High Low Close
2020-06-23 137.454012 144.403523 129.235702 143.945741
2021-01-01 109.310277 197.899429 106.286082 188.085355
2022-01-01 140.348287 147.026333 98.926715 103.562334
2023-01-01 186.846798 189.998543 144.825695 189.588286
2024-01-01 151.771164 160.336537 101.736241 128.839795
2025-01-01 130.312836 137.712451 124.656259 135.872983
Explanation
df.groupby
to group by year (DatetimeIndex.year
).groupby.apply
+ df.resample
. This way we can access .name
for each group to create the appropriate origin
.Edit (in response to comment by @wjandrea):
The .name
attribute doesn't appear to be documented (cf. this post). It gets set when the apply_groupwise
method of the BaseGrouper
is called, which happens when you use groupby.apply
(groupby.py#L1785) via _python_apply_general
(e.g. #L1851 and then #L1885).
The relevent part for apply_groupwise
is in ops.py#L1006-L1014:
zipped = zip(group_keys, splitter)
for key, group in zipped:
# Pinning name is needed for
# test_group_apply_once_per_group,
# test_inconsistent_return_type, test_set_group_name,
# test_group_name_available_in_inference_pass,
# test_groupby_multi_timezone
object.__setattr__(group, "name", key)
Seems safe to assume that the attribute is not going anywhere soon, given the reasons for pinning it. Using g.index.year[0]
to create the appropriate timestamp for origin
will normally be a suitable alternative. It certainly is here, but one can of course use group_keys
that aren't directly retrievable from the data contained in g
.
A more verbose, but generic and "documented" alternative could then be:
gr_by = ohlc_data.groupby(ohlc_data.index.year, group_keys=False)
keys_iter = iter(gr_by.groups.keys())
out2 = (
gr_by
.apply(
lambda g: g.resample('3D', origin=pd.Timestamp(next(keys_iter), 1, 1))
.agg(agg)
)
)
out2.equals(out)
# True
Using groupby.groups
+ iter
+ next
.
If I'm reading this right, you're generating a daily model, then only taking every third day. The reason it doesn't work clean is because there are 365 days in non-leap years, which doesn't divide cleanly by 3, offsetting the resample dates by one for every year.
The easiest solution to implement would be to break the data down by year:
year1 = ohlc_data.loc['2020-01-01': '2020-12-31']
year2 = ohlc_data.loc['2021-01-01': '2021-12-31']
etc.....
and then sort them each by three day increments:
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resample1 = year1.resample('3D').agg(agg)
resample2 = year2.resample('3D').agg(agg)
etc...
Then collapse them into a single pandas file:
resample = pd.concat([resample1, resample2, resample3, etc])
By breaking it down this way, you also avoid having to write unique code for pulling leap years.
If you wanted to be fancy with it, write a loop along the lines of:
# Empty Dataframe
resample = pd.Dataframe()
# Breakdown control
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
# List of years in data
years_in_data = ['2020', '2021', '2022', '2023', '2024', '2025']
for year in years_in_data:
# Jan01 Dec31
temporary_1D = ohlc_data.loc['%year-01-01', '%year-12-31']
temporary_3D = year1.resample('3D').agg(agg)
resample = pd.concat([resample, temporary_3D])
This should give you the output where every years starts on the first of January. You can simplify it or make it more reactive from there.
origin
as "starting from the first day for each year". The suggested duplicates combine to the answer:(ohlc_data.groupby(ohlc_data.index.year, group_keys=False).resample('3d', origin='start_day').agg(agg))
, which leads to the correct A only if time series per year actually starts on 1 Jan. For the first year, that is not the case. – ouroboros1 Commented Jan 30 at 9:33:'2020-06-26'
shouldn't include2020-06-28
– wjandrea Commented Feb 4 at 21:21