python - How do I make Pandas resample starting first day of each year in DataFrame - Stack Overflow

admin2025-04-18  3

I have a dataframe containing daily data

import pandas as pd
import numpy as np

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random data
dates = pd.date_range(start='2020-06-25', periods=1679, freq='D')
open_prices = np.random.uniform(low=100, high=200, size=len(dates))
high_prices = open_prices + np.random.uniform(low=0, high=10, size=len(dates))
low_prices = open_prices - np.random.uniform(low=0, high=10, size=len(dates))
close_prices = np.random.uniform(low=low_prices, high=high_prices)

# Create the DataFrame
ohlc_data = pd.DataFrame({
    'Open': open_prices,
    'High': high_prices,
    'Low': low_prices,
    'Close': close_prices
}, index=dates)
>>> ohlc_data
            Open        High        Low         Close
Date        
2020-06-25  137.454012  144.403523  129.235702  143.945741
2020-06-26  195.071431  198.532428  188.476323  195.793458
2020-06-27  173.199394  182.955496  165.236980  181.584150
2020-06-28  159.865848  166.275569  157.146388  164.104760
2020-06-29  115.601864  123.826670  108.678275  117.641837
... ... ... ... ...
2025-01-24  179.003044  184.073640  173.358878  175.845545
2025-01-25  130.467914  132.347118  124.462082  130.051784
2025-01-26  108.091928  108.861645  106.429375  108.432944
2025-01-27  140.298018  147.259579  136.500644  145.721364
2025-01-28  117.352451  121.180439  111.180509  115.331552

I need to resample data to 3 days starting from the first day for each year in DataFrame

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resampled = ohlc_data.resample('3D').agg(agg)
>>> resampled
            Open        High        Low         Close
Date
2020-06-25  137.454012  198.532428  129.235702  181.584150
2020-06-28  159.865848  166.275569  108.678275  113.720371
2020-07-01  105.808361  195.845186  96.417676   161.755865
2020-07-04  170.807258  198.739371  100.149184  192.952485
2020-07-07  183.244264  188.269925  115.280878  118.845486
... ... ... ... ...
2025-01-15  185.142496  195.768948  111.549399  191.451661
2025-01-18  111.636640  184.883239  109.047597  136.075629
2025-01-21  187.797432  191.054177  175.267831  176.687211
2025-01-24  179.003044  184.073640  106.429375  108.432944
2025-01-27  140.298018  147.259579  111.180509  115.331552

first year:

>>> resampled.loc['2020-01-01': '2020-06-26']
            Open        High        Low         Close
Date
2020-06-25  137.454012  198.532428  129.235702  181.584150
2020-06-28  159.865848  166.275569  108.678275  113.720371

This is okay for now because I don't have data before 2020-06-25

second year:

>>> resampled.loc['2021-01-01': '2021-01-06']
            Open        High        Low         Close
Date
2021-01-03  190.041806  192.176919  128.095988  134.056420
2021-01-06  134.920957  195.614503  129.865870  189.599085

resampling here start from 2021-01-03 , I need it to start from 2021-01-01

third year:

>>> resampled.loc['2022-01-01': '2022-01-06']
            Open        High        Low         Close
Date
2022-01-01  140.348287  147.026333  98.926715   103.562334
2022-01-04  175.513726  177.367027  158.572894  169.912020

resampling in this year work as what I need, starting from 2022-01-01

I tried using origin parameter:

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
ori = str(ohlc_data.index[0].date().replace(month=1, day=1))
resampled = ohlc_data.resample('3D', origin=ori).agg(agg)

but this works only with first year in DataFrame

I have a dataframe containing daily data

import pandas as pd
import numpy as np

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random data
dates = pd.date_range(start='2020-06-25', periods=1679, freq='D')
open_prices = np.random.uniform(low=100, high=200, size=len(dates))
high_prices = open_prices + np.random.uniform(low=0, high=10, size=len(dates))
low_prices = open_prices - np.random.uniform(low=0, high=10, size=len(dates))
close_prices = np.random.uniform(low=low_prices, high=high_prices)

# Create the DataFrame
ohlc_data = pd.DataFrame({
    'Open': open_prices,
    'High': high_prices,
    'Low': low_prices,
    'Close': close_prices
}, index=dates)
>>> ohlc_data
            Open        High        Low         Close
Date        
2020-06-25  137.454012  144.403523  129.235702  143.945741
2020-06-26  195.071431  198.532428  188.476323  195.793458
2020-06-27  173.199394  182.955496  165.236980  181.584150
2020-06-28  159.865848  166.275569  157.146388  164.104760
2020-06-29  115.601864  123.826670  108.678275  117.641837
... ... ... ... ...
2025-01-24  179.003044  184.073640  173.358878  175.845545
2025-01-25  130.467914  132.347118  124.462082  130.051784
2025-01-26  108.091928  108.861645  106.429375  108.432944
2025-01-27  140.298018  147.259579  136.500644  145.721364
2025-01-28  117.352451  121.180439  111.180509  115.331552

I need to resample data to 3 days starting from the first day for each year in DataFrame

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resampled = ohlc_data.resample('3D').agg(agg)
>>> resampled
            Open        High        Low         Close
Date
2020-06-25  137.454012  198.532428  129.235702  181.584150
2020-06-28  159.865848  166.275569  108.678275  113.720371
2020-07-01  105.808361  195.845186  96.417676   161.755865
2020-07-04  170.807258  198.739371  100.149184  192.952485
2020-07-07  183.244264  188.269925  115.280878  118.845486
... ... ... ... ...
2025-01-15  185.142496  195.768948  111.549399  191.451661
2025-01-18  111.636640  184.883239  109.047597  136.075629
2025-01-21  187.797432  191.054177  175.267831  176.687211
2025-01-24  179.003044  184.073640  106.429375  108.432944
2025-01-27  140.298018  147.259579  111.180509  115.331552

first year:

>>> resampled.loc['2020-01-01': '2020-06-26']
            Open        High        Low         Close
Date
2020-06-25  137.454012  198.532428  129.235702  181.584150
2020-06-28  159.865848  166.275569  108.678275  113.720371

This is okay for now because I don't have data before 2020-06-25

second year:

>>> resampled.loc['2021-01-01': '2021-01-06']
            Open        High        Low         Close
Date
2021-01-03  190.041806  192.176919  128.095988  134.056420
2021-01-06  134.920957  195.614503  129.865870  189.599085

resampling here start from 2021-01-03 , I need it to start from 2021-01-01

third year:

>>> resampled.loc['2022-01-01': '2022-01-06']
            Open        High        Low         Close
Date
2022-01-01  140.348287  147.026333  98.926715   103.562334
2022-01-04  175.513726  177.367027  158.572894  169.912020

resampling in this year work as what I need, starting from 2022-01-01

I tried using origin parameter:

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
ori = str(ohlc_data.index[0].date().replace(month=1, day=1))
resampled = ohlc_data.resample('3D', origin=ori).agg(agg)

but this works only with first year in DataFrame

Share edited Feb 4 at 21:54 wjandrea 33.2k10 gold badges69 silver badges98 bronze badges asked Jan 29 at 16:05 x_cryptox_crypto 153 bronze badges 5
  • 1 Isn't this essentially the same as your previous question, just on a different timescale? Did you try groupby like mozway answered there? – wjandrea Commented Jan 29 at 20:14
  • 1 Voting to reopen. @wjandrea: I don't believe that to be true. OP is asking for origin as "starting from the first day for each year". The suggested duplicates combine to the answer: (ohlc_data.groupby(ohlc_data.index.year, group_keys=False).resample('3d', origin='start_day').agg(agg)), which leads to the correct A only if time series per year actually starts on 1 Jan. For the first year, that is not the case. – ouroboros1 Commented Jan 30 at 9:33
  • 1 Hello @wjandrea No it's not the same question and like what ouroboros1 said that is a different case – x_crypto Commented Jan 30 at 16:17
  • Minor detail, but :'2020-06-26' shouldn't include 2020-06-28 – wjandrea Commented Feb 4 at 21:21
  • @x_crypto What I mean is, there are a lot of similarities and it doesn't seem like you've tried applying what you learned there to this situation, but regardless, I can see how the question is different now. I reopened it. – wjandrea Commented Feb 4 at 21:57
Add a comment  | 

2 Answers 2

Reset to default 1

Here's one approach:

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}

out = (
    ohlc_data.groupby(ohlc_data.index.year, group_keys=False)
    .apply(
        lambda g: g.resample('3D', origin=pd.Timestamp(g.name, 1, 1))
        .agg(agg)
    )
)

Output:

out[~out.index.year.duplicated()]

                  Open        High         Low       Close
2020-06-23  137.454012  144.403523  129.235702  143.945741
2021-01-01  109.310277  197.899429  106.286082  188.085355
2022-01-01  140.348287  147.026333   98.926715  103.562334
2023-01-01  186.846798  189.998543  144.825695  189.588286
2024-01-01  151.771164  160.336537  101.736241  128.839795
2025-01-01  130.312836  137.712451  124.656259  135.872983

Explanation

  • Use df.groupby to group by year (DatetimeIndex.year).
  • Use groupby.apply + df.resample. This way we can access .name for each group to create the appropriate origin.

Edit (in response to comment by @wjandrea):

The .name attribute doesn't appear to be documented (cf. this post). It gets set when the apply_groupwise method of the BaseGrouper is called, which happens when you use groupby.apply (groupby.py#L1785) via _python_apply_general (e.g. #L1851 and then #L1885).

The relevent part for apply_groupwise is in ops.py#L1006-L1014:

        zipped = zip(group_keys, splitter)

        for key, group in zipped:
            # Pinning name is needed for
            #  test_group_apply_once_per_group,
            #  test_inconsistent_return_type, test_set_group_name,
            #  test_group_name_available_in_inference_pass,
            #  test_groupby_multi_timezone
            object.__setattr__(group, "name", key)

Seems safe to assume that the attribute is not going anywhere soon, given the reasons for pinning it. Using g.index.year[0] to create the appropriate timestamp for origin will normally be a suitable alternative. It certainly is here, but one can of course use group_keys that aren't directly retrievable from the data contained in g.

A more verbose, but generic and "documented" alternative could then be:

gr_by = ohlc_data.groupby(ohlc_data.index.year, group_keys=False)
keys_iter = iter(gr_by.groups.keys())

out2 = (
    gr_by
    .apply(
        lambda g: g.resample('3D', origin=pd.Timestamp(next(keys_iter), 1, 1))
        .agg(agg)
    )
)

out2.equals(out)
# True

Using groupby.groups + iter + next.

If I'm reading this right, you're generating a daily model, then only taking every third day. The reason it doesn't work clean is because there are 365 days in non-leap years, which doesn't divide cleanly by 3, offsetting the resample dates by one for every year.

The easiest solution to implement would be to break the data down by year:

year1 = ohlc_data.loc['2020-01-01': '2020-12-31'] 
year2 = ohlc_data.loc['2021-01-01': '2021-12-31']

etc.....

and then sort them each by three day increments:

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resample1 = year1.resample('3D').agg(agg)
resample2 = year2.resample('3D').agg(agg) 

etc...

Then collapse them into a single pandas file:

resample = pd.concat([resample1, resample2, resample3, etc]) 

By breaking it down this way, you also avoid having to write unique code for pulling leap years.

If you wanted to be fancy with it, write a loop along the lines of:

# Empty Dataframe
resample = pd.Dataframe()
# Breakdown control
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'} 
# List of years in data
years_in_data = ['2020', '2021', '2022', '2023', '2024', '2025']

for year in years_in_data: 
    #                                   Jan01          Dec31
    temporary_1D = ohlc_data.loc['%year-01-01', '%year-12-31'] 
    temporary_3D = year1.resample('3D').agg(agg)
    resample = pd.concat([resample, temporary_3D]) 

This should give you the output where every years starts on the first of January. You can simplify it or make it more reactive from there.

转载请注明原文地址:http://anycun.com/QandA/1744953968a89968.html