python - How can I change the format of a string date in pandas dataframe to an integer? - Stack Overflow

admin2025-05-01  1

Python (venv) 3.12.2 from VSCode (1.96.2) within Anaconda (2.6.4) on an intel iMac, Sequoia 15.1.1.

I'm working on a downloaded model (not my original work) for predicting MLB scores. One of the 17 or 18 different programs in the process retrieves pitcher data by game for multiple seasons. There are 3200+ individual csv files, each with up to several hundred rows x 32 columns.

Each pitchers' games pitched are to be matched to the specific game in which they appeared. This list of games is in a second (single) csv with ALL games for ALL teams listed (100k+ rows x 200 columns).

The match is done by date. This is where my problem begins.

The date format for the "games" file is, for example, 19800421.

The format for the "pitchers" dates are, for example, 4-11-1998 (strings, I believe). However there are also some formatted as 4- 5-1998 (note the space after day-) for games that aren't two digit months. (Not sure why, but there is no way to address the problem at the source given the sheer number of files and rows).

There is a line in the original code that is intended to clean up the first case (dates formatted as 4-11-1998). But there is no code to clean up the second case (4- 5-1998).

When I run the original code, I get the following error message:

ValueError: invalid literal for int() with base 10: '1977-07-26'

This date is the date in the first row of the first pitchers' stats.

I've tried several possible solutions that I found online and those are included in the attached code, commented out.

Here is the minimum amount of code that I believe will demonstrate the problem. This works (or fails, as it were) because I have the entirety of both data sets (pitcher and game data) available, however it's impossible to include all that here. So if more is needed, I'll need to figure out a way to get it to you.

import numpy as np
import pandas as pd
from datetime import datetime

filepath=('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/data/pitcher_data/')

def load_and_process_pitch_df(p_id, filepath):
    fname = filepath + 'pitching_data_'+p_id+'.csv'
    pitch_df = pd.read_csv(fname)

    #ADDED Convert the 'date' column to datetime format
    pitch_df['Date'] = pd.to_datetime(pitch_df['Date'])

    #ADDED Change the date format to 'YYYYMMDD'
    pitch_df['Date'] = pitch_df['Date'].dt.strftime('%Y%m%d')

   # Convert date, fix dblhead_num to be 0,1,2
   # This line below (Original code) was giving me the following error: ValueError:     invalid literal for int() with base 10: '1977-07-26'.  This is the first date in the  first pitchers' dataframe.  There are other dates in the dataframe that are formatted as '8- 6-1977' with a space where a '1' would be for months 10, 11, 12.

    # Original code
    #pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.replace('-','')).astype(int)
    
    # My first attempt at replacing either "-" or "- "
    #pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.multireplace([('- ','-'), ('','')])).astype(int)
    
    # A second attempt
    #for r in (("- ", "-"), ("","")):
    #    Date = Date.replace(*r).astype(int)

    # A third attempt
    pitch_df['Date'] = datetime('Date')
    pitch_df['Date'].strftime('%Y%m%d').astype(int)

    pitch_df.dblhead_num.fillna(0, inplace=True)
    pitch_df['dblhead_num'] = pitch_df['dblhead_num'].astype(int)
    df=pd.read_csv('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/df_bp3.csv')
 
    start_pitchers_h = df.pitcher_start_id_h.unique()
    start_pitchers_v = df.pitcher_start_id_v.unique()
    start_pitchers_all = np.union1d(start_pitchers_h, start_pitchers_v)
    pitcher_data_dict = {}

    for i, p_id in enumerate(start_pitchers_all):
        if i%100==0:
            print(i)
        pitcher_data_dict[p_id] = load_and_process_pitch_df(p_id, filepath)

This is a sample of the pitching data I'm using:

at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
VS,MIL,A,1,1,0,0,0,9,9,38,0,3,2,2,0,11,0,1,0,0,0,2,0,0,0,1,0,2.00,7-26-1977,
AT,CAL,A,1,1,1,0,0,9,3,31,0,0,0,1,0,7,0,0,0,0,0,1,0,0,0,1,0,1.00,7-31-1977,
AT,OAK,A,1,0,0,0,0,7,5,28,0,1,1,2,0,1,1,0,0,1,0,1,0,0,0,1,0,1.08,8- 6-1977,
VS,CAL,A,1,0,0,0,0,5,8,22,2,6,5,1,0,1,0,1,0,0,0,1,0,2,0,0,1,2.40,8-11-1977,
VS,KC,A,1,0,0,0,0,6.1,7,27,0,3,3,0,0,5,0,0,0,0,0,2,0,0,1,0,0,2.72,8-16-1977,
AT,KC,A,1,0,0,0,0,6.1,8,28,0,3,2,3,1,2,0,1,0,0,0,3,1,0,1,0,0,2.74,8-21-1977,
VS,MIN,A,1,0,0,0,0,7,6,28,1,4,4,2,0,4,0,0,0,0,0,2,0,0,0,0,0,3.08,8-26-1977,
VS,CLE,A,1,0,0,0,0,6,7,24,0,3,3,0,0,5,0,0,0,0,0,1,0,0,0,0,0,3.23,8-31-1977,
AT,TOR,A,1,1,1,0,0,9,3,33,0,0,0,2,0,4,0,0,0,0,0,0,1,0,1,1,0,2.78,9- 5-1977,1
VS,DET,A,1,1,0,0,0,9,5,33,0,1,1,2,0,3,0,0,0,0,0,0,0,1,0,1,0,2.57,9-10-1977,
AT,BAL,A,1,0,0,0,0,5,7,21,1,4,4,2,0,1,0,0,0,0,0,2,1,2,0,0,1,2.86,9-16-1977,
AT,DET,A,1,0,0,0,0,6.1,7,28,1,3,3,1,0,2,1,0,0,0,0,1,1,0,1,0,0,2.96,9-22-1977,
VS,TOR,A,1,0,0,0,0,7.1,10,32,1,5,4,1,0,3,0,0,0,0,0,1,0,2,1,1,0,3.12,9-27-1977,1
VS,MIN,A,1,0,0,0,0,4.1,6,20,2,5,5,2,0,2,0,0,1,0,0,1,0,0,0,0,1,10.38,4-11-1978,
AT,OAK,A,1,0,0,0,0,7,4,26,1,1,1,3,1,2,0,0,0,0,0,0,0,0,2,0,0,4.76,4-16-1978,
AT,MIN,A,1,0,0,0,0,6.1,6,29,0,2,2,3,1,3,1,0,0,0,0,1,0,0,1,1,0,4.08,4-21-1978,
VS,TOR,A,1,0,0,0,0,3.1,7,18,0,3,3,1,0,4,0,0,0,0,0,0,0,0,0,0,0,4.71,4-28-1978,
VS,CLE,A,1,0,0,0,0,6,8,27,1,3,3,2,0,3,0,0,0,0,0,2,0,1,0,1,0,4.67,5- 7-1978,
AT,TOR,A,1,0,0,0,0,7.1,8,34,0,5,5,5,1,3,1,0,1,0,0,1,2,1,0,0,1,4.98,5-16-1978,
AT,MIL,A,1,0,0,0,0,7,7,30,0,1,1,3,0,3,0,0,0,0,0,2,0,1,0,0,0,4.35,5-21-1978,
VS,MIL,A,1,0,0,0,0,4.2,6,23,1,4,4,2,0,5,0,0,0,1,0,1,0,0,0,0,0,4.70,5-26-1978,
AT,CHI,A,1,0,0,0,0,4.1,8,20,0,4,2,2,0,0,0,0,0,0,0,1,0,1,1,0,1,4.65,5-31-1978,

And here is a sample of the game data to which I need to match the pitcher data (file df_bp3.csv in the above code):

date,dblheader_code,day_of_week,team_v,league_v,game_no_v,team_h,league_h,game_no_h,runs_v,runs_h,outs_total,day_night,completion_info,forfeit_info,protest_info,ballpark_id,attendance,game_minutes,linescore_v,linescore_h,AB_v,H_v,2B_v,3B_v,HR_v,RBI_v,SH_v,SF_v,HBP_v,BB_v,IBB_v,SO_v,SB_v,CS_v,GIDP_v,CI_v,LOB_v,P_num_v,ERind_v,ERteam_v,WP_v,balk_v,PO_v,ASST_v,ERR_v,PB_v,DP_v,TP_v,AB_h,H_h,2B_h,3B_h,HR_h,RBI_h,SH_h,SF_h,HBP_h,BB_h,IBB_h,SO_h,SB_h,CS_h,GIDP_h,CI_h,LOB_h,P_num_h,ERind_h,ERteam_h,WP_h,balk_h,PO_h,ASST_h,ERR_h,PB_h,DP_h,TP_h,ump_HB_id,ump_HB_name,ump_1B_id,ump_1B_name,ump_2B_id,ump_2B_name,ump_3B_id,ump_3B_name,ump_LF_id,ump_LF_name,ump_RF_id,ump_RF_name,mgr_id_v,mgr_name_v,mgr_id_h,mgr_name_h,pitcher_id_w,pitcher_name_w,pitcher_id_l,pitcher_name_l,pitcher_id_s,pitcher_name_s,GWRBI_id,GWRBI_name,pitcher_start_id_v,pitcher_start_name_v,pitcher_start_id_h,pitcher_start_name_h,batter1_name_v,batter1_id_v,batter1_pos_v,batter2_name_v,batter2_id_v,batter2_pos_v,batter3_name_v,batter3_id_v,batter3_pos_v,batter4_name_v,batter4_id_v,batter4_pos_v,batter5_name_v,batter5_id_v,batter5_pos_v,batter6_name_v,batter6_id_v,batter6_pos_v,batter7_name_v,batter7_id_v,batter7_pos_v,batter8_name_v,batter8_id_v,batter8_pos_v,batter9_name_v,batter9_id_v,batter9_pos_v,batter1_name_h,batter1_id_h,batter1_pos_h,batter2_name_h,batter2_id_h,batter2_pos_h,batter3_name_h,batter3_id_h,batter3_pos_h,batter4_name_h,batter4_id_h,batter4_pos_h,batter5_name_h,batter5_id_h,batter5_pos_h,batter6_name_h,batter6_id_h,batter6_pos_h,batter7_name_h,batter7_id_h,batter7_pos_h,batter8_name_h,batter8_id_h,batter8_pos_h,batter9_name_h,batter9_id_h,batter9_pos_h,misc_info,acqui_info,season,run_diff,home_victory,run_total,date_dblhead,BATAVG_162_h,BATAVG_162_v,OBP_162_h,OBP_162_v,SLG_162_h,SLG_162_v,OBS_162_h,OBS_162_v,SB_162_h,SB_162_v,CS_162_h,CS_162_v,ERR_162_h,ERR_162_v,BATAVG_30_h,BATAVG_30_v,OBP_30_h,OBP_30_v,SLG_30_h,SLG_30_v,OBS_30_h,OBS_30_v,SB_30_h,SB_30_v,CS_30_h,CS_30_v,ERR_30_h,ERR_30_v,implied_prob_h,implied_prob_v,implied_prob_h_mid,over_under_line,over_under_result
19800409,0,Wed,TOR,AL,1,SEA,AL,1,6,8,51,N,,,,SEA02,22588.0,154,200001030,41010020x,36,9,1,1,2,5,0,0,0,5,0,4,2,0,1,0,8,5,6,6,0,0,24,14,1,0,1,0,34,12,2,1,1,8,2,0,0,2,1,4,0,0,1,0,6,3,6,6,0,0,27,16,2,1,1,0,barnl901,Larry Barnett,evanj901,Jim Evans,mckej901,Jim McKean,hendt901,Ted Hendry,,(none),,(none),mattb101,Bobby Mattick,johnd106,Darrell Johnson,parrm101,Mike Parrott,lemad101,Dave Lemanczyk,heavd101,Dave Heaverlo,cox-t101,Ted Cox,lemad101,Dave Lemanczyk,parrm101,Mike Parrott,grifa001,Alfredo Griffin,6,bailb001,Bob Bailor,9,maybj101,John Mayberry,3,veleo101,Otto Velez,10,hower001,Roy Howell,5,bonnb001,Barry Bonnell,7,boser101,Rick Bosetti,8,garcd001,Damaso Garcia,4,white002,Ernie Whitt,2,cruzj002,Julio Cruz,4,crair001,Rod Craig,8,meyed001,Dan Meyer,7,bochb001,Bruce Bochte,3,hortw101,Willie Horton,10,simpj101,Joe Simpson,9,cox-t101,Ted Cox,5,stinb101,Bob Stinson,2,mendm101,Mario Mendoza,6,,Y,1980,2,1,14,

Because of the size of the game file, I can't include more than this one row, so hopefully this will be enough to show what I'm trying to match.

Python (venv) 3.12.2 from VSCode (1.96.2) within Anaconda (2.6.4) on an intel iMac, Sequoia 15.1.1.

I'm working on a downloaded model (not my original work) for predicting MLB scores. One of the 17 or 18 different programs in the process retrieves pitcher data by game for multiple seasons. There are 3200+ individual csv files, each with up to several hundred rows x 32 columns.

Each pitchers' games pitched are to be matched to the specific game in which they appeared. This list of games is in a second (single) csv with ALL games for ALL teams listed (100k+ rows x 200 columns).

The match is done by date. This is where my problem begins.

The date format for the "games" file is, for example, 19800421.

The format for the "pitchers" dates are, for example, 4-11-1998 (strings, I believe). However there are also some formatted as 4- 5-1998 (note the space after day-) for games that aren't two digit months. (Not sure why, but there is no way to address the problem at the source given the sheer number of files and rows).

There is a line in the original code that is intended to clean up the first case (dates formatted as 4-11-1998). But there is no code to clean up the second case (4- 5-1998).

When I run the original code, I get the following error message:

ValueError: invalid literal for int() with base 10: '1977-07-26'

This date is the date in the first row of the first pitchers' stats.

I've tried several possible solutions that I found online and those are included in the attached code, commented out.

Here is the minimum amount of code that I believe will demonstrate the problem. This works (or fails, as it were) because I have the entirety of both data sets (pitcher and game data) available, however it's impossible to include all that here. So if more is needed, I'll need to figure out a way to get it to you.

import numpy as np
import pandas as pd
from datetime import datetime

filepath=('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/data/pitcher_data/')

def load_and_process_pitch_df(p_id, filepath):
    fname = filepath + 'pitching_data_'+p_id+'.csv'
    pitch_df = pd.read_csv(fname)

    #ADDED Convert the 'date' column to datetime format
    pitch_df['Date'] = pd.to_datetime(pitch_df['Date'])

    #ADDED Change the date format to 'YYYYMMDD'
    pitch_df['Date'] = pitch_df['Date'].dt.strftime('%Y%m%d')

   # Convert date, fix dblhead_num to be 0,1,2
   # This line below (Original code) was giving me the following error: ValueError:     invalid literal for int() with base 10: '1977-07-26'.  This is the first date in the  first pitchers' dataframe.  There are other dates in the dataframe that are formatted as '8- 6-1977' with a space where a '1' would be for months 10, 11, 12.

    # Original code
    #pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.replace('-','')).astype(int)
    
    # My first attempt at replacing either "-" or "- "
    #pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.multireplace([('- ','-'), ('','')])).astype(int)
    
    # A second attempt
    #for r in (("- ", "-"), ("","")):
    #    Date = Date.replace(*r).astype(int)

    # A third attempt
    pitch_df['Date'] = datetime('Date')
    pitch_df['Date'].strftime('%Y%m%d').astype(int)

    pitch_df.dblhead_num.fillna(0, inplace=True)
    pitch_df['dblhead_num'] = pitch_df['dblhead_num'].astype(int)
    df=pd.read_csv('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/df_bp3.csv')
 
    start_pitchers_h = df.pitcher_start_id_h.unique()
    start_pitchers_v = df.pitcher_start_id_v.unique()
    start_pitchers_all = np.union1d(start_pitchers_h, start_pitchers_v)
    pitcher_data_dict = {}

    for i, p_id in enumerate(start_pitchers_all):
        if i%100==0:
            print(i)
        pitcher_data_dict[p_id] = load_and_process_pitch_df(p_id, filepath)

This is a sample of the pitching data I'm using:

at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
VS,MIL,A,1,1,0,0,0,9,9,38,0,3,2,2,0,11,0,1,0,0,0,2,0,0,0,1,0,2.00,7-26-1977,
AT,CAL,A,1,1,1,0,0,9,3,31,0,0,0,1,0,7,0,0,0,0,0,1,0,0,0,1,0,1.00,7-31-1977,
AT,OAK,A,1,0,0,0,0,7,5,28,0,1,1,2,0,1,1,0,0,1,0,1,0,0,0,1,0,1.08,8- 6-1977,
VS,CAL,A,1,0,0,0,0,5,8,22,2,6,5,1,0,1,0,1,0,0,0,1,0,2,0,0,1,2.40,8-11-1977,
VS,KC,A,1,0,0,0,0,6.1,7,27,0,3,3,0,0,5,0,0,0,0,0,2,0,0,1,0,0,2.72,8-16-1977,
AT,KC,A,1,0,0,0,0,6.1,8,28,0,3,2,3,1,2,0,1,0,0,0,3,1,0,1,0,0,2.74,8-21-1977,
VS,MIN,A,1,0,0,0,0,7,6,28,1,4,4,2,0,4,0,0,0,0,0,2,0,0,0,0,0,3.08,8-26-1977,
VS,CLE,A,1,0,0,0,0,6,7,24,0,3,3,0,0,5,0,0,0,0,0,1,0,0,0,0,0,3.23,8-31-1977,
AT,TOR,A,1,1,1,0,0,9,3,33,0,0,0,2,0,4,0,0,0,0,0,0,1,0,1,1,0,2.78,9- 5-1977,1
VS,DET,A,1,1,0,0,0,9,5,33,0,1,1,2,0,3,0,0,0,0,0,0,0,1,0,1,0,2.57,9-10-1977,
AT,BAL,A,1,0,0,0,0,5,7,21,1,4,4,2,0,1,0,0,0,0,0,2,1,2,0,0,1,2.86,9-16-1977,
AT,DET,A,1,0,0,0,0,6.1,7,28,1,3,3,1,0,2,1,0,0,0,0,1,1,0,1,0,0,2.96,9-22-1977,
VS,TOR,A,1,0,0,0,0,7.1,10,32,1,5,4,1,0,3,0,0,0,0,0,1,0,2,1,1,0,3.12,9-27-1977,1
VS,MIN,A,1,0,0,0,0,4.1,6,20,2,5,5,2,0,2,0,0,1,0,0,1,0,0,0,0,1,10.38,4-11-1978,
AT,OAK,A,1,0,0,0,0,7,4,26,1,1,1,3,1,2,0,0,0,0,0,0,0,0,2,0,0,4.76,4-16-1978,
AT,MIN,A,1,0,0,0,0,6.1,6,29,0,2,2,3,1,3,1,0,0,0,0,1,0,0,1,1,0,4.08,4-21-1978,
VS,TOR,A,1,0,0,0,0,3.1,7,18,0,3,3,1,0,4,0,0,0,0,0,0,0,0,0,0,0,4.71,4-28-1978,
VS,CLE,A,1,0,0,0,0,6,8,27,1,3,3,2,0,3,0,0,0,0,0,2,0,1,0,1,0,4.67,5- 7-1978,
AT,TOR,A,1,0,0,0,0,7.1,8,34,0,5,5,5,1,3,1,0,1,0,0,1,2,1,0,0,1,4.98,5-16-1978,
AT,MIL,A,1,0,0,0,0,7,7,30,0,1,1,3,0,3,0,0,0,0,0,2,0,1,0,0,0,4.35,5-21-1978,
VS,MIL,A,1,0,0,0,0,4.2,6,23,1,4,4,2,0,5,0,0,0,1,0,1,0,0,0,0,0,4.70,5-26-1978,
AT,CHI,A,1,0,0,0,0,4.1,8,20,0,4,2,2,0,0,0,0,0,0,0,1,0,1,1,0,1,4.65,5-31-1978,

And here is a sample of the game data to which I need to match the pitcher data (file df_bp3.csv in the above code):

date,dblheader_code,day_of_week,team_v,league_v,game_no_v,team_h,league_h,game_no_h,runs_v,runs_h,outs_total,day_night,completion_info,forfeit_info,protest_info,ballpark_id,attendance,game_minutes,linescore_v,linescore_h,AB_v,H_v,2B_v,3B_v,HR_v,RBI_v,SH_v,SF_v,HBP_v,BB_v,IBB_v,SO_v,SB_v,CS_v,GIDP_v,CI_v,LOB_v,P_num_v,ERind_v,ERteam_v,WP_v,balk_v,PO_v,ASST_v,ERR_v,PB_v,DP_v,TP_v,AB_h,H_h,2B_h,3B_h,HR_h,RBI_h,SH_h,SF_h,HBP_h,BB_h,IBB_h,SO_h,SB_h,CS_h,GIDP_h,CI_h,LOB_h,P_num_h,ERind_h,ERteam_h,WP_h,balk_h,PO_h,ASST_h,ERR_h,PB_h,DP_h,TP_h,ump_HB_id,ump_HB_name,ump_1B_id,ump_1B_name,ump_2B_id,ump_2B_name,ump_3B_id,ump_3B_name,ump_LF_id,ump_LF_name,ump_RF_id,ump_RF_name,mgr_id_v,mgr_name_v,mgr_id_h,mgr_name_h,pitcher_id_w,pitcher_name_w,pitcher_id_l,pitcher_name_l,pitcher_id_s,pitcher_name_s,GWRBI_id,GWRBI_name,pitcher_start_id_v,pitcher_start_name_v,pitcher_start_id_h,pitcher_start_name_h,batter1_name_v,batter1_id_v,batter1_pos_v,batter2_name_v,batter2_id_v,batter2_pos_v,batter3_name_v,batter3_id_v,batter3_pos_v,batter4_name_v,batter4_id_v,batter4_pos_v,batter5_name_v,batter5_id_v,batter5_pos_v,batter6_name_v,batter6_id_v,batter6_pos_v,batter7_name_v,batter7_id_v,batter7_pos_v,batter8_name_v,batter8_id_v,batter8_pos_v,batter9_name_v,batter9_id_v,batter9_pos_v,batter1_name_h,batter1_id_h,batter1_pos_h,batter2_name_h,batter2_id_h,batter2_pos_h,batter3_name_h,batter3_id_h,batter3_pos_h,batter4_name_h,batter4_id_h,batter4_pos_h,batter5_name_h,batter5_id_h,batter5_pos_h,batter6_name_h,batter6_id_h,batter6_pos_h,batter7_name_h,batter7_id_h,batter7_pos_h,batter8_name_h,batter8_id_h,batter8_pos_h,batter9_name_h,batter9_id_h,batter9_pos_h,misc_info,acqui_info,season,run_diff,home_victory,run_total,date_dblhead,BATAVG_162_h,BATAVG_162_v,OBP_162_h,OBP_162_v,SLG_162_h,SLG_162_v,OBS_162_h,OBS_162_v,SB_162_h,SB_162_v,CS_162_h,CS_162_v,ERR_162_h,ERR_162_v,BATAVG_30_h,BATAVG_30_v,OBP_30_h,OBP_30_v,SLG_30_h,SLG_30_v,OBS_30_h,OBS_30_v,SB_30_h,SB_30_v,CS_30_h,CS_30_v,ERR_30_h,ERR_30_v,implied_prob_h,implied_prob_v,implied_prob_h_mid,over_under_line,over_under_result
19800409,0,Wed,TOR,AL,1,SEA,AL,1,6,8,51,N,,,,SEA02,22588.0,154,200001030,41010020x,36,9,1,1,2,5,0,0,0,5,0,4,2,0,1,0,8,5,6,6,0,0,24,14,1,0,1,0,34,12,2,1,1,8,2,0,0,2,1,4,0,0,1,0,6,3,6,6,0,0,27,16,2,1,1,0,barnl901,Larry Barnett,evanj901,Jim Evans,mckej901,Jim McKean,hendt901,Ted Hendry,,(none),,(none),mattb101,Bobby Mattick,johnd106,Darrell Johnson,parrm101,Mike Parrott,lemad101,Dave Lemanczyk,heavd101,Dave Heaverlo,cox-t101,Ted Cox,lemad101,Dave Lemanczyk,parrm101,Mike Parrott,grifa001,Alfredo Griffin,6,bailb001,Bob Bailor,9,maybj101,John Mayberry,3,veleo101,Otto Velez,10,hower001,Roy Howell,5,bonnb001,Barry Bonnell,7,boser101,Rick Bosetti,8,garcd001,Damaso Garcia,4,white002,Ernie Whitt,2,cruzj002,Julio Cruz,4,crair001,Rod Craig,8,meyed001,Dan Meyer,7,bochb001,Bruce Bochte,3,hortw101,Willie Horton,10,simpj101,Joe Simpson,9,cox-t101,Ted Cox,5,stinb101,Bob Stinson,2,mendm101,Mario Mendoza,6,,Y,1980,2,1,14,

Because of the size of the game file, I can't include more than this one row, so hopefully this will be enough to show what I'm trying to match.

Share Improve this question edited Jan 3 at 0:00 Richard C asked Jan 2 at 20:59 Richard CRichard C 11 silver badge4 bronze badges 10
  • Is the code indentation correct in the question? Everything after #A third attempt is not inside the function definition. – Barmar Commented Jan 2 at 21:06
  • You need to remove the spaces before you call to_datetime() – Barmar Commented Jan 2 at 21:08
  • Yes. My bad apparently. Although the indentation was overlooked when I posted the code, it is properly indented in the actual code. It doesn't look like I can correct that here, however. – Richard C Commented Jan 2 at 21:23
  • 1 Why can't you correct it here? Click on the Edit link and paste it again. – Barmar Commented Jan 2 at 21:24
  • 1 @Vin He's not calling strptime() himself, it's happening internally to pandas. – Barmar Commented Jan 2 at 21:47
 |  Show 5 more comments

2 Answers 2

Reset to default 0

Based on what is given in your description you can try to parse the different date formats with the pandas to_datetime method. The trickiest format are the dates with the extra space like in '4- 5-1998'. This can be parsed when setting the dayfirst argument to True.

Making use of the fact that dayfirst=True is not strict, it will try to parse with day first but won't force it on dates that don't start with day first.

Combining the parsing with calling the assign method you can parse the dates as you needed like so:

import pandas as pd
    
dates = ['20011221', '19800409', '5- 1-1998', '10-12-2021', '31-12-2020']
df = pd.DataFrame(dates, columns=['date'])

df = df.assign(
  parsed=pd.to_datetime(df.date, dayfirst=True),
  date=lambda x: x.parsed.dt.strftime('%Y%m%d')
)

assign executes in order resulting in the following dataframe with the extra column 'parsed' that can be dropped if not needed.

       date     parsed
0  20011221 2001-12-21
1  19800409 1980-04-09
2  19980105 1998-01-05
3  20211210 2021-12-10
4  20201231 2020-12-31

Although there may be alternatives, I found a solution that solves the original problem and updated the code with that solution. I added the following lines in the function definition that properly converts "d/m/YYYY" into the "YYYYmd" format I need. That raises a different problem which I'll address separately, but it solves this question so it can be (if appropriate) flagged as answered.

Solution:

def load_and_process_pitch_df(p_id, filepath):
    fname = filepath + 'pitching_data_'+p_id+'.csv'
    pitch_df = pd.read_csv(fname)

    #ADDED Convert the 'date' column to datetime format
    pitch_df['Date'] = pd.to_datetime(pitch_df['Date'])

    #ADDED Change the date format to 'YYYYMMDD'
    pitch_df['Date'] = pitch_df['Date'].dt.strftime('%Y%m%d')
转载请注明原文地址:http://anycun.com/QandA/1746096854a91621.html