Python (venv) 3.12.2 from VSCode (1.96.2) within Anaconda (2.6.4) on an intel iMac, Sequoia 15.1.1.
I'm working on a downloaded model (not my original work) for predicting MLB scores. One of the 17 or 18 different programs in the process retrieves pitcher data by game for multiple seasons. There are 3200+ individual csv files, each with up to several hundred rows x 32 columns.
Each pitchers' games pitched are to be matched to the specific game in which they appeared. This list of games is in a second (single) csv with ALL games for ALL teams listed (100k+ rows x 200 columns).
The match is done by date. This is where my problem begins.
The date format for the "games" file is, for example, 19800421.
The format for the "pitchers" dates are, for example, 4-11-1998 (strings, I believe). However there are also some formatted as 4- 5-1998 (note the space after day-) for games that aren't two digit months. (Not sure why, but there is no way to address the problem at the source given the sheer number of files and rows).
There is a line in the original code that is intended to clean up the first case (dates formatted as 4-11-1998). But there is no code to clean up the second case (4- 5-1998).
When I run the original code, I get the following error message:
ValueError: invalid literal for int() with base 10: '1977-07-26'
This date is the date in the first row of the first pitchers' stats.
I've tried several possible solutions that I found online and those are included in the attached code, commented out.
Here is the minimum amount of code that I believe will demonstrate the problem. This works (or fails, as it were) because I have the entirety of both data sets (pitcher and game data) available, however it's impossible to include all that here. So if more is needed, I'll need to figure out a way to get it to you.
import numpy as np
import pandas as pd
from datetime import datetime
filepath=('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/data/pitcher_data/')
def load_and_process_pitch_df(p_id, filepath):
fname = filepath + 'pitching_data_'+p_id+'.csv'
pitch_df = pd.read_csv(fname)
#ADDED Convert the 'date' column to datetime format
pitch_df['Date'] = pd.to_datetime(pitch_df['Date'])
#ADDED Change the date format to 'YYYYMMDD'
pitch_df['Date'] = pitch_df['Date'].dt.strftime('%Y%m%d')
# Convert date, fix dblhead_num to be 0,1,2
# This line below (Original code) was giving me the following error: ValueError: invalid literal for int() with base 10: '1977-07-26'. This is the first date in the first pitchers' dataframe. There are other dates in the dataframe that are formatted as '8- 6-1977' with a space where a '1' would be for months 10, 11, 12.
# Original code
#pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.replace('-','')).astype(int)
# My first attempt at replacing either "-" or "- "
#pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.multireplace([('- ','-'), ('','')])).astype(int)
# A second attempt
#for r in (("- ", "-"), ("","")):
# Date = Date.replace(*r).astype(int)
# A third attempt
pitch_df['Date'] = datetime('Date')
pitch_df['Date'].strftime('%Y%m%d').astype(int)
pitch_df.dblhead_num.fillna(0, inplace=True)
pitch_df['dblhead_num'] = pitch_df['dblhead_num'].astype(int)
df=pd.read_csv('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/df_bp3.csv')
start_pitchers_h = df.pitcher_start_id_h.unique()
start_pitchers_v = df.pitcher_start_id_v.unique()
start_pitchers_all = np.union1d(start_pitchers_h, start_pitchers_v)
pitcher_data_dict = {}
for i, p_id in enumerate(start_pitchers_all):
if i%100==0:
print(i)
pitcher_data_dict[p_id] = load_and_process_pitch_df(p_id, filepath)
This is a sample of the pitching data I'm using:
at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
VS,MIL,A,1,1,0,0,0,9,9,38,0,3,2,2,0,11,0,1,0,0,0,2,0,0,0,1,0,2.00,7-26-1977,
AT,CAL,A,1,1,1,0,0,9,3,31,0,0,0,1,0,7,0,0,0,0,0,1,0,0,0,1,0,1.00,7-31-1977,
AT,OAK,A,1,0,0,0,0,7,5,28,0,1,1,2,0,1,1,0,0,1,0,1,0,0,0,1,0,1.08,8- 6-1977,
VS,CAL,A,1,0,0,0,0,5,8,22,2,6,5,1,0,1,0,1,0,0,0,1,0,2,0,0,1,2.40,8-11-1977,
VS,KC,A,1,0,0,0,0,6.1,7,27,0,3,3,0,0,5,0,0,0,0,0,2,0,0,1,0,0,2.72,8-16-1977,
AT,KC,A,1,0,0,0,0,6.1,8,28,0,3,2,3,1,2,0,1,0,0,0,3,1,0,1,0,0,2.74,8-21-1977,
VS,MIN,A,1,0,0,0,0,7,6,28,1,4,4,2,0,4,0,0,0,0,0,2,0,0,0,0,0,3.08,8-26-1977,
VS,CLE,A,1,0,0,0,0,6,7,24,0,3,3,0,0,5,0,0,0,0,0,1,0,0,0,0,0,3.23,8-31-1977,
AT,TOR,A,1,1,1,0,0,9,3,33,0,0,0,2,0,4,0,0,0,0,0,0,1,0,1,1,0,2.78,9- 5-1977,1
VS,DET,A,1,1,0,0,0,9,5,33,0,1,1,2,0,3,0,0,0,0,0,0,0,1,0,1,0,2.57,9-10-1977,
AT,BAL,A,1,0,0,0,0,5,7,21,1,4,4,2,0,1,0,0,0,0,0,2,1,2,0,0,1,2.86,9-16-1977,
AT,DET,A,1,0,0,0,0,6.1,7,28,1,3,3,1,0,2,1,0,0,0,0,1,1,0,1,0,0,2.96,9-22-1977,
VS,TOR,A,1,0,0,0,0,7.1,10,32,1,5,4,1,0,3,0,0,0,0,0,1,0,2,1,1,0,3.12,9-27-1977,1
VS,MIN,A,1,0,0,0,0,4.1,6,20,2,5,5,2,0,2,0,0,1,0,0,1,0,0,0,0,1,10.38,4-11-1978,
AT,OAK,A,1,0,0,0,0,7,4,26,1,1,1,3,1,2,0,0,0,0,0,0,0,0,2,0,0,4.76,4-16-1978,
AT,MIN,A,1,0,0,0,0,6.1,6,29,0,2,2,3,1,3,1,0,0,0,0,1,0,0,1,1,0,4.08,4-21-1978,
VS,TOR,A,1,0,0,0,0,3.1,7,18,0,3,3,1,0,4,0,0,0,0,0,0,0,0,0,0,0,4.71,4-28-1978,
VS,CLE,A,1,0,0,0,0,6,8,27,1,3,3,2,0,3,0,0,0,0,0,2,0,1,0,1,0,4.67,5- 7-1978,
AT,TOR,A,1,0,0,0,0,7.1,8,34,0,5,5,5,1,3,1,0,1,0,0,1,2,1,0,0,1,4.98,5-16-1978,
AT,MIL,A,1,0,0,0,0,7,7,30,0,1,1,3,0,3,0,0,0,0,0,2,0,1,0,0,0,4.35,5-21-1978,
VS,MIL,A,1,0,0,0,0,4.2,6,23,1,4,4,2,0,5,0,0,0,1,0,1,0,0,0,0,0,4.70,5-26-1978,
AT,CHI,A,1,0,0,0,0,4.1,8,20,0,4,2,2,0,0,0,0,0,0,0,1,0,1,1,0,1,4.65,5-31-1978,
And here is a sample of the game data to which I need to match the pitcher data (file df_bp3.csv in the above code):
date,dblheader_code,day_of_week,team_v,league_v,game_no_v,team_h,league_h,game_no_h,runs_v,runs_h,outs_total,day_night,completion_info,forfeit_info,protest_info,ballpark_id,attendance,game_minutes,linescore_v,linescore_h,AB_v,H_v,2B_v,3B_v,HR_v,RBI_v,SH_v,SF_v,HBP_v,BB_v,IBB_v,SO_v,SB_v,CS_v,GIDP_v,CI_v,LOB_v,P_num_v,ERind_v,ERteam_v,WP_v,balk_v,PO_v,ASST_v,ERR_v,PB_v,DP_v,TP_v,AB_h,H_h,2B_h,3B_h,HR_h,RBI_h,SH_h,SF_h,HBP_h,BB_h,IBB_h,SO_h,SB_h,CS_h,GIDP_h,CI_h,LOB_h,P_num_h,ERind_h,ERteam_h,WP_h,balk_h,PO_h,ASST_h,ERR_h,PB_h,DP_h,TP_h,ump_HB_id,ump_HB_name,ump_1B_id,ump_1B_name,ump_2B_id,ump_2B_name,ump_3B_id,ump_3B_name,ump_LF_id,ump_LF_name,ump_RF_id,ump_RF_name,mgr_id_v,mgr_name_v,mgr_id_h,mgr_name_h,pitcher_id_w,pitcher_name_w,pitcher_id_l,pitcher_name_l,pitcher_id_s,pitcher_name_s,GWRBI_id,GWRBI_name,pitcher_start_id_v,pitcher_start_name_v,pitcher_start_id_h,pitcher_start_name_h,batter1_name_v,batter1_id_v,batter1_pos_v,batter2_name_v,batter2_id_v,batter2_pos_v,batter3_name_v,batter3_id_v,batter3_pos_v,batter4_name_v,batter4_id_v,batter4_pos_v,batter5_name_v,batter5_id_v,batter5_pos_v,batter6_name_v,batter6_id_v,batter6_pos_v,batter7_name_v,batter7_id_v,batter7_pos_v,batter8_name_v,batter8_id_v,batter8_pos_v,batter9_name_v,batter9_id_v,batter9_pos_v,batter1_name_h,batter1_id_h,batter1_pos_h,batter2_name_h,batter2_id_h,batter2_pos_h,batter3_name_h,batter3_id_h,batter3_pos_h,batter4_name_h,batter4_id_h,batter4_pos_h,batter5_name_h,batter5_id_h,batter5_pos_h,batter6_name_h,batter6_id_h,batter6_pos_h,batter7_name_h,batter7_id_h,batter7_pos_h,batter8_name_h,batter8_id_h,batter8_pos_h,batter9_name_h,batter9_id_h,batter9_pos_h,misc_info,acqui_info,season,run_diff,home_victory,run_total,date_dblhead,BATAVG_162_h,BATAVG_162_v,OBP_162_h,OBP_162_v,SLG_162_h,SLG_162_v,OBS_162_h,OBS_162_v,SB_162_h,SB_162_v,CS_162_h,CS_162_v,ERR_162_h,ERR_162_v,BATAVG_30_h,BATAVG_30_v,OBP_30_h,OBP_30_v,SLG_30_h,SLG_30_v,OBS_30_h,OBS_30_v,SB_30_h,SB_30_v,CS_30_h,CS_30_v,ERR_30_h,ERR_30_v,implied_prob_h,implied_prob_v,implied_prob_h_mid,over_under_line,over_under_result
19800409,0,Wed,TOR,AL,1,SEA,AL,1,6,8,51,N,,,,SEA02,22588.0,154,200001030,41010020x,36,9,1,1,2,5,0,0,0,5,0,4,2,0,1,0,8,5,6,6,0,0,24,14,1,0,1,0,34,12,2,1,1,8,2,0,0,2,1,4,0,0,1,0,6,3,6,6,0,0,27,16,2,1,1,0,barnl901,Larry Barnett,evanj901,Jim Evans,mckej901,Jim McKean,hendt901,Ted Hendry,,(none),,(none),mattb101,Bobby Mattick,johnd106,Darrell Johnson,parrm101,Mike Parrott,lemad101,Dave Lemanczyk,heavd101,Dave Heaverlo,cox-t101,Ted Cox,lemad101,Dave Lemanczyk,parrm101,Mike Parrott,grifa001,Alfredo Griffin,6,bailb001,Bob Bailor,9,maybj101,John Mayberry,3,veleo101,Otto Velez,10,hower001,Roy Howell,5,bonnb001,Barry Bonnell,7,boser101,Rick Bosetti,8,garcd001,Damaso Garcia,4,white002,Ernie Whitt,2,cruzj002,Julio Cruz,4,crair001,Rod Craig,8,meyed001,Dan Meyer,7,bochb001,Bruce Bochte,3,hortw101,Willie Horton,10,simpj101,Joe Simpson,9,cox-t101,Ted Cox,5,stinb101,Bob Stinson,2,mendm101,Mario Mendoza,6,,Y,1980,2,1,14,
Because of the size of the game file, I can't include more than this one row, so hopefully this will be enough to show what I'm trying to match.
Python (venv) 3.12.2 from VSCode (1.96.2) within Anaconda (2.6.4) on an intel iMac, Sequoia 15.1.1.
I'm working on a downloaded model (not my original work) for predicting MLB scores. One of the 17 or 18 different programs in the process retrieves pitcher data by game for multiple seasons. There are 3200+ individual csv files, each with up to several hundred rows x 32 columns.
Each pitchers' games pitched are to be matched to the specific game in which they appeared. This list of games is in a second (single) csv with ALL games for ALL teams listed (100k+ rows x 200 columns).
The match is done by date. This is where my problem begins.
The date format for the "games" file is, for example, 19800421.
The format for the "pitchers" dates are, for example, 4-11-1998 (strings, I believe). However there are also some formatted as 4- 5-1998 (note the space after day-) for games that aren't two digit months. (Not sure why, but there is no way to address the problem at the source given the sheer number of files and rows).
There is a line in the original code that is intended to clean up the first case (dates formatted as 4-11-1998). But there is no code to clean up the second case (4- 5-1998).
When I run the original code, I get the following error message:
ValueError: invalid literal for int() with base 10: '1977-07-26'
This date is the date in the first row of the first pitchers' stats.
I've tried several possible solutions that I found online and those are included in the attached code, commented out.
Here is the minimum amount of code that I believe will demonstrate the problem. This works (or fails, as it were) because I have the entirety of both data sets (pitcher and game data) available, however it's impossible to include all that here. So if more is needed, I'll need to figure out a way to get it to you.
import numpy as np
import pandas as pd
from datetime import datetime
filepath=('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/data/pitcher_data/')
def load_and_process_pitch_df(p_id, filepath):
fname = filepath + 'pitching_data_'+p_id+'.csv'
pitch_df = pd.read_csv(fname)
#ADDED Convert the 'date' column to datetime format
pitch_df['Date'] = pd.to_datetime(pitch_df['Date'])
#ADDED Change the date format to 'YYYYMMDD'
pitch_df['Date'] = pitch_df['Date'].dt.strftime('%Y%m%d')
# Convert date, fix dblhead_num to be 0,1,2
# This line below (Original code) was giving me the following error: ValueError: invalid literal for int() with base 10: '1977-07-26'. This is the first date in the first pitchers' dataframe. There are other dates in the dataframe that are formatted as '8- 6-1977' with a space where a '1' would be for months 10, 11, 12.
# Original code
#pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.replace('-','')).astype(int)
# My first attempt at replacing either "-" or "- "
#pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.multireplace([('- ','-'), ('','')])).astype(int)
# A second attempt
#for r in (("- ", "-"), ("","")):
# Date = Date.replace(*r).astype(int)
# A third attempt
pitch_df['Date'] = datetime('Date')
pitch_df['Date'].strftime('%Y%m%d').astype(int)
pitch_df.dblhead_num.fillna(0, inplace=True)
pitch_df['dblhead_num'] = pitch_df['dblhead_num'].astype(int)
df=pd.read_csv('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/df_bp3.csv')
start_pitchers_h = df.pitcher_start_id_h.unique()
start_pitchers_v = df.pitcher_start_id_v.unique()
start_pitchers_all = np.union1d(start_pitchers_h, start_pitchers_v)
pitcher_data_dict = {}
for i, p_id in enumerate(start_pitchers_all):
if i%100==0:
print(i)
pitcher_data_dict[p_id] = load_and_process_pitch_df(p_id, filepath)
This is a sample of the pitching data I'm using:
at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
VS,MIL,A,1,1,0,0,0,9,9,38,0,3,2,2,0,11,0,1,0,0,0,2,0,0,0,1,0,2.00,7-26-1977,
AT,CAL,A,1,1,1,0,0,9,3,31,0,0,0,1,0,7,0,0,0,0,0,1,0,0,0,1,0,1.00,7-31-1977,
AT,OAK,A,1,0,0,0,0,7,5,28,0,1,1,2,0,1,1,0,0,1,0,1,0,0,0,1,0,1.08,8- 6-1977,
VS,CAL,A,1,0,0,0,0,5,8,22,2,6,5,1,0,1,0,1,0,0,0,1,0,2,0,0,1,2.40,8-11-1977,
VS,KC,A,1,0,0,0,0,6.1,7,27,0,3,3,0,0,5,0,0,0,0,0,2,0,0,1,0,0,2.72,8-16-1977,
AT,KC,A,1,0,0,0,0,6.1,8,28,0,3,2,3,1,2,0,1,0,0,0,3,1,0,1,0,0,2.74,8-21-1977,
VS,MIN,A,1,0,0,0,0,7,6,28,1,4,4,2,0,4,0,0,0,0,0,2,0,0,0,0,0,3.08,8-26-1977,
VS,CLE,A,1,0,0,0,0,6,7,24,0,3,3,0,0,5,0,0,0,0,0,1,0,0,0,0,0,3.23,8-31-1977,
AT,TOR,A,1,1,1,0,0,9,3,33,0,0,0,2,0,4,0,0,0,0,0,0,1,0,1,1,0,2.78,9- 5-1977,1
VS,DET,A,1,1,0,0,0,9,5,33,0,1,1,2,0,3,0,0,0,0,0,0,0,1,0,1,0,2.57,9-10-1977,
AT,BAL,A,1,0,0,0,0,5,7,21,1,4,4,2,0,1,0,0,0,0,0,2,1,2,0,0,1,2.86,9-16-1977,
AT,DET,A,1,0,0,0,0,6.1,7,28,1,3,3,1,0,2,1,0,0,0,0,1,1,0,1,0,0,2.96,9-22-1977,
VS,TOR,A,1,0,0,0,0,7.1,10,32,1,5,4,1,0,3,0,0,0,0,0,1,0,2,1,1,0,3.12,9-27-1977,1
VS,MIN,A,1,0,0,0,0,4.1,6,20,2,5,5,2,0,2,0,0,1,0,0,1,0,0,0,0,1,10.38,4-11-1978,
AT,OAK,A,1,0,0,0,0,7,4,26,1,1,1,3,1,2,0,0,0,0,0,0,0,0,2,0,0,4.76,4-16-1978,
AT,MIN,A,1,0,0,0,0,6.1,6,29,0,2,2,3,1,3,1,0,0,0,0,1,0,0,1,1,0,4.08,4-21-1978,
VS,TOR,A,1,0,0,0,0,3.1,7,18,0,3,3,1,0,4,0,0,0,0,0,0,0,0,0,0,0,4.71,4-28-1978,
VS,CLE,A,1,0,0,0,0,6,8,27,1,3,3,2,0,3,0,0,0,0,0,2,0,1,0,1,0,4.67,5- 7-1978,
AT,TOR,A,1,0,0,0,0,7.1,8,34,0,5,5,5,1,3,1,0,1,0,0,1,2,1,0,0,1,4.98,5-16-1978,
AT,MIL,A,1,0,0,0,0,7,7,30,0,1,1,3,0,3,0,0,0,0,0,2,0,1,0,0,0,4.35,5-21-1978,
VS,MIL,A,1,0,0,0,0,4.2,6,23,1,4,4,2,0,5,0,0,0,1,0,1,0,0,0,0,0,4.70,5-26-1978,
AT,CHI,A,1,0,0,0,0,4.1,8,20,0,4,2,2,0,0,0,0,0,0,0,1,0,1,1,0,1,4.65,5-31-1978,
And here is a sample of the game data to which I need to match the pitcher data (file df_bp3.csv in the above code):
date,dblheader_code,day_of_week,team_v,league_v,game_no_v,team_h,league_h,game_no_h,runs_v,runs_h,outs_total,day_night,completion_info,forfeit_info,protest_info,ballpark_id,attendance,game_minutes,linescore_v,linescore_h,AB_v,H_v,2B_v,3B_v,HR_v,RBI_v,SH_v,SF_v,HBP_v,BB_v,IBB_v,SO_v,SB_v,CS_v,GIDP_v,CI_v,LOB_v,P_num_v,ERind_v,ERteam_v,WP_v,balk_v,PO_v,ASST_v,ERR_v,PB_v,DP_v,TP_v,AB_h,H_h,2B_h,3B_h,HR_h,RBI_h,SH_h,SF_h,HBP_h,BB_h,IBB_h,SO_h,SB_h,CS_h,GIDP_h,CI_h,LOB_h,P_num_h,ERind_h,ERteam_h,WP_h,balk_h,PO_h,ASST_h,ERR_h,PB_h,DP_h,TP_h,ump_HB_id,ump_HB_name,ump_1B_id,ump_1B_name,ump_2B_id,ump_2B_name,ump_3B_id,ump_3B_name,ump_LF_id,ump_LF_name,ump_RF_id,ump_RF_name,mgr_id_v,mgr_name_v,mgr_id_h,mgr_name_h,pitcher_id_w,pitcher_name_w,pitcher_id_l,pitcher_name_l,pitcher_id_s,pitcher_name_s,GWRBI_id,GWRBI_name,pitcher_start_id_v,pitcher_start_name_v,pitcher_start_id_h,pitcher_start_name_h,batter1_name_v,batter1_id_v,batter1_pos_v,batter2_name_v,batter2_id_v,batter2_pos_v,batter3_name_v,batter3_id_v,batter3_pos_v,batter4_name_v,batter4_id_v,batter4_pos_v,batter5_name_v,batter5_id_v,batter5_pos_v,batter6_name_v,batter6_id_v,batter6_pos_v,batter7_name_v,batter7_id_v,batter7_pos_v,batter8_name_v,batter8_id_v,batter8_pos_v,batter9_name_v,batter9_id_v,batter9_pos_v,batter1_name_h,batter1_id_h,batter1_pos_h,batter2_name_h,batter2_id_h,batter2_pos_h,batter3_name_h,batter3_id_h,batter3_pos_h,batter4_name_h,batter4_id_h,batter4_pos_h,batter5_name_h,batter5_id_h,batter5_pos_h,batter6_name_h,batter6_id_h,batter6_pos_h,batter7_name_h,batter7_id_h,batter7_pos_h,batter8_name_h,batter8_id_h,batter8_pos_h,batter9_name_h,batter9_id_h,batter9_pos_h,misc_info,acqui_info,season,run_diff,home_victory,run_total,date_dblhead,BATAVG_162_h,BATAVG_162_v,OBP_162_h,OBP_162_v,SLG_162_h,SLG_162_v,OBS_162_h,OBS_162_v,SB_162_h,SB_162_v,CS_162_h,CS_162_v,ERR_162_h,ERR_162_v,BATAVG_30_h,BATAVG_30_v,OBP_30_h,OBP_30_v,SLG_30_h,SLG_30_v,OBS_30_h,OBS_30_v,SB_30_h,SB_30_v,CS_30_h,CS_30_v,ERR_30_h,ERR_30_v,implied_prob_h,implied_prob_v,implied_prob_h_mid,over_under_line,over_under_result
19800409,0,Wed,TOR,AL,1,SEA,AL,1,6,8,51,N,,,,SEA02,22588.0,154,200001030,41010020x,36,9,1,1,2,5,0,0,0,5,0,4,2,0,1,0,8,5,6,6,0,0,24,14,1,0,1,0,34,12,2,1,1,8,2,0,0,2,1,4,0,0,1,0,6,3,6,6,0,0,27,16,2,1,1,0,barnl901,Larry Barnett,evanj901,Jim Evans,mckej901,Jim McKean,hendt901,Ted Hendry,,(none),,(none),mattb101,Bobby Mattick,johnd106,Darrell Johnson,parrm101,Mike Parrott,lemad101,Dave Lemanczyk,heavd101,Dave Heaverlo,cox-t101,Ted Cox,lemad101,Dave Lemanczyk,parrm101,Mike Parrott,grifa001,Alfredo Griffin,6,bailb001,Bob Bailor,9,maybj101,John Mayberry,3,veleo101,Otto Velez,10,hower001,Roy Howell,5,bonnb001,Barry Bonnell,7,boser101,Rick Bosetti,8,garcd001,Damaso Garcia,4,white002,Ernie Whitt,2,cruzj002,Julio Cruz,4,crair001,Rod Craig,8,meyed001,Dan Meyer,7,bochb001,Bruce Bochte,3,hortw101,Willie Horton,10,simpj101,Joe Simpson,9,cox-t101,Ted Cox,5,stinb101,Bob Stinson,2,mendm101,Mario Mendoza,6,,Y,1980,2,1,14,
Because of the size of the game file, I can't include more than this one row, so hopefully this will be enough to show what I'm trying to match.
Based on what is given in your description you can try to parse the different date formats with the pandas to_datetime
method.
The trickiest format are the dates with the extra space like in '4- 5-1998'
.
This can be parsed when setting the dayfirst
argument to True
.
Making use of the fact that dayfirst=True
is not strict, it will try to parse with day first but won't force it on dates that don't start with day first.
Combining the parsing with calling the assign
method you can parse the dates as you needed like so:
import pandas as pd
dates = ['20011221', '19800409', '5- 1-1998', '10-12-2021', '31-12-2020']
df = pd.DataFrame(dates, columns=['date'])
df = df.assign(
parsed=pd.to_datetime(df.date, dayfirst=True),
date=lambda x: x.parsed.dt.strftime('%Y%m%d')
)
assign
executes in order resulting in the following dataframe with the extra column 'parsed'
that can be dropped if not needed.
date parsed
0 20011221 2001-12-21
1 19800409 1980-04-09
2 19980105 1998-01-05
3 20211210 2021-12-10
4 20201231 2020-12-31
Although there may be alternatives, I found a solution that solves the original problem and updated the code with that solution. I added the following lines in the function definition that properly converts "d/m/YYYY" into the "YYYYmd" format I need. That raises a different problem which I'll address separately, but it solves this question so it can be (if appropriate) flagged as answered.
Solution:
def load_and_process_pitch_df(p_id, filepath):
fname = filepath + 'pitching_data_'+p_id+'.csv'
pitch_df = pd.read_csv(fname)
#ADDED Convert the 'date' column to datetime format
pitch_df['Date'] = pd.to_datetime(pitch_df['Date'])
#ADDED Change the date format to 'YYYYMMDD'
pitch_df['Date'] = pitch_df['Date'].dt.strftime('%Y%m%d')
#A third attempt
is not inside the function definition. – Barmar Commented Jan 2 at 21:06to_datetime()
– Barmar Commented Jan 2 at 21:08Edit
link and paste it again. – Barmar Commented Jan 2 at 21:24strptime()
himself, it's happening internally to pandas. – Barmar Commented Jan 2 at 21:47