ETL with Python and Pandas for Soccer Scheduling
New Platforms, New Features, Missing Cheese
As the Information Officer for the South Jersey Soccer League, I'm tasked with making sure our platforms run and run well. SJSL has over 500 teams from over 70 clubs in South Jersey. Our clubs and teams depend on these systems for registration, game schedules, referees and week to week operations to get their players on the pitch.
This Fall, our platform provider for registration, schedules and clubs, GotSport required SJSL to move off their legacy GotSoccer platform to their new modernized system. This was a heavy lift for our scheduling process. Scheduling works with data provided from each team registration, each club, each venue and their related fields to create an optimal schedule.
To assign referees to this schedule, our assignors use a referee scheduling platform, Arbiter and GotSoccer, the legacy system was integrated with this platform. From GotSoccer, the schedule could be synchronized at the press of a button.
That cheese was moved. No such feature to synchronize between GotSport and Arbiter.
Arbiter has an import feature and I can export the schedule from GotSport. The tricky part is the Arbiter import validates the upload and Sites, Subsites - Venues and Pitches or Fields, need to map to the Arbiter data. I need to get 2549 games into Arbiter for the referee assignors to start assigning referees.
Python and Pandas to the rescue.
UPDATES WILL FOLLOW
Need to clean up code for public consumption
Points of interest:
- pseudo code the process
- reading excel files
- replace with dictionary and set_index
- merge 2 dataframes
- clean up key values
- add and remove columns for the export
- date value with insert_space from lambda
- final export to_csv using timestamp to rerun as needed
Results from the ETL
|Total Exported from GotSport
|Total Import Actual
|Total Import Errors