A Journey into Python & Data Pipelines
Embarking on a Python programming journey, I jumped into data pipelines, focusing on the vital first step: data collection and storage.
The purpose of this project was to delve into the world of Python programming and gain an understanding of data pipelines. While the project did not encompass the creation of a full data pipeline, it primarily focused on the essential first step: data collection and storage.
Inspiration for this project was drawn from a blog post by Simon Späti: https://www.sspaeti.com/blog/data-engineering-project-in-twenty-minutes/ This insightful post presents an end-to-end data pipeline utilizing real estate data.
Below is a summary of the steps undertaken:
Scrape real estate listing URLs
Loop over URLs and extract all data of interest
Clean the extracted DataFrame
Store the DataFrame in MinIO
As a newcomer to Python, the initial step involved selecting an appropriate IDE. Google Colab was the first choice, intending to maintain a cloud-based project. However, difficulties arose when attempting to mount a Chrome driver to the project for Selenium. After numerous unsuccessful attempts, the decision was made to switch to PyCharm, which proved an effective solution.
With the IDE established, the required packages, including Selenium and BeautifulSoup, were downloaded—the next step involved identifying a suitable website for the project. Helton Real Estate Group's website was chosen due to its clear HTML hierarchy, simplifying the process of locating listing data and performing web scraping.
Scraping the first 25 viewable listing URLs was relatively easy. However, a method to load additional results was necessary. Upon examining the webpage, it was discovered that more results appeared after scrolling several times, and to access further results, a "Load More" button needed to be clicked. To overcome this challenge, a function was created to scroll the page several times and then click the "Load More" button. Both functions were integrated into a 'while true' loop, allowing the script to continue until no more new results loaded (i.e., when the new page length equaled the old page length).
After completing this step, the script successfully extracted all listing URLs. Then each URL was scraped using a loop and stored in a data frame. The final data frame necessitated minimal cleaning, as the data was already relatively well-organized.
The concluding step involved uploading all data to MinIO, a cloud-based object storage service. Storing data in MinIO offers many benefits, including scalability and effortless access to data from any location with an internet connection.
The process of uploading the data was straightforward. MinIO was downloaded using Homebrew, a local session was launched in Terminal, the API was retrieved, and finally, it was connected in my script. The following article provided guidance for this step: https://medium.com/featurepreneur/upload-files-in-minio-using-python-4f987f902076
In summary, this project offered valuable hands-on experience in Python programming, web scraping, data cleaning, and cloud storage using MinIO. In my next project, I plan to explore the concept of Change Data Capture (CDC), Delta Lakes, and Pyspark.
The code for his project can be found on my GitHub here: https://github.com/GrahamChalfant/A-Journey-into-Python-Data-Pipelines