π Understanding the Challenge of Financial Data Accessibility
In the digital age, accessing structured financial information about public companies can be challenging, especially in emerging markets like Indonesia. While the Indonesia Stock Exchange (IDX) provides comprehensive company profiles, the data is often:
- 𧩠Scattered across multiple web pages
- π Primarily in the Indonesian language
- π Not readily available in machine-readable formats
π The Need for Automated Data Collection
Financial analysts, researchers, and investors frequently encounter barriers when trying to:
- π Compile comprehensive company information
- π Translate and standardize company data
- π Create datasets for market research or investment analysis
π€ Theoretical Approach to Web Scraping
Web scraping is a powerful technique for extracting structured data from websites. Our approach focuses on several key principles:
-
π€² Automated Data Extraction
- Eliminate manual data entry
- Reduce human error
- Enable rapid, repeatable data collection
-
π Dynamic Web Interaction
- Use Selenium WebDriver to simulate human-like browser interactions
- Handle dynamic content loading
- Navigate complex web structures
-
π Data Translation and Standardization
- Convert Indonesian field names to English
- Create a consistent, machine-readable data format
- Improve data interoperability
𧩠Technical Challenges and Solutions
Challenge: Multilingual Data Extraction
- π Problem: Company information is primarily in Indonesian
- π Solution: Implement a translation mapping for key terms
Challenge: Dynamic Web Content
- β‘ Problem: Websites use JavaScript to load content
- β³ Solution: Use WebDriverWait to ensure complete page loading
Challenge: Robust Error Handling
- π‘οΈ Problem: Inconsistent web page structures
- π§ Solution: Implement flexible data extraction with fallback mechanisms
π οΈ Implementation Strategy
Our Python script will:
- Use Selenium WebDriver for web automation
- Extract company profile data
- Translate field names
- Save data in a standardized JSON format
π Python Code Implementation
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import json
def translate_key(key):
"""
Translate key from Indonesian to English using a predefined dictionary.
Args:
- key (str): The key in Indonesian to be translated.
Returns:
- str: The translated key in English, or the original key if no translation is found.
"""
# Dictionary to translate key from Indonesian to English
translations = {
"Nama": "name",
"Kode": "code",
"Alamat Kantor": "office_address",
"Alamat Email": "email",
"Telepon": "phone",
"Fax": "fax",
"NPWP": "tax_id",
"Situs": "website",
"Tanggal Pencatatan": "listing_date",
"Papan Pencatatan": "board",
"Bidang Usaha Utama": "main_business",
"Sektor": "sector",
"Subsektor": "subsector",
"Industri": "industry",
"Subindustri": "subindustry",
"Biro Administrasi Efek": "share_registrar"
}
return translations.get(key, key) # Return original key if translation not found
def scrape_idx_profile(code_stock):
"""
Scrape company profile data from IDX website.
Args:
- code_stock (str): Stock code (e.g., 'BBCA') to scrape data for.
Returns:
- dict: A dictionary containing the scraped company data in English.
"""
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36")
try:
driver = webdriver.Chrome(options=chrome_options)
url = f"https://www.idx.co.id/id/perusahaan-tercatat/profil-perusahaan-tercatat/{code_stock}"
print(f"Accessing URL: {url}")
driver.get(url)
time.sleep(5)
# Dictionary to store data
company_data = {}
try:
# Wait for element with class 'bzg' to appear
wait = WebDriverWait(driver, 10)
bzg_element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "bzg")))
# Get all tables within 'bzg'
tables = bzg_element.find_elements(By.TAG_NAME, "table")
# Process each table
for table in tables:
rows = table.find_elements(By.TAG_NAME, "tr")
for row in rows:
try:
# Get the field name (td with class td-name)
field_name = row.find_element(By.CLASS_NAME, "td-name").text.strip()
# Get the content (td with class td-content)
content_element = row.find_element(By.CLASS_NAME, "td-content")
# Check if there is a link within the content
try:
content = content_element.find_element(By.TAG_NAME, "a").text.strip()
except:
content = content_element.find_element(By.TAG_NAME, "span").text.strip()
# Translate key to English and store in dictionary
english_key = translate_key(field_name)
company_data[english_key] = content
except Exception as e:
continue
except Exception as e:
print(f"Error processing bzg element: {str(e)}")
# Save data to JSON file
with open(f'data_{code_stock}.json', 'w', encoding='utf-8') as f:
json.dump(company_data, f, ensure_ascii=False, indent=4)
print(f"\nData successfully saved to data_{code_stock}.json")
# Print data preview
print("\nScraped data preview:")
for key, value in company_data.items():
print(f"{key}: {value}")
return company_data
except Exception as e:
print(f"An error occurred: {str(e)}")
return None
finally:
driver.quit()
if __name__ == "__main__":
code_stock = "BBCA" # Can be changed to other stock codes
result = scrape_idx_profile(code_stock)
π Code Walkthrough and Design Patterns
1. Translation Mechanism
The translate_key()
function demonstrates a dictionary-based translation approach:
- Maps Indonesian financial terms to English
- Provides a fallback for unmapped terms
- Ensures consistent terminology across extracted data
2. Robust Web Scraping
The scrape_idx_profile()
function implements several resilience strategies:
- Headless browser configuration
- Explicit waits for page elements
- Flexible content extraction
- Comprehensive error handling
3. Data Standardization
- Converts multilingual data to a uniform format
- Generates machine-readable JSON output
- Preserves original data integrity
π Practical Applications
This script can be used for:
- π Financial research
- π Market analysis
- πΌ Investment due diligence
- π Academic research on Indonesian public companies
βοΈ Ethical Considerations and Limitations
π€ Responsible Scraping
- Respect website terms of service
- Implement rate limiting
- Use scraping ethically and legally
β οΈ Disclaimer
This tool is for educational purposes. Always verify data accuracy and comply with legal and ethical guidelines when scraping web content.
Top comments (0)