Project Overview

Project Purpose

The project purpose is to develop a sophisticated Large Language Model (LLM) capable of summarizing Arabic legal texts using a specific template, enhancing accessibility and efficiency in legal document processing.

Dataset

We've compiled a comprehensive dataset of 25,000 legal cases from Morocco, originally in PDF format. To make this data usable, we employed Optical Character Recognition (OCR) for accurate text extraction.

Data Annotation and Gold Standard

Leveraging the advanced capabilities of GPT-4, we generated summaries of the extracted texts following a predefined template. These summaries serve as the gold standard for model training, balancing efficiency with high-quality annotations.

Model Fine-Tuning

LLaMA 3.2 3B Instruct LLM is choosen for fine-tuning, utilizing the gold standard summaries. LLaMA was selected for its strong performance, Arabic language comprehension, and open-source nature, making it ideal for the specialized task.

Results

Post fine-tuning evaluation on unseen test data revealed significant improvements:

  • Performance boost ranging from 10% to 26%
  • Impressive results considering the model size (3 billion parameters)
  • Serves as a robust proof of concept, with potential for further enhancements using larger models

Knowledge Graph

To maximize the value of the summarized and extracted data, we constructed a knowledge graph. This graph visually represents the intricate relationships between cases, based on extracted properties such as case topics and applied legal principles. Explore this interactive tool in the "Created Knowledge Graph" tab.

Knowledge Graph

Model Availability

The fine-tuned model is now publicly accessible on Hugging Face. You can interact with and explore the model's capabilities through the "Chat with Fine-tuned Model" tab.