Using OpenAI to summarise PDF

Andrew Fletcher published: 19 October 2023 (updated) 27 October 2023 10 minutes read

20.04

To use OpenAI to summarise text from a PDF using Python 3.11.6, you'll first need to extract the text from the PDF and then send it to the OpenAI API for summarisation.

Preparation

Set-up

pip install python-dotenv langchain openai tiktoken pypdf pymupdf

Code

The current code is on my Summaries GitHub page.

Elements

Adjusting the parameters - you can adjust the max_tokens and temperature parameters in the summarize_with_openai function to control the length and randomness of the generated.

Summarization

As a reference, review the following documentation - Summarization - (https://python.langchain.com/docs/use_cases/summarization).

Model

Review the existing models see OpenAI Models - (https://platform.openai.com/docs/models/gpt-4).

Temperature

In the context of OpenAI's GPT-based models, the temperature parameter is a value that controls the randomness of the model's output. When temperature is set to 0, it means that the output will be completely deterministic, and the model will generate the most likely and certain response given the input. This results in the most focused and least random output.

Here's how temperature affects the output:

Temperature = 0

The output is highly deterministic. The model will consistently provide the same response for the same input. It chooses the most likely completion.

Temperature > 0

When temperature is greater than 0, it introduces randomness into the output. Higher values like 1.0 make the output more diverse and less deterministic, while lower values like 0.2 make the output more focused and less random.

So, when you use temperature=0 in an OpenAI model, you're essentially asking for a deterministic response, which can be useful in situations where you need highly controlled and predictable model outputs.

Summaries - document one

First cut

Words: 163
Characters: 1,113

llm = select_llm()

This publication is a research project conducted by Robin Thomson, Mark Bravington, Pierre Feutry, Rasanthi Gunasekera, Peter Grewe, Paavo Jumpanen, Claudio Castillo Jordán, Elizabeth Brewer, Floriaan Devloo-Delva, Simon Robertson, and James Marthick in August 2020. It focuses on the close kin mark recapture (CKMR) method for estimating the abundance of school shark (Galeorhinus galeus) in the SESSF. Samples were collected by fishers, fish processors (Toumazos; Pitliangas Foods: Nick and Chris Pitliangas) and AFMA’s Observer Program (approximately 1,000 samples), and were sent to CSIRO, mainly by refrigerated truck. The authors collected approximately 3,000 samples from three broad locations (700 samples from South Australia, 900 from Bass Strait, and 400 from Tasmania) in proportion to fishing activity. Collections made by the fishing industry were all taken between 2015 and 2018, with no more than 50 animals from any fishing trip to guard against any sampling bias that might arise from close relatives schooling together. In addition to the commercially caught sharks, tissue samples were sourced

2nd summary cut

Words: 188
Characters: 1,225

llm = OpenAI(temperature=0)

This publication is a research project conducted by Robin Thomson, Mark Bravington, Pierre Feutry, Rasanthi Gunasekera, and Peter Grewe in August 2020. It focuses on the close kin mark recapture (CKMR) method for school shark (Galeorhinus galeus) in the SESSF, as well as in New Zealand, Australian and Chilean waters. The research project was funded by Fisheries Research and Development Corporation (FRDC 2014‐024) and Commonwealth Scientific and Industrial Research Organization (CSIRO). It included the collection of approximately 3,000 samples, all of which were aged by the Fish Ageing Service (FAS) using counts of contrasting bands of material in the vertebra, the full mitochondrial genome sequencing for those sharks found to belong to close kin pairs, the identification of sex from genetic data using a subset of five sex markers from the fifteen found in the initial sequencing investigation, the compilation of fishery dependent data and biological parameters, the development of simple models and a close kin model to estimate the absolute abundance of school shark in the SESSF, the analysis of same cohort siblings, ring counts, trawl CPUE, likelihood, coefficient of variation (CV) for ageing error, and the

3rd summary cut

Words: 96
Characters: 658

llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0)

chain = load_summarize_chain(
    llm,
    /.../
    input_key="input_documents",
    output_key="output_text",
)

The use of CKMR without a B0 can still estimate replacement yield and set conservative RBCs below that level. Future projections using the CKMR model and fixed fishing mortality levels were used to calculate increasing catches that account for the expected increase in unavoidable bycatch as the stock rebuilds. The successful use of the genotyping technology "DartCap" on school shark and southern bluefin tuna has provided valuable insights and cost-effective methods for future CKMR projects. Additionally, the experience gained in designing CKMR projects, particularly in power calculations for sample sizes, has proven beneficial for long-term planning.

4th summary cut

Words: 99
Characters: 675

llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0, request_timeout=120)

The use of CKMR without a B0 can still estimate replacement yield and set conservative RBCs below that level. Future projections using the CKMR model and fixed fishing mortality levels were used to calculate increasing catches that account for the expected increase in unavoidable bycatch as the stock rebuilds. The successful use of the genotyping technology "DartCap" on school shark and southern bluefin tuna has provided valuable insights and cost-effective methods for future CKMR projects on other species. Additionally, the experience gained in designing CKMR projects, particularly in power calculations for sample sizes, has proven beneficial for long-term planning.

5th summary cut

Words: 139
Characters: 1,000

llm = ChatOpenAI(model_name="gpt-4", temperature=0, request_timeout=120)

The CKMR model can be utilized to estimate replacement yield and set conservative RBCs, even without a B0. Future projections using this model, in conjunction with fixed future levels of fishing mortality, can calculate time series of increasing catches, taking into account the likely increase in unavoidable bycatch as the stock rebuilds. SharkRAG recommends using catches that relate to the average fishing mortality rate from 2013-2017. The genotyping technology "DartCap" from Diversity Arrays Technology, tested on school shark and southern bluefin tuna, was successful and cost-effective, reliably distinguishing half-sibling pairs. This technology will reduce costs and development overhead for future CKMR projects on other species. Additionally, experience has been gained in designing future CKMR projects, particularly in power calculations for choosing sample sizes to achieve specific assessment goals. This approach requires significant preliminary work but aids in long-term planning.

6th summary cut

Words: 144
Characters: 1,025

llm = ChatOpenAI(model_name="gpt-4", temperature=0.5, request_timeout=120)

The CKMR model can be utilized to estimate replacement yield and set conservative RBCs, even without a B0. Future projections using median results from the CKMR model, along with fixed future levels of fishing mortality, were used to calculate time series of increasing catches, taking into account the likely increase in unavoidable bycatch as the stock rebuilds. SharkRAG recommended using catches that relate to the average fishing mortality rate from 2013-2017. The genotyping technology "DartCap" from Diversity Arrays Technology, tested on school shark and southern bluefin tuna, was successful and cost-effective, reliably distinguishing half-sibling pairs. This technology will reduce costs and development overhead for future CKMR projects on other species. Additionally, experience has been gained in designing future CKMR projects, particularly in power calculations for choosing sample sizes to achieve specific assessment goals. This approach requires significant preliminary work but aids in long-term planning.

7th summary cut

Words: 147
Characters: 1,028

llm = ChatOpenAI(model_name="gpt-4", temperature=0.7, request_timeout=120)

The Close-Kin Mark-Recapture (CKMR) model has proven effective in estimating replacement yield and setting conservative RBCs, even without a B0. Future projections using median results from this model, coupled with fixed future levels of fishing mortality, enable the calculation of a time series of increasing catches, accounting for likely bycatch increases as the stock rebuilds. The recommendation is to base catches on the average fishing mortality rate from 2013-2017. The novel genotyping technology "DartCap" from Diversity Arrays Technology, though being tested for the first time on school shark and southern bluefin tuna, has shown significant success, notably in distinguishing half-sibling pairs, a tough task for genotyping. This technology will reduce costs for future CKMR projects on other species. Additionally, the experience gained from this project will aid in designing future projects, particularly in deciding sample sizes for specific assessment goals, despite the need for substantial preliminary work.

8th summary cut

Words: 175
Characters: 1,238

llm = ChatOpenAI(model_name="gpt-4", temperature=1.0, request_timeout=120)

The CKMR (Close Kin Mark-Recapture) model can effectively estimate replacement yield and allow conservative RBCs (Responsible Business Conduct) below this level even without a B0. The model used future projections of fixed fishing mortality rates and median results to predict trends of increasing catches, considering the probable rise in unavoidable bycatch with stock rebuilds. The SharkRAG recommended setting catch amounts based on the average fishing mortality rate over the years 2013-2017.

A new genotyping technology known as "DartCap" from Diversity Arrays Technology was also tested on school shark and southern bluefin tuna. The technology was successful in economically and reliably distinguishing half-sibling pairs, a complex task for genotyping methods. The use of the same technology in future CKMR projects on other species can significantly cut down on development overhead and unit costs.

The project has also broadened the experience in designing future CKMR projects with particular attention to "power calculations" for selecting sample sizes that achieve specific assessment objectives. Although the developed approach necessitates a substantial amount of preliminary work, it considerably aids longer-term planning.

9th summary cut

Words: 140
Characters: 1,030

llm = ChatOpenAI(model_name="gpt-4", temperature=1.0, request_timeout=120)

CKMR, even without a B0, can be used to estimate yield replacement and set conservative RBCs below that level. Future projections using median results from the CKMR model, alongside fixed levels of fishing mortality, allow for the calculation of increasing catches accounting for the potential increase in unavoidable bycatch as the stock rebuilds. The average fishing mortality rate from 2013-2017 is recommended for catch calculations. The novel genotyping technology "DartCap", from Diversity Arrays Technology, proved to be cost-effective and successful in distinguishing half-sibling pairs, which is a challenging task for any genotyping method. This technology represents future savings in CKMR projects for different species. Additionally, valuable experience was gained on designing future CKMR projects, specifically in conducting "power calculations" for selecting sample sizes to achieve specific assessment goals. This approach entails considerable preliminary work but contributes significantly to long-term planning.

10th summary cut

Words: 158
Characters: 1,108

llm = ChatOpenAI(model_name="gpt-4", temperature=0.0, request_timeout=120)

The genotyping technology "DartCap" from Diversity Arrays Technology, which was tested for the first time on school shark and southern bluefin tuna, was successful and cost-effective, reliably distinguishing half-sibling pairs, a difficult task for a genotyping method. Future CKMR projects using the same technology will benefit from lower development overhead and unit costs.

Additionally, valuable experience was gained in designing future CKMR projects, particularly in "power calculations" for choosing sample sizes to achieve specific assessment goals. This approach requires significant preliminary work but aids in long-term planning.

11th summary cut

Words: 143
Characters: 1,030

llm = ChatOpenAI(model_name="gpt-4", temperature=0.3, request_timeout=120)

The CKMR model can be used to estimate replacement yield and set conservative RBCs, even without a B0. Future projections using this model, in conjunction with fixed future levels of fishing mortality, can calculate increasing catches that account for the likely rise in unavoidable bycatch as the stock rebuilds. SharkRAG recommended using catches that relate to the average fishing mortality rate from 2013-2017. The genotyping technology "DartCap" from Diversity Arrays Technology was successfully and economically tested on school shark and southern bluefin tuna, reliably distinguishing half-sibling pairs, a challenging task for genotyping methods. This technology will reduce costs and development overhead for future CKMR projects on other species. Additionally, we have gained insights into designing future CKMR projects, specifically in conducting "power calculations" for determining sample sizes to achieve specific assessment goals. This approach requires substantial preliminary work but aids in long-term planning.

Summaries - document two

First cut

Words: 96
Characters: 610

llm = ChatOpenAI(model_name="gpt-4", temperature=1.0, request_timeout=120)

The original summary provides brief descriptions and full names for various terms related to food industry regulations and sectors, including takeaway, hotel, hospitality, entertainment, and prisons; codes such as the Food Standards Code in Australia and New Zealand, the Franchising Code of Conduct, and the Horticulture Code of Conduct; the Meat Industry Act 1993 and Seafood Safety Act 2003 in Victoria; the Marine Stewardship Council (MSC); and the retail sector. This information can help those in the food industry understand the regulations they may need to follow and the sectors they may be a part of.

2nd summary cut

Words: 98
Characters: 619

llm = ChatOpenAI(model_name="gpt-4", temperature=0.0, request_timeout=120)

The original summary provides a list of various codes, acts, and sectors related to the food industry in Australia and New Zealand. These include the Food Standards Code, the Franchising Code, the Horticulture Code, the Meat Industry Act, the Marine Stewardship Council, the retail sector, and the Seafood Safety Act. These regulations and sectors cover a wide range of food-related areas, from ready-to-eat food served in hotels and prisons to the sale of packaged and unpackaged food products in retail markets. The summary could be improved by providing a brief explanation of what each code, act, or sector entails.

3rd summary cut

Words: 86
Characters: 592

llm = ChatOpenAI(model_name="gpt-4", temperature=0.3, request_timeout=120)

The original summary provides a list of terms and their full forms or definitions, mostly related to the food and hospitality industry in Australia and New Zealand. These include the Food Standards Code, the Franchising Code, the Horticulture Code, the Meat Industry Act, the Marine Stewardship Council, the Retail sector, and the Seafood Safety Act. The summary also mentions the sectors where ready-to-eat food is served, such as takeaways, hotels, hospitality, entertainment, and prisons. Without additional context, the original summary seems comprehensive and doesn't require refinement.

Andrew Fletcher • 17 Mar 2025

Refining text analysis for research data from regex to Python automation

regex
Python

IntroductionData extraction and filtering are crucial for developers working with large research datasets. Whether you're working on government archives, industry reports, or academic research projects, extracting meaningful insights efficiently can be challenging.  I'm going to explore how we...

Andrew Fletcher • 13 Feb 2025

Deploying a Python project from UAT to production using Git

Python

When deploying a Python project from a User Acceptance Testing (UAT) environment to Production, it’s essential to ensure that all dependencies and configurations remain consistent. Particularly in our situation where this was going to be the first deployment of AI semantic search functionality to...

Andrew Fletcher • 16 Jan 2025

get IP address from terminal OSX

Terminal

When troubleshooting network issues or configuring devices, knowing your IP address can be essential. Whether you're connected via Wi-Fi, Ethernet, or tethering through a mobile provider, macOS offers powerful built-in tools to quickly identify your IP address. Here's a practical guide tailored to...

Preparation

Set-up

Code

Elements

Summarization

Model

Temperature

Temperature = 0

Temperature > 0

Summaries - document one

First cut

2nd summary cut

3rd summary cut

4th summary cut

5th summary cut

6th summary cut

7th summary cut

8th summary cut

9th summary cut

10th summary cut

11th summary cut

Summaries - document two

First cut

2nd summary cut

3rd summary cut

Related articles