DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for Streaming Large JSON Response in Spring.
Ratul sharker
Ratul sharker

Posted on • Updated on

Streaming Large JSON Response in Spring.

Introduction

Reading data from the database and generating response in Spring works like a charm. But things get messy when we tried to return large data set as a response.

Setup

As a sample data set i am using datacharmer/test_db. It has following tables

+--------------+------------------+
| table_name   |          records |
+--------------+------------------+
| employees    |           300024 |
| departments  |                9 |
| dept_manager |               24 |
| dept_emp     |           331603 |
| titles       |           443308 |
| salaries     |          2844047 |
+--------------+------------------+

Our goal is to return ~300K employees in the json response.

I am using:

  1. MySQL (v8.0.19) as database
  2. Spring Boot (2.7.4)
  3. Spring Web.
  4. Spring Data JPA.
  5. Lombok & Spring boot dev tools.

To talk with database, following is a enum and a entity

Gender.java

public enum Gender {
    M,F;
}
Enter fullscreen mode Exit fullscreen mode

Employee.java

@Entity
@Getter
@Setter
@Table(name = "employees")
public class Employee {

    @Id
    private Long empNo;

    @Column(nullable = false)
    private Date birthDate;

    @Column(nullable = false)
    private String firstName;

    @Column(nullable = false)
    private String lastName;

    @Column(nullable = false)
    @Enumerated(EnumType.STRING)
    private Gender gender;

    @Column(nullable = false)
    private Date hireDate;
}
Enter fullscreen mode Exit fullscreen mode

Here isn't much to describe about the entity. One thing is intentional that is, i just specified the @Id annotation here, without specifying the generation strategy, because i am interested in reading the already generated rows, instead of writing one.

Traditional Way

Declare everything layer by layer. At first declare a Repository, Service and a Controller as follows

EmployeeRepository.java

public interface EmployeeRepository extends JpaRepository<Employee, Long> {

}
Enter fullscreen mode Exit fullscreen mode

EmployeeService.java

@Service
@Transactional(readOnly = true)
@RequiredArgsConstructor
public class EmployeeService {

    private final EmployeeRepository employeeRepository;

    public List<Employee> findAllAtOnce() {
        return employeeRepository.findAll();
    }
}
Enter fullscreen mode Exit fullscreen mode

EmployeeController.java

@RestController
@RequiredArgsConstructor
public class EmployeeController {

    private final EmployeeService employeeService;

    @GetMapping(value = "/employees", params = {"at-once"})
    public List<Employee> allEmployeeAtOnce() {
        return employeeService.findAllAtOnce();
    }
}
Enter fullscreen mode Exit fullscreen mode

Now run the spring boot project and start your favourite terminal to do some curl request

curl http://localhost:8080/employees?at-once=true > response.json
Enter fullscreen mode Exit fullscreen mode

We are requesting all 300K employees and save it into response.json file.

Initial success

So we are getting a 45MB of response. Part of this response looks like following

[
  {
    "empNo":10001,
    "birthDate":"1953-09-01T18:00:00.000+00:00",
    "firstName":"Georgi",
    "lastName":"Facello",
    "gender":"M",
    "hireDate":"1986-06-25T18:00:00.000+00:00"
  },
  ...
]
Enter fullscreen mode Exit fullscreen mode

So it seems we are done here. But wait, if the response itself is 45MB alone, then how much memory is needed to generate this response ? All the employee objects and the response itself needed to be in memory at once.

Let's try our program, limiting the heap size by providing the jvm argument -Xmx64M. We are limiting jvm heap size upto 64MB. Now trying to curl again

JVM Strikes with heap oveflow

Overall logs outputs following:

2022-10-07 02:22:52.322 ERROR 6010 --- [nio-8080-exec-1] o.h.engine.jdbc.spi.SqlExceptionHelper   : Java heap space
2022-10-07 02:23:15.497 ERROR 6010 --- [alina-utility-1] o.a.coyote.http11.Http11NioProtocol      : Error processing async timeouts

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:na]
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[na:na]
        at org.apache.coyote.AbstractProtocol.startAsyncTimeout(AbstractProtocol.java:632) ~[tomcat-embed-core-9.0.65.jar:9.0.65]
        at org.apache.coyote.AbstractProtocol.lambda$start$0(AbstractProtocol.java:617) ~[tomcat-embed-core-9.0.65.jar:9.0.65]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[na:na]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.65.jar:9.0.65]
        at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.concurrent.ConcurrentHashMap$KeySetView.iterator(ConcurrentHashMap.java:4627) ~[na:na]
        at java.base/java.util.Collections$SetFromMap.iterator(Collections.java:5567) ~[na:na]
        at org.apache.coyote.AbstractProtocol.lambda$startAsyncTimeout$1(AbstractProtocol.java:640) ~[tomcat-embed-core-9.0.65.jar:9.0.65]
        at org.apache.coyote.AbstractProtocol$$Lambda$1112/0x00000008008c2440.run(Unknown Source) ~[na:na]
        ... 7 common frames omitted
Enter fullscreen mode Exit fullscreen mode

Clearly loading so much objects at once and keeping large serialised json buffer in memory are the culprits here.

We can do better.

Streaming Way

Let's point out, where are so much memories are used ?

  1. When we read the managed entities from the database via JPA.
  2. When generating the json response.

Now can we read the managed entities at application ease ?

Easiest way to do so is reading the employee list, page by page. Let's say page size is 10K. First read the 1-10K employee, then read the 10001-20K employee and so forth. Reading all the 300K entries. In this approach we are keeping 10K records in memory at once. Downside of this approach is that, database has to run the query every time. If the employee list is filtered by complicated condition and ordered, then it will run that complicated condition and ordering each time a page is retrieved.

Another way is to stream the result set into the application as per application ease, instead of reading the whole result set at once. Spring data jpa's JpaRepository already supports streaming result set.

EmployeeRepository.java

public interface EmployeeRepository extends JpaRepository<Employee, Long> {

    @QueryHints(
        value = {
            @QueryHint(name = "org.hibernate.fetchSize", value = Integer.MIN_VALUE + "" )
        }
    )
    @Query("SELECT e FROM Employee e")
    public Stream<Employee> streamAll();
}
Enter fullscreen mode Exit fullscreen mode

Few word about fetch size
In general fetch size defines, how many records to pull from database to the application server. But it depends on the database vendor and the database driver you are using. If nothing specified MySql fetch all the rows at once. For fetching one row at a time β€˜Integer.MIN_VALUE’ is expected in the fetch size parameter.

More about MySQL fetch size.

More about fetch size in MySql and other databases are specified in a future post

So now we are fetching one record at a time from our database. Pressure from our java heap space is relieved.

Now time to design the service layer to safely fetch the records and output it into the http response in json format.

EmployeeService.java

@Service
@Transactional(readOnly = true)
@RequiredArgsConstructor
public class EmployeeService {

    private final EmployeeRepository employeeRepository;
    private ObjectMapper objectMapper;


    @PostConstruct
    public void postConstruct() {
        objectMapper = new ObjectMapper();
    }

        .... previous code

    public void findAllStreaming(OutputStream outputStream) throws IOException {
        Stream<Employee> employees = employeeRepository.streamAll();

        JsonGenerator jsonGenerator = objectMapper.getFactory().createGenerator(outputStream);

        try {

            jsonGenerator.writeStartArray();

            Iterator<Employee> employeeIterator = employees.iterator();
            while(employeeIterator.hasNext()) {
                Employee employee = employeeIterator.next();
                jsonGenerator.writeObject(employee);
            }

            jsonGenerator.writeEndArray();

        } catch(Exception ex) {
            throw ex;
        } finally {

            if(employees != null) {
                employees.close();
            }

            if(jsonGenerator != null)  {
                jsonGenerator.close();
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

To put things simply, this method findAllStreaming() takes an OutputStream to write the whole json response into. To write json into the stream chunk by chunk we are using JsonGenerator. JsonGenerator is provided by Jackson-Core. More detailed about how to write stream event is covered in this blog post by the author of the library.

We start by writing the start of the array token into the string which outputs json array start ([) into the output stream. Then we write the employee object which we get by stream from the database. After writing all the employee object into the stream we then close the array operator by writing json array end (]) into the stream.

Now time to connect the controller with the newly written method.

@RestController
@RequiredArgsConstructor
public class EmployeeController {

    private final EmployeeService employeeService;

        ... Previous Code

    @GetMapping(value = "/employees", params = "stream")
    public void streamAllEmployee(HttpServletResponse response) throws IOException {
        response.setHeader(HttpHeaders.CONTENT_ENCODING, MediaType.APPLICATION_JSON_VALUE);
        employeeService.findAllStreaming(response.getOutputStream());
    }
}
Enter fullscreen mode Exit fullscreen mode

Now trying with the -Xmx64m jvm arg

JVM is still unhappy

We are still struggling, with the memory overflow issue. What could go wrong here ?

If we look closely, inside the streaming portion

Iterator<Employee> employeeIterator = employees.iterator();
while(employeeIterator.hasNext()) {
    Employee employee = employeeIterator.next();
    jsonGenerator.writeObject(employee);
}
Enter fullscreen mode Exit fullscreen mode

We are fetching entities here from database. Each time a new entity is being generated. Hibernate return's persistent entity every time. So our persistent context is full of entity objects.

We can directly @Autowired the EntityManager and call detach on it. But we can do better. We can introduce our own custom BaseRepository equipped with detach method.

BaseRepository.java

@NoRepositoryBean
public interface BaseRepository<T, ID> extends JpaRepository<T, ID> {

    public void detach(T entity);
}
Enter fullscreen mode Exit fullscreen mode

BaseRepositoryImpl.java

public class BaseRepositoryImpl<T, ID> extends SimpleJpaRepository<T, ID> implements BaseRepository<T, ID> {

    private final EntityManager entityManager;

    public BaseRepositoryImpl(JpaEntityInformation<T, ID> entityInformation, EntityManager entityManager) {
        super(entityInformation, entityManager);
        this.entityManager = entityManager;
    }

    public BaseRepositoryImpl(Class<T> domainClass, EntityManager em) {
        super(domainClass, em);
        this.entityManager = em;
    }

    @Override
    public void detach(T entity) {
        entityManager.detach(entity);
    }   
}
Enter fullscreen mode Exit fullscreen mode

Register the BaseRepositoryImpl class as the repository base class

StreamJsonResponseServerApplication.java

@SpringBootApplication
@EnableJpaRepositories(repositoryBaseClass = BaseRepositoryImpl.class)
public class StreamJsonResponseServerApplication {

    ... Previous Code

}
Enter fullscreen mode Exit fullscreen mode

Now update the EmployeeRepository.java

public interface EmployeeRepository extends BaseRepository<Employee, Long> {
    ... Previous Code
}
Enter fullscreen mode Exit fullscreen mode

And inside the service

Iterator<Employee> employeeIterator = employees.iterator();
while(employeeIterator.hasNext()) {
    Employee employee = employeeIterator.next();     
    employeeRepository.detach(employee); // Removing entity object from persistent context.
    jsonGenerator.writeObject(employee);
}
Enter fullscreen mode Exit fullscreen mode

Now try again with the curl

Finally Success

So we the response landed successfully.

I even tried with extreme situation by passing -Xmx32m as jvm argument. It also works like charm.

Portion of the downloaded json looks like following

[
    {
        "empNo":10001,
        "birthDate":-515397600000,
        "firstName":"Georgi",
        "lastName":"Facello",
        "gender":"M",
        "hireDate":"26 June 1986"
    }
    ...
]
Enter fullscreen mode Exit fullscreen mode

Downsides

In this approach, servlet containers thread pool can be drained out under heavy load, as these responses taking longer that usual. Application managed thread can be used in returning the response, which will be covered in details in a future post.

Bonus

In order to bring some customisation in the json format is quite easy.

Say we want a custom format in the date then inside the EmployeeService.java

@PostConstruct
public void postConstruct() {
    objectMapper = new ObjectMapper();
    objectMapper.setDateFormat(new SimpleDateFormat("dd MMMM YYYY"));
}
Enter fullscreen mode Exit fullscreen mode

It outputs following json

[
    {
        "empNo":10001,
        "birthDate":"02 September 1953",
        "firstName":"Georgi",
        "lastName":"Facello",
        "gender":"M",
        "hireDate":"26 June 1986"
    }
    ...
]
Enter fullscreen mode Exit fullscreen mode

Another thing like we want to restructure the json, so that instead of returning the employees inside of an array, we want a json object inside which a employees keys will be present holding to the list of employees. For that we could customise the method findAllStreaming accordingly

try {
    jsonGenerator.writeStartObject();            
    jsonGenerator.writeArrayFieldStart("employees");
    Iterator<Employee> employeeIterator = 
    employees.iterator();
    while(employeeIterator.hasNext()) {
        Employee employee = employeeIterator.next();
        employeeRepository.detach(employee);
        jsonGenerator.writeObject(employee);
    }

    jsonGenerator.writeEndArray();
    jsonGenerator.writeEndObject(); 
} ... Previous Code
Enter fullscreen mode Exit fullscreen mode

Which generates following

{
    "employees" : [
        {
            "empNo":10001,
            "birthDate":"02 September 1953",
            "firstName":"Georgi",
            "lastName":"Facello",
            "gender":"M",
            "hireDate":"26 June 1986"
        }
        ...
    ]
}
Enter fullscreen mode Exit fullscreen mode

All the code is shared into my github repository.

Want to know, how to consume this large json responses without hitting the jvm heap boundary ? Read Consume Large JSON Response in Spring.

Top comments (4)

Collapse
 
unixdev profile image
Mojahedul Hoque Abul Hasanat

Nice and useful article! Personally, I would switch to raw SQL queries to reduce memory footprint further. I would also do HTTP response streaming.

BTW, is there a type regarding MySQL fetchsize and Integer.MAX_VALUE? It should probably be Integer.MIN_VALUE?

Collapse
 
ratulsharker profile image
Ratul sharker • Edited on

As per my understanding, copying content to the response's output stream achieves similar thing to the HTTP Response Streaming. Doing so is a anti-pattern, because service layer needed to know about the response serialization formats and others, which meant to be invisible to the service layer and sole responsibility of controller layer. I tried to keep things decoupled by introducing OutputStream as a parameter.

In case of the raw SQL queries, we will loose the capabilities of dynamic query support provided by JPA Specification. I have build up the dynamic queries myself. To reduce the memory footprint further, projection can be used so that persistent context is not used by hibernate and we don't need to detach the entity (if using projection, then then it won't be entity anymore) every time.

Issue with spring provided StreamingResponseBody functional interface is that, it requires to keep transaction session opened. Then we cannot set spring.jpa.open-in-view to false in the application.properties.

You are right vaiya, about the Integer.MIN_VALUE as a value of fetchSize hint. More controlled option is using useCursorFetch=true connection property to customize the fetchSize according to use case.

MySQL JDBC Implementation note

conn = DriverManager.getConnection("jdbc:mysql://localhost/?useCursorFetch=true", "user", "s3cr3t");
stmt = conn.createStatement();
stmt.setFetchSize(100);
rs = stmt.executeQuery("SELECT * FROM your_table_here");
Enter fullscreen mode Exit fullscreen mode
Collapse
 
sunoyon profile image
Sheikh Al Amin Sunoyon • Edited on

Good article. One possible solution of the downfall might be limiting concurrent requests. As the purpose of this approach is to serve heavy flow of data via API, the service might limit concurrent requests.
Besides, the API should be able to serve data with an offset. If the request stops in the middle, the client can request the API with the next offset. (in my POV though)

Collapse
 
ratulsharker profile image
Ratul sharker

The idea of supporting the offset is great. Doing so, consumer can consume at it's ease. But the problem is that, JPQL directly does not support offset only feature, JPQL queries with page and pageSize. So i cannot specify offset alone in JPQL.

It can be achieved using the entityManager directly. Searching through the javadoc, found setFirstResult(int startPosition). I haven't tested using this method, but according to the documentation, it should set the start offset of returning result set.

Implementation will look like following:

@Override
    public <T> Stream<T> findAll(..., int offset) {
        CriteriaBuilder builder = entityManager.getCriteriaBuilder();
        CriteriaQuery<T> query = builder.createQuery(clazz);

                // ... Necessary implementation specific codes

        TypedQuery<T> typedQuery = entityManager.createQuery(query);
                typedQuery.setFirstResult(offset); // Here we are setting the offset.

        return typedQuery.getResultStream();
    }
Enter fullscreen mode Exit fullscreen mode

Here is a post you might want to check out:

Regex for lazy developers

regex for lazy devs

Sorry for the callout πŸ˜†